-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes to make life easier with the nlp library #6423
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6423 +/- ##
==========================================
+ Coverage 77.51% 79.79% +2.27%
==========================================
Files 150 150
Lines 27789 27807 +18
==========================================
+ Hits 21542 22188 +646
+ Misses 6247 5619 -628
Continue to review full report at Codecov.
|
@@ -2318,7 +2318,7 @@ def _concat_inputs_history(self, inputs: List[List[int]], histories: List[Option | |||
max_len = max([len(item) for item in outputs]) | |||
outputs = [output + [self.pad_token_id] * (max_len - len(output)) for output in outputs] | |||
outputs = BatchEncoding( | |||
{"input_ids": outputs, "attention_mask": [1] * len(outputs)}, tensor_type=self.framework | |||
{"input_ids": outputs, "attention_mask": [[1] * len(outputs)]}, tensor_type=self.framework, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only thing were the change of dim in BatchEncoding.convert_to_tensors
is breaking something, but in this case, it was a bit magical that the dimension was automatically added, so I don't think this is a serious failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great that you added tests for all three frameworks.
Merging then, we can follow up next week when @thomwolf is back if he has more comments. |
)" This reverts commit 79dd29d.
This PR adds two things to make the interface easier with the
nlp
library:BatchEncoding
stops enforcing a 2-dim for every tensor, which causes problems for labels (which should be one vector of shape[batch_size]
).PreTrainedTokenizerBase.pad
accepts tensors as inputs, which makes it easy to use this function for data collation.Added proper documentation and tests from @thomwolf initial work.