Fixes to make life easier with the nlp library #6423

sgugger · 2020-08-11T18:46:47Z

This PR adds two things to make the interface easier with the nlp library:

BatchEncoding stops enforcing a 2-dim for every tensor, which causes problems for labels (which should be one vector of shape [batch_size]).
PreTrainedTokenizerBase.pad accepts tensors as inputs, which makes it easy to use this function for data collation.

Added proper documentation and tests from @thomwolf initial work.

…mples

codecov · 2020-08-11T19:10:36Z

Codecov Report

Merging #6423 into master will increase coverage by 2.27%.
The diff coverage is 95.45%.

@@            Coverage Diff             @@
##           master    #6423      +/-   ##
==========================================
+ Coverage   77.51%   79.79%   +2.27%     
==========================================
  Files         150      150              
  Lines       27789    27807      +18     
==========================================
+ Hits        21542    22188     +646     
+ Misses       6247     5619     -628

Impacted Files	Coverage Δ
src/transformers/pipelines.py	`79.79% <ø> (+52.80%)`	⬆️
src/transformers/tokenization_utils_base.py	`94.16% <95.45%> (+0.28%)`	⬆️
src/transformers/tokenization_albert.py	`28.84% <0.00%> (-58.66%)`	⬇️
src/transformers/modeling_utils.py	`87.35% <0.00%> (+0.19%)`	⬆️
src/transformers/modeling_tf_bert.py	`96.58% <0.00%> (+0.35%)`	⬆️
src/transformers/tokenization_utils.py	`90.40% <0.00%> (+0.40%)`	⬆️
src/transformers/generation_tf_utils.py	`86.46% <0.00%> (+0.75%)`	⬆️
src/transformers/generation_utils.py	`96.92% <0.00%> (+0.83%)`	⬆️
src/transformers/modeling_t5.py	`83.33% <0.00%> (+0.94%)`	⬆️
src/transformers/modeling_tf_ctrl.py	`97.87% <0.00%> (+1.06%)`	⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6cb0f8...8edc948. Read the comment docs.

sgugger · 2020-08-11T19:45:17Z

src/transformers/pipelines.py

@@ -2318,7 +2318,7 @@ def _concat_inputs_history(self, inputs: List[List[int]], histories: List[Option
        max_len = max([len(item) for item in outputs])
        outputs = [output + [self.pad_token_id] * (max_len - len(output)) for output in outputs]
        outputs = BatchEncoding(
-            {"input_ids": outputs, "attention_mask": [1] * len(outputs)}, tensor_type=self.framework
+            {"input_ids": outputs, "attention_mask": [[1] * len(outputs)]}, tensor_type=self.framework,


This is the only thing were the change of dim in BatchEncoding.convert_to_tensors is breaking something, but in this case, it was a bit magical that the dimension was automatically added, so I don't think this is a serious failure.

LysandreJik

LGTM, great that you added tests for all three frameworks.

sgugger · 2020-08-12T12:00:53Z

Merging then, we can follow up next week when @thomwolf is back if he has more comments.

)" This reverts commit 79dd29d.

thomwolf and others added 4 commits July 31, 2020 20:51

allow using tokenizer.pad as a collate_fn in pytorch

22bd22f

allow using tokenizer.pad as a collate_fn in pytorch

6009a7c

Merge remote-tracking branch 'origin/simple-examples' into simple-exa…

67859e5

…mples

Add documentation and tests

78191ca

sgugger requested review from thomwolf and LysandreJik August 11, 2020 18:46

sgugger mentioned this pull request Aug 11, 2020

Data collator with padding #6398

Closed

Make attention mask the right shape

593f645

sgugger commented Aug 11, 2020

View reviewed changes

Better test

8edc948

LysandreJik approved these changes Aug 12, 2020

View reviewed changes

sgugger merged commit e9c3031 into master Aug 12, 2020

sgugger deleted the simple-examples branch August 12, 2020 12:00

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "Fixes to make life easier with the nlp library (huggingface#6423

7e54c70

)" This reverts commit 79dd29d.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes to make life easier with the nlp library #6423

Fixes to make life easier with the nlp library #6423

sgugger commented Aug 11, 2020

codecov bot commented Aug 11, 2020 •

edited

Loading

sgugger Aug 11, 2020 •

edited

Loading

LysandreJik left a comment

sgugger commented Aug 12, 2020

Fixes to make life easier with the nlp library #6423

Fixes to make life easier with the nlp library #6423

Conversation

sgugger commented Aug 11, 2020

codecov bot commented Aug 11, 2020 • edited Loading

Codecov Report

sgugger Aug 11, 2020 • edited Loading

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

sgugger commented Aug 12, 2020

codecov bot commented Aug 11, 2020 •

edited

Loading

sgugger Aug 11, 2020 •

edited

Loading