Data collator with padding #6398

sgugger · 2020-08-10T19:47:54Z

Add a data collator to dynamically pad samples during batching. This is necessary for the training set since padding can't be applied beforehand if we use shuffling (unless with pad to a fixed max_length).
This should make it more straightforward to plug nlp into the Trainer.

codecov · 2020-08-10T19:53:08Z

Codecov Report

Merging #6398 into master will increase coverage by 0.18%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master    #6398      +/-   ##
==========================================
+ Coverage   79.55%   79.73%   +0.18%     
==========================================
  Files         148      148              
  Lines       27206    27226      +20     
==========================================
+ Hits        21644    21710      +66     
+ Misses       5562     5516      -46

Impacted Files	Coverage Δ
src/transformers/__init__.py	`99.24% <ø> (ø)`
src/transformers/data/data_collator.py	`89.05% <50.00%> (-7.54%)`	⬇️
src/transformers/tokenization_ctrl.py	`78.64% <0.00%> (-17.48%)`	⬇️
src/transformers/data/processors/squad.py	`28.13% <0.00%> (-5.51%)`	⬇️
src/transformers/tokenization_dpr.py	`53.15% <0.00%> (-4.51%)`	⬇️
src/transformers/file_utils.py	`82.18% <0.00%> (-0.26%)`	⬇️
src/transformers/generation_tf_utils.py	`86.46% <0.00%> (-0.26%)`	⬇️
src/transformers/tokenization_utils_base.py	`93.84% <0.00%> (+1.39%)`	⬆️
src/transformers/tokenization_transfo_xl.py	`42.48% <0.00%> (+8.92%)`	⬆️
src/transformers/tokenization_gpt2.py	`97.22% <0.00%> (+9.72%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3425936...4bed573. Read the comment docs.

* improve names and tests longformer * more and better tests for longformer * add first tf test * finalize tf basic op functions * fix merge * tf shape test passes * narrow down discrepancies * make longformer local attn tf work * correct tf longformer * add first global attn function * add more global longformer func * advance tf longformer * finish global attn * upload big model * finish all tests * correct false any statement * fix common tests * make all tests pass except keras save load * fix some tests * fix torch test import * finish tests * fix test * fix torch tf tests * add docs * finish docs * Update src/transformers/modeling_longformer.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/modeling_tf_longformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * apply Lysandres suggestions * reverse to assert statement because function will fail otherwise * applying sylvains recommendations * Update src/transformers/modeling_longformer.py Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * Update src/transformers/modeling_tf_longformer.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* Chunked feed forward for Bert This is an initial implementation to test applying feed forward chunking for BERT. Will need additional modifications based on output and benchmark results. * Black and cleanup * Feed forward chunking in BertLayer class. * Isort * add chunking for all models * fix docs * Fix typo Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>

* add pl_glue example test * for now just test that it runs, next validate results of eval or predict? * complete the run_pl_glue test to validate the actual outcome * worked on my machine, CI gets less accuracy - trying higher epochs * match run_pl.sh hparms * more epochs? * trying higher lr * for now just test that the script runs to a completion * correct the comment * if cuda is available, add --fp16 --gpus=1 to cover more bases * style

* testing utils: capturing std streams context manager * style * missing import * add the origin of this code

LysandreJik

This is a welcome addition. Can this collator be used alongside other collators? It seems to be an option that we would like on all of them, given its utility.

* fix tokenizer saving and loading bugs when adding AddedToken to additional special tokens * Add tokenizer test * Style * Style 2 Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

* Warn if debug requested without TPU fixes (#6308) Check whether a PyTorch compatible TPU is available before attempting to print TPU metrics after training has completed. This way, users who apply `--debug` without reading the documentation aren't suprised by a stacktrace. * Style Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

* Optimized banned token masking * Avoid duplicate EOS masking if in bad_words_id * Updated mask generation to handle empty banned token list * Addition of unit tests for the updated bad_words_ids masking * Updated timeout handling in `test_postprocess_next_token_scores_large_bad_words_list` unit test * Updated timeout handling in `test_postprocess_next_token_scores_large_bad_words_list` unit test (timeout does not work on Windows) * Moving Marian import to the test context to allow TF only environments to run * Moving imports to torch_available test * Updated operations device and test * Updated operations device and test * Added docstring and comment for in-place scores modification * Moving test to own test_generation_utils, use of lighter models for testing * removed unneded imports in test_modeling_common * revert formatting change for ModelTesterMixin * Updated caching, simplified eos token id test, removed unnecessary @require_torch * formatting compliance

sgugger · 2020-08-11T12:07:04Z

It's hard to see how to combine it with another data collator since a data collator's function is to create batch, and you can't create batches if your tensors are not padded to the same size.

* Create README.md * Update model_cards/akhooli/gpt2-small-arabic-poetry/README.md * Update model_cards/akhooli/gpt2-small-arabic-poetry/README.md * Update model_cards/akhooli/gpt2-small-arabic-poetry/README.md * Update model_cards/akhooli/gpt2-small-arabic-poetry/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com>

src/transformers/data/data_collator.py

patrickvonplaten · 2020-08-11T13:39:29Z

src/transformers/data/data_collator.py

+    pad_to_multiple_of: Optional[int] = None
+
+    def __call__(self, features):
+        batch = self.tokenizer.pad(


As it is coded now features cannot be of a dict of torch.Tensor self.tokenizer.pad(...) would fail, right?
This is a bit different from some other DataCollators that expect torch.Tensors as an input.

Do we want to go more towards an approach where we preprocess everything in "Python" using the nlp library and then use the data collators to transformers a dataset of Python lists to a dataset of torch Tensors?

I disagree that data collators expect torch.Tensor as inputs: default_data_tokenizer accept both. IMO it's the DataCollatorForLanguageModeling and variants that are not great in the sense they don't accept both tensors and list of ints as inputs. On that note, this data collator should also work with tensors, I agree. IMO it's the pad method that should be fixed to accept tensors ultimately.

I think we want to support everything in terms of preprocessing, but nlp should be a premium citizen.

Ok, no I agree with you - I also think we should refactor the other data collators to also accept Python list inputs.

IMO, a nice workflow is:

Download a dataset with nlp, this will be in the format Dict[str, List[int or str]].

Then we can use the dataset.map() functionality to process the data to input_ids, attention_mask, etc....Since nlp.map(...) does not allow to return torch.Tensors, here we have to stay anyways in Python List format.

This dataset can then ideally already be used with the default_data_tokenizer because all tokenization methods were applied during the .map() operations.

Here would be an example of this workflow: https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16#training-script. (Note that the tensors would not even have to be put into torch tensor format).
What do you think @sgugger ?

I agree, except step 3 should be default_data_collator only when you are padding to a fixed max length. With the shuffle of the training set, you need dynamic padding as provided by the data collator introduced in this PR :-)

* Create README.md Model card for https://huggingface.co/akhooli/gpt2-small-arabic * Update model_cards/akhooli/gpt2-small-arabic/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com>

Co-authored-by: Jingqing Zhang <jingqing.zhang15@imperial.ac.uk>

sgugger · 2020-08-11T18:49:36Z

Things I tried to fix here were actually addressed by @thomwolf in #6423, so waiting for this PR to be merged before merging this one.

* Create README.md * Update README.md

* [wip] add get_polynomial_decay_schedule_with_warmup * style * add assert * change lr_end to a much smaller default number * check for exact equality * [model_cards] electra-base-turkish-cased-ner (#6350) * for electra-base-turkish-cased-ner * Add metadata Co-authored-by: Julien Chaumond <chaumond@gmail.com> * Temporarily de-activate TPU CI * Update modeling_tf_utils.py (#6372) fix typo: ckeckpoint->checkpoint * the test now works again (#6371) * correct pl link in readme (#6364) * refactor almost identical tests (#6339) * refactor almost identical tests * important to add a clear assert error message * make the assert error even more descriptive than the original bt * Small docfile fixes (#6328) * Patch models (#6326) * TFAlbertFor{TokenClassification, MultipleChoice} * Patch models * BERT and TF BERT info s * Update check_repo * Ci GitHub caching (#6382) * Cache Github Actions CI * Remove useless file * Colab button (#6389) * Add colab button * Add colab link for tutorials * Fix links for open in colab (#6391) * Update src/transformers/optimization.py consistently use lr_end=1e-7 default Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * [wip] add get_polynomial_decay_schedule_with_warmup * style * add assert * change lr_end to a much smaller default number * check for exact equality * Update src/transformers/optimization.py consistently use lr_end=1e-7 default Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * remove dup (leftover from merge) * convert the test into the new refactored format * stick to using the current_step as is, without ++ Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com> Co-authored-by: Julien Chaumond <chaumond@gmail.com> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Alexander Measure <ameasure@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Create README.md * add results on SAIL dataset * Update model_cards/rohanrajpal/bert-base-multilingual-codemixed-cased-sentiment/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com> Co-authored-by: Julien Chaumond <chaumond@gmail.com>

…ion (#6412)

* Create README.md * Update model_cards/rohanrajpal/bert-base-codemixed-uncased-sentiment/README.md * Update model_cards/rohanrajpal/bert-base-codemixed-uncased-sentiment/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com>

#4323) * Fix FFN dropout in TFAlbertLayer, and split dropout in TFAlbertAttention into two separate dropout layers. * Same dropout fixes for PyTorch.

…ut (#6422) * replace capsys with the more refined CaptureStderr/CaptureStdout * Update examples/seq2seq/test_seq2seq_examples.py Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* allow using tokenizer.pad as a collate_fn in pytorch * allow using tokenizer.pad as a collate_fn in pytorch * Add documentation and tests * Make attention mask the right shape * Better test Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* Activate check on the CI * Fix repo inconsistencies * Don't document too much

sgugger · 2020-08-12T15:20:38Z

Rebase made the PR unreadable. Opening a new clean one.

Data collator with padding

ccd51fe

sgugger requested review from thomwolf and LysandreJik August 10, 2020 19:47

patrickvonplaten and others added 7 commits August 10, 2020 23:25

[s2s] Script to save wmt data to disk (#6403)

b9ecd92

Add missing docker arg for TPU CI. (#6393)

f65ac1f

Add TPU testing once again

8a3db6b

testing utils: capturing std streams context manager (#6231)

83984a6

* testing utils: capturing std streams context manager * style * missing import * add the origin of this code

LysandreJik reviewed Aug 11, 2020

View reviewed changes

yobekiko and others added 3 commits August 11, 2020 04:49

Fix tokenizer saving and loading error (#6026)

cdf1f7e

* fix tokenizer saving and loading bugs when adding AddedToken to additional special tokens * Add tokenizer test * Style * Style 2 Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

patrickvonplaten reviewed Aug 11, 2020

View reviewed changes

src/transformers/data/data_collator.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Aug 11, 2020

View reviewed changes

mapmeld and others added 10 commits August 11, 2020 10:34

switch Hindi-BERT to S3 README (#6396)

3ae3078

Create README.md (#6413)

00ce881

* Create README.md Model card for https://huggingface.co/akhooli/gpt2-small-arabic * Update model_cards/akhooli/gpt2-small-arabic/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com>

Create Model Card File (#6357)

1d1d5be

Add type annotation

29ba25e

pl version: examples/requirements.txt is single source of truth (#6309)

7c6a085

Support tensors as well

cca2544

Add comment

600f35e

[s2s] wmt download script use less ram (#6405)

f6cb0f8

Fix for labels wrong shape

2c8a25a

PegasusForConditionalGeneration (torch version) (#6340)

66fa8ce

Co-authored-by: Jingqing Zhang <jingqing.zhang15@imperial.ac.uk>

sshleifer and others added 25 commits August 11, 2020 15:57

rename prepare_translation_batch -> prepare_seq2seq_batch (#6103)

be1520d

[pl] restore lr logging behavior for glue, ner examples (#6314)

0203d65

Create README.md (#6386)

6c87b73

* Create README.md * Update README.md

[examples] add pytest dependency (#6425)

3f071c4

Create README.md (#6346)

42ee0bc

* Create README.md * add results on SAIL dataset * Update model_cards/rohanrajpal/bert-base-multilingual-codemixed-cased-sentiment/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com> Co-authored-by: Julien Chaumond <chaumond@gmail.com>

Change metadata to be indexed correctly (#6379)

471be5f

Add metadata to be indexed properly (#6380)

25e2915

Create README.md (#6378)

31dfde7

Create README.md (#6381)

46e3a0a

Create model card T5-base fine-tuned on event2Mind for Intent Predict…

4e6245f

…ion (#6412)

Disabled pabee test (#6431)

4ffea5c

Fix FFN dropout in TFAlbertLayer, and split dropout in TFAlbertAttent… (

ac5bcf2

#4323) * Fix FFN dropout in TFAlbertLayer, and split dropout in TFAlbertAttention into two separate dropout layers. * Same dropout fixes for PyTorch.

[test] replace capsys with the more refined CaptureStderr/CaptureStdo…

87b3594

…ut (#6422) * replace capsys with the more refined CaptureStderr/CaptureStdout * Update examples/seq2seq/test_seq2seq_examples.py Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

Move prediction_loss_only to TrainingArguments (#6426)

34fabe1

Activate check on the CI (#6427)

a8db954

* Activate check on the CI * Fix repo inconsistencies * Don't document too much

Fix #6428 (#6437)

96c3329

Data collator with padding

8835403

Add type annotation

652196b

Support tensors as well

f3386f1

Add comment

74801be

Fix for labels wrong shape

89fc28a

Merge remote-tracking branch 'origin/nlp_utils' into nlp_utils

4bed573

sgugger closed this Aug 12, 2020

sgugger mentioned this pull request Aug 12, 2020

Adding PaddingDataCollator #6442

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data collator with padding #6398

Data collator with padding #6398

sgugger commented Aug 10, 2020

codecov bot commented Aug 10, 2020 •

edited

Loading

LysandreJik left a comment

sgugger commented Aug 11, 2020

patrickvonplaten Aug 11, 2020

sgugger Aug 11, 2020

patrickvonplaten Aug 11, 2020

sgugger Aug 11, 2020

sgugger commented Aug 11, 2020

sgugger commented Aug 12, 2020

Data collator with padding #6398

Data collator with padding #6398

Conversation

sgugger commented Aug 10, 2020

codecov bot commented Aug 10, 2020 • edited Loading

Codecov Report

LysandreJik left a comment

Choose a reason for hiding this comment

sgugger commented Aug 11, 2020

patrickvonplaten Aug 11, 2020

Choose a reason for hiding this comment

sgugger Aug 11, 2020

Choose a reason for hiding this comment

patrickvonplaten Aug 11, 2020

Choose a reason for hiding this comment

sgugger Aug 11, 2020

Choose a reason for hiding this comment

sgugger commented Aug 11, 2020

sgugger commented Aug 12, 2020

codecov bot commented Aug 10, 2020 •

edited

Loading