Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data collator with padding #6398

Closed
wants to merge 47 commits into from
Closed

Data collator with padding #6398

wants to merge 47 commits into from

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Aug 10, 2020

Add a data collator to dynamically pad samples during batching. This is necessary for the training set since padding can't be applied beforehand if we use shuffling (unless with pad to a fixed max_length).
This should make it more straightforward to plug nlp into the Trainer.

@codecov
Copy link

codecov bot commented Aug 10, 2020

Codecov Report

Merging #6398 into master will increase coverage by 0.18%.
The diff coverage is 50.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6398      +/-   ##
==========================================
+ Coverage   79.55%   79.73%   +0.18%     
==========================================
  Files         148      148              
  Lines       27206    27226      +20     
==========================================
+ Hits        21644    21710      +66     
+ Misses       5562     5516      -46     
Impacted Files Coverage Δ
src/transformers/__init__.py 99.24% <ø> (ø)
src/transformers/data/data_collator.py 89.05% <50.00%> (-7.54%) ⬇️
src/transformers/tokenization_ctrl.py 78.64% <0.00%> (-17.48%) ⬇️
src/transformers/data/processors/squad.py 28.13% <0.00%> (-5.51%) ⬇️
src/transformers/tokenization_dpr.py 53.15% <0.00%> (-4.51%) ⬇️
src/transformers/file_utils.py 82.18% <0.00%> (-0.26%) ⬇️
src/transformers/generation_tf_utils.py 86.46% <0.00%> (-0.26%) ⬇️
src/transformers/tokenization_utils_base.py 93.84% <0.00%> (+1.39%) ⬆️
src/transformers/tokenization_transfo_xl.py 42.48% <0.00%> (+8.92%) ⬆️
src/transformers/tokenization_gpt2.py 97.22% <0.00%> (+9.72%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3425936...4bed573. Read the comment docs.

patrickvonplaten and others added 7 commits August 10, 2020 23:25
* improve names and tests longformer

* more and better tests for longformer

* add first tf test

* finalize tf basic op functions

* fix merge

* tf shape test passes

* narrow down discrepancies

* make longformer local attn tf work

* correct tf longformer

* add first global attn function

* add more global longformer func

* advance tf longformer

* finish global attn

* upload big model

* finish all tests

* correct false any statement

* fix common tests

* make all tests pass except keras save load

* fix some tests

* fix torch test import

* finish tests

* fix test

* fix torch tf tests

* add docs

* finish docs

* Update src/transformers/modeling_longformer.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/modeling_tf_longformer.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* apply Lysandres suggestions

* reverse to assert statement because function will fail otherwise

* applying sylvains recommendations

* Update src/transformers/modeling_longformer.py

Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* Update src/transformers/modeling_tf_longformer.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
* Chunked feed forward for Bert

This is an initial implementation to test applying feed forward chunking for BERT.
Will need additional modifications based on output and benchmark results.

* Black and cleanup

* Feed forward chunking in BertLayer class.

* Isort

* add chunking for all models

* fix docs

* Fix typo

Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>
* add pl_glue example test

* for now just test that it runs, next validate results of eval or predict?

* complete the run_pl_glue test to validate the actual outcome

* worked on my machine, CI gets less accuracy - trying higher epochs

* match run_pl.sh hparms

* more epochs?

* trying higher lr

* for now just test that the script runs to a completion

* correct the comment

* if cuda is available, add --fp16 --gpus=1 to cover more bases

* style
* testing utils: capturing std streams context manager

* style

* missing import

* add the origin of this code
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a welcome addition. Can this collator be used alongside other collators? It seems to be an option that we would like on all of them, given its utility.

yobekiko and others added 3 commits August 11, 2020 04:49
* fix tokenizer saving and loading bugs when adding AddedToken to additional special tokens

* Add tokenizer test

* Style

* Style 2

Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
* Warn if debug requested without TPU fixes (#6308)
Check whether a PyTorch compatible TPU is available before attempting to print TPU metrics after training has completed. This way, users who apply `--debug` without reading the documentation aren't suprised by a stacktrace.

* Style

Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
* Optimized banned token masking

* Avoid duplicate EOS masking if in bad_words_id

* Updated mask generation to handle empty banned token list

* Addition of unit tests for the updated bad_words_ids masking

* Updated timeout handling in `test_postprocess_next_token_scores_large_bad_words_list` unit test

* Updated timeout handling in `test_postprocess_next_token_scores_large_bad_words_list` unit test (timeout does not work on Windows)

* Moving Marian import to the test context to allow TF only environments to run

* Moving imports to torch_available test

* Updated operations device and test

* Updated operations device and test

* Added docstring and comment for in-place scores modification

* Moving test to own test_generation_utils, use of lighter models for testing

* removed unneded imports in test_modeling_common

* revert formatting change for ModelTesterMixin

* Updated caching, simplified eos token id test, removed unnecessary @require_torch

* formatting compliance
@sgugger
Copy link
Collaborator Author

sgugger commented Aug 11, 2020

It's hard to see how to combine it with another data collator since a data collator's function is to create batch, and you can't create batches if your tensors are not padded to the same size.

* Create README.md

* Update model_cards/akhooli/gpt2-small-arabic-poetry/README.md

* Update model_cards/akhooli/gpt2-small-arabic-poetry/README.md

* Update model_cards/akhooli/gpt2-small-arabic-poetry/README.md

* Update model_cards/akhooli/gpt2-small-arabic-poetry/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
pad_to_multiple_of: Optional[int] = None

def __call__(self, features):
batch = self.tokenizer.pad(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is coded now features cannot be of a dict of torch.Tensor self.tokenizer.pad(...) would fail, right?
This is a bit different from some other DataCollators that expect torch.Tensors as an input.

Do we want to go more towards an approach where we preprocess everything in "Python" using the nlp library and then use the data collators to transformers a dataset of Python lists to a dataset of torch Tensors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree that data collators expect torch.Tensor as inputs: default_data_tokenizer accept both. IMO it's the DataCollatorForLanguageModeling and variants that are not great in the sense they don't accept both tensors and list of ints as inputs. On that note, this data collator should also work with tensors, I agree. IMO it's the pad method that should be fixed to accept tensors ultimately.

I think we want to support everything in terms of preprocessing, but nlp should be a premium citizen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, no I agree with you - I also think we should refactor the other data collators to also accept Python list inputs.

IMO, a nice workflow is:

  1. Download a dataset with nlp, this will be in the format Dict[str, List[int or str]].
  2. Then we can use the dataset.map() functionality to process the data to input_ids, attention_mask, etc....Since nlp.map(...) does not allow to return torch.Tensors, here we have to stay anyways in Python List format.
  3. This dataset can then ideally already be used with the default_data_tokenizer because all tokenization methods were applied during the .map() operations.

Here would be an example of this workflow: https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16#training-script. (Note that the tensors would not even have to be put into torch tensor format).
What do you think @sgugger ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, except step 3 should be default_data_collator only when you are padding to a fixed max length. With the shuffle of the training set, you need dynamic padding as provided by the data collator introduced in this PR :-)

@sgugger
Copy link
Collaborator Author

sgugger commented Aug 11, 2020

Things I tried to fix here were actually addressed by @thomwolf in #6423, so waiting for this PR to be merged before merging this one.

sshleifer and others added 25 commits August 11, 2020 15:57
* Create README.md

* Update README.md
* [wip] add get_polynomial_decay_schedule_with_warmup

* style

* add assert

* change lr_end to a much smaller default number

* check for exact equality

* [model_cards] electra-base-turkish-cased-ner (#6350)

* for electra-base-turkish-cased-ner

* Add metadata

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

* Temporarily de-activate TPU CI

* Update modeling_tf_utils.py (#6372)

fix typo: ckeckpoint->checkpoint

* the test now works again (#6371)

* correct pl link in readme (#6364)

* refactor almost identical tests (#6339)

* refactor almost identical tests

* important to add a clear assert error message

* make the assert error even more descriptive than the original bt

* Small docfile fixes (#6328)

* Patch models (#6326)

* TFAlbertFor{TokenClassification, MultipleChoice}

* Patch models

* BERT and TF BERT info


s

* Update check_repo

* Ci GitHub caching (#6382)

* Cache Github Actions CI

* Remove useless file

* Colab button (#6389)

* Add colab button

* Add colab link for tutorials

* Fix links for open in colab (#6391)

* Update src/transformers/optimization.py

consistently use lr_end=1e-7 default

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* [wip] add get_polynomial_decay_schedule_with_warmup

* style

* add assert

* change lr_end to a much smaller default number

* check for exact equality

* Update src/transformers/optimization.py

consistently use lr_end=1e-7 default

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* remove dup (leftover from merge)

* convert the test into the new refactored format

* stick to using the current_step as is, without ++

Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
Co-authored-by: Alexander Measure <ameasure@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Create README.md

* add results on SAIL dataset

* Update model_cards/rohanrajpal/bert-base-multilingual-codemixed-cased-sentiment/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
* Create README.md

* Update model_cards/rohanrajpal/bert-base-codemixed-uncased-sentiment/README.md

* Update model_cards/rohanrajpal/bert-base-codemixed-uncased-sentiment/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
#4323)

* Fix FFN dropout in TFAlbertLayer, and split dropout in TFAlbertAttention into two separate dropout layers.

* Same dropout fixes for PyTorch.
…ut (#6422)

* replace capsys with the more refined CaptureStderr/CaptureStdout

* Update examples/seq2seq/test_seq2seq_examples.py

Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
* allow using tokenizer.pad as a collate_fn in pytorch

* allow using tokenizer.pad as a collate_fn in pytorch

* Add documentation and tests

* Make attention mask the right shape

* Better test

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
* Activate check on the CI

* Fix repo inconsistencies

* Don't document too much
@sgugger
Copy link
Collaborator Author

sgugger commented Aug 12, 2020

Rebase made the PR unreadable. Opening a new clean one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.