-
Notifications
You must be signed in to change notification settings - Fork 26.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data collator with padding #6398
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6398 +/- ##
==========================================
+ Coverage 79.55% 79.73% +0.18%
==========================================
Files 148 148
Lines 27206 27226 +20
==========================================
+ Hits 21644 21710 +66
+ Misses 5562 5516 -46
Continue to review full report at Codecov.
|
* improve names and tests longformer * more and better tests for longformer * add first tf test * finalize tf basic op functions * fix merge * tf shape test passes * narrow down discrepancies * make longformer local attn tf work * correct tf longformer * add first global attn function * add more global longformer func * advance tf longformer * finish global attn * upload big model * finish all tests * correct false any statement * fix common tests * make all tests pass except keras save load * fix some tests * fix torch test import * finish tests * fix test * fix torch tf tests * add docs * finish docs * Update src/transformers/modeling_longformer.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/modeling_tf_longformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * apply Lysandres suggestions * reverse to assert statement because function will fail otherwise * applying sylvains recommendations * Update src/transformers/modeling_longformer.py Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * Update src/transformers/modeling_tf_longformer.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
* Chunked feed forward for Bert This is an initial implementation to test applying feed forward chunking for BERT. Will need additional modifications based on output and benchmark results. * Black and cleanup * Feed forward chunking in BertLayer class. * Isort * add chunking for all models * fix docs * Fix typo Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>
* add pl_glue example test * for now just test that it runs, next validate results of eval or predict? * complete the run_pl_glue test to validate the actual outcome * worked on my machine, CI gets less accuracy - trying higher epochs * match run_pl.sh hparms * more epochs? * trying higher lr * for now just test that the script runs to a completion * correct the comment * if cuda is available, add --fp16 --gpus=1 to cover more bases * style
* testing utils: capturing std streams context manager * style * missing import * add the origin of this code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a welcome addition. Can this collator be used alongside other collators? It seems to be an option that we would like on all of them, given its utility.
* fix tokenizer saving and loading bugs when adding AddedToken to additional special tokens * Add tokenizer test * Style * Style 2 Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
* Warn if debug requested without TPU fixes (#6308) Check whether a PyTorch compatible TPU is available before attempting to print TPU metrics after training has completed. This way, users who apply `--debug` without reading the documentation aren't suprised by a stacktrace. * Style Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
* Optimized banned token masking * Avoid duplicate EOS masking if in bad_words_id * Updated mask generation to handle empty banned token list * Addition of unit tests for the updated bad_words_ids masking * Updated timeout handling in `test_postprocess_next_token_scores_large_bad_words_list` unit test * Updated timeout handling in `test_postprocess_next_token_scores_large_bad_words_list` unit test (timeout does not work on Windows) * Moving Marian import to the test context to allow TF only environments to run * Moving imports to torch_available test * Updated operations device and test * Updated operations device and test * Added docstring and comment for in-place scores modification * Moving test to own test_generation_utils, use of lighter models for testing * removed unneded imports in test_modeling_common * revert formatting change for ModelTesterMixin * Updated caching, simplified eos token id test, removed unnecessary @require_torch * formatting compliance
It's hard to see how to combine it with another data collator since a data collator's function is to create batch, and you can't create batches if your tensors are not padded to the same size. |
* Create README.md * Update model_cards/akhooli/gpt2-small-arabic-poetry/README.md * Update model_cards/akhooli/gpt2-small-arabic-poetry/README.md * Update model_cards/akhooli/gpt2-small-arabic-poetry/README.md * Update model_cards/akhooli/gpt2-small-arabic-poetry/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com>
pad_to_multiple_of: Optional[int] = None | ||
|
||
def __call__(self, features): | ||
batch = self.tokenizer.pad( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is coded now features
cannot be of a dict
of torch.Tensor
self.tokenizer.pad(...)
would fail, right?
This is a bit different from some other DataCollators that expect torch.Tensors
as an input.
Do we want to go more towards an approach where we preprocess everything in "Python" using the nlp
library and then use the data collators to transformers a dataset of Python lists to a dataset of torch Tensors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree that data collators expect torch.Tensor
as inputs: default_data_tokenizer
accept both. IMO it's the DataCollatorForLanguageModeling
and variants that are not great in the sense they don't accept both tensors and list of ints as inputs. On that note, this data collator should also work with tensors, I agree. IMO it's the pad method that should be fixed to accept tensors ultimately.
I think we want to support everything in terms of preprocessing, but nlp
should be a premium citizen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, no I agree with you - I also think we should refactor the other data collators to also accept Python list inputs.
IMO, a nice workflow is:
- Download a dataset with
nlp
, this will be in the formatDict[str, List[int or str]]
. - Then we can use the dataset.map() functionality to process the data to
input_ids
,attention_mask
, etc....Sincenlp.map(...)
does not allow to returntorch.Tensors
, here we have to stay anyways in Python List format. - This dataset can then ideally already be used with the
default_data_tokenizer
because all tokenization methods were applied during the .map() operations.
Here would be an example of this workflow: https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16#training-script. (Note that the tensors would not even have to be put into torch tensor format).
What do you think @sgugger ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, except step 3 should be default_data_collator
only when you are padding to a fixed max length. With the shuffle of the training set, you need dynamic padding as provided by the data collator introduced in this PR :-)
* Create README.md Model card for https://huggingface.co/akhooli/gpt2-small-arabic * Update model_cards/akhooli/gpt2-small-arabic/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com>
Co-authored-by: Jingqing Zhang <jingqing.zhang15@imperial.ac.uk>
* Create README.md * Update README.md
* [wip] add get_polynomial_decay_schedule_with_warmup * style * add assert * change lr_end to a much smaller default number * check for exact equality * [model_cards] electra-base-turkish-cased-ner (#6350) * for electra-base-turkish-cased-ner * Add metadata Co-authored-by: Julien Chaumond <chaumond@gmail.com> * Temporarily de-activate TPU CI * Update modeling_tf_utils.py (#6372) fix typo: ckeckpoint->checkpoint * the test now works again (#6371) * correct pl link in readme (#6364) * refactor almost identical tests (#6339) * refactor almost identical tests * important to add a clear assert error message * make the assert error even more descriptive than the original bt * Small docfile fixes (#6328) * Patch models (#6326) * TFAlbertFor{TokenClassification, MultipleChoice} * Patch models * BERT and TF BERT info s * Update check_repo * Ci GitHub caching (#6382) * Cache Github Actions CI * Remove useless file * Colab button (#6389) * Add colab button * Add colab link for tutorials * Fix links for open in colab (#6391) * Update src/transformers/optimization.py consistently use lr_end=1e-7 default Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * [wip] add get_polynomial_decay_schedule_with_warmup * style * add assert * change lr_end to a much smaller default number * check for exact equality * Update src/transformers/optimization.py consistently use lr_end=1e-7 default Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * remove dup (leftover from merge) * convert the test into the new refactored format * stick to using the current_step as is, without ++ Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com> Co-authored-by: Julien Chaumond <chaumond@gmail.com> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Alexander Measure <ameasure@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Create README.md * add results on SAIL dataset * Update model_cards/rohanrajpal/bert-base-multilingual-codemixed-cased-sentiment/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com> Co-authored-by: Julien Chaumond <chaumond@gmail.com>
* Create README.md * Update model_cards/rohanrajpal/bert-base-codemixed-uncased-sentiment/README.md * Update model_cards/rohanrajpal/bert-base-codemixed-uncased-sentiment/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com>
#4323) * Fix FFN dropout in TFAlbertLayer, and split dropout in TFAlbertAttention into two separate dropout layers. * Same dropout fixes for PyTorch.
…ut (#6422) * replace capsys with the more refined CaptureStderr/CaptureStdout * Update examples/seq2seq/test_seq2seq_examples.py Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
* allow using tokenizer.pad as a collate_fn in pytorch * allow using tokenizer.pad as a collate_fn in pytorch * Add documentation and tests * Make attention mask the right shape * Better test Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
* Activate check on the CI * Fix repo inconsistencies * Don't document too much
Rebase made the PR unreadable. Opening a new clean one. |
Add a data collator to dynamically pad samples during batching. This is necessary for the training set since padding can't be applied beforehand if we use shuffling (unless with pad to a fixed
max_length
).This should make it more straightforward to plug nlp into the Trainer.