Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tokenizer to Trainer #6689

Merged
merged 2 commits into from
Aug 25, 2020
Merged

Add tokenizer to Trainer #6689

merged 2 commits into from
Aug 25, 2020

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Aug 24, 2020

Not entirely sure about this change as there is a trade-off API complexity/ease of use.

This PR adds tokenizer as an optional argument to Trainer (if this is approved, will do the same for TFTrainer, I have a few recent changes to port there but was mainly waiting for @jplu to be back from vacation to make the two APIs on par).

The benefit is that:

  • we can have a smart default data_collator that will automatically pad examples if the tokenizer is provided, so the user doesn't have to learn about data_collators for simple examples.
  • we can save the tokenizer along the model directly inside Trainer for the intermediary checkpoints, so it a checkpoint folder can be used directly with our scripts when resuming an interrupted training.

As for the bad part, it's just that it adds a new argument to Trainer.

@codecov
Copy link

codecov bot commented Aug 24, 2020

Codecov Report

Merging #6689 into master will decrease coverage by 1.73%.
The diff coverage is 55.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6689      +/-   ##
==========================================
- Coverage   78.98%   77.24%   -1.74%     
==========================================
  Files         156      156              
  Lines       28398    28405       +7     
==========================================
- Hits        22429    21941     -488     
- Misses       5969     6464     +495     
Impacted Files Coverage Δ
src/transformers/trainer.py 53.66% <55.55%> (-0.13%) ⬇️
src/transformers/modeling_tf_albert.py 21.47% <0.00%> (-69.44%) ⬇️
src/transformers/tokenization_xlm.py 16.26% <0.00%> (-66.67%) ⬇️
src/transformers/pipelines.py 25.63% <0.00%> (-54.32%) ⬇️
src/transformers/optimization.py 58.88% <0.00%> (-36.67%) ⬇️
src/transformers/modeling_tf_gpt2.py 65.68% <0.00%> (-29.33%) ⬇️
src/transformers/optimization_tf.py 33.33% <0.00%> (-24.33%) ⬇️
src/transformers/modeling_tf_auto.py 48.79% <0.00%> (-18.08%) ⬇️
src/transformers/data/processors/squad.py 13.76% <0.00%> (-14.38%) ⬇️
src/transformers/modeling_auto.py 64.36% <0.00%> (-14.37%) ⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update abc0202...54feec2. Read the comment docs.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot and feel like it should have been like this from the start! Since we consider models effectively as model-tokenizer pairs, I think it makes a lot of sense for the trainer to handle both.

Especially with regards to saving/reloading, which has always been an issue with users not understanding how to reload from checkpoints as the tokenizers were not saved in the same folder.

@jplu
Copy link
Contributor

jplu commented Aug 25, 2020

Nice! I like it. Ok for me to do the same on the TF one 👍

@sgugger sgugger merged commit 124c3d6 into master Aug 25, 2020
@sgugger sgugger deleted the trainer_tokenizer branch August 25, 2020 11:47
Zigur pushed a commit to Zigur/transformers that referenced this pull request Oct 26, 2020
fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants