-
Notifications
You must be signed in to change notification settings - Fork 25.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tokenizer to Trainer #6689
Add tokenizer to Trainer #6689
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6689 +/- ##
==========================================
- Coverage 78.98% 77.24% -1.74%
==========================================
Files 156 156
Lines 28398 28405 +7
==========================================
- Hits 22429 21941 -488
- Misses 5969 6464 +495
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this a lot and feel like it should have been like this from the start! Since we consider models effectively as model-tokenizer pairs, I think it makes a lot of sense for the trainer to handle both.
Especially with regards to saving/reloading, which has always been an issue with users not understanding how to reload from checkpoints as the tokenizers were not saved in the same folder.
Nice! I like it. Ok for me to do the same on the TF one 👍 |
This reverts commit 6d89ce0.
Not entirely sure about this change as there is a trade-off API complexity/ease of use.
This PR adds
tokenizer
as an optional argument toTrainer
(if this is approved, will do the same forTFTrainer
, I have a few recent changes to port there but was mainly waiting for @jplu to be back from vacation to make the two APIs on par).The benefit is that:
data_collator
that will automatically pad examples if the tokenizer is provided, so the user doesn't have to learn about data_collators for simple examples.Trainer
for the intermediary checkpoints, so it a checkpoint folder can be used directly with our scripts when resuming an interrupted training.As for the bad part, it's just that it adds a new argument to
Trainer
.