New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Count of tokens seen during training in Trainer #27027
Comments
+1 |
cc @muellerzr seems nice if we can make it efficient! |
Is the |
Yeah, tokens/sec doesn't cut it for many use cases (although it is still very useful!!) -- similar to how tracking steps/sec doesn't obviate the need for a global step count. If you can add it that would be amazing, I am sure this would be a useful feature to almost anyone training a language model. And I think there are some subtleties to how to make it work right in a distributed setting that you would probably be much better at handling.... |
agree tokens/sec/gpu is useful, but it fails to track |
thanks @muellerzr !! |
Feature request
The
Trainer
API should track and log the number of tokens seen during training.While it sometimes could (maybe?) be possible to back out the number of tokens seen from the FLOS, or by iterating over the whole dataset, it would make a lot of sense for the Trainer API to track the number of tokens seen (and it shouldn't be necessary to completely iterate over a model's training loop just to compute the count of tokens, which is the only current implementation of any token-related metric in Trainer,
Trainer.num_tokens()
).This can't currently be implemented in a CallBack, because callbacks don't have access to the training data (only the trainer state).
Motivation
Number of tokens seen is an essential metric tracked in nearly every LLM training run. It is widely considered one of the fundamental drivers of model quality (tokens seen during training is reported for nearly every major LLM release). It seems that any language model developer using Hugging Face would like to know this metric for their training runs -- it maybe even more important and useful than the FLOS, and perhaps as important as the number of gradient steps.
In any case, it's an extremely useful number to have, and it must be tracked during training as the model consumes examples.
Your contribution
I'm willing to contribute this but would like some guidance on the overall design first.
In particular, here's what I think a reasonable implementation would include:
global_tokens_seen
or similar to theTrainerState
. This would add only a single integer value to theTrainerState
.Trainer._inner_training_loop()
What do the folks at HF think about that?
The text was updated successfully, but these errors were encountered: