Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count of tokens seen during training in Trainer #27027

Closed
jpgard opened this issue Oct 23, 2023 · 6 comments · Fixed by #27274
Closed

Count of tokens seen during training in Trainer #27027

jpgard opened this issue Oct 23, 2023 · 6 comments · Fixed by #27274
Assignees
Labels
Feature request Request for a new feature

Comments

@jpgard
Copy link

jpgard commented Oct 23, 2023

Feature request

The Trainer API should track and log the number of tokens seen during training.

While it sometimes could (maybe?) be possible to back out the number of tokens seen from the FLOS, or by iterating over the whole dataset, it would make a lot of sense for the Trainer API to track the number of tokens seen (and it shouldn't be necessary to completely iterate over a model's training loop just to compute the count of tokens, which is the only current implementation of any token-related metric in Trainer, Trainer.num_tokens()).

This can't currently be implemented in a CallBack, because callbacks don't have access to the training data (only the trainer state).

Motivation

Number of tokens seen is an essential metric tracked in nearly every LLM training run. It is widely considered one of the fundamental drivers of model quality (tokens seen during training is reported for nearly every major LLM release). It seems that any language model developer using Hugging Face would like to know this metric for their training runs -- it maybe even more important and useful than the FLOS, and perhaps as important as the number of gradient steps.

In any case, it's an extremely useful number to have, and it must be tracked during training as the model consumes examples.

Your contribution

I'm willing to contribute this but would like some guidance on the overall design first.

In particular, here's what I think a reasonable implementation would include:

  • Add a global_tokens_seen or similar to the TrainerState. This would add only a single integer value to the TrainerState.
  • Increment this during Trainer._inner_training_loop()
  • Probably add this information to the logging outputs

What do the folks at HF think about that?

@geronimi73
Copy link

+1
I think we need this feature

@ArthurZucker ArthurZucker added the Feature request Request for a new feature label Oct 24, 2023
@ArthurZucker
Copy link
Collaborator

cc @muellerzr seems nice if we can make it efficient!

@muellerzr
Copy link
Contributor

muellerzr commented Oct 24, 2023

Is the tokens_per_second we already have as part of #25858 enough? Otherwise we can definitely add it :)

@jpgard
Copy link
Author

jpgard commented Oct 24, 2023

Yeah, tokens/sec doesn't cut it for many use cases (although it is still very useful!!) -- similar to how tracking steps/sec doesn't obviate the need for a global step count.

If you can add it that would be amazing, I am sure this would be a useful feature to almost anyone training a language model. And I think there are some subtleties to how to make it work right in a distributed setting that you would probably be much better at handling....

@muellerzr muellerzr self-assigned this Oct 24, 2023
@raghukiran1224
Copy link

agree tokens/sec/gpu is useful, but it fails to track pad tokens and if we were to do SFTTrainer with packing set to False, this number can be way off. So, we need a feature that tracks actual tokens seen.

@jpgard
Copy link
Author

jpgard commented Nov 15, 2023

thanks @muellerzr !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants