Count of tokens seen during training in Trainer #27027

jpgard · 2023-10-23T18:38:51Z

Feature request

The Trainer API should track and log the number of tokens seen during training.

While it sometimes could (maybe?) be possible to back out the number of tokens seen from the FLOS, or by iterating over the whole dataset, it would make a lot of sense for the Trainer API to track the number of tokens seen (and it shouldn't be necessary to completely iterate over a model's training loop just to compute the count of tokens, which is the only current implementation of any token-related metric in Trainer, Trainer.num_tokens()).

This can't currently be implemented in a CallBack, because callbacks don't have access to the training data (only the trainer state).

Motivation

Number of tokens seen is an essential metric tracked in nearly every LLM training run. It is widely considered one of the fundamental drivers of model quality (tokens seen during training is reported for nearly every major LLM release). It seems that any language model developer using Hugging Face would like to know this metric for their training runs -- it maybe even more important and useful than the FLOS, and perhaps as important as the number of gradient steps.

In any case, it's an extremely useful number to have, and it must be tracked during training as the model consumes examples.

Your contribution

I'm willing to contribute this but would like some guidance on the overall design first.

In particular, here's what I think a reasonable implementation would include:

Add a global_tokens_seen or similar to the TrainerState. This would add only a single integer value to the TrainerState.
Increment this during Trainer._inner_training_loop()
Probably add this information to the logging outputs

What do the folks at HF think about that?

The text was updated successfully, but these errors were encountered:

geronimi73 · 2023-10-24T07:05:00Z

+1
I think we need this feature

ArthurZucker · 2023-10-24T10:10:25Z

cc @muellerzr seems nice if we can make it efficient!

muellerzr · 2023-10-24T13:54:36Z

Is the tokens_per_second we already have as part of #25858 enough? Otherwise we can definitely add it :)

jpgard · 2023-10-24T16:50:52Z

Yeah, tokens/sec doesn't cut it for many use cases (although it is still very useful!!) -- similar to how tracking steps/sec doesn't obviate the need for a global step count.

If you can add it that would be amazing, I am sure this would be a useful feature to almost anyone training a language model. And I think there are some subtleties to how to make it work right in a distributed setting that you would probably be much better at handling....

raghukiran1224 · 2023-11-02T17:20:55Z

agree tokens/sec/gpu is useful, but it fails to track pad tokens and if we were to do SFTTrainer with packing set to False, this number can be way off. So, we need a feature that tracks actual tokens seen.

jpgard · 2023-11-15T18:19:09Z

thanks @muellerzr !!

ArthurZucker added the Feature request Request for a new feature label Oct 24, 2023

muellerzr self-assigned this Oct 24, 2023

muellerzr mentioned this issue Nov 3, 2023

Track the number of tokens seen to metrics #27274

Merged

5 tasks

muellerzr closed this as completed in #27274 Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Count of tokens seen during training in Trainer #27027

Count of tokens seen during training in Trainer #27027

jpgard commented Oct 23, 2023

geronimi73 commented Oct 24, 2023

ArthurZucker commented Oct 24, 2023

muellerzr commented Oct 24, 2023 •

edited

jpgard commented Oct 24, 2023

raghukiran1224 commented Nov 2, 2023

jpgard commented Nov 15, 2023

Count of tokens seen during training in Trainer #27027

Count of tokens seen during training in Trainer #27027

Comments

jpgard commented Oct 23, 2023

Feature request

Motivation

Your contribution

geronimi73 commented Oct 24, 2023

ArthurZucker commented Oct 24, 2023

muellerzr commented Oct 24, 2023 • edited

jpgard commented Oct 24, 2023

raghukiran1224 commented Nov 2, 2023

jpgard commented Nov 15, 2023

muellerzr commented Oct 24, 2023 •

edited