-
Notifications
You must be signed in to change notification settings - Fork 30.6k
Closed
Labels
Feature requestRequest for a new featureRequest for a new feature
Description
Feature request
The Trainer
enables tracking the number of input tokens. I was puzzling over why the logged number of input tokens was so much higher than the number of tokens in my dataset and then I discovered that the logged number of input tokens doesn't exclude padding tokens:
input_tokens = inputs[main_input_name].numel()
input_tokens = torch.tensor(input_tokens, device=self.args.device, dtype=torch.int64)
self.state.num_input_tokens_seen += (
self.accelerator.gather(input_tokens).sum().cpu().item()
)
I see two solutions:
-
Setting
include_num_input_tokens_seen=True
should exclude padding tokens by default -
Introduce a new
TrainingArguments
flag likeinclude_num_input_tokens_seen_excluding_padding_tokens
Motivation
I was puzzling over why the logged number of input tokens was so much higher than the number of tokens in my dataset and then I discovered that the logged number of input tokens doesn't exclude padding tokens.
Your contribution
If you tell me which solution you prefer, I might be able to draft a PR :)
Metadata
Metadata
Assignees
Labels
Feature requestRequest for a new featureRequest for a new feature