Skip to content

num_input_tokens_seen should exclude padding tokens #40401

@RylanSchaeffer

Description

@RylanSchaeffer

Feature request

The Trainer enables tracking the number of input tokens. I was puzzling over why the logged number of input tokens was so much higher than the number of tokens in my dataset and then I discovered that the logged number of input tokens doesn't exclude padding tokens:

input_tokens = inputs[main_input_name].numel()
input_tokens = torch.tensor(input_tokens, device=self.args.device, dtype=torch.int64)
self.state.num_input_tokens_seen += (
    self.accelerator.gather(input_tokens).sum().cpu().item()
)

I see two solutions:

  1. Setting include_num_input_tokens_seen=True should exclude padding tokens by default

  2. Introduce a new TrainingArguments flag likeinclude_num_input_tokens_seen_excluding_padding_tokens

Motivation

I was puzzling over why the logged number of input tokens was so much higher than the number of tokens in my dataset and then I discovered that the logged number of input tokens doesn't exclude padding tokens.

Your contribution

If you tell me which solution you prefer, I might be able to draft a PR :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions