`num_input_tokens_seen` should exclude padding tokens

### Feature request

The `Trainer` enables tracking [the number of input tokens](https://github.com/huggingface/transformers/blob/241c04d36867259cdf11dbb4e9d9a60f9cb65ebc/src/transformers/trainer.py#L2492-L2496). I was puzzling over why the logged number of input tokens was so much higher than the number of tokens in my dataset and then I discovered that the logged number of input tokens doesn't exclude padding tokens:

```
input_tokens = inputs[main_input_name].numel()
input_tokens = torch.tensor(input_tokens, device=self.args.device, dtype=torch.int64)
self.state.num_input_tokens_seen += (
    self.accelerator.gather(input_tokens).sum().cpu().item()
)
```

I see two solutions:

1. Setting `include_num_input_tokens_seen=True` should exclude padding tokens by default

2. Introduce a new `TrainingArguments` flag like`include_num_input_tokens_seen_excluding_padding_tokens`

### Motivation

I was puzzling over why the logged number of input tokens was so much higher than the number of tokens in my dataset and then I discovered that the logged number of input tokens doesn't exclude padding tokens.

### Your contribution

If you tell me which solution you prefer, I might be able to draft a PR :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`num_input_tokens_seen` should exclude padding tokens #40401

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

num_input_tokens_seen should exclude padding tokens #40401

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`num_input_tokens_seen` should exclude padding tokens #40401