feature: Add robust token counting with padding exclusion #40416

PrathmeshAdsod · 2025-08-25T04:37:08Z

This pull request improves the Trainer by adding a better way to count input tokens. It includes a new option to exclude padding. This is done by expanding the functionality of the current include_num_input_tokens_seen argument in TrainingArguments, ensuring full backward compatibility.

What was the feature?
The goal was to give users more precise control over how input tokens are counted during training. This feature allows excluding padding tokens from the total count. This is useful for accurate logging and performance analysis, especially in tasks with variable sequence lengths.

What was done and why?
To implement this effectively without adding unnecessary new parameters (bool flag), the following changes were made:

Updated Existing Parameter: The include_num_input_tokens_seen argument in TrainingArguments was updated to accept string values ("all", "non_padding") in addition to boolean values. This allows for clearer control while keeping full backward compatibility (True is mapped to "all," and False to "no").

Improved Counting Logic: The Trainer's token counting logic was made more reliable. When "non_padding" is selected, the Trainer now follows a prioritized approach:

It first tries to use attention_mask.sum() for the most accurate count of non-padded tokens.
If attention_mask is not available, it counts tokens where input_ids are not equal to the pad_token_id.
If neither method works, it counts all tokens and logs a warning to inform the user.

Testing:

To ensure the reliability of this feature, a thorough test suite has been added to tests/trainer/test_trainer.py. The new tests cover:

All token counting modes ("all," "non_padding," True, False).
The new fallback logic, with specific test cases for when attention_mask is present, when it is absent (falling back to pad_token_id), and when neither is available (testing the warning and fallback to counting all tokens).
Full backward compatibility.

I noticed torch_dtype is replaced by dtype so in our files I made them manual changes so no issues will be created to merge it. #39782

Also I clicked on Update Branch button.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Models:

text models: @ArthurZucker
Library:
trainer: @zach-huggingface, @SunMarc and @qgallouedec

…ens_seen variable and kept bool for backward compatibility and added string also to ensure everything goes well and kept default as is. also robust test cases are created

…t and also solved code quality issue

PrathmeshAdsod · 2025-08-25T07:40:59Z

Hello, I made changes and our feature test case is successful. I am working on passing on checks, I noticed in my first commit it gave me success in run_tests but because of code_quality it failed and I solved it then in 3rd, 4th, 5th commit there I am getting inconsistent result in run_tests in terms of number of failed --> 3, 2, 1 respectively. Is this because environment issue or what can be?

Rocketknight1 · 2025-08-25T14:18:24Z

cc @SunMarc

ArthurZucker

Thanks 😉

RylanSchaeffer · 2025-09-11T16:08:44Z

Thank you all!

…e#40416) * created robust token counting by using existing include_num_input_tokens_seen variable and kept bool for backward compatibility and added string also to ensure everything goes well and kept default as is. also robust test cases are created * some codebase mismatched in my local and remote, commiting to solve it and also solved code quality issue * ci: retrigger tests * another attemp to trigger CI for checks

created robust token counting by using existing include_num_input_tok…

1b98471

…ens_seen variable and kept bool for backward compatibility and added string also to ensure everything goes well and kept default as is. also robust test cases are created

PrathmeshAdsod changed the title ~~created robust token counting by using existing include_num_input_tok…~~ feature: Add robust token counting with padding exclusion Aug 25, 2025

PrathmeshAdsod marked this pull request as draft August 25, 2025 04:44

PrathmeshAdsod and others added 2 commits August 25, 2025 11:08

some codebase mismatched in my local and remote, commiting to solve i…

42ead73

…t and also solved code quality issue

Merge branch 'main' into feature-to-count-tokens-exclude-padding

fade9b0

PrathmeshAdsod marked this pull request as ready for review August 25, 2025 05:54

PrathmeshAdsod added 2 commits August 25, 2025 11:55

ci: retrigger tests

fd14028

another attemp to trigger CI for checks

7f87044

ArthurZucker requested a review from SunMarc August 25, 2025 11:11

Merge branch 'main' into feature-to-count-tokens-exclude-padding

fdda75f

ArthurZucker approved these changes Sep 11, 2025

View reviewed changes

ArthurZucker merged commit ec532f2 into huggingface:main Sep 11, 2025
20 of 22 checks passed

qgallouedec mentioned this pull request Sep 13, 2025

🛠️ Fix CI huggingface/trl#4076

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature: Add robust token counting with padding exclusion #40416

feature: Add robust token counting with padding exclusion #40416

Uh oh!

PrathmeshAdsod commented Aug 25, 2025 •

edited

Loading

Uh oh!

PrathmeshAdsod commented Aug 25, 2025 •

edited

Loading

Uh oh!

Rocketknight1 commented Aug 25, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

RylanSchaeffer commented Sep 11, 2025

Uh oh!

Uh oh!

feature: Add robust token counting with padding exclusion #40416

feature: Add robust token counting with padding exclusion #40416

Uh oh!

Conversation

PrathmeshAdsod commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before submitting

Who can review?

Uh oh!

PrathmeshAdsod commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented Aug 25, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RylanSchaeffer commented Sep 11, 2025

Uh oh!

Uh oh!

PrathmeshAdsod commented Aug 25, 2025 •

edited

Loading

PrathmeshAdsod commented Aug 25, 2025 •

edited

Loading