Skip to content

Conversation

@mjaliz
Copy link
Contributor

@mjaliz mjaliz commented Nov 11, 2025

What does this PR do?

This PR fixes a bug in the DataCollatorForLanguageModeling class that causes a TypeError when instantiating the data collator with whole_word_mask=True.

Problem

When users try to instantiate DataCollatorForLanguageModeling with whole_word_mask=True, they encounter the following error:

TypeError: category must be a Warning subclass, not 'str'
This error occurs in the __post_init__ method at line 724-727, where warnings.warn() is called incorrectly with two separate string arguments. The second string is being interpreted as the category parameter, which must be a Warning subclass (like UserWarning), not a string.
Solution
The fix combines the two warning message strings into a single message and explicitly passes UserWarning as the category parameter, following the pattern used elsewhere in the same file (e.g., line 719-722). Before:
warnings.warn(
    "Random token replacement is not supported with whole word masking.",
    "Setting mask_replace_prob to 1.",
)
After:
warnings.warn(
    "Random token replacement is not supported with whole word masking. "
    "Setting mask_replace_prob to 1.",
    UserWarning,
)
This ensures the data collator can be instantiated correctly when using whole word masking for masked language modeling tasks.
Before submitting
 This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
 Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request), Pull Request section?
 Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case.
 Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
 Did you write any new necessary tests?
Who can review?
@SunMarc @ArthurZucker - This is a small bug fix in the data collator that prevents instantiation with whole word masking enabled.
Note: Discovered and fixed by @mjaliz
y

Copy link
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you for the fix! The extra comma made the second line appear to be the warning category, so we crashed in this case instead of throwing the warning correctly.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Rocketknight1 Rocketknight1 merged commit 2072f30 into huggingface:main Nov 11, 2025
23 checks passed
@mjaliz
Copy link
Contributor Author

mjaliz commented Nov 11, 2025

Thank you so much for your attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants