-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix mask token handling #14364
Fix mask token handling #14364
Conversation
Hey @qqaatw, are you sure this is an issue with all tokenizers you refactored here? If so, then ideally there would be a test for all of them. The test would fail on current If the problem is as widespread as you show it here, then it might even make sense to add it to the common tests. |
Hey @LysandreJik, thank you for your response. I think this change will be covered in this test after we extend the test for both python and rust tokenizers. (discussed on this thread) transformers/tests/test_tokenization_common.py Lines 651 to 656 in 1cc453d
As a matter of fact, not all tokenizers would fail on current master as it depends on different kinds of special tokens.
However, if the mask token is
Therefore, changing all tokenizers with special mask token handling just wants to make sure they have a consistent behavior throughout the codebase. |
Thank you very much for the additional information @qqaatw . I agree with you that it is more "intuitive" that the default behavior for a mask token is However, since this doesn't necessarily solve a problem and potentially introduce changes for our users, maybe it's worth leaving the settings as they were before. What do you think about it? |
This reverts commit daaa3f5.
@SaulLu I agree with your point. Except for So we only need to modify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this analysis and reverting the changes @qqaatw !
And I agree with you about the change for FNET. For me too given the default mask token argument and the do_lower_case
argument this change is justified. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks a lot for iterating on this @qqaatw !
What does this PR do?
This PR fixes the problem that the mask token is trying to incorrectly match normalized input texts.
This is a related PR of #13594 and #13930.
@SaulLu @LysandreJik