Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mask token handling #14364

Merged
merged 3 commits into from
Dec 1, 2021
Merged

Conversation

qqaatw
Copy link
Contributor

@qqaatw qqaatw commented Nov 11, 2021

What does this PR do?

This PR fixes the problem that the mask token is trying to incorrectly match normalized input texts.

This is a related PR of #13594 and #13930.

@SaulLu @LysandreJik

@qqaatw qqaatw mentioned this pull request Nov 11, 2021
@LysandreJik LysandreJik requested a review from SaulLu November 11, 2021 15:56
@LysandreJik
Copy link
Member

Hey @qqaatw, are you sure this is an issue with all tokenizers you refactored here? If so, then ideally there would be a test for all of them. The test would fail on current master, and would be solved by your PR.

If the problem is as widespread as you show it here, then it might even make sense to add it to the common tests.

@qqaatw
Copy link
Contributor Author

qqaatw commented Nov 16, 2021

Hey @LysandreJik, thank you for your response.

I think this change will be covered in this test after we extend the test for both python and rust tokenizers. (discussed on this thread)

# Check that none of the special tokens are lowercased
sequence_with_special_tokens = "A " + " yEs ".join(tokenizer.all_special_tokens) + " B"
tokenized_sequence = tokenizer.tokenize(sequence_with_special_tokens)
for special_token in tokenizer.all_special_tokens:
self.assertTrue(special_token in tokenized_sequence)

As a matter of fact, not all tokenizers would fail on current master as it depends on different kinds of special tokens.
For example, if the mask token is [MASK], then on current master the tokenizer will incorrectly normalize input texts first, and then try to match the mask token, resulting in:

Today is a [MASK] day normalized-> today is a [mask] day -> cannot match [MASK]

However, if the mask token is <mask>, whether we're on current master or this PR, the test will always pass as there is no difference on the mask token before and after normalization.

Today is a <mask> day normalized-> today is a <mask> day -> can match <mask>

Therefore, changing all tokenizers with special mask token handling just wants to make sure they have a consistent behavior throughout the codebase.

@SaulLu
Copy link
Contributor

SaulLu commented Nov 16, 2021

Thank you very much for the additional information @qqaatw .

I agree with you that it is more "intuitive" that the default behavior for a mask token is Normalized=False.

However, since this doesn't necessarily solve a problem and potentially introduce changes for our users, maybe it's worth leaving the settings as they were before. What do you think about it?

@qqaatw
Copy link
Contributor Author

qqaatw commented Dec 1, 2021

@SaulLu I agree with your point. Except for Albert and FNet tokenizers, other tokenizers having specifal mask token handling don't have the do_lower_case option; therefore, they would not fail test_added_tokens_do_lower_case test.

So we only need to modify FNet because Albert was already addressed by another PR.

Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this analysis and reverting the changes @qqaatw !

And I agree with you about the change for FNET. For me too given the default mask token argument and the do_lower_case argument this change is justified. 🙂

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks a lot for iterating on this @qqaatw !

@LysandreJik LysandreJik merged commit 934e279 into huggingface:master Dec 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants