Fix mask token handling #14364

qqaatw · 2021-11-11T07:13:12Z

What does this PR do?

This PR fixes the problem that the mask token is trying to incorrectly match normalized input texts.

This is a related PR of #13594 and #13930.

LysandreJik · 2021-11-15T23:23:34Z

Hey @qqaatw, are you sure this is an issue with all tokenizers you refactored here? If so, then ideally there would be a test for all of them. The test would fail on current master, and would be solved by your PR.

If the problem is as widespread as you show it here, then it might even make sense to add it to the common tests.

qqaatw · 2021-11-16T04:02:38Z

Hey @LysandreJik, thank you for your response.

I think this change will be covered in this test after we extend the test for both python and rust tokenizers. (discussed on this thread)

transformers/tests/test_tokenization_common.py

Lines 651 to 656 in 1cc453d

    
           # Check that none of the special tokens are lowercased 
        
           sequence_with_special_tokens = "A " + " yEs ".join(tokenizer.all_special_tokens) + " B" 
        
           tokenized_sequence = tokenizer.tokenize(sequence_with_special_tokens) 
        
           for special_token in tokenizer.all_special_tokens: 
        
               self.assertTrue(special_token in tokenized_sequence)

As a matter of fact, not all tokenizers would fail on current master as it depends on different kinds of special tokens.
For example, if the mask token is [MASK], then on current master the tokenizer will incorrectly normalize input texts first, and then try to match the mask token, resulting in:

Today is a [MASK] day normalized-> today is a [mask] day -> cannot match [MASK]

However, if the mask token is <mask>, whether we're on current master or this PR, the test will always pass as there is no difference on the mask token before and after normalization.

Today is a <mask> day normalized-> today is a <mask> day -> can match <mask>

Therefore, changing all tokenizers with special mask token handling just wants to make sure they have a consistent behavior throughout the codebase.

SaulLu · 2021-11-16T19:53:52Z

Thank you very much for the additional information @qqaatw .

I agree with you that it is more "intuitive" that the default behavior for a mask token is Normalized=False.

However, since this doesn't necessarily solve a problem and potentially introduce changes for our users, maybe it's worth leaving the settings as they were before. What do you think about it?

This reverts commit daaa3f5.

qqaatw · 2021-12-01T17:45:08Z

@SaulLu I agree with your point. Except for Albert and FNet tokenizers, other tokenizers having specifal mask token handling don't have the do_lower_case option; therefore, they would not fail test_added_tokens_do_lower_case test.

So we only need to modify FNet because Albert was already addressed by another PR.

SaulLu

Thanks a lot for this analysis and reverting the changes @qqaatw !

And I agree with you about the change for FNET. For me too given the default mask token argument and the do_lower_case argument this change is justified. 🙂

LysandreJik

Great, thanks a lot for iterating on this @qqaatw !

Fix mask token handling

daaa3f5

qqaatw mentioned this pull request Nov 11, 2021

Improve tokenizer tests #13594

Merged

LysandreJik requested a review from SaulLu November 11, 2021 15:56

qqaatw added 2 commits November 27, 2021 02:29

Revert "Fix mask token handling"

0c548ad

This reverts commit daaa3f5.

Fix FNet mask token tokenization

045ad79

SaulLu approved these changes Dec 1, 2021

View reviewed changes

LysandreJik approved these changes Dec 1, 2021

View reviewed changes

LysandreJik merged commit 934e279 into huggingface:master Dec 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mask token handling #14364

Fix mask token handling #14364

qqaatw commented Nov 11, 2021 •

edited

Loading

LysandreJik commented Nov 15, 2021

qqaatw commented Nov 16, 2021 •

edited

Loading

SaulLu commented Nov 16, 2021

qqaatw commented Dec 1, 2021

SaulLu left a comment

LysandreJik left a comment

Fix mask token handling #14364

Fix mask token handling #14364

Conversation

qqaatw commented Nov 11, 2021 • edited Loading

What does this PR do?

LysandreJik commented Nov 15, 2021

qqaatw commented Nov 16, 2021 • edited Loading

SaulLu commented Nov 16, 2021

qqaatw commented Dec 1, 2021

SaulLu left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

qqaatw commented Nov 11, 2021 •

edited

Loading

qqaatw commented Nov 16, 2021 •

edited

Loading