Fix fast tokenization problems #13930

qqaatw · 2021-10-08T08:54:25Z

What does this PR do?

This PR addresses two problems of fast tokenization:

The special tokens specified directly from kwargs were not sanitized. (not added through tokenizer._add_tokens function.)
Edited: We finally decided to not include this change in this PR.

from transformers import AlbertTokenizerFast

tokenizer = AlbertTokenizerFast("tests/fixtures/spiece.model", mask_token="[OAO]")
print(tokenizer._mask_token)
print(tokenizer.tokenize('[OAO]'))

# Outputs:
# [OAO]
# ['▁[', 'o', 'ao', ']']

# Fixed outputs:
# [OAO]
# ['[OAO]']

The special handling of Albert's [MASK] token incorrectly normalizes texts to be matched. (Special tokens should exactly match original text not the normalized one.)

from transformers import AlbertTokenizerFast

tokenizer = AlbertTokenizerFast("tests/fixtures/spiece.model")

print(tokenizer._mask_token)
print(tokenizer.tokenize('[MASK]'))

# Outputs:
# [MASK]
# ['▁[', 'mask', ']']

# Fixed outputs:
# [MASK]
# ['[MASK]']

Who can review?

@LysandreJik @sgugger @SaulLu

sgugger

I'll let @SaulLu decide on tis since she is the one who suggested the changes :-)
(FYI she's on vacation until the end of next week.)

qqaatw · 2021-10-08T12:47:28Z

I'll let @SaulLu decide on tis since she is the one who suggested the changes :-) (FYI she's on vacation until the end of next week.)

Thanks, no problem.

SaulLu

Thank you very much for working on this issue. To take up the two problems you mention:

I'm not quite sure this is a "problem". As explained in this docstring, the recommended workflow is to do:

            # You can link tokens to special vocabulary when instantiating
            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', unk_token='<unk>')
            # You should be sure '<unk>' is in the vocabulary when doing that.
            # Otherwise use tokenizer.add_special_tokens({'unk_token': '<unk>'}) instead)

Do you have a use case in mind in which this workflow would not suit you?

Moreover the slow tokenizer does not add the special token to the vocabulary if it does not exist, so if I do not miss anything this PR would add a difference of behavior between the tokenizers slows and the tokenizers fast of type Albert.

With your PR, I obtain:

from transformers import AlbertTokenizerFast, AlbertTokenizer

tokenizer = AlbertTokenizer("tests/fixtures/spiece.model", cls_token="[OAA]")
print(tokenizer._cls_token)
print(tokenizer.tokenize('this is a [OAA]'))

# Outputs:
# [OAA]
# ['▁this', '▁is', '▁a', '▁[', 'oa', 'a', ']']

tokenizer = AlbertTokenizerFast("tests/fixtures/spiece.model", cls_token="[OAA]")
print(tokenizer._cls_token)
print(tokenizer.tokenize('this is a [OAA]'))

# Outputs:
# [OAA]
# ['▁this', '▁is', '▁a', '▁', '[OAA]']

Really happy to hear your thoughts about that!

As mentioned in this PR, I completely agree with this analysis and personally I think that what you propose in src/transformers/models/albert/tokenization_albert.py reflects more accurately the way to use the mask token.
Nevertheless, it changes the behavior of this tokenizer and would like to be sure that it is a change also approved by @sgugger and @LysandreJik . If it is, we could make another PR to make this same change for all the other tokenizers that have a "similar" mask token to Albert (i.e. the mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token line is present in many tokenizers and I suspect - we would have to check case by case - that this same change is also relevant).

src/transformers/models/albert/tokenization_albert_fast.py

src/transformers/models/albert/tokenization_albert.py

qqaatw · 2021-10-15T07:24:07Z

Hi @SaulLu,

Thank you very much for your detailed reply! I'm sorry if I didn't explain the purpose of these changes clearly.

For the first change, please temporarily ignore the inconsistency between the slow and fast ones as I just used the fast one as an example.

You can see in the below snippet, the problem is that currently if we specify a new cls token, it will not be treated as a token as it does not exist in the vocabulary. However, after saving this tokenizer and loading it again through from_pretrained method, it will surprisingly be treated as a token because the from_pretrained method calls tokenizer.sanitize_special_tokens() as the below link shows.

Therefore, I think this is a weird behavior that tokenizer states before reloaded and after reloaded are not the same. What do you think about this?

transformers/src/transformers/tokenization_utils_base.py

Line 1930 in 65659a2

added_tokens = tokenizer.sanitize_special_tokens()

tokenizer = AlbertTokenizerFast("tests/fixtures/spiece.model", cls_token="[OAO]")
print(tokenizer.tokenize('[OAO]')) # Output: ['▁[', 'o', 'ao', ']']

tokenizer.save_pretrained("./test_storage")

tokenizer = AlbertTokenizerFast.from_pretrained("./test_storage")
print(tokenizer.tokenize('[OAO]')) # Output: ['[OAO]']

Also, I was actually not aware of the workflow docstring you mentioned, maybe we can add this to constructor docstring or another place for a better noticeability? (Users might not be aware of this workflow if they don't use from_pretrained method.)

For the second change, I will reply on your review threads.

Thank you again for taking care of this!

SaulLu · 2021-10-26T18:22:59Z

Hi @qqaatw,

Thank you so much for your detailed answers! 🤗 I perfectly understand why that this behavior surprises you now!

Your proposal seems to me very relevant but I am still a bit worried about the side effects that this could bring because it is a change that will affect all tokenizers and I personnaly need some time to examine its impact - in particular given the previous workflow recommended in the docstring.

On this subject, if this change is accepted, it will probably require dedicated tests. In order not to block the rest, maybe the best thing would be to divide this PR in two: the change of the arguments for the mask token on one side and the change of the value of a special token in the init on the other. What do you think about it?

qqaatw · 2021-10-30T05:29:05Z

Hello @SaulLu,

Thanks for your reply!

Sure, I can separate these to different PRs.

SaulLu

Thank you very much for all this discussion and your work! Everything looks good to me in order to adjust how the mask token is handled by Albert 👍

* Fix albert mask token tokenization. * Ensure special tokans sanitized. * Style * Fix * Apply suggestions from code review

qqaatw added 3 commits October 8, 2021 14:22

Fix albert mask token tokenization.

721f89a

Ensure special tokans sanitized.

2d1aafc

Style

6f2d501

qqaatw mentioned this pull request Oct 8, 2021

Improve tokenizer tests #13594

Merged

Fix

a72fab7

sgugger reviewed Oct 8, 2021

View reviewed changes

SaulLu self-requested a review October 14, 2021 15:12

SaulLu reviewed Oct 14, 2021

View reviewed changes

src/transformers/models/albert/tokenization_albert_fast.py Show resolved Hide resolved

src/transformers/models/albert/tokenization_albert.py Outdated Show resolved Hide resolved

Apply suggestions from code review

4b18163

qqaatw requested a review from SaulLu November 10, 2021 06:31

SaulLu approved these changes Nov 10, 2021

View reviewed changes

SaulLu merged commit ea163d0 into huggingface:master Nov 10, 2021

qqaatw mentioned this pull request Nov 11, 2021

Fix mask token handling #14364

Merged

Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022

Fix fast tokenization problems (huggingface#13930)

3da4207

* Fix albert mask token tokenization. * Ensure special tokans sanitized. * Style * Fix * Apply suggestions from code review

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fast tokenization problems #13930

Fix fast tokenization problems #13930

qqaatw commented Oct 8, 2021 •

edited

Loading

sgugger left a comment

qqaatw commented Oct 8, 2021

SaulLu left a comment •

edited

Loading

qqaatw commented Oct 15, 2021 •

edited

Loading

SaulLu commented Oct 26, 2021 •

edited

Loading

qqaatw commented Oct 30, 2021

SaulLu left a comment

Fix fast tokenization problems #13930

Fix fast tokenization problems #13930

Conversation

qqaatw commented Oct 8, 2021 • edited Loading

What does this PR do?

Who can review?

sgugger left a comment

Choose a reason for hiding this comment

qqaatw commented Oct 8, 2021

SaulLu left a comment • edited Loading

Choose a reason for hiding this comment

qqaatw commented Oct 15, 2021 • edited Loading

SaulLu commented Oct 26, 2021 • edited Loading

qqaatw commented Oct 30, 2021

SaulLu left a comment

Choose a reason for hiding this comment

qqaatw commented Oct 8, 2021 •

edited

Loading

SaulLu left a comment •

edited

Loading

qqaatw commented Oct 15, 2021 •

edited

Loading

SaulLu commented Oct 26, 2021 •

edited

Loading