Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fast tokenization problems #13930

Merged
merged 5 commits into from
Nov 10, 2021
Merged

Conversation

qqaatw
Copy link
Contributor

@qqaatw qqaatw commented Oct 8, 2021

What does this PR do?

This PR addresses two problems of fast tokenization:

  1. The special tokens specified directly from kwargs were not sanitized. (not added through tokenizer._add_tokens function.)
    Edited: We finally decided to not include this change in this PR.
from transformers import AlbertTokenizerFast

tokenizer = AlbertTokenizerFast("tests/fixtures/spiece.model", mask_token="[OAO]")
print(tokenizer._mask_token)
print(tokenizer.tokenize('[OAO]'))

# Outputs:
# [OAO]
# ['▁[', 'o', 'ao', ']']

# Fixed outputs:
# [OAO]
# ['[OAO]']
  1. The special handling of Albert's [MASK] token incorrectly normalizes texts to be matched. (Special tokens should exactly match original text not the normalized one.)
from transformers import AlbertTokenizerFast

tokenizer = AlbertTokenizerFast("tests/fixtures/spiece.model")

print(tokenizer._mask_token)
print(tokenizer.tokenize('[MASK]'))

# Outputs:
# [MASK]
# ['▁[', 'mask', ']']

# Fixed outputs:
# [MASK]
# ['[MASK]']

Who can review?

@LysandreJik @sgugger @SaulLu

@qqaatw qqaatw mentioned this pull request Oct 8, 2021
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let @SaulLu decide on tis since she is the one who suggested the changes :-)
(FYI she's on vacation until the end of next week.)

@qqaatw
Copy link
Contributor Author

qqaatw commented Oct 8, 2021

I'll let @SaulLu decide on tis since she is the one who suggested the changes :-) (FYI she's on vacation until the end of next week.)

Thanks, no problem.

@SaulLu SaulLu self-requested a review October 14, 2021 15:12
Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for working on this issue. To take up the two problems you mention:

  1. I'm not quite sure this is a "problem". As explained in this docstring, the recommended workflow is to do:
            # You can link tokens to special vocabulary when instantiating
            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', unk_token='<unk>')
            # You should be sure '<unk>' is in the vocabulary when doing that.
            # Otherwise use tokenizer.add_special_tokens({'unk_token': '<unk>'}) instead)

Do you have a use case in mind in which this workflow would not suit you?

Moreover the slow tokenizer does not add the special token to the vocabulary if it does not exist, so if I do not miss anything this PR would add a difference of behavior between the tokenizers slows and the tokenizers fast of type Albert.

With your PR, I obtain:

from transformers import AlbertTokenizerFast, AlbertTokenizer
tokenizer = AlbertTokenizer("tests/fixtures/spiece.model", cls_token="[OAA]")
print(tokenizer._cls_token)
print(tokenizer.tokenize('this is a [OAA]'))

# Outputs:
# [OAA]
# ['▁this', '▁is', '▁a', '▁[', 'oa', 'a', ']']
tokenizer = AlbertTokenizerFast("tests/fixtures/spiece.model", cls_token="[OAA]")
print(tokenizer._cls_token)
print(tokenizer.tokenize('this is a [OAA]'))

# Outputs:
# [OAA]
# ['▁this', '▁is', '▁a', '▁', '[OAA]']

Really happy to hear your thoughts about that!

  1. As mentioned in this PR, I completely agree with this analysis and personally I think that what you propose in src/transformers/models/albert/tokenization_albert.py reflects more accurately the way to use the mask token.
    Nevertheless, it changes the behavior of this tokenizer and would like to be sure that it is a change also approved by @sgugger and @LysandreJik . If it is, we could make another PR to make this same change for all the other tokenizers that have a "similar" mask token to Albert (i.e. the mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token line is present in many tokenizers and I suspect - we would have to check case by case - that this same change is also relevant).

@qqaatw
Copy link
Contributor Author

qqaatw commented Oct 15, 2021

Hi @SaulLu,

Thank you very much for your detailed reply! I'm sorry if I didn't explain the purpose of these changes clearly.

For the first change, please temporarily ignore the inconsistency between the slow and fast ones as I just used the fast one as an example.

You can see in the below snippet, the problem is that currently if we specify a new cls token, it will not be treated as a token as it does not exist in the vocabulary. However, after saving this tokenizer and loading it again through from_pretrained method, it will surprisingly be treated as a token because the from_pretrained method calls tokenizer.sanitize_special_tokens() as the below link shows.

Therefore, I think this is a weird behavior that tokenizer states before reloaded and after reloaded are not the same. What do you think about this?

added_tokens = tokenizer.sanitize_special_tokens()

tokenizer = AlbertTokenizerFast("tests/fixtures/spiece.model", cls_token="[OAO]")
print(tokenizer.tokenize('[OAO]')) # Output: ['▁[', 'o', 'ao', ']']

tokenizer.save_pretrained("./test_storage")

tokenizer = AlbertTokenizerFast.from_pretrained("./test_storage")
print(tokenizer.tokenize('[OAO]')) # Output: ['[OAO]']

Also, I was actually not aware of the workflow docstring you mentioned, maybe we can add this to constructor docstring or another place for a better noticeability? (Users might not be aware of this workflow if they don't use from_pretrained method.)

For the second change, I will reply on your review threads.

Thank you again for taking care of this!

@SaulLu
Copy link
Contributor

SaulLu commented Oct 26, 2021

Hi @qqaatw,

Thank you so much for your detailed answers! 🤗 I perfectly understand why that this behavior surprises you now!

Your proposal seems to me very relevant but I am still a bit worried about the side effects that this could bring because it is a change that will affect all tokenizers and I personnaly need some time to examine its impact - in particular given the previous workflow recommended in the docstring.

On this subject, if this change is accepted, it will probably require dedicated tests. In order not to block the rest, maybe the best thing would be to divide this PR in two: the change of the arguments for the mask token on one side and the change of the value of a special token in the init on the other. What do you think about it?

@qqaatw
Copy link
Contributor Author

qqaatw commented Oct 30, 2021

Hello @SaulLu,

Thanks for your reply!

Sure, I can separate these to different PRs.

@qqaatw qqaatw requested a review from SaulLu November 10, 2021 06:31
Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for all this discussion and your work! Everything looks good to me in order to adjust how the mask token is handled by Albert 👍

@SaulLu SaulLu merged commit ea163d0 into huggingface:master Nov 10, 2021
@qqaatw qqaatw mentioned this pull request Nov 11, 2021
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
* Fix albert mask token tokenization.

* Ensure special tokans sanitized.

* Style

* Fix

* Apply suggestions from code review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants