-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix fast tokenization problems #13930
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let @SaulLu decide on tis since she is the one who suggested the changes :-)
(FYI she's on vacation until the end of next week.)
Thanks, no problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for working on this issue. To take up the two problems you mention:
- I'm not quite sure this is a "problem". As explained in this docstring, the recommended workflow is to do:
# You can link tokens to special vocabulary when instantiating
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', unk_token='<unk>')
# You should be sure '<unk>' is in the vocabulary when doing that.
# Otherwise use tokenizer.add_special_tokens({'unk_token': '<unk>'}) instead)
Do you have a use case in mind in which this workflow would not suit you?
Moreover the slow tokenizer does not add the special token to the vocabulary if it does not exist, so if I do not miss anything this PR would add a difference of behavior between the tokenizers slows and the tokenizers fast of type Albert.
With your PR, I obtain:
from transformers import AlbertTokenizerFast, AlbertTokenizer
tokenizer = AlbertTokenizer("tests/fixtures/spiece.model", cls_token="[OAA]")
print(tokenizer._cls_token)
print(tokenizer.tokenize('this is a [OAA]'))
# Outputs:
# [OAA]
# ['▁this', '▁is', '▁a', '▁[', 'oa', 'a', ']']
tokenizer = AlbertTokenizerFast("tests/fixtures/spiece.model", cls_token="[OAA]")
print(tokenizer._cls_token)
print(tokenizer.tokenize('this is a [OAA]'))
# Outputs:
# [OAA]
# ['▁this', '▁is', '▁a', '▁', '[OAA]']
Really happy to hear your thoughts about that!
- As mentioned in this PR, I completely agree with this analysis and personally I think that what you propose in
src/transformers/models/albert/tokenization_albert.py
reflects more accurately the way to use the mask token.
Nevertheless, it changes the behavior of this tokenizer and would like to be sure that it is a change also approved by @sgugger and @LysandreJik . If it is, we could make another PR to make this same change for all the other tokenizers that have a "similar" mask token to Albert (i.e. themask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
line is present in many tokenizers and I suspect - we would have to check case by case - that this same change is also relevant).
Hi @SaulLu, Thank you very much for your detailed reply! I'm sorry if I didn't explain the purpose of these changes clearly. For the first change, please temporarily ignore the inconsistency between the slow and fast ones as I just used the fast one as an example. You can see in the below snippet, the problem is that currently if we specify a new cls token, it will not be treated as a token as it does not exist in the vocabulary. However, after saving this tokenizer and loading it again through Therefore, I think this is a weird behavior that tokenizer states before reloaded and after reloaded are not the same. What do you think about this?
Also, I was actually not aware of the workflow docstring you mentioned, maybe we can add this to constructor docstring or another place for a better noticeability? (Users might not be aware of this workflow if they don't use For the second change, I will reply on your review threads. Thank you again for taking care of this! |
Hi @qqaatw, Thank you so much for your detailed answers! 🤗 I perfectly understand why that this behavior surprises you now! Your proposal seems to me very relevant but I am still a bit worried about the side effects that this could bring because it is a change that will affect all tokenizers and I personnaly need some time to examine its impact - in particular given the previous workflow recommended in the docstring. On this subject, if this change is accepted, it will probably require dedicated tests. In order not to block the rest, maybe the best thing would be to divide this PR in two: the change of the arguments for the mask token on one side and the change of the value of a special token in the init on the other. What do you think about it? |
Hello @SaulLu, Thanks for your reply! Sure, I can separate these to different PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for all this discussion and your work! Everything looks good to me in order to adjust how the mask token is handled by Albert 👍
* Fix albert mask token tokenization. * Ensure special tokans sanitized. * Style * Fix * Apply suggestions from code review
What does this PR do?
This PR addresses two problems of fast tokenization:
tokenizer._add_tokens
function.)Edited: We finally decided to not include this change in this PR.
Albert
's[MASK]
token incorrectly normalizes texts to be matched. (Special tokens should exactly match original text not the normalized one.)Who can review?
@LysandreJik @sgugger @SaulLu