-
Notifications
You must be signed in to change notification settings - Fork 26.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tokenizers] Several small improvements and bug fixes #5287
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5287 +/- ##
=======================================
Coverage 79.08% 79.08%
=======================================
Files 138 138
Lines 24078 24093 +15
=======================================
+ Hits 19041 19054 +13
- Misses 5037 5039 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool, LGTM!
@@ -149,6 +149,9 @@ def __init__( | |||
add_prefix_space=False, | |||
**kwargs | |||
): | |||
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token | |||
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token | |||
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in order to setup an unknown token if need be? It shouldn't be set by default, right, given it's a byte-level BPE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little bit of background (will copy this in the description):
We now align the behavior of the byte-level BPE tokenizer to the Fast version, i.e. except for the mask token which is assumed to represent a word and thus have a prefix space, all the other are assumed to not have a prefix space.
This is already built-in for the Fast tokenizer. Here I update the slow tokenizer to have this behavior using the newly introduced AddedToken
which lets you control the space behaviors of the special tokens.
The unk token for GPT2 is a bit strange indeed and basically here only for our tests (all the tokens are known for GPT2...) so I give him this behavior just for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very cool
* avoid recursion in id checks for fast tokenizers * better typings and fix huggingface#5232 * align slow and fast tokenizers behaviors for Roberta and GPT2 * style and quality * fix tests - improve typings
Various improvements for tokenizers:
convert_tokens_to_string
for Fast tokenizersA little bit of background on the modifications in Roberta tokenizer:
We now align the behavior of the byte-level BPE tokenizer to the Fast version which is the most consistent with the way the original tokenizer behaved: all the special tokens are assumed to not have a prefix space so the user can control whether he wants to have a space or not in the string.
We do an exception for the mask token in Roberta which is assumed to represent a word and thus has a prefix space by default (can be overided at initialization). This is necessary to be able to use Roberta in filled-mask completion easily.
This is already built-in for the Fast tokenizer. Here I update the slow tokenizer to have this behavior using the newly introduced
AddedToken
which lets you control the space behaviors of the special tokens.