Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tokenizers] Several small improvements and bug fixes #5287

Merged
merged 5 commits into from
Jun 25, 2020
Merged

Conversation

thomwolf
Copy link
Member

@thomwolf thomwolf commented Jun 25, 2020

Various improvements for tokenizers:

A little bit of background on the modifications in Roberta tokenizer:
We now align the behavior of the byte-level BPE tokenizer to the Fast version which is the most consistent with the way the original tokenizer behaved: all the special tokens are assumed to not have a prefix space so the user can control whether he wants to have a space or not in the string.

We do an exception for the mask token in Roberta which is assumed to represent a word and thus has a prefix space by default (can be overided at initialization). This is necessary to be able to use Roberta in filled-mask completion easily.

This is already built-in for the Fast tokenizer. Here I update the slow tokenizer to have this behavior using the newly introduced AddedToken which lets you control the space behaviors of the special tokens.

@codecov
Copy link

codecov bot commented Jun 25, 2020

Codecov Report

Merging #5287 into master will increase coverage by 0.00%.
The diff coverage is 97.14%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #5287   +/-   ##
=======================================
  Coverage   79.08%   79.08%           
=======================================
  Files         138      138           
  Lines       24078    24093   +15     
=======================================
+ Hits        19041    19054   +13     
- Misses       5037     5039    +2     
Impacted Files Coverage Δ
src/transformers/tokenization_utils_base.py 93.15% <94.44%> (-0.01%) ⬇️
src/transformers/tokenization_gpt2.py 97.18% <100.00%> (+0.06%) ⬆️
src/transformers/tokenization_roberta.py 94.52% <100.00%> (ø)
src/transformers/tokenization_utils_fast.py 94.20% <100.00%> (-0.09%) ⬇️
src/transformers/tokenization_utils.py 91.16% <0.00%> (-0.32%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24f46ea...209dcc7. Read the comment docs.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, LGTM!

@@ -149,6 +149,9 @@ def __init__(
add_prefix_space=False,
**kwargs
):
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in order to setup an unknown token if need be? It shouldn't be set by default, right, given it's a byte-level BPE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little bit of background (will copy this in the description):
We now align the behavior of the byte-level BPE tokenizer to the Fast version, i.e. except for the mask token which is assumed to represent a word and thus have a prefix space, all the other are assumed to not have a prefix space.

This is already built-in for the Fast tokenizer. Here I update the slow tokenizer to have this behavior using the newly introduced AddedToken which lets you control the space behaviors of the special tokens.

The unk token for GPT2 is a bit strange indeed and basically here only for our tests (all the tokens are known for GPT2...) so I give him this behavior just for consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very cool

@thomwolf thomwolf merged commit 315f464 into master Jun 25, 2020
@thomwolf thomwolf deleted the fix-5256 branch June 25, 2020 20:17
jplu pushed a commit to jplu/transformers that referenced this pull request Jun 29, 2020
* avoid recursion in id checks for fast tokenizers

* better typings and fix huggingface#5232

* align slow and fast tokenizers behaviors for Roberta and GPT2

* style and quality

* fix tests - improve typings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants