[tokenizers] Several small improvements and bug fixes #5287

thomwolf · 2020-06-25T19:14:49Z

Various improvements for tokenizers:

Avoid recursion loop for special tokens id look-up in Fast tokenizers
Fix BertTokenizerFast.convert_tokens_to_string converts ids to string, not tokens to string #5232 by removing the unsupported method convert_tokens_to_string for Fast tokenizers
Fix RobertaTokenizerFast produces a different output than RobertaTokenizer #5256 by aligning the behavior of the slow tokenizer on the behavior of the fast tokenizer for special tokens inside the input.

A little bit of background on the modifications in Roberta tokenizer:
We now align the behavior of the byte-level BPE tokenizer to the Fast version which is the most consistent with the way the original tokenizer behaved: all the special tokens are assumed to not have a prefix space so the user can control whether he wants to have a space or not in the string.

We do an exception for the mask token in Roberta which is assumed to represent a word and thus has a prefix space by default (can be overided at initialization). This is necessary to be able to use Roberta in filled-mask completion easily.

This is already built-in for the Fast tokenizer. Here I update the slow tokenizer to have this behavior using the newly introduced AddedToken which lets you control the space behaviors of the special tokens.

codecov · 2020-06-25T19:44:42Z

Codecov Report

Merging #5287 into master will increase coverage by 0.00%.
The diff coverage is 97.14%.

@@           Coverage Diff           @@
##           master    #5287   +/-   ##
=======================================
  Coverage   79.08%   79.08%           
=======================================
  Files         138      138           
  Lines       24078    24093   +15     
=======================================
+ Hits        19041    19054   +13     
- Misses       5037     5039    +2

Impacted Files	Coverage Δ
src/transformers/tokenization_utils_base.py	`93.15% <94.44%> (-0.01%)`	⬇️
src/transformers/tokenization_gpt2.py	`97.18% <100.00%> (+0.06%)`	⬆️
src/transformers/tokenization_roberta.py	`94.52% <100.00%> (ø)`
src/transformers/tokenization_utils_fast.py	`94.20% <100.00%> (-0.09%)`	⬇️
src/transformers/tokenization_utils.py	`91.16% <0.00%> (-0.32%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24f46ea...209dcc7. Read the comment docs.

LysandreJik

Very cool, LGTM!

LysandreJik · 2020-06-25T19:57:26Z

src/transformers/tokenization_gpt2.py

@@ -149,6 +149,9 @@ def __init__(
        add_prefix_space=False,
        **kwargs
    ):
+        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
+        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
+        unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token


This is in order to setup an unknown token if need be? It shouldn't be set by default, right, given it's a byte-level BPE?

A little bit of background (will copy this in the description):
We now align the behavior of the byte-level BPE tokenizer to the Fast version, i.e. except for the mask token which is assumed to represent a word and thus have a prefix space, all the other are assumed to not have a prefix space.

This is already built-in for the Fast tokenizer. Here I update the slow tokenizer to have this behavior using the newly introduced AddedToken which lets you control the space behaviors of the special tokens.

The unk token for GPT2 is a bit strange indeed and basically here only for our tests (all the tokens are known for GPT2...) so I give him this behavior just for consistency.

That's very cool

* avoid recursion in id checks for fast tokenizers * better typings and fix huggingface#5232 * align slow and fast tokenizers behaviors for Roberta and GPT2 * style and quality * fix tests - improve typings

thomwolf added 5 commits June 25, 2020 21:09

avoid recursion in id checks for fast tokenizers

b22721e

better typings and fix #5232

fb8a5b1

align slow and fast tokenizers behaviors for Roberta and GPT2

41267d8

style and quality

1806b99

fix tests - improve typings

209dcc7

thomwolf requested a review from LysandreJik June 25, 2020 19:52

LysandreJik approved these changes Jun 25, 2020

View reviewed changes

thomwolf merged commit 315f464 into master Jun 25, 2020

thomwolf deleted the fix-5256 branch June 25, 2020 20:17

nreimers mentioned this pull request Dec 22, 2020

Roberta python Tokenizer encodes differently across transformers==2.11 and transformers==4.0.1 #9165

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tokenizers] Several small improvements and bug fixes #5287

[tokenizers] Several small improvements and bug fixes #5287

thomwolf commented Jun 25, 2020 •

edited

Loading

codecov bot commented Jun 25, 2020 •

edited

Loading

LysandreJik left a comment

LysandreJik Jun 25, 2020

thomwolf Jun 25, 2020

julien-c Jun 25, 2020

[tokenizers] Several small improvements and bug fixes #5287

[tokenizers] Several small improvements and bug fixes #5287

Conversation

thomwolf commented Jun 25, 2020 • edited Loading

codecov bot commented Jun 25, 2020 • edited Loading

Codecov Report

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Jun 25, 2020

Choose a reason for hiding this comment

thomwolf Jun 25, 2020

Choose a reason for hiding this comment

julien-c Jun 25, 2020

Choose a reason for hiding this comment

thomwolf commented Jun 25, 2020 •

edited

Loading

codecov bot commented Jun 25, 2020 •

edited

Loading