Integrate fast tokenizers library inside transformers #2674

mfuntowicz · 2020-01-29T16:30:28Z

Integrate the BPE-based tokenizers inside transformers.

Added priority for Tokenizer with fast implementation in AutoTokenizer this is done through a new mapping (name: class) -> (name: Tuple[class, class]) which represents both the Python and Rust implementation classes. if no Rust implementation is available, it is simply set to None. AutoTokenizer will try to pick the Rust class if not None, otherwise it defaults to the Python one.

Added some matching tests which basically checks that there is a huge % of element wise matching tokens. This is set arbitrary to 0.05 (5%) [i.e. at max, 5% of differences between Python and Rust].

Added parameter return_offsets_mapping=False over encoding methods which will return the offset mapping if using a Rust tokenizer. If using a Python tokenizer, a warning message is displayed through the module logger and the argument is discarded.

src/transformers/tokenization_auto.py

julien-c · 2020-01-30T23:29:58Z

only took a superficial look, but looks very clean 👍

Excited to use fast tokenizers by default!

mfuntowicz · 2020-02-03T15:56:04Z

Current CI issues are real and "normal" we need to release the next version of tokenizers lib which will bring all the dependencies.

src/transformers/tokenization_utils.py

n1t0 · 2020-02-06T15:26:21Z

src/transformers/tokenization_transfo_xl.py

+
+    def encode_batch(self, sequences: List[Union[str, Tuple[str, str]]]) -> List[Encoding]:
+        return super().encode_batch(
+            [seq.strip() if isinstance(seq, str) else (seq[0].strip(), seq[1].strip()) for seq in sequences]


This should probably be an additional Normalizer. It would also let us keep track of the offsets. What do you think?

Yap make sense 👍

Or maybe this deserves a 0.4.3 haha

src/transformers/tokenization_utils.py

n1t0 · 2020-02-06T15:33:02Z

src/transformers/tokenization_utils.py

        if return_special_tokens_mask:
-            encoding_dict["special_tokens_mask"] = encoding.special_tokens_mask
+            encoding_dict["special_tokens_mask"] = [e.special_tokens_mask for e in encodings]
+        if return_offsets_mapping:


If we don't give access to the normalized string somehow, we should maybe provide offsets to the original string here. Wdyt?

Hum, currently the offsets are given w.r.t the normalised string ? If this is the case, then yes we may want to provide offsets in the original string then, or expose an utility method doing the mapping in Python.

Is it something we can easily expose on Encoding ?

Yes, offsets are related to the normalized string. You can retrieve the original offsets by doing encoding.original_str.offsets[encoding.offsets[X]]

I haven't followed the discussion very closely, but shouldn't offsets be returned based on the original string by default?

As an end user I don't think I really care about the normalized (internal) version of my input.

Thoughts?

I tend to agree with @julien-c, the normalized string is more an internal representation

Same for me we should probably default to the original string.

codecov-io · 2020-02-11T04:04:49Z

Codecov Report

Merging #2674 into master will increase coverage by 0.29%.
The diff coverage is 83.01%.

@@            Coverage Diff            @@
##           master   #2674      +/-   ##
=========================================
+ Coverage      75%   75.3%   +0.29%     
=========================================
  Files          94      94              
  Lines       15288   15424     +136     
=========================================
+ Hits        11467   11615     +148     
+ Misses       3821    3809      -12

Impacted Files	Coverage Δ
src/transformers/__init__.py	`98.87% <100%> (ø)`	⬆️
src/transformers/tokenization_roberta.py	`100% <100%> (ø)`	⬆️
src/transformers/tokenization_bert.py	`96.92% <100%> (+0.3%)`	⬆️
src/transformers/pipelines.py	`70.88% <100%> (+0.14%)`	⬆️
src/transformers/tokenization_distilbert.py	`100% <100%> (ø)`	⬆️
src/transformers/tokenization_gpt2.py	`96.85% <100%> (+0.58%)`	⬆️
src/transformers/tokenization_auto.py	`97.22% <100%> (+0.25%)`	⬆️
src/transformers/tokenization_transfo_xl.py	`37.91% <51.42%> (+5.04%)`	⬆️
src/transformers/tokenization_openai.py	`82.27% <81.57%> (+0.46%)`	⬆️
src/transformers/tokenization_utils.py	`90.08% <87.23%> (+3.98%)`	⬆️
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20fc18f...56748e8. Read the comment docs.

n1t0

Looks really good! Great job @mfuntowicz!

setup.py

src/transformers/tokenization_bert.py

src/transformers/tokenization_utils.py

n1t0 · 2020-02-11T14:51:31Z

src/transformers/tokenization_transfo_xl.py

+
+    def encode_batch(self, sequences: List[Union[str, Tuple[str, str]]]) -> List[Encoding]:
+        return super().encode_batch(
+            [seq.strip() if isinstance(seq, str) else (seq[0].strip(), seq[1].strip()) for seq in sequences]


Or maybe this deserves a 0.4.3 haha

thomwolf

Great work @mfuntowicz!

thomwolf · 2020-02-12T14:44:16Z

setup.py

@@ -89,7 +89,7 @@
    packages=find_packages("src"),
    install_requires=[
        "numpy",
-        "tokenizers == 0.0.11",
+        "tokenizers == 0.4.2",


As we don't have so many unit tests on tokenizers python's binding for now, I would tend to stick to a specific version that will be tested on the CI. Otherwise it might introduce some flaky tests when releasing new tokenizers versions

Well, given that we are still introducing breaking changes pretty often in tokenizers, I would strongly advise against that.

src/transformers/tokenization_utils.py

thomwolf · 2020-02-12T15:06:31Z

src/transformers/tokenization_utils.py

        if return_special_tokens_mask:
-            encoding_dict["special_tokens_mask"] = encoding.special_tokens_mask
+            encoding_dict["special_tokens_mask"] = [e.special_tokens_mask for e in encodings]
+        if return_offsets_mapping:


Same for me we should probably default to the original string.

thomwolf · 2020-02-12T15:09:29Z

src/transformers/tokenization_utils.py


        # Prepare inputs as tensors if asked
        if return_tensors == "tf" and is_tf_available():
-            encoding_dict["input_ids"] = tf.constant([encoding_dict["input_ids"]])
+            encoding_dict["input_ids"] = tf.constant(encoding_dict["input_ids"])


Ok nice, so this will be a tensor with a "batch dimension" equal to the number of encodings when they are splited in overflowing tokens. I like this solution, it's clean. We should document this behavior though.

src/transformers/tokenization_bert.py

thomwolf · 2020-02-13T09:42:28Z

src/transformers/tokenization_openai.py

@@ -213,3 +222,91 @@ def save_vocabulary(self, save_directory):
                index += 1

        return vocab_file, merge_file
+
+
+class _OpenAIGPTCharBPETokenizer(BaseTokenizer):


Why do we have to have this class here?

Don't we have an implementation of char-level BPE in tokenizers now?
Here: https://github.com/huggingface/tokenizers/blob/master/bindings/python/tokenizers/implementations/char_level_bpe.py#L9

We do need a special OpenaiGPT implementation because it slightly differs from the char-level BPE we have in tokenizers:

Normalizer is the same as Bert (BertNormalizer)

PreTokenizer is not Whitespace, it's the same as Bert (BertPreTokenizer)

If we put TransformerXL into tokenizers.implementations, may be this one can make its way to tokenizers too. cc @n1t0

Honestly, I'm not too sure about this. I think tokenizers should stay a library with some generic implementations, with an easy way for everybody to build it's own custom tokenizer when needed. So I'd like to avoid introducing specific implementations for each new model/tokenizer. Otherwise, the next thing we'll discuss is whether we should have default vocabularies downloaded automatically with each specific implementation, and then we'll have as many implementations as models there are in transformers... I think it makes more sense to have specific customization details in transformers, next to the model that actually uses the custom tokenizer.

Otherwise, the next thing we'll discuss is whether we should have default vocabularies downloaded automatically with each specific implementation

You mean have all the things that made the success of Transformers? 😜

Jking Well ok for me to keep these in Transformers then.

src/transformers/tokenization_roberta.py

thomwolf · 2020-02-13T09:45:25Z

src/transformers/tokenization_transfo_xl.py

@@ -280,6 +293,108 @@ def _tokenize(self, line, add_eos=False, add_double_eos=False):
            return symbols


+class _TransfoXLDelimiterLookupTokenizer(BaseTokenizer):


Can we move this class upstream in tokenizers now that we have a word-level model?

It would be more consistent.

cc @n1t0 what do you think ? I can put the content of this into tokenizers.implementations

cf comment for OpenAIGPTCharBPETokenizer

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

…e on masked input. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

…tences_pair to PreTrainedTokenizerFast + tests. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

…ist object. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

…vior. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

…e current tokenizer. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

…rategy + unittest. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

…tensors="..." Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

…h axis. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

…meters in kwargs Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

…g_tokens is True. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

…tensor. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

…ce max) Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

LysandreJik

This is really cool, great job @mfuntowicz !

LysandreJik · 2020-02-19T16:28:05Z

src/transformers/tokenization_bert.py

-        self._tokenizer = tk.Tokenizer(tk.models.WordPiece.from_files(vocab_file, unk_token=unk_token))
-        self._update_special_tokens()
-        self._tokenizer.with_pre_tokenizer(
-            tk.pre_tokenizers.BertPreTokenizer.new(
-                do_basic_tokenize=do_basic_tokenize,
-                do_lower_case=do_lower_case,
-                tokenize_chinese_chars=tokenize_chinese_chars,
-                never_split=never_split if never_split is not None else [],
-            )
-        )
-        self._tokenizer.with_decoder(tk.decoders.WordPiece.new())
-
-        if add_special_tokens:
-            self._tokenizer.with_post_processor(
-                tk.processors.BertProcessing.new(
-                    (sep_token, self._tokenizer.token_to_id(sep_token)),
-                    (cls_token, self._tokenizer.token_to_id(cls_token)),
-                )
-            )
-        if max_length is not None:
-            self._tokenizer.with_truncation(max_length, stride=stride, strategy=truncation_strategy)
-        self._tokenizer.with_padding(
-            max_length=max_length if pad_to_max_length else None,
-            direction=self.padding_side,
-            pad_id=self.pad_token_id,
-            pad_type_id=self.pad_token_type_id,
-            pad_token=self.pad_token,
-        )
-        self._decoder = tk.decoders.WordPiece.new()


This is very satisfying

LysandreJik · 2020-02-19T16:34:47Z

src/transformers/tokenization_utils.py

+        else:
+            return max(0, len(self.encode(self.mask_token or "")) - 1)
+
+    def tokenize(self, text, **kwargs):


This seems like it would take too much memory before being useful, as you really never have 2 sequences with 512/1024 tokens which are exactly the same

mfuntowicz requested review from julien-c, n1t0, thomwolf and LysandreJik January 29, 2020 16:30

julien-c reviewed Jan 30, 2020

View reviewed changes

src/transformers/tokenization_auto.py Outdated Show resolved Hide resolved

n1t0 mentioned this pull request Feb 3, 2020

Tokenizer import error huggingface/tokenizers#120

Closed

mfuntowicz force-pushed the tokenizers-v2 branch from aed66b0 to 66cd67e Compare February 6, 2020 10:40

n1t0 reviewed Feb 6, 2020

View reviewed changes

mfuntowicz force-pushed the tokenizers-v2 branch 3 times, most recently from 8c70bc6 to ef42cf5 Compare February 10, 2020 12:57

mfuntowicz mentioned this pull request Feb 10, 2020

Using fast tokenizers with pipelines #2775

Closed

mfuntowicz force-pushed the tokenizers-v2 branch from b6cad60 to b9305c0 Compare February 11, 2020 10:49

n1t0 approved these changes Feb 11, 2020

View reviewed changes

mfuntowicz force-pushed the tokenizers-v2 branch 2 times, most recently from 88ffca6 to 63660c4 Compare February 12, 2020 13:23

thomwolf reviewed Feb 13, 2020

View reviewed changes

n1t0 mentioned this pull request Feb 18, 2020

Python improvements huggingface/tokenizers#155

Merged

mfuntowicz added 9 commits February 19, 2020 15:54

Implemented fast version of tokenizers

f8f7487

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Bumped tokenizers version requirements to latest 0.2.1

c435009

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Added matching tests

96bc6e6

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Matching OpenAI GPT tokenization !

c2a5805

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Matching GPT2 on tokenizers

92ce90d

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Expose add_prefix_space as constructor parameter for GPT2

0e19ed3

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Matching Roberta tokenization !

7f5e943

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Removed fast implementation of CTRL.

8d4322a

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Binding TransformerXL tokenizers to Rust.

02dcd7c

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

mfuntowicz added 19 commits February 19, 2020 15:54

format & quality.

9ebefaf

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Addressing review comment: Use named arguments when applicable.

6c58a79

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Addressing review comment: Add Github link to Roberta/GPT2 space issu…

43afcec

…e on masked input. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Addressing review comment: Move max_len_single_sentence / max_len_sen…

623064d

…tences_pair to PreTrainedTokenizerFast + tests. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Addressing review comment: Relax type checking to include tuple and l…

adc9d59

…ist object. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Addressing review comment: Document the truncate_and_pad manager beha…

60152ec

…vior. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Raise an exception if return_offsets_mapping is not available with th…

7c9d853

…e current tokenizer. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Ensure padding is set on the tokenizers before setting any padding st…

339175d

…rategy + unittest. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

On pytorch we need to stack tensor to get proper new axis.

c64b472

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Generalize tests to different framework removing hard written return_…

d2ff615

…tensors="..." Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Bump tokenizer dependency for num_special_tokens_to_add

2689cf0

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Overflowing tokens in batch_encode_plus are now stacked over the batc…

cc94880

…h axis. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Improved error message for padding strategy without pad token.

7d05684

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Bumping tokenizers dependency to 0.5.0 for release.

a2043b0

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Optimizing convert_encoding around 4x improvement. 🚀

f8e3cf4

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

expose pad_to_max_length in encode_plus to avoid duplicating the para…

8499ddc

…meters in kwargs Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Generate a proper overflow_to_sampling_mapping when return_overflowin…

ad61705

…g_tokens is True. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Fix unittests for overflow_to_sampling_mapping not being returned as …

4c50f20

…tensor. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Format & quality.

3342897

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

mfuntowicz force-pushed the tokenizers-v2 branch from 330b068 to 3342897 Compare February 19, 2020 14:59

Remove perfect alignment constraint for Roberta (allowing 1% differen…

590ceb5

…ce max) Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

mfuntowicz changed the title ~~[WIP] Integrate fast tokenizers library inside transformers~~ Integrate fast tokenizers library inside transformers Feb 19, 2020

Triggering final CI

56748e8

LysandreJik approved these changes Feb 19, 2020

View reviewed changes

LysandreJik merged commit 3f3fa7f into master Feb 19, 2020

mfuntowicz deleted the tokenizers-v2 branch February 19, 2020 19:41

world2vec mentioned this pull request Jun 22, 2022

DeBERTa V3 Fast Tokenizer #14712

Closed

lin1490188 mentioned this pull request Apr 13, 2023

多轮对话报错 PhoebusSi/Alpaca-CoT#73

Closed

Maxhyl mentioned this pull request Aug 11, 2023

模型训练问题 modelscope/modelscope#459

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate fast tokenizers library inside transformers #2674

Integrate fast tokenizers library inside transformers #2674

mfuntowicz commented Jan 29, 2020 •

edited

Loading

julien-c commented Jan 30, 2020

mfuntowicz commented Feb 3, 2020

n1t0 Feb 6, 2020

mfuntowicz Feb 7, 2020

n1t0 Feb 11, 2020

n1t0 Feb 6, 2020

mfuntowicz Feb 7, 2020

n1t0 Feb 8, 2020

julien-c Feb 8, 2020

mfuntowicz Feb 8, 2020

thomwolf Feb 12, 2020

codecov-io commented Feb 11, 2020 •

edited

Loading

n1t0 left a comment

n1t0 Feb 11, 2020

thomwolf left a comment

thomwolf Feb 12, 2020

mfuntowicz Feb 13, 2020

n1t0 Feb 13, 2020

thomwolf Feb 12, 2020

thomwolf Feb 12, 2020

thomwolf Feb 13, 2020

mfuntowicz Feb 13, 2020

mfuntowicz Feb 13, 2020

n1t0 Feb 13, 2020

thomwolf Feb 13, 2020

thomwolf Feb 13, 2020

mfuntowicz Feb 13, 2020

n1t0 Feb 13, 2020

LysandreJik left a comment

LysandreJik Feb 19, 2020

LysandreJik Feb 19, 2020

		@@ -280,6 +293,108 @@ def _tokenize(self, line, add_eos=False, add_double_eos=False):
		return symbols


		class _TransfoXLDelimiterLookupTokenizer(BaseTokenizer):

Integrate fast tokenizers library inside transformers #2674

Integrate fast tokenizers library inside transformers #2674

Conversation

mfuntowicz commented Jan 29, 2020 • edited Loading

julien-c commented Jan 30, 2020

mfuntowicz commented Feb 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Feb 11, 2020 • edited Loading

Codecov Report

n1t0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomwolf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfuntowicz commented Jan 29, 2020 •

edited

Loading

codecov-io commented Feb 11, 2020 •

edited

Loading