Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer #7141

thomwolf · 2020-09-15T10:29:36Z

This pull request add the "fast" Rust tokenizer for the SentencePiece tokenizers as well.

Based on unreleased v0.9.0 of tokenizers.

Tokenizers:

Breaking:

Fast version of Transformer-XL (which gave different tokenization results) is removed.

Remaining tokenizers without Fast implementations (no fast tokenizers expected in the short/mid-term):

BertJapanese (special python libs for multi-linguality)
CTRL (would require a specific BPE to handle missing merges)
XLM (uses special python libs for multi-linguality)
Flaubert (same as XLM)
Transformer-XL (same as XLM)

Other fixes:

Also allow to tokenizer Bert Japanese with Mecab token splitter and fix runing dataset.map, it raises TypeError: can't pickle Tokenizer objects datasets#665
deprecation warning in tokenizer methods are limited to one occurence per class instance

thomwolf · 2020-10-06T07:37:00Z

Ready for review, the remaining failing tests should be ok after the next tokenizers RC release

LysandreJik

Great work, incredible to now have support for all tokenizers for which it is possible!

LysandreJik · 2020-10-06T07:50:02Z

src/transformers/convert_slow_tokenizer.py

@@ -0,0 +1,546 @@
+from typing import Dict, List, Tuple


Should add the copyright here

I don't exactly understand what this script does. Does it convert from original implementations to ours, or from our slow implementations to our fast ones? Some docstrings would be very welcome!

Indeed, I'll add more doc

This file contain utilities to convert slow tokenizers in their fast tokenizers counterparts.

All the conversions are grouped here to gather SentencePiece dependencies outside of the fast tokenizers files and allow to make our dependency on SentencePiece optional.

src/transformers/tokenization_albert.py

src/transformers/tokenization_bert.py

src/transformers/tokenization_pegasus.py

src/transformers/tokenization_utils_base.py

src/transformers/utils/sentencepiece_model_pb2.py

LysandreJik · 2020-10-06T09:19:09Z

tests/test_tokenization_common.py

+    # def test_swap_special_token(self):
+    #     tokenizers = self.get_tokenizers(do_lower_case=False)
+    #     for tokenizer in tokenizers:
+    #         with self.subTest(f"{tokenizer.__class__.__name__}"):
+    #             # Our mask token
+    #             mask = "<mask>"
+    #             # We take a single word in the middle of the vocabulary
+    #             all_tokens = sorted(tokenizer.get_vocab().keys())
+    #             word = tokenizer.decode(tokenizer.encode(all_tokens[len(all_tokens)//2], add_special_tokens=False)[:1])
+
+    #             sequence_0 = "Encode " + word + " sequence"


Why is this test removed?

It is just too mind-breaking to make work in the general setting of an arbitrary vocabulary tokenizers and I don't think it's a useful test in the end.

LysandreJik · 2020-10-06T09:21:06Z

tests/test_tokenization_fast.py

@@ -1,23 +1,52 @@
 import logging


Needs compyright here.

The diff for this file is slightly complicated to read, did you wrap the tests in subtests, iterating over every tokenizer? Is that better than doing a mixin like we do in other test classes?

I kinda kept the original setup made by @mfuntowicz even though I agree switching to a mixin would probably be easier to read in the end.

…ntencepiece

n1t0 · 2020-10-06T15:15:44Z

Great job! I'm not entirely up to date with everything in transformers, but this looks very nice and clean!

sgugger

Made my changes directly on the branch.
This is amazing work! The only comment I have left is that it would be nice to have some documentation of convert_slow_tokenizer.py. Also, if it needs some updates when adding a new model, it should be documented in the new model template, so that we or external contributors don't forget.

thomwolf · 2020-10-08T08:43:29Z

Ok, yes I'll add documentation. We will probably wait to have a clean documentation in tokenizers as well so we can do proper cross-linking.

patrickvonplaten · 2020-10-13T09:23:42Z

src/transformers/tokenization_utils_base.py

            for key, value in special_tokens_map.items():
-                if isinstance(value, dict):


@thomwolf - this change currently breaks the RagTokenizer.
If one runs the slow test:

tests/test_modeling_rag.py::RagModelIntegrationTests::test_rag_sequence_generate_batch

and puts a breakpoint before convert_added_tokens, one can see why:

Previously a dict object corresponding e.g. to the BOS token, such as

'bos_token': {'content': '<s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}

would have been processed by the line

value = AddedToken(**value)

Now such a dict - because it has no __type attribute - is processed differently, which leads to errors later.
Not 100% sure how to solve it here. Do you have any good ideas: @LysandreJik @thomwolf @n1t0 ?

… remove Transfo-XL fast tokenizer (huggingface#7141) * [WIP] SP tokenizers * fixing tests for T5 * WIP tokenizers * serialization * update T5 * WIP T5 tokenization * slow to fast conversion script * Refactoring to move tokenzier implementations inside transformers * Adding gpt - refactoring - quality * WIP adding several tokenizers to the fast world * WIP Roberta - moving implementations * update to dev4 switch file loading to in-memory loading * Updating and fixing * advancing on the tokenizers - updating do_lower_case * style and quality * moving forward with tokenizers conversion and tests * MBart, T5 * dumping the fast version of transformer XL * Adding to autotokenizers + style/quality * update init and space_between_special_tokens * style and quality * bump up tokenizers version * add protobuf * fix pickle Bert JP with Mecab * fix newly added tokenizers * style and quality * fix bert japanese * fix funnel * limite tokenizer warning to one occurence * clean up file * fix new tokenizers * fast tokenizers deep tests * WIP adding all the special fast tests on the new fast tokenizers * quick fix * adding more fast tokenizers in the fast tests * all tokenizers in fast version tested * Adding BertGenerationFast * bump up setup.py for CI * remove BertGenerationFast (too early) * bump up tokenizers version * Clean old docstrings * Typo * Update following Lysandre comments Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

…reaking: remove Transfo-XL fast tokenizer (huggingface#7141)" This reverts commit 324bd77.

thomwolf added 3 commits September 14, 2020 10:00

[WIP] SP tokenizers

e1f8a82

fixing tests for T5

8258fd9

WIP tokenizers

9b666b7

thomwolf assigned Narsil Sep 16, 2020

thomwolf added 21 commits September 18, 2020 12:13

serialization

be153a5

update T5

6be83ed

WIP T5 tokenization

5fd9382

slow to fast conversion script

9ac5621

Refactoring to move tokenzier implementations inside transformers

4537083

Adding gpt - refactoring - quality

a62abb3

WIP adding several tokenizers to the fast world

516b9b4

WIP Roberta - moving implementations

fc8a772

update to dev4 switch file loading to in-memory loading

bceea6a

Updating and fixing

0ba83ac

advancing on the tokenizers - updating do_lower_case

67251be

style and quality

ec69397

moving forward with tokenizers conversion and tests

57c398a

MBart, T5

5e197ce

dumping the fast version of transformer XL

16353ec

Adding to autotokenizers + style/quality

991c6d0

update init and space_between_special_tokens

7faaf99

Merge branch 'master' into fast-sentencepiece

a6b29a8

style and quality

0391d5f

bump up tokenizers version

f87eeeb

add protobuf

8d63b0a

thomwolf mentioned this pull request Sep 29, 2020

runing dataset.map, it raises TypeError: can't pickle Tokenizer objects huggingface/datasets#665

Closed

thomwolf added 4 commits September 29, 2020 19:24

fix pickle Bert JP with Mecab

03adae9

fix newly added tokenizers

c9c59e0

style and quality

a000db0

fix bert japanese

162a2c4

thomwolf added 4 commits September 29, 2020 20:21

fix new tokenizers

ac8e8e9

fast tokenizers deep tests

9b403b2

WIP adding all the special fast tests on the new fast tokenizers

3fbd0df

quick fix

5f9b4c9

lhoestq mentioned this pull request Sep 30, 2020

ArrowInvalid occurs while running Dataset.map() function huggingface/datasets#687

Closed

thomwolf added 6 commits September 30, 2020 16:51

adding more fast tokenizers in the fast tests

3b5e828

all tokenizers in fast version tested

1ec619b

Adding BertGenerationFast

f74b883

bump up setup.py for CI

af338ea

remove BertGenerationFast (too early)

360a0a0

Merge branch 'master' into fast-sentencepiece

f312063

thomwolf marked this pull request as ready for review October 6, 2020 07:36

LysandreJik reviewed Oct 6, 2020

View reviewed changes

thomwolf and others added 4 commits October 6, 2020 11:30

bump up tokenizers version

cd588fb

Clean old docstrings

677bacc

Merge remote-tracking branch 'origin/fast-sentencepiece' into fast-se…

9591f37

…ntencepiece

Typo

e2d750a

sgugger approved these changes Oct 6, 2020

View reviewed changes

thomwolf added 2 commits October 8, 2020 10:43

Update following Lysandre comments

183afea

Merge branch 'master' into fast-sentencepiece

c02307f

thomwolf changed the title ~~[WIP] Adding Fast tokenizers for SentencePiece based tokenizers~~ Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer Oct 8, 2020

thomwolf merged commit 9aeacb5 into master Oct 8, 2020

thomwolf deleted the fast-sentencepiece branch October 8, 2020 09:32

patrickvonplaten reviewed Oct 13, 2020

View reviewed changes

patrickvonplaten mentioned this pull request Oct 13, 2020

[Rag] Fix loading of pretrained Rag Tokenizer #7756

Merged

5 tasks

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "Adding Fast tokenizers for SentencePiece based tokenizers - B…

a94da2b

…reaking: remove Transfo-XL fast tokenizer (huggingface#7141)" This reverts commit 324bd77.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer #7141

Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer #7141

thomwolf commented Sep 15, 2020 •

edited

thomwolf commented Oct 6, 2020

LysandreJik left a comment

LysandreJik Oct 6, 2020

LysandreJik Oct 6, 2020

thomwolf Oct 8, 2020

LysandreJik Oct 6, 2020

thomwolf Oct 8, 2020

LysandreJik Oct 6, 2020

thomwolf Oct 8, 2020

n1t0 commented Oct 6, 2020

sgugger left a comment

thomwolf commented Oct 8, 2020

patrickvonplaten Oct 13, 2020 •

edited

		for key, value in special_tokens_map.items():
		if isinstance(value, dict):

Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer #7141

Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer #7141

Conversation

thomwolf commented Sep 15, 2020 • edited

thomwolf commented Oct 6, 2020

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Oct 6, 2020

Choose a reason for hiding this comment

LysandreJik Oct 6, 2020

Choose a reason for hiding this comment

thomwolf Oct 8, 2020

Choose a reason for hiding this comment

LysandreJik Oct 6, 2020

Choose a reason for hiding this comment

thomwolf Oct 8, 2020

Choose a reason for hiding this comment

LysandreJik Oct 6, 2020

Choose a reason for hiding this comment

thomwolf Oct 8, 2020

Choose a reason for hiding this comment

n1t0 commented Oct 6, 2020

sgugger left a comment

Choose a reason for hiding this comment

thomwolf commented Oct 8, 2020

patrickvonplaten Oct 13, 2020 • edited

Choose a reason for hiding this comment

thomwolf commented Sep 15, 2020 •

edited

patrickvonplaten Oct 13, 2020 •

edited