Add `M2M100TokenizerFast` (+ convert_slow_tokenizer implementation) #25478

xenova · 2023-08-13T03:21:43Z

What does this PR do?

Adds fast tokenizer support for M2M100Tokenizer. I also added a convert_slow_tokenizer config to support generating a tokenizer.json file. For example, the generated tokenizer files for facebook/m2m100_418M can be found here.

The fast tokenizer format is needed for transformers.js (see issue and PR). This may or may not be super relevant (considering the models are quite old), but there was a feature request in transformers.js, and I needed to write this code to get it working - so I thought I might as well share here.

I have ran the tokenizer on a variety of test cases, and it passes each one. Additionally, it fixes a bug/inconsistency with the slow tokenizer (actually found in all sentencepiece tokenizers), where whitespace after special tokens is removed. To make the test cases past, I just basically hardcoded a fix for it (for now at least).

Example conversion code

from transformers import convert_slow_tokenizer, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Xenova/m2m100_418M', use_fast=False)
fast_tokenizer=convert_slow_tokenizer.convert_slow_tokenizer(tokenizer)
fast_tokenizer.save('tokenizer.json')

Example usage code

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Xenova/m2m100_418M', use_fast=False)
fast_tokenizer = AutoTokenizer.from_pretrained('Xenova/m2m100_418M', use_fast=True)
assert tokenizer('Hello world').input_ids == fast_tokenizer('Hello world').input_ids

NOTE: To get facebook/m2m100_418M working, we need to remove the "tokenizer_file": null, line from tokenizer_config.json

Fixes # (issue)

TODO

Unit tests
Other config files + docs as done with NllbTokenizerFast
Check that all special tokens are added (including </s> and similar tokens)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker

HuggingFaceDocBuilderDev · 2023-08-13T03:44:04Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

* Add `M2M100Tokenizer` * Allow `added_tokens` list to be empty * Apply hot-fix for issue in HF's `M2M100Tokenizer` * Skip M2M100 tokenizer tests for now TODO: Remove when huggingface/transformers#25478 is merged * Fix `_build_translation_inputs` for `M2M100Tokenizer` * Add example code in JSDoc for `TranslationPipeline` * Update supported_models.py

ArthurZucker · 2023-08-16T11:32:27Z

Feel free to ping me when you need a review!

xenova · 2023-08-19T22:15:41Z

@ArthurZucker Hey! 👋 I think this is ready for your review. The tokenizer is quite similar to NLLB, which was added here and then updated by you here.

I haven't added a tokenizer before, but one thing I am missing is a unit test for hf-internal-testing/tiny-random-m2m_100, but this is due to the missing tokenizer.json file.

Here is a tokenizer file which is generated from above, but then updated by hand to work correctly. So that functionality in the conversion script is missing (i.e., added_tokens), but I'm not sure what the best way to fix it is. Would appreciate some help here too 😄

ArthurZucker

Hey! Looks good already, could you give me more details on how you converted the model? The tests should also be updated I think (apply integration tests to both tokenizers!)

ArthurZucker · 2023-08-21T08:19:06Z

src/transformers/convert_slow_tokenizer.py

+        with open(self.original_vocab_file) as fp:
+            vocab = json.load(fp)
+
+        vocab = list(sorted(vocab.items(), key=lambda x: x[1]))


If <s> and <unk> etc are not part of the original_vocab_file then you need to add them here at the beginning.

They are in https://huggingface.co/facebook/m2m100_418M/raw/main/vocab.json 👀, so, I think it should already be included here, no?

ArthurZucker · 2023-08-21T08:32:39Z

src/transformers/models/m2m_100/tokenization_m2m_100_fast.py

+                [t for t in additional_special_tokens if t not in _additional_special_tokens]
+            )
+
+        self.add_special_tokens({"additional_special_tokens": _additional_special_tokens})


This should be adding the tokens to the list of special tokens, the problem is that in fast if the tokens are already part of the vocabulary they will not be added again. For Llama we used:

tokenizer.add_special_tokens( [ AddedToken("<unk>"), AddedToken("<s>"), AddedToken("</s>"), ] )

in the convert_slow_tokenizer.py script.

Did you convert the tokenizer slow using from_slow=True?
(I am not really suprised by this bug)

Did you convert the tokenizer slow using from_slow=True?

I didn't add any other parameters other than the ones in the original comment 😅 ... Is this needed, considering there is no fast tokenizer to convert from?

ArthurZucker · 2023-08-21T08:33:02Z

src/transformers/models/m2m_100/tokenization_m2m_100_fast.py

+        self._src_lang = new_src_lang
+        self.set_src_lang_special_tokens(self._src_lang)
+
+    def build_inputs_with_special_tokens(


Missing copied from statements here and wherever you copied from!

Oh yes, my bad! Will add that :)

It was from NllbTokenizerFast

transformers/src/transformers/models/nllb/tokenization_nllb_fast.py

Line 61 in 450a181

class NllbTokenizerFast(PreTrainedTokenizerFast):

xenova · 2023-08-21T16:29:31Z

Looks good already, could you give me more details on how you converted the model?

I just used Optimum to convert the model to ONNX, but this shouldn't matter for this PR (which only concerns the tokenizer).

ArthurZucker · 2023-08-22T06:19:55Z

Sorry by model I meant the tokenizer model (the backend sentencepiece model if you will). I am trying to understand why you had to manually add the added tokens, and the from_slow is to initialize a fast tokenizer from a slow tokenizer using the conversion method!

xenova · 2023-08-22T13:36:17Z

Sorry by model I meant the tokenizer model (the backend sentencepiece model if you will).

Oh yes of course 😆 (silly me). I converted it with this code:

from transformers import convert_slow_tokenizer, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Xenova/m2m100_418M', use_fast=False)
fast_tokenizer=convert_slow_tokenizer.convert_slow_tokenizer(tokenizer)
fast_tokenizer.save('tokenizer.json')

I am trying to understand why you had to manually add the added tokens

I'm not 100% sure, I just ran some unit tests for transformers.js and found that it failed (and after inspecting, I found that it was just missing added tokens. I'll play around with a few more things today!

xenova · 2023-08-22T16:10:23Z

@ArthurZucker I added the "copied from" annotations as well as added the special tokens (like they are done for Llama). On the latter point, the tokenizer includes <madeupwordX> tokens (for some reason) - should these be classified as special tokens?

Here's the output from running the above code (zipped because GH doesn't like JSON 🤷)
m2m.zip

src/transformers/models/m2m_100/tokenization_m2m_100_fast.py

ArthurZucker · 2023-08-22T18:21:23Z

Regarding the madeup words, it depends, would just say let's follow what's done for slow!

github-actions · 2023-09-16T08:02:27Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

LysandreJik · 2023-09-25T11:06:05Z

Should this be reopened @xenova ? :)

xenova · 2023-09-25T12:49:33Z

I marked it as a draft since I have been quite busy on some more important things, with the idea that I would return to it eventually once I had more time 😅. It's basically just missing some unit tests and example tokenizer files.

xenova added 3 commits August 12, 2023 20:50

Add M2M100Converter slow -> fast

e1c214e

formatting

6097e2d

Fix conversion

b759756

xenova mentioned this pull request Aug 13, 2023

Add M2M100 tokenizer (Closes #235) xenova/transformers.js#250

Merged

4 tasks

xenova added 3 commits August 13, 2023 15:16

Cleanup

b757bbe

Cleanup

38ef2eb

Add M2M100TokenizerFast

6d7969d

xenova changed the title ~~Add convert_slow_tokenizer config for M2M100Tokenizer~~ Add M2M100TokenizerFast (+ convert_slow_tokenizer implementation) Aug 13, 2023

formatting

0881a69

xenova added 2 commits August 19, 2023 22:06

Add docs for M2M100TokenizerFast

6bcce82

Add fast tokenizer to tests

6962350

xenova requested a review from ArthurZucker August 19, 2023 22:13

ArthurZucker reviewed Aug 21, 2023

View reviewed changes

xenova added 2 commits August 22, 2023 15:50

Add "Copied from" annotations

2d37b25

Add special tokens to fast tokenizer

e32c9ac

ArthurZucker reviewed Aug 22, 2023

View reviewed changes

src/transformers/models/m2m_100/tokenization_m2m_100_fast.py Outdated Show resolved Hide resolved

Update src/transformers/models/m2m_100/tokenization_m2m_100_fast.py

6dae603

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

xenova marked this pull request as draft September 16, 2023 15:36

github-actions bot closed this Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `M2M100TokenizerFast` (+ convert_slow_tokenizer implementation) #25478

Add `M2M100TokenizerFast` (+ convert_slow_tokenizer implementation) #25478

xenova commented Aug 13, 2023 •

edited

HuggingFaceDocBuilderDev commented Aug 13, 2023

ArthurZucker commented Aug 16, 2023

xenova commented Aug 19, 2023 •

edited

ArthurZucker left a comment

ArthurZucker Aug 21, 2023

xenova Aug 21, 2023

ArthurZucker Aug 21, 2023

xenova Aug 21, 2023

ArthurZucker Aug 21, 2023

xenova Aug 21, 2023 •

edited

xenova commented Aug 21, 2023

ArthurZucker commented Aug 22, 2023

xenova commented Aug 22, 2023 •

edited

xenova commented Aug 22, 2023 •

edited

ArthurZucker commented Aug 22, 2023

github-actions bot commented Sep 16, 2023

LysandreJik commented Sep 25, 2023

xenova commented Sep 25, 2023

Add M2M100TokenizerFast (+ convert_slow_tokenizer implementation) #25478

Add M2M100TokenizerFast (+ convert_slow_tokenizer implementation) #25478

Conversation

xenova commented Aug 13, 2023 • edited

What does this PR do?

Example conversion code

Example usage code

TODO

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Aug 13, 2023

ArthurZucker commented Aug 16, 2023

xenova commented Aug 19, 2023 • edited

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Aug 21, 2023

Choose a reason for hiding this comment

xenova Aug 21, 2023

Choose a reason for hiding this comment

ArthurZucker Aug 21, 2023

Choose a reason for hiding this comment

xenova Aug 21, 2023

Choose a reason for hiding this comment

ArthurZucker Aug 21, 2023

Choose a reason for hiding this comment

xenova Aug 21, 2023 • edited

Choose a reason for hiding this comment

xenova commented Aug 21, 2023

ArthurZucker commented Aug 22, 2023

xenova commented Aug 22, 2023 • edited

xenova commented Aug 22, 2023 • edited

ArthurZucker commented Aug 22, 2023

github-actions bot commented Sep 16, 2023

LysandreJik commented Sep 25, 2023

xenova commented Sep 25, 2023

Add `M2M100TokenizerFast` (+ convert_slow_tokenizer implementation) #25478

Add `M2M100TokenizerFast` (+ convert_slow_tokenizer implementation) #25478

xenova commented Aug 13, 2023 •

edited

xenova commented Aug 19, 2023 •

edited

xenova Aug 21, 2023 •

edited

xenova commented Aug 22, 2023 •

edited

xenova commented Aug 22, 2023 •

edited