Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add M2M100TokenizerFast (+ convert_slow_tokenizer implementation) #25478

Closed
wants to merge 12 commits into from

Conversation

xenova
Copy link
Contributor

@xenova xenova commented Aug 13, 2023

What does this PR do?

Adds fast tokenizer support for M2M100Tokenizer. I also added a convert_slow_tokenizer config to support generating a tokenizer.json file. For example, the generated tokenizer files for facebook/m2m100_418M can be found here.

The fast tokenizer format is needed for transformers.js (see issue and PR). This may or may not be super relevant (considering the models are quite old), but there was a feature request in transformers.js, and I needed to write this code to get it working - so I thought I might as well share here.

I have ran the tokenizer on a variety of test cases, and it passes each one. Additionally, it fixes a bug/inconsistency with the slow tokenizer (actually found in all sentencepiece tokenizers), where whitespace after special tokens is removed. To make the test cases past, I just basically hardcoded a fix for it (for now at least).

Example conversion code

from transformers import convert_slow_tokenizer, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Xenova/m2m100_418M', use_fast=False)
fast_tokenizer=convert_slow_tokenizer.convert_slow_tokenizer(tokenizer)
fast_tokenizer.save('tokenizer.json')

Example usage code

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Xenova/m2m100_418M', use_fast=False)
fast_tokenizer = AutoTokenizer.from_pretrained('Xenova/m2m100_418M', use_fast=True)
assert tokenizer('Hello world').input_ids == fast_tokenizer('Hello world').input_ids

NOTE: To get facebook/m2m100_418M working, we need to remove the "tokenizer_file": null, line from tokenizer_config.json

Fixes # (issue)

TODO

  • Unit tests
  • Other config files + docs as done with NllbTokenizerFast
  • Check that all special tokens are added (including </s> and similar tokens)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@xenova xenova changed the title Add convert_slow_tokenizer config for M2M100Tokenizer Add M2M100TokenizerFast (+ convert_slow_tokenizer implementation) Aug 13, 2023
xenova added a commit to xenova/transformers.js that referenced this pull request Aug 14, 2023
* Add `M2M100Tokenizer`

* Allow `added_tokens` list to be empty

* Apply hot-fix for issue in HF's `M2M100Tokenizer`

* Skip M2M100 tokenizer tests for now

TODO: Remove when huggingface/transformers#25478 is merged

* Fix `_build_translation_inputs` for `M2M100Tokenizer`

* Add example code in JSDoc for `TranslationPipeline`

* Update supported_models.py
@ArthurZucker
Copy link
Collaborator

Feel free to ping me when you need a review!

@xenova
Copy link
Contributor Author

xenova commented Aug 19, 2023

@ArthurZucker Hey! 👋 I think this is ready for your review. The tokenizer is quite similar to NLLB, which was added here and then updated by you here.

I haven't added a tokenizer before, but one thing I am missing is a unit test for hf-internal-testing/tiny-random-m2m_100, but this is due to the missing tokenizer.json file.

Here is a tokenizer file which is generated from above, but then updated by hand to work correctly. So that functionality in the conversion script is missing (i.e., added_tokens), but I'm not sure what the best way to fix it is. Would appreciate some help here too 😄

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Looks good already, could you give me more details on how you converted the model? The tests should also be updated I think (apply integration tests to both tokenizers!)

with open(self.original_vocab_file) as fp:
vocab = json.load(fp)

vocab = list(sorted(vocab.items(), key=lambda x: x[1]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If <s> and <unk> etc are not part of the original_vocab_file then you need to add them here at the beginning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are in https://huggingface.co/facebook/m2m100_418M/raw/main/vocab.json 👀, so, I think it should already be included here, no?

[t for t in additional_special_tokens if t not in _additional_special_tokens]
)

self.add_special_tokens({"additional_special_tokens": _additional_special_tokens})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be adding the tokens to the list of special tokens, the problem is that in fast if the tokens are already part of the vocabulary they will not be added again. For Llama we used:

            tokenizer.add_special_tokens(
                [
                    AddedToken("<unk>"),
                    AddedToken("<s>"),
                    AddedToken("</s>"),
                ]
            )

in the convert_slow_tokenizer.py script.

Did you convert the tokenizer slow using from_slow=True?
(I am not really suprised by this bug)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you convert the tokenizer slow using from_slow=True?

I didn't add any other parameters other than the ones in the original comment 😅 ... Is this needed, considering there is no fast tokenizer to convert from?

self._src_lang = new_src_lang
self.set_src_lang_special_tokens(self._src_lang)

def build_inputs_with_special_tokens(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing copied from statements here and wherever you copied from!

Copy link
Contributor Author

@xenova xenova Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, my bad! Will add that :)

It was from NllbTokenizerFast

class NllbTokenizerFast(PreTrainedTokenizerFast):

@xenova
Copy link
Contributor Author

xenova commented Aug 21, 2023

Looks good already, could you give me more details on how you converted the model?

I just used Optimum to convert the model to ONNX, but this shouldn't matter for this PR (which only concerns the tokenizer).

@ArthurZucker
Copy link
Collaborator

Sorry by model I meant the tokenizer model (the backend sentencepiece model if you will). I am trying to understand why you had to manually add the added tokens, and the from_slow is to initialize a fast tokenizer from a slow tokenizer using the conversion method!

@xenova
Copy link
Contributor Author

xenova commented Aug 22, 2023

Sorry by model I meant the tokenizer model (the backend sentencepiece model if you will).

Oh yes of course 😆 (silly me). I converted it with this code:

from transformers import convert_slow_tokenizer, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Xenova/m2m100_418M', use_fast=False)
fast_tokenizer=convert_slow_tokenizer.convert_slow_tokenizer(tokenizer)
fast_tokenizer.save('tokenizer.json')

I am trying to understand why you had to manually add the added tokens

I'm not 100% sure, I just ran some unit tests for transformers.js and found that it failed (and after inspecting, I found that it was just missing added tokens. I'll play around with a few more things today!

@xenova
Copy link
Contributor Author

xenova commented Aug 22, 2023

@ArthurZucker I added the "copied from" annotations as well as added the special tokens (like they are done for Llama). On the latter point, the tokenizer includes <madeupwordX> tokens (for some reason) - should these be classified as special tokens?

Here's the output from running the above code (zipped because GH doesn't like JSON 🤷)
m2m.zip

@ArthurZucker
Copy link
Collaborator

Regarding the madeup words, it depends, would just say let's follow what's done for slow!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
@xenova xenova marked this pull request as draft September 16, 2023 15:36
@github-actions github-actions bot closed this Sep 25, 2023
@LysandreJik
Copy link
Member

Should this be reopened @xenova ? :)

@xenova
Copy link
Contributor Author

xenova commented Sep 25, 2023

I marked it as a draft since I have been quite busy on some more important things, with the idea that I would return to it eventually once I had more time 😅. It's basically just missing some unit tests and example tokenizer files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants