New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer #7141
Conversation
Ready for review, the remaining failing tests should be ok after the next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, incredible to now have support for all tokenizers for which it is possible!
@@ -0,0 +1,546 @@ | |||
from typing import Dict, List, Tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add the copyright here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't exactly understand what this script does. Does it convert from original implementations to ours, or from our slow implementations to our fast ones? Some docstrings would be very welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I'll add more doc
This file contain utilities to convert slow tokenizers in their fast tokenizers counterparts.
All the conversions are grouped here to gather SentencePiece dependencies outside of the fast tokenizers files and allow to make our dependency on SentencePiece optional.
# def test_swap_special_token(self): | ||
# tokenizers = self.get_tokenizers(do_lower_case=False) | ||
# for tokenizer in tokenizers: | ||
# with self.subTest(f"{tokenizer.__class__.__name__}"): | ||
# # Our mask token | ||
# mask = "<mask>" | ||
# # We take a single word in the middle of the vocabulary | ||
# all_tokens = sorted(tokenizer.get_vocab().keys()) | ||
# word = tokenizer.decode(tokenizer.encode(all_tokens[len(all_tokens)//2], add_special_tokens=False)[:1]) | ||
|
||
# sequence_0 = "Encode " + word + " sequence" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this test removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just too mind-breaking to make work in the general setting of an arbitrary vocabulary tokenizers and I don't think it's a useful test in the end.
@@ -1,23 +1,52 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs compyright here.
The diff for this file is slightly complicated to read, did you wrap the tests in subtests, iterating over every tokenizer? Is that better than doing a mixin like we do in other test classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda kept the original setup made by @mfuntowicz even though I agree switching to a mixin would probably be easier to read in the end.
Great job! I'm not entirely up to date with everything in transformers, but this looks very nice and clean! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made my changes directly on the branch.
This is amazing work! The only comment I have left is that it would be nice to have some documentation of convert_slow_tokenizer.py
. Also, if it needs some updates when adding a new model, it should be documented in the new model template, so that we or external contributors don't forget.
Ok, yes I'll add documentation. We will probably wait to have a clean documentation in |
for key, value in special_tokens_map.items(): | ||
if isinstance(value, dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomwolf - this change currently breaks the RagTokenizer
.
If one runs the slow test:
tests/test_modeling_rag.py::RagModelIntegrationTests::test_rag_sequence_generate_batch
and puts a breakpoint before convert_added_tokens
, one can see why:
Previously a dict object corresponding e.g. to the BOS token, such as
'bos_token': {'content': '<s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}
would have been processed by the line
value = AddedToken(**value)
Now such a dict - because it has no __type
attribute - is processed differently, which leads to errors later.
Not 100% sure how to solve it here. Do you have any good ideas: @LysandreJik @thomwolf @n1t0 ?
… remove Transfo-XL fast tokenizer (huggingface#7141) * [WIP] SP tokenizers * fixing tests for T5 * WIP tokenizers * serialization * update T5 * WIP T5 tokenization * slow to fast conversion script * Refactoring to move tokenzier implementations inside transformers * Adding gpt - refactoring - quality * WIP adding several tokenizers to the fast world * WIP Roberta - moving implementations * update to dev4 switch file loading to in-memory loading * Updating and fixing * advancing on the tokenizers - updating do_lower_case * style and quality * moving forward with tokenizers conversion and tests * MBart, T5 * dumping the fast version of transformer XL * Adding to autotokenizers + style/quality * update init and space_between_special_tokens * style and quality * bump up tokenizers version * add protobuf * fix pickle Bert JP with Mecab * fix newly added tokenizers * style and quality * fix bert japanese * fix funnel * limite tokenizer warning to one occurence * clean up file * fix new tokenizers * fast tokenizers deep tests * WIP adding all the special fast tests on the new fast tokenizers * quick fix * adding more fast tokenizers in the fast tests * all tokenizers in fast version tested * Adding BertGenerationFast * bump up setup.py for CI * remove BertGenerationFast (too early) * bump up tokenizers version * Clean old docstrings * Typo * Update following Lysandre comments Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
…reaking: remove Transfo-XL fast tokenizer (huggingface#7141)" This reverts commit 324bd77.
This pull request add the "fast" Rust tokenizer for the SentencePiece tokenizers as well.
Based on unreleased v0.9.0 of
tokenizers
.Tokenizers:
Breaking:
Remaining tokenizers without Fast implementations (no fast tokenizers expected in the short/mid-term):
Other fixes: