-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate fast tokenizers library inside transformers #2674
Conversation
only took a superficial look, but looks very clean 👍 Excited to use fast tokenizers by default! |
Current CI issues are real and "normal" we need to release the next version of tokenizers lib which will bring all the dependencies. |
aed66b0
to
66cd67e
Compare
|
||
def encode_batch(self, sequences: List[Union[str, Tuple[str, str]]]) -> List[Encoding]: | ||
return super().encode_batch( | ||
[seq.strip() if isinstance(seq, str) else (seq[0].strip(), seq[1].strip()) for seq in sequences] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be an additional Normalizer
. It would also let us keep track of the offsets. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yap make sense 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe this deserves a 0.4.3
haha
if return_special_tokens_mask: | ||
encoding_dict["special_tokens_mask"] = encoding.special_tokens_mask | ||
encoding_dict["special_tokens_mask"] = [e.special_tokens_mask for e in encodings] | ||
if return_offsets_mapping: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't give access to the normalized string somehow, we should maybe provide offsets to the original string here. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, currently the offsets are given w.r.t the normalised string ? If this is the case, then yes we may want to provide offsets in the original string then, or expose an utility method doing the mapping in Python.
Is it something we can easily expose on Encoding
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, offsets are related to the normalized string. You can retrieve the original
offsets by doing encoding.original_str.offsets[encoding.offsets[X]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't followed the discussion very closely, but shouldn't offsets be returned based on the original string by default?
As an end user I don't think I really care about the normalized (internal) version of my input.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree with @julien-c, the normalized string is more an internal representation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for me we should probably default to the original string.
8c70bc6
to
ef42cf5
Compare
Codecov Report
@@ Coverage Diff @@
## master #2674 +/- ##
=========================================
+ Coverage 75% 75.3% +0.29%
=========================================
Files 94 94
Lines 15288 15424 +136
=========================================
+ Hits 11467 11615 +148
+ Misses 3821 3809 -12
Continue to review full report at Codecov.
|
b6cad60
to
b9305c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good! Great job @mfuntowicz!
|
||
def encode_batch(self, sequences: List[Union[str, Tuple[str, str]]]) -> List[Encoding]: | ||
return super().encode_batch( | ||
[seq.strip() if isinstance(seq, str) else (seq[0].strip(), seq[1].strip()) for seq in sequences] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe this deserves a 0.4.3
haha
88ffca6
to
63660c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @mfuntowicz!
setup.py
Outdated
@@ -89,7 +89,7 @@ | |||
packages=find_packages("src"), | |||
install_requires=[ | |||
"numpy", | |||
"tokenizers == 0.0.11", | |||
"tokenizers == 0.4.2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>=
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we don't have so many unit tests on tokenizers python's binding for now, I would tend to stick to a specific version that will be tested on the CI. Otherwise it might introduce some flaky tests when releasing new tokenizers versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, given that we are still introducing breaking changes pretty often in tokenizers
, I would strongly advise against that.
if return_special_tokens_mask: | ||
encoding_dict["special_tokens_mask"] = encoding.special_tokens_mask | ||
encoding_dict["special_tokens_mask"] = [e.special_tokens_mask for e in encodings] | ||
if return_offsets_mapping: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for me we should probably default to the original string.
|
||
# Prepare inputs as tensors if asked | ||
if return_tensors == "tf" and is_tf_available(): | ||
encoding_dict["input_ids"] = tf.constant([encoding_dict["input_ids"]]) | ||
encoding_dict["input_ids"] = tf.constant(encoding_dict["input_ids"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok nice, so this will be a tensor with a "batch dimension" equal to the number of encodings when they are splited in overflowing tokens. I like this solution, it's clean. We should document this behavior though.
@@ -213,3 +222,91 @@ def save_vocabulary(self, save_directory): | |||
index += 1 | |||
|
|||
return vocab_file, merge_file | |||
|
|||
|
|||
class _OpenAIGPTCharBPETokenizer(BaseTokenizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to have this class here?
Don't we have an implementation of char-level BPE in tokenizers
now?
Here: https://github.com/huggingface/tokenizers/blob/master/bindings/python/tokenizers/implementations/char_level_bpe.py#L9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do need a special OpenaiGPT implementation because it slightly differs from the char-level BPE we have in tokenizers:
- Normalizer is the same as Bert (BertNormalizer)
- PreTokenizer is not Whitespace, it's the same as Bert (BertPreTokenizer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we put TransformerXL into tokenizers.implementations, may be this one can make its way to tokenizers too. cc @n1t0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I'm not too sure about this. I think tokenizers
should stay a library with some generic implementations, with an easy way for everybody to build it's own custom tokenizer when needed. So I'd like to avoid introducing specific implementations for each new model/tokenizer. Otherwise, the next thing we'll discuss is whether we should have default vocabularies downloaded automatically with each specific implementation, and then we'll have as many implementations as models there are in transformers
... I think it makes more sense to have specific customization details in transformers
, next to the model that actually uses the custom tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, the next thing we'll discuss is whether we should have default vocabularies downloaded automatically with each specific implementation
You mean have all the things that made the success of Transformers
? 😜
Jking Well ok for me to keep these in Transformers
then.
@@ -280,6 +293,108 @@ def _tokenize(self, line, add_eos=False, add_double_eos=False): | |||
return symbols | |||
|
|||
|
|||
class _TransfoXLDelimiterLookupTokenizer(BaseTokenizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this class upstream in tokenizers
now that we have a word-level model?
It would be more consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @n1t0 what do you think ? I can put the content of this into tokenizers.implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cf comment for OpenAIGPTCharBPETokenizer
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…e on masked input. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…tences_pair to PreTrainedTokenizerFast + tests. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…ist object. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…vior. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…e current tokenizer. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…rategy + unittest. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…tensors="..." Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…h axis. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…meters in kwargs Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…g_tokens is True. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…tensor. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
330b068
to
3342897
Compare
…ce max) Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool, great job @mfuntowicz !
self._tokenizer = tk.Tokenizer(tk.models.WordPiece.from_files(vocab_file, unk_token=unk_token)) | ||
self._update_special_tokens() | ||
self._tokenizer.with_pre_tokenizer( | ||
tk.pre_tokenizers.BertPreTokenizer.new( | ||
do_basic_tokenize=do_basic_tokenize, | ||
do_lower_case=do_lower_case, | ||
tokenize_chinese_chars=tokenize_chinese_chars, | ||
never_split=never_split if never_split is not None else [], | ||
) | ||
) | ||
self._tokenizer.with_decoder(tk.decoders.WordPiece.new()) | ||
|
||
if add_special_tokens: | ||
self._tokenizer.with_post_processor( | ||
tk.processors.BertProcessing.new( | ||
(sep_token, self._tokenizer.token_to_id(sep_token)), | ||
(cls_token, self._tokenizer.token_to_id(cls_token)), | ||
) | ||
) | ||
if max_length is not None: | ||
self._tokenizer.with_truncation(max_length, stride=stride, strategy=truncation_strategy) | ||
self._tokenizer.with_padding( | ||
max_length=max_length if pad_to_max_length else None, | ||
direction=self.padding_side, | ||
pad_id=self.pad_token_id, | ||
pad_type_id=self.pad_token_type_id, | ||
pad_token=self.pad_token, | ||
) | ||
self._decoder = tk.decoders.WordPiece.new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very satisfying
else: | ||
return max(0, len(self.encode(self.mask_token or "")) - 1) | ||
|
||
def tokenize(self, text, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it would take too much memory before being useful, as you really never have 2 sequences with 512/1024 tokens which are exactly the same
Integrate the BPE-based tokenizers inside transformers.
Added priority for Tokenizer with fast implementation in
AutoTokenizer
this is done through a new mapping (name: class) -> (name: Tuple[class, class]) which represents both the Python and Rust implementation classes. if no Rust implementation is available, it is simply set to None. AutoTokenizer will try to pick the Rust class if not None, otherwise it defaults to the Python one.Added some matching tests which basically checks that there is a huge % of element wise matching tokens. This is set arbitrary to 0.05 (5%) [i.e. at max, 5% of differences between Python and Rust].
Added parameter
return_offsets_mapping=False
over encoding methods which will return the offset mapping if using a Rust tokenizer. If using a Python tokenizer, a warning message is displayed through the module logger and the argument is discarded.