Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate fast tokenizers library inside transformers #2674

Merged
merged 76 commits into from
Feb 19, 2020
Merged

Conversation

mfuntowicz
Copy link
Member

@mfuntowicz mfuntowicz commented Jan 29, 2020

Integrate the BPE-based tokenizers inside transformers.

  • Bert (100% match)
  • DistilBert (100% match)
  • OpenAI GPT (100% match)
  • GPT2 (100% match if no trailing \n)
  • Roberta (100% match if no trailing \n)
  • TransformerXL
  • CTRL (No binding will be provided).

Added priority for Tokenizer with fast implementation in AutoTokenizer this is done through a new mapping (name: class) -> (name: Tuple[class, class]) which represents both the Python and Rust implementation classes. if no Rust implementation is available, it is simply set to None. AutoTokenizer will try to pick the Rust class if not None, otherwise it defaults to the Python one.

Added some matching tests which basically checks that there is a huge % of element wise matching tokens. This is set arbitrary to 0.05 (5%) [i.e. at max, 5% of differences between Python and Rust].

Added parameter return_offsets_mapping=False over encoding methods which will return the offset mapping if using a Rust tokenizer. If using a Python tokenizer, a warning message is displayed through the module logger and the argument is discarded.

@julien-c
Copy link
Member

only took a superficial look, but looks very clean 👍

Excited to use fast tokenizers by default!

@mfuntowicz
Copy link
Member Author

Current CI issues are real and "normal" we need to release the next version of tokenizers lib which will bring all the dependencies.

src/transformers/tokenization_utils.py Outdated Show resolved Hide resolved

def encode_batch(self, sequences: List[Union[str, Tuple[str, str]]]) -> List[Encoding]:
return super().encode_batch(
[seq.strip() if isinstance(seq, str) else (seq[0].strip(), seq[1].strip()) for seq in sequences]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be an additional Normalizer. It would also let us keep track of the offsets. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yap make sense 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe this deserves a 0.4.3 haha

src/transformers/tokenization_utils.py Outdated Show resolved Hide resolved
if return_special_tokens_mask:
encoding_dict["special_tokens_mask"] = encoding.special_tokens_mask
encoding_dict["special_tokens_mask"] = [e.special_tokens_mask for e in encodings]
if return_offsets_mapping:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't give access to the normalized string somehow, we should maybe provide offsets to the original string here. Wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, currently the offsets are given w.r.t the normalised string ? If this is the case, then yes we may want to provide offsets in the original string then, or expose an utility method doing the mapping in Python.

Is it something we can easily expose on Encoding ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, offsets are related to the normalized string. You can retrieve the original offsets by doing encoding.original_str.offsets[encoding.offsets[X]]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't followed the discussion very closely, but shouldn't offsets be returned based on the original string by default?

As an end user I don't think I really care about the normalized (internal) version of my input.

Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with @julien-c, the normalized string is more an internal representation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for me we should probably default to the original string.

@codecov-io
Copy link

codecov-io commented Feb 11, 2020

Codecov Report

Merging #2674 into master will increase coverage by 0.29%.
The diff coverage is 83.01%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #2674      +/-   ##
=========================================
+ Coverage      75%   75.3%   +0.29%     
=========================================
  Files          94      94              
  Lines       15288   15424     +136     
=========================================
+ Hits        11467   11615     +148     
+ Misses       3821    3809      -12
Impacted Files Coverage Δ
src/transformers/__init__.py 98.87% <100%> (ø) ⬆️
src/transformers/tokenization_roberta.py 100% <100%> (ø) ⬆️
src/transformers/tokenization_bert.py 96.92% <100%> (+0.3%) ⬆️
src/transformers/pipelines.py 70.88% <100%> (+0.14%) ⬆️
src/transformers/tokenization_distilbert.py 100% <100%> (ø) ⬆️
src/transformers/tokenization_gpt2.py 96.85% <100%> (+0.58%) ⬆️
src/transformers/tokenization_auto.py 97.22% <100%> (+0.25%) ⬆️
src/transformers/tokenization_transfo_xl.py 37.91% <51.42%> (+5.04%) ⬆️
src/transformers/tokenization_openai.py 82.27% <81.57%> (+0.46%) ⬆️
src/transformers/tokenization_utils.py 90.08% <87.23%> (+3.98%) ⬆️
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20fc18f...56748e8. Read the comment docs.

Copy link
Member

@n1t0 n1t0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! Great job @mfuntowicz!

setup.py Outdated Show resolved Hide resolved
src/transformers/tokenization_bert.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils.py Outdated Show resolved Hide resolved

def encode_batch(self, sequences: List[Union[str, Tuple[str, str]]]) -> List[Encoding]:
return super().encode_batch(
[seq.strip() if isinstance(seq, str) else (seq[0].strip(), seq[1].strip()) for seq in sequences]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe this deserves a 0.4.3 haha

@mfuntowicz mfuntowicz force-pushed the tokenizers-v2 branch 2 times, most recently from 88ffca6 to 63660c4 Compare February 12, 2020 13:23
Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @mfuntowicz!

setup.py Outdated
@@ -89,7 +89,7 @@
packages=find_packages("src"),
install_requires=[
"numpy",
"tokenizers == 0.0.11",
"tokenizers == 0.4.2",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>= ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we don't have so many unit tests on tokenizers python's binding for now, I would tend to stick to a specific version that will be tested on the CI. Otherwise it might introduce some flaky tests when releasing new tokenizers versions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, given that we are still introducing breaking changes pretty often in tokenizers, I would strongly advise against that.

src/transformers/tokenization_utils.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils.py Show resolved Hide resolved
if return_special_tokens_mask:
encoding_dict["special_tokens_mask"] = encoding.special_tokens_mask
encoding_dict["special_tokens_mask"] = [e.special_tokens_mask for e in encodings]
if return_offsets_mapping:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for me we should probably default to the original string.


# Prepare inputs as tensors if asked
if return_tensors == "tf" and is_tf_available():
encoding_dict["input_ids"] = tf.constant([encoding_dict["input_ids"]])
encoding_dict["input_ids"] = tf.constant(encoding_dict["input_ids"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok nice, so this will be a tensor with a "batch dimension" equal to the number of encodings when they are splited in overflowing tokens. I like this solution, it's clean. We should document this behavior though.

src/transformers/tokenization_bert.py Outdated Show resolved Hide resolved
@@ -213,3 +222,91 @@ def save_vocabulary(self, save_directory):
index += 1

return vocab_file, merge_file


class _OpenAIGPTCharBPETokenizer(BaseTokenizer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to have this class here?

Don't we have an implementation of char-level BPE in tokenizers now?
Here: https://github.com/huggingface/tokenizers/blob/master/bindings/python/tokenizers/implementations/char_level_bpe.py#L9

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need a special OpenaiGPT implementation because it slightly differs from the char-level BPE we have in tokenizers:

  • Normalizer is the same as Bert (BertNormalizer)
  • PreTokenizer is not Whitespace, it's the same as Bert (BertPreTokenizer)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we put TransformerXL into tokenizers.implementations, may be this one can make its way to tokenizers too. cc @n1t0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm not too sure about this. I think tokenizers should stay a library with some generic implementations, with an easy way for everybody to build it's own custom tokenizer when needed. So I'd like to avoid introducing specific implementations for each new model/tokenizer. Otherwise, the next thing we'll discuss is whether we should have default vocabularies downloaded automatically with each specific implementation, and then we'll have as many implementations as models there are in transformers... I think it makes more sense to have specific customization details in transformers, next to the model that actually uses the custom tokenizer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, the next thing we'll discuss is whether we should have default vocabularies downloaded automatically with each specific implementation

You mean have all the things that made the success of Transformers? 😜

Jking Well ok for me to keep these in Transformers then.

src/transformers/tokenization_roberta.py Outdated Show resolved Hide resolved
src/transformers/tokenization_roberta.py Outdated Show resolved Hide resolved
@@ -280,6 +293,108 @@ def _tokenize(self, line, add_eos=False, add_double_eos=False):
return symbols


class _TransfoXLDelimiterLookupTokenizer(BaseTokenizer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this class upstream in tokenizers now that we have a word-level model?

It would be more consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @n1t0 what do you think ? I can put the content of this into tokenizers.implementations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cf comment for OpenAIGPTCharBPETokenizer

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…e on masked input.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…tences_pair to PreTrainedTokenizerFast + tests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…ist object.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…vior.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…e current tokenizer.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…rategy + unittest.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…tensors="..."

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…h axis.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…meters in kwargs

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…g_tokens is True.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…tensor.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…ce max)

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
@mfuntowicz mfuntowicz changed the title [WIP] Integrate fast tokenizers library inside transformers Integrate fast tokenizers library inside transformers Feb 19, 2020
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool, great job @mfuntowicz !

Comment on lines -569 to -597
self._tokenizer = tk.Tokenizer(tk.models.WordPiece.from_files(vocab_file, unk_token=unk_token))
self._update_special_tokens()
self._tokenizer.with_pre_tokenizer(
tk.pre_tokenizers.BertPreTokenizer.new(
do_basic_tokenize=do_basic_tokenize,
do_lower_case=do_lower_case,
tokenize_chinese_chars=tokenize_chinese_chars,
never_split=never_split if never_split is not None else [],
)
)
self._tokenizer.with_decoder(tk.decoders.WordPiece.new())

if add_special_tokens:
self._tokenizer.with_post_processor(
tk.processors.BertProcessing.new(
(sep_token, self._tokenizer.token_to_id(sep_token)),
(cls_token, self._tokenizer.token_to_id(cls_token)),
)
)
if max_length is not None:
self._tokenizer.with_truncation(max_length, stride=stride, strategy=truncation_strategy)
self._tokenizer.with_padding(
max_length=max_length if pad_to_max_length else None,
direction=self.padding_side,
pad_id=self.pad_token_id,
pad_type_id=self.pad_token_type_id,
pad_token=self.pad_token,
)
self._decoder = tk.decoders.WordPiece.new()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very satisfying

else:
return max(0, len(self.encode(self.mask_token or "")) - 1)

def tokenize(self, text, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it would take too much memory before being useful, as you really never have 2 sequences with 512/1024 tokens which are exactly the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants