Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer #7141

Merged
merged 47 commits into from Oct 8, 2020

Conversation

thomwolf
Copy link
Member

@thomwolf thomwolf commented Sep 15, 2020

This pull request add the "fast" Rust tokenizer for the SentencePiece tokenizers as well.

Based on unreleased v0.9.0 of tokenizers.

Tokenizers:

  • Albert
  • Bart
  • Bert
  • Camembert
  • DistilBert
  • DPR
  • Electra
  • Funnel
  • GPT2
  • LongFormer
  • LXMert
  • MBart
  • MobileBert
  • OpenAI GPT
  • Pegasus
  • Reformer
  • RetriBert
  • Roberta
  • T5
  • XLM-Roberta
  • XLNet

Breaking:

  • Fast version of Transformer-XL (which gave different tokenization results) is removed.

Remaining tokenizers without Fast implementations (no fast tokenizers expected in the short/mid-term):

  • BertJapanese (special python libs for multi-linguality)
  • CTRL (would require a specific BPE to handle missing merges)
  • XLM (uses special python libs for multi-linguality)
  • Flaubert (same as XLM)
  • Transformer-XL (same as XLM)

Other fixes:

@thomwolf thomwolf marked this pull request as ready for review October 6, 2020 07:36
@thomwolf
Copy link
Member Author

thomwolf commented Oct 6, 2020

Ready for review, the remaining failing tests should be ok after the next tokenizers RC release

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, incredible to now have support for all tokenizers for which it is possible!

@@ -0,0 +1,546 @@
from typing import Dict, List, Tuple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add the copyright here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't exactly understand what this script does. Does it convert from original implementations to ours, or from our slow implementations to our fast ones? Some docstrings would be very welcome!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I'll add more doc

This file contain utilities to convert slow tokenizers in their fast tokenizers counterparts.

All the conversions are grouped here to gather SentencePiece dependencies outside of the fast tokenizers files and allow to make our dependency on SentencePiece optional.

src/transformers/tokenization_albert.py Outdated Show resolved Hide resolved
src/transformers/tokenization_bert.py Show resolved Hide resolved
src/transformers/tokenization_pegasus.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils_base.py Show resolved Hide resolved
Comment on lines +791 to +801
# def test_swap_special_token(self):
# tokenizers = self.get_tokenizers(do_lower_case=False)
# for tokenizer in tokenizers:
# with self.subTest(f"{tokenizer.__class__.__name__}"):
# # Our mask token
# mask = "<mask>"
# # We take a single word in the middle of the vocabulary
# all_tokens = sorted(tokenizer.get_vocab().keys())
# word = tokenizer.decode(tokenizer.encode(all_tokens[len(all_tokens)//2], add_special_tokens=False)[:1])

# sequence_0 = "Encode " + word + " sequence"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just too mind-breaking to make work in the general setting of an arbitrary vocabulary tokenizers and I don't think it's a useful test in the end.

@@ -1,23 +1,52 @@
import logging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs compyright here.

The diff for this file is slightly complicated to read, did you wrap the tests in subtests, iterating over every tokenizer? Is that better than doing a mixin like we do in other test classes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda kept the original setup made by @mfuntowicz even though I agree switching to a mixin would probably be easier to read in the end.

@n1t0
Copy link
Member

n1t0 commented Oct 6, 2020

Great job! I'm not entirely up to date with everything in transformers, but this looks very nice and clean!

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made my changes directly on the branch.
This is amazing work! The only comment I have left is that it would be nice to have some documentation of convert_slow_tokenizer.py. Also, if it needs some updates when adding a new model, it should be documented in the new model template, so that we or external contributors don't forget.

@thomwolf
Copy link
Member Author

thomwolf commented Oct 8, 2020

Ok, yes I'll add documentation. We will probably wait to have a clean documentation in tokenizers as well so we can do proper cross-linking.

@thomwolf thomwolf changed the title [WIP] Adding Fast tokenizers for SentencePiece based tokenizers Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer Oct 8, 2020
@thomwolf thomwolf merged commit 9aeacb5 into master Oct 8, 2020
@thomwolf thomwolf deleted the fast-sentencepiece branch October 8, 2020 09:32
for key, value in special_tokens_map.items():
if isinstance(value, dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomwolf - this change currently breaks the RagTokenizer.
If one runs the slow test:

tests/test_modeling_rag.py::RagModelIntegrationTests::test_rag_sequence_generate_batch

and puts a breakpoint before convert_added_tokens, one can see why:

Previously a dict object corresponding e.g. to the BOS token, such as

'bos_token': {'content': '<s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}

would have been processed by the line

value = AddedToken(**value)

Now such a dict - because it has no __type attribute - is processed differently, which leads to errors later.
Not 100% sure how to solve it here. Do you have any good ideas: @LysandreJik @thomwolf @n1t0 ?

fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
… remove Transfo-XL fast tokenizer (huggingface#7141)

* [WIP] SP tokenizers

* fixing tests for T5

* WIP tokenizers

* serialization

* update T5

* WIP T5 tokenization

* slow to fast conversion script

* Refactoring to move tokenzier implementations inside transformers

* Adding gpt - refactoring - quality

* WIP adding several tokenizers to the fast world

* WIP Roberta - moving implementations

* update to dev4 switch file loading to in-memory loading

* Updating and fixing

* advancing on the tokenizers - updating do_lower_case

* style and quality

* moving forward with tokenizers conversion and tests

* MBart, T5

* dumping the fast version of transformer XL

* Adding to autotokenizers + style/quality

* update init and space_between_special_tokens

* style and quality

* bump up tokenizers version

* add protobuf

* fix pickle Bert JP with Mecab

* fix newly added tokenizers

* style and quality

* fix bert japanese

* fix funnel

* limite tokenizer warning to one occurence

* clean up file

* fix new tokenizers

* fast tokenizers deep tests

* WIP adding all the special fast tests on the new fast tokenizers

* quick fix

* adding more fast tokenizers in the fast tests

* all tokenizers in fast version tested

* Adding BertGenerationFast

* bump up setup.py for CI

* remove BertGenerationFast (too early)

* bump up tokenizers version

* Clean old docstrings

* Typo

* Update following Lysandre comments

Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
…reaking: remove Transfo-XL fast tokenizer (huggingface#7141)"

This reverts commit 324bd77.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

runing dataset.map, it raises TypeError: can't pickle Tokenizer objects
6 participants