Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BlenderbotTokenizerFast #13720

Merged
merged 19 commits into from
Oct 29, 2021

Conversation

stancld
Copy link
Contributor

@stancld stancld commented Sep 23, 2021

What does this PR do?

This PR add the fast (rust) implementation of BlenderbotTokenizer.

Fixes #13634

Before submitting

Who can review?

@LysandreJik

Anyone in the community is free to review the PR :)

@stancld stancld changed the title Add BlenderbotTokenizerFast [WIP] Add BlenderbotTokenizerFast Sep 23, 2021
@stancld stancld changed the title [WIP] Add BlenderbotTokenizerFast Add BlenderbotTokenizerFast Sep 24, 2021
Copy link
Contributor

@patil-suraj patil-suraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot for adding this!

src/transformers/models/auto/tokenization_auto.py Outdated Show resolved Hide resolved


class Blenderbot3BTokenizerTests(unittest.TestCase):
@cached_property
def tokenizer_3b(self):
return BlenderbotTokenizer.from_pretrained("facebook/blenderbot-3B")

@cached_property
def rust_tokenizer_3b(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) maybe call this fast_tokenizer for consistancy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In test files, it seems to me the rust_tokenizer naming is used instead of fast_tokenizer. However, I can change it here :)

Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this addition! 😄

src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
@SaulLu SaulLu self-requested a review September 29, 2021 21:29
@SaulLu
Copy link
Contributor

SaulLu commented Sep 30, 2021

Let me ping @Narsil if he ever has some leads to fix the failed pipeline tests. 🙂

@Narsil
Copy link
Contributor

Narsil commented Oct 4, 2021

The fix is here: Narsil@605725f

Not sure it's the best fix so I will describe a bit more the issue.

Blenderbot implements for AutoModelForSeq2SeqLM (the real way to use it) and AutoModelForCausalLM (I don't think it's really used in practice, but it's implemented in the lib).

The pipeline tests every model that implements its supported architecture so BlenderbotForCausalLM are used. BUT the test config for pipelines is taken from the test modeler, which implements (understandably) the encoder/decoder config (with encoder_no_repeat_ngram_size=3). But when the tests is tried with a decoder-only BlenderbotForCausalLM then it fails.

test_pipeline_common can have a very specific override as this behavior should be very marginal (implementing both) within the lib and also very consistent (CausalLM = decoder only, and encoder_no_repeat_ngram doesn't make any sense for decoder-only)

@Narsil
Copy link
Contributor

Narsil commented Oct 18, 2021

Something might have been changed since last time, since the embeddings are now an issue.
Suggesting a diff to override only the config used for pipelines tests on blenderbot.

index 33d506492..9e04ec89d 100644
--- a/tests/test_modeling_blenderbot.py
+++ b/tests/test_modeling_blenderbot.py
@@ -137,6 +137,11 @@ class BlenderbotModelTester:
             pad_token_id=self.pad_token_id,
         )
 
+    def get_pipeline_config(self):
+        config = self.get_config()
+        config.max_position_embeddings = 100
+        return config
+
     def prepare_config_and_inputs_for_common(self):
         config, inputs_dict = self.prepare_config_and_inputs()
         return config, inputs_dict

(By default max_embeddings are 20 long which is not enough for some pipelines tests)

@stancld
Copy link
Contributor Author

stancld commented Oct 20, 2021

@Narsil Thank you very much for your help! :) It looks like everything works now.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! @SaulLu, does it look good to you too? Feel free to merge if so.

Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all good for me too! Thank you for this addition and for fixing the latest problems with the pipeline @stancld. 😄

Thank you also @Narsil for giving the right leads to fix the last tests!

@LysandreJik LysandreJik merged commit d37f1fb into huggingface:master Oct 29, 2021
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
* Add the support for the fast (rust) implementation of BlenbderbotTokenizer

* Fix a converter and a typo in a doc

* Apply the patil-suraj's suggestion

* (Nitpick) Fast tokenization -> Fast Tokenization in doc

* Apply the SaulLu's suggestion

* Apply Narsil's suggestion to fix test pipelines

* Add encoder_no_repeat_ngram_size according to the Narsil's suggestion

* Revert the last (unnecessary) commit

* Override pipeline config for Blenderbot to allow for larger pos. emb.

* make fix-copies
@jonatanklosko
Copy link
Contributor

Hey, I opened PRs to add tokenizer.json to the repos:

While the conversions work, tokenizer.json files are useful for us because they allow loading directly using the tokenizers Rust bindings, so if those can be merged it would be appreciated :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add the fast implementation of BlenderbotTokenizer
6 participants