Add `BlenderbotTokenizerFast` #13720

stancld · 2021-09-23T17:20:43Z

What does this PR do?

This PR add the fast (rust) implementation of BlenderbotTokenizer.

Fixes #13634

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@LysandreJik

Anyone in the community is free to review the PR :)

…nizer

patil-suraj

LGTM, thanks a lot for adding this!

src/transformers/models/auto/tokenization_auto.py

src/transformers/models/blenderbot/tokenization_blenderbot_fast.py

patil-suraj · 2021-09-29T08:40:32Z

tests/test_tokenization_blenderbot.py



 class Blenderbot3BTokenizerTests(unittest.TestCase):
    @cached_property
    def tokenizer_3b(self):
        return BlenderbotTokenizer.from_pretrained("facebook/blenderbot-3B")

+    @cached_property
+    def rust_tokenizer_3b(self):


(nit) maybe call this fast_tokenizer for consistancy

In test files, it seems to me the rust_tokenizer naming is used instead of fast_tokenizer. However, I can change it here :)

SaulLu

Thank you so much for this addition! 😄

src/transformers/convert_slow_tokenizer.py

SaulLu · 2021-09-30T10:40:27Z

Let me ping @Narsil if he ever has some leads to fix the failed pipeline tests. 🙂

Narsil · 2021-10-04T11:06:23Z

The fix is here: Narsil@605725f

Not sure it's the best fix so I will describe a bit more the issue.

Blenderbot implements for AutoModelForSeq2SeqLM (the real way to use it) and AutoModelForCausalLM (I don't think it's really used in practice, but it's implemented in the lib).

The pipeline tests every model that implements its supported architecture so BlenderbotForCausalLM are used. BUT the test config for pipelines is taken from the test modeler, which implements (understandably) the encoder/decoder config (with encoder_no_repeat_ngram_size=3). But when the tests is tried with a decoder-only BlenderbotForCausalLM then it fails.

test_pipeline_common can have a very specific override as this behavior should be very marginal (implementing both) within the lib and also very consistent (CausalLM = decoder only, and encoder_no_repeat_ngram doesn't make any sense for decoder-only)

Narsil · 2021-10-18T07:50:48Z

Something might have been changed since last time, since the embeddings are now an issue.
Suggesting a diff to override only the config used for pipelines tests on blenderbot.

index 33d506492..9e04ec89d 100644
--- a/tests/test_modeling_blenderbot.py
+++ b/tests/test_modeling_blenderbot.py
@@ -137,6 +137,11 @@ class BlenderbotModelTester:
             pad_token_id=self.pad_token_id,
         )
 
+    def get_pipeline_config(self):
+        config = self.get_config()
+        config.max_position_embeddings = 100
+        return config
+
     def prepare_config_and_inputs_for_common(self):
         config, inputs_dict = self.prepare_config_and_inputs()
         return config, inputs_dict

(By default max_embeddings are 20 long which is not enough for some pipelines tests)

stancld · 2021-10-20T09:03:53Z

@Narsil Thank you very much for your help! :) It looks like everything works now.

LysandreJik

This looks good to me! @SaulLu, does it look good to you too? Feel free to merge if so.

SaulLu

It's all good for me too! Thank you for this addition and for fixing the latest problems with the pipeline @stancld. 😄

Thank you also @Narsil for giving the right leads to fix the last tests!

* Add the support for the fast (rust) implementation of BlenbderbotTokenizer * Fix a converter and a typo in a doc * Apply the patil-suraj's suggestion * (Nitpick) Fast tokenization -> Fast Tokenization in doc * Apply the SaulLu's suggestion * Apply Narsil's suggestion to fix test pipelines * Add encoder_no_repeat_ngram_size according to the Narsil's suggestion * Revert the last (unnecessary) commit * Override pipeline config for Blenderbot to allow for larger pos. emb. * make fix-copies

jonatanklosko · 2023-03-22T21:07:31Z

Hey, I opened PRs to add tokenizer.json to the repos:

While the conversions work, tokenizer.json files are useful for us because they allow loading directly using the tokenizers Rust bindings, so if those can be merged it would be appreciated :)

stancld added 4 commits September 20, 2021 23:51

Add the support for the fast (rust) implementation of BlenbderbotToke…

a02b805

…nizer

Merge branch 'master' into blenderbot_fast_tokenizer

df77310

Fix a converter and a typo in a doc

4d8702c

Merge branch 'master' into blenderbot_fast_tokenizer

53f36ba

LysandreJik requested review from SaulLu and patil-suraj September 23, 2021 20:54

stancld changed the title ~~Add BlenderbotTokenizerFast~~ [WIP] Add BlenderbotTokenizerFast Sep 23, 2021

stancld changed the title ~~[WIP] Add BlenderbotTokenizerFast~~ Add BlenderbotTokenizerFast Sep 24, 2021

Merge branch 'master' into blenderbot_fast_tokenizer

f3cc692

patil-suraj approved these changes Sep 29, 2021

View reviewed changes

stancld added 3 commits September 29, 2021 12:49

Merge branch 'master' into blenderbot_fast_tokenizer

b27b247

Apply the patil-suraj's suggestion

3a9cb48

(Nitpick) Fast tokenization -> Fast Tokenization in doc

12d6ae0

SaulLu approved these changes Sep 29, 2021

View reviewed changes

src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved

SaulLu self-requested a review September 29, 2021 21:29

stancld added 2 commits September 30, 2021 12:32

Merge branch 'master' into blenderbot_fast_tokenizer

4d78b39

Apply the SaulLu's suggestion

f24e104

stancld added 4 commits October 15, 2021 09:50

Merge branch 'master' into blenderbot_fast_tokenizer

918bf93

Apply Narsil's suggestion to fix test pipelines

025432b

Add encoder_no_repeat_ngram_size according to the Narsil's suggestion

d3d4b1a

Revert the last (unnecessary) commit

aa0cb70

stancld added 2 commits October 20, 2021 10:44

Merge branch 'master' into blenderbot_fast_tokenizer

423fe52

Override pipeline config for Blenderbot to allow for larger pos. emb.

be988fb

Merge branch 'master' into blenderbot_fast_tokenizer

ee3c72e

stancld mentioned this pull request Oct 22, 2021

Add the fast implementation of BlenderbotTokenizer #13634

Closed

make fix-copies

a393420

LysandreJik approved these changes Oct 26, 2021

View reviewed changes

Merge branch 'master' into blenderbot_fast_tokenizer

a40b52d

SaulLu approved these changes Oct 29, 2021

View reviewed changes

LysandreJik merged commit d37f1fb into huggingface:master Oct 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `BlenderbotTokenizerFast` #13720

Add `BlenderbotTokenizerFast` #13720

stancld commented Sep 23, 2021

patil-suraj left a comment

patil-suraj Sep 29, 2021

stancld Sep 29, 2021

SaulLu left a comment

SaulLu commented Sep 30, 2021

Narsil commented Oct 4, 2021

Narsil commented Oct 18, 2021 •

edited

Loading

stancld commented Oct 20, 2021

LysandreJik left a comment

SaulLu left a comment

jonatanklosko commented Mar 22, 2023

Add BlenderbotTokenizerFast #13720

Add BlenderbotTokenizerFast #13720

Conversation

stancld commented Sep 23, 2021

What does this PR do?

Before submitting

Who can review?

patil-suraj left a comment

Choose a reason for hiding this comment

patil-suraj Sep 29, 2021

Choose a reason for hiding this comment

stancld Sep 29, 2021

Choose a reason for hiding this comment

SaulLu left a comment

Choose a reason for hiding this comment

SaulLu commented Sep 30, 2021

Narsil commented Oct 4, 2021

Narsil commented Oct 18, 2021 • edited Loading

stancld commented Oct 20, 2021

LysandreJik left a comment

Choose a reason for hiding this comment

SaulLu left a comment

Choose a reason for hiding this comment

jonatanklosko commented Mar 22, 2023

Add `BlenderbotTokenizerFast` #13720

Add `BlenderbotTokenizerFast` #13720

Narsil commented Oct 18, 2021 •

edited

Loading