-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BlenderbotTokenizerFast
#13720
Add BlenderbotTokenizerFast
#13720
Conversation
BlenderbotTokenizerFast
BlenderbotTokenizerFast
BlenderbotTokenizerFast
BlenderbotTokenizerFast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot for adding this!
src/transformers/models/blenderbot/tokenization_blenderbot_fast.py
Outdated
Show resolved
Hide resolved
src/transformers/models/blenderbot/tokenization_blenderbot_fast.py
Outdated
Show resolved
Hide resolved
|
||
|
||
class Blenderbot3BTokenizerTests(unittest.TestCase): | ||
@cached_property | ||
def tokenizer_3b(self): | ||
return BlenderbotTokenizer.from_pretrained("facebook/blenderbot-3B") | ||
|
||
@cached_property | ||
def rust_tokenizer_3b(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) maybe call this fast_tokenizer
for consistancy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In test files, it seems to me the rust_tokenizer
naming is used instead of fast_tokenizer
. However, I can change it here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this addition! 😄
Let me ping @Narsil if he ever has some leads to fix the failed pipeline tests. 🙂 |
The fix is here: Narsil@605725f Not sure it's the best fix so I will describe a bit more the issue. Blenderbot implements for AutoModelForSeq2SeqLM (the real way to use it) and The pipeline tests every model that implements its supported architecture so
|
Something might have been changed since last time, since the embeddings are now an issue.
(By default max_embeddings are 20 long which is not enough for some pipelines tests) |
@Narsil Thank you very much for your help! :) It looks like everything works now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! @SaulLu, does it look good to you too? Feel free to merge if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Add the support for the fast (rust) implementation of BlenbderbotTokenizer * Fix a converter and a typo in a doc * Apply the patil-suraj's suggestion * (Nitpick) Fast tokenization -> Fast Tokenization in doc * Apply the SaulLu's suggestion * Apply Narsil's suggestion to fix test pipelines * Add encoder_no_repeat_ngram_size according to the Narsil's suggestion * Revert the last (unnecessary) commit * Override pipeline config for Blenderbot to allow for larger pos. emb. * make fix-copies
Hey, I opened PRs to add While the conversions work, |
What does this PR do?
This PR add the fast (rust) implementation of
BlenderbotTokenizer
.Fixes #13634
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@LysandreJik
Anyone in the community is free to review the PR :)