-
Notifications
You must be signed in to change notification settings - Fork 6.7k
fast tok update #13036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fast tok update #13036
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@sayakpaul here is the fix around the |
| tokenizer._update_trie() | ||
| # set correct total vocab size after removing tokens | ||
| tokenizer._update_total_vocab_size() | ||
| # Fast tokenizers: serialize, filter tokens, reload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it work with transformers (< v5) as well?
If not, maybe we could keep maintaining two code paths? One for v5 and another one for < v5? This way, in the next release cycle, we can pin transformers ver to >=5.0.0.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! I added it back
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work! I left one comment regarding versioning. LMK what you think.
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
|
@bot /style |
|
Style bot fixed some files and pushed the changes. |
Following transformers v5, we no longer have "slow" tokenizers that use a Trie - by default we use fast tokenizers. This script assumes always slow, so it is updated to work with fast!