New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pad_to_multiple_of on tokenizers (reimport) #5054
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5054 +/- ##
==========================================
+ Coverage 79.08% 79.09% +0.01%
==========================================
Files 138 138
Lines 24078 24081 +3
==========================================
+ Hits 19041 19047 +6
+ Misses 5037 5034 -3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
15f8c2a
to
1179cc8
Compare
* Add new parameter `pad_to_multiple_of` on tokenizers. * unittest for pad_to_multiple_of * Add .name when logging enum. * Fix missing .items() on dict in tests. * Add special check + warning if the tokenizer doesn't have proper pad_token. * Use the correct logger format specifier. * Ensure tokenizer with no pad_token do not modify the underlying padding strategy. * Skip test if tokenizer doesn't have pad_token * Fix RobertaTokenizer on empty input * Format. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * fix and updating to simpler API Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Reimported from #4731.
Introduce
pad_to_multiple_of
on both slow and fast tokenizers. This parameter introduces the "bucketizaton behaviour" also refered as Shape Polymorphism.This is especially usefull when targetting NN dedicated accelerators such as:
Bonus:
text[0].is_space()
would crash (RobertaTokenizer corner case with empty string #3608).Edit (@thomwolf):
ValueError
if you want to truncation to a length which is not a multiple ofpad_to_multiple_of