Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pad_to_multiple_of on tokenizers (reimport) #5054

Merged
merged 11 commits into from Jun 26, 2020
Merged

Conversation

mfuntowicz
Copy link
Member

@mfuntowicz mfuntowicz commented Jun 16, 2020

Reimported from #4731.

Introduce pad_to_multiple_of on both slow and fast tokenizers. This parameter introduces the "bucketizaton behaviour" also refered as Shape Polymorphism.

This is especially usefull when targetting NN dedicated accelerators such as:

  • NVidia Tensor Core (on >= Volta Architecture)
  • XLA (PyTorch TPU)
  • XLA (Jax / Flax)

Bonus:

Edit (@thomwolf):

  • updated to the new API
  • raise a ValueError if you want to truncation to a length which is not a multiple of pad_to_multiple_of

@codecov
Copy link

codecov bot commented Jun 16, 2020

Codecov Report

Merging #5054 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5054      +/-   ##
==========================================
+ Coverage   79.08%   79.09%   +0.01%     
==========================================
  Files         138      138              
  Lines       24078    24081       +3     
==========================================
+ Hits        19041    19047       +6     
+ Misses       5037     5034       -3     
Impacted Files Coverage Δ
src/transformers/tokenization_utils.py 91.48% <ø> (ø)
src/transformers/tokenization_utils_fast.py 94.28% <ø> (ø)
src/transformers/tokenization_roberta.py 94.52% <100.00%> (ø)
src/transformers/tokenization_utils_base.py 93.70% <100.00%> (+0.54%) ⬆️
src/transformers/file_utils.py 76.42% <0.00%> (-0.39%) ⬇️
src/transformers/modeling_utils.py 91.26% <0.00%> (+0.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24f46ea...449cba1. Read the comment docs.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@thomwolf thomwolf self-assigned this Jun 23, 2020
@thomwolf thomwolf merged commit 135791e into master Jun 26, 2020
@thomwolf thomwolf deleted the pad_to_multiple_of_v2 branch June 26, 2020 09:55
jplu pushed a commit to jplu/transformers that referenced this pull request Jun 29, 2020
* Add new parameter `pad_to_multiple_of` on tokenizers.

* unittest for pad_to_multiple_of

* Add .name when logging enum.

* Fix missing .items() on dict in tests.

* Add special check + warning if the tokenizer doesn't have proper pad_token.

* Use the correct logger format specifier.

* Ensure tokenizer with no pad_token do not modify the underlying padding strategy.

* Skip test if tokenizer doesn't have pad_token

* Fix RobertaTokenizer on empty input

* Format.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* fix and updating to simpler API

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants