Add pad_to_multiple_of on tokenizers (reimport) #5054

mfuntowicz · 2020-06-16T11:37:40Z

Reimported from #4731.

Introduce pad_to_multiple_of on both slow and fast tokenizers. This parameter introduces the "bucketizaton behaviour" also refered as Shape Polymorphism.

This is especially usefull when targetting NN dedicated accelerators such as:

NVidia Tensor Core (on >= Volta Architecture)
XLA (PyTorch TPU)
XLA (Jax / Flax)

Bonus:

Fix RobertaTokenizer when input is empty text[0].is_space() would crash (RobertaTokenizer corner case with empty string #3608).

Edit (@thomwolf):

updated to the new API
raise a ValueError if you want to truncation to a length which is not a multiple of pad_to_multiple_of

codecov · 2020-06-16T12:35:37Z

Codecov Report

Merging #5054 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5054      +/-   ##
==========================================
+ Coverage   79.08%   79.09%   +0.01%     
==========================================
  Files         138      138              
  Lines       24078    24081       +3     
==========================================
+ Hits        19041    19047       +6     
+ Misses       5037     5034       -3

Impacted Files	Coverage Δ
src/transformers/tokenization_utils.py	`91.48% <ø> (ø)`
src/transformers/tokenization_utils_fast.py	`94.28% <ø> (ø)`
src/transformers/tokenization_roberta.py	`94.52% <100.00%> (ø)`
src/transformers/tokenization_utils_base.py	`93.70% <100.00%> (+0.54%)`	⬆️
src/transformers/file_utils.py	`76.42% <0.00%> (-0.39%)`	⬇️
src/transformers/modeling_utils.py	`91.26% <0.00%> (+0.12%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24f46ea...449cba1. Read the comment docs.

LysandreJik

LGTM!

…token.

…ng strategy.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Add new parameter `pad_to_multiple_of` on tokenizers. * unittest for pad_to_multiple_of * Add .name when logging enum. * Fix missing .items() on dict in tests. * Add special check + warning if the tokenizer doesn't have proper pad_token. * Use the correct logger format specifier. * Ensure tokenizer with no pad_token do not modify the underlying padding strategy. * Skip test if tokenizer doesn't have pad_token * Fix RobertaTokenizer on empty input * Format. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * fix and updating to simpler API Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

mfuntowicz requested review from LysandreJik, n1t0 and thomwolf June 16, 2020 11:38

LysandreJik approved these changes Jun 16, 2020

View reviewed changes

thomwolf self-assigned this Jun 23, 2020

mfuntowicz added 10 commits June 25, 2020 21:57

Add new parameter pad_to_multiple_of on tokenizers.

2dd48f5

unittest for pad_to_multiple_of

d1dd644

Add .name when logging enum.

7480ca6

Fix missing .items() on dict in tests.

e2c2518

Add special check + warning if the tokenizer doesn't have proper pad_…

9ae82db

…token.

Use the correct logger format specifier.

f3750ff

Ensure tokenizer with no pad_token do not modify the underlying paddi…

b98aa67

…ng strategy.

Skip test if tokenizer doesn't have pad_token

d7a934e

Fix RobertaTokenizer on empty input

a3b2114

Format.

1179cc8

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

thomwolf force-pushed the pad_to_multiple_of_v2 branch from 15f8c2a to 1179cc8 Compare June 25, 2020 21:13

fix and updating to simpler API

449cba1

thomwolf merged commit 135791e into master Jun 26, 2020

thomwolf deleted the pad_to_multiple_of_v2 branch June 26, 2020 09:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pad_to_multiple_of on tokenizers (reimport) #5054

Add pad_to_multiple_of on tokenizers (reimport) #5054

mfuntowicz commented Jun 16, 2020 •

edited by thomwolf

codecov bot commented Jun 16, 2020 •

edited

LysandreJik left a comment

Add pad_to_multiple_of on tokenizers (reimport) #5054

Add pad_to_multiple_of on tokenizers (reimport) #5054

Conversation

mfuntowicz commented Jun 16, 2020 • edited by thomwolf

codecov bot commented Jun 16, 2020 • edited

Codecov Report

LysandreJik left a comment

Choose a reason for hiding this comment

mfuntowicz commented Jun 16, 2020 •

edited by thomwolf

codecov bot commented Jun 16, 2020 •

edited