Is it possible to mimic trim_batch using new tokenizer strategies? #5181

sshleifer · 2020-06-22T14:46:02Z

I am trying to replace the old workflow of
calling batch_encode_plus to make tensors of shape
(n_examples, model_max_length) and then calling trim_batch to reduce padding computation, with the new tokenizers kwargs.
Is this possible?
The following code does not seem to truncate inputs longer than 512 (the second assert breaks).

Attempt:

from transformers import BartTokenizer

tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
kw = dict(max_length=512, 
          pad_to_max_length=True, padding=True, return_tensors='pt', truncation='only_first')
batch = tokenizer(['tiny sentence 1', 'tiny_sentence2'],**kw)
assert batch.input_ids.shape[1] == 7, batch.input_ids.shape[1]
input_ids, mask = trim_batch(**batch, pad_token_id=tokenizer.pad_token_id)
assert input_ids.shape[1] == 7, batch.input_ids.shape[1]

batch_overflow = tokenizer(['tiny sentence 1'*1000, 'tiny_sentence2'], **kw)


assert batch_overflow.input_ids.shape[1] == 512, batch_overflow.input_ids.shape[1]

Traceback:

assert batch_overflow.input_ids.shape[1] == 512, batch_overflow.input_ids.shape[1]

AssertionError: 3002

Help much appreciated, @mfuntowicz @thomwolf

The text was updated successfully, but these errors were encountered:

thomwolf · 2020-06-22T14:49:57Z

Hi @sshleifer you should read the detailed description on the tokenizers refactoring PR #4510 (comment)

Until it's added in the doc (will be soon), it's required reading for all core contributors of transformers.

sshleifer · 2020-06-22T16:45:43Z

Thanks. I read that, and am still somewhat confused about why I pass truncation=True and get entries that are longer than tokenizer.max_model_length. The PR description says:

Here is a simplified example:

from transformers import BartTokenizer
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
assert tokenizer.model_max_length == 1024

# tokenizer.batch_encode_plus returns ids shaped (2, 1024)
batch_sentences = ['tiny sentence 1'*1000, 'tiny_sentence2']
ids = tokenizer.batch_encode_plus(batch_sentences, pad_to_max_length=True, max_length=tokenizer.model_max_length,
                                  truncation=True, return_tensors='pt').input_ids
assert ids.shape[1] <= tokenizer.model_max_length, ids.shape[1]

# tokenizer.__call__ returns ids shaped (2, 3002)
ids = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors='pt',
                max_length=tokenizer.model_max_length, ).input_ids
assert ids.shape[1] <= tokenizer.model_max_length, ids.shape[1]

thomwolf · 2020-06-23T15:02:14Z

I'll take a look

…efault logging level in tests to WARNING (#5252) * fix-5181 Padding to max sequence length while truncation to another length was wrong on slow tokenizers * clean up and fix #5155 * fix XLM test * Fix tests for Transfo-XL * logging only above WARNING in tests * switch slow tokenizers tests in @slow * fix Marian truncation tokenization test * style and quality * make the test a lot faster by limiting the sequence length used in tests

…icit - move back the default logging level in tests to WARNING (huggingface#5252) * fix-5181 Padding to max sequence length while truncation to another length was wrong on slow tokenizers * clean up and fix huggingface#5155 * fix XLM test * Fix tests for Transfo-XL * logging only above WARNING in tests * switch slow tokenizers tests in @slow * fix Marian truncation tokenization test * style and quality * make the test a lot faster by limiting the sequence length used in tests

sshleifer added the Core: Tokenization Internals of the library; Tokenization. label Jun 22, 2020

thomwolf mentioned this issue Jun 24, 2020

[Tokenization] Fix #5181 - make #5155 more explicit - move back the default logging level in tests to WARNING #5252

Merged

thomwolf closed this as completed in #5252 Jun 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to mimic trim_batch using new tokenizer strategies? #5181

Is it possible to mimic trim_batch using new tokenizer strategies? #5181

sshleifer commented Jun 22, 2020

thomwolf commented Jun 22, 2020

sshleifer commented Jun 22, 2020 •

edited

thomwolf commented Jun 23, 2020

Is it possible to mimic trim_batch using new tokenizer strategies? #5181

Is it possible to mimic trim_batch using new tokenizer strategies? #5181

Comments

sshleifer commented Jun 22, 2020

thomwolf commented Jun 22, 2020

sshleifer commented Jun 22, 2020 • edited

thomwolf commented Jun 23, 2020

sshleifer commented Jun 22, 2020 •

edited