New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to mimic trim_batch using new tokenizer strategies? #5181
Comments
Hi @sshleifer you should read the detailed description on the tokenizers refactoring PR #4510 (comment) Until it's added in the doc (will be soon), it's required reading for all core contributors of |
Thanks. I read that, and am still somewhat confused about why I pass Here is a simplified example: from transformers import BartTokenizer
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
assert tokenizer.model_max_length == 1024
# tokenizer.batch_encode_plus returns ids shaped (2, 1024)
batch_sentences = ['tiny sentence 1'*1000, 'tiny_sentence2']
ids = tokenizer.batch_encode_plus(batch_sentences, pad_to_max_length=True, max_length=tokenizer.model_max_length,
truncation=True, return_tensors='pt').input_ids
assert ids.shape[1] <= tokenizer.model_max_length, ids.shape[1]
# tokenizer.__call__ returns ids shaped (2, 3002)
ids = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors='pt',
max_length=tokenizer.model_max_length, ).input_ids
assert ids.shape[1] <= tokenizer.model_max_length, ids.shape[1] |
I'll take a look |
…efault logging level in tests to WARNING (#5252) * fix-5181 Padding to max sequence length while truncation to another length was wrong on slow tokenizers * clean up and fix #5155 * fix XLM test * Fix tests for Transfo-XL * logging only above WARNING in tests * switch slow tokenizers tests in @slow * fix Marian truncation tokenization test * style and quality * make the test a lot faster by limiting the sequence length used in tests
…icit - move back the default logging level in tests to WARNING (huggingface#5252) * fix-5181 Padding to max sequence length while truncation to another length was wrong on slow tokenizers * clean up and fix huggingface#5155 * fix XLM test * Fix tests for Transfo-XL * logging only above WARNING in tests * switch slow tokenizers tests in @slow * fix Marian truncation tokenization test * style and quality * make the test a lot faster by limiting the sequence length used in tests
I am trying to replace the old workflow of
calling batch_encode_plus to make tensors of shape
(n_examples, model_max_length)
and then callingtrim_batch
to reduce padding computation, with the new tokenizers kwargs.Is this possible?
The following code does not seem to truncate inputs longer than 512 (the second assert breaks).
Attempt:
Traceback:
Help much appreciated, @mfuntowicz @thomwolf
The text was updated successfully, but these errors were encountered: