-
Notifications
You must be signed in to change notification settings - Fork 31.4k
Description
Hi,
truncation='do_not_truncate' is not working equivalently to truncation=False.
When using truncation=False and providing max_length, it defaults to 'longest_first' truncation strategy.
Whether this default behavior is natural or not, isn't False supposed to be identical to 'do_not_truncate'?
This leads to a situation when the user explicitly specifies truncation=False but the text is tokenized.
This manual: https://huggingface.co/docs/transformers/pad_truncation and this doc https://huggingface.co/docs/transformers/main_classes/tokenizer
say that:
Falseor'do_not_truncate': no truncation is applied. This is the default behavior.
Which means that they are supposed to be equivalent (regardless of what they do, they should behave the same).
I suggest that False should just mean "no truncation", regardless of max_length was supplied or not.
Here is a short example:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")
sent = 'The quick brown fox jumps over the lazy dog'
len(tokenizer.encode(sent, max_length=5, truncation='do_not_truncate'))prints: 11
len(tokenizer.encode(sent, max_length=5, truncation=False))prints: 5
Thanks,
Uri