Skip to content

truncation='do_not_truncate' is not working equivalently to truncation=False #19334

@urialon

Description

@urialon

Hi,
truncation='do_not_truncate' is not working equivalently to truncation=False.
When using truncation=False and providing max_length, it defaults to 'longest_first' truncation strategy.
Whether this default behavior is natural or not, isn't False supposed to be identical to 'do_not_truncate'?

This leads to a situation when the user explicitly specifies truncation=False but the text is tokenized.

This manual: https://huggingface.co/docs/transformers/pad_truncation and this doc https://huggingface.co/docs/transformers/main_classes/tokenizer
say that:

False or 'do_not_truncate': no truncation is applied. This is the default behavior.

Which means that they are supposed to be equivalent (regardless of what they do, they should behave the same).

I suggest that False should just mean "no truncation", regardless of max_length was supplied or not.

Here is a short example:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")
sent = 'The quick brown fox jumps over the lazy dog'

len(tokenizer.encode(sent, max_length=5, truncation='do_not_truncate'))

prints: 11

len(tokenizer.encode(sent, max_length=5, truncation=False))

prints: 5

Thanks,
Uri

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions