[BUG] Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.) #1544

MilkClouds · 2024-06-04T05:30:10Z

When I'm trying to add some tokens in vocab, there's 3 issue in Fast type tokenizers; there's no problem in python tokenizer, though.

Spacebar before additional token would be deleted if added token was not special token
If the added token was not a special token, entering additional tokens in a row (without spaces) would prevent the subsequent token from being tokenized
A single spacebar(ID 28705) between two additional tokens was treated as two (ID 259) when the added token was a special token.

Source code to recall issue

from transformers import (
    AutoProcessor,
    LlamaTokenizer,
    LlamaTokenizerFast,
)

processor = AutoProcessor.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
    do_image_splitting=True,
)
print(processor.tokenizer)


def test_tokenizer(tokenizer):
    # print(f"Tokenizer: {tokenizer}")
    print("=======")
    test_texts = [
        "!@#",
        "!@# ",
        "!@# <ACTION_1>",
        "!@# <ACTION_1> ",
        "!@# <ACTION_1> <ACTION_2>",
        "!@# <ACTION_1><ACTION_2>",
    ]
    for text in test_texts:
        print(f"{text:30}", tokenizer(text))


tokenizer = LlamaTokenizer.from_pretrained("HuggingFaceM4/idefics2-8b")
tokenizer.add_tokens([f"<ACTION_{idx}>" for idx in range(18)])
test_tokenizer(tokenizer)


tokenizer = LlamaTokenizer.from_pretrained("HuggingFaceM4/idefics2-8b")
tokenizer.add_tokens([f"<ACTION_{idx}>" for idx in range(18)], special_tokens=True)
test_tokenizer(tokenizer)

tokenizer = LlamaTokenizerFast.from_pretrained("HuggingFaceM4/idefics2-8b")
tokenizer.add_tokens([f"<ACTION_{idx}>" for idx in range(18)])
test_tokenizer(tokenizer)


tokenizer = LlamaTokenizerFast.from_pretrained("HuggingFaceM4/idefics2-8b")
tokenizer.add_tokens([f"<ACTION_{idx}>" for idx in range(18)], special_tokens=True)
test_tokenizer(tokenizer)

execution result

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
LlamaTokenizerFast(name_or_path='HuggingFaceM4/idefics2-8b', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<unk>', 'additional_special_tokens': ['<fake_token_around_image>', '<image>', '<end_of_utterance>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
        0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        32000: AddedToken("<fake_token_around_image>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        32001: AddedToken("<image>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        32002: AddedToken("<end_of_utterance>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
=======
!@#                            {'input_ids': [1, 918, 28818, 28771], 'attention_mask': [1, 1, 1, 1]}
!@#                            {'input_ids': [1, 918, 28818, 28771, 28705], 'attention_mask': [1, 1, 1, 1, 1]}
!@# <ACTION_1>                 {'input_ids': [1, 918, 28818, 28771, 28705, 32004], 'attention_mask': [1, 1, 1, 1, 1, 1]}
!@# <ACTION_1>                 {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 28705], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
!@# <ACTION_1> <ACTION_2>      {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 28705, 32005], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
!@# <ACTION_1><ACTION_2>       {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 32005], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
=======
!@#                            {'input_ids': [1, 918, 28818, 28771], 'attention_mask': [1, 1, 1, 1]}
!@#                            {'input_ids': [1, 918, 28818, 28771, 28705], 'attention_mask': [1, 1, 1, 1, 1]}
!@# <ACTION_1>                 {'input_ids': [1, 918, 28818, 28771, 28705, 32004], 'attention_mask': [1, 1, 1, 1, 1, 1]}
!@# <ACTION_1>                 {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 28705], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
!@# <ACTION_1> <ACTION_2>      {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 28705, 32005], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
!@# <ACTION_1><ACTION_2>       {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 32005], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
=======
!@#                            {'input_ids': [1, 918, 28818, 28771], 'attention_mask': [1, 1, 1, 1]}
!@#                            {'input_ids': [1, 918, 28818, 28771, 28705], 'attention_mask': [1, 1, 1, 1, 1]}
!@# <ACTION_1>                 {'input_ids': [1, 918, 28818, 28771, 32004], 'attention_mask': [1, 1, 1, 1, 1]}
!@# <ACTION_1>                 {'input_ids': [1, 918, 28818, 28771, 32004, 28705], 'attention_mask': [1, 1, 1, 1, 1, 1]}
!@# <ACTION_1> <ACTION_2>      {'input_ids': [1, 918, 28818, 28771, 32004, 32005], 'attention_mask': [1, 1, 1, 1, 1, 1]}
!@# <ACTION_1><ACTION_2>       {'input_ids': [1, 918, 28818, 28771, 32004, 28789, 17615, 28730, 28750, 28767], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
=======
!@#                            {'input_ids': [1, 918, 28818, 28771], 'attention_mask': [1, 1, 1, 1]}
!@#                            {'input_ids': [1, 918, 28818, 28771, 28705], 'attention_mask': [1, 1, 1, 1, 1]}
!@# <ACTION_1>                 {'input_ids': [1, 918, 28818, 28771, 28705, 32004], 'attention_mask': [1, 1, 1, 1, 1, 1]}
!@# <ACTION_1>                 {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 259], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
!@# <ACTION_1> <ACTION_2>      {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 259, 32005], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
!@# <ACTION_1><ACTION_2>       {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 32005], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Additional Note

If I use from_slow option to load Fast Tokenizer, it have no problem.
tokenizer = LlamaTokenizerFast.from_pretrained("HuggingFaceM4/idefics2-8b", from_slow=True)

The text was updated successfully, but these errors were encountered:

MilkClouds · 2024-06-04T05:36:02Z

$ pip list | grep tokenizer                                                                                                                                                                   ✹
tokenizers                0.19.1
$ pip list | grep transformers                                                                                                                                                               
sentence-transformers     3.0.0
transformers              4.41.2

ArthurZucker · 2024-06-05T07:31:14Z

Hey! I think most of these can be removed if you set the legacy=False flag when initializing the tokenizer. I'll talk to the M4 team about this.

Basically the normalizer is prepending a space before each token, and before each split! For more details huggingface/transformers#28881

github-actions · 2024-07-06T01:50:40Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

MilkClouds · 2024-07-12T01:19:16Z

processor.tokenizer.legacy is already False, but same issue is occuring until now.

transformers version: 4.42.3
tokenizers version: 0.19.1

the output is exactly same as the time I first reported this issue.

ArthurZucker · 2024-07-12T13:29:44Z

Hey! As you mention:

tokenizer = LlamaTokenizerFast.from_pretrained("HuggingFaceM4/idefics2-8b", from_slow=True)

the issue is that if you do not call "from_slow" I cannot update the tokenizer for you. I mean I'll ping the team internally for sure, but we need to re-upload tokenizer.json!

ArthurZucker · 2024-07-12T13:30:18Z

Can you open a PR on the hub and ping me here with the link? 🤗

ArthurZucker · 2024-08-06T05:29:43Z

Closing as the issue is with the model on the hub!

MilkClouds changed the title ~~Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.)~~ [BUG] Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.) Jun 4, 2024

github-actions bot added the Stale label Jul 6, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 11, 2024

ArthurZucker reopened this Jul 12, 2024

github-actions bot removed the Stale label Jul 13, 2024

ArthurZucker closed this as completed Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.) #1544

[BUG] Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.) #1544

MilkClouds commented Jun 4, 2024 •

edited

Loading

MilkClouds commented Jun 4, 2024

ArthurZucker commented Jun 5, 2024

github-actions bot commented Jul 6, 2024

MilkClouds commented Jul 12, 2024 •

edited

Loading

ArthurZucker commented Jul 12, 2024

ArthurZucker commented Jul 12, 2024

ArthurZucker commented Aug 6, 2024

[BUG] Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.) #1544

[BUG] Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.) #1544

Comments

MilkClouds commented Jun 4, 2024 • edited Loading

Source code to recall issue

execution result

Additional Note

MilkClouds commented Jun 4, 2024

ArthurZucker commented Jun 5, 2024

github-actions bot commented Jul 6, 2024

MilkClouds commented Jul 12, 2024 • edited Loading

ArthurZucker commented Jul 12, 2024

ArthurZucker commented Jul 12, 2024

ArthurZucker commented Aug 6, 2024

MilkClouds commented Jun 4, 2024 •

edited

Loading

MilkClouds commented Jul 12, 2024 •

edited

Loading