LLaMATokenizerFast works abnormally #23818

jiangwangyi · 2023-05-27T21:33:55Z

System Info

platform==Ubuntu18.04
python==3.10
transformers==4.29.2

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

</s> is the special token of LLaMATokenizer(Fast), it is expected that </s> can be recognized as a single token when encoding the text. However, it can be shown that the two tokenizers behave differently:

>>> t1 = transformers.AutoTokenizer.from_pretrained("huggyllama/llama-7b", use_fast=True)
>>> t2 = transformers.AutoTokenizer.from_pretrained("huggyllama/llama-7b", use_fast=False)
>>> text = "I love you.</s>"
>>> t1(text)
>>> {'input_ids': [1, 306, 5360, 366, 21106, 29879, 29958], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
>>> t2(text)
>>> {'input_ids': [1, 306, 5360, 366, 29889, 2], 'attention_mask': [1, 1, 1, 1, 1, 1]}

also, LLaMATokenizerFast returns token_type_ids but LLaMATokenizer does not.

Expected behavior

LLaMATokenizerFast to be consistent with LLaMATokenzier.

The text was updated successfully, but these errors were encountered:

NielsRogge · 2023-05-30T06:50:02Z

Also have 2 questions related to LlamaTokenizerFast:

First, loading a fast tokenizer from a saved slow one takes very long:

from transformers import LlamaTokenizer, LlamaTokenizerFast

tokenizer = LlamaTokenizer.from_pretrained("huggyllama/llama-7b")
tokenizer.save_pretrained(".")

# the following line takes > 1 min
fast_tokenizer = LlamaTokenizerFast.from_pretrained(".")

This is not the case for other tokenizers like BertTokenizerFast.

Second, for a new model I'm working on (#23460) I wonder how to get the same behaviour between slow and fast tokenizers for the following:

from transformers import LlamaTokenizer, LlamaTokenizerFast

tokenizer = LlamaTokenizer.from_pretrained("huggyllama/llama-7b", truncation_side="left")
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
tokenizer.add_special_tokens({"bos_token": "</s>"})
tokenizer.add_special_tokens({"eos_token": "</s>"})
tokenizer.add_special_tokens({"unk_token": "</s>"})

fast_tokenizer = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", truncation_side="left")
fast_tokenizer.add_special_tokens({"pad_token": "[PAD]"})
fast_tokenizer.add_special_tokens({"bos_token": "</s>"})
fast_tokenizer.add_special_tokens({"eos_token": "</s>"})
fast_tokenizer.add_special_tokens({"unk_token": "</s>"})

prompt = "What is unusual about this image?"

encoding = tokenizer(prompt, return_tensors="pt")

fast_encoding = fast_tokenizer(prompt, return_tensors="pt")

for k,v in encoding.items():
    assert torch.allclose(fast_encoding[k], v)

=> this assertion fails since the input_ids differ:

tensor([[    2,  1724,   338, 22910,  1048,   445,  1967, 29973]])
tensor([[    1,  1724,   338, 22910,  1048,   445,  1967, 29973]])

NielsRogge · 2023-05-30T07:38:44Z

cc'ing @ArthurZucker and @Narsil here

ArthurZucker · 2023-05-30T16:18:34Z

Hey! Thanks for opening this issue.

return_token_type_ids should be set to None by default but is updated with "token_type_ids" in self.model_input_names. This is specific to the fast tokenizer, and is a known difference. I am not sure why this was added only in the fast tokenizer but it's more than 2yo!
The BPE models splits on (spaces), before encoding the tokens. When converting the models from slow to fast the special tokens were added to the BPE vocabulary, with a score of 0. We probably forgot to add them to the list of additional_special_tokens, which is why they are not properly split. ( quick fix: t1.additional_special_tokens = ["</s>, ... ])
@NielsRogge when you load a slow from a fast, it takes a long time because you need to convert the BPE sentenpiece model, which is very long. Nothing we can do about that.
About your second question, the best thing would be to open a new issue. Seems like it might be another slow/fast discrepency but you are not completely doing this the way the API is designed! (check that each call to add a token actively adds it!)

jiangwangyi · 2023-05-30T16:26:29Z

Hey! Thanks for opening this issue.

return_token_type_ids should be set to None by default but is updated with "token_type_ids" in self.model_input_names. This is specific to the fast tokenizer, and is a known difference. I am not sure why this was added only in the fast tokenizer but it's more than 2yo!

The BPE models splits on (spaces), before encoding the tokens. When converting the models from slow to fast the special tokens were added to the BPE vocabulary, with a score of 0. We probably forgot to add them to the list of additional_special_tokens, which is why they are not properly split. ( quick fix: t1.additional_special_tokens = ["</s>, ... ])

@NielsRogge when you load a slow from a fast, it takes a long time because you need to convert the BPE sentenpiece model, which is very long. Nothing we can do about that.

About your second question, the best thing would be to open a new issue. Seems like it might be another slow/fast discrepency but you are not completely doing this the way the API is designed! (check that each call to add a token actively adds it!)

In the tokenizer_config.json of huggyllama/llama-7b, </s> is quite a special token (eos_token). Adding </s> to t1.additional_special_tokens does not fix the problem.

ArthurZucker · 2023-05-30T16:36:42Z

Indeed, sorry for the confusion. I added a different token <//s> with add_special_token which worked as expected ( meaning whether there was a space or not, the output was properly encode) which is why the issue most probably lies with the handling of the special tokens ( maybe we should not have added them to the voab? I'll check). I'll dig into this!

jiangwangyi · 2023-06-05T15:40:38Z

@ArthurZucker How is the progress now?

ArthurZucker · 2023-06-06T08:42:14Z

I am still working on this, top priority! My PR did not fix it yet, so I am opening a new on just for llama and will see for the other ones.

jiangwangyi · 2023-06-06T08:47:04Z

I am still working on this, top priority! My PR did not fix it yet, so I am opening a new on just for llama and will see for the other ones.

Thanks for working on this! I appreciate the update and look forward to getting the issue resolved.

ArthurZucker · 2023-06-07T09:32:36Z

Update: in order to fix this, the tokenizer.json should be modified: the special tokens should not be normalized (so set normalized = False. There is a more profound issue, since the slow tokenizer is not bother by that and handles this differently.

jiangwangyi · 2023-06-11T15:48:26Z

@ArthurZucker
My transformer version is 4.30.1. I do not change the tokenizer_config.json, instead I replace the default special tokens by add_special_tokens like

>>> from transformers import AutoTokenizer
>>> lt = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
>>> lt
LlamaTokenizerFast(name_or_path='huggyllama/llama-7b', vocab_size=32000, model_max_length=2048, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True)}, clean_up_tokenization_spaces=False)
>>> lt.add_special_tokens({"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>"})
>>> lt
LlamaTokenizerFast(name_or_path='huggyllama/llama-7b', vocab_size=32000, model_max_length=2048, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False)
>>> lt("ok</s>")
>>> {'input_ids': [1, 3431, 829, 29879, 29958], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

It seems that the problem still exists?

ArthurZucker · 2023-06-12T09:41:43Z

Hey, as mentioned in #23889, as well as in #24042 the tokenizer.json has to be modified. I did not have time to open pr on all models yet, but you still have normalized = True on the special tokens, which is why they are split.

jiangwangyi · 2023-06-12T09:57:52Z

Hey, as mentioned in #23889, as well as in #24042 the tokenizer.json has to be modified. I did not have time to open pr on all models yet, but you still have normalized = True on the special tokens, which is why they are split.

As shown in your example in #23889, if I do not modify the tokenizer.json, reseting the bos_token and eos_token when initializing the fast tokenizer or using the add_special_tokens method do not work (the normalized=True attribute still exists), even if the special_tokens_dict attribute has been changed to {"bos_token": "<s>", "eos_token": "</s>"}. Is that true?

ArthurZucker · 2023-06-12T12:21:16Z

Yes. Basically, you have to correctly add the tokens when converting, ortherwise the underlying regex is not properly updated. We are thinking of adding a update_tokens feature, which would allow to modify a token that is already part of the vocab.
See the following problem:

In [2]: lt.add_special_tokens({"eos_token": AddedToken("<//s>", normalized = False)})
Out[2]: 1

In [3]: lt.encode("Another tests<//s>")
Out[3]: [1, 7280, 6987, 32000]

In [4]: lt.add_special_tokens({"eos_token": AddedToken("<//s>", normalized = True)})
Out[4]: 0

In [5]: lt.encode("Another tests<//s>")
Out[5]: [1, 7280, 6987, 32000]

In [6]: lt.add_special_tokens({"eos_token": AddedToken("<///s>", normalized = True)})
Out[6]: 1

In [7]: lt.encode("Another tests<///s>")
Out[7]: [1, 7280, 6987, 29966, 6658, 29879, 29958]

jiangwangyi · 2023-06-12T12:23:32Z

Yes. Basically, you have to correctly add the tokens when converting, ortherwise the underlying regex is not properly updated. We are thinking of adding a update_tokens feature, which would allow to modify a token that is already part of the vocab. See the following problem:

In [2]: lt.add_special_tokens({"eos_token": AddedToken("<//s>", normalized = False)})
Out[2]: 1

In [3]: lt.encode("Another tests<//s>")
Out[3]: [1, 7280, 6987, 32000]

In [4]: lt.add_special_tokens({"eos_token": AddedToken("<//s>", normalized = True)})
Out[4]: 0

In [5]: lt.encode("Another tests<//s>")
Out[5]: [1, 7280, 6987, 32000]

In [6]: lt.add_special_tokens({"eos_token": AddedToken("<///s>", normalized = True)})
Out[6]: 1

In [7]: lt.encode("Another tests<///s>")
Out[7]: [1, 7280, 6987, 29966, 6658, 29879, 29958]

Thank you for your kind guidance!

jiangwangyi changed the title ~~LLaMATokenizerFast cannot recognize special tokens when encoding~~ LLaMATokenizerFast works abnormally May 30, 2023

jiangwangyi mentioned this issue May 30, 2023

[Bug]? how does the tokenizer encode the special tokens? #23851

Closed

4 tasks

ArthurZucker mentioned this issue Jun 1, 2023

🚨🚨 🚨🚨 [Tokenizer] attemp to fix add_token issues🚨🚨 🚨🚨 #23909

Merged

mnoukhov mentioned this issue Jun 1, 2023

StackLlama: fixed RL training and added args huggingface/trl#400

Merged

ArthurZucker linked a pull request Jun 7, 2023 that will close this issue

🚨🚨 🚨🚨 [Tokenizer] attemp to fix add_token issues🚨🚨 🚨🚨 #23909

Merged

ArthurZucker removed a link to a pull request Jun 7, 2023

🚨🚨 🚨🚨 [Tokenizer] attemp to fix add_token issues🚨🚨 🚨🚨 #23909

Merged

ArthurZucker linked a pull request Jun 7, 2023 that will close this issue

[Lllama] Update tokenization code to ensure parsing of the special tokens [core] #24042

Merged

ArthurZucker closed this as completed in #24042 Jun 9, 2023

ArthurZucker mentioned this issue Jun 27, 2023

LlamaModel.forward() got an unexpected keyword argument 'token_type_ids' #24514

Closed

4 tasks

andreasbinder mentioned this issue Sep 16, 2023

RuntimeError: CUDA error: device-side assert triggered when using Llama 2 from HF maitrix-org/llm-reasoners#38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaMATokenizerFast works abnormally #23818

LLaMATokenizerFast works abnormally #23818

jiangwangyi commented May 27, 2023 •

edited

Loading

NielsRogge commented May 30, 2023 •

edited

Loading

NielsRogge commented May 30, 2023

ArthurZucker commented May 30, 2023

jiangwangyi commented May 30, 2023

ArthurZucker commented May 30, 2023

jiangwangyi commented Jun 5, 2023

ArthurZucker commented Jun 6, 2023

jiangwangyi commented Jun 6, 2023

ArthurZucker commented Jun 7, 2023

jiangwangyi commented Jun 11, 2023 •

edited

Loading

ArthurZucker commented Jun 12, 2023

jiangwangyi commented Jun 12, 2023

ArthurZucker commented Jun 12, 2023

jiangwangyi commented Jun 12, 2023

LLaMATokenizerFast works abnormally #23818

LLaMATokenizerFast works abnormally #23818

Comments

jiangwangyi commented May 27, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

NielsRogge commented May 30, 2023 • edited Loading

NielsRogge commented May 30, 2023

ArthurZucker commented May 30, 2023

jiangwangyi commented May 30, 2023

ArthurZucker commented May 30, 2023

jiangwangyi commented Jun 5, 2023

ArthurZucker commented Jun 6, 2023

jiangwangyi commented Jun 6, 2023

ArthurZucker commented Jun 7, 2023

jiangwangyi commented Jun 11, 2023 • edited Loading

ArthurZucker commented Jun 12, 2023

jiangwangyi commented Jun 12, 2023

ArthurZucker commented Jun 12, 2023

jiangwangyi commented Jun 12, 2023

jiangwangyi commented May 27, 2023 •

edited

Loading

NielsRogge commented May 30, 2023 •

edited

Loading

jiangwangyi commented Jun 11, 2023 •

edited

Loading