-
Notifications
You must be signed in to change notification settings - Fork 26.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special token handling breaks idempotency of sentencepiece due to extra spaces #31513
Comments
Do you have a reproducer? |
Llama based tokenizer don't have this issue anymore and was fixed by the metaspace refactoring. |
Are you using |
Also the snipper shared: from transformers import LlamaTokenizer
model_id = "lmsys/vicuna-13b-delta-v1.1"
tokenizer = LlamaTokenizer.from_pretrained(model_id, add_bos_token = False, )
message = "<s>hello</s>"
decoded = tokenizer.decode(tokenizer(message)['input_ids'])
print(decoded, decoded == message) this is on |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Update: In [2]: tokenizer.tokenize(message)
Out[2]: ['<s>', '▁hello', '</s>'] This is kind of expected: we add a prefix space at the beginning. |
cc @itazap |
Hi! #31315 will fix this with tokenizer = LlamaTokenizer.from_pretrained(model_id, add_bos_token = False, legacy=False, add_prefix_space=False)``` |
Hi,
Gives this, with an extra space after the added_tokens
if so, will the PR be merged shortly ? Thanks. |
Hey @vince62s 😊 ! passing tokenizer = AutoTokenizer.from_pretrained("Unbabel/TowerInstruct-7B-v0.2", padding_side='left', legacy=False, add_prefix_space=False)
# Output
'<s><|im_start|>user
Translate the following text from English into German.
English: Hello world
German:<|im_end|>
<|im_start|>assistant
' Would this be suitable for your use-case? |
well, I thought this setting was part of the unmerged #31315 but there is some strange behavior. |
They should have different behaviours, but for False it is correct that the space disappears. I think if |
There is really something strange with the tokenizer behavior. Using With no import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Unbabel/TowerInstruct-7B-v0.2", padding_side='left')
prompt = f"<|im_start|>user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|>\n<|im_start|>assistant\n"
input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=256, truncation=True).input_ids.cuda()
print(input_ids)
print(prompt)
outputs = tokenizer.batch_decode(input_ids, skip_special_tokens=False)
print(outputs)
tensor([[ 1, 32006, 1404, 13, 4300, 9632, 278, 1494, 1426, 515,
4223, 964, 5332, 29889, 13, 24636, 29901, 15043, 3186, 13,
29954, 3504, 29901, 32005, 29871, 13, 32006, 20255, 13]],
device='cuda:0')
<|im_start|>user
Translate the following text from English into German.
English: Hello world
German:<|im_end|>
<|im_start|>assistant
['<s><|im_start|> user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|> \n<|im_start|> assistant\n'] So you see the space added 3 times before "user", between "<|im_end|>" and "\n", and before "assistant". As said before, if we add the flag import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Unbabel/TowerInstruct-7B-v0.2", padding_side='left', add_prefix_space=True)
prompt = f"<|im_start|>user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|>\n<|im_start|>assistant\n"
input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=256, truncation=True).input_ids.cuda()
print(input_ids)
print(prompt)
outputs = tokenizer.batch_decode(input_ids, skip_special_tokens=False)
print(outputs)
tensor([[ 1, 32006, 1792, 13, 4300, 9632, 278, 1494, 1426, 515,
4223, 964, 5332, 29889, 13, 24636, 29901, 15043, 3186, 13,
29954, 3504, 29901, 32005, 13, 32006, 465, 22137, 13]],
device='cuda:0')
<|im_start|>user
Translate the following text from English into German.
English: Hello world
German:<|im_end|>
<|im_start|>assistant
['<s><|im_start|>user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|>\n<|im_start|>assistant\n'] You can note that the tokens are not the same (1792=user instead of 1404=_user, and assistant broken in 465, 22137 instead of _aasistant=20255 => why the same behavior with False or True NOW import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("utter-project/EuroLLM-1.7B-Instruct", padding_side='left')
prompt = f"<|im_start|>user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|>\n<|im_start|>assistant\n"
input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=256, truncation=True).input_ids.cuda()
print(input_ids)
print(prompt)
outputs = tokenizer.batch_decode(input_ids, skip_special_tokens=False)
print(outputs)
tensor([[ 1, 3, 15236, 271, 31702, 31817, 557, 5302, 6001,
1061, 6771, 2023, 5256, 119735, 271, 31601, 119782, 97849,
4437, 271, 60457, 119782, 4, 119715, 271, 3, 58406,
271]], device='cuda:0')
<|im_start|>user
Translate the following text from English into German.
English: Hello world
German:<|im_end|>
<|im_start|>assistant
['<s><|im_start|> user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|> \n<|im_start|> assistant\n'] With the flag=True, the space is added (which is not the same behavior with the llama2 one) import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("utter-project/EuroLLM-1.7B-Instruct", padding_side='left', add_prefix_space=False)
prompt = f"<|im_start|>user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|>\n<|im_start|>assistant\n"
input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=256, truncation=True).input_ids.cuda()
print(input_ids)
print(prompt)
outputs = tokenizer.batch_decode(input_ids, skip_special_tokens=False)
print(outputs)
tensor([[ 1, 3, 13676, 271, 31702, 31817, 557, 5302, 6001,
1061, 6771, 2023, 5256, 119735, 271, 31601, 119782, 97849,
4437, 271, 60457, 119782, 4, 271, 3, 788, 35441,
271]], device='cuda:0')
<|im_start|>user
Translate the following text from English into German.
English: Hello world
German:<|im_end|>
<|im_start|>assistant
['<s><|im_start|>user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|>\n<|im_start|>assistant\n'] Again tokens are not the same of course. Can someone clarify exactly what is going on ? Thanks |
For your second case, you should not use |
still unclear. First model (the llama2 one) then why it triggers a different token for "user" when not using the flag vs using the flag ? |
Now, it also affects decoding. Basically the decoding removes the added space. But that means decoding |
The prompt is: When you set |
I am not even talking about decoding at this point, just encoding. |
Your script has To be honest most of the issues are because of the |
ok keeping
How to explain the choice of 5205 or 1788 for system, ▁system |
When you set Again you should always print |
Does that make more sense for you? (linked PR) |
if you PR like this, my understanding is that you will break the behavior of a model like this one: https://huggingface.co/utter-project/EuroLLM-1.7B-Instruct/discussions/6 |
I am not sure I understand what you expect from me at this point 😅 I can't "fix" the fact that the issue contaminated the initial training! |
My point is that it is still very unclear. We are talking about LLama butlook at Mistral with sentencepiece (not tekken):
It will give you:
Again 2606 = "user", 2956="_user", and [1257, 11911]=[ass, istant] and 14660="_assistant" When you look at the legacy code of Mistral here: Am I clearer in the explanation of the issue ? EDIT: Is there somewhere a patch that forces add_prefix_space to True for Mistral ? |
We have this document that may be of help to understand this issue: https://github.com/mistralai/cookbook/blob/main/concept-deep-dive/tokenization/chat_templates.md It explains why exactly each template is slightly different. |
thanks @pandora-s-git this doc is very clear. However the fact that in some versions the word "user" becomes "_user" and in some other "user" and same for "assistant" becoming "_assistant" or ("ass", "istant") for the same model, does it trigger a big difference in quality in the end ? (since my understanding is that MistralInstruct-v0.3 supports both V2 and V3 tokenizers) |
From experience it can have a huge impact on completion, specifically, lets say I provide the model (tokenizer v2 or v3): I hope this answers your question, but by experience these white spaces have a lot more importance than one may think, and be careful with Tekken, the reason V3-Tekken is considered a V3 template its because the implementation with Mistral Common of the template is the exact same one as the normal V3, the only difference being that V3 uses sentencepiece, and V3-Tekken uses Tiktoken, but this difference actually impacts the template itself if we use the string representatins, becoming: And the tokenizer vocab being completely different it will of course tokenize differently. V2 and V3 (sentence piece) tokenizers are very very similar, the only difference between them is with the tool calling. |
If I may, but this is valid for all models in fact, it would be great to post in the model card the token ID of an expected prompt with special token so that one can verify that the HF flags are set correctly when it comes to both finetuning / inference. I have the impression there is a huge overlook of this issue. Anyway, thanks for your answers. |
Sentenpiece tokenizers have the property that
Decode(Encode(Normalize(input))) == Normalize(input).
. This property is very useful when combining and re-inferring prompts. However, when used throughtokenizers
with special tokens added for BOS/EOS etc,tokenizers
will inject an extra space around special tokens when decoding - i.e,<s>A
will become<s> A
, which when encoded and decoded will become<s> A
,<s> A
, etc.A previous issue was raised about this but incorrectly closed as intended behavior/unfixable: huggingface/tokenizers#1237 . Although not all tokenizers have this property, sentencepiece is very widely used now due to llama and mistral so it would make sense for this behavior to be preserved.
There could be two fixes for this: either not add the extra space, or tokenize
<s> A
the same as<s>A
(i think could be accomplished by changing theAddedToken
params for these tokens.The text was updated successfully, but these errors were encountered: