-
Notifications
You must be signed in to change notification settings - Fork 25.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama tokenizer inconsistency for the newline character for convert_tokens_to_ids #31030
Comments
Hello! 🤗 The token 'Ċ' is actually represented by 2 tokens (after the begin_of_text token):
Token 128 corresponds to another token:
So, to decode 'Ċ' you will need:
why you are seeing 'Ċ' as opposed to '\n':This is related to the BPE algorithm which converts 'space' tokens like newline and tab into special characters, which may be represented by multiple bytes and results in them being represented by multiple ids, such as 2 values in this case. It is explained well in this comment here and here. I hope I answered your questions! Feel free to reply with any further questions |
So when I use GenerationConfig, I want to initlialize it like so
The newline character as stop strings doesn't work for llama 3 because it is internally using something similar to |
Can you please share a small reproducer? |
Hey, were you able to get it working? I too have the same issue. I want to use the stop_strings parameter for stopping generation. |
System Info
transformers 4.41.0
torch 2.3.0
GPU: NVIDIA GeForce RTX 4090, CUDA version 12.3
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am trying to get the token id for the new line character for llama 3, and found this weird inconsistency. Basically
convert_tokens_to_ids('\n')
outputsNone
, buttokenize('\n')
outputs198
. But thentokenizer.convert_ids_to_tokens(198)
gives meĊ
Expected behavior
I expected the output of
convert_tokens_to_ids('\n')
to be128
The text was updated successfully, but these errors were encountered: