You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
fromtransformersimportAutoTokenizerphi_2_tokenizer=AutoTokenizer.from_pretrained("microsoft/phi-2")
phi_3_tokenizer=AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
forname, tokenizerin (("phi-2", phi_2_tokenizer), ("phi-3", phi_3_tokenizer)):
print(f"Tokenizer: {name}")
tokens=tokenizer.encode("This is a test string")
print(f"{tokens=}")
print(tokenizer.decode(tokens))
print("".join([tokenizer.decode(token) fortokenintokens]))
print("-"*50)
Tokenizer: phi-2
tokens=[1212, 318, 257, 1332, 4731]
This is a test string
This is a test string
--------------------------------------------------
Tokenizer: phi-3
tokens=[1, 910, 338, 263, 1243, 1347]
<s> This is a test string
<s>Thisisateststring
--------------------------------------------------
Expected behavior
I expect that, even if I decode a single token at a time, the resulting string should contain spaces between tokens.
As one can see, with Phi-2 model there are no problems, but for some reason Phi-3 does produce such a concatenated string.
The text was updated successfully, but these errors were encountered:
Hey @Andrei-Aksionov , thanks for the reproducer! It has to do with Phi-3 being based on the LlamaTokenizerFast and Phi-2 on CodeGen. LlamaTokenizerFast strips leading whitespace in order to manually add a prefix space on add_prefix_space. I'm looking into a fix now that handles this better!
System Info
transformers
version: 4.41.2Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
I expect that, even if I decode a single token at a time, the resulting string should contain spaces between tokens.
As one can see, with Phi-2 model there are no problems, but for some reason Phi-3 does produce such a concatenated string.
The text was updated successfully, but these errors were encountered: