You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The string after encoding and decoding back should be the same. The original tokenizer has this behavior, but not the transformers version which throws out the whitespace. Is this expected?
I know that original Llama-3's tokenizer is based on tiktoken now. Is that the reason we see this difference?
The text was updated successfully, but these errors were encountered:
Hey! This is because of a default in transformers: print(tokenizer.decode(tokenizer("! ! !").input_ids, clean_up_tokenization_spaces=False) )
should do the trick.
Let's set it to default False and deprecated it cc @itazap !
System Info
transformers==4.41.2
tiktoken==0.7.0
and0.4.0
Who can help?
@ArthurZucker @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The issue can be produced with the following snippet.
Expected behavior
The string after encoding and decoding back should be the same. The original tokenizer has this behavior, but not the
transformers
version which throws out the whitespace. Is this expected?I know that original Llama-3's tokenizer is based on
tiktoken
now. Is that the reason we see this difference?The text was updated successfully, but these errors were encountered: