Original Llama-3 tokenizer behaves differently from `transformers` version #31187

chawins · 2024-06-02T05:55:09Z

System Info

transformers==4.41.2
tiktoken==0.7.0 and 0.4.0

Who can help?

@ArthurZucker @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The issue can be produced with the following snippet.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
ids = tokenizer("! ! !").input_ids
print(ids)  # >> [128000, 0, 758, 758]
print(tokenizer.decode(tokenizer("! ! !").input_ids))  # >> <|begin_of_text|>!!! (this is wrong?)

# Llama-3 from https://github.com/meta-llama/llama3
# Download tokenizer.model from https://llama.meta.com/llama-downloads/
from llama import Tokenizer
tokenizer = Tokenizer("path/to/tokenizer.model")
ids = tokenizer.encode("! ! !", bos=True, eos=False)
print(ids)  # >> [128000, 0, 758, 758]
print(tokenizer.decode(ids))  # <|begin_of_text|>! ! ! (this is expected)

Expected behavior

The string after encoding and decoding back should be the same. The original tokenizer has this behavior, but not the transformers version which throws out the whitespace. Is this expected?

I know that original Llama-3's tokenizer is based on tiktoken now. Is that the reason we see this difference?

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-06-02T10:27:42Z

Hey! This is because of a default in transformers:
print(tokenizer.decode(tokenizer("! ! !").input_ids, clean_up_tokenization_spaces=False) )
should do the trick.
Let's set it to default False and deprecated it cc @itazap !

chawins · 2024-06-02T17:00:07Z

Awesome! I confirmed that clean_up_tokenization_spaces=False fixed the issue. Thanks a lot for pointing it out.

itazap mentioned this issue Jun 4, 2024

depreciating all occurances of clean_up_tokenization_spaces #31232

Closed

ArthurZucker mentioned this issue Jun 5, 2024

llama3 tokenizer doesn't round trip huggingface/tokenizers#1543

Closed

chawins closed this as completed Jun 24, 2024

chawins mentioned this issue Jun 24, 2024

ModuleNotFoundError: No module named 'llama' chawins/pal#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Original Llama-3 tokenizer behaves differently from `transformers` version #31187

Original Llama-3 tokenizer behaves differently from `transformers` version #31187

chawins commented Jun 2, 2024 •

edited

Loading

ArthurZucker commented Jun 2, 2024

chawins commented Jun 2, 2024

Original Llama-3 tokenizer behaves differently from transformers version #31187

Original Llama-3 tokenizer behaves differently from transformers version #31187

Comments

chawins commented Jun 2, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jun 2, 2024

chawins commented Jun 2, 2024

Original Llama-3 tokenizer behaves differently from `transformers` version #31187

Original Llama-3 tokenizer behaves differently from `transformers` version #31187

chawins commented Jun 2, 2024 •

edited

Loading