GemmaTokenizerFast word_ids() returns only zeros #31437

Alienmaster · 2024-06-15T10:15:55Z

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.41.2
Platform: Linux-5.15.0-86-generic-x86_64-with-glibc2.31
Python version: 3.10.13
Huggingface_hub version: 0.23.1
Safetensors version: 0.4.2
Accelerate version: 0.28.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The method word_ids() does only return a list of zeros instead of the correct word_ids.

sentence = "I love my cat"
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("google/Gemma-7b") #-version a0eac5b
encoded = tokenizer(sentence, return_tensors="pt")
print(encoded.word_ids())
# [None, 0, 0, 0, 0]

I tried several variations of configurations stated in the linked issues in #28881 , but for Gemma it doesn't change the result. The llama3 tokenizer outputs the correct values with this code.

Expected behavior

The output of word_ids should look like
[None, 0, 1, 2, 3]

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-06-19T11:59:45Z

Hey! Will have a look thanks for reporting

ArthurZucker · 2024-07-16T08:42:58Z

It seems that we need this:

tokenizer._tokenizer.pre_tokenizer = Sequence([Split("▁","merged_with_next")])
encoded = tokenizer(sentence, return_tensors="pt")
print(encoded.word_ids())
[None, 0, 1, 2, 3]

huggingface deleted a comment from github-actions bot Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GemmaTokenizerFast word_ids() returns only zeros #31437

GemmaTokenizerFast word_ids() returns only zeros #31437

Alienmaster commented Jun 15, 2024

ArthurZucker commented Jun 19, 2024

ArthurZucker commented Jul 16, 2024 •

edited

Loading

GemmaTokenizerFast word_ids() returns only zeros #31437

GemmaTokenizerFast word_ids() returns only zeros #31437

Comments

Alienmaster commented Jun 15, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jun 19, 2024

ArthurZucker commented Jul 16, 2024 • edited Loading

ArthurZucker commented Jul 16, 2024 •

edited

Loading