Gemma's Tokenizer fails to split on spaces #30416

vasqu · 2024-04-23T10:20:55Z

Preface

This is related to #29617, similar issue as described in there so this is kind of a placeholder. Should be fixed by #28881.

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('google/gemma-7b', use_auth=True)
sentence = 'Sampel to demonstrate the issue'

input_encoding = tokenizer(sentence, add_special_tokens=True)
tokens = input_encoding.tokens()
word_ids = input_encoding.word_ids()

print(f'Tokens produced: {tokens}\nReferenced word ids: {word_ids}')

Expected behavior

Correct split(s) assigned.

The text was updated successfully, but these errors were encountered:

amyeroberts added the Core: Tokenization Internals of the library; Tokenization. label Apr 23, 2024

ArthurZucker mentioned this issue Apr 23, 2024

[LlamaTokenizerFast] Refactor default llama #28881

Merged

ArthurZucker closed this as completed in #28881 Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma's Tokenizer fails to split on spaces #30416

Gemma's Tokenizer fails to split on spaces #30416

vasqu commented Apr 23, 2024 •

edited

Gemma's Tokenizer fails to split on spaces #30416

Gemma's Tokenizer fails to split on spaces #30416

Comments

vasqu commented Apr 23, 2024 • edited

Preface

Who can help?

Information

Tasks

Reproduction

Expected behavior

vasqu commented Apr 23, 2024 •

edited