Custom tokenizer fails to encode despite characters being in mergeable_ranks #289

afang-story · 2024-05-02T10:35:57Z

Hello,

I'm trying to create a custom tokenizer but am getting "pyo3_runtime.PanicException: no entry found for key" despite being sure of coverage. This seems to happen when a character that requires multiple bytes is immediately followed by another character.

Here is a simple example for reproducibility:

import tiktoken

cl100k_base = tiktoken.get_encoding("cl100k_base")
pat_str = cl100k_base._pat_str

tik_vocab = {'“'.encode(): 0, 'a'.encode(): 1}
tik_special_tokens = {}

enc = tiktoken.Encoding(
    name="tik_test",
    pat_str=pat_str,
    mergeable_ranks=tik_vocab,
    special_tokens=tik_special_tokens
)
print(enc.encode("a“")) # this works, [1, 0]
print(enc.encode("“a"))

Any ideas for how to fix this?

Thanks in advance for the help

The text was updated successfully, but these errors were encountered:

Muennighoff · 2024-05-02T18:30:31Z

It also happens with non-Latin characters the other way round e.g.

import tiktoken

cl100k_base = tiktoken.get_encoding("cl100k_base")
pat_str = cl100k_base._pat_str

tik_vocab = {'か'.encode(): 0, 'a'.encode(): 1}
tik_special_tokens = {}

enc = tiktoken.Encoding(
    name="tik_test",
    pat_str=pat_str,
    mergeable_ranks=tik_vocab,
    special_tokens=tik_special_tokens
)
print(enc.encode("aか"))

Maybe there's some setting that needs to be changed / some fallbacks that need to be added that cover this?

djsaber · 2024-07-28T09:03:20Z

I'm having the same issue, have you solved it？

Hello,

I'm trying to create a custom tokenizer but am getting "pyo3_runtime.PanicException: no entry found for key" despite being sure of coverage. This seems to happen when a character that requires multiple bytes is immediately followed by another character.

Here is a simple example for reproducibility:
import tiktoken

cl100k_base = tiktoken.get_encoding("cl100k_base")
pat_str = cl100k_base._pat_str

tik_vocab = {'“'.encode(): 0, 'a'.encode(): 1}
tik_special_tokens = {}

enc = tiktoken.Encoding(
    name="tik_test",
    pat_str=pat_str,
    mergeable_ranks=tik_vocab,
    special_tokens=tik_special_tokens
)
print(enc.encode("a“")) # this works, [1, 0]
print(enc.encode("“a"))
Any ideas for how to fix this?

Thanks in advance for the help

hauntsaninja · 2024-10-03T22:50:07Z

>>> '“'.encode()
b'\xe2\x80\x9c'
>>> len('“'.encode())
3

You'll need to have individual bytes in your vocabulary.

On top of that tiktoken makes the assumption that token index corresponds to merge priority (i.e. the sequence of merges to produce a token needs to produce intermediate tokens with value in increasing order).

tiktoken/src/lib.rs

Line 25 in 6352764

// merge priority from token index or to prevent specific token merges.

hauntsaninja closed this as completed Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom tokenizer fails to encode despite characters being in mergeable_ranks #289

Custom tokenizer fails to encode despite characters being in mergeable_ranks #289

afang-story commented May 2, 2024

Muennighoff commented May 2, 2024

djsaber commented Jul 28, 2024

hauntsaninja commented Oct 3, 2024 •

edited

Loading

Custom tokenizer fails to encode despite characters being in mergeable_ranks #289

Custom tokenizer fails to encode despite characters being in mergeable_ranks #289

Comments

afang-story commented May 2, 2024

Muennighoff commented May 2, 2024

djsaber commented Jul 28, 2024

hauntsaninja commented Oct 3, 2024 • edited Loading

hauntsaninja commented Oct 3, 2024 •

edited

Loading