Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom tokenizer fails to encode despite characters being in mergeable_ranks #289

Open
afang-story opened this issue May 2, 2024 · 1 comment

Comments

@afang-story
Copy link

Hello,

I'm trying to create a custom tokenizer but am getting "pyo3_runtime.PanicException: no entry found for key" despite being sure of coverage. This seems to happen when a character that requires multiple bytes is immediately followed by another character.

Here is a simple example for reproducibility:

import tiktoken

cl100k_base = tiktoken.get_encoding("cl100k_base")
pat_str = cl100k_base._pat_str

tik_vocab = {'“'.encode(): 0, 'a'.encode(): 1}
tik_special_tokens = {}

enc = tiktoken.Encoding(
    name="tik_test",
    pat_str=pat_str,
    mergeable_ranks=tik_vocab,
    special_tokens=tik_special_tokens
)
print(enc.encode("a“")) # this works, [1, 0]
print(enc.encode("“a"))

Any ideas for how to fix this?

Thanks in advance for the help

@Muennighoff
Copy link

It also happens with non-Latin characters the other way round e.g.

import tiktoken

cl100k_base = tiktoken.get_encoding("cl100k_base")
pat_str = cl100k_base._pat_str

tik_vocab = {'か'.encode(): 0, 'a'.encode(): 1}
tik_special_tokens = {}

enc = tiktoken.Encoding(
    name="tik_test",
    pat_str=pat_str,
    mergeable_ranks=tik_vocab,
    special_tokens=tik_special_tokens
)
print(enc.encode("aか"))

Maybe there's some setting that needs to be changed / some fallbacks that need to be added that cover this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants