-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom tokenizer fails to encode despite characters being in mergeable_ranks #289
Comments
It also happens with non-Latin characters the other way round e.g. import tiktoken
cl100k_base = tiktoken.get_encoding("cl100k_base")
pat_str = cl100k_base._pat_str
tik_vocab = {'か'.encode(): 0, 'a'.encode(): 1}
tik_special_tokens = {}
enc = tiktoken.Encoding(
name="tik_test",
pat_str=pat_str,
mergeable_ranks=tik_vocab,
special_tokens=tik_special_tokens
)
print(enc.encode("aか")) Maybe there's some setting that needs to be changed / some fallbacks that need to be added that cover this? |
I'm having the same issue, have you solved it?
|
You'll need to have individual bytes in your vocabulary. On top of that tiktoken makes the assumption that token index corresponds to merge priority (i.e. the sequence of merges to produce a token needs to produce intermediate tokens with value in increasing order). Line 25 in 6352764
|
Hello,
I'm trying to create a custom tokenizer but am getting "pyo3_runtime.PanicException: no entry found for key" despite being sure of coverage. This seems to happen when a character that requires multiple bytes is immediately followed by another character.
Here is a simple example for reproducibility:
Any ideas for how to fix this?
Thanks in advance for the help
The text was updated successfully, but these errors were encountered: