Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread Panic when decoding token id 100256 and others with cl100k_base tokenizer #47

Open
minimaxir opened this issue Mar 4, 2023 · 1 comment

Comments

@minimaxir
Copy link

minimaxir commented Mar 4, 2023

Code example:

enc = tiktoken.get_encoding("cl100k_base")
enc.decode([100256])

Trace:

thread '<unnamed>' panicked at 'no entry found for key', src[/lib.rs:210:37](https://file+.vscode-resource.vscode-cdn.net/lib.rs:210:37)
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
[/var/folders/m9/s4s3bdq96pn3dk13fbgpw6rm0000gn/T/ipykernel_9548/1299473396.py](https://file+.vscode-resource.vscode-cdn.net/var/folders/m9/s4s3bdq96pn3dk13fbgpw6rm0000gn/T/ipykernel_9548/1299473396.py) in 
      1 enc = tiktoken.get_encoding("cl100k_base")
----> 2 enc.decode([100256])

[/usr/local/lib/python3.9/site-packages/tiktoken/core.py](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.9/site-packages/tiktoken/core.py) in decode(self, tokens, errors)
    237         ```
    238         """
--> 239         return self._core_bpe.decode_bytes(tokens).decode("utf-8", errors=errors)
    240 
    241     def decode_single_token_bytes(self, token: int) -> bytes:

PanicException: no entry found for key

Also reproduces for token ids 100261 through 100275

If tokens are intentionally empty, they should still not cause a panic.

@dbl001
Copy link

dbl001 commented Mar 5, 2023

I get the same exception.

ults of the COVID-2. For this results. In the first-19 to the results of the study, the COVID-19, and a study, as the pandemic, the first-19 and the first to the first-CoV--19 and a same, we also been been been a significant. A. It is
---------------
thread '<unnamed>' panicked at 'no entry found for key', src/lib.rs:155:37
stack backtrace:
   0:        0x105835d42 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h8d94e552d95b28cc
   1:        0x105849f6a - core::fmt::write::h421d4212716e9716
   2:        0x105833bac - std::io::Write::write_fmt::hdc28b71c2d62dad8
   3:        0x105835b0a - std::sys_common::backtrace::print::he11eab6b959c3b5b
   4:        0x105836ee6 - std::panicking::default_hook::{{closure}}::ha68ba8cbe26bbbe3
   5:        0x105836c37 - std::panicking::default_hook::h5cf85224a4df5bc6
   6:        0x10583762d - std::panicking::rust_panic_with_hook::hed342721bf9addfa
   7:        0x1058373e3 - std::panicking::begin_panic_handler::{{closure}}::h3d9af89e51f2fba9
   8:        0x1058361d8 - std::sys_common::backtrace::__rust_end_short_backtrace::hfb9719355016e93f
   9:        0x1058370ad - _rust_begin_unwind
  10:        0x10585af43 - core::panicking::panic_fmt::h1965fc2159be50bb
  11:        0x10584911b - core::panicking::panic_display::h841c2aac0ae11b23
  12:        0x1058490cc - core::panicking::panic_str::ha2b2b46922a69871
  13:        0x10585af09 - core::option::expect_failed::h5dc600f0ba669ad7
  14:        0x1057739e4 - _tiktoken::CoreBPE::_decode_native::hf970f41e2ffb103d
  15:        0x10576624b - pyo3::marker::Python::allow_threads::h9399c4884f71c380
  16:        0x10577705d - _tiktoken::CoreBPE::decode_bytes::hac2ea10696677c55
  17:        0x10576e572 - std::panicking::try::hdddd1e2b25b9d596
  18:        0x10577816e - _tiktoken::_::<impl _tiktoken::CoreBPE>::__pymethod_decode_bytes__::h7364fbad820d3301
  19:        0x1017d9ecf - _method_vectorcall_FASTCALL_KEYWORDS
  20:        0x1018e83ae - __PyEval_EvalFrameDefault
  21:        0x1017ca7f6 - __PyFunction_Vectorcall
  22:        0x1018e83ae - __PyEval_EvalFrameDefault
  23:        0x1017ca7f6 - __PyFunction_Vectorcall
  24:        0x1019107db - _call_function
  25:        0x1018e1d84 - __PyEval_EvalFrameDefault
  26:        0x1018ddb91 - __PyEval_Vector
  27:        0x101966460 - _run_mod
  28:        0x101966225 - _pyrun_file
  29:        0x101965d76 - __PyRun_SimpleFileObject
  30:        0x10196569f - __PyRun_AnyFileObject
  31:        0x10198a978 - _pymain_run_file_obj
  32:        0x10198a305 - _pymain_run_file
  33:        0x101989b38 - _pymain_run_python
  34:        0x101989975 - _Py_RunMain
  35:        0x101762598 - _main
  36:     0x7ff809a49310 - <unknown>
Traceback (most recent call last):
  File "/Users/davidlaxer/nanoGPT/sample.py", line 93, in <module>
    print(decode(y[0].tolist()))
  File "/Users/davidlaxer/nanoGPT/sample.py", line 79, in <lambda>
    decode = lambda l: enc.decode(l)
  File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/tiktoken/core.py", line 239, in decode
    return self._core_bpe.decode_bytes(tokens).decode("utf-8", errors=errors)
pyo3_runtime.PanicException: no entry found for key

Screenshot 2023-03-05 at 7 19 40 AM

I'm running 'nanoGPT'

https://github.com/karpathy/nanoGPT

% RUST_BACKTRACE=full  python sample.py --out_dir=out --device='cpu' --compile=False

My error is in a list of 501 tokens. I'm not sure which one(s) are causing the exception.

Screenshot 2023-03-05 at 7 25 14 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants