Panic (stack overflow) when encoding a certain string #245

Crazytieguy · 2024-01-18T23:40:46Z

Hi, I'm getting a panic when trying to encode the attached file with the gpt-4 tokenizer. This is from the AMPS dataset that was published along with the MATH dataset. Backtrace:

called `Result::unwrap()` on an `Err` value: RuntimeError(StackOverflow)
stack backtrace:
   0: rust_begin_unwind
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panicking.rs:597:5
   1: core::panicking::panic_fmt
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/result.rs:1652:5
   3: _tiktoken::CoreBPE::_encode_native
   4: _tiktoken::_::<impl _tiktoken::CoreBPE>::__pymethod_encode__
   5: pyo3::impl_::trampoline::fastcall_with_keywords
   6: _PyEval_EvalFrameDefault
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:5258:29
   7: _PyEval_EvalFrame
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Include/internal/pycore_ceval.h:73:16
   8: _PyEval_Vector
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:6439:24
   9: _PyFunction_Vectorcall
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Objects/call.c:393:16
  10: _PyObject_VectorcallTstate
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Include/internal/pycore_call.h:92:11
  11: method_vectorcall
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Objects/classobject.c:89:18
  12: do_call_core
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:7357:12
  13: _PyEval_EvalFrameDefault
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:5381:22
  14: _PyEval_EvalFrame
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Include/internal/pycore_ceval.h:73:16
  15: _PyEval_Vector
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:6439:24
  16: _PyFunction_Vectorcall
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Objects/call.c:393:16
  17: do_call_core
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:7357:12
  18: _PyEval_EvalFrameDefault
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:5381:22
  19: _PyEval_EvalFrame
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Include/internal/pycore_ceval.h:73:16
  20: _PyEval_Vector
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:6439:24
  21: _PyFunction_Vectorcall
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Objects/call.c:393:16
  22: _PyObject_VectorcallTstate
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Include/internal/pycore_call.h:92:11
  23: method_vectorcall
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Objects/classobject.c:67:20
  24: thread_run
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Modules/_threadmodule.c:1092
  25: pythread_wrapper
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/thread_pthread.h:241:5
  26: <unknown>
  27: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.```

The text was updated successfully, but these errors were encountered:

l0rinc · 2024-02-11T17:54:46Z

I've check every single JSON file in the MATH and AMPS pretraining datasets, all of them were passing this test.

def test_math_roundtrips():
    enc = tiktoken.get_encoding("cl100k_base")
    base_dir = '.../Downloads/amps'

    for dirpath, _, filenames in os.walk(base_dir):
        for filename in filenames:
            if filename.endswith(".json"):
                print(f'Checking ${filename}!')

                with open(os.path.join(dirpath, filename), 'rb') as f:
                    content = f.read().decode('utf-8')
                    encoded_content = enc.encode(content)
                    decoded_content = enc.decode(encoded_content)

                    assert content == decoded_content, f"Roundtrip mismatch for {filename}"

Can you please attach the file and a simple reproducer that fails?

hauntsaninja · 2024-02-11T20:40:18Z

Here's a simple repro of the problem:

import tiktoken
enc = tiktoken.get_encoding("r50k_base")
enc.encode("^" * 1000000)

There's a stack overflow in the Rust regex library we use. I haven't yet gotten a chance to see if it's easy to fix the Rust (fancy-)regex library, but one workaround would be to raise a more specific exception, catch it, and fall back to using the Python regex library to split the string as done in this private method: https://github.com/openai/tiktoken/blob/main/tiktoken/core.py#L360

l0rinc · 2024-02-11T22:25:39Z

Fixed init in my recently pushed PRs for cl100k_base (see #234 and #239) - and the backported possessives quantifiers to the legacy encoding in #258

Cherrypicking the PRs on top of each other not just passes, but is quite fast as well:

fedor57 · 2024-07-11T12:37:44Z

Hi, I have the same crash trying to tokenize this web page

https://www.mathauditor.com/2147483647-in-english.html

There is a roman representation of the number 2147483647, which contains about 2,147,483 letters M in one source token

thijs-hakkenberg · 2024-09-10T11:31:08Z

I have the same issue. The error according to ChatGPT why we cannot catch the exception in python:
_The error message you are seeing is because a Rust panic unwinding to Python is not an actual Python exception, and thus cannot be caught by a standard Python exception handler. It completely aborts the Python interpreter.

This is an issue with the Rust-Python interoperability. A Python program can't catch Rust panics because Rust panics are designed to unwind the stack, cleaning up as they go, until they reach the application boundary, at which point the application aborts. In this case, the application boundary is the Rust-Python boundary, so the panic unwinds the Rust stack, crosses the boundary, and causes the Python interpreter to abort.

However, latest PyO3 versions have a feature that allows converting Rust panics into Python exceptions, but it's opt-in and has to be enabled in the Rust library. The Python dependency using PyO3 can then be rerun so that Rust panics will become catchable Python RuntimeError exceptions.

Unfortunately, it seems like the tiktoken library doesn't use a new enough PyO3 version or has this feature disabled, and so it doesn't convert Rust panics into Python exceptions. Therefore, you can't catch the panic in Python code and the Python interpreter aborts.

For now, you could try to ensure your code never causes a panic in tiktoken. For instance, by checking properties of the input before passing it to tiktoken methods that might cause a panic.

Otherwise, this is a pretty hard issue to work around from Python. The best way to resolve it would be to open an issue on the repository of the library causing the issue or ask the maintainer to upgrade the PyO3 dependency and enable panic conversion._

l0rinc · 2024-09-10T11:36:37Z

You could add the PRs mentioned in #245 (comment) and build a custom TikToken version that supports big tokens.
@hauntsaninja, do you think we could merge some of those PRs instead?

fedor57 · 2024-09-10T11:43:16Z

What I did, I tested a safe long token length about 200000 chars. Then before calling the tiktoken, I've splited the input into batches, called the tiktoken separately on each batch and concatenated token arrays. The result is still correct if you decode the tokenized string. Theoretically suboptimal for LLM use, but in practice no difference.

Before that I was kindof searching runs of 200000 without spaces to insert batch break if needed. But later I deleted the code as not being truly useful.

Fixes the crash in #245 by prohibiting the regex engine from backtracking catastrophically via [possessive quantifiers](https://www.regular-expressions.info/possessive.html). <img width="400" alt="image" src="https://github.com/openai/tiktoken/assets/1841944/ed341153-4cf4-4c1c-93d6-3f5e32133569"> Interestingly these possesives make the encoding a lot faster again in `fancy-regex`. Before this change (but with large byte pair merge PR cherry-picked): ``` num_threads: 1, num_bytes: 98379553 tiktoken 11,946,036 bytes / s tiktoken 11,961,343 bytes / s tiktoken 11,995,846 bytes / s tiktoken 11,951,263 bytes / s tiktoken 11,983,405 bytes / s ``` Same, with these changes applied: ``` num_threads: 1, num_bytes: 98379553 tiktoken 14,511,827 bytes / s tiktoken 14,638,134 bytes / s tiktoken 14,644,029 bytes / s tiktoken 14,729,030 bytes / s tiktoken 14,666,903 bytes / s ``` Updating the regex libs makes it a tiny bit faster still: ``` num_threads: 1, num_bytes: 98379553 tiktoken 14,485,590 bytes / s tiktoken 14,854,049 bytes / s tiktoken 14,891,086 bytes / s tiktoken 14,843,007 bytes / s tiktoken 14,874,520 bytes / s ``` This is almost 2x faster than [before any of the optimizations](#234). ------- Opened an issue for increasing the [default backtrack limit](https://github.com/fancy-regex/fancy-regex/blob/bf2c807447f72ee20ae839e0f8cb3a06fc79982c/src/lib.rs#L407), see: fancy-regex/fancy-regex#134, but it shouldn't be necessary here anymore. --------- Co-authored-by: Lőrinc <lorinc.pap@gmail.com>

Crazytieguy changed the title ~~Panic when encoding a certain string~~ Panic (stack overflow) when encoding a certain string Jan 18, 2024

l0rinc mentioned this issue Feb 11, 2024

Add possessive quantifiers to avoid catastrophic backtracking #258

Merged

This was referenced Feb 12, 2024

Sync with tiktoken knuddelsgmbh/jtokkit#87

Draft

Inconsistent Backtrack Limit Handling in Regex Matching fancy-regex/fancy-regex#134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic (stack overflow) when encoding a certain string #245

Panic (stack overflow) when encoding a certain string #245

Crazytieguy commented Jan 18, 2024 •

edited

Loading

l0rinc commented Feb 11, 2024

hauntsaninja commented Feb 11, 2024 •

edited

Loading

l0rinc commented Feb 11, 2024 •

edited

Loading

fedor57 commented Jul 11, 2024

thijs-hakkenberg commented Sep 10, 2024

l0rinc commented Sep 10, 2024

fedor57 commented Sep 10, 2024

Panic (stack overflow) when encoding a certain string #245

Panic (stack overflow) when encoding a certain string #245

Comments

Crazytieguy commented Jan 18, 2024 • edited Loading

l0rinc commented Feb 11, 2024

hauntsaninja commented Feb 11, 2024 • edited Loading

l0rinc commented Feb 11, 2024 • edited Loading

fedor57 commented Jul 11, 2024

thijs-hakkenberg commented Sep 10, 2024

l0rinc commented Sep 10, 2024

fedor57 commented Sep 10, 2024

Crazytieguy commented Jan 18, 2024 •

edited

Loading

hauntsaninja commented Feb 11, 2024 •

edited

Loading

l0rinc commented Feb 11, 2024 •

edited

Loading