-
Notifications
You must be signed in to change notification settings - Fork 842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic (stack overflow) when encoding a certain string #245
Comments
I've check every single JSON file in the MATH and AMPS pretraining datasets, all of them were passing this test. def test_math_roundtrips():
enc = tiktoken.get_encoding("cl100k_base")
base_dir = '.../Downloads/amps'
for dirpath, _, filenames in os.walk(base_dir):
for filename in filenames:
if filename.endswith(".json"):
print(f'Checking ${filename}!')
with open(os.path.join(dirpath, filename), 'rb') as f:
content = f.read().decode('utf-8')
encoded_content = enc.encode(content)
decoded_content = enc.decode(encoded_content)
assert content == decoded_content, f"Roundtrip mismatch for {filename}" Can you please attach the file and a simple reproducer that fails? |
Here's a simple repro of the problem:
There's a stack overflow in the Rust regex library we use. I haven't yet gotten a chance to see if it's easy to fix the Rust (fancy-)regex library, but one workaround would be to raise a more specific exception, catch it, and fall back to using the Python |
Hi, I have the same crash trying to tokenize this web page https://www.mathauditor.com/2147483647-in-english.html There is a roman representation of the number 2147483647, which contains about 2,147,483 letters M in one source token |
I have the same issue. The error according to ChatGPT why we cannot catch the exception in python: This is an issue with the Rust-Python interoperability. A Python program can't catch Rust panics because Rust panics are designed to unwind the stack, cleaning up as they go, until they reach the application boundary, at which point the application aborts. In this case, the application boundary is the Rust-Python boundary, so the panic unwinds the Rust stack, crosses the boundary, and causes the Python interpreter to abort. However, latest PyO3 versions have a feature that allows converting Rust panics into Python exceptions, but it's opt-in and has to be enabled in the Rust library. The Python dependency using PyO3 can then be rerun so that Rust panics will become catchable Python RuntimeError exceptions. Unfortunately, it seems like the tiktoken library doesn't use a new enough PyO3 version or has this feature disabled, and so it doesn't convert Rust panics into Python exceptions. Therefore, you can't catch the panic in Python code and the Python interpreter aborts. For now, you could try to ensure your code never causes a panic in tiktoken. For instance, by checking properties of the input before passing it to tiktoken methods that might cause a panic. Otherwise, this is a pretty hard issue to work around from Python. The best way to resolve it would be to open an issue on the repository of the library causing the issue or ask the maintainer to upgrade the PyO3 dependency and enable panic conversion._ |
You could add the PRs mentioned in #245 (comment) and build a custom TikToken version that supports big tokens. |
What I did, I tested a safe long token length about 200000 chars. Then before calling the tiktoken, I've splited the input into batches, called the tiktoken separately on each batch and concatenated token arrays. The result is still correct if you decode the tokenized string. Theoretically suboptimal for LLM use, but in practice no difference. Before that I was kindof searching runs of 200000 without spaces to insert batch break if needed. But later I deleted the code as not being truly useful. |
Fixes the crash in #245 by prohibiting the regex engine from backtracking catastrophically via [possessive quantifiers](https://www.regular-expressions.info/possessive.html). <img width="400" alt="image" src="https://github.com/openai/tiktoken/assets/1841944/ed341153-4cf4-4c1c-93d6-3f5e32133569"> Interestingly these possesives make the encoding a lot faster again in `fancy-regex`. Before this change (but with large byte pair merge PR cherry-picked): ``` num_threads: 1, num_bytes: 98379553 tiktoken 11,946,036 bytes / s tiktoken 11,961,343 bytes / s tiktoken 11,995,846 bytes / s tiktoken 11,951,263 bytes / s tiktoken 11,983,405 bytes / s ``` Same, with these changes applied: ``` num_threads: 1, num_bytes: 98379553 tiktoken 14,511,827 bytes / s tiktoken 14,638,134 bytes / s tiktoken 14,644,029 bytes / s tiktoken 14,729,030 bytes / s tiktoken 14,666,903 bytes / s ``` Updating the regex libs makes it a tiny bit faster still: ``` num_threads: 1, num_bytes: 98379553 tiktoken 14,485,590 bytes / s tiktoken 14,854,049 bytes / s tiktoken 14,891,086 bytes / s tiktoken 14,843,007 bytes / s tiktoken 14,874,520 bytes / s ``` This is almost 2x faster than [before any of the optimizations](#234). ------- Opened an issue for increasing the [default backtrack limit](https://github.com/fancy-regex/fancy-regex/blob/bf2c807447f72ee20ae839e0f8cb3a06fc79982c/src/lib.rs#L407), see: fancy-regex/fancy-regex#134, but it shouldn't be necessary here anymore. --------- Co-authored-by: Lőrinc <lorinc.pap@gmail.com>
Hi, I'm getting a panic when trying to encode the attached file with the gpt-4 tokenizer. This is from the AMPS dataset that was published along with the MATH dataset. Backtrace:
The text was updated successfully, but these errors were encountered: