1) Optimize regular expressions used for splitting by ~20% #234

l0rinc · 2023-12-31T12:32:33Z

By combining the contractions to a single non-capturing group prefixed by ', we can speed up matches by roughly 20%.

By using possessive quantifiers for the cl100k_base in the word and punctuation groups we're avoiding some backtracking.

The last whitespace groups can also be simplified to have a single newline matched explicitly, since the previous whitespace would already match it.

Overall the regex matches the exact same sequence of characters as before for any case and for unicode sequences.

This is the first part of the optimizations I did for jtokkit, reducing the speed of the tokenization from ~10.5 seconds to ~1.6 seconds in several big steps.
If this change is accepted I'll continue migrating the changes I've made.

I've modified benchmark.py locally to measure the improvement:

def benchmark_batch(documents: list[str]) -> None:
    num_threads = int(os.environ.get("RAYON_NUM_THREADS", "1"))
    num_bytes = sum(map(len, map(str.encode, documents)))
    print(f"num_threads: {num_threads}, num_bytes: {num_bytes}")

    enc = tiktoken.get_encoding("cl100k_base")
    enc.encode("warmup")

    for _ in range(5):
        start = time.perf_counter_ns()
        enc.encode_ordinary_batch(documents, num_threads=num_threads)
        end = time.perf_counter_ns()
        bytes_per_second = num_bytes / (end - start) * 1e9
        print(f"tiktoken \t{bytes_per_second:,.0f} bytes / s")

Here the speedup is as follows:

Before:

num_threads: 1, num_bytes: 98359164
tiktoken 	8,040,959 bytes / s
tiktoken 	8,047,612 bytes / s
tiktoken 	8,059,961 bytes / s
tiktoken 	8,097,749 bytes / s
tiktoken 	8,125,161 bytes / s

After regex optimization:

num_threads: 1, num_bytes: 98359164
tiktoken 	9,861,159 bytes / s
tiktoken 	9,888,486 bytes / s
tiktoken 	9,918,514 bytes / s
tiktoken 	9,902,705 bytes / s
tiktoken 	9,917,494 bytes / s

The other 50k tokenizers are also sped up slightly, not just the C100k.

By combining the contractions to a single non-capturing group prefixed by "'", we can speed up matches by roughly 20%. By using possessive quantifiers for the cl100k_base in the word and punctuation groups we're avoiding some backtracking. The last whitespace groups can also be simplified to have a single newline matched explicitly, since the previous whitespace would already match it. Overall the regex matches the exact same sequence of characters as before for any case and for unicode sequences

l0rinc · 2024-01-07T16:19:19Z

tiktoken_ext/openai_public.py

@@ -73,7 +73,7 @@ def cl100k_base():
    }
    return {
        "name": "cl100k_base",
-        "pat_str": r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+""",
+        "pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""",


Besides the contractions that were grouped in the other regexes, here I've optimized using possessive quantifiers to avoid backtracking.
The changes were guided by JMH benchmarks for the same regex: https://github.com/knuddelsgmbh/jtokkit/pull/75/files#r1434984132

hauntsaninja

This is great, thank you! I reproduced the benchmarks. In some configurations / datasets, I actually see much more than a 20% win. I also tested that the possessive quantifier change preserves behaviour on a large and varied corpus, just in case I was missing something.

I'll get to your next PR soon. I appreciate this change and your patience and wanted to find a way to say thank you — please check your email :-)

Fixes the crash in #245 by prohibiting the regex engine from backtracking catastrophically via [possessive quantifiers](https://www.regular-expressions.info/possessive.html). <img width="400" alt="image" src="https://github.com/openai/tiktoken/assets/1841944/ed341153-4cf4-4c1c-93d6-3f5e32133569"> Interestingly these possesives make the encoding a lot faster again in `fancy-regex`. Before this change (but with large byte pair merge PR cherry-picked): ``` num_threads: 1, num_bytes: 98379553 tiktoken 11,946,036 bytes / s tiktoken 11,961,343 bytes / s tiktoken 11,995,846 bytes / s tiktoken 11,951,263 bytes / s tiktoken 11,983,405 bytes / s ``` Same, with these changes applied: ``` num_threads: 1, num_bytes: 98379553 tiktoken 14,511,827 bytes / s tiktoken 14,638,134 bytes / s tiktoken 14,644,029 bytes / s tiktoken 14,729,030 bytes / s tiktoken 14,666,903 bytes / s ``` Updating the regex libs makes it a tiny bit faster still: ``` num_threads: 1, num_bytes: 98379553 tiktoken 14,485,590 bytes / s tiktoken 14,854,049 bytes / s tiktoken 14,891,086 bytes / s tiktoken 14,843,007 bytes / s tiktoken 14,874,520 bytes / s ``` This is almost 2x faster than [before any of the optimizations](#234). ------- Opened an issue for increasing the [default backtrack limit](https://github.com/fancy-regex/fancy-regex/blob/bf2c807447f72ee20ae839e0f8cb3a06fc79982c/src/lib.rs#L407), see: fancy-regex/fancy-regex#134, but it shouldn't be necessary here anymore. --------- Co-authored-by: Lőrinc <lorinc.pap@gmail.com>

l0rinc changed the title ~~Optimize regular expressions used for splitting by ~20%~~ 1) Optimize regular expressions used for splitting by ~20% Jan 6, 2024

l0rinc mentioned this pull request Jan 6, 2024

0) Add the jtokkit test suite examples to validate the cl100k_base, p50k_base & r50k_base encodings #237

Open

l0rinc commented Jan 7, 2024

View reviewed changes

l0rinc mentioned this pull request Jan 15, 2024

Optimize byte pair merge for really big tokens (40x faster for a 2500 token word) #239

Open

l0rinc and others added 2 commits January 30, 2024 13:18

Merge branch 'main' into paplorinc/optimize-regex

2f04faa

gpt-2 docs

6f261de

hauntsaninja approved these changes Feb 9, 2024

View reviewed changes

hauntsaninja merged commit 6cc3a46 into openai:main Feb 9, 2024
31 of 42 checks passed

l0rinc deleted the paplorinc/optimize-regex branch February 9, 2024 09:21

This was referenced Feb 11, 2024

Panic (stack overflow) when encoding a certain string #245

Open

Add possessive quantifiers to avoid catastrophic backtracking #258

Merged

stephentoub mentioned this pull request Feb 20, 2024

Optimize regexes used in tiktoken dotnet/machinelearning#7020

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1) Optimize regular expressions used for splitting by ~20% #234

1) Optimize regular expressions used for splitting by ~20% #234

l0rinc commented Dec 31, 2023 •

edited

Loading

l0rinc Jan 7, 2024

hauntsaninja left a comment

1) Optimize regular expressions used for splitting by ~20% #234

1) Optimize regular expressions used for splitting by ~20% #234

Conversation

l0rinc commented Dec 31, 2023 • edited Loading

l0rinc Jan 7, 2024

Choose a reason for hiding this comment

hauntsaninja left a comment

Choose a reason for hiding this comment

l0rinc commented Dec 31, 2023 •

edited

Loading