Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1) Optimize regular expressions used for splitting by ~20% #234

Merged
merged 3 commits into from
Feb 9, 2024

Conversation

paplorinc
Copy link
Contributor

@paplorinc paplorinc commented Dec 31, 2023

By combining the contractions to a single non-capturing group prefixed by ', we can speed up matches by roughly 20%.

By using possessive quantifiers for the cl100k_base in the word and punctuation groups we're avoiding some backtracking.

The last whitespace groups can also be simplified to have a single newline matched explicitly, since the previous whitespace would already match it.

Overall the regex matches the exact same sequence of characters as before for any case and for unicode sequences.

This is the first part of the optimizations I did for jtokkit, reducing the speed of the tokenization from ~10.5 seconds to ~1.6 seconds in several big steps.
If this change is accepted I'll continue migrating the changes I've made.

I've modified benchmark.py locally to measure the improvement:

def benchmark_batch(documents: list[str]) -> None:
    num_threads = int(os.environ.get("RAYON_NUM_THREADS", "1"))
    num_bytes = sum(map(len, map(str.encode, documents)))
    print(f"num_threads: {num_threads}, num_bytes: {num_bytes}")

    enc = tiktoken.get_encoding("cl100k_base")
    enc.encode("warmup")

    for _ in range(5):
        start = time.perf_counter_ns()
        enc.encode_ordinary_batch(documents, num_threads=num_threads)
        end = time.perf_counter_ns()
        bytes_per_second = num_bytes / (end - start) * 1e9
        print(f"tiktoken \t{bytes_per_second:,.0f} bytes / s")

Here the speedup is as follows:

Before:

num_threads: 1, num_bytes: 98359164
tiktoken 	8,040,959 bytes / s
tiktoken 	8,047,612 bytes / s
tiktoken 	8,059,961 bytes / s
tiktoken 	8,097,749 bytes / s
tiktoken 	8,125,161 bytes / s

After regex optimization:

num_threads: 1, num_bytes: 98359164
tiktoken 	9,861,159 bytes / s
tiktoken 	9,888,486 bytes / s
tiktoken 	9,918,514 bytes / s
tiktoken 	9,902,705 bytes / s
tiktoken 	9,917,494 bytes / s

The other 50k tokenizers are also sped up slightly, not just the C100k.

By combining the contractions to a single non-capturing group prefixed by "'", we can speed up matches by roughly 20%.

By using possessive quantifiers for the cl100k_base in the word and punctuation groups we're avoiding some backtracking.

The last whitespace groups can also be simplified to have a single newline matched explicitly, since the previous whitespace would already match it.

Overall the regex matches the exact same sequence of characters as before for any case and for unicode sequences
@paplorinc paplorinc changed the title Optimize regular expressions used for splitting by ~20% 1) Optimize regular expressions used for splitting by ~20% Jan 6, 2024
@@ -73,7 +73,7 @@ def cl100k_base():
}
return {
"name": "cl100k_base",
"pat_str": r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+""",
"pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the contractions that were grouped in the other regexes, here I've optimized using possessive quantifiers to avoid backtracking.
The changes were guided by JMH benchmarks for the same regex: https://github.com/knuddelsgmbh/jtokkit/pull/75/files#r1434984132

Copy link
Collaborator

@hauntsaninja hauntsaninja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you! I reproduced the benchmarks. In some configurations / datasets, I actually see much more than a 20% win. I also tested that the possessive quantifier change preserves behaviour on a large and varied corpus, just in case I was missing something.

I'll get to your next PR soon. I appreciate this change and your patience and wanted to find a way to say thank you — please check your email :-)

@hauntsaninja hauntsaninja merged commit 6cc3a46 into openai:main Feb 9, 2024
31 of 42 checks passed
@paplorinc paplorinc deleted the paplorinc/optimize-regex branch February 9, 2024 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants