-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1) Optimize regular expressions used for splitting by ~20% #234
Conversation
By combining the contractions to a single non-capturing group prefixed by "'", we can speed up matches by roughly 20%. By using possessive quantifiers for the cl100k_base in the word and punctuation groups we're avoiding some backtracking. The last whitespace groups can also be simplified to have a single newline matched explicitly, since the previous whitespace would already match it. Overall the regex matches the exact same sequence of characters as before for any case and for unicode sequences
@@ -73,7 +73,7 @@ def cl100k_base(): | |||
} | |||
return { | |||
"name": "cl100k_base", | |||
"pat_str": r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+""", | |||
"pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the contractions that were grouped in the other regexes, here I've optimized using possessive quantifiers to avoid backtracking.
The changes were guided by JMH benchmarks for the same regex: https://github.com/knuddelsgmbh/jtokkit/pull/75/files#r1434984132
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thank you! I reproduced the benchmarks. In some configurations / datasets, I actually see much more than a 20% win. I also tested that the possessive quantifier change preserves behaviour on a large and varied corpus, just in case I was missing something.
I'll get to your next PR soon. I appreciate this change and your patience and wanted to find a way to say thank you — please check your email :-)
By combining the contractions to a single non-capturing group prefixed by
'
, we can speed up matches by roughly 20%.By using possessive quantifiers for the
cl100k_base
in the word and punctuation groups we're avoiding some backtracking.The last whitespace groups can also be simplified to have a single newline matched explicitly, since the previous whitespace would already match it.
Overall the regex matches the exact same sequence of characters as before for any case and for unicode sequences.
This is the first part of the optimizations I did for jtokkit, reducing the speed of the tokenization from ~10.5 seconds to ~1.6 seconds in several big steps.
If this change is accepted I'll continue migrating the changes I've made.
I've modified
benchmark.py
locally to measure the improvement:Here the speedup is as follows:
Before:
After regex optimization:
The other 50k tokenizers are also sped up slightly, not just the C100k.