-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance ideas #33
Comments
I had a version using a trie to do single-pass-ish encoding of an input, but it wasn't correct. I'm not certain how fast a correct version of that trie would be. |
Thanks, those are really nice results!
|
Previous script, full encode:
|
Feel free to close this if the ideas have been ideated. |
Hello, It seems that the slow performance is due to an ineffective implementation of the negative lookahead clause ("\s+(?!\S)") in the fancy_regex library. A possible solution to mimic the negative lookahead functionality is to remove it from the regex and manually re-add spaces to the matched parts, such as words or numbers. Although this approach achieves the same performance as pcre2, it may not be the most elegant solution. |
I'm currently working on optimizing the tokenizer and the token counter (on the Java implementation at https://github.com/knuddelsgmbh/jtokkit, but most of the tricks should be applicable to other implementations as well).
So far it's 3x faster, but I still have a few ideas left. |
I made a toy GPT2 tokenizer as a python rust extension. It seems to be slightly faster than tiktoken in my tests. It looks like #31 may get most or all the way there, but I thought I'd post the results from this script:
The text is 64MiB of wikipedia wikitext, probably enwik8, but I just found it on my hard drive.
There are no fancy optimizations here (like SIMD stuff), the library has a few things it might do differently from tiktoken:
I didn't implement rust regexps so I don't know if the word splitting matters, though I could benchmark just the splitting part.
The text was updated successfully, but these errors were encountered: