-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3) Add a dedicated fragment splitter for the cl100k encoding - 3.8s to 2.1s #77
Conversation
eddff1e
to
18da404
Compare
66c6ca7
to
3973ffd
Compare
lib/src/test/java/com/knuddels/jtokkit/reference/Cl100kBaseTest.java
Outdated
Show resolved
Hide resolved
…e piece Since whole pieces were already checked, we don't have to try to reencode them Before: Benchmark (dataFolderPath) Mode Cnt Score Error Units SingleThreadedBenchmark.benchmarkCl100kBase data ss 10 2.365 ± 0.019 s/op SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount data ss 10 2.130 ± 0.024 s/op SingleThreadedBenchmark.benchmarkP50kBase data ss 10 4.393 ± 0.026 s/op SingleThreadedBenchmark.benchmarkP50kEdit data ss 10 4.408 ± 0.015 s/op SingleThreadedBenchmark.benchmarkR50kBase data ss 10 4.073 ± 0.017 s/op After: Benchmark (dataFolderPath) Mode Cnt Score Error Units SingleThreadedBenchmark.benchmarkCl100kBase data ss 10 2.340 ± 0.023 s/op SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount data ss 10 2.096 ± 0.029 s/op SingleThreadedBenchmark.benchmarkP50kBase data ss 10 4.385 ± 0.017 s/op SingleThreadedBenchmark.benchmarkP50kEdit data ss 10 4.372 ± 0.041 s/op SingleThreadedBenchmark.benchmarkR50kBase data ss 10 4.059 ± 0.026 s/op
…count We're also skipping the last getMinRankIndex calculation when we have 2 remaining tokens
852f440
to
0411aaa
Compare
e3534f5
to
fce667a
Compare
b026dee
to
fad6ece
Compare
fad6ece
to
d04d660
Compare
d2aeadb
to
7813452
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me :) see my in-line comment for a question how you want to proceed, but after that we can get this merged
fetchUnicodeData().entrySet().stream().parallel().forEach(e -> { | ||
var expected = Character.toString(e.getKey()); | ||
if (isValidUTF8(expected)) { | ||
var dst = new ByteArrayList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that at the point of this commit (89d6212) this class did not exist yet. I personally do not care whether this commit compiles on its own, but since you purposefully split your contributions into meaningful commits, you may prefer to have each commit in a compiling state? Let me know whether you want me to merge it this way or whether you want to re-organize your commits first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was likely the result of moving a change back to a previous commit.
Rebasing would mess with the comments, I usually only rebase before reviews.
I'm fine with merging as is, if you are.
Continuing #76 (and as an alternative to the optimized regular expressions in #75), here the
'(?:[sdmt]|ll|ve|re)|[^\r\n\\p{L}\\p{N}]?+\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]++[\r\n]*|\\s*[\r\n]|\\s+(?!\\S)|\\s+
regex for the cl100k parser is completely eliminated and replaced with a custom character-by-character parser for each segment with optimized unicode category detections.First category is the contractions, followed by words, numbers, punctuation and whitespace.
The start and end indexes are collected and when a match is found we convert it to UFT8 to a reusable array which is presented to the fragmentConsumer (which will attempt to tokenize it).
The last segment, the whitespaces are first completely consumed, split by new lines and if after the last (non-newline) whitespace there's a non-whitespace (e.g. a "\n a"), we pop off the last space for the next token.
The isLetter, isNumeric, isLetterOrNumeric, isWhitespace, isNewline, isNotWhitespaceOrLetterOrNumeric, isNotNewlineOrLetterOrNumeric helpers are highly optimized (detecting the common cases first, before doing the heavy calculations) to detect if the next character category is a match or not.
addUtf8Bytes needed to be reimplemented since I couldn't find any other way to convert it directly to a reusable list (which avoids creating so much temporary garbage).
Lastly, to avoid all the boxing of primitives, I've added a simple growable
byte[]
andint[]
backed list for the utf8 bytes and the tokens themselves.Before:
After:
As usual, I recommend reviewing the PR commit-by-commit for the changes to make sense:
![image](https://private-user-images.githubusercontent.com/1841944/296243991-bb577a8b-0efe-4b9d-a330-8faebe7460fe.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk1NTUzNDksIm5iZiI6MTcxOTU1NTA0OSwicGF0aCI6Ii8xODQxOTQ0LzI5NjI0Mzk5MS1iYjU3N2E4Yi0wZWZlLTRiOWQtYTMzMC04ZmFlYmU3NDYwZmUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYyOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MjhUMDYxMDQ5WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9OTE5NDQyMzBlOGQ2OWQ1YTQ1ZTg1NDllOGViMmM5YjJlYWRhY2VhOWNiOTBiMzExNzQwN2VjYTE0YmQ5MDYxOCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.qpw-6RByp67GyoFunGW6ZUB0Th6SB4sCqsbYgdQigck)
After the changes both the small and large token encoders became a lot faster, so now the
![image](https://private-user-images.githubusercontent.com/1841944/293914424-80f1bdbb-f994-46af-a375-8a8fed275a11.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk1NTUzNDksIm5iZiI6MTcxOTU1NTA0OSwicGF0aCI6Ii8xODQxOTQ0LzI5MzkxNDQyNC04MGYxYmRiYi1mOTk0LTQ2YWYtYTM3NS04YThmZWQyNzVhMTEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYyOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MjhUMDYxMDQ5WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZDE5Y2U1MTlhYTI2YWNhNjVjNGI3NTA4MTQ5MzczOWY0YTdjYmEyNzIwMzk2M2U5YWUzMTNiMjMxZDJjOTU5MiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.25thfaPi13iyrKUVgY-H26NXAlzaMiaG7ejwOchSFK0)
VERY_LARGE_TOKENIZER_BYTE_THRESHOLD
is at 500: