3) Add a dedicated fragment splitter for the cl100k encoding - 3.8s to 2.1s #77

paplorinc · 2023-12-26T20:59:54Z

Continuing #76 (and as an alternative to the optimized regular expressions in #75), here the '(?:[sdmt]|ll|ve|re)|[^\r\n\\p{L}\\p{N}]?+\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]++[\r\n]*|\\s*[\r\n]|\\s+(?!\\S)|\\s+ regex for the cl100k parser is completely eliminated and replaced with a custom character-by-character parser for each segment with optimized unicode category detections.

First category is the contractions, followed by words, numbers, punctuation and whitespace.

The start and end indexes are collected and when a match is found we convert it to UFT8 to a reusable array which is presented to the fragmentConsumer (which will attempt to tokenize it).

The last segment, the whitespaces are first completely consumed, split by new lines and if after the last (non-newline) whitespace there's a non-whitespace (e.g. a "\n a"), we pop off the last space for the next token.

The isLetter, isNumeric, isLetterOrNumeric, isWhitespace, isNewline, isNotWhitespaceOrLetterOrNumeric, isNotNewlineOrLetterOrNumeric helpers are highly optimized (detecting the common cases first, before doing the heavy calculations) to detect if the next character category is a match or not.

addUtf8Bytes needed to be reimplemented since I couldn't find any other way to convert it directly to a reusable list (which avoids creating so much temporary garbage).

Lastly, to avoid all the boxing of primitives, I've added a simple growable byte[] and int[] backed list for the utf8 bytes and the tokens themselves.

Before:

Benchmark                                              (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase                        data    ss   10  4.470 ± 0.086   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount              data    ss   10  3.799 ± 0.020   s/op
SingleThreadedBenchmark.benchmarkP50kBase                          data    ss   10  5.290 ± 0.067   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                          data    ss   10  5.308 ± 0.086   s/op
SingleThreadedBenchmark.benchmarkR50kBase                          data    ss   10  4.962 ± 0.096   s/op

After:

Benchmark                                              (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase                        data    ss   10  2.315 ± 0.025   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount              data    ss   10  2.095 ± 0.030   s/op
SingleThreadedBenchmark.benchmarkP50kBase                          data    ss   10  4.266 ± 0.027   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                          data    ss   10  4.298 ± 0.120   s/op
SingleThreadedBenchmark.benchmarkR50kBase                          data    ss   10  3.914 ± 0.026   s/op

As usual, I recommend reviewing the PR commit-by-commit for the changes to make sense:

After the changes both the small and large token encoders became a lot faster, so now the VERY_LARGE_TOKENIZER_BYTE_THRESHOLD is at 500:

benchmark/src/jmh/java/com/knuddels/jtokkit/Cl100kParserBenchmark.java

lib/src/main/java/com/knuddels/jtokkit/Cl100kParser.java

lib/src/main/java/com/knuddels/jtokkit/EncodingFactory.java

lib/src/main/java/com/knuddels/jtokkit/ByteArrayList.java

lib/src/test/java/com/knuddels/jtokkit/ByteArrayListTest.java

lib/src/main/java/com/knuddels/jtokkit/ByteArrayList.java

lib/src/main/java/com/knuddels/jtokkit/api/IntArrayList.java

lib/src/main/java/com/knuddels/jtokkit/ByteArrayList.java

docs/docs/getting-started/usage.md

lib/src/test/java/com/knuddels/jtokkit/Cl100kParserTest.java

lib/src/test/java/com/knuddels/jtokkit/reference/Cl100kBaseTest.java

lib/src/main/java/com/knuddels/jtokkit/TokenEncoderLarge.java

…e piece Since whole pieces were already checked, we don't have to try to reencode them Before: Benchmark (dataFolderPath) Mode Cnt Score Error Units SingleThreadedBenchmark.benchmarkCl100kBase data ss 10 2.365 ± 0.019 s/op SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount data ss 10 2.130 ± 0.024 s/op SingleThreadedBenchmark.benchmarkP50kBase data ss 10 4.393 ± 0.026 s/op SingleThreadedBenchmark.benchmarkP50kEdit data ss 10 4.408 ± 0.015 s/op SingleThreadedBenchmark.benchmarkR50kBase data ss 10 4.073 ± 0.017 s/op After: Benchmark (dataFolderPath) Mode Cnt Score Error Units SingleThreadedBenchmark.benchmarkCl100kBase data ss 10 2.340 ± 0.023 s/op SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount data ss 10 2.096 ± 0.029 s/op SingleThreadedBenchmark.benchmarkP50kBase data ss 10 4.385 ± 0.017 s/op SingleThreadedBenchmark.benchmarkP50kEdit data ss 10 4.372 ± 0.041 s/op SingleThreadedBenchmark.benchmarkR50kBase data ss 10 4.059 ± 0.026 s/op

lib/src/main/java/com/knuddels/jtokkit/TokenEncoder.java

…count We're also skipping the last getMinRankIndex calculation when we have 2 remaining tokens

lib/src/main/java/com/knuddels/jtokkit/TokenEncoderLarge.java

lib/src/main/java/com/knuddels/jtokkit/TokenEncoder.java

lib/src/main/java/com/knuddels/jtokkit/Cl100kParser.java

tox-p

Looks good to me :) see my in-line comment for a question how you want to proceed, but after that we can get this merged

tox-p · 2024-02-03T11:01:05Z

lib/src/test/java/com/knuddels/jtokkit/Cl100kParserTest.java

+        fetchUnicodeData().entrySet().stream().parallel().forEach(e -> {
+            var expected = Character.toString(e.getKey());
+            if (isValidUTF8(expected)) {
+                var dst = new ByteArrayList();


I believe that at the point of this commit (89d6212) this class did not exist yet. I personally do not care whether this commit compiles on its own, but since you purposefully split your contributions into meaningful commits, you may prefer to have each commit in a compiling state? Let me know whether you want me to merge it this way or whether you want to re-organize your commits first

Yeah, this was likely the result of moving a change back to a previous commit.
Rebasing would mess with the comments, I usually only rebase before reviews.
I'm fine with merging as is, if you are.

paplorinc requested a review from tox-p as a code owner December 26, 2023 20:59