Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3) Add a dedicated fragment splitter for the cl100k encoding - 3.8s to 2.1s #77

Merged
merged 10 commits into from
Feb 4, 2024

Conversation

paplorinc
Copy link
Contributor

@paplorinc paplorinc commented Dec 26, 2023

Continuing #76 (and as an alternative to the optimized regular expressions in #75), here the '(?:[sdmt]|ll|ve|re)|[^\r\n\\p{L}\\p{N}]?+\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]++[\r\n]*|\\s*[\r\n]|\\s+(?!\\S)|\\s+ regex for the cl100k parser is completely eliminated and replaced with a custom character-by-character parser for each segment with optimized unicode category detections.

First category is the contractions, followed by words, numbers, punctuation and whitespace.

The start and end indexes are collected and when a match is found we convert it to UFT8 to a reusable array which is presented to the fragmentConsumer (which will attempt to tokenize it).

The last segment, the whitespaces are first completely consumed, split by new lines and if after the last (non-newline) whitespace there's a non-whitespace (e.g. a "\n a"), we pop off the last space for the next token.

The isLetter, isNumeric, isLetterOrNumeric, isWhitespace, isNewline, isNotWhitespaceOrLetterOrNumeric, isNotNewlineOrLetterOrNumeric helpers are highly optimized (detecting the common cases first, before doing the heavy calculations) to detect if the next character category is a match or not.

addUtf8Bytes needed to be reimplemented since I couldn't find any other way to convert it directly to a reusable list (which avoids creating so much temporary garbage).

Lastly, to avoid all the boxing of primitives, I've added a simple growable byte[] and int[] backed list for the utf8 bytes and the tokens themselves.

Before:

Benchmark                                              (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase                        data    ss   10  4.470 ± 0.086   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount              data    ss   10  3.799 ± 0.020   s/op
SingleThreadedBenchmark.benchmarkP50kBase                          data    ss   10  5.290 ± 0.067   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                          data    ss   10  5.308 ± 0.086   s/op
SingleThreadedBenchmark.benchmarkR50kBase                          data    ss   10  4.962 ± 0.096   s/op

After:

Benchmark                                              (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase                        data    ss   10  2.315 ± 0.025   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount              data    ss   10  2.095 ± 0.030   s/op
SingleThreadedBenchmark.benchmarkP50kBase                          data    ss   10  4.266 ± 0.027   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                          data    ss   10  4.298 ± 0.120   s/op
SingleThreadedBenchmark.benchmarkR50kBase                          data    ss   10  3.914 ± 0.026   s/op

As usual, I recommend reviewing the PR commit-by-commit for the changes to make sense:
image

After the changes both the small and large token encoders became a lot faster, so now the VERY_LARGE_TOKENIZER_BYTE_THRESHOLD is at 500:
image

@paplorinc paplorinc force-pushed the optimize-token-splitter branch 3 times, most recently from eddff1e to 18da404 Compare December 26, 2023 21:17
@paplorinc paplorinc changed the title Add a dedicated fragment splitter for the cl100k encoding 3) Add a dedicated fragment splitter for the cl100k encoding Dec 27, 2023
@paplorinc paplorinc changed the title 3) Add a dedicated fragment splitter for the cl100k encoding 3) Add a dedicated fragment splitter for the cl100k encoding - 3.8s to 2.2s Dec 27, 2023
@paplorinc paplorinc changed the title 3) Add a dedicated fragment splitter for the cl100k encoding - 3.8s to 2.2s 3) Add a dedicated fragment splitter for the cl100k encoding - 3.8s to 2.1s Dec 27, 2023
…e piece

Since whole pieces were already checked, we don't have to try to reencode them

Before:
Benchmark                                              (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase                        data    ss   10  2.365 ± 0.019   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount              data    ss   10  2.130 ± 0.024   s/op
SingleThreadedBenchmark.benchmarkP50kBase                          data    ss   10  4.393 ± 0.026   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                          data    ss   10  4.408 ± 0.015   s/op
SingleThreadedBenchmark.benchmarkR50kBase                          data    ss   10  4.073 ± 0.017   s/op

After:
Benchmark                                              (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase                        data    ss   10  2.340 ± 0.023   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount              data    ss   10  2.096 ± 0.029   s/op
SingleThreadedBenchmark.benchmarkP50kBase                          data    ss   10  4.385 ± 0.017   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                          data    ss   10  4.372 ± 0.041   s/op
SingleThreadedBenchmark.benchmarkR50kBase                          data    ss   10  4.059 ± 0.026   s/op
@paplorinc paplorinc changed the title 3) Add a dedicated fragment splitter for the cl100k encoding - 3.8s to 2.1s 3) Add a dedicated fragment splitter for the cl100k encoding - 3.8s to 2s Jan 4, 2024
@paplorinc paplorinc changed the title 3) Add a dedicated fragment splitter for the cl100k encoding - 3.8s to 2s 3) Add a dedicated fragment splitter for the cl100k encoding - 3.8s to 2.1s Jan 4, 2024
Lőrinc added 2 commits January 6, 2024 20:10
…count

We're also skipping the last getMinRankIndex calculation when we have 2 remaining tokens
@paplorinc paplorinc force-pushed the optimize-token-splitter branch 2 times, most recently from e3534f5 to fce667a Compare January 12, 2024 13:10
@paplorinc paplorinc force-pushed the optimize-token-splitter branch 2 times, most recently from b026dee to fad6ece Compare January 12, 2024 21:10
Copy link
Contributor

@tox-p tox-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me :) see my in-line comment for a question how you want to proceed, but after that we can get this merged

fetchUnicodeData().entrySet().stream().parallel().forEach(e -> {
var expected = Character.toString(e.getKey());
if (isValidUTF8(expected)) {
var dst = new ByteArrayList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that at the point of this commit (89d6212) this class did not exist yet. I personally do not care whether this commit compiles on its own, but since you purposefully split your contributions into meaningful commits, you may prefer to have each commit in a compiling state? Let me know whether you want me to merge it this way or whether you want to re-organize your commits first

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was likely the result of moving a change back to a previous commit.
Rebasing would mess with the comments, I usually only rebase before reviews.
I'm fine with merging as is, if you are.

@tox-p tox-p merged commit 4de1e4e into knuddelsgmbh:main Feb 4, 2024
2 checks passed
@paplorinc paplorinc mentioned this pull request Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants