Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1) Optimize 50k & 100k splitter regular expressions - 10.5s to 8.9s #75

Merged
merged 8 commits into from
Dec 28, 2023

Conversation

l0rinc
Copy link
Contributor

@l0rinc l0rinc commented Dec 22, 2023

As the first step in optimizing the c100k parser mostly (used for GPT 3.5 & 4), here's the regex optimization applying to all 50k and the 100k parsers.

The difference is not huge, but measurable:

Before:

Benchmark                                    (dataFolderPath)  Mode  Cnt   Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase              data    ss   10  10.548 ± 0.885   s/op
SingleThreadedBenchmark.benchmarkP50kBase                data    ss   10   9.999 ± 0.097   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                data    ss   10  10.184 ± 0.131   s/op
SingleThreadedBenchmark.benchmarkR50kBase                data    ss   10   9.938 ± 0.076   s/op

After:

Benchmark                                    (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase              data    ss   10  8.947 ± 0.109   s/op
SingleThreadedBenchmark.benchmarkP50kBase                data    ss   10  9.419 ± 0.082   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                data    ss   10  9.365 ± 0.073   s/op
SingleThreadedBenchmark.benchmarkR50kBase                data    ss   10  8.403 ± 0.080   s/op

Please review commit-by-commit for the changes to make sense:
image

Feel free to either comment - or if it's simpler -, add commits on top of these.

@l0rinc l0rinc requested a review from tox-p as a code owner December 22, 2023 11:21
Lőrinc added 3 commits December 25, 2023 11:03
Benchmark                                    (dataFolderPath)  Mode  Cnt   Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase              data    ss   10  10.548 ± 0.885   s/op
SingleThreadedBenchmark.benchmarkP50kBase                data    ss   10   9.999 ± 0.097   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                data    ss   10  10.184 ± 0.131   s/op
SingleThreadedBenchmark.benchmarkR50kBase                data    ss   10   9.938 ± 0.076   s/op
Before:
Benchmark                                    (dataFolderPath)  Mode  Cnt   Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase              data    ss   10  10.548 ± 0.885   s/op
SingleThreadedBenchmark.benchmarkP50kBase                data    ss   10   9.999 ± 0.097   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                data    ss   10  10.184 ± 0.131   s/op
SingleThreadedBenchmark.benchmarkR50kBase                data    ss   10   9.938 ± 0.076   s/op

After:
Benchmark                                    (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase              data    ss   10  8.947 ± 0.109   s/op
SingleThreadedBenchmark.benchmarkP50kBase                data    ss   10  9.419 ± 0.082   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                data    ss   10  9.365 ± 0.073   s/op
SingleThreadedBenchmark.benchmarkR50kBase                data    ss   10  8.403 ± 0.080   s/op
@l0rinc l0rinc changed the title Optimize 50k & 100k splitter regular expressions 1) Optimize 50k & 100k splitter regular expressions Dec 27, 2023
@l0rinc l0rinc changed the title 1) Optimize 50k & 100k splitter regular expressions 1) Optimize 50k & 100k splitter regular expressions - 10.5s to 8.9s Dec 27, 2023
Copy link
Contributor

@tox-p tox-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great PR, the structure and the helpful comments!

I have commented some questions that I would love to get resolved before merging, but overall they are quite minor 🙂

JUnit 5 doesn't require public modifiers anymore.
Private modifiers also don't make sense for non-helper tests since they aren't inherited.
Renamed Cl100kBaseTestTest to Cl100kBaseTest.
Removed dangling final modifiers.
Used `var` in all local declarations.
Copy link
Contributor

@tox-p tox-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated in the in-line comments: I can easily add testing with all LTS JDKs after your changes are merged. Let's not let that delay us moving forward with the other PRs 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants