Skip to content

Additional performance improvements for cluster mode#680

Merged
zbjornson merged 10 commits into
prometheus:masterfrom
jdmarshall:perf/cluster
Jul 8, 2025
Merged

Additional performance improvements for cluster mode#680
zbjornson merged 10 commits into
prometheus:masterfrom
jdmarshall:perf/cluster

Conversation

@jdmarshall

@jdmarshall jdmarshall commented Jul 5, 2025

Copy link
Copy Markdown
Contributor

Most of these improvements are on the worker side of the connection, but when combined with #679 should provide some
relief for people using this mode.

This PR includes:

  • substantial additional code coverage in the benchmarks via addition of more scenarios
  • a workaround for benchmark.js not allowing async test bodies (despite supporting async setup and teardown?)
  • removal of truthy conditionals in hot paths
  • removal of double lookups during map insertions
  • reworking hashObject() to optimize for the common case, which is metrics with labels

No breaking changes.

This should partially address the concerns brought up in #628, and partly fix #676

Code coverage improvement in the benchmarks:

File                             | % Stmts | % Branch | % Funcs | % Lines |
---------------------------------|---------|----------|---------|---------|
All files                        |   46.54 |    22.47 |   42.65 |    47.3 | 
All files                        |    51.3 |    27.01 |   48.78 |   52.04 |

~~There's a lot of jitter in these tests, but the common theme is that the benchmarks with no labels show a tiny regression versus the label paths showing 10+% improvements.

Given that the no-label path was previously almost 500x faster, that's a reasonable tradeoff. If prom-client is not fast enough for anyone, it will be someone using a lot of labels.~~

Edit: most of the hashObject changes were dropped from this PR at the last moment, once I realized a much better option was possible. It still might be worth resurrecting these for the metrics aggregation logic, which still uses this code.

#687 introduces a new class which consolidates all of this logic and replaces the key lookup with a much more compact value.

⚠ counter ➭ inc is 0.06717% acceptably slower.
✓ counter ➭ inc with labels is 25.43% faster.
✓ gauge ➭ inc is 2.639% faster.
✓ gauge ➭ inc with labels is 21.03% faster.
✓ histogram ➭ observe#1 with 64 is 16.87% faster.
✓ histogram ➭ observe#2 with 8 is 10.97% faster.
⚠ histogram ➭ observe#2 with 4 and 2 with 2 is 1.345% acceptably slower.
⚠ histogram ➭ observe#2 with 2 and 2 with 4 is 3.383% acceptably slower.
⚠ histogram ➭ observe#6 with 2 is 3.862% acceptably slower.
✓ registry ➭ getMetricsAsJSON#1 with 64 is 4.661% faster.
✓ registry ➭ getMetricsAsJSON#2 with 8 is 5.511% faster.
✓ registry ➭ getMetricsAsJSON#2 with 4 and 2 with 2 is 5.085% faster.
✓ registry ➭ getMetricsAsJSON#2 with 2 and 2 with 4 is 5.989% faster.
✓ registry ➭ getMetricsAsJSON#6 with 2 is 4.661% faster.
✓ registry ➭ metrics#1 with 64 is 7.167% faster.
✓ registry ➭ metrics#1 with 64 and openMetrics is 4.562% faster.
✓ registry ➭ metrics#2 with 8 is 3.493% faster.
✓ registry ➭ metrics#2 with 8 and openMetrics is 4.773% faster.
✓ registry ➭ metrics#2 with 4 and 2 with 2 is 5.141% faster.
✓ registry ➭ metrics#2 with 4 and 2 with 2 and openMetrics is 2.863% faster.
✓ registry ➭ metrics#2 with 2 and 2 with 4 is 4.577% faster.
✓ registry ➭ metrics#2 with 2 and 2 with 4 and openMetrics is 3.778% faster.
✓ registry ➭ metrics#6 with 2 is 2.311% faster.
✓ registry ➭ metrics#6 with 2 and openMetrics is 1.362% faster.
✓ summary ➭ observe#1 with 64 is 18.31% faster.
✓ summary ➭ observe#2 with 8 is 5.565% faster.
✓ summary ➭ observe#2 with 4 and 2 with 2 is 0.6756% faster.
⚠ summary ➭ observe#2 with 2 and 2 with 4 is 5.010% acceptably slower.
⚠ summary ➭ observe#6 with 2 is 4.731% acceptably slower.
✓ cluster ➭ aggregate() is 1.476% faster.

Author notes:

Originally this PR was meant to include full e2e benchmarks of the cluster code, however the two benchmark frameworks being used are not particularly cooperative in this respect. I spent more time trying to get those to work than I did on the rest of the work combined. In the end I had code that ran properly, however the variability in the run time was so great (due to coordinating multiple runs of the benchmark code across multiple processes), that none of the output was actionable. That work inspired the addition of the debug module, and I opted to retain that code so that some value could be derived from that goose chase.

I believe this could be fixed with a single PR to benchmark-regressions to have it delegate setup() to benchmark.js, allowing the coordination overhead to occur between the measurement windows. However that project has been dormant for a very, very long time and I do not have any expectation of a PR being landed if offered.

@jdmarshall

Copy link
Copy Markdown
Contributor Author

Dropping hashCode rework commit - I have a better option in mind.

@zbjornson

Copy link
Copy Markdown
Contributor

Looks good.

Can you rebase please? Sorry for the hassle, you're right that the changelog requirement makes this a pain.

jdmarshall added 10 commits July 7, 2025 21:11
The receiver of this data expects metrics[] and we are naming the
data as if we sent the registries instead of their metrics.
As $645 notes, the overhead of calculating real labels is substantial
in cluster mode.

single letter labels with single letter values don't reflect real
world behavior.
Also rename the 'current' tests to align the output in a way one
can actually read while the screen is scrolling.

This is rescued from an attempt to support running cluster benchmarks.
After substantial effort the code ran but the results were inaccurate.
I would have preferred end-to-end tests for the cluster code but
that would require some substantial PRs to benchmark-regressions.
This function has been identified as being part of a hot path
by prometheus#628
@SimenB

SimenB commented Jul 8, 2025

Copy link
Copy Markdown
Collaborator

Author notes:

Originally this PR was meant to include full e2e benchmarks of the cluster code, however the two benchmark frameworks being used are not particularly cooperative in this respect. I spent more time trying to get those to work than I did on the rest of the work combined. In the end I had code that ran properly, however the variability in the run time was so great (due to coordinating multiple runs of the benchmark code across multiple processes), that none of the output was actionable. That work inspired the addition of the debug module, and I opted to retain that code so that some value could be derived from that goose chase.

I believe this could be fixed with a single PR to benchmark-regressions to have it delegate setup() to benchmark.js, allowing the coordination overhead to occur between the measurement windows. However that project has been dormant for a very, very long time and I do not have any expectation of a PR being landed if offered.

I'd be happy to replace it with a maintained lib 😀

@zbjornson zbjornson left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@zbjornson zbjornson merged commit e913c11 into prometheus:master Jul 8, 2025
12 checks passed
@jdmarshall

Copy link
Copy Markdown
Contributor Author

Turns out I was wrong about benchmark.js allowing async setup and teardown. That’s all benchmark-regression and why benchmark.js’s lifecycle events don’t fire in time to say conditionally start background processes.

I did file a PR for this. But benchmark.js using eval() to run benchmark code made that PR ten times harder than it needed to be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Poor code coverage in the benchmarks

3 participants