Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@Timed: histogram buckets have a too high cadinality #1947

Closed
adericbourg opened this issue Mar 26, 2020 · 5 comments
Closed

@Timed: histogram buckets have a too high cadinality #1947

adericbourg opened this issue Mar 26, 2020 · 5 comments
Labels
question A user question, probably better suited for StackOverflow

Comments

@adericbourg
Copy link

Using @Timed annotation does not allow to control buckets for a timer (like the sla() method would on a timer builder).

This results in having many buckets.

  • Most of them are not relevant
  • It causes performance issues with TSBDs (at least Prometheus) by having a too high cardinality

Example (with a Prometheus registry):

http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="0.001",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="0.001048576",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="0.001398101",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="0.001747626",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="0.002097151",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="0.002446676",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="0.002796201",} 0.0

// 54 other lines here

http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="12.884901886",} 3.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="14.316557651",} 3.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="15.748213416",} 3.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="17.179869184",} 3.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="22.906492245",} 3.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="28.633115306",} 3.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="30.0",} 3.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="+Inf",} 3.0

As a library user:

  • I'd like to benefit from the current integration
  • I'd like to be able to define my own buckets

Currently, using buckets or not is enabled (if I read well) by the histogram attribute of @Timed annotation.

I don't see a "good" option right now:

  • adding a new property (eg, an sla property) could bring confusion as it would conflict with histogram
  • replacing histogram by sla would break the API

What do you think?

@jkschneider jkschneider added the question A user question, probably better suited for StackOverflow label Mar 27, 2020
@jkschneider
Copy link
Contributor

jkschneider commented Mar 27, 2020

@adericbourg You can control the buckets added by @Timed with a MeterFilter that implements the configure method. Some shortcuts:

  • MeterFilter.maxExpected("http.server.requests", Duration.ofSeconds(1))
  • MeterFilter.minExpected("http.server.reqests", Duration.ofMillis(10))

MeterFilter implementations simply need to be wired as a @Bean in a Spring app to take effect.

Even more of a shortcut, Spring boot has property-driven configuration for setting minimum and maximum expected values:

https://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/#per-meter-properties

Also, there is no need to use @Timed on Spring MVC and WebFlux endpoints, as these are automatically instrumented by the framework.

In general we try to discourage fine grained control of histogram buckets because experience has shown that folks tend to select buckets leading to high-error-bound percentile approximations (and the true error bound is in fact unknowable from the discretized distribution, so they never really know). Technically sla options just add additional bucket values. So if you really really want to fight the recommendation, you can turn off percentile histograms and add any number of SLA boundaries. This in effect yields a histogram of a defined set of bucket boundaries.

@adericbourg
Copy link
Author

Thanks a lot for that answer: it is very helpful!

I agree that tuning histogram may cause "bad" approximations and I don't mean to fight against it.
Setting explicit bounds is part of the solution as the main issue I have here is that there are too many bounds. To provide some context, iour SRE team is struggling maintaining Prometheus instances that scrape metrics with a too high cardinality. My goal here is to set constraints on that cardinality.

A complete solution could be:

  • setting min and max bounds
  • and setting the number of bounds (number of buckets)

Anyhow, setting SLA manually and disabling histogram property is a good workaround for now.

@jkschneider
Copy link
Contributor

jkschneider commented May 7, 2020

and setting the number of bounds (number of buckets)

This temptation is actually precisely why we removed (obvious) configurability. It's just too easy to say "hey it's publishing 100 buckets right now, can we turn that down to 50 buckets?" But the impact on percentile approximation is unknowable. Narrowing min/max (provided your timings don't go outside of that range) doesn't affect the approximation's accuracy.

There are other options for selecting bucket functions. Somebody once suggested the E series for example, though it was based on an argument about the readability of buckets and not performance. Happy to take PRs with other bucketing functions if you discover one that demonstrates a decent error bound. Maybe adding more dynamism to buckets, such as was done recently for VictoriaMetrics histograms, is a good path forward.

@fzyzcjy
Copy link

fzyzcjy commented Mar 3, 2023

@jkschneider Hi thanks for the explanation - now I see why spring does not allow me to set the buckets explicitly, but only allow to set min/max. However, I see people recommending that, prometheus should not have too high cardinality, e.g. only 10 buckets should be used. What do you think about it, and how do you solve it in your production environment? Thanks

@renannprado
Copy link
Contributor

TL;DR

management:
  metrics:
    distribution:
      slo:
        http.server.requests:
          - 100ms
          - 500ms
          - 1s
          - 2s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question A user question, probably better suited for StackOverflow
Projects
None yet
Development

No branches or pull requests

4 participants