@Timed: histogram buckets have a too high cadinality #1947

adericbourg · 2020-03-26T14:51:58Z

Using @Timed annotation does not allow to control buckets for a timer (like the sla() method would on a timer builder).

This results in having many buckets.

Most of them are not relevant
It causes performance issues with TSBDs (at least Prometheus) by having a too high cardinality

Example (with a Prometheus registry):

http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="0.001",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="0.001048576",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="0.001398101",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="0.001747626",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="0.002097151",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="0.002446676",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="0.002796201",} 0.0

// 54 other lines here

http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="12.884901886",} 3.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="14.316557651",} 3.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="15.748213416",} 3.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="17.179869184",} 3.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="22.906492245",} 3.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="28.633115306",} 3.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="30.0",} 3.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/foo",le="+Inf",} 3.0

As a library user:

I'd like to benefit from the current integration
I'd like to be able to define my own buckets

Currently, using buckets or not is enabled (if I read well) by the histogram attribute of @Timed annotation.

I don't see a "good" option right now:

adding a new property (eg, an sla property) could bring confusion as it would conflict with histogram
replacing histogram by sla would break the API

What do you think?

The text was updated successfully, but these errors were encountered:

jkschneider · 2020-03-27T13:43:46Z

@adericbourg You can control the buckets added by @Timed with a MeterFilter that implements the configure method. Some shortcuts:

MeterFilter.maxExpected("http.server.requests", Duration.ofSeconds(1))
MeterFilter.minExpected("http.server.reqests", Duration.ofMillis(10))

MeterFilter implementations simply need to be wired as a @Bean in a Spring app to take effect.

Even more of a shortcut, Spring boot has property-driven configuration for setting minimum and maximum expected values:

https://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/#per-meter-properties

Also, there is no need to use @Timed on Spring MVC and WebFlux endpoints, as these are automatically instrumented by the framework.

In general we try to discourage fine grained control of histogram buckets because experience has shown that folks tend to select buckets leading to high-error-bound percentile approximations (and the true error bound is in fact unknowable from the discretized distribution, so they never really know). Technically sla options just add additional bucket values. So if you really really want to fight the recommendation, you can turn off percentile histograms and add any number of SLA boundaries. This in effect yields a histogram of a defined set of bucket boundaries.

adericbourg · 2020-03-27T15:19:45Z

Thanks a lot for that answer: it is very helpful!

I agree that tuning histogram may cause "bad" approximations and I don't mean to fight against it.
Setting explicit bounds is part of the solution as the main issue I have here is that there are too many bounds. To provide some context, iour SRE team is struggling maintaining Prometheus instances that scrape metrics with a too high cardinality. My goal here is to set constraints on that cardinality.

A complete solution could be:

setting min and max bounds
and setting the number of bounds (number of buckets)

Anyhow, setting SLA manually and disabling histogram property is a good workaround for now.

jkschneider · 2020-05-07T20:39:12Z

and setting the number of bounds (number of buckets)

This temptation is actually precisely why we removed (obvious) configurability. It's just too easy to say "hey it's publishing 100 buckets right now, can we turn that down to 50 buckets?" But the impact on percentile approximation is unknowable. Narrowing min/max (provided your timings don't go outside of that range) doesn't affect the approximation's accuracy.

There are other options for selecting bucket functions. Somebody once suggested the E series for example, though it was based on an argument about the readability of buckets and not performance. Happy to take PRs with other bucketing functions if you discover one that demonstrates a decent error bound. Maybe adding more dynamism to buckets, such as was done recently for VictoriaMetrics histograms, is a good path forward.

fzyzcjy · 2023-03-03T12:35:25Z

@jkschneider Hi thanks for the explanation - now I see why spring does not allow me to set the buckets explicitly, but only allow to set min/max. However, I see people recommending that, prometheus should not have too high cardinality, e.g. only 10 buckets should be used. What do you think about it, and how do you solve it in your production environment? Thanks

renannprado · 2023-05-24T13:48:20Z

TL;DR

management:
  metrics:
    distribution:
      slo:
        http.server.requests:
          - 100ms
          - 500ms
          - 1s
          - 2s

jkschneider added the question A user question, probably better suited for StackOverflow label Mar 27, 2020

jkschneider closed this as completed May 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

@Timed: histogram buckets have a too high cadinality #1947

@Timed: histogram buckets have a too high cadinality #1947

adericbourg commented Mar 26, 2020

jkschneider commented Mar 27, 2020 •

edited

adericbourg commented Mar 27, 2020

jkschneider commented May 7, 2020 •

edited

fzyzcjy commented Mar 3, 2023

renannprado commented May 24, 2023

@Timed: histogram buckets have a too high cadinality #1947

@Timed: histogram buckets have a too high cadinality #1947

Comments

adericbourg commented Mar 26, 2020

jkschneider commented Mar 27, 2020 • edited

adericbourg commented Mar 27, 2020

jkschneider commented May 7, 2020 • edited

fzyzcjy commented Mar 3, 2023

renannprado commented May 24, 2023

jkschneider commented Mar 27, 2020 •

edited

jkschneider commented May 7, 2020 •

edited