Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

histogram_quantile makes a big jump from 98th to 99th percentile #3611

Closed
JorritSalverda opened this Issue Dec 22, 2017 · 12 comments

Comments

Projects
None yet
2 participants
@JorritSalverda
Copy link
Contributor

JorritSalverda commented Dec 22, 2017

I'm investigating whether histogram_quantile might be presenting us inaccurate values since we see a big spike in our higher percentile response times based on Prometheus, whereas we don't see the same behaviour in the logged response times.

What did you do?

I run openresty and collect response time metrics for the following buckets

{0.1, 0.5, 1, 2, 5, 7.5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 70, 80, 90, 120}

When I run

sum(rate(nginx_http_request_duration_seconds_bucket{app="myapp"}[30s])) by (app, le)

It returns the following bucket values:

{app="myapp",le="+Inf"}  199.50187708588712
{app="myapp",le="120.0"} 199.50187708588712
{app="myapp",le="090.0"} 199.50187708588712
{app="myapp",le="080.0"} 199.50187708588712
{app="myapp",le="070.0"} 199.50187708588712
{app="myapp",le="060.0"} 199.45190207339337
{app="myapp",le="055.0"} 199.35191957026893
{app="myapp",le="050.0"} 199.35191957026893
{app="myapp",le="045.0"} 198.8021265871456
{app="myapp",le="040.0"} 198.75235057918144
{app="myapp",le="035.0"} 198.5526971815257
{app="myapp",le="030.0"} 196.45640317752432
{app="myapp",le="025.0"} 195.45769592817464
{app="myapp",le="020.0"} 194.15942538340332
{app="myapp",le="015.0"} 183.42920668757
{app="myapp",le="010.0"} 149.1837498477594
{app="myapp",le="007.5"} 118.63705228384505
{app="myapp",le="005.0"}  75.51447107481218
{app="myapp",le="002.0"}  32.191822707563105
{app="myapp",le="001.0"}  22.310299617882855
{app="myapp",le="000.5"}  15.272436699620076
{app="myapp",le="000.1"}   3.6436199646088867

If I calculate the 99th percentile from those buckets with

histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket{app="myapp"}[30s])) by (app, le))

it returns

{app="myapp"} 32.753189149053256

whereas the 98th percentile returns

{app="myapp"} 23.962643893755075

It results in pretty large bumps in response times according to Prometheus as seen in this graph

routing-proxy-percentile-bump

Whereas data from logs doesn't show anything like it.

prometheus-bump-but-not-logs

What did you expect to see?

Approximately the same kind of response times per percentile in Prometheus as based on logs.

I understand the bucketing in Prometheus trades accuracy for efficiency. And this might actually be accurate - with the difference coming from something else - but I want to make sure it's not a 'bug' in how we set up the buckets or calculate the percentiles.

What did you see instead? Under which circumstances?

A big difference between Prometheus percentiles and log-based percentiles.

The graphs are actually based on the following recording rules:

- record: app:nginx_http_request_duration_seconds_bucket:histogram_quantile99
  expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[30s])) by (app, le))

- record: app:nginx_http_request_duration_seconds_bucket:histogram_quantile90
  expr: histogram_quantile(0.9, sum(rate(nginx_http_request_duration_seconds_bucket[30s])) by (app, le))

- record: app:nginx_http_request_duration_seconds_bucket:histogram_quantile50
  expr: histogram_quantile(0.5, sum(rate(nginx_http_request_duration_seconds_bucket[30s])) by (app, le))

When looking at the following comment in the Prometheus code the use of recording rules might have some impact. But when graphing the directly calculated 99th percentile next to the one based on the recording the only difference is a shift in time, not in height.

// The assumption that bucket counts increase monotonically with increasing
// upperBound may be violated during:
//
// * Recording rule evaluation of histogram_quantile, especially when rate()
// has been applied to the underlying bucket timeseries.
// * Evaluation of histogram_quantile computed over federated bucket
// timeseries, especially when rate() has been applied.
//
// This is because scraped data is not made available to rule evaluation or
// federation atomically, so some buckets are computed with data from the
// most recent scrapes, but the other buckets are missing data from the most
// recent scrape.
//
// Monotonicity is usually guaranteed because if a bucket with upper bound
// u1 has count c1, then any bucket with a higher upper bound u > u1 must
// have counted all c1 observations and perhaps more, so that c >= c1.
//
// Randomly interspersed partial sampling breaks that guarantee, and rate()
// exacerbates it. Specifically, suppose bucket le=1000 has a count of 10 from
// 4 samples but the bucket with le=2000 has a count of 7 from 3 samples. The
// monotonicity is broken. It is exacerbated by rate() because under normal
// operation, cumulative counting of buckets will cause the bucket counts to
// diverge such that small differences from missing samples are not a problem.
// rate() removes this divergence.)
//
// bucketQuantile depends on that monotonicity to do a binary search for the
// bucket with the φ-quantile count, so breaking the monotonicity
// guarantee causes bucketQuantile() to return undefined (nonsense) results.
//
// As a somewhat hacky solution until ingestion is atomic per scrape, we
// calculate the "envelope" of the histogram buckets, essentially removing
// any decreases in the count between successive buckets.

Environment

  • System information:
Linux 4.4.64+ x86_64
  • Prometheus version:
prometheus, version 2.0.0 (branch: HEAD, revision: 0a74f98628a0463dddc90528220c94de5032d1a0)
  build user:       root@615b82cb36b6
  build date:       20171108-07:11:59
  go version:       go1.9.2
  • Prometheus configuration file:
scrape_configs:

- job_name: 'kubernetes-pods'

  kubernetes_sd_configs:
  - role: pod

  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: kubernetes_pod_name
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 22, 2017

The results are in line with the data and accurate. Calculating 99th percentiles with only 200 values means that 1-2 outliers can affect the latency up by quite a bit.

@JorritSalverda

This comment has been minimized.

Copy link
Contributor Author

JorritSalverda commented Dec 22, 2017

I'm handling 250qps. I would think calculating the rate over 30 seconds would result in much more measurements in the buckets than 200. Any explanation for that?

@JorritSalverda

This comment has been minimized.

Copy link
Contributor Author

JorritSalverda commented Dec 22, 2017

Ah, it's calculated back to operations per second of course when using rate... Okay, maybe looking at the 99th percentile doesn't really make sense in that case.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 22, 2017

Ah, that'd be ~7500 queries within the 30s range. That should be plenty for a 99th percentile.

@JorritSalverda

This comment has been minimized.

Copy link
Contributor Author

JorritSalverda commented Dec 22, 2017

Wouldn't make using sum_over_time instead of rate make it more accurate?

@JorritSalverda

This comment has been minimized.

Copy link
Contributor Author

JorritSalverda commented Dec 22, 2017

I at least I get pretty different outcomes:

histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket{app="myapp"}[30s])) by (app, le))

{app="myapp""} 19.291673542342327

histogram_quantile(0.99, sum(sum_over_time(nginx_http_request_duration_seconds_bucket{app="myapp"}[30s])) by (app, le))

{app="myapp"} 22.24807170346361
@JorritSalverda

This comment has been minimized.

Copy link
Contributor Author

JorritSalverda commented Dec 22, 2017

I guess if you use rate instead of sum_over_time empty buckets get more relative weight? I'll make recording rules for the sum_over_time based response time percentiles and compare them. And will report here once I have some useful data.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 22, 2017

sum_over_time is meaningless on top of a counter. It's the number of events in the time range that matters.

@JorritSalverda

This comment has been minimized.

Copy link
Contributor Author

JorritSalverda commented Dec 22, 2017

Of course. increase was what I'm after. I'll compare that instead. Thx for pointing that out.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 22, 2017

increase won't make a difference to histogram calculations.

@JorritSalverda

This comment has been minimized.

Copy link
Contributor Author

JorritSalverda commented Dec 22, 2017

It indeed seems to overlap mostly. Sometimes it's a bit different, but that's because my Grafana balances requests over 2 prometheus instances. I know that might not be best practice, but we run them on preemptibles - lasting at most 24 hours - and at least want to connect to an instance that's online while the other one gets relocated.

Now that it seems Prometheus is not to blame I'll close the ticket and focus my investigation elsewhere. Thanks for your quick responses :)

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.