Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign uphistogram_quantile makes a big jump from 98th to 99th percentile #3611
Comments
This comment has been minimized.
This comment has been minimized.
|
The results are in line with the data and accurate. Calculating 99th percentiles with only 200 values means that 1-2 outliers can affect the latency up by quite a bit. |
brian-brazil
added
the
kind/question
label
Dec 22, 2017
This comment has been minimized.
This comment has been minimized.
|
I'm handling 250qps. I would think calculating the rate over 30 seconds would result in much more measurements in the buckets than 200. Any explanation for that? |
This comment has been minimized.
This comment has been minimized.
|
Ah, it's calculated back to operations per second of course when using |
This comment has been minimized.
This comment has been minimized.
|
Ah, that'd be ~7500 queries within the 30s range. That should be plenty for a 99th percentile. |
This comment has been minimized.
This comment has been minimized.
|
Wouldn't make using |
This comment has been minimized.
This comment has been minimized.
|
I at least I get pretty different outcomes:
|
This comment has been minimized.
This comment has been minimized.
|
I guess if you use |
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
|
Of course. |
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
|
It indeed seems to overlap mostly. Sometimes it's a bit different, but that's because my Grafana balances requests over 2 prometheus instances. I know that might not be best practice, but we run them on preemptibles - lasting at most 24 hours - and at least want to connect to an instance that's online while the other one gets relocated. Now that it seems Prometheus is not to blame I'll close the ticket and focus my investigation elsewhere. Thanks for your quick responses :) |
JorritSalverda
closed this
Dec 22, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
JorritSalverda commentedDec 22, 2017
I'm investigating whether
histogram_quantilemight be presenting us inaccurate values since we see a big spike in our higher percentile response times based on Prometheus, whereas we don't see the same behaviour in the logged response times.What did you do?
I run openresty and collect response time metrics for the following buckets
When I run
It returns the following bucket values:
If I calculate the 99th percentile from those buckets with
it returns
whereas the 98th percentile returns
It results in pretty large bumps in response times according to Prometheus as seen in this graph
Whereas data from logs doesn't show anything like it.
What did you expect to see?
Approximately the same kind of response times per percentile in Prometheus as based on logs.
I understand the bucketing in Prometheus trades accuracy for efficiency. And this might actually be accurate - with the difference coming from something else - but I want to make sure it's not a 'bug' in how we set up the buckets or calculate the percentiles.
What did you see instead? Under which circumstances?
A big difference between Prometheus percentiles and log-based percentiles.
The graphs are actually based on the following recording rules:
When looking at the following comment in the Prometheus code the use of recording rules might have some impact. But when graphing the directly calculated 99th percentile next to the one based on the recording the only difference is a shift in time, not in height.
prometheus/promql/quantile.go
Lines 110 to 141 in abf7c97
Environment