Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Force buckets in a histogram to be monotonic for quantile estimation #2610
The assumption that bucket counts increase monotonically with increasing
This is because scraped data is not made available to RR evalution or
Monotonicity is usually guaranteed because if a bucket with upper bound
Randomly interspersed partial sampling breaks that guarantee, and rate()
bucketQuantile depends on that monotonicity to do a binary search for the
As a somewhat hacky solution until the Prometheus project is ready to
I can't see any significant downsides as histograms are currently unusable for us and this will only help. It'd be great if this could be included in 1.6 if accepted.…
On Tue, 11 Apr 2017 22:10 Julius Volz, ***@***.***> wrote: The idea generally seems sensible to me, but I'll let the guardians of the histograms ***@***.*** <https://github.com/beorn7> and @brian-brazil <https://github.com/brian-brazil>) judge this. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2610 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEJbsA7BcDnSIA6L-JGHVREfsQRkFYa7ks5ru-w1gaJpZM4M6L7V> .
If I take the
Where the red line represents where 0.99 of the observations fall. It should intersect the CDF curve at the bucket containing the 0.99 quantile. As you can see we have multiple choices and we use a binary search to find the bucket. So the binary search may find and return any number of bucket boundary options.
We've run this code in production for a week at this point. Yup, we know the data is still less than accurate when this case happens. However, this has removed a lot of the noisy, very high peaks from the graphs. So we get a much better idea of the trend of the data.
I've only seen @brian-brazil comment on https://groups.google.com/d/msg/prometheus-developers/JgWahlzEhl4/E9AF-cB8AgAJ on this.
Let's review my current understanding:
I don't know what's worse. I'm leaning towards accepting it. @brian-brazil are there more considerations to this than my thoughts above?
One extreme case would for example be: the first bucket is already 1000 in a non-atomic scrape, all the others are still at an old value of 0, although their new value would be really high. Then all buckets would get the value of 1000 assigned and it would look like all requests fell into the first bucket, so you get an erroneously low 99th percentile.
Or the reverse, only the last bucket is already updated to 1000 and the others are still 0, although in reality there were more requests in those lower buckets. Then the quantile calculation will yield much larger values than it's supposed to.
But it's not clear to me that either of those is worse than the current behavior...
Actually the second case can already happen with the current implementation.
So all that can happen with the new behavior is that we can calculate way too low quantiles. But the current behavior just produces indeterministic gibberish.
Given the following reasons, I'm for including this in 1.6:
I read @brian-brazil 's latest comment as "this is at least not making things worse". Since @juliusv is for merging it, and my look at it (and somewhat superficial analysis) makes me thing we should do this, let's get it into 1.6.
I have a few nits, though, see comments. ( @juliusv will be proud of me, I hope…)