Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement `quantile` and `quantile_over_time`. #552

Closed
beorn7 opened this Issue Feb 23, 2015 · 18 comments

Comments

Projects
None yet
6 participants
@beorn7
Copy link
Member

beorn7 commented Feb 23, 2015

@beorn7 beorn7 self-assigned this Mar 3, 2015

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Jun 18, 2015

@matthiasr just reported an use-case for this. (Just adding here to get an impression how much needed this feature actually is...)

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Jul 15, 2015

More requests for this have reached me. Since it is easy to implement, we should add it.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Aug 1, 2015

Just got one more request for this by email.

@mwitkow

This comment has been minimized.

Copy link
Contributor

mwitkow commented Aug 3, 2015

Yea, it'd be great for us for federation. Buckets and bucket-related metadata are easily aggregatable (just sum) and allows you to infer rich statistics across datasets coming from multiple scrapers, e.g. a bucketed dataset of process_cpu_seconds_total:rate1m.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 29, 2016

For quantile_over_time there's the question of if we should be using the timestamps to weigh data points, with a similar argument for avg_over_time. Thoughts?

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Jun 30, 2016

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Jun 30, 2016

I assume you mean that samples with a larger gap to other samples should have a stronger weight?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 30, 2016

Yes, that's the question.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Jun 30, 2016

sounds good to me, but might be tricky to implement at start and end of a series. I guess we need a similar heuristics as for the rate function.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 30, 2016

Previously you were against such things for sum_ and count_over_time. Did something make you change your mind?

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Jul 1, 2016

The use cases I saw so far for count_over_time were all "I want to know how many samples I have scraped in that interval", so quite explicit not a weighted average or something. As @matthiasr said, our current implementation of avg_over_time is more like avg_of_samples_in_interval, which might also be something that people would like. For sum_over_time, I could also see both needs, i.e. a real integral over the interval, or just the technical sum of all samples.

But frankly, if I take the SoundCloud use case, there are very few uses of *._over_time in our codebase to make a clear call what's needed most of the time. (I can only find two right now, and both are pretty hackish.)

Different aspect of the rate vs *._over_time issue:

In the rate case, the critical point was how to deal with start and end of a series because the length of the time interval is in the denominator. In the *._over_time functions, the number of samples is in the denominator. From that perspective, we don't have the start and end problem. The problem here is irregularly spaced samples between the first and the last sample in the interval, which is in turn not a problem at all for rate and friends. Only once we introduce different weights, we run into the problem of finding the beginning of a series, and yes, then I'm all up for applying the same heuristics.

Conclusion: I think we need to go back to use cases for *._over_time functions, to find out if we in general want "weighted over time" or "over samples in time interval". In the worst case, we need both…

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Jul 1, 2016

One major use cases for avg_over_time right now is calculation availability according to an SLO metric.

We have a recording rule that, for every system, records whether it is "available" (within acceptable latency and error bounds). We record the availability over a time period from this using avg_over_time. The current behaviour is acceptable because rule evaluation time is mostly constant, but this really calls for a weighted average.

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Jul 1, 2016

For clarity, the "is available" metric is 0/1 valued, so the avg_over_time gives us the straight-up "fraction of time it was available".

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 1, 2016

The other question is will the two methods produce notably different results in standard use cases? For a small gap weighting is unlikely to make much of a difference over a day/week, whereas the semantics to use for say an hour long gap will depend on the use case. This may be one where going with simplicity is the best option.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Jul 1, 2016

@matthiasr Good point with our SLO calculation. The query is in a different repo, and that's why I didn't find it in my little survey. That's probably the most legitimate use case, and you are right, if rule evaluations were irregular, a weighted average would be in order.

@brian-brazil Indeed, I assume that in regular cases, the results wouldn't differ as a constant scrape and rule-evaluation period is the regular case. One might argue that any irregularity could be seen as "no data", so that data around a gap should not get more weight.

Let's expand on that:

Imagine we have a 1h gap in our availability metric (which is 0 or 1, depending if a microservice considers itself within SLO or not). The 1h gap was created because the collecting Prometheus server had a downtime. If our query now averages over the whole day (to get availability for that day), we would currently simply disregard the 1h gap. With weighting of the samples, the last sample before the gap and the last sample after the gap would get the weight of half an hour, while all the other samples only have the weight of the usual rule-evaluation period (e.g. 30s). If one of those samples happens to be exceptional (like the only 0 among all 1s otherwise), it would bias the result.

The opposite case would be a change of the rule evaluation period during the averaging interval. E.g. 1m until noon, 30s after noon. Now it would be sane to give the samples before noon double weight in the average.

A possible approach to cover both cases would be to apply the start/stop heuristics within the interval, too, i.e. see a single longer gap as a real gap while compensating for changing in sample spacing that looks more systematic. But that will be a lot of complexity.

At this point, I tend towards simplicity, i.e. declaring any changes in sample spacing as irregularities that the *._over_time function will not try to compensate for, or in other words: let's not do weighting.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 1, 2016

I'm for simplicity here too.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Jul 1, 2016

Sounds we have agreement. I have created prometheus/docs#470 to clarify the behavior in the docs.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.