Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for improving rate/increase #3806

Open
free opened this issue Feb 6, 2018 · 15 comments
Open

Proposal for improving rate/increase #3806

free opened this issue Feb 6, 2018 · 15 comments

Comments

@free
Copy link
Contributor

@free free commented Feb 6, 2018

I'm creating a separate, hopefully more focused (and civil) issue in an attempt to start a discussion on the problems (as seen by me and a number of others) with and possible solutions for rate() and increase().

First, let me start by acknowledging that considering Prometheus' self-imposed constraints -- in particular having foo[5m] in the context of rate(foo[5m]) only produce the data points that are strictly contained in the 5 minute range -- the current rate() and increase() implementations are essentially the most comprehensive solution for the particular problem.

That being said, I believe the core problems/limitations in the implementation stem from the constrained definition of what is a time range and an only slightly more generous definition (i.e. including the last point before the range) would provide a number of significant improvements. Why/how does including the last point before the range make any sense? Well, you could look at it as "the (rate of) increase between the value of the timeseries X minutes ago and now". And just as the current value of a timeseries is the last collected/evaluated point, so the value of the timeseries X seconds ago is the last collected/evaluated point at that time (i.e. the one immediately preceding the start of the range).

Or looking at it another way, if we consider a scrape interval of 1m (and, for now, no missed scrapes or series beginning/ending), a 5m rate calculated "now" should cover the last 5m (plus jitter) starting from the last point collected, just as the value of the timeseries "now" is the last point collected.

Moving on to the possible benefits of expanding the range to include the one extra point (in the past, not in the future):

  1. Evaluating a rate/increase over the last X minutes every X minutes without any loss of data. Currently the options are either (a) compute a rate over significantly more than X minutes (which results in rates being averaged out over unnecessarily long ranges); or (b) compute the rate over X minutes + scrape_interval, which will basically give you the actual rate over X minutes, but requires you to be consistently aware of both the evaluation and scrape intervals.

  2. rate() and increase() evaluations at the resolution of the underlying data, later aggregatable over arbitrarily long time ranges. Going back to the 1m scrape interval example, one would be able to compute a rate over 1m every minute, then, in a console/dashboard use avg_over_time() to obtain rates over any number of minutes/hours/days (and the same with increase and sum_over_time). Currently the rule of thumb is to calculate rates over 2.5-3.5x the scraping interval every 2 scraping intervals, resulting in (a) unnecessarily low resolution and (b) every resulting rate covering anywhere between 1 and 3 actual counter increases (2-4 points) with each increase arbitrarily included in either 1 or 2 rates. To be fair, the current implementation could be used to evaluate a rate every 1m, but the specified range would have to be 2m, which is (to say the least) counterintuitive.

  3. On-the-fly query_range rate/increase calculations with each increase only included in one rate/increase point and without data loss (which is what Grafana provides support for). Currently there are 2 issues that prevent this from working: (a) one needs to be aware of the scrape interval in order to bump the range by that much; and (b) neither Grafana nor Prometheus support the kind of time arithmetic necessary to query rate(foo[1h+1m]). (Not to mention the additional /(1h+1m) * 1h necessary to get a somewhat accurate increase).

  4. Finally, and this is a judgment call, one could get integer increases from integer counters. Particularly in the case of rare events, it is somewhat jarring to see increases of 1.5 or 1.2 or 2.4 requests. This would work perfectly for ranges that are multiples (including 1x) of the scraping interval. If the range is not a multiple of the scraping interval (and I am actually curious why someone would do this to begin with), then a smoothly increasing counter will result in aliasing. The alternative is to do what the current implementation does, i.e. to compute the rate first (which would be ~constant for a smoothly increasing counter) and multiply it by the range. It would yield essentially the same result for the multiple of the scrape interval case (as the difference between the timestamps of the 2 ends would be ~= to the range) and smooth (but not integer) increases for the arbitrary range case.

To be clear, I am aware of the fact that timeseries don't come perfectly evenly spaced, that collections are missed at times and counters begin/end/restart, but I can't think of a particular case where my proposed improvement would behave differently from the current implementation. Please let me know if you think I'm missing something obvious.

Now let's move on to the downsides:

  1. This would require a change to the definition of what a range is. True, but it can (and should) be done in such a way as to provide an extra point before the actual range, which would then be ignored in the vast majority of cases. E.g. the <operation>_over_time should definitely not make use of this extra point, as in my example above avg_over_time(foo:rate_1m[5m]) would already include the increase between foo offset 5m and the first point in the range. It could even be implemented as a separate field from the list of points in the range. I understand this is a controversial change, but in my view the benefits are worth it. And it is actually more consistent with the way instant values work.

  2. Introducing yet another rate()/increase() function or replacing the existing ones and breaking backward compatibility is not ideal. Fully agreed, but again (personal opinion, confirmed by a handful of others) I believe the benefits are worth it. It could be pushed back to Prometheus 3.0 (if there is a plan for that) or hidden behind a flag or really whatever makes sense. I think it would improve the experience of lots of users, if made available in any way.

  3. One other possible argument against this approach is "if the last point before the range is almost one scrape interval away from the range, why should it count as much as the others/at all?". The exact same argument could be made against the current implementation: "what if the first point in the range is almost one scrape interval away from the start, does it make sense to extrapolate that far?". Or, looking at it another way, if it's OK for the instant value of a timeseries to be one scrape interval old, then it should be similarly OK for a rate over whatever interval to be one scrape interval old.

Thanks, and I really hope this leads to a discussion of the benefits/downsides of the proposal or, more generally, on how rate()/interval() could be improved.

@beorn7

This comment has been minimized.

Copy link
Member

@beorn7 beorn7 commented Feb 7, 2018

Thanks for your attempt to take the toxicity out of the debate.
However, as said, I'm positive that @brian-brazil has seriously considered your arguments, despite the tone in the original issue. I am myself still in no position to invest the necessary time.
If you are fishing for a wider audience, the developers mailinglist might be a better forum than a GitHub issue.

@free

This comment has been minimized.

Copy link
Contributor Author

@free free commented Feb 7, 2018

I'm sorry to hear you don't have the time to look into this, even at a high level. I am not sure that a wider audience is what's needed here, as this is a rather contested issue (part of why the previous discussion devolved into an argument) and I'm not sure +1s and thumbs up are what's going to convince anyone to change their mind about it.

Regarding whether @brian-brazil has considered my arguments, I'm not entirely sure. Partly because the arguments I was making were spread out over a long series of long comments; and partly because the discussion we did have was entirely focused on Prometheus implementation/PromQL constraints/version compatibility rather than whether an implementation based on my proposal would actually materially improve user experience/data quality or not.

@free

This comment has been minimized.

Copy link
Contributor Author

@free free commented Mar 28, 2018

I gave the prometheus-developers mailing list a try, but the thread I created there started marking all new posts as spam (mine and at least one other poster's, even though we both joined the group). As a result I'm attempting to move the discussion back here.

I'll also attempt to summarize the discussion that happened over there. I have put together a collection of screenshots at https://free.github.io/xrate/index.html , comparing rate and xrate on a couple of counters (one that is smoothly increasing and one that increases in steps -- like a low QPS server or an error counter would), plus a set of rolling restart simulations generated using @beorn7's rrsim.

I've tried 3 different variants:

  • xrate V1 does no adjustment whatsoever. It takes the difference between counter and counter offset $range (modulo any counter resets) and divides it by the difference in timestamps between the two data points.
  • xrate V2 also does no interpolation/extrapolation, but it does adjust the difference between the 2 data points to the whole range: multiplied by requested range, divided by actually sampled range.
  • xrate V3 is very similar to rate in that it does what xrate V2 does, plus it interpolates to the left (when there exists a data point just before the range) and extrapolates to the right (under similar conditions as rate, up to 1.1x average interval away).

xrate V1 is a major departure from rate and xrate V3 generates significant aliasing (sawtooth pattern) with anything but a smooth counter when the series begins/ends (sawtooth pattern, rate has the exact same problem).

xrate V2 is (I think) a good middle ground: even though smooth counter starts/ends produce a step pattern, I believe the only place that is actually unwanted is on a high resolution graph. At resolutions equal to or lower than the collection interval, it makes much more sense for a rate to begin with .1, .2, .3 etc. than with .176, .276, .376, etc.

Finally, I will attempt to summarize the pluses and minuses of xrate V2 as compared to rate. I would really appreciate it if the pluses were also addressed, because thus far it was mostly the minuses that were brought up.

  1. Minus: Documentation/maintenance/support cost of a new/additional rate() implementation.
  2. Minus: Introduction of an (internal to Prometheus) inconsistency in the definition of a range: the points that fall strictly inside the range for most functions (e.g. max_over_time()) vs. the same points plus the point just before for xrange()/xincrease(). I would, however, argue that what's important for a counter is the increase between successive values, so (conceptually) xrange()/xincrease() take as inputs all the increases that fall inside the range in the same way that max_over_time() takes as inputs the actual gauge values in the range. I.e. a counter range should be different from a gauge range.
  3. Minus: An about 10% overhead in terms of CPU usage because of the extra 5 minutes worth of data that need to be scanned (this is close to a worst case behavior, as I am testing this with 10s ranges over 5s resolution data, so the extra range being scanned is ~95%).
  4. Neutral: The rrsim graphs show a drop in QPS during the rolling restart, of about 1/3rd of the QPS of one server. This is explained by the lack of extrapolation, but it means that the value produced by xrate V2 is actually a lower bound for the actual rate (whereas rate is neither a lower nor an upper bound due to the sawtooth aliasing). A value identical to rate can be obtained by using xrate V3 instead (with the same downsides).
  5. Plus: It is composable: i.e. avg_over_time(foo:xrate_1m[10m]) == xrate(foo[10m]) and sum_over_time(foo:xincrease_1m[10m]) == xincrease(foo[10m]), where foo:xrate_1m = xrate(foo[1m]) and foo:xincrease_1m = xincrease(foo[1m])
  6. Plus: /range_query?expr=xrate(foo[60s])&step=60 includes every single increase between successive data points exactly once (the equivalent rate() expression doesn't) so e.g. a spike will never disappear upon refresh.
  7. Plus: xincrease(), while still syntactic sugar for xrate() is much closer to the actual increase than increase() is. xincrease() will only be off by the amount of jitter in the collection and/or eval interval, whereas increase() is off by that times (collection_interval + eval_interval) / eval_interval
  8. Plus: No sawtooth aliasing when a series begins/ends.
  9. Plus: Much easier to understand how the result is computed. I believe the fact that no developer but Brian and Björn is even willing to discuss this proposal is an indication that not even Prometheus developers have a good grasp of the implementation.
@roidelapluie

This comment has been minimized.

Copy link
Member

@roidelapluie roidelapluie commented Aug 19, 2018

ping @free #1227 might solve this

@free

This comment has been minimized.

Copy link
Contributor Author

@free free commented Aug 19, 2018

Thanks for the heads-up, @roidelapluie.

#1227 might fix this, AFAICT, at least to some extent. It wouldn't be the best solution, though: low evaluation resolution will underestimate counter resets (as the samples right before or after the reset might not be selected), or even miss them altogether (e.g. if a counter resets twice within an interval); too high resolution would generate unnecessary duplicate samples, thus being less efficient (and still wouldn't fully prevent the issues above).

But I'm keeping my fingers crossed, hoping that the resulting inconsistency between the 2 range syntaxes (with and without a colon) may end up convincing people that this is the common sense way of computing rates from samples.

@AKYD

This comment has been minimized.

Copy link

@AKYD AKYD commented Sep 10, 2018

Wouldn't "faking" the interval in the xrate() function solve this without any changes to other Prometheus areas? Basically if you want to do xrate[5m], just ask for a range of 5m+50% - so 450s, and only keep the 5m range + 1 extra point just before the range. This could be mentioned in bold in the documentation.

@free

This comment has been minimized.

Copy link
Contributor Author

@free free commented Sep 10, 2018

Well, that's pretty much what it does. Except it asks for 5m extra instead of 50% extra. The reason for that is that 5m is a "magical" constant for how far back Prometheus will look for a "current value" before considering a timeseries "stale".

(And, of course, there is the issue of extrapolation, which is, in my view, unnecessary in this brave new world.)

But Brian's argument is that any changes to the range will lead to inconsistencies and headache down the road.

@AKYD

This comment has been minimized.

Copy link

@AKYD AKYD commented Sep 10, 2018

Well if you just use the new range logic for the new function and don't change the range for any other functions, what's the harm? Users can then choose which rate()/increase() version to use. I personally would choose the more exact version even if it consumes some extra resources (by requesting more points than actually requested by the user); it would also not break existing rules/dashboards.

@valyala

This comment has been minimized.

Copy link

@valyala valyala commented Jan 26, 2019

That being said, I believe the core problems/limitations in the implementation stem from the constrained definition of what is a time range and an only slightly more generous definition (i.e. including the last point before the range) would provide a number of significant improvements.

FYI, VictoriaMetrics takes into account the previous point before the range window for implementing all the PromQL functions accepting range vectors. This successfully resolves problems outlined in the initial post by @free .
Additionally VM automatically adjusts range window to scrape interval if it is smaller than the interval. This resolves the issue with disappearing graph when less than two points trap into the range window as described at grafana/grafana#11451 .

@JohannesRudolph

This comment has been minimized.

Copy link

@JohannesRudolph JohannesRudolph commented Apr 16, 2019

I've been following this discussion and it's tangents (e.g. #1227) for quite a while now. I believe that there's a lot of merit in providing an "intuitive" graphing experience for users using Prometheus counters + rate() + Grafana. PromQLs current evaluation model for rate() may be a good trade-off for a metric/sampling based monitoring system and the inherent observation anomalies, provide a consistent query model and produce correct results on average.

But it doesn't match the "intuitive" expectation of a user that graphs a counter, then graphs the corresponding rate() and realizes the two don't match up because rate() misses increases between step windows like in this example:

image

@brian-brazil proposed that the PromQL client is the right place to solve this. To this extent, I've proposed grafana/grafana#16612 in Grafana. Happy to hear your thoughts on this.

@brian-brazil

This comment has been minimized.

Copy link
Member

@brian-brazil brian-brazil commented Apr 16, 2019

That's a different issue than what's being discussed here. That's basically re-implementing increase in Grafana, but without reset handling or aggregation.

@JohannesRudolph

This comment has been minimized.

Copy link

@JohannesRudolph JohannesRudolph commented Apr 16, 2019

That's a different issue than what's being discussed here. That's basically re-implementing increase in Grafana, but without reset handling or aggregation.

Not quite @brian-brazil, resets and aggregation are still handled by prometheus with a sum(metric) by (a,b) style query on the metric. Sorry if that wasn't clear from my post.

@free

This comment has been minimized.

Copy link
Contributor Author

@free free commented Apr 17, 2019

@brian-brazil is right to point out that reset handling and aggregation could not possibly be handled in Grafana, because the query output is presumably already aggregated -- e.g. sum by(job) (cpu_utilization) -- and you'd need to do the reset handling on each counter before aggregating for it to work at all. (Hence the recommendation to do sum(rate()) rather than rate(sum()), which PromQL mostly prevents you from doing anyway).

Which makes even more puzzling the fact that Prometheus doesn't give you this out of the box (without requiring you to write slow, fragile PromQL queries).

@JohannesRudolph

This comment has been minimized.

Copy link

@JohannesRudolph JohannesRudolph commented Apr 18, 2019

My bad - I was somehow under the false impression that sum() would already handle counter resets for me as its already aggregating over targets.

So in summary there's the following options & tradeoffs:

  • use sum( rate(metric[$__interval]) ) by (x): handles resets correctly, gives on average correct results but misses increase between step windows. Probably useful if you have lots of targets, high reset probability, low value individual counter increases (i.e. missing one in graph doesn't hurt).
  • use something like sum( metric ) by (x) - sum ( metric offset $__interval ) by (x): doesn't handle resets but correctly accounts for increases between step windows. Probably useful if you have few targets and low reset probability, high value individual counter increases (i.e. missing one in the graph does hurt).

There appears to be no good middle ground or solution for a graphing scenario when I do have requirements that do not align neatly with this model (i.e. high reset probability and high value individual counter increases). I realize there's the position that this should be handled by a different stack altogether (you want log monitoring...) but that's kind of like a bummer when prometheus + grafana already give me 98% of what I need and I just need this one tiny thing (a "continuous" rate that doesn't skip samples between query step windows) to get what I want.

Btw. maybe "continuous rate" / crate() might be an alternative naming for xrate.

@brycedrennan

This comment has been minimized.

Copy link

@brycedrennan brycedrennan commented Feb 6, 2020

@free @JohannesRudolph I think I found a formula that addresses both resets and increases between windows. Let me know if you see any issues with this:

sum(clamp_min(metric - (metric offset $Interval or (metric * 0)), 0)) by (x, y)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants
You can’t perform that action at this time.