Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upProposal for improving rate/increase #3806
Comments
This comment has been minimized.
This comment has been minimized.
|
Thanks for your attempt to take the toxicity out of the debate. |
This comment has been minimized.
This comment has been minimized.
|
I'm sorry to hear you don't have the time to look into this, even at a high level. I am not sure that a wider audience is what's needed here, as this is a rather contested issue (part of why the previous discussion devolved into an argument) and I'm not sure +1s and thumbs up are what's going to convince anyone to change their mind about it. Regarding whether @brian-brazil has considered my arguments, I'm not entirely sure. Partly because the arguments I was making were spread out over a long series of long comments; and partly because the discussion we did have was entirely focused on Prometheus implementation/PromQL constraints/version compatibility rather than whether an implementation based on my proposal would actually materially improve user experience/data quality or not. |
This comment has been minimized.
This comment has been minimized.
|
I gave the prometheus-developers mailing list a try, but the thread I created there started marking all new posts as spam (mine and at least one other poster's, even though we both joined the group). As a result I'm attempting to move the discussion back here. I'll also attempt to summarize the discussion that happened over there. I have put together a collection of screenshots at https://free.github.io/xrate/index.html , comparing I've tried 3 different variants:
Finally, I will attempt to summarize the pluses and minuses of
|
free
referenced this issue
Apr 3, 2018
Open
[Feature request] $__interval in Prometheus should never be less than double scrape interval #11451
free
referenced this issue
Apr 11, 2018
Closed
rate()/increase() extrapolation considered harmful #3746
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Thanks for the heads-up, @roidelapluie. #1227 might fix this, AFAICT, at least to some extent. It wouldn't be the best solution, though: low evaluation resolution will underestimate counter resets (as the samples right before or after the reset might not be selected), or even miss them altogether (e.g. if a counter resets twice within an interval); too high resolution would generate unnecessary duplicate samples, thus being less efficient (and still wouldn't fully prevent the issues above). But I'm keeping my fingers crossed, hoping that the resulting inconsistency between the 2 range syntaxes (with and without a colon) may end up convincing people that this is the common sense way of computing rates from samples. |
This comment has been minimized.
This comment has been minimized.
AKYD
commented
Sep 10, 2018
|
Wouldn't "faking" the interval in the |
This comment has been minimized.
This comment has been minimized.
|
Well, that's pretty much what it does. Except it asks for 5m extra instead of 50% extra. The reason for that is that 5m is a "magical" constant for how far back Prometheus will look for a "current value" before considering a timeseries "stale". (And, of course, there is the issue of extrapolation, which is, in my view, unnecessary in this brave new world.) But Brian's argument is that any changes to the range will lead to inconsistencies and headache down the road. |
This comment has been minimized.
This comment has been minimized.
AKYD
commented
Sep 10, 2018
|
Well if you just use the new range logic for the new function and don't change the range for any other functions, what's the harm? Users can then choose which rate()/increase() version to use. I personally would choose the more exact version even if it consumes some extra resources (by requesting more points than actually requested by the user); it would also not break existing rules/dashboards. |
This comment has been minimized.
This comment has been minimized.
valyala
commented
Jan 26, 2019
•
FYI, VictoriaMetrics takes into account the previous point before the range window for implementing all the PromQL functions accepting range vectors. This successfully resolves problems outlined in the initial post by @free . |
free
referenced this issue
Feb 2, 2019
Open
Prometheus query: query step is bound by min interval #14209
JohannesRudolph
referenced this issue
Apr 16, 2019
Open
[Feature request] Delta "Mode Option" for graphs #16612
This comment has been minimized.
This comment has been minimized.
JohannesRudolph
commented
Apr 16, 2019
|
I've been following this discussion and it's tangents (e.g. #1227) for quite a while now. I believe that there's a lot of merit in providing an "intuitive" graphing experience for users using Prometheus counters + But it doesn't match the "intuitive" expectation of a user that graphs a counter, then graphs the corresponding rate() and realizes the two don't match up because rate() misses increases between step windows like in this example: @brian-brazil proposed that the PromQL client is the right place to solve this. To this extent, I've proposed grafana/grafana#16612 in Grafana. Happy to hear your thoughts on this. |
This comment has been minimized.
This comment has been minimized.
|
That's a different issue than what's being discussed here. That's basically re-implementing |
This comment has been minimized.
This comment has been minimized.
JohannesRudolph
commented
Apr 16, 2019
•
Not quite @brian-brazil, resets and aggregation are still handled by prometheus with a |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil is right to point out that reset handling and aggregation could not possibly be handled in Grafana, because the query output is presumably already aggregated -- e.g. Which makes even more puzzling the fact that Prometheus doesn't give you this out of the box (without requiring you to write slow, fragile PromQL queries). |
This comment has been minimized.
This comment has been minimized.
JohannesRudolph
commented
Apr 18, 2019
|
My bad - I was somehow under the false impression that So in summary there's the following options & tradeoffs:
There appears to be no good middle ground or solution for a graphing scenario when I do have requirements that do not align neatly with this model (i.e. high reset probability and high value individual counter increases). I realize there's the position that this should be handled by a different stack altogether (you want log monitoring...) but that's kind of like a bummer when prometheus + grafana already give me 98% of what I need and I just need this one tiny thing (a "continuous" rate that doesn't skip samples between query step windows) to get what I want. Btw. maybe "continuous rate" / |

free commentedFeb 6, 2018
•
edited
I'm creating a separate, hopefully more focused (and civil) issue in an attempt to start a discussion on the problems (as seen by me and a number of others) with and possible solutions for
rate()andincrease().First, let me start by acknowledging that considering Prometheus' self-imposed constraints -- in particular having
foo[5m]in the context ofrate(foo[5m])only produce the data points that are strictly contained in the 5 minute range -- the currentrate()andincrease()implementations are essentially the most comprehensive solution for the particular problem.That being said, I believe the core problems/limitations in the implementation stem from the constrained definition of what is a time range and an only slightly more generous definition (i.e. including the last point before the range) would provide a number of significant improvements. Why/how does including the last point before the range make any sense? Well, you could look at it as "the (rate of) increase between the value of the timeseries X minutes ago and now". And just as the current value of a timeseries is the last collected/evaluated point, so the value of the timeseries X seconds ago is the last collected/evaluated point at that time (i.e. the one immediately preceding the start of the range).
Or looking at it another way, if we consider a scrape interval of 1m (and, for now, no missed scrapes or series beginning/ending), a 5m rate calculated "now" should cover the last 5m (plus jitter) starting from the last point collected, just as the value of the timeseries "now" is the last point collected.
Moving on to the possible benefits of expanding the range to include the one extra point (in the past, not in the future):
Evaluating a rate/increase over the last X minutes every X minutes without any loss of data. Currently the options are either (a) compute a rate over significantly more than X minutes (which results in rates being averaged out over unnecessarily long ranges); or (b) compute the rate over
X minutes + scrape_interval, which will basically give you the actual rate over X minutes, but requires you to be consistently aware of both the evaluation and scrape intervals.rate()andincrease()evaluations at the resolution of the underlying data, later aggregatable over arbitrarily long time ranges. Going back to the 1m scrape interval example, one would be able to compute a rate over 1m every minute, then, in a console/dashboard useavg_over_time()to obtain rates over any number of minutes/hours/days (and the same withincreaseandsum_over_time). Currently the rule of thumb is to calculate rates over 2.5-3.5x the scraping interval every 2 scraping intervals, resulting in (a) unnecessarily low resolution and (b) every resulting rate covering anywhere between 1 and 3 actual counter increases (2-4 points) with each increase arbitrarily included in either 1 or 2 rates. To be fair, the current implementation could be used to evaluate a rate every 1m, but the specified range would have to be 2m, which is (to say the least) counterintuitive.On-the-fly
query_rangerate/increase calculations with each increase only included in one rate/increase point and without data loss (which is what Grafana provides support for). Currently there are 2 issues that prevent this from working: (a) one needs to be aware of the scrape interval in order to bump the range by that much; and (b) neither Grafana nor Prometheus support the kind of time arithmetic necessary to queryrate(foo[1h+1m]). (Not to mention the additional/(1h+1m) * 1hnecessary to get a somewhat accurate increase).Finally, and this is a judgment call, one could get integer increases from integer counters. Particularly in the case of rare events, it is somewhat jarring to see increases of 1.5 or 1.2 or 2.4 requests. This would work perfectly for ranges that are multiples (including 1x) of the scraping interval. If the range is not a multiple of the scraping interval (and I am actually curious why someone would do this to begin with), then a smoothly increasing counter will result in aliasing. The alternative is to do what the current implementation does, i.e. to compute the rate first (which would be ~constant for a smoothly increasing counter) and multiply it by the range. It would yield essentially the same result for the multiple of the scrape interval case (as the difference between the timestamps of the 2 ends would be ~= to the range) and smooth (but not integer) increases for the arbitrary range case.
To be clear, I am aware of the fact that timeseries don't come perfectly evenly spaced, that collections are missed at times and counters begin/end/restart, but I can't think of a particular case where my proposed improvement would behave differently from the current implementation. Please let me know if you think I'm missing something obvious.
Now let's move on to the downsides:
This would require a change to the definition of what a range is. True, but it can (and should) be done in such a way as to provide an extra point before the actual range, which would then be ignored in the vast majority of cases. E.g. the
<operation>_over_timeshould definitely not make use of this extra point, as in my example aboveavg_over_time(foo:rate_1m[5m])would already include the increase betweenfoo offset 5mand the first point in the range. It could even be implemented as a separate field from the list of points in the range. I understand this is a controversial change, but in my view the benefits are worth it. And it is actually more consistent with the way instant values work.Introducing yet another
rate()/increase()function or replacing the existing ones and breaking backward compatibility is not ideal. Fully agreed, but again (personal opinion, confirmed by a handful of others) I believe the benefits are worth it. It could be pushed back to Prometheus 3.0 (if there is a plan for that) or hidden behind a flag or really whatever makes sense. I think it would improve the experience of lots of users, if made available in any way.One other possible argument against this approach is "if the last point before the range is almost one scrape interval away from the range, why should it count as much as the others/at all?". The exact same argument could be made against the current implementation: "what if the first point in the range is almost one scrape interval away from the start, does it make sense to extrapolate that far?". Or, looking at it another way, if it's OK for the instant value of a timeseries to be one scrape interval old, then it should be similarly OK for a rate over whatever interval to be one scrape interval old.
Thanks, and I really hope this leads to a discussion of the benefits/downsides of the proposal or, more generally, on how
rate()/interval()could be improved.