-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
promql: Remove extrapolation from rate/increase/delta. #1161
Conversation
4348f86
to
37daaf1
Compare
Thanks for the elaborate reasoning, sounds sane to me 👍 |
This is an improvement from both a theoretical and practical standpoint. Theory: For simplicty I'll take the example of increase(). Counters are fundamentally lossy, if there's a counter reset or instance failure between one scrape and the next we'll lose information about the period up to the reset/failure. Thus the samples we have allow us to calculate a lower-bound on the increase() in a time period. Extrapolation multiples this by an amount depending on timing which is an upper bound based on what might have happened if everything continued as it was. This mix of upper and lower bounds means that we can't make any meaningful statements about the output of increase() in relation to what actually happened. By removing the extrapolation, we can once again reason that the result of increase() is a lower bound on the true increase of the counter. Practical: Fixes #581. The extrapolation presumes that it's looking at a range within a continuous stream of samples. If in fact the time series starts or end within this range, this leads to an over-correction. For discrete counters and gauges, extrapolating to invalid values in their domain can be confusing and prevent rules being written that depend on exact values. For those looking to graph things more accurately irate() is a better choice than extrapolation on rate(). For those looking to calculate how a gauge is trending deriv() is a better choice than delta().
37daaf1
to
06c55d7
Compare
I like your arguments, but this is a major breaking change in what As an example, if I scrape a counter every 15s and do a Now yes, we can tell people to use |
So while both variants of |
If that's the case we can't sell what we currently have as increase per second either. What we currently have can be off by an factor up to between a scrape interval and the range. Without extrapolation it can only be off by up to a scrape interval. This is a much smaller discrepancy, and is meaningful without having to look at the data.
Taking your example, the current rate can produce 55/s when the true value is 13.75/s. This is not more accurate.
Do they understand the current behaviour can be off by between 25% and 300% in that case? As with interpolation in evaluation this is something that sounds good in theory but doesn't work out in practice. We should never do any interpolation or extrapolation without informed opt-in by the user, as it can cause unexpected results and only the user knows if that's okay for their use case. |
I understand the brokenness in either direction. I still don't know what would be the best. The more I think about it I'm unhappy with either direction... My argument above is that in the current form, the inaccuracies happen mostly temporarily around edge cases (time series appearing or counter resets happening within the window causing temporary glitches), so it gives an accurate per-second value most of the time. The new behavior always consistently underreports, and sometimes very significantly. People will get consistently too low rates in their graphs, not only temporary glitches. People might never notice that they're getting consistently wrong results, whereas with temporary glitches, the impact is lower in general, and one would usually notice that something is "off" temporarily. On one hand I really like your argument about the new behavior not causing values out of the original series's domain, but on the other hand, I would expect a different output domain as a result of a per-second rate operation. I.e. for an integer counter, it's totally expected to get non-integer results for a per-second rate. Otherwise I really can't call it a per-second rate anymore. For I'm wondering about the option of keeping the extrapolation for Maybe @beorn7 has another opinion on this? |
We've already had multiple complaints around the edge cases, which can last hours to days in common usage (not that I'm a major fan of that particular usage, but anyway). From a personal standpoint, I can't trust the current rate() output when it comes to debugging. Having to manually inspect the datapoints of hundreds to thousands of timeseries to figure out what they're really telling me every time I get into deep debugging would be rather annoying. Being able to reason about the behaviour of functions is essential - being able to know that it always underreports is a feature.
The problem is that it's "off" and broken for their use cases. I heard no complaints about this behaviour over several years, so I think the downsides are not a big problem. Where you care more is things like latency and error rates, which due to division will cancel the discrepancy.
One is syntactic sugar for the other, they should stay consistent for sanity's sake. |
I still don't buy it that this won't be a bigger problem than the current rate issues - it's really easy and likely for users to get this wrong all the time and see consistently way too low rates on their dashboards without even noticing their error. It seems very dangerous to me, especially because it is not obvious in any way that you're doing anything wrong unless you happen to stumble over it somehow. That might also explain if we wouldn't get many reports about it. But if a user graphs their requests/s and it's off by 25% or so, you seriously don't consider that a big problem? |
With irate(), I'd expect most rate()s to be over longer time period. Experience with a setup where a range of 2.5X the scrape interval was the rule of thumb for many years says 25% won't be a problem as the only complaint was the graphs being too slow to respond (which irate handles). |
I tried to leave this debate to the experts, but as it looks, another view might be needed here. I'll read up the details ASAP, perhaps on the plane today. And then I'll voice my opinion. Stay tuned... |
@beorn7 Yeah - well, at least I think we share a common understanding on the facts already, so that's good. My stance is just that I'd have a harder time defending a consistently too low rate before users than a rate which has an occasional (and visible) glitch. @brian-brazil Regarding the lack of complaints regarding a ~25% too low rate, are you sure that people involved actually knew how much the rate was off? If people don't actually notice the brokenness, they will of course not complain, but will act on wrong information, which seems dangerous to me. |
OK, I guess I got it all... here my points: Some easy observations/conclusions
Fancier stuff
Suggestion for a solutionThe huge error spikes of the extrapolated value happen if there are gaps at the beginning or at the end of the range that are large compared to the normal scrape interval, or if the number of counter resets within the range is large compared to the number of samples within the range. Both cases can be detected. As the simplest heuristics, we could refuse calculate a rate if Possible refinements:
|
That's a good analysis.
We could always document it that way, rate for long durations and irate for shorter ones.
Thus the range of 2.5X scrape interval being the rule of thumb (and part of why we have irate). The sum wouldn't be a problem, as it'd all balance out with many targets.
This breaks the use cases where there's a long interval. Particularly for increase I think that asking for values of up to a week should work.
This breaks the case when a server takes a while to restart. These suggestions involve silently throwing away data. While that's still a lower bound, it's not going to be a bounded one compared to the true value so I don't think they're viable options for whatever we call rate.
I think that'd be unexpected semantics for a range, it should only look at data within the range itself. |
My verdict after considering all things so far:I think we should adjust the current Detailed replies:
As said: I think that a range with 4 or 5 samples in the range is legitimate to use. If we tossed out Square-wave issue:
My example had 3.5*scrape_interval. I could have taken 4.5 as well. It would still be a very irritating stepping up and down. Aggregating over hundreds of targets would indeed alleviate the issue, but if you have a handful of targets, you get really weird jitter.
I don't think it's a breakage. If I ask “give me the increase over the last week” but there is only data for the last three days, I want “sorry, not sufficient data” as an answer, not some number that hides the fact that it is only the increase over the last three days. “Give me the increase over the last week even if you have only data for a much shorter period of time” would be a different question (for what we can offer a non-extrapolating version of
Good point. Counter resets are hard… I'd start with my suggestion to limit the
This statement is highly misleading. No data is thrown away. More accurately, calculated result of low confidence are not reported. |
We've already had several users looking for these functions. It also better handles target churn, which may not lead to obvious distortions in the graphs when there's many targets. Any functions that do interpolation or extrapolation need to explicitly call that out in their name, as applying these techniques without user understanding and opt-in is unsafe. The current behaviour needs different names if we're keeping them around.
We'd still support it, with different semantics though. I believe that irate (potentially combined with avg_over_time) covers those uses sufficiently in practice.
Currently the only times we drop due to insufficient data we only drop the affected timeseries. My understanding of your proposal is that'd we'd follow that, and thus that The alternative where we break the entire For consistency we'd also need to apply the same semantics to all the over_time functions, which will have the same issues. This is why you don't do extrapolation without opt-in, it's not safe in general :) |
I daresay that many techniques in Prometheus are unsafe if users don't understand them. However, I'd be willing to consider the proposal to rename the extrapolating
Yes. Do you want to say that you have a problem if the expression returns a much lower value than the true value? (In case the irony is not obvious: Returning a lower value in a scenario where data in a sum is missing sounds like the right thing. Returning systematically a lower value even if the data is fine sounds only right if you know what you are dealing with.)
So far I like my suggestion. We should play with it a bit. If the most primitive approach (don't return anything if
That's not true. Those functions do not extrapolate. |
That's an option. What (if anything) we call
You'd have to go via a rule. Increase is just syntactic sugar around rate, so I don't think rate will be commonly used in graphs once irate gets fully out there. If anything I think we'll need to work to direct people where to use rate rather than irate, given that I'd expect almost all examples to be for graphs using irate.
I think that returning a lower value than we have to is a problem, particularly if it can surprise the end-user without warning. What I'm proposing is the strongest possible lower bound. This has the same common and worst cases, which are okay overall in terms of expected value. What you're proposing is something that can return anywhere between an empty vector and an overestimate. This has bad common (target churn is common) and terrible worst cases (sometimes really common, and the #581 case). The current behaviour is an upper bound that assumes a linear model. It has an okay common case, but bad worst cases. I believe that we need to avoid having to make users care about edge cases, and the smaller the amount of documentation we need to provide the better. "Is a strict lower bound, will be typically out by a constant factor" is easier to explain than "Is a non-strict upper bound, presumes a linear time series. Can be out by a small amount downwards or very large factor upwards. Only way to correct is manual inspection of data." which is easier to explain than "Will drop data that doesn't cover half the range, may be out by up to 100% downwards or 100% upwards. Only way to correct is manual inspection of data."
It mitigates it by breaking in a different way in some cases. It's papering over a symptom rather than attacking the core problem. Making it tunable is getting into epicycles, and hints that the core idea isn't right. Per above, the amount of documentation is much more than a line as you need to go through the failure modes and where it's not safe to use. This is impractical as a general rate/increase function. The only vaguely common good use I can think of would be as an alternative to a for clause in an alert over certain types of data pending #422 - and that's due to the filtering, not the extrapolation.
They act over ranges, so to keep the language consistent they should have similar behaviour in relation to dropping data if it's incomplete. Consider |
A thought that just occurred to me is that we could add a |
In contrast to the idea to go through rules to allow |
For simplicity, I'll call the proposed lower-bound rate function LBR in the following.
I disagree on almost every single point here:
As said many times before, this is understood. It's just not acceptable to call it
A lot of target churn can happen if you take a very long range. However, for very long ranges, the LBR is fine. That's a very good use-case for it and a perfect solution. However, I'm missing a good solution for the middle area of ranges with ~5 samples. Too many to replace with I agree that the worst cases are still very bad with my suggestion. This affects mostly long ranges. I'm not happy with that and still thinking how to improve it...
I strongly disagree that this explanation is sufficient. It's also not true as the factor is not constant but depends on alignment. You also have to include an explanation how the underestimation factor depends on the number of samples in the range.
However, with LBR we do not solve the core problem, we simply solve a different problem. If the core problem is unsolvable, then fine, tell the user to ask different questions. I would like to try a bit harder to solve the core problem before doing that.
I don't see that argument. The |
Makes sense. LUBR = linear upper bound rate, our current behaviour.
I measure comprehensibility in terms of the number ideas the user needs to understand to use a feature properly. I'd consider the LBR strictly more comprehensible than the LUBR, as it doesn't have the #583 edge case that the LUBR does. The averages cases are about the same difficulty to explain (strongest lower bound vs linear upper bound). Your proposal adds a corner case to the LUBR, so it's less comprehensible than the LUBR. So I see it as roughly 1 vs 2 vs 3 ideas, plus the other 4ish ideas that are common across them (per-second rate, accumulating counter, calculated on server, don't need to maintain state across restarts/counter reset handling).
LBR is strictly more predictable that the LUBR, as it only depends on the range and scrape times. The LUBR also depends on those to determine the bounds, then the data and it's context to get the true value.
It's the semi bit where I take issue. Anything with an unqualified name like
What task are you trying to perform with those 5 samples? I can only see it coming up in alerting, and it doesn't matter much there as you're usually dealing with ratios which will cancel out any error.
I tried to keep all the explanations concise, so they're missing the full details. By constant I mean that it's a constant relative to the prometheus configuration, that is that it doesn't depend on the data. The underestimation amount doesn't depend on the number of samples in the range, you need to allow for at most a scrape interval worth of data around the samples you have.
If rate and friends are going to try and fill in data that's missing form the range then there's a reasonable argument that all other range functions do the same for the same reasons - that is that the result will be closer to the true expected value, rather than a lower bound. I don't like this argument for the same reasons I don't like rate extrapolating, but it is a valid argument and the logical extension of the current rate logic and associated arguments. |
That has been proven wrong above. I don't know how to demonstrate it more clearly than it has been done before.
I tried to explain why I don't think that there is a reasonable argument of that kind. I don't know how to explain it more clearly. |
To get action items back on track: I'm definitely in favor of offering functions as implemented in the PR, I just would like a naming scheme that calls out the lower-bound behavior (e.g. I have an idea how to improve the extrapolation behavior we have currently implemented, which I have to flesh out. We have to see if we can improve |
I don't see anything demonstrating that, and I can't see a way the underestimate can exceed one scrape worth of data. I just ran through the numbers on the 210s example to be sure (I presume that's what you're referring to, if not please indicate otherwise) and on average it's out by a scrape worth of data.
It comes down to what you expect these functions to do, and how they fit in the taxonomy of our functions as our users see them and how they're used, rather than how we see them as implementers. If you were used to the _over_time functions extrapolating you'd probably see the argument as you'd be used to thinking about them in that way. In practice those functions aren't much use until we figure out staleness. Presume we've handled staleness and pgw metrics are timestamped. Consider sum_over_time on top of timestamped data coming from the pgw to calculate how many records were processed by a series of batch job runs over some period. That's exactly the same use case as Designing a language to be consistent is non-trivial, things interact in unexpected ways. count_over_time and sum_over_time are strongest lower bound functions. min_over_time is a strongest upper bound, max_over_time is a strongest lower bound. This doesn't matter for avg_over_time as it cancels, the same as for a ratio of rates. For consistency, whatever we do extrapolation-wise for rate/increase/delta we should do for the _over_time functions too as they're all unqualified range functions and in addition there's isomorphic use cases.
I'd like to generally avoid abbreviations, users should have a good idea what a function does from just the name. The only reason
I had an idea to look back beyond the start of range, but that completely changes the semantics of ranges and gets very epicycley when you work through all the implications. We'd need to make sure that both made logical sense and worked in practice.
You never answered this question. If we can come up with another way to do what you need then that gives us more options. |
Hmm, rather than filtering results what if we turned extrapolation on/off based on the coverage? |
Yes, that goes into the direction I'm fleshing out at the moment. Just very busy with work at the moment. I'll write something more elaborate ASAP. |
You wanted an explicit answer to: “What task are you trying to perform with those 5 samples? I can only see it coming up in alerting, and it doesn't matter much there as you're usually dealing with ratios which will cancel out any error.“ The question I (and may of my coworkers) ask multiple times each day is: “How many QPS are we serving now?“ With a scrape interval typically between 15s and 30s, Luckily, we are also in a domain where the brokenness of Ideally, I want a I don't want to have complicated syntax to answer my straight forward question above, especially not if it just reconstructs the semi-broken behavior So let's take a note for later, when we discuss naming: Let's now talk about ratios: To not go abstract with formulas, I'll take a concrete example. Going back to the case with the range of 210s and 60s scrape interval. If I sum up multiple LBRs, half of them will be underestimated by 14%, half of them by 43%. Let's say I calculate an average latency from Now let's have the first target be underestimated by 43% and the second be underestimated by 14% (due to different scrape timings). The ratio becomes (114+129)/(57+43) = 2.43. Funny enough, if the underestimations happen to be the other way round, we get (171+64)/(86+21) = 2.20. So what originally was a guaranteed lower bound, turns into something that might be too high or too low, depending on scraping alignment, if you take a ratio. Not really surprising, thinking about it. The mnemonic rhyme I learned in school was: „Aus Summen kürzen nur die Dummen.“ See here for a discussion how to translate it. :) |
This use-case is fundamentally violating Prometheus semantics. That's what we are telling everybody who tries to set it up that way. I don't think we should optimize our query language on fringe use-cases for which Prometheus was not meant to be used. I still see a fundamental difference between the |
OK, finally my 2nd try to fix I'll use LBR for the I'll start from fundamentals once more, sorry for that... 1. When is both LUBR and LBR exactly right?If the first sample in the range is exactly at the beginning of the range and the last sample in the range is exactly at the end of the range. This case is unfortunately very unlikely. But keep in mind that both implementations approach the true rate as closer as we get to this ideal situation. 2. When is LUBR right and LBR wrong?LUBR is exactly right If the series starts before the range, and continues after the range, and the counter increases linearly. Again will never happen in practice. But at least LUBR will estimate the correct value if the series starts before the range, and continues after the range. Since this is in general the most frequent case, I argued that LUBR is correct as it yields the expectation value in general. LBR systematically underestimates in this case. The exact value depends on the scrape interval and the alignment. The maximum amount of underestimation can be calculated from the scrape interval, but by design, the scrape interval may vary (jitter, missed scrapes), so it's fundamentally difficult to reason about the maximum possible underestimation and even more so about the underestimation in a concrete case. 3. When is LBR right and LUBR wrong?In the case where a series starts within the range and ends within the range, LBR is exactly right and LUBR is off by an amount that depends on the ratio between covered range and total range, which can be a lot. Obviously, the larger a range becomes, the more likely we end up in this situation. Things go totally haywire if target churn happens. The error of LBR multiplies if a number of time series appear and disappear again within the range one ofter another, which again is more likely with larger ranges. Indeed, the problems users reported with LUBR were usually related to a 4. What's bad for both LBR and LUBR?Unfortunately, two relatively common cases haven't been touched yet: A time series exists before the range begins but ends within the range. Or a time series starts within the range and continues beyond the end of the range. In those cases, we want “left-sided“ or “right-sided“ extrapolation only. Computationally, that's not a problem. 5. The solutionFundamentally, the farther away we get from case (1), we want to find out if the time series has existed before the range or will exist after the range, to determine if we are in (2), (3), or (4). The naive approach would be to scan backwards and forwards to see if there are samples in the same time series within a reasonable duration before and after the range. While in principal possible for the past, it is only possible for the future if we have a range that ends far enough in the past. In most cases (in particular rule evaluation), the range ends at the present, so we can't know if the time series will continue. Since we really don't want to change the evaluation depending on the evaluation time being the present or more than one scrape interval into the past, we should restrain ourselves to only look at samples within the range. (Doing it only for the past, but not for the future would be inconsistent. Perhaps worth a thought, but I sense danger…) So we could do the following:
First concern: How to pick x? In an ideas world with constant scrape intervals, it would be 1. However, in practice, scrape intervals have a jitter, and we might even miss a scrape completely. To take jitter into account, we can add some ε to it (e.g. 0.1 for 10% jitter). To still extrapolate if a scrape at the border is lost, we needed an x = 2+ε. That would however apply extrapolation way more often than necessary. Currently, I expect x = 1+ε yielding best results. Second concern: We'll get funny jumps between extrapolation and non-extrapolation if we are hovering around an edge case. Overall, I think it's worth playing with it. I believe it will yield the expected results in most cases. (FTR: I still think we should offer a lower-bound version of |
That makes sense.
I put this down to #398.
I don't see this as extrapolation, just as I wouldn't consider
Ratios should only be taken on aligned data, as no matter which rate function you use you're going to have races causing weirdness otherwise. Wanting to take a ratio of rates from two distinct and thus unaligned sets of targets is not a common case (I've seen it come up in provisioning where I knew it was a little dodgy, it was based on a sum across many targets and with enough other noise that it worked out fine).
Sounds vaguely like Simpson's paradox.
I think we're talking about different things. This is about batch jobs, not daemons. This isn't a case of trying to make prometheus do push where you should be using pull. We had one person in IRC asking for this. I don't see any fundamental problems with this use case, though it's obviously not how Prometheus is mainly used. We haven't every really talked about how counters inside ephemeral jobs should work when the jobs are too brief to be caught sufficiently by polling. sum_over_time with timestamps was my first thought. I have another idea, but let's not get further sidetracked.
I think you typoed LUBR here.
Yeah, that's related to one of the reasons interpolation was removed.
This seems like a viable approach to me.
It may be better to calculate the differences between all the scrape intervals, then take the median. That way outliers due to missing scrapes won't be as problematic.
That feels right to me, use something like 10% or a higher quantile than median and then see how it goes. We'll need to do some math to make sure we understand the edge cases, and that they are acceptable, but I think it'll work out.
If we can get the main rate working sufficiently well, then increase will cover those cases well enough too. More functions and variants thereof mean more confusion for users as to which to use, two rates is more than enough already so I'd prefer not to add more unless we have to. |
Still distracted by non-Prometheus work. only quick remarks: Working staleness detection would indeed solve many of the problems here as we could know that a time series has ended and act accordingly. The ratio from my example is aligned indeed. The problem happens before, when the LBR is taken on differently aligned targets (and then summed, and then divided). Yes, I typoed LUBR->LBR above. I like the idea of taking the median of the scrape intervals within the range. It's computationally more expensive but probably worth the effort. However, I think we should take 1.1*median as threshold, not a higher quantile. The latter is very noisy if we have only a handful of scrape intervals within the range. Another thought that crossed my mind: If a series starts within the range (or, respectively, ends within the range), the expectation value for the time when the series actually started (or ended) is half a scrape interval before (after) the first (last) scrape. So we can improve the estimation by extrapolating half the median scrape interval to the left (right) if the first (last) sample within the range is more than 1.1*median after (before) the start (end) of the range. If it's closer, we do the full extrapolation (on the respective side). So how to proceed from here? Implement the suggested modified extrapolation and play with it? |
That seems best, hopefully I'll have some time next week. |
Superseded by #1245 |
This is an improvement from both a theoretical and practical
standpoint.
Theory:
For simplicty I'll take the example of increase().
Counters are fundamentally lossy, if there's a counter reset
or instance failure between one scrape and the next we'll
lose information about the period up to the reset/failure.
Thus the samples we have allow us to calculate a lower-bound
on the increase() in a time period.
Extrapolation multiples this by an amount depending on timing which is
an upper bound based on what might have happened if everything continued
as it was.
This mix of upper and lower bounds means that we can't make any
meaningful statements about the output of increase() in relation to what
actually happened.
By removing the extrapolation, we can once again reason that the result
of increase() is a lower bound on the true increase of the counter.
Practical:
Fixes #581.
The extrapolation presumes that it's looking at a range within a
continuous stream of samples. If in fact the time series starts or
end within this range, this leads to an over-correction.
For discrete counters and gauges, extrapolating to invalid values in
their domain can be confusing and pervent rules being written that
depend on exact values.
For those looking to graph things more accurately irate() is a better
choice than extrapolation on rate().
For those looking to calculate how a gauge is trending deriv() is a
better choice than delta().
@juliusv @fabxc This depends on #1158