Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promql: Remove extrapolation from rate/increase/delta. #1161

Closed
wants to merge 1 commit into from

Conversation

brian-brazil
Copy link
Contributor

This is an improvement from both a theoretical and practical
standpoint.

Theory:
For simplicty I'll take the example of increase().

Counters are fundamentally lossy, if there's a counter reset
or instance failure between one scrape and the next we'll
lose information about the period up to the reset/failure.
Thus the samples we have allow us to calculate a lower-bound
on the increase() in a time period.

Extrapolation multiples this by an amount depending on timing which is
an upper bound based on what might have happened if everything continued
as it was.
This mix of upper and lower bounds means that we can't make any
meaningful statements about the output of increase() in relation to what
actually happened.

By removing the extrapolation, we can once again reason that the result
of increase() is a lower bound on the true increase of the counter.

Practical:
Fixes #581.
The extrapolation presumes that it's looking at a range within a
continuous stream of samples. If in fact the time series starts or
end within this range, this leads to an over-correction.

For discrete counters and gauges, extrapolating to invalid values in
their domain can be confusing and pervent rules being written that
depend on exact values.

For those looking to graph things more accurately irate() is a better
choice than extrapolation on rate().
For those looking to calculate how a gauge is trending deriv() is a
better choice than delta().

@juliusv @fabxc This depends on #1158

@fabxc
Copy link
Contributor

fabxc commented Oct 10, 2015

Thanks for the elaborate reasoning, sounds sane to me 👍

This is an improvement from both a theoretical and practical
standpoint.

Theory:
For simplicty I'll take the example of increase().

Counters are fundamentally lossy, if there's a counter reset
or instance failure between one scrape and the next we'll
lose information about the period up to the reset/failure.
Thus the samples we have allow us to calculate a lower-bound
on the increase() in a time period.

Extrapolation multiples this by an amount depending on timing which is
an upper bound based on what might have happened if everything continued
as it was.
This mix of upper and lower bounds means that we can't make any
meaningful statements about the output of increase() in relation to what
actually happened.

By removing the extrapolation, we can once again reason that the result
of increase() is a lower bound on the true increase of the counter.

Practical:
Fixes #581.
The extrapolation presumes that it's looking at a range within a
continuous stream of samples. If in fact the time series starts or
end within this range, this leads to an over-correction.

For discrete counters and gauges, extrapolating to invalid values in
their domain can be confusing and prevent rules being written that
depend on exact values.

For those looking to graph things more accurately irate() is a better
choice than extrapolation on rate().
For those looking to calculate how a gauge is trending deriv() is a
better choice than delta().
@juliusv
Copy link
Member

juliusv commented Oct 13, 2015

I like your arguments, but this is a major breaking change in what rate() does. If we do this, we really cannot sell rate() as increase-per-second anymore. Whereas before it was trying its best to give you a per-second increase (assuming roughly linear counter growth in the rate window and putting edge cases aside), it is now way off from the per-second value if you get somewhat close to your sampling period with your averaging window.

As an example, if I scrape a counter every 15s and do a rate(foo[1m]) over it, I get a rate of 41/s without extrapolation, but 55/s with extrapolation, where the latter matches the true per-second counter increase pretty much perfectly and the unextrapolated result is way off. Of course, it gets really way worse when making the window even smaller (rate of 28/s for a 30s window).

Now yes, we can tell people to use irate() for very small windows, but will they understand that even at an averaging window 4x as big as their sampling interval, their rate will be off by 25% already?

@juliusv
Copy link
Member

juliusv commented Oct 13, 2015

So while both variants of rate() would be broken in some ways, I'm wondering whether the new behavior wouldn't be way worse...

@brian-brazil
Copy link
Contributor Author

If we do this, we really cannot sell rate() as increase-per-second anymore.

If that's the case we can't sell what we currently have as increase per second either. What we currently have can be off by an factor up to between a scrape interval and the range.

Without extrapolation it can only be off by up to a scrape interval. This is a much smaller discrepancy, and is meaningful without having to look at the data.

As an example, if I scrape a counter every 15s and do a rate(foo[1m]) over it, I get a rate of 41/s without extrapolation, but 55/s with extrapolation, where the latter matches the true per-second counter increase pretty much perfectly and the unextrapolated result is way off.

Taking your example, the current rate can produce 55/s when the true value is 13.75/s. This is not more accurate.

but will they understand that even at an averaging window 4x as big as their sampling interval, their rate will be off by 25% already?

Do they understand the current behaviour can be off by between 25% and 300% in that case?

As with interpolation in evaluation this is something that sounds good in theory but doesn't work out in practice. We should never do any interpolation or extrapolation without informed opt-in by the user, as it can cause unexpected results and only the user knows if that's okay for their use case.

@juliusv
Copy link
Member

juliusv commented Oct 13, 2015

I understand the brokenness in either direction. I still don't know what would be the best. The more I think about it I'm unhappy with either direction...

My argument above is that in the current form, the inaccuracies happen mostly temporarily around edge cases (time series appearing or counter resets happening within the window causing temporary glitches), so it gives an accurate per-second value most of the time. The new behavior always consistently underreports, and sometimes very significantly. People will get consistently too low rates in their graphs, not only temporary glitches. People might never notice that they're getting consistently wrong results, whereas with temporary glitches, the impact is lower in general, and one would usually notice that something is "off" temporarily.

On one hand I really like your argument about the new behavior not causing values out of the original series's domain, but on the other hand, I would expect a different output domain as a result of a per-second rate operation. I.e. for an integer counter, it's totally expected to get non-integer results for a per-second rate. Otherwise I really can't call it a per-second rate anymore. For increase(), that story could be different.

I'm wondering about the option of keeping the extrapolation for rate(), but dropping it for increase().

Maybe @beorn7 has another opinion on this?

@brian-brazil
Copy link
Contributor Author

My argument above is that in the current form, the inaccuracies happen mostly temporarily around edge cases (...), so it gives an accurate per-second value most of the time.

We've already had multiple complaints around the edge cases, which can last hours to days in common usage (not that I'm a major fan of that particular usage, but anyway).

From a personal standpoint, I can't trust the current rate() output when it comes to debugging. Having to manually inspect the datapoints of hundreds to thousands of timeseries to figure out what they're really telling me every time I get into deep debugging would be rather annoying. Being able to reason about the behaviour of functions is essential - being able to know that it always underreports is a feature.

People might never notice that they're getting consistently wrong results, whereas with temporary glitches, the impact is lower in general, and one would usually notice that something is "off" temporarily.

The problem is that it's "off" and broken for their use cases. I heard no complaints about this behaviour over several years, so I think the downsides are not a big problem. Where you care more is things like latency and error rates, which due to division will cancel the discrepancy.

I'm wondering about the option of keeping the extrapolation for rate(), but dropping it for increase().

One is syntactic sugar for the other, they should stay consistent for sanity's sake.

@juliusv
Copy link
Member

juliusv commented Oct 14, 2015

I still don't buy it that this won't be a bigger problem than the current rate issues - it's really easy and likely for users to get this wrong all the time and see consistently way too low rates on their dashboards without even noticing their error. It seems very dangerous to me, especially because it is not obvious in any way that you're doing anything wrong unless you happen to stumble over it somehow. That might also explain if we wouldn't get many reports about it. But if a user graphs their requests/s and it's off by 25% or so, you seriously don't consider that a big problem?

@brian-brazil
Copy link
Contributor Author

But if a user graphs their requests/s and it's off by 25% or so, you seriously don't consider that a big problem?

With irate(), I'd expect most rate()s to be over longer time period. Experience with a setup where a range of 2.5X the scrape interval was the rule of thumb for many years says 25% won't be a problem as the only complaint was the graphs being too slow to respond (which irate handles).

@beorn7
Copy link
Member

beorn7 commented Oct 14, 2015

I tried to leave this debate to the experts, but as it looks, another view might be needed here. I'll read up the details ASAP, perhaps on the plane today. And then I'll voice my opinion. Stay tuned...

@juliusv
Copy link
Member

juliusv commented Oct 15, 2015

@beorn7 Yeah - well, at least I think we share a common understanding on the facts already, so that's good. My stance is just that I'd have a harder time defending a consistently too low rate before users than a rate which has an occasional (and visible) glitch.

@brian-brazil Regarding the lack of complaints regarding a ~25% too low rate, are you sure that people involved actually knew how much the rate was off? If people don't actually notice the brokenness, they will of course not complain, but will act on wrong information, which seems dangerous to me.

@beorn7
Copy link
Member

beorn7 commented Oct 17, 2015

OK, I guess I got it all... here my points:

Some easy observations/conclusions

  • I agree that delta, increase, and rate should all have consistent extrapolation behavior. If we really need different behavior, we should have separate sets of functions with a meaningful prefix (something like exp_rate for extrapolated rate or lb_rate for lower bound rate). Ideally, however, we can come up with one behavior that fits all needs.
  • The current behavior has the correct expectation value and a mean error approaching zero if counter resets and beginning of a series is rare compared to the regular case. Individual values might be far off. Brian's suggestion gives a lower bound and not an expectation value at all. Neither is more correct or wrong than the others. They are just different. From the debate so far, I cannot conclude which behavior is more appropriate in general. Perhaps we really need both…
  • With a 15s scrape interval, I find a rate(foo[1m]) legit, i.e. something where I wouldn't necessarily tell users to use irate instead. However, with the lower-bound proposal, the reported rate will be 20% too low on average. I do find that problematic, even after reading all the arguments here. I don't think we can assume users will regularly take long enough ranges to make the difference irrelevant.

Fancier stuff

  • What I find particularly worrying about the lower-bound approach is the behavior when the range is not a multiple of the scrape interval. Let's say we have a 60s scrape interval and are evaluating rate(foo[210s]). Then there is a 50% probability to have 3 samples in the range, and a 50% probability to have 4 samples in the range. Depending on the evaluation time, the under-reporting will be 43% or 14%, respectively, resulting in an impressive square waveform if plotted. If we do a sum(rate(foo[210s])) over different targets, the waveform gets even more interesting...
  • Fundamentally, it feels inconsistent that we got rid of interpolation for individual sample values, but still do extrapolation for rate and friends. However, for individual sample values, we always take the latest value before evaluation time. So arguably, we should start the range for a rate with the latest value before the range starts, and then take it from there. In that way, we would not systematically under-report anymore, but we would also not return a lower bound, and we would still have the weird step function behavior described in the previous item, now at least centered around the “true” value of the rate.

Suggestion for a solution

The huge error spikes of the extrapolated value happen if there are gaps at the beginning or at the end of the range that are large compared to the normal scrape interval, or if the number of counter resets within the range is large compared to the number of samples within the range. Both cases can be detected.

As the simplest heuristics, we could refuse calculate a rate if intervalCorrection is 2 or greater.

Possible refinements:

  • The maximum intervalCorrection can be required to be lower if we have many samples in the interval.
  • The summed up time between consecutive pairs of samples with a counter reset in between must not exceed a certain fraction of the range.

@brian-brazil
Copy link
Contributor Author

That's a good analysis.

I don't think we can assume users will regularly take long enough ranges to make the difference irrelevant.

We could always document it that way, rate for long durations and irate for shorter ones.

Depending on the evaluation time, the under-reporting will be 43% or 14%, respectively, resulting in an impressive square waveform if plotted. If we do a sum(rate(foo[210s])) over different targets, the waveform gets even more interesting...

Thus the range of 2.5X scrape interval being the rule of thumb (and part of why we have irate). The sum wouldn't be a problem, as it'd all balance out with many targets.

As the simplest heuristics, we could refuse calculate a rate if intervalCorrection is 2 or greater.

This breaks the use cases where there's a long interval. Particularly for increase I think that asking for values of up to a week should work.

The summed up time between consecutive pairs of samples with a counter reset in between must not exceed a certain fraction of the range.

This breaks the case when a server takes a while to restart.

These suggestions involve silently throwing away data. While that's still a lower bound, it's not going to be a bounded one compared to the true value so I don't think they're viable options for whatever we call rate.

So arguably, we should start the range for a rate with the latest value before the range starts, and then take it from there.

I think that'd be unexpected semantics for a range, it should only look at data within the range itself.

@beorn7
Copy link
Member

beorn7 commented Oct 17, 2015

My verdict after considering all things so far:

I think we should adjust the current rate function (and friends, i.e. delta and increase) to not return values that are likely to have a high extrapolation error. The use-cases for non-extrapolating versions of those function seem legitimate to me but rather special. We can introduce those under a different name.

Detailed replies:

We could always document it that way, rate for long durations and irate for shorter ones.

As said: I think that a range with 4 or 5 samples in the range is legitimate to use. If we tossed out rate as we know it and replace it by a non-extrapolating version, we would not support that use-case anymore.

Square-wave issue:

Thus the range of 2.5X scrape interval being the rule of thumb (and part of why we have irate). The sum wouldn't be a problem, as it'd all balance out with many targets.

My example had 3.5*scrape_interval. I could have taken 4.5 as well. It would still be a very irritating stepping up and down. Aggregating over hundreds of targets would indeed alleviate the issue, but if you have a handful of targets, you get really weird jitter.

As the simplest heuristics, we could refuse calculate a rate if intervalCorrection is 2 or greater.

This breaks the use cases where there's a long interval. Particularly for increase I think that asking for values of up to a week should work.

I don't think it's a breakage. If I ask “give me the increase over the last week” but there is only data for the last three days, I want “sorry, not sufficient data” as an answer, not some number that hides the fact that it is only the increase over the last three days. “Give me the increase over the last week even if you have only data for a much shorter period of time” would be a different question (for what we can offer a non-extrapolating version of increase).

The summed up time between consecutive pairs of samples with a counter reset in between must not exceed a certain fraction of the range.

This breaks the case when a server takes a while to restart.

Good point. Counter resets are hard…

I'd start with my suggestion to limit the intervalCorrection and see how many confusions are solved alone with that.

These suggestions involve silently throwing away data.

This statement is highly misleading. No data is thrown away. More accurately, calculated result of low confidence are not reported.

@brian-brazil
Copy link
Contributor Author

The use-cases for non-extrapolating versions of those function seem legitimate to me but rather special. We can introduce those under a different name.

We've already had several users looking for these functions. It also better handles target churn, which may not lead to obvious distortions in the graphs when there's many targets.

Any functions that do interpolation or extrapolation need to explicitly call that out in their name, as applying these techniques without user understanding and opt-in is unsafe. The current behaviour needs different names if we're keeping them around.

As said: I think that a range with 4 or 5 samples in the range is legitimate to use. If we tossed out rate as we know it and replace it by a non-extrapolating version, we would not support that use-case anymore.

We'd still support it, with different semantics though. I believe that irate (potentially combined with avg_over_time) covers those uses sufficiently in practice.

This statement is highly misleading. No data is thrown away. More accurately, calculated result of low confidence are not reported.

Currently the only times we drop due to insufficient data we only drop the affected timeseries. My understanding of your proposal is that'd we'd follow that, and thus that sum(increase()) could operate on a subset, silently drop data and potentially give a much lower value than the true value.

The alternative where we break the entire increase() would mean it'd break every time there was a new target - which is going to be continuously in large systems. We'd also need to add a by clause to break it along the correct boundaries. I don't see a way to make you propose workable, and that's before considering how much additional documentation and tuning is required if we found a way to make it work.

For consistency we'd also need to apply the same semantics to all the over_time functions, which will have the same issues.

This is why you don't do extrapolation without opt-in, it's not safe in general :)

@beorn7
Copy link
Member

beorn7 commented Oct 17, 2015

Any functions that do interpolation or extrapolation need to explicitly call that out in their name, as applying these techniques without user understanding and opt-in is unsafe. The current behaviour needs different names if we're keeping them around.

I daresay that many techniques in Prometheus are unsafe if users don't understand them. However, I'd be willing to consider the proposal to rename the extrapolating rate and friends. However, I'm one to two orders of magnitude more convinced that a rate that systematically underestimates the actual rate (and that depending on the ratio of scrape interval and range and their precise alignment) needs to call that out in its name.

I believe that irate (potentially combined with avg_over_time) covers those uses sufficiently in practice.

avg_over_time(irate(foo[2m])[2m]) is not valid promql. Even if it were, how would you do increase and delta? Not even mentioning that I would not ask users to use such an arcane construction for something very commonly used...

My understanding of your proposal is that'd we'd follow that, and thus that sum(increase()) could operate on a subset, silently drop data and potentially give a much lower value than the true value.

Yes. Do you want to say that you have a problem if the expression returns a much lower value than the true value?

(In case the irony is not obvious: Returning a lower value in a scenario where data in a sum is missing sounds like the right thing. Returning systematically a lower value even if the data is fine sounds only right if you know what you are dealing with.)

I don't see a way to make you propose workable, and that's before considering how much additional documentation and tuning is required if we found a way to make it work.

So far I like my suggestion. We should play with it a bit. If the most primitive approach (don't return anything if intervalCorrection >= 2) is sufficient for fixing the worst problems with the extrapolation, the amount of documentation is ~1 line. I see a lot of documentation necessary for the non-extrapolating versions as they work against the intuition of most users, and effects like square-wave patterns will utterly confuse the less mathematically inclined.

For consistency we'd also need to apply the same semantics to all the over_time functions, which will have the same issues.

That's not true. Those functions do not extrapolate.

@brian-brazil
Copy link
Contributor Author

However, I'm one to two orders of magnitude more convinced that a rate that systematically underestimates the actual rate (and that depending on the ratio of scrape interval and range and their precise alignment) needs to call that out in its name.

That's an option. What (if anything) we call rate needs to work in the general case, and be generally comprehensible. A consistent predictable underestimate meets that criteria better than the current behaviour.

avg_over_time(irate(foo[2m])[2m]) is not valid promql. Even if it were, how would you do increase and delta? Not even mentioning that I would not ask users to use such an arcane construction for something very commonly used...

You'd have to go via a rule. Increase is just syntactic sugar around rate, so avg_over_time(irate(foo[2m])[2m]) * 60*2 - though I intend increase to be used over longer time periods where it won't matter much. There's no equivalent for delta as we don't offer a per-second version, but you should probably be using deriv anyway.

I don't think rate will be commonly used in graphs once irate gets fully out there. If anything I think we'll need to work to direct people where to use rate rather than irate, given that I'd expect almost all examples to be for graphs using irate.

Yes. Do you want to say that you have a problem if the expression returns a much lower value than the true value?
(In case the irony is not obvious: Returning a lower value in a scenario where data in a sum is missing sounds like the right thing. Returning systematically a lower value even if the data is fine sounds only right if you know what you are dealing with.)

I think that returning a lower value than we have to is a problem, particularly if it can surprise the end-user without warning.

What I'm proposing is the strongest possible lower bound. This has the same common and worst cases, which are okay overall in terms of expected value.

What you're proposing is something that can return anywhere between an empty vector and an overestimate. This has bad common (target churn is common) and terrible worst cases (sometimes really common, and the #581 case).

The current behaviour is an upper bound that assumes a linear model. It has an okay common case, but bad worst cases.

I believe that we need to avoid having to make users care about edge cases, and the smaller the amount of documentation we need to provide the better. "Is a strict lower bound, will be typically out by a constant factor" is easier to explain than "Is a non-strict upper bound, presumes a linear time series. Can be out by a small amount downwards or very large factor upwards. Only way to correct is manual inspection of data." which is easier to explain than "Will drop data that doesn't cover half the range, may be out by up to 100% downwards or 100% upwards. Only way to correct is manual inspection of data."

So far I like my suggestion. We should play with it a bit. If the most primitive approach (don't return anything if intervalCorrection >= 2) is sufficient for fixing the worst problems with the extrapolation, the amount of documentation is ~1 line.

It mitigates it by breaking in a different way in some cases. It's papering over a symptom rather than attacking the core problem. Making it tunable is getting into epicycles, and hints that the core idea isn't right. Per above, the amount of documentation is much more than a line as you need to go through the failure modes and where it's not safe to use.

This is impractical as a general rate/increase function. The only vaguely common good use I can think of would be as an alternative to a for clause in an alert over certain types of data pending #422 - and that's due to the filtering, not the extrapolation.

For consistency we'd also need to apply the same semantics to all the over_time functions, which will have the same issues.
That's not true. Those functions do not extrapolate.

They act over ranges, so to keep the language consistent they should have similar behaviour in relation to dropping data if it's incomplete. Consider sum_over_time for example, for which your rate argument also applies to and for which there's an argument to making it extrapolate.

@brian-brazil
Copy link
Contributor Author

A thought that just occurred to me is that we could add a range_covered function that'd return the inverse of the intervalCorrection. Then you could get the current behaviour with rate(x[5m]) / range_covered(x[5m]) and you could do whatever filtering you like too.

@beorn7
Copy link
Member

beorn7 commented Oct 18, 2015

A thought that just occurred to me is that we could add a range_covered function that'd return the inverse of the intervalCorrection. Then you could get the current behaviour with rate(x[5m]) / range_covered(x[5m]) and you could do whatever filtering you like too.

In contrast to the idea to go through rules to allow avg_over_time(irate(foo[2m])[2m]), this would at least allow ad-hoc exploration. It's still pretty clunky for something I do a couple of times per day. (And after all the arguments, I'm still convinced I will continue to use the present rate over the lower-bound rate in most cases.)

@beorn7
Copy link
Member

beorn7 commented Oct 18, 2015

For simplicity, I'll call the proposed lower-bound rate function LBR in the following.

What (if anything) we call rate needs to work in the general case, and be generally comprehensible. A consistent predictable underestimate meets that criteria better than the current behaviour.

I disagree on almost every single point here:

  • The LBR doesn't work in the general case because it generally doesn't return the rate. (But a lower bound of it. It only works in general as just that – a lower-bound rate. No problem if we call it that way.) But we have said that many times already, so probably no point discussing it further.
  • The LBR is far less generally comprehensible than the existing rate or my suggestion for an improved version. But since there is no analytic proof of the degree of general comprehensibility, it's moot to discuss this point even more.
  • The underestimation of the LBR is not easily predictable. It depends on the alignment of range and scrape times. That has been discussed before.

What I'm proposing is the strongest possible lower bound. This has the same common and worst cases, which are okay overall in terms of expected value.

As said many times before, this is understood. It's just not acceptable to call it rate, and it's also not acceptable to take away a (semi-)viable rate function from the user.

What you're proposing is something that can return anywhere between an empty vector and an overestimate. This has bad common (target churn is common) and terrible worst cases (sometimes really common, and the #581 case).

A lot of target churn can happen if you take a very long range. However, for very long ranges, the LBR is fine. That's a very good use-case for it and a perfect solution. However, I'm missing a good solution for the middle area of ranges with ~5 samples. Too many to replace with irate, too few to ignore the LBR underestimation.

I agree that the worst cases are still very bad with my suggestion. This affects mostly long ranges. I'm not happy with that and still thinking how to improve it...

I believe that we need to avoid having to make users care about edge cases, and the smaller the amount of documentation we need to provide the better. "Is a strict lower bound, will be typically out by a constant factor" is easier to explain than [...]

I strongly disagree that this explanation is sufficient. It's also not true as the factor is not constant but depends on alignment. You also have to include an explanation how the underestimation factor depends on the number of samples in the range.

It mitigates it by breaking in a different way in some cases. It's papering over a symptom rather than attacking the core problem. Making it tunable is getting into epicycles, and hints that the core idea isn't right.

However, with LBR we do not solve the core problem, we simply solve a different problem. If the core problem is unsolvable, then fine, tell the user to ask different questions. I would like to try a bit harder to solve the core problem before doing that.

Consider sum_over_time for example, for which your rate argument also applies to and for which there's an argument to making it extrapolate.

I don't see that argument. The _over_time functions are fundamentally concerned with taking every single sample in a range and do something with it. rate and friends are fundamentally concerned with taking the first and last sample in a range and do something with the difference (modulo counter-reset detection, but that's different from the former). The rate results are getting the weirder the farther away from the actual begin and end of the range the first and last samples are. That's not a concern of _over_time function, so extrapolation is not required to fix the result.

@brian-brazil
Copy link
Contributor Author

For simplicity, I'll call the proposed lower-bound rate function LBR in the following.

Makes sense. LUBR = linear upper bound rate, our current behaviour.

The LBR is far less generally comprehensible than the existing rate or my suggestion for an improved version. But since there is no analytic proof of the degree of general comprehensibility, it's moot to discuss this point even more.

I measure comprehensibility in terms of the number ideas the user needs to understand to use a feature properly.

I'd consider the LBR strictly more comprehensible than the LUBR, as it doesn't have the #583 edge case that the LUBR does. The averages cases are about the same difficulty to explain (strongest lower bound vs linear upper bound).

Your proposal adds a corner case to the LUBR, so it's less comprehensible than the LUBR.

So I see it as roughly 1 vs 2 vs 3 ideas, plus the other 4ish ideas that are common across them (per-second rate, accumulating counter, calculated on server, don't need to maintain state across restarts/counter reset handling).

The underestimation of the LBR is not easily predictable. It depends on the alignment of range and scrape times. That has been discussed before.

LBR is strictly more predictable that the LUBR, as it only depends on the range and scrape times. The LUBR also depends on those to determine the bounds, then the data and it's context to get the true value.

it's also not acceptable to take away a (semi-)viable rate function from the user.

It's the semi bit where I take issue. Anything with an unqualified name like rate needs to work reasonably well across all expected cases.

However, I'm missing a good solution for the middle area of ranges with ~5 samples. Too many to replace with irate, too few to ignore the LBR underestimation.

What task are you trying to perform with those 5 samples? I can only see it coming up in alerting, and it doesn't matter much there as you're usually dealing with ratios which will cancel out any error.

I strongly disagree that this explanation is sufficient. It's also not true as the factor is not constant but depends on alignment. You also have to include an explanation how the underestimation factor depends on the number of samples in the range.

I tried to keep all the explanations concise, so they're missing the full details.

By constant I mean that it's a constant relative to the prometheus configuration, that is that it doesn't depend on the data. The underestimation amount doesn't depend on the number of samples in the range, you need to allow for at most a scrape interval worth of data around the samples you have.

I don't see that argument. The _over_time functions are fundamentally concerned with taking every single sample in a range and do something with it. rate and friends are fundamentally concerned with taking the first and last sample in a range and do something with the difference (modulo counter-reset detection, but that's different from the former). The rate results are getting the weirder the farther away from the actual begin and end of the range the first and last samples are. That's not a concern of _over_time function, so extrapolation is not required to fix the result.

If rate and friends are going to try and fill in data that's missing form the range then there's a reasonable argument that all other range functions do the same for the same reasons - that is that the result will be closer to the true expected value, rather than a lower bound.

I don't like this argument for the same reasons I don't like rate extrapolating, but it is a valid argument and the logical extension of the current rate logic and associated arguments.

@beorn7
Copy link
Member

beorn7 commented Oct 19, 2015

By constant I mean that it's a constant relative to the prometheus configuration, that is that it doesn't depend on the data. The underestimation amount doesn't depend on the number of samples in the range, you need to allow for at most a scrape interval worth of data around the samples you have.

That has been proven wrong above. I don't know how to demonstrate it more clearly than it has been done before.

If rate and friends are going to try and fill in data that's missing form the range then there's a reasonable argument that all other range functions do the same for the same reasons - that is that the result will be closer to the true expected value, rather than a lower bound.

I tried to explain why I don't think that there is a reasonable argument of that kind. I don't know how to explain it more clearly.

@beorn7
Copy link
Member

beorn7 commented Oct 19, 2015

To get action items back on track:

I'm definitely in favor of offering functions as implemented in the PR, I just would like a naming scheme that calls out the lower-bound behavior (e.g. lb_rate, lb_delta, ... but open to suggestions). That could be added and released as 0.16.2 right now, as it doesn't break anything existing. (In that way, we could also get some feedback from a wider audience.)

I have an idea how to improve the extrapolation behavior we have currently implemented, which I have to flesh out. We have to see if we can improve rate and friends sufficiently to let it keep its name. I agree that in the current state of things, it has way to many misleading edge cases to be called a rate.

@brian-brazil
Copy link
Contributor Author

That has been proven wrong above. I don't know how to demonstrate it more clearly than it has been done before.

I don't see anything demonstrating that, and I can't see a way the underestimate can exceed one scrape worth of data. I just ran through the numbers on the 210s example to be sure (I presume that's what you're referring to, if not please indicate otherwise) and on average it's out by a scrape worth of data.

I tried to explain why I don't think that there is a reasonable argument of that kind. I don't know how to explain it more clearly.

It comes down to what you expect these functions to do, and how they fit in the taxonomy of our functions as our users see them and how they're used, rather than how we see them as implementers. If you were used to the _over_time functions extrapolating you'd probably see the argument as you'd be used to thinking about them in that way. In practice those functions aren't much use until we figure out staleness.

Presume we've handled staleness and pgw metrics are timestamped. Consider sum_over_time on top of timestamped data coming from the pgw to calculate how many records were processed by a series of batch job runs over some period. That's exactly the same use case as increase on a summary's sum but with a push per run rather than continuous pull. And as you put it, it'll get weirder the further the pushes are from the beginning and end of the period. If you think that rate should extrapolate, then you should also think that sum_over_time should extrapolate for that case. Conversely if you don't think that sum_over_time should extrapolate in that case, then neither should rate.

Designing a language to be consistent is non-trivial, things interact in unexpected ways.

count_over_time and sum_over_time are strongest lower bound functions. min_over_time is a strongest upper bound, max_over_time is a strongest lower bound. This doesn't matter for avg_over_time as it cancels, the same as for a ratio of rates. For consistency, whatever we do extrapolation-wise for rate/increase/delta we should do for the _over_time functions too as they're all unqualified range functions and in addition there's isomorphic use cases.

(e.g. lb_rate, lb_delta, ... but open to suggestions)

I'd like to generally avoid abbreviations, users should have a good idea what a function does from just the name. The only reason irate isn't instant_rate is that it is going to be the mostly commonly used function in promql, and the longer name would have scared off too many users compared to it's usefulness.

I have an idea how to improve the extrapolation behavior we have currently implemented, which I have to flesh out.

I had an idea to look back beyond the start of range, but that completely changes the semantics of ranges and gets very epicycley when you work through all the implications. We'd need to make sure that both made logical sense and worked in practice.

What task are you trying to perform with those 5 samples? I can only see it coming up in alerting, and it doesn't matter much there as you're usually dealing with ratios which will cancel out any error.

You never answered this question. If we can come up with another way to do what you need then that gives us more options.

@brian-brazil
Copy link
Contributor Author

Hmm, rather than filtering results what if we turned extrapolation on/off based on the coverage?

@beorn7
Copy link
Member

beorn7 commented Oct 21, 2015

Hmm, rather than filtering results what if we turned extrapolation on/off based on the coverage?

Yes, that goes into the direction I'm fleshing out at the moment. Just very busy with work at the moment. I'll write something more elaborate ASAP.

@beorn7
Copy link
Member

beorn7 commented Oct 21, 2015

You wanted an explicit answer to: “What task are you trying to perform with those 5 samples? I can only see it coming up in alerting, and it doesn't matter much there as you're usually dealing with ratios which will cancel out any error.“

The question I (and may of my coworkers) ask multiple times each day is: “How many QPS are we serving now?“ With a scrape interval typically between 15s and 30s, irate is too spiky, but I also don't want an average over 1h or so. 2min to 5min is the sweet spot here, and then we are in the domain of ~5 samples that is problematic for LBR.

Luckily, we are also in a domain where the brokenness of rate isn't that impactful. It becomes worse if the range gets larger, but the larger the range the less problematic LBR becomes. That's why I think having both variants available is useful.

Ideally, I want a rate that just works.

I don't want to have complicated syntax to answer my straight forward question above, especially not if it just reconstructs the semi-broken behavior rate has right now. The avg_over_time(irate(foo[2m])[2m]) approach, ignoring that it is not valid promql at the moment, is BTW doing essentially the same extrapolation as rate now and running into the same problems with time series appearing and disappearing within the range. Contrived example: Mustiple targets, sum(avg_over_time(irate(foo[2m])[2m])), we do a switch-over in a blue-green deployment just in the middle of the 2m. irate will happily return the old and the new targets, although the were serving at different times. The sum will overshoot by a factor of 2. (And if you use irate with very long ranges, you can construct exactly the same horrible failure cases as with rate. Obviously, irate isn't meant to be used with long ranges, but to be fool-proof, we should probably mention it in the docs.

So let's take a note for later, when we discuss naming: irate is extrapolating in very much the same way as the current rate but doesn't call it out in its name.

Let's now talk about ratios:

To not go abstract with formulas, I'll take a concrete example.

Going back to the case with the range of 210s and 60s scrape interval. If I sum up multiple LBRs, half of them will be underestimated by 14%, half of them by 43%. Let's say I calculate an average latency from latency_seconds_count and latency_seconds_sum of a summary or histogram. Let's say I have two targets, one has a “true” rate of 200 for the sum and 100 for the count. The other one has a “true” rate of 150 for the sum and 50 for the count. “True” average latency is (200+150)/(100+50) = 2.33.

Now let's have the first target be underestimated by 43% and the second be underestimated by 14% (due to different scrape timings). The ratio becomes (114+129)/(57+43) = 2.43.

Funny enough, if the underestimations happen to be the other way round, we get (171+64)/(86+21) = 2.20.

So what originally was a guaranteed lower bound, turns into something that might be too high or too low, depending on scraping alignment, if you take a ratio.

Not really surprising, thinking about it. The mnemonic rhyme I learned in school was: „Aus Summen kürzen nur die Dummen.“ See here for a discussion how to translate it. :)

@beorn7
Copy link
Member

beorn7 commented Oct 21, 2015

Presume we've handled staleness and pgw metrics are timestamped. Consider sum_over_time on top of timestamped data coming from the pgw to calculate how many records were processed by a series of batch job runs over some period. That's exactly the same use case as increase on a summary's sum but with a push per run rather than continuous pull. And as you put it, it'll get weirder the further the pushes are from the beginning and end of the period. If you think that rate should extrapolate, then you should also think that sum_over_time should extrapolate for that case. Conversely if you don't think that sum_over_time should extrapolate in that case, then neither should rate.

This use-case is fundamentally violating Prometheus semantics. That's what we are telling everybody who tries to set it up that way. I don't think we should optimize our query language on fringe use-cases for which Prometheus was not meant to be used.

I still see a fundamental difference between the ..._over_time functions (which precisely process every single value in the range) and rate & friends (which are concerned with the values at the beginning and the end of a range and therefore necessarily need to return an estimate rather than an exact value).

@beorn7
Copy link
Member

beorn7 commented Oct 21, 2015

OK, finally my 2nd try to fix rate so that it does what people usually expect it to do.

I'll use LBR for the rate implementation in this PR, and LUBR for the existing rate implementation. (But keep in mind that LUBR can still _under_estimate the true rate. It's only the upper bound under the assumption that the counter increases linearly.)

I'll start from fundamentals once more, sorry for that...

1. When is both LUBR and LBR exactly right?

If the first sample in the range is exactly at the beginning of the range and the last sample in the range is exactly at the end of the range. This case is unfortunately very unlikely. But keep in mind that both implementations approach the true rate as closer as we get to this ideal situation.

2. When is LUBR right and LBR wrong?

LUBR is exactly right If the series starts before the range, and continues after the range, and the counter increases linearly. Again will never happen in practice. But at least LUBR will estimate the correct value if the series starts before the range, and continues after the range. Since this is in general the most frequent case, I argued that LUBR is correct as it yields the expectation value in general.

LBR systematically underestimates in this case. The exact value depends on the scrape interval and the alignment. The maximum amount of underestimation can be calculated from the scrape interval, but by design, the scrape interval may vary (jitter, missed scrapes), so it's fundamentally difficult to reason about the maximum possible underestimation and even more so about the underestimation in a concrete case.

3. When is LBR right and LUBR wrong?

In the case where a series starts within the range and ends within the range, LBR is exactly right and LUBR is off by an amount that depends on the ratio between covered range and total range, which can be a lot.

Obviously, the larger a range becomes, the more likely we end up in this situation. Things go totally haywire if target churn happens. The error of LBR multiplies if a number of time series appear and disappear again within the range one ofter another, which again is more likely with larger ranges. Indeed, the problems users reported with LUBR were usually related to a sum over a very long-ranged rate.

4. What's bad for both LBR and LUBR?

Unfortunately, two relatively common cases haven't been touched yet: A time series exists before the range begins but ends within the range. Or a time series starts within the range and continues beyond the end of the range. In those cases, we want “left-sided“ or “right-sided“ extrapolation only. Computationally, that's not a problem.

5. The solution

Fundamentally, the farther away we get from case (1), we want to find out if the time series has existed before the range or will exist after the range, to determine if we are in (2), (3), or (4). The naive approach would be to scan backwards and forwards to see if there are samples in the same time series within a reasonable duration before and after the range. While in principal possible for the past, it is only possible for the future if we have a range that ends far enough in the past. In most cases (in particular rule evaluation), the range ends at the present, so we can't know if the time series will continue. Since we really don't want to change the evaluation depending on the evaluation time being the present or more than one scrape interval into the past, we should restrain ourselves to only look at samples within the range. (Doing it only for the past, but not for the future would be inconsistent. Perhaps worth a thought, but I sense danger…)

So we could do the following:

  1. Calculate the average scrape interval from the samples within the range (covered_range / number_of_samples), call it t. We also have a constant factor x (see below).
  2. If the first sample in the range is closer to the beginning of the range than t * x, assume the series started before the range and do left-sided extrapolation.
  3. If the last sample in the range is closer to the end of the range than t * x, assume the series continues beyond the range and do right-sided extrapolation.

First concern: How to pick x? In an ideas world with constant scrape intervals, it would be 1. However, in practice, scrape intervals have a jitter, and we might even miss a scrape completely. To take jitter into account, we can add some ε to it (e.g. 0.1 for 10% jitter). To still extrapolate if a scrape at the border is lost, we needed an x = 2+ε. That would however apply extrapolation way more often than necessary. Currently, I expect x = 1+ε yielding best results.

Second concern: We'll get funny jumps between extrapolation and non-extrapolation if we are hovering around an edge case.

Overall, I think it's worth playing with it. I believe it will yield the expected results in most cases.

(FTR: I still think we should offer a lower-bound version of rate & friends, which I expect more often to be used for increase and such, for cases where you explicitly don't want any extrapolation.)

@brian-brazil
Copy link
Contributor Author

2min to 5min is the sweet spot here, and then we are in the domain of ~5 samples that is problematic for LBR.

That makes sense.

Contrived example: Mustiple targets, sum(avg_over_time(irate(foo[2m])[2m])), we do a switch-over in a blue-green deployment just in the middle of the 2m. irate will happily return the old and the new targets, although the were serving at different times. The sum will overshoot by a factor of 2.

I put this down to #398. sum(process_resident_memory_bytes) has the same behaviour.

So let's take a note for later, when we discuss naming: irate is extrapolating in very much the same way as the current rate but doesn't call it out in its name.

I don't see this as extrapolation, just as I wouldn't consider sum to be doing extrapolation when faced with the same data. It's a problem arising from the staleness and computational models, not the functions themselves.

So what originally was a guaranteed lower bound, turns into something that might be too high or too low, depending on scraping alignment, if you take a ratio.

Ratios should only be taken on aligned data, as no matter which rate function you use you're going to have races causing weirdness otherwise. Wanting to take a ratio of rates from two distinct and thus unaligned sets of targets is not a common case (I've seen it come up in provisioning where I knew it was a little dodgy, it was based on a sum across many targets and with enough other noise that it worked out fine).

Not really surprising, thinking about it. The mnemonic rhyme I learned in school was: „Aus Summen kürzen nur die Dummen.“ See here for a discussion how to translate it. :)

Sounds vaguely like Simpson's paradox.

This use-case is fundamentally violating Prometheus semantics. That's what we are telling everybody who tries to set it up that way.

I think we're talking about different things. This is about batch jobs, not daemons. This isn't a case of trying to make prometheus do push where you should be using pull.

We had one person in IRC asking for this. I don't see any fundamental problems with this use case, though it's obviously not how Prometheus is mainly used. We haven't every really talked about how counters inside ephemeral jobs should work when the jobs are too brief to be caught sufficiently by polling. sum_over_time with timestamps was my first thought. I have another idea, but let's not get further sidetracked.

The error of LBR multiplies if a number of time series appear

I think you typoed LUBR here.

(Doing it only for the past, but not for the future would be inconsistent. Perhaps worth a thought, but I sense danger…)

Yeah, that's related to one of the reasons interpolation was removed.

So we could do the following

This seems like a viable approach to me.

Calculate the average scrape interval from the samples within the range (covered_range / number_of_samples), call it t.

It may be better to calculate the differences between all the scrape intervals, then take the median. That way outliers due to missing scrapes won't be as problematic.

Currently, I expect x = 1+ε yielding best results.
Second concern: We'll get funny jumps between extrapolation and non-extrapolation if we are hovering around an edge case.
Overall, I think it's worth playing with it. I believe it will yield the expected results in most cases.

That feels right to me, use something like 10% or a higher quantile than median and then see how it goes.

We'll need to do some math to make sure we understand the edge cases, and that they are acceptable, but I think it'll work out.

(FTR: I still think we should offer a lower-bound version of rate & friends, which I expect more often to be used for increase and such, for cases where you explicitly don't want any extrapolation.)

If we can get the main rate working sufficiently well, then increase will cover those cases well enough too. More functions and variants thereof mean more confusion for users as to which to use, two rates is more than enough already so I'd prefer not to add more unless we have to.

@beorn7
Copy link
Member

beorn7 commented Oct 23, 2015

Still distracted by non-Prometheus work. only quick remarks:

Working staleness detection would indeed solve many of the problems here as we could know that a time series has ended and act accordingly.

The ratio from my example is aligned indeed. The problem happens before, when the LBR is taken on differently aligned targets (and then summed, and then divided). sum(rate(foo_sum[5m])) / sum(rate(foo_count[5m])) is a very common thing to do in Prometheus and completely by the book.

Yes, I typoed LUBR->LBR above.

I like the idea of taking the median of the scrape intervals within the range. It's computationally more expensive but probably worth the effort. However, I think we should take 1.1*median as threshold, not a higher quantile. The latter is very noisy if we have only a handful of scrape intervals within the range.

Another thought that crossed my mind: If a series starts within the range (or, respectively, ends within the range), the expectation value for the time when the series actually started (or ended) is half a scrape interval before (after) the first (last) scrape. So we can improve the estimation by extrapolating half the median scrape interval to the left (right) if the first (last) sample within the range is more than 1.1*median after (before) the start (end) of the range. If it's closer, we do the full extrapolation (on the respective side).

So how to proceed from here? Implement the suggested modified extrapolation and play with it?

@brian-brazil
Copy link
Contributor Author

So how to proceed from here? Implement the suggested modified extrapolation and play with it?

That seems best, hopefully I'll have some time next week.

@brian-brazil
Copy link
Contributor Author

Superseded by #1245

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants