Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
promql: Limit extrapolation of delta/rate/increase #1245
Currenty the extrapolation is always done no matter
This will still result in artifacts for appearing and disappearing
I'm moving the fundamental discussion here to top-level to not make it drown in outdated commits.
I don't think that median is more complicated than average. (The fact that people confuse the two doesn't render the one more complicated than the other.) Median is clearly the better heuristics given the nature of Prometheus data. The only point in the whole discussion so far I'd accept as valid is the concern about computational cost. However, as said, I'd surprised if that turned out to be true.
About "half-median" extrapolation vs. "no extrapolation": What bothers me most with our general solution is that we have that sudden switch-over between "full extrapolation" and "reduced extrapolation" (may it be none at all or "half median"). Let's say we have a range that includes only three or four samples. Now one sample is missed. If that missed sample is at the end of the range, we'll suddenly switch to "no extrapolation" for a while. Same happens when the missed sample reaches the beginning of the interval. There will be jumps in the rate graph. With my "half median" suggestion, those jumps will at least be approx. half as big. That's one part of the deal. The other part is, that in addition the half-median extrapolation is in general the best heuristics for the start and end of a series. The only case where we get a systematic artifact is the very special one where somebody managed to coincide the start or end of a series with the scrape time. The artifact, however, will be of very limited size. Nothing like the weird overshoots users have run into. Even with the switch to "no extrapolation", we wouldn't fully accommodate the "I want no extrapolation" case as we would (wrongfully) extrapolate once the carefully coincided samples get close enough to the limits of the range.
Conclusion: Let's have non-extrapolating functions (or let's say at least a
About the explanations: The naive user (presumably most of them) wants to hear "Because of the sampled nature of the Prometheus time series data, we have to extrapolate the calculated value the best we can. If you want to know details, click here." In your case, we have to explain "Because of the sampled nature of the Prometheus time series data, we have to extrapolate the calculated value the best we can. Except that sometimes we don't to accommodate a special use-case that's probably not the one you just ran into. If you want to know details, click here." For the more interested users, they won't understand why we sometimes don't do the optimal extrapolation but apply assumptions (alignment of sample times with start and end of a series) that are in general not true.
It's not your or my opinion on statistics that matters, it's those of our users. That will include people who might have left high school early due to the .com boom. That'll include people who never had a chance to study statistics, such as Ireland where statistics is only an optional part of higher level maths.
Even though median is the most appropriate statistical choice, that doesn't change the fact that for many of our users it'll be a brand new mathematical concept they're not had to properly deal with before.
As soon as our technical choices impact what the users can reasonably run into we need to consider the human factor too. This wouldn't be a concern in say the storage layer.
That bothers me too.
With my usual example (a counter that starts as a 0, then increments to 1, then holds), this results in values that are further away from the true value (to the extent that it's no longer possible to return the true value) so this would make #581 worse compared to what this PR does.
That's not the only case where it's a problem, it's also a problem where the series isn't very linear such as above. Supporting non-linear series is critical, and whatever we call
I don't think delta is used often enough to justify variants, I'm not sure anyone is even using it.
I'm not sure that's a fair comparison, there's special cases on all sides.
We've seen them in the event/push-based requests. When things are batchy and have exact timestamps it'd be incorrect to extrapolate beyond them.
It does if there's an even number of samples ;)
Strictly speaking, all values that are larger than exactly half of the sample values are valid values for the median.
About the real discussion:
I'll come back to the use-cases tomorrow and will work through the effects the various schemes have on the results.
Two general statements, though:
(1) We have consistently advertised that pushing with an exact timestamp is not the use-case Prometheus is built for. Should we end up in a situation of a trade-off between proper support for the use-case Prometheus was built for vs. proper support for a use-case Prometheus was not built for, I will vote for the former.
(2) Picking our techniques based on the school curriculum in certain cultures seems outrageously wrong to me. I'll for one pick the technique that is best suited to solve our problem.
It depends on how you define the problem. If the problem is "how do I estimate the sampling interval of a series" then it's silly, if the problem is "how do I build a useful monitoring system that's widely deployed" then cultural norms are really important.
There's no point in having the right answer if it does more harm than good overall.
Given that we have histograms, arguing the understanding of the median cannot be expected from a user who wants to do monitoring, doesn't make sense.
I feel tempted now to discuss the fundamental incentives of an open source project in general and Prometheus in particular, but let's not do that in a PR.
Let's instead come up with a purely technical decision between median and average. It will then be a separate decision to deliberately go down the technically inferior way for non-technical reasons. (I expect to be strongly against it, should it come up, but as usual, I'm willing to "disagree but commit" if you other Prometheans think it's the better decision.)
"Half-extrapolation" vs. "no extrapolation" appears to me to be an orthogonal decision to be discussed separately.
Let's first get the median vs. average thing settled, on a purely technical basis.
Technically, median is computationally more expensive (by an amount that needs to be determined but presumably just a fraction of the cost of the rate calculation), which has to be compensated by advantages of some kind. If we take the "almost regular sampling intervals" as the normal case, median seems vastly superior to me. But @brian-brazil brought up the case of pushing with fixed time stamps. In that scenario, sampling intervals are arbitrary and not necessarily regular anymore. To support that, average would make more sense than median. If we want to support that case at all, I'd be willing to go for average. But do we? For comparison, the case where average hurts most is if a scrape is lost close to the start or end of a time series so that the average sampling interval is artificially inflated and the start/end of the time series is identified later than necessary. What happens more often? Non-regular timestamps set by the client or lost samples close to the start or end of a series? I would have thought the latter, but I could be convinced otherwise, especially once we provide a bulk-import API (as more users will use Prometheus as a general-purpose TSDB – which appears intriguing and dangerous to me at the same time...).
Unfortunately we don't have a reasonably accurate rate. If we could come up with something that worked reasonably in all common scenarios then we could go to town complexity wise.
Everything proposed thus far has enough oddness in behaviour that it's going to cause confusion for a good chunk of users and support problems for us, so we need to think how we're going to deal with that.
What I'm thinking of is a batch job. The start time should be regular, the completion time will vary though.
The question for me is which causes the more problematic artifacts and will thus cause more confusion/frustration. Our fundamental challenge is that we can't distinguish between a missed scrape and a timeseries appearing/disappearing, and on top of that we expect some natural variation in scrape intervals.
If it's a lost scrape, then having a longer period of extrapolation is desirable as we want to extrapolate up to when things get working again. If it's a timeseries appearing/disappearing then that's not what we want to do.
I think it'd be interesting to run all this through #581 type use cases, and a rolling restart of a service that changed all the instance labels to see how things look with a variety of sampling periods.
Obviously, we are on the same page here. The actual point is that I believe with "median/half-median" extrapolation we'll cause less frustration than with "average/no" extrapolation.
That's again not disputed. The real question is if timeseries usually appear/disappear exactly where a sample happens to be, or at a random point within the sample interval (where I claim it's in the typical "Promethean" use-case the latter, while the former is a rare edge case where people are using Prometheus for something for which it is not really made). The real question about "natural variation in scrape intervals" is if we usually have a regular scrape interval where larger deviations are rare and then usually caused by a missed scrape (for which the median would be the best heuristics) or if scrape intervals are more or less random (for which the average would be better). I claim the former case is the by far more common one.
I was talking about the following case:
If my range is from 1.8 to 7.8, the median heuristics will correctly identify the start of the series, while the average heuristics will assume the series started before the range.
Currenty the extrapolation is always done no matter how much of the range is covered by samples. When a time-series appears or disappears this can lead to significant artifacts. Instead only extrapolate if the first/last sample is within 110% of the expected place we'd see the next sample's timestamp. This will still result in artifacts for appearing and disappearing timeseries, but they should be limited to +110% of the true value. Fixes #581
On this point I was thinking a bit last night, and I don't think start/end and a missed scrape are independent. A server still initialising or one that's overloaded are more likely to fail scrapes.
I was also pondering if average might give the better expected value due to this and the chance of a double miss scraped (and we're talking about adding 10% on top of it all anyway).
This requires working through with examples methinks.
I'm currently implementing all four options we are discussing. Stay tuned, separate PRs imminent.
While working on it, I realized that counters are never negative by contract, so we can make use of that to never extrapolate below 0, but that's yet another front in the battle.
I had that thought too. Certainly helps for new timeseries, but it's also more complexity,
So let's have the discussion only in this PR. I created the other PRs to play with, but I didn't want to fragment the discussion.
Brian said about the "no extrapolation below zero" approach (#1255):
I don't understand this concern. I mean, yes, it fixes only a special case, but that's exactly what we need. The special case is that a counter (that keeps the contract) cannot be negative, and we are making use of that. How is disappearance or reapperance of a timeseries after a downtime isomorphic? I need more explanation to understand what you mean.
#1245 only gets it right when it doesn't extrapolate. But that's in general the wrong approach. It only happens to be the right approach in this special case. The only reason why we know it's the right approach is that the counter starts at zero. Exactly that knowledge is encoded in #1255 . So #1255 always does the right thing to the best of our knowledge while #1245 in general does the wrong thing and only happens to be right in special cases.
I'd really like to set up a test scenario with time series appearing and disappearing (reconstructing #581) and in particular the case where we have a ~100 instances performing a rolling restart. But I couldn't find time at all. Whoever feels like it, please play with the various PRs. I'll still try to do so ASAP.
has a problem, then
is going to have a similar problem.
Not quite. Each is wrong in the general case, and right in some special cases. The problem is that we don't have any good knowledge to work off, as we've an arbitrary time series that has a infinite number of valid interpretations.
Both examples are fundamentally different. In the first example, we can be sure that the missed scrapes cannot be represented by negative values (because of the contract of a counter), so a linear extrapolation to the left would be wrong. In the second example, we have no knowledge about the values in the missed scrapes. In the absence of other information, we can only assume that the series will continue as it started, i.e. one increase every 6 sampling intervals. It is generally wrong to assume that the one increase we have observed is the only one the series will ever have.
I guess it doesn't make sense to just repeat our contradicting assessments of #581 over and over.
Could you explain in more detail why you think your thought experiment is the essence of #581? For me, that's so obviously not the case that I don't even know where I can start to explain why.
Technically we can't, it's possible that there were missed scrapes which covered a counter reset.
It's also generally wrong to assume that the pattern we see will repeat.
It's the only thing we can assume in lack of other information. In reality, a counter could increase more or less or not at all, but the null hypothesis is that it goes on as it has gone before. That's the foundation for doing extrapolation at all.
As concluded in the conversation of #581, the essence is that a series appears in the middle (or even close to the end) of a long range or – more generally – has relatively few samples in a relatively long range. The fact that the counter increases slowly or not at all is irrelevant.
Quote (emphasis is mine): "Meaning, rate deals poorly with time series appearing / disappearing completely, or having very infrequent samples (no matter if the sample values stay constant or not) compared to the time window over which we do the rate."
Even if we assume that the essence of #581 is that counters are incremented relatively rarely, than your example still doesn't catch it. Your example assumes that the counter never increments again after it has incremented once. If it, in fact, increments rarely, then the only reasonable thing to do is to extrapolate by half a sampling interval.
It is relevant, the artifacts are much worse for slow moving counters. We wouldn't have a problem if the counter didn't increase.
I'm only looking at the first handful of samples of the counter's life, as that's where the biggest problems are. Maybe there's another increment at the next sample and maybe it's in the next hour - but what I think we need to focus on is what happens with rate in the time frame of those first few samples.
I disagree that that's the only reasonable thing to do. If it increments very rarely then it's unlikely there's another increment within the time frame, so not adding half a sampling interval is more likely to be correct.
The problem is that we don't know if the counter increases and if yes how much. The only piece of information is how much it has increased for the samples we see. So we must not throw away that only piece of information.
Luckily, counters start at 0 most of the time. #1255 makes use of that fact in the perfect way (much better than not extrapolationg to the left). Problem solved.
We have no information about how often a counter increments. The only piece of information is how much it has increased for the samples we see, may it be rare or not. So we must not throw away that only piece of information by not extrapolating.
We are running in circles once more. I'm at loss how I can improve my explanation. I'd like to spot a hole in my line of argument, but I can't, and I'm inclined to claim there is deductive proof that not extrapolating by half a sampling interval is wrong in the general case. If we cannot convince each other, we need an experiment, i.e. somebody has to set up test data and run the different implementations on it. I'm still out of time. I will do ASAP, but happy to see others conducting the experiments.
Just because we have a piece of information doesn't mean we must use it. The salient question is whether the information overall helps us.
Counters start at 0 but that doesn't tell you if there's been a reset or it's a brand new counter. It may usually be correct, but it's not perfect.
That's clearly false. It's only correct in the special case of a linear time series that starts/stops exactly half a sampling interval away, in all other cases it produces the wrong result. Thus it's incorrect in the general case.
Being right in the general case means being right in all possible cases.
Let me rephrase: The only piece of helpful information is how much it has increased for the samples we see. So we must not throw away that only piece of helpful information.
We are only talking about the case where our heuristics has detected a "start of series". So the problem with a counter reset only happens if it coincides with a (wrongly or correctly identified) start of a series. A highly unlikely event.
We are talking about expectation values. We want to identify the correct expectation value. The expectation value of a roll of a perfect die is 3.5. What you are saying is: "That's clearly wrong. I can never roll a 3.5 on a die. Let's take 1 for our heuristics as this is the lowest value I can roll."
Obviously our calculated rate is strictly wrong in the general case. All of Prometheus is strictly wrong because we are sampling instead of event-logging. (That angry Mesosphere guy at Gophercon is yelling again in my head now… "Sampling is evil.") As we are in Promethean sampling-land anyway, correctness means "correct expectation values".
Sooo.... https://github.com/beorn7/rrsim is the littly test program I wrote to simulate rolling restarts. So far, I haven't introduced lost scrapes (or varying scrape intervals) into the scenarios. So in this partial report, there is no difference between the approaches of using the mean or the average sampling interval as heuristics.
The most important result is finding a bug affecting all versions discussed so far. I fixed it in 98c2c36 . Unfortunately, the branch this PR ( #1245 ) is based upon was rebased on top of master (rather than merged). Since all my other implementations were built on top of the original branch, I have done the fix in my own branch https://github.com/prometheus/prometheus/commits/beorn7/rate-average-no, which corresponds to the original branch. (Rebasing considered harmful... ;)
With the fix, I got the nice and expected huge improvement compared to the original rate implementation. The differences between the various suggested solutions is small, as expected, but noticeable. (With the aforementioned exception that median vs. average doesn't matter in the scenarios I have tested so far.)
In all scenarios, the scrape interval is 1s to speed up data collection.
Scenario 1: 20 tasks, 10qps each, rolling restart over 1m, followed by 1m normal run.
The aggregated rate in this scenario should be 200qps, but is in practice slightly less because of implementation details in my test program (199.5). (This could be improved with a little effort but doesn't really matter for what we are trying to find out here.)
The screenshots always show the total aggregate rate over two rolling restarts and then the aggregate rate of a single batch of tasks that got started in the first restart and stopped in the second.
Thanks for running the tests, the data is pretty clear. I expect we'll have to tradeoff the better results of the below zero for slow data with the not-below-zero doing better on rolling restarts.
That makes sense to me intuitively, you're turning off half the interpolation. It'd be interesting to see what this would do with a 10s scrape interval.
I'd expect that if rolling restarts are okay, then random restarts should also be okay.
Yes, that was my initial thought, too. But then I realized that I'm also turning up the interpolation in half of the cases: The interpolation is decreased in cases where the zero point is closer to the first sample than half a scrape interval. But if the zero point is at a distance between one scrape interval and half a scrape interval, the interpolation is increased compared to the usual "half sample interval' interpolation. In short, I had expected that the zero cutoff would make the interpolation more precise but would not change the average outcome... I have to think more about it. Perhaps there is an error in my code...
I think what matters is the ratio between scrape interval and counter increase frequency. So a 10s scrape interval should give us the same result as a 1s scrape interval with 100qps.
Scenario 2: 200 tasks, 0.05qps each, rolling restart over 3m, followed by 3m normal run.
This is the "slow moving" example, where a counter increment happens much less often than a scrape. The challenge here is that the extrapolation will overshoot quite a bit if one of the rare counter increments happens to be within the range, and that even more if the range is short. If no increment is in the range, the extrapolation undershoots (it's assuming no increments at all, ever), but that's the same as no extrapolation at all. Thus, in the case of slowly incrementing counters, avoiding extrapolation appears attractive (as it seemingly minimizes the errors we make), but on average, avoiding extrapolation will underestimate, while the "half interval" extrapolation will yield correct expectation values. This scenario is supposed to test this claim and to find out how big the differences are.
Short results: The practical differences are very small. Below is an example for "half interval" extrapolation without the zero cap (the approach that looked best in the previous scenario), but all the other "advanced" solution look essentially the same. (You can check the screenshots in the rrsim repo.) I couldn't tell them apart from just looking at them. (The graph's y-axis is from 9.4 to 10.6.)
The conventional rate, of course, fails again:
Very interesting (and good for a laugh) is to look at the rate of a single task with a 60s range. Here we can nicely see how the conventional rate combines the extrapolation overshoot with the inherent error of the slow incrementing counter to some kind of "pointy ears" graph:
The graph for our "advanced" rate approaches again look all very much the same, and way more reasonable:
In this scenario, it doesn't really matter which approach we take, as long as we don't stick with the old 'rate'.
In different news, I checked my zero-cap implementation with tests, and it looks as expected. https://github.com/prometheus/prometheus/blob/beorn7/rate-letsgocrazy/promql/testdata/legacy.test#L225-L241 even shows very nicely how the left- and right-side extrapolation end up with the same on average, despite the former being zero-capped. So I still don't understand the strikingly different behavior in scenario 1.
Next, I'll play with jitter in the rate, and then lost scrapes. Perhaps that will enlighten me.
That doesn't sound right to me.
I was thinking more along the lines that the maximum graph granularity is 1s, so as the scrape interval is also 1s we may not be seeing some effects due to that.
Any kind of extrapolation will likely yield "impossible" (i.e. non-integral) results for integral counters. The requirement for integral increases might be one of the use cases for a non-extrapolating version of
Could you be more specific? https://github.com/prometheus/prometheus/blob/beorn7/rate-letsgocrazy/promql/testdata/legacy.test#L225-L241 illustrates quite nicely what I mean. If you think it's wrong, we need reasons.
Graph resolution can be set to fractions of a second. I played with that. Or what do you mean by "graph granularity"?
I'll post the reports for the other scenarios ASAP, but currently, I'm on holidays with little spare time. So probably not before the 6th.
I'd only expect the zero approach to ever have less interpolation compared to the non-zero approach, so a situation where it increases the amount of interpolation sounds odd to me.
That's what I meant, I had thought
Back from my travels. Will be able to report more results soon.
But first another take on explaining my expectation:
Both prerequisites are given in my first two scenarios.
With the prerequisites, if we detect the beginning of a series, we are always right, because a 1.1 Δ_t_ gap between start of the range t0 and the time of the 1st sample in the range t1 is a sufficient condition for the start of a series.
The true start of the series is somewhere between t1–Δ_t_ and t1, equally distributed. The expectation value for the true start of the series is therefore (t1–Δ_t_ + t1)/2 = t1 – 1/2 Δ_t_, which is where the “half interval” extrapolation is coming from.
Hence, the easy proof is that the “half interval approach” extrapolates to the expectation value of the true start of the series, while the “zero approach” extrapolates to the true start of the series. The extrapolation value of the rate based on the true start of the series is equal to the expectation value of the rate based on the expectation value of the true start of the series. q.e.d.
To illustrate it a bit more: With the “half interval approach”, we are overestimating the rate in half of the cases (if the true start is after t1 – 1/2 Δ_t_), and we are underestimating the rate in the other half of the cases (if the true start is before t1 – 1/2 Δ_t_). In the former case, the “zero approach” yields a lower rate (in fact the true rate). In the latter case, the “zero approach” yields a higher rate (in fact the true rate again). On average, both approaches yield the same rate.
Yeah, definitely something we should fix. Different construction site, though... (direct translation of a German expression, sorry for that... ;)
Scenario 3: QPS jitter.
I ran multiple scenarios. The screenshots below were done with 20 tasks, 10qps each, 1m rolling restart time and a jitter of 0.5 (i.e. the waiting time between counter increments was varied by a normal-distributed random value with σ/μ=0.5). The range is 30s.
Without extrapolation at the start and end of a series, we get the usual undershooting, even a bit stronger than in the equivalent scenario without jitter, which might be attributed to the random noise:
With half-interval extrapolation, the deviations are much smaller, but again larger than in the non-jitter case:
I'm very satisfied that the zero-capped extrapolation looks almost the same:
I have a theory that the weird results from the zero-capping in the first scenario are caused by the alignment of counter increases across tasks, and it's nicely confirmed by the weirdness going away with jittered counter increments.
Scenario 4: Missed scrapes.
I won't include screenshots here. (For reference: https://github.com/beorn7/rrsim/tree/master/screenshots/n20-qps10-jitter0-restart1m-run1m-lost0.2 ) The results are very clear, though, and not surprising once you have thought about it.
A missing sample will affect the calculated rate most if it is at the beginning or end of a range because it will cause false detection of series start or end. The errors introduced by using median vs. average are tiny compared to that. While I could construct queries that would show the difference, their use would be highly contrived. While I still believe the median makes more sense conceptually, the higher computational effort compared to the average outweighs the practical benefits.
Based on the results presented, I recommend the zero-capped half-average extrapolation approach.