Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promql: Limit extrapolation of delta/rate/increase #1245

Closed
wants to merge 1 commit into from

Conversation

brian-brazil
Copy link
Contributor

Currenty the extrapolation is always done no matter
how much of the range is covered by samples.
When a time-series appears or disappears this can lead
to significant artifacts. Instead only extrapolate if
the first/last sample is within 110% of the expected place
we'd see the next sample's timestamp.

This will still result in artifacts for appearing and disappearing
timeseries, but they should be limited to +110% of the true value.

Fixes #581

@beorn7 following discussions on #1161

@beorn7
Copy link
Member

beorn7 commented Nov 28, 2015

Cool. Will have a thorough look Monday(-ish).

@@ -193,7 +193,7 @@ eval instant at 50m sum(http_requests) by (job) + min(http_requests) by (job) +
{job="app-server"} 4550
{job="api-server"} 1750


# http_requests{job="app-server", instance="1", group="canary"} 0+80x10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the meaning of this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a line to help me verify/understand the test was acting as expected. removed.

durationToEnd := rangeEnd.Sub(samples.Values[len(samples.Values)-1].Timestamp).Seconds()

sampledInterval := samples.Values[len(samples.Values)-1].Timestamp.Sub(samples.Values[0].Timestamp).Seconds()
averageDurationBetweenSamples := sampledInterval / float64(len(samples.Values))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We concluded that the median would be a better heuristics (as the sample intervals tend to be regular, and a single outlier should probably be disregarded (as effectively happening with the median) instead of taken into account (via averaging).

So I'd suggest to calculate the median in the for loop above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went for simplicity in the first version, reasoning through it I don't think it'll make much of a difference in practice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not much more complicated in the code. Concern would be that it's too expensive storage wise (need to allocate a slice with the length of the number of samples in the range). Ironically, the median would have the most benefit over the average with a small number of samples (like ~4 or so). I'd go with a median implementation initially and then see if the resource cost is prohibitive...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be one of the more commonly used promql functions, so I would worry about cost. I'd prefer to see if this is good enough and what other issues this algorithm has before making it more complicated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently more worried that we have to explain to the users how it changes. I don't want to explain a trade-off of computational cost and why we took a way we know is sub-optimal because of that. If we are really worried about the cost, we should benchmark it now and then come to a conclusion for good instead of changing the behavior later again.

The biggest jump in cost will anyway happen, it is for long-range functions that didn't need to take counter resets into account so far. Here again, providing an explicitly non-extrapolating function would allow for a low-cost function. With this PR, we always have to go through all samples, even without counter reset awareness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More thoughts about the cost:

We have to allocate space for the sample values anyway, so the increase of allocation cost for finding the median will be only a fraction of the current cost. I would be surprised if the cost were significant enough to prevent using the median. I can't be sure, so we either do proper real-life data benchmarking now, or we take the bet and go for median now, reverting if needed. (I'd vote for not releasing without benchmarks, however.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently more worried that we have to explain to the users how it changes.

I'm worried about explaining to users too, and the mean is far better for that. More people are going to know what the mean is and be able to calculate it than the median.

I prefer to keep things simple until we know we need to make them complicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a completely different note: The expression has to be

averageDurationBetweenSamples := sampledInterval / float64(len(samples.Values)-1)

(One fewer sample intervals in the total sampled interval than number of samples.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed.

@beorn7
Copy link
Member

beorn7 commented Nov 30, 2015

I'm moving the fundamental discussion here to top-level to not make it drown in outdated commits.

I don't think that median is more complicated than average. (The fact that people confuse the two doesn't render the one more complicated than the other.) Median is clearly the better heuristics given the nature of Prometheus data. The only point in the whole discussion so far I'd accept as valid is the concern about computational cost. However, as said, I'd surprised if that turned out to be true.

About "half-median" extrapolation vs. "no extrapolation": What bothers me most with our general solution is that we have that sudden switch-over between "full extrapolation" and "reduced extrapolation" (may it be none at all or "half median"). Let's say we have a range that includes only three or four samples. Now one sample is missed. If that missed sample is at the end of the range, we'll suddenly switch to "no extrapolation" for a while. Same happens when the missed sample reaches the beginning of the interval. There will be jumps in the rate graph. With my "half median" suggestion, those jumps will at least be approx. half as big. That's one part of the deal. The other part is, that in addition the half-median extrapolation is in general the best heuristics for the start and end of a series. The only case where we get a systematic artifact is the very special one where somebody managed to coincide the start or end of a series with the scrape time. The artifact, however, will be of very limited size. Nothing like the weird overshoots users have run into. Even with the switch to "no extrapolation", we wouldn't fully accommodate the "I want no extrapolation" case as we would (wrongfully) extrapolate once the carefully coincided samples get close enough to the limits of the range.

Conclusion: Let's have non-extrapolating functions (or let's say at least a delta, which might be justified just because of the huge savings in computational cost – a non-extrapolating delta is the only function where you only have to look at the first and last sample in the range) for the special case with correctly aligned start and end values.

About the explanations: The naive user (presumably most of them) wants to hear "Because of the sampled nature of the Prometheus time series data, we have to extrapolate the calculated value the best we can. If you want to know details, click here." In your case, we have to explain "Because of the sampled nature of the Prometheus time series data, we have to extrapolate the calculated value the best we can. Except that sometimes we don't to accommodate a special use-case that's probably not the one you just ran into. If you want to know details, click here." For the more interested users, they won't understand why we sometimes don't do the optimal extrapolation but apply assumptions (alignment of sample times with start and end of a series) that are in general not true.

@beorn7
Copy link
Member

beorn7 commented Nov 30, 2015

@brian-brazil Perhaps you can showcase a typical use-case for exact start and/or end of a series it the sample so that we can vet the various cases against each other?

And BTW: Calculating an average needs a fraction. Calculating a median does not. :-)

@fabxc
Copy link
Contributor

fabxc commented Nov 30, 2015

For an even number of samples the median is the average of the two middle values. So strictly speaking, it needs a fraction.

@brian-brazil
Copy link
Contributor Author

I don't think that median is more complicated than average.

It's not your or my opinion on statistics that matters, it's those of our users. That will include people who might have left high school early due to the .com boom. That'll include people who never had a chance to study statistics, such as Ireland where statistics is only an optional part of higher level maths.

Even though median is the most appropriate statistical choice, that doesn't change the fact that for many of our users it'll be a brand new mathematical concept they're not had to properly deal with before.

As soon as our technical choices impact what the users can reasonably run into we need to consider the human factor too. This wouldn't be a concern in say the storage layer.

About "half-median" extrapolation vs. "no extrapolation": What bothers me most with our general solution is that we have that sudden switch-over between "full extrapolation" and "reduced extrapolation" (may it be none at all or "half median").

That bothers me too.

With my "half median" suggestion, those jumps will at least be approx. half as big.

With my usual example (a counter that starts as a 0, then increments to 1, then holds), this results in values that are further away from the true value (to the extent that it's no longer possible to return the true value) so this would make #581 worse compared to what this PR does.

The only case where we get a systematic artifact is the very special one where somebody managed to coincide the start or end of a series with the scrape time.

That's not the only case where it's a problem, it's also a problem where the series isn't very linear such as above. Supporting non-linear series is critical, and whatever we call rate needs to support them sanely.

or let's say at least a delta, which might be justified just because of the huge savings in computational cost

I don't think delta is used often enough to justify variants, I'm not sure anyone is even using it.

About the explanations

I'm not sure that's a fair comparison, there's special cases on all sides.

Perhaps you can showcase a typical use-case for exact start and/or end of a series it the sample so that we can vet the various cases against each other?

We've seen them in the event/push-based requests. When things are batchy and have exact timestamps it'd be incorrect to extrapolate beyond them.

And BTW: Calculating an average needs a fraction. Calculating a median does not. :-)

It does if there's an even number of samples ;)

@beorn7
Copy link
Member

beorn7 commented Nov 30, 2015

Strictly speaking, all values that are larger than exactly half of the sample values are valid values for the median.

About the real discussion:

I'll come back to the use-cases tomorrow and will work through the effects the various schemes have on the results.

Two general statements, though:

(1) We have consistently advertised that pushing with an exact timestamp is not the use-case Prometheus is built for. Should we end up in a situation of a trade-off between proper support for the use-case Prometheus was built for vs. proper support for a use-case Prometheus was not built for, I will vote for the former.

(2) Picking our techniques based on the school curriculum in certain cultures seems outrageously wrong to me. I'll for one pick the technique that is best suited to solve our problem.

@brian-brazil
Copy link
Contributor Author

Picking our techniques based on the school curriculum in certain cultures seems outrageously wrong to me. I'll for one pick the technique that is best suited to solve our problem.

It depends on how you define the problem. If the problem is "how do I estimate the sampling interval of a series" then it's silly, if the problem is "how do I build a useful monitoring system that's widely deployed" then cultural norms are really important.

There's no point in having the right answer if it does more harm than good overall.

@fabxc
Copy link
Contributor

fabxc commented Dec 1, 2015

Given that we have histograms, arguing the understanding of the median cannot be expected from a user who wants to do monitoring, doesn't make sense.
Those users will in particular not care about what algorithms we used to implement a reasonably accurate rate. No idea why we would even argue for anything but the implementation with the best result at a reasonable cost.

@beorn7
Copy link
Member

beorn7 commented Dec 1, 2015

I feel tempted now to discuss the fundamental incentives of an open source project in general and Prometheus in particular, but let's not do that in a PR.

Let's instead come up with a purely technical decision between median and average. It will then be a separate decision to deliberately go down the technically inferior way for non-technical reasons. (I expect to be strongly against it, should it come up, but as usual, I'm willing to "disagree but commit" if you other Prometheans think it's the better decision.)

"Half-extrapolation" vs. "no extrapolation" appears to me to be an orthogonal decision to be discussed separately.

Let's first get the median vs. average thing settled, on a purely technical basis.

Technically, median is computationally more expensive (by an amount that needs to be determined but presumably just a fraction of the cost of the rate calculation), which has to be compensated by advantages of some kind. If we take the "almost regular sampling intervals" as the normal case, median seems vastly superior to me. But @brian-brazil brought up the case of pushing with fixed time stamps. In that scenario, sampling intervals are arbitrary and not necessarily regular anymore. To support that, average would make more sense than median. If we want to support that case at all, I'd be willing to go for average. But do we? For comparison, the case where average hurts most is if a scrape is lost close to the start or end of a time series so that the average sampling interval is artificially inflated and the start/end of the time series is identified later than necessary. What happens more often? Non-regular timestamps set by the client or lost samples close to the start or end of a series? I would have thought the latter, but I could be convinced otherwise, especially once we provide a bulk-import API (as more users will use Prometheus as a general-purpose TSDB – which appears intriguing and dangerous to me at the same time...).

@brian-brazil
Copy link
Contributor Author

Those users will in particular not care about what algorithms we used to implement a reasonably accurate rate.

Unfortunately we don't have a reasonably accurate rate. If we could come up with something that worked reasonably in all common scenarios then we could go to town complexity wise.

Everything proposed thus far has enough oddness in behaviour that it's going to cause confusion for a good chunk of users and support problems for us, so we need to think how we're going to deal with that.

In that scenario, sampling intervals are arbitrary and not necessarily regular anymore.

What I'm thinking of is a batch job. The start time should be regular, the completion time will vary though.

For comparison, the case where average hurts most is if a scrape is lost close to the start or end of a time series so that the average sampling interval is artificially inflated and the start/end of the time series is identified later than necessary.

The question for me is which causes the more problematic artifacts and will thus cause more confusion/frustration. Our fundamental challenge is that we can't distinguish between a missed scrape and a timeseries appearing/disappearing, and on top of that we expect some natural variation in scrape intervals.

If it's a lost scrape, then having a longer period of extrapolation is desirable as we want to extrapolate up to when things get working again. If it's a timeseries appearing/disappearing then that's not what we want to do.

I think it'd be interesting to run all this through #581 type use cases, and a rolling restart of a service that changed all the instance labels to see how things look with a variety of sampling periods.

@beorn7
Copy link
Member

beorn7 commented Dec 2, 2015

The question for me is which causes the more problematic artifacts and will thus cause more confusion/frustration.

Obviously, we are on the same page here. The actual point is that I believe with "median/half-median" extrapolation we'll cause less frustration than with "average/no" extrapolation.

Our fundamental challenge is that we can't distinguish between a missed scrape and a timeseries appearing/disappearing, and on top of that we expect some natural variation in scrape intervals.

That's again not disputed. The real question is if timeseries usually appear/disappear exactly where a sample happens to be, or at a random point within the sample interval (where I claim it's in the typical "Promethean" use-case the latter, while the former is a rare edge case where people are using Prometheus for something for which it is not really made). The real question about "natural variation in scrape intervals" is if we usually have a regular scrape interval where larger deviations are rare and then usually caused by a missed scrape (for which the median would be the best heuristics) or if scrape intervals are more or less random (for which the average would be better). I claim the former case is the by far more common one.

If it's a lost scrape, then having a longer period of extrapolation is desirable as we want to extrapolate up to when things get working again. If it's a timeseries appearing/disappearing then that's not what we want to do.

I was talking about the following case:

       1st sample   missed scrape
              |         |
              v         v
              x    x         x    x    x    x    x    x    x
----|----|----|----|----|----|----|----|----|----|----|----|
    1    2    3    4    5    6    7    8    9   10   11   12
Time

If my range is from 1.8 to 7.8, the median heuristics will correctly identify the start of the series, while the average heuristics will assume the series started before the range.

@beorn7
Copy link
Member

beorn7 commented Dec 2, 2015

What I also wanted to say with my example: It's a pretty rare double-coincidence (start of series + missed scrape). The question is if the special use-case of quasi-random sample timing (in contrast to quasi-regular intervals) is more or less rare.

Currenty the extrapolation is always done no matter
how much of the range is covered by samples.
When a time-series appears or disappears this can lead
to significant artifacts. Instead only extrapolate if
the first/last sample is within 110% of the expected place
we'd see the next sample's timestamp.

This will still result in artifacts for appearing and disappearing
timeseries, but they should be limited to +110% of the true value.

Fixes #581
@brian-brazil
Copy link
Contributor Author

What I also wanted to say with my example: It's a pretty rare double-coincidence (start of series + missed scrape).

On this point I was thinking a bit last night, and I don't think start/end and a missed scrape are independent. A server still initialising or one that's overloaded are more likely to fail scrapes.

I was also pondering if average might give the better expected value due to this and the chance of a double miss scraped (and we're talking about adding 10% on top of it all anyway).

This requires working through with examples methinks.

@fabxc
Copy link
Contributor

fabxc commented Dec 2, 2015

Can confirm that. Heavy load Prometheus servers have a high amount of failed scrapes on startup (first 60-120s I would say).

@beorn7
Copy link
Member

beorn7 commented Dec 2, 2015

I'm currently implementing all four options we are discussing. Stay tuned, separate PRs imminent.
We can then ploy with real-life data and see how it looks like.

While working on it, I realized that counters are never negative by contract, so we can make use of that to never extrapolate below 0, but that's yet another front in the battle.

@brian-brazil
Copy link
Contributor Author

While working on it, I realized that counters are never negative by contract, so we can make use of that to never extrapolate below 0, but that's yet another front in the battle.

I had that thought too. Certainly helps for new timeseries, but it's also more complexity,

@beorn7
Copy link
Member

beorn7 commented Dec 2, 2015

PRs are out. Let's play... :)

@beorn7
Copy link
Member

beorn7 commented Dec 2, 2015

#1254 is my current favorite (median/half).

@beorn7
Copy link
Member

beorn7 commented Dec 2, 2015

Another toy: #1255

@beorn7
Copy link
Member

beorn7 commented Dec 9, 2015

So let's have the discussion only in this PR. I created the other PRs to play with, but I didn't want to fragment the discussion.

Brian said about the "no extrapolation below zero" approach (#1255):

My problem with this is that it only fixes one special case. There's still isomorphic problems when a timeseries disappears, or reappears after a period of downtime.

I don't understand this concern. I mean, yes, it fixes only a special case, but that's exactly what we need. The special case is that a counter (that keeps the contract) cannot be negative, and we are making use of that. How is disappearance or reapperance of a timeseries after a downtime isomorphic? I need more explanation to understand what you mean.

Brain said:

I did this in a temporary text file as a thought experiment, this was calculating what value increase([1m]) would have for a 10s sample interval, calculated every 1s for data that was 0 1 1 1 1 1 ....

With #1245 with the right timing it's possible to get the true value of 1 for the increase, whereas the closest the half median gets is 1.5.

#1245 only gets it right when it doesn't extrapolate. But that's in general the wrong approach. It only happens to be the right approach in this special case. The only reason why we know it's the right approach is that the counter starts at zero. Exactly that knowledge is encoded in #1255 . So #1255 always does the right thing to the best of our knowledge while #1245 in general does the wrong thing and only happens to be right in special cases.

I'd really like to set up a test scenario with time series appearing and disappearing (reconstructing #581) and in particular the case where we have a ~100 instances performing a rolling restart. But I couldn't find time at all. Whoever feels like it, please play with the various PRs. I'll still try to do so ASAP.

@brian-brazil
Copy link
Contributor Author

How is disappearance or reapperance of a timeseries after a downtime isomorphic? I need more explanation to understand what you mean.

If

- - - 0 1 1 1 1 1

has a problem, then

0 0 0 0 0 1 - - - 

is going to have a similar problem.

I think your thought experiment is distinctly different from the problem in #581.

My thought experiment is the essence of what's happening in #581. If we solve it, we solve #581.

So #1255 always does the right thing to the best of our knowledge while #1245 in general does the wrong thing and only happens to be right in special cases.

Not quite. Each is wrong in the general case, and right in some special cases. The problem is that we don't have any good knowledge to work off, as we've an arbitrary time series that has a infinite number of valid interpretations.

@beorn7
Copy link
Member

beorn7 commented Dec 14, 2015

If

- - - 0 1 1 1 1 1

has a problem, then

0 0 0 0 0 1 - - - 

is going to have a similar problem.

Both examples are fundamentally different. In the first example, we can be sure that the missed scrapes cannot be represented by negative values (because of the contract of a counter), so a linear extrapolation to the left would be wrong. In the second example, we have no knowledge about the values in the missed scrapes. In the absence of other information, we can only assume that the series will continue as it started, i.e. one increase every 6 sampling intervals. It is generally wrong to assume that the one increase we have observed is the only one the series will ever have.

I think your thought experiment is distinctly different from the problem in #581.

My thought experiment is the essence of what's happening in #581. If we solve it, we solve #581.

I guess it doesn't make sense to just repeat our contradicting assessments of #581 over and over.

Could you explain in more detail why you think your thought experiment is the essence of #581? For me, that's so obviously not the case that I don't even know where I can start to explain why.

@brian-brazil
Copy link
Contributor Author

In the first example, we can be sure that the missed scrapes cannot be represented by negative values (because of the contract of a counter)

Technically we can't, it's possible that there were missed scrapes which covered a counter reset.

In the absence of other information, we can only assume that the series will continue as it started, i.e. one increase every 6 sampling intervals. It is generally wrong to assume that the one increase we have observed is the only one the series will ever have.

It's also generally wrong to assume that the pattern we see will repeat.

Could you explain in more detail why you think your thought experiment is the essence of #581? For me, that's so obviously not the case that I don't even know where I can start to explain why.

The essence of #581 is that a timeseries appears, and is incremented relatively rarely. This is what my example does, and it shows the same behaviours as we see in #581.

@beorn7
Copy link
Member

beorn7 commented Dec 15, 2015

It's also generally wrong to assume that the pattern we see will repeat.

It's the only thing we can assume in lack of other information. In reality, a counter could increase more or less or not at all, but the null hypothesis is that it goes on as it has gone before. That's the foundation for doing extrapolation at all.

The essence of #581 is that a timeseries appears, and is incremented relatively rarely.

As concluded in the conversation of #581, the essence is that a series appears in the middle (or even close to the end) of a long range or – more generally – has relatively few samples in a relatively long range. The fact that the counter increases slowly or not at all is irrelevant.

Quote (emphasis is mine): "Meaning, rate deals poorly with time series appearing / disappearing completely, or having very infrequent samples (no matter if the sample values stay constant or not) compared to the time window over which we do the rate."

This is what my example does, and it shows the same behaviours as we see in #581.

Even if we assume that the essence of #581 is that counters are incremented relatively rarely, than your example still doesn't catch it. Your example assumes that the counter never increments again after it has incremented once. If it, in fact, increments rarely, then the only reasonable thing to do is to extrapolate by half a sampling interval.

@brian-brazil
Copy link
Contributor Author

The fact that the counter increases slowly or not at all is irrelevant.

It is relevant, the artifacts are much worse for slow moving counters. We wouldn't have a problem if the counter didn't increase.

Your example assumes that the counter never increments again after it has incremented once.

I'm only looking at the first handful of samples of the counter's life, as that's where the biggest problems are. Maybe there's another increment at the next sample and maybe it's in the next hour - but what I think we need to focus on is what happens with rate in the time frame of those first few samples.

If it, in fact, increments rarely, then the only reasonable thing to do is to extrapolate by half a sampling interval.

I disagree that that's the only reasonable thing to do. If it increments very rarely then it's unlikely there's another increment within the time frame, so not adding half a sampling interval is more likely to be correct.

@beorn7
Copy link
Member

beorn7 commented Dec 15, 2015

It is relevant, the artifacts are much worse for slow moving counters. We wouldn't have a problem if the counter didn't increase.

The problem is that we don't know if the counter increases and if yes how much. The only piece of information is how much it has increased for the samples we see. So we must not throw away that only piece of information.

I'm only looking at the first handful of samples of the counter's life, as that's where the biggest problems are. Maybe there's another increment at the next sample and maybe it's in the next hour - but what I think we need to focus on is what happens with rate in the time frame of those first few samples.

Luckily, counters start at 0 most of the time. #1255 makes use of that fact in the perfect way (much better than not extrapolationg to the left). Problem solved.

I disagree that that's the only reasonable thing to do. If it increments very rarely then it's unlikely there's another increment within the time frame, so not adding half a sampling interval is more likely to be correct.

We have no information about how often a counter increments. The only piece of information is how much it has increased for the samples we see, may it be rare or not. So we must not throw away that only piece of information by not extrapolating.

We are running in circles once more. I'm at loss how I can improve my explanation. I'd like to spot a hole in my line of argument, but I can't, and I'm inclined to claim there is deductive proof that not extrapolating by half a sampling interval is wrong in the general case. If we cannot convince each other, we need an experiment, i.e. somebody has to set up test data and run the different implementations on it. I'm still out of time. I will do ASAP, but happy to see others conducting the experiments.

@brian-brazil
Copy link
Contributor Author

The only piece of information is how much it has increased for the samples we see. So we must not throw away that only piece of information.

Just because we have a piece of information doesn't mean we must use it. The salient question is whether the information overall helps us.

Luckily, counters start at 0 most of the time. #1255 makes use of that fact in the perfect way (much better than not extrapolationg to the left). Problem solved.

Counters start at 0 but that doesn't tell you if there's been a reset or it's a brand new counter. It may usually be correct, but it's not perfect.

I'm inclined to claim there is deductive proof that not extrapolating by half a sampling interval is wrong in the general case.

That's clearly false. It's only correct in the special case of a linear time series that starts/stops exactly half a sampling interval away, in all other cases it produces the wrong result. Thus it's incorrect in the general case.

Being right in the general case means being right in all possible cases.

@beorn7
Copy link
Member

beorn7 commented Dec 15, 2015

Just because we have a piece of information doesn't mean we must use it. The salient question is whether the information overall helps us.

Let me rephrase: The only piece of helpful information is how much it has increased for the samples we see. So we must not throw away that only piece of helpful information.

Counters start at 0 but that doesn't tell you if there's been a reset or it's a brand new counter. It may usually be correct, but it's not perfect.

We are only talking about the case where our heuristics has detected a "start of series". So the problem with a counter reset only happens if it coincides with a (wrongly or correctly identified) start of a series. A highly unlikely event.

It's only correct in the special case of a linear time series that starts/stops exactly half a sampling interval away, in all other cases it produces the wrong result. Thus it's incorrect in the general case.

Being right in the general case means being right in all possible cases.

We are talking about expectation values. We want to identify the correct expectation value. The expectation value of a roll of a perfect die is 3.5. What you are saying is: "That's clearly wrong. I can never roll a 3.5 on a die. Let's take 1 for our heuristics as this is the lowest value I can roll."

Obviously our calculated rate is strictly wrong in the general case. All of Prometheus is strictly wrong because we are sampling instead of event-logging. (That angry Mesosphere guy at Gophercon is yelling again in my head now… "Sampling is evil.") As we are in Promethean sampling-land anyway, correctness means "correct expectation values".

@beorn7
Copy link
Member

beorn7 commented Dec 28, 2015

Sooo.... https://github.com/beorn7/rrsim is the littly test program I wrote to simulate rolling restarts. So far, I haven't introduced lost scrapes (or varying scrape intervals) into the scenarios. So in this partial report, there is no difference between the approaches of using the mean or the average sampling interval as heuristics.

The most important result is finding a bug affecting all versions discussed so far. I fixed it in 98c2c36 . Unfortunately, the branch this PR ( #1245 ) is based upon was rebased on top of master (rather than merged). Since all my other implementations were built on top of the original branch, I have done the fix in my own branch https://github.com/prometheus/prometheus/commits/beorn7/rate-average-no, which corresponds to the original branch. (Rebasing considered harmful... ;)

With the fix, I got the nice and expected huge improvement compared to the original rate implementation. The differences between the various suggested solutions is small, as expected, but noticeable. (With the aforementioned exception that median vs. average doesn't matter in the scenarios I have tested so far.)

In all scenarios, the scrape interval is 1s to speed up data collection.

Scenario 1: 20 tasks, 10qps each, rolling restart over 1m, followed by 1m normal run.

The aggregated rate in this scenario should be 200qps, but is in practice slightly less because of implementation details in my test program (199.5). (This could be improved with a little effort but doesn't really matter for what we are trying to find out here.)

The screenshots always show the total aggregate rate over two rolling restarts and then the aggregate rate of a single batch of tasks that got started in the first restart and stopped in the second.

irate

This is just a side note for what is already well understood: With long ranges, irate doesn't work well in this scenario. We should highlight that fact in the documentation.

With [3s] (i.e. ~3 samples in the range), irate looks pretty good, with a few spikes. (We don't expect "real" noise here as the test tasks have a steady increase rate, jitter=0).
irate3s

However, with [10s], things overshoot a lot. The same sample values are used (the most recent two), but time series that have ceased to exist stay in the sum for a bit too long.
irate10s

Old rate

As expected, we have a crazy overshoot here, the more the longer the range.

[10s]:
rate10s

[30s]:
rate30s

Interestingly, the rate of the single batch of tasks is not completely horrible.

No extrapolation at assumed series start/end

I.e. Brian's implementation from this PR plus aforementioned bug fix.

To save space, I'm only including the [30s] plots here. They don't look much different from [10s]. Complete screenshots are checked into https://github.com/beorn7/rrsim.

rate-average-no-30s

No overshoot here, but as predicted an "undershoot". The 199.5qps between restarts are probably accurate (see above), but during restarts, the line should stay more or less at the same value (modulo inaccuracies in my test program). However, we get about 196.5qps in the middle of the rolling restart (peak min 196qps). Not dramatic, but recognizable. The single-batch rate looks nicely smooth.

Half-sample-interval extrapolation at assumed series start/end

This is the method I championed, without the added "don't extrapolate below zero".

In the graph, the known display bug masks the y-axis labels. The bottom line is ~199.0, the toy line is ~200.0. Take that into account when looking at the graph. The whole movement happens between 199 and 200 qps. The second rolling restart looks even nicer, with everything between 199.2 and 199.8, and - as predicted - no systematic overshoot or undershoot. If we look at the midpoint of the ziczac, there is almost no movement during the rolling restart.

rate-average-half-30s

Half-sample-interval extrapolation at assumed series start/end but not below zero

rate-mean-half-zerocap-30s

Interestingly, the curve looks very similar to no extrapolation at the assumed series start/end. However, looking more closely, the undershooting is only about half as large. Still, this is more of an undershoot than what I had expected. I haven't gain an understanding for it yet.

Summary

In this scenario, the "half sample-interval" method clearly wins.

I think the scenario I have tested here is the most common, and a solution to it solves most of the confusion around rate. Random restarts interspersed might screw some things up, but they are rare compared to the systematic restarts during a rolling restart.

The next scenario I have tested is the "slow increase", i.e. similar to this one, but a single task only increments it's counter much less often than a scrape interval. I'll write the report down tomorrow.

@brian-brazil
Copy link
Contributor Author

Thanks for running the tests, the data is pretty clear. I expect we'll have to tradeoff the better results of the below zero for slow data with the not-below-zero doing better on rolling restarts.

For irate, I'd presume that that'll improve once #398 is fixed.

However, looking more closely, the undershooting is only about half as large. Still, this is more of an undershoot than what I had expected. I haven't gain an understanding for it yet.

That makes sense to me intuitively, you're turning off half the interpolation. It'd be interesting to see what this would do with a 10s scrape interval.

Random restarts interspersed might screw some things up, but they are rare compared to the systematic restarts during a rolling restart.

I'd expect that if rolling restarts are okay, then random restarts should also be okay.

@beorn7
Copy link
Member

beorn7 commented Dec 29, 2015

For irate, I'd presume that that'll improve once #398 is fixed.

Definitely. irate and staleness detection will be a match made in heaven...

However, looking more closely, the undershooting is only about half as large. Still, this is more of an undershoot than what I had expected. I haven't gain an understanding for it yet.

That makes sense to me intuitively, you're turning off half the interpolation.

Yes, that was my initial thought, too. But then I realized that I'm also turning up the interpolation in half of the cases: The interpolation is decreased in cases where the zero point is closer to the first sample than half a scrape interval. But if the zero point is at a distance between one scrape interval and half a scrape interval, the interpolation is increased compared to the usual "half sample interval' interpolation. In short, I had expected that the zero cutoff would make the interpolation more precise but would not change the average outcome... I have to think more about it. Perhaps there is an error in my code...

It'd be interesting to see what this would do with a 10s scrape interval.

I think what matters is the ratio between scrape interval and counter increase frequency. So a 10s scrape interval should give us the same result as a 1s scrape interval with 100qps.

@beorn7
Copy link
Member

beorn7 commented Dec 29, 2015

Scenario 2: 200 tasks, 0.05qps each, rolling restart over 3m, followed by 3m normal run.

This is the "slow moving" example, where a counter increment happens much less often than a scrape. The challenge here is that the extrapolation will overshoot quite a bit if one of the rare counter increments happens to be within the range, and that even more if the range is short. If no increment is in the range, the extrapolation undershoots (it's assuming no increments at all, ever), but that's the same as no extrapolation at all. Thus, in the case of slowly incrementing counters, avoiding extrapolation appears attractive (as it seemingly minimizes the errors we make), but on average, avoiding extrapolation will underestimate, while the "half interval" extrapolation will yield correct expectation values. This scenario is supposed to test this claim and to find out how big the differences are.

Short results: The practical differences are very small. Below is an example for "half interval" extrapolation without the zero cap (the approach that looked best in the previous scenario), but all the other "advanced" solution look essentially the same. (You can check the screenshots in the rrsim repo.) I couldn't tell them apart from just looking at them. (The graph's y-axis is from 9.4 to 10.6.)

rate-average-half-30s

The conventional rate, of course, fails again:

rate-30s

Very interesting (and good for a laugh) is to look at the rate of a single task with a 60s range. Here we can nicely see how the conventional rate combines the extrapolation overshoot with the inherent error of the slow incrementing counter to some kind of "pointy ears" graph:

rate-60s-single-task

The graph for our "advanced" rate approaches again look all very much the same, and way more reasonable:

rate-average-half-60s-single-task

In this scenario, it doesn't really matter which approach we take, as long as we don't stick with the old 'rate'.

In different news, I checked my zero-cap implementation with tests, and it looks as expected. https://github.com/prometheus/prometheus/blob/beorn7/rate-letsgocrazy/promql/testdata/legacy.test#L225-L241 even shows very nicely how the left- and right-side extrapolation end up with the same on average, despite the former being zero-capped. So I still don't understand the strikingly different behavior in scenario 1.

Next, I'll play with jitter in the rate, and then lost scrapes. Perhaps that will enlighten me.

@brian-brazil
Copy link
Contributor Author

But if the zero point is at a distance between one scrape interval and half a scrape interval, the interpolation is increased compared to the usual "half sample interval' interpolation

That doesn't sound right to me.

I think what matters is the ratio between scrape interval and counter increase frequency. So a 10s scrape interval should give us the same result as a 1s scrape interval with 100qps.

I was thinking more along the lines that the maximum graph granularity is 1s, so as the scrape interval is also 1s we may not be seeing some effects due to that.

@brian-brazil
Copy link
Contributor Author

The graph for our "advanced" rate approaches again look all very much the same, and way more reasonable

That's looking pretty good, though slightly overshooting. That could produce impossible values for increase for integral counters.

@beorn7
Copy link
Member

beorn7 commented Jan 1, 2016

That could produce impossible values for increase for integral counters.

Any kind of extrapolation will likely yield "impossible" (i.e. non-integral) results for integral counters. The requirement for integral increases might be one of the use cases for a non-extrapolating version of increase and delta, as discussed before. I see no way how to recognize that case automatically.

But if the zero point is at a distance between one scrape interval and half a scrape interval, the interpolation is increased compared to the usual "half sample interval' interpolation

That doesn't sound right to me.

Could you be more specific? https://github.com/prometheus/prometheus/blob/beorn7/rate-letsgocrazy/promql/testdata/legacy.test#L225-L241 illustrates quite nicely what I mean. If you think it's wrong, we need reasons.

I was thinking more along the lines that the maximum graph granularity is 1s, so as the scrape interval is also 1s we may not be seeing some effects due to that.

Graph resolution can be set to fractions of a second. I played with that. Or what do you mean by "graph granularity"?

I'll post the reports for the other scenarios ASAP, but currently, I'm on holidays with little spare time. So probably not before the 6th.

@brian-brazil
Copy link
Contributor Author

Could you be more specific?

I'd only expect the zero approach to ever have less interpolation compared to the non-zero approach, so a situation where it increases the amount of interpolation sounds odd to me.

Graph resolution can be set to fractions of a second. I played with that. Or what do you mean by "graph granularity"?

That's what I meant, I had thought step only had second granularity. Our duration parsing is all over the place.

@beorn7
Copy link
Member

beorn7 commented Jan 6, 2016

Back from my travels. Will be able to report more results soon.

But first another take on explaining my expectation:

Two prerequisites:

  1. Sample intervals are equal, i.e. no lost or delayed scrapes. Sampling interval is called Δ_t_ below.
  2. Increase of counters is linear.

Both prerequisites are given in my first two scenarios.

With the prerequisites, if we detect the beginning of a series, we are always right, because a 1.1 Δ_t_ gap between start of the range t0 and the time of the 1st sample in the range t1 is a sufficient condition for the start of a series.

The true start of the series is somewhere between t1–Δ_t_ and t1, equally distributed. The expectation value for the true start of the series is therefore (t1–Δ_t_ + t1)/2 = t1 – 1/2 Δ_t_, which is where the “half interval” extrapolation is coming from.

Hence, the easy proof is that the “half interval approach” extrapolates to the expectation value of the true start of the series, while the “zero approach” extrapolates to the true start of the series. The extrapolation value of the rate based on the true start of the series is equal to the expectation value of the rate based on the expectation value of the true start of the series. q.e.d.

To illustrate it a bit more: With the “half interval approach”, we are overestimating the rate in half of the cases (if the true start is after t1 – 1/2 Δ_t_), and we are underestimating the rate in the other half of the cases (if the true start is before t1 – 1/2 Δ_t_). In the former case, the “zero approach” yields a lower rate (in fact the true rate). In the latter case, the “zero approach” yields a higher rate (in fact the true rate again). On average, both approaches yield the same rate.

Our duration parsing is all over the place.

Yeah, definitely something we should fix. Different construction site, though... (direct translation of a German expression, sorry for that... ;)

@brian-brazil
Copy link
Contributor Author

But first another take on explaining my expectation:

That makes sense.

@beorn7
Copy link
Member

beorn7 commented Jan 7, 2016

Scenario 3: QPS jitter.

I ran multiple scenarios. The screenshots below were done with 20 tasks, 10qps each, 1m rolling restart time and a jitter of 0.5 (i.e. the waiting time between counter increments was varied by a normal-distributed random value with σ/μ=0.5). The range is 30s.

Without extrapolation at the start and end of a series, we get the usual undershooting, even a bit stronger than in the equivalent scenario without jitter, which might be attributed to the random noise:

rate-average-no-30s

With half-interval extrapolation, the deviations are much smaller, but again larger than in the non-jitter case:

rate-average-half-30s

I'm very satisfied that the zero-capped extrapolation looks almost the same:

rate-median-half-zerocap-30s

I have a theory that the weird results from the zero-capping in the first scenario are caused by the alignment of counter increases across tasks, and it's nicely confirmed by the weirdness going away with jittered counter increments.

Scenario 4: Missed scrapes.

I won't include screenshots here. (For reference: https://github.com/beorn7/rrsim/tree/master/screenshots/n20-qps10-jitter0-restart1m-run1m-lost0.2 ) The results are very clear, though, and not surprising once you have thought about it.

A missing sample will affect the calculated rate most if it is at the beginning or end of a range because it will cause false detection of series start or end. The errors introduced by using median vs. average are tiny compared to that. While I could construct queries that would show the difference, their use would be highly contrived. While I still believe the median makes more sense conceptually, the higher computational effort compared to the average outweighs the practical benefits.

Conclusion

Based on the results presented, I recommend the zero-capped half-average extrapolation approach.

@beorn7
Copy link
Member

beorn7 commented Jan 7, 2016

The recommendation above is not implemented purely in any of the branches. I'll prepare one once we have agreement to go for it.

@brian-brazil
Copy link
Contributor Author

Thanks for working through this, I agree with your recommendation.

@beorn7
Copy link
Member

beorn7 commented Jan 8, 2016

I'll implement it in a new branch then. Closing this branch and all the others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants