`rate` and `delta` deal poorly with slow moving data. #581

tomprince · 2015-03-06T18:27:29Z

For data that typically changes less often than the sampling window, the delta expansion to avoid aliasing actually add inaccuracy.

For example, there is a spike in the rate to 80 (see http://bit.ly/1A4suIw) which is an order of magnitude greater than the actual rate (which is <10).

juliusv · 2015-03-10T22:36:46Z

Adding a screenshot of that in case the data disappears:

Adding some explanations to this issue:

So it's a bit more tricky than "it deals poorly with slow moving data". Actually, slow-moving data should be handled fine as long as the underlying time series are "there". Meaning, rate deals poorly with time series appearing / disappearing completely, or having very infrequent samples (no matter if the sample values stay constant or not) compared to the time window over which we do the rate. This is due to the way that rates are extrapolated from the actual samples that are found under the time window.

In this example, the reason for the high initial spike of the time series is a tricky one: you're doing the sum over a multiple rates of individual time series. One of these time series (buildbot_finished_builds_total{builder="flocker/installed-package/fedora-20",buildmaster="build.clusterhq.com",instance="build.clusterhq.com:80",job="buildbot",result="failure",slave_class="fedora-vagrant",slave_number="0"}) doesn't exist at all for the first part of the graph, and springs into existence in the middle of the 4h-rate window. Meaning, the rate function will see something like this:

                             4h window
|------------------------------------------------------------|              ...
                                                                       X
                                                                  X
                                                            X
                                                      X

Seeing no earlier data points to the left, rate() then naively (and sometimes correctly?) assumes that if there had just been more points, the growth would have probably looked similar. The extrapolation generally happens because no matter what time window you choose, you're unlikely to find samples to exactly match the window boundaries, so you wouldn't actually get the rate over the selected window of time, but over whatever deltaT there really is between samples. You'd also get funny temporal aliasing due to that.

Now the question is: how should rates behave in abnormal situations like this, given that the above is just a more extreme case of the more general extrapolation one which is intended? Should they really act differently, or should one just avoid/ignore time series that disappear/appear like that (if the dimensions are known beforehand, one can initialize their counters to 0)? That's not always possible, but not sure what a sane(r) behavior for rate would be that wouldn't be wrong just as often in the other direction?

This change is breaking, use increase() instead. I'm not cleaning up the function in this PR, as my solution to #581 will rewrite and simplify increase/rate/delta.

This is an improvement from both a theoretical and practical standpoint. Theory: For simplicty I'll take the example of increase(). Counters are fundamentally lossy, if there's a counter reset or instance failure between one scrape and the next we'll lose information about the period up to the reset/failure. Thus the samples we have allow us to calculate a lower-bound on the increase() in a time period. Extrapolation multiples this by an amount depending on timing which is an upper bound based on what might have happened if everything continued as it was. This mix of upper and lower bounds means that we can't make any meaningful statements about the output of increase() in relation to what actually happened. By removing the extrapolation, we can once again reason that the result of increase() is a lower bound on the true increase of the counter. Practical: Fixes #581. The extrapolation presumes that it's looking at a range within a continuous stream of samples. If in fact the time series starts or end within this range, this leads to an over-correction. For discrete counters and gauges, extrapolating to invalid values in their domain can be confusing and pervent rules being written that depend on exact values. For those looking to graph things more accurately irate() is a better choice than extrapolation on rate(). For those looking to calculate how a gauge is trending deriv() is a better choice than delta().

This is an improvement from both a theoretical and practical standpoint. Theory: For simplicty I'll take the example of increase(). Counters are fundamentally lossy, if there's a counter reset or instance failure between one scrape and the next we'll lose information about the period up to the reset/failure. Thus the samples we have allow us to calculate a lower-bound on the increase() in a time period. Extrapolation multiples this by an amount depending on timing which is an upper bound based on what might have happened if everything continued as it was. This mix of upper and lower bounds means that we can't make any meaningful statements about the output of increase() in relation to what actually happened. By removing the extrapolation, we can once again reason that the result of increase() is a lower bound on the true increase of the counter. Practical: Fixes #581. The extrapolation presumes that it's looking at a range within a continuous stream of samples. If in fact the time series starts or end within this range, this leads to an over-correction. For discrete counters and gauges, extrapolating to invalid values in their domain can be confusing and prevent rules being written that depend on exact values. For those looking to graph things more accurately irate() is a better choice than extrapolation on rate(). For those looking to calculate how a gauge is trending deriv() is a better choice than delta().

Currenty the extrapolation is always done no matter how much of the range is covered by samples. When a time-series appears or disappears this can lead to significant artifacts. Instead only extrapolate if the first/last sample is within 110% of the expected place we'd see the next sample's timestamp. This will still result in artifacts for appearing and disappearing timeseries, but they should be limited to +110% of the true value. Fixes #581

The new implementation detects the start and end of a series by looking at the average sample interval within the range. If the first (last) sample in the range is more than 1.1*interval distant from the beginning (end) of the range, it is considered the first (last) sample of the series as a whole, and extrapolation is limited to half the interval (rather than all the way to the beginning (end) of the range). In addition, if the extrapolated starting point of a counter (where it is zero) is within the range, it is used as the starting point of the series. Fixes #581

This change is breaking, use increase() instead. I'm not cleaning up the function in this PR, as my solution to #581 will rewrite and simplify increase/rate/delta.

The new implementation detects the start and end of a series by looking at the average sample interval within the range. If the first (last) sample in the range is more than 1.1*interval distant from the beginning (end) of the range, it is considered the first (last) sample of the series as a whole, and extrapolation is limited to half the interval (rather than all the way to the beginning (end) of the range). In addition, if the extrapolated starting point of a counter (where it is zero) is within the range, it is used as the starting point of the series. Fixes #581

lock · 2019-03-24T15:27:35Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

brian-brazil mentioned this issue Aug 30, 2015

Remove obsolete reference to closed issue prometheus/docs#198

Merged

brian-brazil added a commit that referenced this issue Oct 10, 2015

promql: Remove deprecated 2nd argument to delta()

5740a8f

This change is breaking, use increase() instead. I'm not cleaning up the function in this PR, as my solution to #581 will rewrite and simplify increase/rate/delta.

brian-brazil mentioned this issue Oct 10, 2015

promql: Remove deprecated 2nd argument to delta() #1158

Merged

brian-brazil mentioned this issue Oct 10, 2015

promql: Remove extrapolation from rate/increase/delta. #1161

Closed

brian-brazil mentioned this issue Nov 28, 2015

promql: Limit extrapolation of delta/rate/increase #1245

Closed

This was referenced Dec 2, 2015

Rate extrapolation mode "average/half". #1253

Closed

Rate extrapolation mode "median/half". #1254

Closed

brian-brazil added bug labels Dec 16, 2015

beorn7 mentioned this issue Jan 8, 2016

promql: Limit extrapolation of delta/rate/increase #1295

Merged

beorn7 closed this as completed in #1295 Jan 8, 2016

fabxc pushed a commit that referenced this issue Jan 11, 2016

promql: Remove deprecated 2nd argument to delta()

b48d2b1

This change is breaking, use increase() instead. I'm not cleaning up the function in this PR, as my solution to #581 will rewrite and simplify increase/rate/delta.

lock bot locked and limited conversation to collaborators Mar 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`rate` and `delta` deal poorly with slow moving data. #581

`rate` and `delta` deal poorly with slow moving data. #581

tomprince commented Mar 6, 2015

juliusv commented Mar 10, 2015

lock bot commented Mar 24, 2019

rate and delta deal poorly with slow moving data. #581

rate and delta deal poorly with slow moving data. #581

Comments

tomprince commented Mar 6, 2015

juliusv commented Mar 10, 2015

lock bot commented Mar 24, 2019

`rate` and `delta` deal poorly with slow moving data. #581

`rate` and `delta` deal poorly with slow moving data. #581