-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rate
and delta
deal poorly with slow moving data.
#581
Comments
Adding a screenshot of that in case the data disappears: Adding some explanations to this issue: So it's a bit more tricky than "it deals poorly with slow moving data". Actually, slow-moving data should be handled fine as long as the underlying time series are "there". Meaning, rate deals poorly with time series appearing / disappearing completely, or having very infrequent samples (no matter if the sample values stay constant or not) compared to the time window over which we do the rate. This is due to the way that rates are extrapolated from the actual samples that are found under the time window. In this example, the reason for the high initial spike of the time series is a tricky one: you're doing the sum over a multiple rates of individual time series. One of these time series (
Seeing no earlier data points to the left, rate() then naively (and sometimes correctly?) assumes that if there had just been more points, the growth would have probably looked similar. The extrapolation generally happens because no matter what time window you choose, you're unlikely to find samples to exactly match the window boundaries, so you wouldn't actually get the rate over the selected window of time, but over whatever deltaT there really is between samples. You'd also get funny temporal aliasing due to that. Now the question is: how should rates behave in abnormal situations like this, given that the above is just a more extreme case of the more general extrapolation one which is intended? Should they really act differently, or should one just avoid/ignore time series that disappear/appear like that (if the dimensions are known beforehand, one can initialize their counters to 0)? That's not always possible, but not sure what a sane(r) behavior for rate would be that wouldn't be wrong just as often in the other direction? |
This change is breaking, use increase() instead. I'm not cleaning up the function in this PR, as my solution to #581 will rewrite and simplify increase/rate/delta.
This is an improvement from both a theoretical and practical standpoint. Theory: For simplicty I'll take the example of increase(). Counters are fundamentally lossy, if there's a counter reset or instance failure between one scrape and the next we'll lose information about the period up to the reset/failure. Thus the samples we have allow us to calculate a lower-bound on the increase() in a time period. Extrapolation multiples this by an amount depending on timing which is an upper bound based on what might have happened if everything continued as it was. This mix of upper and lower bounds means that we can't make any meaningful statements about the output of increase() in relation to what actually happened. By removing the extrapolation, we can once again reason that the result of increase() is a lower bound on the true increase of the counter. Practical: Fixes #581. The extrapolation presumes that it's looking at a range within a continuous stream of samples. If in fact the time series starts or end within this range, this leads to an over-correction. For discrete counters and gauges, extrapolating to invalid values in their domain can be confusing and pervent rules being written that depend on exact values. For those looking to graph things more accurately irate() is a better choice than extrapolation on rate(). For those looking to calculate how a gauge is trending deriv() is a better choice than delta().
This is an improvement from both a theoretical and practical standpoint. Theory: For simplicty I'll take the example of increase(). Counters are fundamentally lossy, if there's a counter reset or instance failure between one scrape and the next we'll lose information about the period up to the reset/failure. Thus the samples we have allow us to calculate a lower-bound on the increase() in a time period. Extrapolation multiples this by an amount depending on timing which is an upper bound based on what might have happened if everything continued as it was. This mix of upper and lower bounds means that we can't make any meaningful statements about the output of increase() in relation to what actually happened. By removing the extrapolation, we can once again reason that the result of increase() is a lower bound on the true increase of the counter. Practical: Fixes #581. The extrapolation presumes that it's looking at a range within a continuous stream of samples. If in fact the time series starts or end within this range, this leads to an over-correction. For discrete counters and gauges, extrapolating to invalid values in their domain can be confusing and prevent rules being written that depend on exact values. For those looking to graph things more accurately irate() is a better choice than extrapolation on rate(). For those looking to calculate how a gauge is trending deriv() is a better choice than delta().
Currenty the extrapolation is always done no matter how much of the range is covered by samples. When a time-series appears or disappears this can lead to significant artifacts. Instead only extrapolate if the first/last sample is within 110% of the expected place we'd see the next sample's timestamp. This will still result in artifacts for appearing and disappearing timeseries, but they should be limited to +110% of the true value. Fixes #581
Currenty the extrapolation is always done no matter how much of the range is covered by samples. When a time-series appears or disappears this can lead to significant artifacts. Instead only extrapolate if the first/last sample is within 110% of the expected place we'd see the next sample's timestamp. This will still result in artifacts for appearing and disappearing timeseries, but they should be limited to +110% of the true value. Fixes #581
The new implementation detects the start and end of a series by looking at the average sample interval within the range. If the first (last) sample in the range is more than 1.1*interval distant from the beginning (end) of the range, it is considered the first (last) sample of the series as a whole, and extrapolation is limited to half the interval (rather than all the way to the beginning (end) of the range). In addition, if the extrapolated starting point of a counter (where it is zero) is within the range, it is used as the starting point of the series. Fixes #581
This change is breaking, use increase() instead. I'm not cleaning up the function in this PR, as my solution to #581 will rewrite and simplify increase/rate/delta.
The new implementation detects the start and end of a series by looking at the average sample interval within the range. If the first (last) sample in the range is more than 1.1*interval distant from the beginning (end) of the range, it is considered the first (last) sample of the series as a whole, and extrapolation is limited to half the interval (rather than all the way to the beginning (end) of the range). In addition, if the extrapolated starting point of a counter (where it is zero) is within the range, it is used as the starting point of the series. Fixes #581
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
For data that typically changes less often than the sampling window, the delta expansion to avoid aliasing actually add inaccuracy.
For example, there is a spike in the rate to 80 (see http://bit.ly/1A4suIw) which is an order of magnitude greater than the actual rate (which is <10).
The text was updated successfully, but these errors were encountered: