Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

`rate` and `delta` deal poorly with slow moving data. #581

Closed
tomprince opened this issue Mar 6, 2015 · 2 comments
Closed

`rate` and `delta` deal poorly with slow moving data. #581

tomprince opened this issue Mar 6, 2015 · 2 comments

Comments

@tomprince
Copy link
Contributor

@tomprince tomprince commented Mar 6, 2015

For data that typically changes less often than the sampling window, the delta expansion to avoid aliasing actually add inaccuracy.

For example, there is a spike in the rate to 80 (see http://bit.ly/1A4suIw) which is an order of magnitude greater than the actual rate (which is <10).

@juliusv

This comment has been minimized.

Copy link
Member

@juliusv juliusv commented Mar 10, 2015

Adding a screenshot of that in case the data disappears:

rate

Adding some explanations to this issue:

So it's a bit more tricky than "it deals poorly with slow moving data". Actually, slow-moving data should be handled fine as long as the underlying time series are "there". Meaning, rate deals poorly with time series appearing / disappearing completely, or having very infrequent samples (no matter if the sample values stay constant or not) compared to the time window over which we do the rate. This is due to the way that rates are extrapolated from the actual samples that are found under the time window.

In this example, the reason for the high initial spike of the time series is a tricky one: you're doing the sum over a multiple rates of individual time series. One of these time series (buildbot_finished_builds_total{builder="flocker/installed-package/fedora-20",buildmaster="build.clusterhq.com",instance="build.clusterhq.com:80",job="buildbot",result="failure",slave_class="fedora-vagrant",slave_number="0"}) doesn't exist at all for the first part of the graph, and springs into existence in the middle of the 4h-rate window. Meaning, the rate function will see something like this:

                             4h window
|------------------------------------------------------------|              ...
                                                                       X
                                                                  X
                                                            X
                                                      X

Seeing no earlier data points to the left, rate() then naively (and sometimes correctly?) assumes that if there had just been more points, the growth would have probably looked similar. The extrapolation generally happens because no matter what time window you choose, you're unlikely to find samples to exactly match the window boundaries, so you wouldn't actually get the rate over the selected window of time, but over whatever deltaT there really is between samples. You'd also get funny temporal aliasing due to that.

Now the question is: how should rates behave in abnormal situations like this, given that the above is just a more extreme case of the more general extrapolation one which is intended? Should they really act differently, or should one just avoid/ignore time series that disappear/appear like that (if the dimensions are known beforehand, one can initialize their counters to 0)? That's not always possible, but not sure what a sane(r) behavior for rate would be that wouldn't be wrong just as often in the other direction?

brian-brazil added a commit that referenced this issue Oct 10, 2015
This change is breaking, use increase() instead.

I'm not cleaning up the function in this PR, as my solution to #581 will
rewrite and simplify increase/rate/delta.
brian-brazil added a commit that referenced this issue Oct 10, 2015
This is an improvement from both a theoretical and practical
standpoint.

Theory:
For simplicty I'll take the example of increase().

Counters are fundamentally lossy, if there's a counter reset
or instance failure between one scrape and the next we'll
lose information about the period up to the reset/failure.
Thus the samples we have allow us to calculate a lower-bound
on the increase() in a time period.

Extrapolation multiples this by an amount depending on timing which is
an upper bound based on what might have happened if everything continued
as it was.
This mix of upper and lower bounds means that we can't make any
meaningful statements about the output of increase() in relation to what
actually happened.

By removing the extrapolation, we can once again reason that the result
of increase() is a lower bound on the true increase of the counter.

Practical:
Fixes #581.
The extrapolation presumes that it's looking at a range within a
continuous stream of samples. If in fact the time series starts or
end within this range, this leads to an over-correction.

For discrete counters and gauges, extrapolating to invalid values in
their domain can be confusing and pervent rules being written that
depend on exact values.

For those looking to graph things more accurately irate() is a better
choice than extrapolation on rate().
For those looking to calculate how a gauge is trending deriv() is a
better choice than delta().
brian-brazil added a commit that referenced this issue Oct 11, 2015
This is an improvement from both a theoretical and practical
standpoint.

Theory:
For simplicty I'll take the example of increase().

Counters are fundamentally lossy, if there's a counter reset
or instance failure between one scrape and the next we'll
lose information about the period up to the reset/failure.
Thus the samples we have allow us to calculate a lower-bound
on the increase() in a time period.

Extrapolation multiples this by an amount depending on timing which is
an upper bound based on what might have happened if everything continued
as it was.
This mix of upper and lower bounds means that we can't make any
meaningful statements about the output of increase() in relation to what
actually happened.

By removing the extrapolation, we can once again reason that the result
of increase() is a lower bound on the true increase of the counter.

Practical:
Fixes #581.
The extrapolation presumes that it's looking at a range within a
continuous stream of samples. If in fact the time series starts or
end within this range, this leads to an over-correction.

For discrete counters and gauges, extrapolating to invalid values in
their domain can be confusing and prevent rules being written that
depend on exact values.

For those looking to graph things more accurately irate() is a better
choice than extrapolation on rate().
For those looking to calculate how a gauge is trending deriv() is a
better choice than delta().
brian-brazil added a commit that referenced this issue Nov 30, 2015
Currenty the extrapolation is always done no matter
how much of the range is covered by samples.
When a time-series appears or disappears this can lead
to significant artifacts. Instead only extrapolate if
the first/last sample is within 110% of the expected place
we'd see the next sample's timestamp.

This will still result in artifacts for appearing and disappearing
timeseries, but they should be limited to +110% of the true value.

Fixes #581
brian-brazil added a commit that referenced this issue Dec 2, 2015
Currenty the extrapolation is always done no matter
how much of the range is covered by samples.
When a time-series appears or disappears this can lead
to significant artifacts. Instead only extrapolate if
the first/last sample is within 110% of the expected place
we'd see the next sample's timestamp.

This will still result in artifacts for appearing and disappearing
timeseries, but they should be limited to +110% of the true value.

Fixes #581
brian-brazil added a commit that referenced this issue Jan 8, 2016
The new implementation detects the start and end of a series by
looking at the average sample interval within the range. If the first
(last) sample in the range is more than 1.1*interval distant from the
beginning (end) of the range, it is considered the first (last) sample
of the series as a whole, and extrapolation is limited to half the
interval (rather than all the way to the beginning (end) of the
range). In addition, if the extrapolated starting point of a counter
(where it is zero) is within the range, it is used as the starting
point of the series.

Fixes #581
@beorn7 beorn7 closed this in #1295 Jan 8, 2016
fabxc added a commit that referenced this issue Jan 11, 2016
This change is breaking, use increase() instead.

I'm not cleaning up the function in this PR, as my solution to #581 will
rewrite and simplify increase/rate/delta.
fabxc added a commit that referenced this issue Jan 11, 2016
The new implementation detects the start and end of a series by
looking at the average sample interval within the range. If the first
(last) sample in the range is more than 1.1*interval distant from the
beginning (end) of the range, it is considered the first (last) sample
of the series as a whole, and extrapolation is limited to half the
interval (rather than all the way to the beginning (end) of the
range). In addition, if the extrapolated starting point of a counter
(where it is zero) is within the range, it is used as the starting
point of the series.

Fixes #581
@lock

This comment has been minimized.

Copy link

@lock lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.