Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deriv() is broken due to floating point precision issues #2674

Closed
marcan opened this Issue May 3, 2017 · 16 comments

Comments

Projects
None yet
2 participants
@marcan
Copy link

marcan commented May 3, 2017

I have a bunch of gauges collected via sql_exporter. The metric for a given label set only exists when its value is nonzero (they're mostly row counts; zero rows means no count means no metric).

This is what the input data looks like. The red timeseries is either 1 or does not exist here (graphing it alone yields no points where the combined graph shows it as 0):
raw

Making the query deriv(metric[2m]) yields a hole in the middle:
deriv
Presumably this is some kind of degenerate case where deriv() doesn't like the data coming and going.

This wouldn't be a big deal in and of itself, but when wrapping the expression in sum(), that hole punches a hole in the final sum:
spectacle f24030

And that is clearly not what I want. delta() doesn't have the problem, but does not scale per second. rate() works here but is obviously wrong on a gauge when the value goes down. There seems to be no rate() equivalent for gauges (deriv() is a lot more complex).

prometheus, version 1.5.2 (branch: , revision: bd1182d)
  build user:       portage@binhost
  build date:       20170329-06:45:32
  go version:       go1.7.5```
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 3, 2017

(they're mostly row counts; zero rows means no count means no metric).

This is going to make any result statistically invalid. You're also likely running into staleness issues.

Produce a 0 when there's 0.

@marcan

This comment has been minimized.

Copy link
Author

marcan commented May 3, 2017

The problem is there's no easy way to pull a set of possible values for columns in a SQL table (and sql_exporter doesn't support anything like this anyway, otherwise I could've made a token attempt to hardcode some label sets in). All I can do is aggregate by certain columns and produce metrics for the tuples that happen to exist.

I know having missing metrics isn't nice, but surely there must be a way to handle this other than "don't do that". I don't even care about edge-case behavior in a time series that is just fluttering around 0; it's noise anyway. But it's killing an overall aggregation.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 3, 2017

This smells like a NaN is propagating, but this code can't produce a NaN from a quick look.

Can you share your raw data?

I know having missing metrics isn't nice, but surely there must be a way to handle this other than "don't do that".

You can't do much about time series that only exist sometimes and you don't have a way of figuring out their identities when they don't exist.

@marcan

This comment has been minimized.

Copy link
Author

marcan commented May 3, 2017

Looking at the data returned by the underlying query_range API for the second graph I do see NaNs.

What is the correct way to pull out the raw data for this time range?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 3, 2017

metric[10m] in the expression browser console

@marcan

This comment has been minimized.

Copy link
Author

marcan commented May 3, 2017

This is historical (yesterday), but here's the problem part (it's the first time this metric exists):
"values":[[1493712456.939,"1"],[1493712486.939,"1"],[1493712516.939,"1"],[1493712546.939,"1"],[1493712576.939,"1"],[1493712606.939,"1"],[1493712636.939,"1"],[1493712666.939,"1"],[1493712696.939,"1"],[1493712726.939,"1"],[1493712816.939,"1"],[1493712846.939,"1"]]

Then there's a gap and at timestamp 1493716296.939 the metric exists again, but that's outside the range I was graphing.

The problem query returning NaNs is: deriv(metric{...}[5m]). Via query_range the time range requested is start=1493712420&end=1493713320&step=3. The (interpolated) return data becomes NaN starting at timestamp 1493713029 and continuing until 1493713116; there's no further data returned after that.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 3, 2017

Doing the math by hand, the result comes out as 0 - which is what you'd expect for a horizontal line.

@marcan

This comment has been minimized.

Copy link
Author

marcan commented May 10, 2017

Any idea what's going on here? Let me know if you need more information/data.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 10, 2017

The data doesn't line up with the behaviour. Can you verify NaNs are being returned by the API?

@marcan

This comment has been minimized.

Copy link
Author

marcan commented May 10, 2017

Yes, I see NaNs come back in the query_range response.

$ curl 'https://foo/api/v1/query?query=sql_party_users%7Bwith_seat%3D%22false%22,database%3D%22eps-prod%22,in_group%3D%22true%22%7D%5B20m%5D&time=1493713320&step=3'
{"status":"success","data":{"resultType":"matrix","result":[{"metric":{"__name__":"sql_party_users","col":"count","database":"eps-prod","driver":"postgres","in_group":"true","instance":"tohru:9237","job":"sql_exporter","party":"ee25","sql_job":"global","with_seat":"false","with_ticket":"false"},"values":[[1493712456.939,"1"],[1493712486.939,"1"],[1493712516.939,"1"],[1493712546.939,"1"],[1493712576.939,"1"],[1493712606.939,"1"],[1493712636.939,"1"],[1493712666.939,"1"],[1493712696.939,"1"],[1493712726.939,"1"],[1493712816.939,"1"],[1493712846.939,"1"]]}]}}
$ curl 'https://foo/api/v1/query_range?query=deriv(sql_party_users%7Bwith_seat%3D%22false%22,database%3D%22eps-prod%22,in_group%3D%22true%22%7D%5B5m%5D)&start=1493712420&end=1493713320&step=30'
{"status":"success","data":{"resultType":"matrix","result":[{"metric":{"col":"count","database":"eps-prod","driver":"postgres","in_group":"true","instance":"tohru:9237","job":"sql_exporter","party":"ee25","sql_job":"global","with_seat":"false","with_ticket":"false"},"values":[[1493712510,"0"],[1493712540,"0"],[1493712570,"0"],[1493712600,"-0.00000000009313225746154786"],[1493712630,"0"],[1493712660,"0"],[1493712690,"0"],[1493712720,"0"],[1493712750,"0"],[1493712780,"0"],[1493712810,"0"],[1493712840,"0"],[1493712870,"0"],[1493712900,"0"],[1493712930,"0"],[1493712960,"-0.000000000038805107275644936"],[1493712990,"0"],[1493713020,"0"],[1493713050,"NaN"],[1493713080,"NaN"],[1493713110,"NaN"]]}]}}
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 10, 2017

At this point I'm suspecting floating point calculations are broken somewhere. Can you printline https://github.com/prometheus/prometheus/blob/master/promql/functions.go#L645 for one of the NaN evaluations and see where it's coming from?

For anyone else debugging, the two relevant data points in are 1@1493712846.939 1@1493712816.939

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 14, 2017

There's not enough to go on here and everything seems okay. Lacking further information I have to presume a hardware fault.

@marcan

This comment has been minimized.

Copy link
Author

marcan commented Jul 14, 2017

I have a data dump of the problem storage, I just haven't had time to dig into it yet. I highly doubt it's a hardware fault, but I'll reopen when I get a chance to investigate.

@marcan

This comment has been minimized.

Copy link
Author

marcan commented Jul 14, 2017

It's a floating point precision issue.

: s={%!f(model.Time=1493712816939) 1.000000} x=1493712816.939000 n=1.000000 sumY=1.000000 sumX=1493712816.939000 sumXY=1493712816.939000 sumX2=2231177979487842816.000000
: s={%!f(model.Time=1493712846939) 1.000000} x=1493712846.939000 n=2.000000 sumY=2.000000 sumX=2987425663.878000 sumXY=2987425663.878000 sumX2=4462356048598455296.000000
slope=NaN intercept=NaN covXY=0.000000 varX=0.000000

varX isn't really zero, but for large numbers like UNIX timestamps which differ only by a few seconds, it underflows down to zero. Looks like the standard fix for this is to precompute the mean and subtract it out from samples, instead of directly implementing the textbook formula.

Please reopen.

@marcan

This comment has been minimized.

Copy link
Author

marcan commented Jul 14, 2017

Python3 reproducer:

>>> var = lambda a,b: (a*a+b*b) - ((a+b)*(a+b)/2)
>>> var(1493712816939000000/1e9, 1493712846939000000/1e9)
0.0

Note that the round trip through nanoseconds is necessary for the floating point stars to align:

>>> var(1493712816.939, 1493712846.939)
512.0

The precision we're getting here is in units of 512. This makes sense, because with UNIX timestamps we're at 31 bits, 62 bits squared, and the mantissa of double floats is only 53 bits: 2**9=512. The true variance is 450. Looks to me like even when the value doesn't underflow down to zero, the output of this function as implemented for deriv() with the magnitudes involved here is basically complete garbage and buried by floating point roundoff error; less than one bit of precision.

@marcan marcan changed the title sum(deriv()) does not like timeseries appearing/disappearing deriv() is broken due to floating point precision issues Jul 14, 2017

@brian-brazil brian-brazil reopened this Jul 14, 2017

brian-brazil added a commit that referenced this issue Jul 17, 2017

Use timestamp of a sample in deriv() to avoid FP issues
With the squaring of the timestamp, we run into the
limitations of the 53bit mantissa for a 64bit float.

By subtracting away a timestamp of one of the samples (which is how the
intercept is used) we avoid this issue in practice as it's unlikely
that it is used over a very long time range.

Fixes #2674

brian-brazil added a commit that referenced this issue Aug 7, 2017

Use timestamp of a sample in deriv() to avoid FP issues (#2958)
With the squaring of the timestamp, we run into the
limitations of the 53bit mantissa for a 64bit float.

By subtracting away a timestamp of one of the samples (which is how the
intercept is used) we avoid this issue in practice as it's unlikely
that it is used over a very long time range.

Fixes #2674
@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.