Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upderiv() is broken due to floating point precision issues #2674
Comments
This comment has been minimized.
This comment has been minimized.
This is going to make any result statistically invalid. You're also likely running into staleness issues. Produce a 0 when there's 0. |
This comment has been minimized.
This comment has been minimized.
|
The problem is there's no easy way to pull a set of possible values for columns in a SQL table (and sql_exporter doesn't support anything like this anyway, otherwise I could've made a token attempt to hardcode some label sets in). All I can do is aggregate by certain columns and produce metrics for the tuples that happen to exist. I know having missing metrics isn't nice, but surely there must be a way to handle this other than "don't do that". I don't even care about edge-case behavior in a time series that is just fluttering around 0; it's noise anyway. But it's killing an overall aggregation. |
This comment has been minimized.
This comment has been minimized.
|
This smells like a NaN is propagating, but this code can't produce a NaN from a quick look. Can you share your raw data?
You can't do much about time series that only exist sometimes and you don't have a way of figuring out their identities when they don't exist. |
This comment has been minimized.
This comment has been minimized.
|
Looking at the data returned by the underlying query_range API for the second graph I do see NaNs. What is the correct way to pull out the raw data for this time range? |
This comment has been minimized.
This comment has been minimized.
|
metric[10m] in the expression browser console |
This comment has been minimized.
This comment has been minimized.
|
This is historical (yesterday), but here's the problem part (it's the first time this metric exists): Then there's a gap and at timestamp 1493716296.939 the metric exists again, but that's outside the range I was graphing. The problem query returning NaNs is: |
This comment has been minimized.
This comment has been minimized.
|
Doing the math by hand, the result comes out as 0 - which is what you'd expect for a horizontal line. |
This comment has been minimized.
This comment has been minimized.
|
Any idea what's going on here? Let me know if you need more information/data. |
This comment has been minimized.
This comment has been minimized.
|
The data doesn't line up with the behaviour. Can you verify NaNs are being returned by the API? |
This comment has been minimized.
This comment has been minimized.
|
Yes, I see NaNs come back in the query_range response.
|
This comment has been minimized.
This comment has been minimized.
|
At this point I'm suspecting floating point calculations are broken somewhere. Can you printline https://github.com/prometheus/prometheus/blob/master/promql/functions.go#L645 for one of the NaN evaluations and see where it's coming from? For anyone else debugging, the two relevant data points in are |
This comment has been minimized.
This comment has been minimized.
|
There's not enough to go on here and everything seems okay. Lacking further information I have to presume a hardware fault. |
brian-brazil
closed this
Jul 14, 2017
This comment has been minimized.
This comment has been minimized.
|
I have a data dump of the problem storage, I just haven't had time to dig into it yet. I highly doubt it's a hardware fault, but I'll reopen when I get a chance to investigate. |
This comment has been minimized.
This comment has been minimized.
|
It's a floating point precision issue.
varX isn't really zero, but for large numbers like UNIX timestamps which differ only by a few seconds, it underflows down to zero. Looks like the standard fix for this is to precompute the mean and subtract it out from samples, instead of directly implementing the textbook formula. Please reopen. |
This comment has been minimized.
This comment has been minimized.
|
Python3 reproducer:
Note that the round trip through nanoseconds is necessary for the floating point stars to align:
The precision we're getting here is in units of 512. This makes sense, because with UNIX timestamps we're at 31 bits, 62 bits squared, and the mantissa of double floats is only 53 bits: 2**9=512. The true variance is 450. Looks to me like even when the value doesn't underflow down to zero, the output of this function as implemented for deriv() with the magnitudes involved here is basically complete garbage and buried by floating point roundoff error; less than one bit of precision. |
marcan
changed the title
sum(deriv()) does not like timeseries appearing/disappearing
deriv() is broken due to floating point precision issues
Jul 14, 2017
brian-brazil
reopened this
Jul 14, 2017
brian-brazil
added
kind/bug
priority/P1
component/promql
labels
Jul 14, 2017
brian-brazil
added a commit
that referenced
this issue
Jul 17, 2017
brian-brazil
referenced this issue
Jul 17, 2017
Merged
Use timestamp of a sample in deriv() to avoid FP issues #2958
brian-brazil
closed this
in
#2958
Aug 7, 2017
brian-brazil
added a commit
that referenced
this issue
Aug 7, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
marcan commentedMay 3, 2017
I have a bunch of gauges collected via sql_exporter. The metric for a given label set only exists when its value is nonzero (they're mostly row counts; zero rows means no count means no metric).
This is what the input data looks like. The red timeseries is either 1 or does not exist here (graphing it alone yields no points where the combined graph shows it as 0):

Making the query

deriv(metric[2m])yields a hole in the middle:Presumably this is some kind of degenerate case where
deriv()doesn't like the data coming and going.This wouldn't be a big deal in and of itself, but when wrapping the expression in

sum(), that hole punches a hole in the final sum:And that is clearly not what I want.
delta()doesn't have the problem, but does not scale per second.rate()works here but is obviously wrong on a gauge when the value goes down. There seems to be norate()equivalent for gauges (deriv()is a lot more complex).