Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign up[Bug] Some values produced by recording rules are inaccurate. #4860
Comments
kminehart
changed the title
[Bug] Some values produced by Rules are inaccurate.
[Bug] Some values produced by recording rules are inaccurate.
Nov 13, 2018
This comment has been minimized.
This comment has been minimized.
steven-aerts
commented
Nov 19, 2018
This comment has been minimized.
This comment has been minimized.
Would also like to know. I'd be happy to debug it if I knew where to start. |
This comment has been minimized.
This comment has been minimized.
|
Weird. @steven-aerts Do you know if there was anything special happening around the timestamp where the rule and normal expression evaluation started diverging? Like a Prometheus version upgrade? |
This comment has been minimized.
This comment has been minimized.
steven-aerts
commented
Nov 29, 2018
|
@juliusv 2.4.3 was already running for 3 days when we saw this issue. After four days the issue disappeared again, and the diff became again 0: As this is our only recording rule I cannot compare the behaviour of other rules at that moment in time. I do not think it is related to server restarts or config reloads, as none of them happened around that time (plotting I still find it very suspicious that both 2 replicated prometheus servers saw the issue at exactly the same time. Thanks, Steven |
This comment has been minimized.
This comment has been minimized.
steven-aerts
commented
Nov 29, 2018
|
@juliusv I am now seeing that the diff has shifted over time. As in the graph I posted 10 days ago, the diff starts at 2018-11-15 06-18-54 UTC. That moment is exactly the moment when we upgraded our servers. (verified by looking at So you are probably right with your guess that it might be upgrade related. I can not explain the shift, but for me that is not a problem. Thanks, Steven |
This comment has been minimized.
This comment has been minimized.
|
The original issue has a sum before rate, so that's likely it. For the second issue, what does the raw data say? |
This comment has been minimized.
This comment has been minimized.
steven-aerts
commented
Nov 29, 2018
|
Looking at the raw data for the recording rule I, I get the following when I query
I looked at the raw data for the underlying series by querying I see at the time the recording rule breaks, two new series start publishing data (some data anonymized):
Can this be some kind of race condition? There are two other series which keep on sending. Best regards, Steven |
This comment has been minimized.
This comment has been minimized.
Can you elaborate on this? |
This comment has been minimized.
This comment has been minimized.
steven-aerts
commented
Nov 29, 2018
|
@kminehart I think he refers to https://www.robustperception.io/rate-then-sum-never-sum-then-rate |
This comment has been minimized.
This comment has been minimized.
|
oh that makes sense, but I don't think that applies in my case. In my example, my recording rule is recorded as:
and records the value of,
Which is just There's no rate being applied to my recording rule to show the inaccuracy being recorded. We also don't apply a rate to this value in our graphs. We do this label joining so that we can see this metrics without the master node, which is unschedulable. |



kminehart commentedNov 12, 2018
•
edited
Bug Report
What did you do?
One metric I've been collecting via. rules has been inaccurate by up to 500. (in this case, seconds), which is a pretty big difference.
The rule is taking less than 10 ms to complete, so I don't think there's much of a "time drift" issue happening here.
This is the difference between the value in the rule, and the query the rule uses. Since these are, to my understanding, essentially the same thing, they should be 0, if not, then within ±1.
(sum(company:node_cpu_seconds_total) by(mode)) - (sum(node_cpu_seconds_total * on(instance) group_left(node) label_replace(max by(pod_ip, node) (kube_pod_info{pod=~"node-exporter.*"}), "instance", "$1:9100", "pod_ip", "(.*)") * on(node) group_left(label_cs_role) kube_node_labels) by(mode))The second block would be the query which the rule contains.
What did you expect to see?
I expected a reasonable margin of error; this much of a difference in values results in a huge difference in rates.
These graphs tell 2 completely different stories:
This one uses inaccurate data produced by the Rule:
This one uses data collected directly from the
node_exporters.Environment
I'm using the Prometheus operator on Kubernetes, however I don't think this issue is related to the prometheus operator at all.
Prometheus version:
v2.3.2
Prometheus configuration file:
Rule config: