-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Float rounding in recording doesn't match runtime #4580
Comments
See https://www.robustperception.io/rate-then-sum-never-sum-then-rate, which is the recommended way to do this. |
We have a static number of CPUs in the system and it's impossible for a single counter under aggregation to reset without resetting all of them, so I don't think that applies. It seems that aggregating multiple counters that come from the same exporter instance should be safe. Is there any other reason counters should not be aggregated when they come from the same instance? Having counters rather than rates stored in tsdb adds flexibility in terms of what period It works just fine with online query, but breaks with a rule when it should work exactly the same way, which makes me think it's a bug in either tsdb or rule evaluation. |
@bobrik why do you think it might be tsdb related? |
@krasi-georgiev maybe tsdb loses precision somehow on store? It could be rule engine that passes incorrect value as well, I'm not certain on where exactly the bug is. |
sounds very, very unlikely so I would say to look for the cause elsewhere. |
@brian-brazil it really seems like a legitimate bug to me if recording rules do not match live query. |
I think the culprit is that labels are not sorted somewhere. Non-deterministic order of float64 addents results in a non-deterministic result. Consider this code (play.golang.org): package main
import (
"fmt"
"math/rand"
)
func main() {
measurements := []float64{3008.72, 21.28, 227.46, 211.33, 219.77, 205.26, 189.64, 231.46, 215.77, 204.09, 110.2, 229.25, 62.67, 244, 212.98, 209.35, 222.7, 177.96, 217.95, 243.82, 205.43, 291.03, 303.63, 110.65, 282.88, 285.47, 350.44, 1290.37, 1090.82, 1221.67, 1149.17, 1166.08, 1118.8, 906.82, 162.94, 1113.33, 1159.18, 1094.31, 1094.44, 1110.53, 1080.12, 1154.85, 1059.6, 36.61, 19.45, 178.84, 30.99, 36.78, 36.88, 38, 35.75, 39.32, 38.94, 43, 45.14, 39.44, 179.4, 38.93, 40.67, 38.51, 42.49, 40.06, 40.14, 36.92, 40.29, 48.84, 46.02, 210.45, 48.13, 47.02, 37.07, 40.96, 41.53, 39.73, 41.75, 48.45, 42.1, 42.73, 95.18, 169.66, 511.01, 430.52, 478.28, 451.9, 469.03, 481.56, 403.33, 447.84, 478.09, 204.87, 426.28, 451.85, 463.84, 432.19, 296.42, 426.75}
base := Sum(measurements)
Shuffle(measurements)
shuffled := Sum(measurements)
fmt.Printf(" base sum : %v\n", base)
fmt.Printf("shuffled sum : %v\n", shuffled)
}
func Shuffle(vals []float64) {
for len(vals) > 0 {
n := len(vals)
randIndex := rand.Intn(n)
vals[n-1], vals[randIndex] = vals[randIndex], vals[n-1]
vals = vals[:n-1]
}
}
func Sum(vals []float64) float64 {
result := float64(0)
for _, value := range vals {
result += value
}
return result
} Output from play.golang.org:
The question is why rules have unordered set of labels. |
Assuming that order is the culprit, I think this is where we populate available series: This in turns goes to type Querier interface {
// Select returns a set of series that matches the given label matchers.
Select(...labels.Matcher) (SeriesSet, error)
// ...
} There is no mention of the result being sorted somehow, but it appears sorted by label names and values from my quick test. Apparently sometimes this invariant doesn't hold. Should it always? |
I applied the following patch: commit e162c6fcfbadcb898ce6ef348334383f6ebe4265
Author: Ivan Babrou <ibobrik@gmail.com>
Date: Wed Jan 23 20:51:33 2019 -0800
Warn if series set is not sorted
Trying to catch the issue in the wild to confirm the theory:
* https://github.com/prometheus/prometheus/issues/4580
diff --git a/promql/engine.go b/promql/engine.go
index 721e0d25..3b71829c 100644
--- a/promql/engine.go
+++ b/promql/engine.go
@@ -551,7 +551,11 @@ func extractFuncFromPath(p []Node) string {
return extractFuncFromPath(p[:len(p)-1])
}
+var foundInvalidOrder = false
+
func expandSeriesSet(ctx context.Context, it storage.SeriesSet) (res []storage.Series, err error) {
+ prev := labels.Labels{}
+
for it.Next() {
select {
case <-ctx.Done():
@@ -559,6 +563,21 @@ func expandSeriesSet(ctx context.Context, it storage.SeriesSet) (res []storage.S
default:
}
res = append(res, it.At())
+
+ if !foundInvalidOrder {
+ curr := res[len(res)-1].Labels()
+ if len(prev) > 0 && labels.Compare(curr, prev) < 0 {
+ fmt.Printf("Unsorted labels:\n curr: %s\n<\n prev %s\n", curr.String(), prev.String())
+ fmt.Printf("Series set in the iterator so far:\n")
+ for i := 0; i < len(res); i++ {
+ fmt.Printf(" [%10d] %s\n", i, res[i].Labels().String())
+ }
+
+ foundInvalidOrder = true
+ }
+
+ prev = curr
+ }
}
return res, it.Err()
} And what do you know, it actually found an issue in production:
It took exactly 2h to trigger, coincidence? a.Flag("storage.tsdb.min-block-duration", "Minimum duration of a data block before being persisted. For use in testing.").
Hidden().Default("2h").SetValue(&cfg.tsdb.MinBlockDuration) Seems like it could be |
Are you using remote_read? |
No, local storage only. |
Hmm. It's not specified if there's an ordering (that's only for sort/topk) but postings should always come out sorted. Are you still on 2.2.1, and if so what happens at master? |
I'm on 2.4.3, but can try master if there's a chance it may help. |
I don't think the relevant logic has changed, but I did touch the code for 2.7. |
Also, does this happen only for recent data (last 2-3 hours), or only for older data (more than 2-3 hours), or only when crossing that boundary? |
I've only seen it happen in recording rules, never in live queries. I don't run many live queries that do this sort of calculation, however. |
Each 2 hours tsdb truncates a head block (WAL) in a general one, I guess it should be related. |
So that's always head data you're hitting then via recording rules. Can you try it on older data? |
See my very first comment: live queries do not have the same artifact as historical recording rules. |
That doesn't make much sense, live queries at a given time should usually return the same value as recording rules executed at that time. |
Sorting breaks pretty rarely, since I had to wait for 2h. Rules are evaluated at regular interval and they write down calculated results into tsdb -> high chance of recording incorrect result and 100% chance of seeing the issue, since it's set in tsdb. My queries are evaluated when I issue them (rarely) and for historic data sorting either doesn't break or breaks very rarely at 2h infliction points -> low chance of seeing the issue. It's not impossible for live query to hit the same issue if I send it at the exact wrong moment. |
While this is not a bug in and of itself, there's something odd going on here that may indicate a bug. |
That doesn't seem to line up with the graph you shared, can you explain exactly what you mean? |
My graph shows |
Found another fun artifact around machine reboot, not sure if it's the same issue or not. We have multiple Promethei scraping this machine every 60s and only two out of five got this. Graphs have resolution of 15s. Here's how raw Calculating rate over that partially rewinded counter gives expected garbage data: And here's how recording rule is calculated: - record: instance_mode:node_cpu:sum
expr: sum(node_cpu) without (cpu) Doing rate over that looks much more sane: |
I've looked through the code, and I don't see any way for things to get out of order. Could you find a historical query that has the issue, and send it and the blocks its touching to me? |
I cannot replicate this with a historical query, it barely appears in recording rules. Anyway, I gave up on the idea of aggregating counters. |
Bug Report
What did you do?
I added a recording rule:
What did you expect to see?
Recording rule matches expression at run time, counter never goes back.
What did you see instead? Under which circumstances?
Runtime calculation:
Rule based calculation:
As you can see, there's some rounding error. Specifically, this causes counter to jump back:
This makes
rate()
upset:Gist with raw
node_cpu
data: https://gist.github.com/bobrik/048435b6e1280926c0c2cdf87bf0bc2aEnvironment
The text was updated successfully, but these errors were encountered: