Float rounding in recording doesn't match runtime #4580

bobrik · 2018-09-05T23:29:19Z

Bug Report

What did you do?

I added a recording rule:

# Total CPU usage by mode and instance
- record: instance_mode:node_cpu:sum
  expr: sum(node_cpu) without (cpu)

What did you expect to see?

Recording rule matches expression at run time, counter never goes back.

What did you see instead? Under which circumstances?

Runtime calculation:

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {},
        "values": [
          [
            1536185100,
            "34438.13999999998"
          ],
          [
            1536185130,
            "34438.13999999998"
          ],
          [
            1536185160,
            "34438.13999999998"
          ],
          [
            1536185190,
            "34438.13999999998"
          ],
          [
            1536185220,
            "34438.13999999998"
          ],
          [
            1536185250,
            "34438.13999999998"
          ],
          [
            1536185280,
            "34438.13999999998"
          ],
          [
            1536185310,
            "34438.13999999998"
          ],
          [
            1536185340,
            "34438.14999999998"
          ],
          [
            1536185370,
            "34438.14999999998"
          ],
          [
            1536185400,
            "34438.14999999998"
          ],
          [
            1536185430,
            "34438.14999999998"
          ],
          [
            1536185460,
            "34438.14999999998"
          ]
        ]
      }
    ]
  }
}

Rule based calculation:

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "__name__": "instance_mode:node_cpu:sum",
          "colo_id": "28",
          "colo_name": "sea01",
          "instance": "myhost",
          "job": "node_exporter",
          "mode": "nice",
          "node_type": "metal",
          "region": "wnam"
        },
        "values": [
          [
            1536185100,
            "34438.139999999985"
          ],
          [
            1536185130,
            "34438.139999999985"
          ],
          [
            1536185160,
            "34438.13999999999"
          ],
          [
            1536185190,
            "34438.13999999999"
          ],
          [
            1536185220,
            "34438.13999999997"
          ],
          [
            1536185250,
            "34438.13999999997"
          ],
          [
            1536185280,
            "34438.14"
          ],
          [
            1536185310,
            "34438.14"
          ],
          [
            1536185340,
            "34438.149999999994"
          ],
          [
            1536185370,
            "34438.149999999994"
          ],
          [
            1536185400,
            "34438.14999999999"
          ],
          [
            1536185430,
            "34438.14999999999"
          ],
          [
            1536185460,
            "34438.149999999994"
          ]
        ]
      }
    ]
  }
}

As you can see, there's some rounding error. Specifically, this causes counter to jump back:

[
          [
            1536185190,
            "34438.13999999999"
          ],
          [
            1536185220,
            "34438.13999999997"
          ],
          [
            1536185250,
            "34438.13999999997"
          ]
]

This makes rate() upset:

Gist with raw node_cpu data: https://gist.github.com/bobrik/048435b6e1280926c0c2cdf87bf0bc2a

Environment

System information:

$ uname -srm
Linux 4.14.59-cloudflare-2018.7.5 x86_64

Prometheus version:

$ prometheus --version
prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c81272a8d938c05e75607371284236aadc)
  build user:       root@149e5b3f0829
  build date:       20180314-14:15:45
  go version:       go1.10

The text was updated successfully, but these errors were encountered:

brian-brazil · 2018-12-07T12:55:16Z

See https://www.robustperception.io/rate-then-sum-never-sum-then-rate, which is the recommended way to do this.

bobrik · 2018-12-17T22:46:59Z

Even if you've worked around this being invalid expression with a recording rule, the real problem is what happens when one of the servers restarts. The counters from the restarted server will reset to 0, the sum will decrease, which will then be treated by rate as a counter reset and you'd get a large spurious spike in the result.

We have a static number of CPUs in the system and it's impossible for a single counter under aggregation to reset without resetting all of them, so I don't think that applies. It seems that aggregating multiple counters that come from the same exporter instance should be safe.

Is there any other reason counters should not be aggregated when they come from the same instance?

Having counters rather than rates stored in tsdb adds flexibility in terms of what period rate and irate can be applied over. It's the same reason we have counters and not rates in the first place.

It works just fine with online query, but breaks with a rule when it should work exactly the same way, which makes me think it's a bug in either tsdb or rule evaluation.

cc @krasi-georgiev

krasi-georgiev · 2018-12-18T10:05:12Z

@bobrik why do you think it might be tsdb related?
tsdb doesn't have any smart logic in there just stores and retrieves data based on a given filter/matcher.

bobrik · 2018-12-18T19:17:42Z

@krasi-georgiev maybe tsdb loses precision somehow on store? It could be rule engine that passes incorrect value as well, I'm not certain on where exactly the bug is.

krasi-georgiev · 2018-12-18T20:56:39Z

sounds very, very unlikely so I would say to look for the cause elsewhere.

bobrik · 2019-01-17T19:53:32Z

@brian-brazil it really seems like a legitimate bug to me if recording rules do not match live query.

bobrik · 2019-01-23T01:33:27Z

I think the culprit is that labels are not sorted somewhere. Non-deterministic order of float64 addents results in a non-deterministic result.

Consider this code (play.golang.org):

package main

import (
	"fmt"
	"math/rand"
)

func main() {
	measurements := []float64{3008.72, 21.28, 227.46, 211.33, 219.77, 205.26, 189.64, 231.46, 215.77, 204.09, 110.2, 229.25, 62.67, 244, 212.98, 209.35, 222.7, 177.96, 217.95, 243.82, 205.43, 291.03, 303.63, 110.65, 282.88, 285.47, 350.44, 1290.37, 1090.82, 1221.67, 1149.17, 1166.08, 1118.8, 906.82, 162.94, 1113.33, 1159.18, 1094.31, 1094.44, 1110.53, 1080.12, 1154.85, 1059.6, 36.61, 19.45, 178.84, 30.99, 36.78, 36.88, 38, 35.75, 39.32, 38.94, 43, 45.14, 39.44, 179.4, 38.93, 40.67, 38.51, 42.49, 40.06, 40.14, 36.92, 40.29, 48.84, 46.02, 210.45, 48.13, 47.02, 37.07, 40.96, 41.53, 39.73, 41.75, 48.45, 42.1, 42.73, 95.18, 169.66, 511.01, 430.52, 478.28, 451.9, 469.03, 481.56, 403.33, 447.84, 478.09, 204.87, 426.28, 451.85, 463.84, 432.19, 296.42, 426.75}

	base := Sum(measurements)

	Shuffle(measurements)

	shuffled := Sum(measurements)

	fmt.Printf("    base sum : %v\n", base)
	fmt.Printf("shuffled sum : %v\n", shuffled)
}

func Shuffle(vals []float64) {
	for len(vals) > 0 {
		n := len(vals)
		randIndex := rand.Intn(n)
		vals[n-1], vals[randIndex] = vals[randIndex], vals[n-1]
		vals = vals[:n-1]
	}
}

func Sum(vals []float64) float64 {
	result := float64(0)

	for _, value := range vals {
		result += value
	}

	return result
}

Output from play.golang.org:

    base sum : 34438.14999999999
shuffled sum : 34438.14999999998

The question is why rules have unordered set of labels.

bobrik · 2019-01-23T06:24:19Z

Assuming that order is the culprit, I think this is where we populate available series:

https://github.com/prometheus/prometheus/blob/b454ed3ec2112376c3f/promql/engine.go#L566

This in turns goes to Querier in tsdb:

https://godoc.org/github.com/prometheus/tsdb#Querier

type Querier interface {
    // Select returns a set of series that matches the given label matchers.
    Select(...labels.Matcher) (SeriesSet, error)

    // ...
}

There is no mention of the result being sorted somehow, but it appears sorted by label names and values from my quick test. Apparently sometimes this invariant doesn't hold. Should it always?

bobrik · 2019-01-24T23:37:05Z

I applied the following patch:

commit e162c6fcfbadcb898ce6ef348334383f6ebe4265
Author: Ivan Babrou <ibobrik@gmail.com>
Date:   Wed Jan 23 20:51:33 2019 -0800

    Warn if series set is not sorted
    
    Trying to catch the issue in the wild to confirm the theory:
    
    * https://github.com/prometheus/prometheus/issues/4580

diff --git a/promql/engine.go b/promql/engine.go
index 721e0d25..3b71829c 100644
--- a/promql/engine.go
+++ b/promql/engine.go
@@ -551,7 +551,11 @@ func extractFuncFromPath(p []Node) string {
 	return extractFuncFromPath(p[:len(p)-1])
 }
 
+var foundInvalidOrder = false
+
 func expandSeriesSet(ctx context.Context, it storage.SeriesSet) (res []storage.Series, err error) {
+	prev := labels.Labels{}
+
 	for it.Next() {
 		select {
 		case <-ctx.Done():
@@ -559,6 +563,21 @@ func expandSeriesSet(ctx context.Context, it storage.SeriesSet) (res []storage.S
 		default:
 		}
 		res = append(res, it.At())
+
+		if !foundInvalidOrder {
+			curr := res[len(res)-1].Labels()
+			if len(prev) > 0 && labels.Compare(curr, prev) < 0 {
+				fmt.Printf("Unsorted labels:\n  curr: %s\n<\n  prev %s\n", curr.String(), prev.String())
+				fmt.Printf("Series set in the iterator so far:\n")
+				for i := 0; i < len(res); i++ {
+					fmt.Printf("  [%10d] %s\n", i, res[i].Labels().String())
+				}
+
+				foundInvalidOrder = true
+			}
+
+			prev = curr
+		}
 	}
 	return res, it.Err()
 }

And what do you know, it actually found an issue in production:

Jan 24 21:01:00 foodm prometheus[92486]: level=info ts=2019-01-24T21:01:00.30969338Z caller=compact.go:398 component=tsdb msg="write block" mint=1548352800000 maxt=1548360000000 ulid=01D20TK445PXFH78WBFQS5PZ4P
...
Jan 24 23:01:00 foodm prometheus[92486]: Unsorted labels:
Jan 24 23:01:00 foodm prometheus[92486]:   curr: {__name__="ssl_connections", cert_type="ECDSA", cipher="AEAD-AES128-GCM-SHA256", colo_id="123", colo_name="foo01", curve="X25519", instance="foo135", job="nginx_ssl", node_status="v", node_type="metal", region="wnam", resumption="Reused-Cache", version="TLSv1.3"}
Jan 24 23:01:00 foodm prometheus[92486]: <
Jan 24 23:01:00 foodm prometheus[92486]:   prev: {__name__="ssl_connections", cert_type="ECDSA", cipher="AEAD-AES128-GCM-SHA256", colo_id="123", colo_name="foo01", curve="X25519", instance="foo33", job="nginx_ssl", node_status="v", node_type="metal", region="wnam", resumption="Reused-Ticket", version="TLSv1.3"}
Jan 24 23:01:00 foodm prometheus[92486]: Series set in the iterator so far:
Jan 24 23:01:00 foodm prometheus[92486]:   [         0] {__name__="ssl_connections", cert_type="ECDSA", cipher="AEAD-AES128-GCM-SHA256", colo_id="123", colo_name="foo01", curve="P-256", instance="foo122", job="nginx_ssl", node_status="v", node_type="metal", region="wnam", resumption="Reused-Cache", version="TLSv1.3"}
Jan 24 23:01:00 foodm prometheus[92486]:   [         1] {__name__="ssl_connections", cert_type="ECDSA", cipher="AEAD-AES128-GCM-SHA256", colo_id="123", colo_name="foo01", curve="X25519", instance="foo136", job="nginx_ssl", node_status="v", node_type="metal", region="wnam", resumption="Reused-Ticket", version="TLSv1.3"}
Jan 24 23:01:00 foodm prometheus[92486]:   [         2] {__name__="ssl_connections", cert_type="ECDSA", cipher="AEAD-AES128-GCM-SHA256", colo_id="123", colo_name="foo01", curve="X25519", instance="foo33", job="nginx_ssl", node_status="v", node_type="metal", region="wnam", resumption="Reused-Ticket", version="TLSv1.3"}
Jan 24 23:01:00 foodm prometheus[92486]:   [         3] {__name__="ssl_connections", cert_type="ECDSA", cipher="AEAD-AES128-GCM-SHA256", colo_id="123", colo_name="foo01", curve="X25519", instance="foo135", job="nginx_ssl", node_status="v", node_type="metal", region="wnam", resumption="Reused-Cache", version="TLSv1.3"}

It took exactly 2h to trigger, coincidence?

https://github.com/prometheus/prometheus/blob/v2.4.3/cmd/prometheus/main.go#L155-L156

	a.Flag("storage.tsdb.min-block-duration", "Minimum duration of a data block before being persisted. For use in testing.").
		Hidden().Default("2h").SetValue(&cfg.tsdb.MinBlockDuration)

Seems like it could be tsdb related after all.

brian-brazil · 2019-01-24T23:49:58Z

Are you using remote_read?

bobrik · 2019-01-24T23:54:02Z

No, local storage only.

brian-brazil · 2019-01-25T00:10:50Z

Hmm. It's not specified if there's an ordering (that's only for sort/topk) but postings should always come out sorted. Are you still on 2.2.1, and if so what happens at master?

bobrik · 2019-01-25T00:12:52Z

I'm on 2.4.3, but can try master if there's a chance it may help.

brian-brazil · 2019-01-25T00:15:11Z

I don't think the relevant logic has changed, but I did touch the code for 2.7.

brian-brazil · 2019-01-25T00:16:12Z

Also, does this happen only for recent data (last 2-3 hours), or only for older data (more than 2-3 hours), or only when crossing that boundary?

bobrik · 2019-01-25T00:18:23Z

I've only seen it happen in recording rules, never in live queries. I don't run many live queries that do this sort of calculation, however.

xjewer · 2019-01-25T00:19:12Z

It took exactly 2h to trigger, coincidence?

Each 2 hours tsdb truncates a head block (WAL) in a general one, I guess it should be related.

brian-brazil · 2019-01-25T00:30:22Z

So that's always head data you're hitting then via recording rules. Can you try it on older data?

bobrik · 2019-01-25T00:32:17Z

See my very first comment: live queries do not have the same artifact as historical recording rules.

brian-brazil · 2019-01-25T00:39:30Z

That doesn't make much sense, live queries at a given time should usually return the same value as recording rules executed at that time.

bobrik · 2019-01-25T01:18:59Z

Sorting breaks pretty rarely, since I had to wait for 2h.

Rules are evaluated at regular interval and they write down calculated results into tsdb -> high chance of recording incorrect result and 100% chance of seeing the issue, since it's set in tsdb.

My queries are evaluated when I issue them (rarely) and for historic data sorting either doesn't break or breaks very rarely at 2h infliction points -> low chance of seeing the issue.

It's not impossible for live query to hit the same issue if I send it at the exact wrong moment.

brian-brazil · 2019-01-25T11:06:01Z

While this is not a bug in and of itself, there's something odd going on here that may indicate a bug.

brian-brazil · 2019-01-25T11:08:58Z

for historic data sorting either doesn't break or breaks very rarely at 2h infliction points

That doesn't seem to line up with the graph you shared, can you explain exactly what you mean?

bobrik · 2019-01-31T22:05:04Z

My graph shows rate() over a broken metric produced by a recording rule. If I rate over rule's expression instead of produced metric, the graph is a flat line as expected.

bobrik · 2019-02-01T08:06:42Z

Found another fun artifact around machine reboot, not sure if it's the same issue or not. We have multiple Promethei scraping this machine every 60s and only two out of five got this.

Graphs have resolution of 15s.

Here's how raw node_cpu looks like:

Calculating rate over that partially rewinded counter gives expected garbage data:

And here's how recording rule is calculated:

- record: instance_mode:node_cpu:sum
  expr: sum(node_cpu) without (cpu)

Doing rate over that looks much more sane:

brian-brazil · 2019-02-01T11:44:44Z

I've looked through the code, and I don't see any way for things to get out of order. Could you find a historical query that has the issue, and send it and the blocks its touching to me?

bobrik · 2019-02-03T21:58:57Z

I cannot replicate this with a historical query, it barely appears in recording rules.

Anyway, I gave up on the idea of aggregating counters.

brian-brazil closed this as completed Dec 7, 2018

brian-brazil reopened this Jan 25, 2019

brian-brazil added component/local storage kind/more-info-needed labels Jan 25, 2019

bobrik closed this as completed Feb 3, 2019

lock bot locked and limited conversation to collaborators Aug 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Float rounding in recording doesn't match runtime #4580

Float rounding in recording doesn't match runtime #4580

bobrik commented Sep 5, 2018

brian-brazil commented Dec 7, 2018

bobrik commented Dec 17, 2018

krasi-georgiev commented Dec 18, 2018

bobrik commented Dec 18, 2018

krasi-georgiev commented Dec 18, 2018

bobrik commented Jan 17, 2019

bobrik commented Jan 23, 2019

bobrik commented Jan 23, 2019

bobrik commented Jan 24, 2019

brian-brazil commented Jan 24, 2019

bobrik commented Jan 24, 2019

brian-brazil commented Jan 25, 2019

bobrik commented Jan 25, 2019

brian-brazil commented Jan 25, 2019

brian-brazil commented Jan 25, 2019

bobrik commented Jan 25, 2019

xjewer commented Jan 25, 2019

brian-brazil commented Jan 25, 2019

bobrik commented Jan 25, 2019

brian-brazil commented Jan 25, 2019

bobrik commented Jan 25, 2019

brian-brazil commented Jan 25, 2019

brian-brazil commented Jan 25, 2019

bobrik commented Jan 31, 2019

bobrik commented Feb 1, 2019

brian-brazil commented Feb 1, 2019

bobrik commented Feb 3, 2019

Float rounding in recording doesn't match runtime #4580

Float rounding in recording doesn't match runtime #4580

Comments

bobrik commented Sep 5, 2018

Bug Report

brian-brazil commented Dec 7, 2018

bobrik commented Dec 17, 2018

krasi-georgiev commented Dec 18, 2018

bobrik commented Dec 18, 2018

krasi-georgiev commented Dec 18, 2018

bobrik commented Jan 17, 2019

bobrik commented Jan 23, 2019

bobrik commented Jan 23, 2019

bobrik commented Jan 24, 2019

brian-brazil commented Jan 24, 2019

bobrik commented Jan 24, 2019

brian-brazil commented Jan 25, 2019

bobrik commented Jan 25, 2019

brian-brazil commented Jan 25, 2019

brian-brazil commented Jan 25, 2019

bobrik commented Jan 25, 2019

xjewer commented Jan 25, 2019

brian-brazil commented Jan 25, 2019

bobrik commented Jan 25, 2019

brian-brazil commented Jan 25, 2019

bobrik commented Jan 25, 2019

brian-brazil commented Jan 25, 2019

brian-brazil commented Jan 25, 2019

bobrik commented Jan 31, 2019

bobrik commented Feb 1, 2019

brian-brazil commented Feb 1, 2019

bobrik commented Feb 3, 2019