Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Float rounding in recording doesn't match runtime #4580

Closed
bobrik opened this issue Sep 5, 2018 · 27 comments
Closed

Float rounding in recording doesn't match runtime #4580

bobrik opened this issue Sep 5, 2018 · 27 comments

Comments

@bobrik
Copy link
Contributor

bobrik commented Sep 5, 2018

Bug Report

What did you do?

I added a recording rule:

# Total CPU usage by mode and instance
- record: instance_mode:node_cpu:sum
  expr: sum(node_cpu) without (cpu)

What did you expect to see?

Recording rule matches expression at run time, counter never goes back.

What did you see instead? Under which circumstances?

Runtime calculation:

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {},
        "values": [
          [
            1536185100,
            "34438.13999999998"
          ],
          [
            1536185130,
            "34438.13999999998"
          ],
          [
            1536185160,
            "34438.13999999998"
          ],
          [
            1536185190,
            "34438.13999999998"
          ],
          [
            1536185220,
            "34438.13999999998"
          ],
          [
            1536185250,
            "34438.13999999998"
          ],
          [
            1536185280,
            "34438.13999999998"
          ],
          [
            1536185310,
            "34438.13999999998"
          ],
          [
            1536185340,
            "34438.14999999998"
          ],
          [
            1536185370,
            "34438.14999999998"
          ],
          [
            1536185400,
            "34438.14999999998"
          ],
          [
            1536185430,
            "34438.14999999998"
          ],
          [
            1536185460,
            "34438.14999999998"
          ]
        ]
      }
    ]
  }
}

Rule based calculation:

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "__name__": "instance_mode:node_cpu:sum",
          "colo_id": "28",
          "colo_name": "sea01",
          "instance": "myhost",
          "job": "node_exporter",
          "mode": "nice",
          "node_type": "metal",
          "region": "wnam"
        },
        "values": [
          [
            1536185100,
            "34438.139999999985"
          ],
          [
            1536185130,
            "34438.139999999985"
          ],
          [
            1536185160,
            "34438.13999999999"
          ],
          [
            1536185190,
            "34438.13999999999"
          ],
          [
            1536185220,
            "34438.13999999997"
          ],
          [
            1536185250,
            "34438.13999999997"
          ],
          [
            1536185280,
            "34438.14"
          ],
          [
            1536185310,
            "34438.14"
          ],
          [
            1536185340,
            "34438.149999999994"
          ],
          [
            1536185370,
            "34438.149999999994"
          ],
          [
            1536185400,
            "34438.14999999999"
          ],
          [
            1536185430,
            "34438.14999999999"
          ],
          [
            1536185460,
            "34438.149999999994"
          ]
        ]
      }
    ]
  }
}

As you can see, there's some rounding error. Specifically, this causes counter to jump back:

[
          [
            1536185190,
            "34438.13999999999"
          ],
          [
            1536185220,
            "34438.13999999997"
          ],
          [
            1536185250,
            "34438.13999999997"
          ]
]

This makes rate() upset:

image

Gist with raw node_cpu data: https://gist.github.com/bobrik/048435b6e1280926c0c2cdf87bf0bc2a

Environment

  • System information:
$ uname -srm
Linux 4.14.59-cloudflare-2018.7.5 x86_64
  • Prometheus version:
$ prometheus --version
prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c81272a8d938c05e75607371284236aadc)
  build user:       root@149e5b3f0829
  build date:       20180314-14:15:45
  go version:       go1.10
@brian-brazil
Copy link
Contributor

See https://www.robustperception.io/rate-then-sum-never-sum-then-rate, which is the recommended way to do this.

@bobrik
Copy link
Contributor Author

bobrik commented Dec 17, 2018

Even if you've worked around this being invalid expression with a recording rule, the real problem is what happens when one of the servers restarts. The counters from the restarted server will reset to 0, the sum will decrease, which will then be treated by rate as a counter reset and you'd get a large spurious spike in the result.

We have a static number of CPUs in the system and it's impossible for a single counter under aggregation to reset without resetting all of them, so I don't think that applies. It seems that aggregating multiple counters that come from the same exporter instance should be safe.

Is there any other reason counters should not be aggregated when they come from the same instance?

Having counters rather than rates stored in tsdb adds flexibility in terms of what period rate and irate can be applied over. It's the same reason we have counters and not rates in the first place.

It works just fine with online query, but breaks with a rule when it should work exactly the same way, which makes me think it's a bug in either tsdb or rule evaluation.

cc @krasi-georgiev

@krasi-georgiev
Copy link
Contributor

@bobrik why do you think it might be tsdb related?
tsdb doesn't have any smart logic in there just stores and retrieves data based on a given filter/matcher.

@bobrik
Copy link
Contributor Author

bobrik commented Dec 18, 2018

@krasi-georgiev maybe tsdb loses precision somehow on store? It could be rule engine that passes incorrect value as well, I'm not certain on where exactly the bug is.

@krasi-georgiev
Copy link
Contributor

sounds very, very unlikely so I would say to look for the cause elsewhere.

@bobrik
Copy link
Contributor Author

bobrik commented Jan 17, 2019

@brian-brazil it really seems like a legitimate bug to me if recording rules do not match live query.

@bobrik
Copy link
Contributor Author

bobrik commented Jan 23, 2019

I think the culprit is that labels are not sorted somewhere. Non-deterministic order of float64 addents results in a non-deterministic result.

Consider this code (play.golang.org):

package main

import (
	"fmt"
	"math/rand"
)

func main() {
	measurements := []float64{3008.72, 21.28, 227.46, 211.33, 219.77, 205.26, 189.64, 231.46, 215.77, 204.09, 110.2, 229.25, 62.67, 244, 212.98, 209.35, 222.7, 177.96, 217.95, 243.82, 205.43, 291.03, 303.63, 110.65, 282.88, 285.47, 350.44, 1290.37, 1090.82, 1221.67, 1149.17, 1166.08, 1118.8, 906.82, 162.94, 1113.33, 1159.18, 1094.31, 1094.44, 1110.53, 1080.12, 1154.85, 1059.6, 36.61, 19.45, 178.84, 30.99, 36.78, 36.88, 38, 35.75, 39.32, 38.94, 43, 45.14, 39.44, 179.4, 38.93, 40.67, 38.51, 42.49, 40.06, 40.14, 36.92, 40.29, 48.84, 46.02, 210.45, 48.13, 47.02, 37.07, 40.96, 41.53, 39.73, 41.75, 48.45, 42.1, 42.73, 95.18, 169.66, 511.01, 430.52, 478.28, 451.9, 469.03, 481.56, 403.33, 447.84, 478.09, 204.87, 426.28, 451.85, 463.84, 432.19, 296.42, 426.75}

	base := Sum(measurements)

	Shuffle(measurements)

	shuffled := Sum(measurements)

	fmt.Printf("    base sum : %v\n", base)
	fmt.Printf("shuffled sum : %v\n", shuffled)
}

func Shuffle(vals []float64) {
	for len(vals) > 0 {
		n := len(vals)
		randIndex := rand.Intn(n)
		vals[n-1], vals[randIndex] = vals[randIndex], vals[n-1]
		vals = vals[:n-1]
	}
}

func Sum(vals []float64) float64 {
	result := float64(0)

	for _, value := range vals {
		result += value
	}

	return result
}

Output from play.golang.org:

    base sum : 34438.14999999999
shuffled sum : 34438.14999999998

The question is why rules have unordered set of labels.

@bobrik
Copy link
Contributor Author

bobrik commented Jan 23, 2019

Assuming that order is the culprit, I think this is where we populate available series:

This in turns goes to Querier in tsdb:

type Querier interface {
    // Select returns a set of series that matches the given label matchers.
    Select(...labels.Matcher) (SeriesSet, error)

    // ...
}

There is no mention of the result being sorted somehow, but it appears sorted by label names and values from my quick test. Apparently sometimes this invariant doesn't hold. Should it always?

@bobrik
Copy link
Contributor Author

bobrik commented Jan 24, 2019

I applied the following patch:

commit e162c6fcfbadcb898ce6ef348334383f6ebe4265
Author: Ivan Babrou <ibobrik@gmail.com>
Date:   Wed Jan 23 20:51:33 2019 -0800

    Warn if series set is not sorted
    
    Trying to catch the issue in the wild to confirm the theory:
    
    * https://github.com/prometheus/prometheus/issues/4580

diff --git a/promql/engine.go b/promql/engine.go
index 721e0d25..3b71829c 100644
--- a/promql/engine.go
+++ b/promql/engine.go
@@ -551,7 +551,11 @@ func extractFuncFromPath(p []Node) string {
 	return extractFuncFromPath(p[:len(p)-1])
 }
 
+var foundInvalidOrder = false
+
 func expandSeriesSet(ctx context.Context, it storage.SeriesSet) (res []storage.Series, err error) {
+	prev := labels.Labels{}
+
 	for it.Next() {
 		select {
 		case <-ctx.Done():
@@ -559,6 +563,21 @@ func expandSeriesSet(ctx context.Context, it storage.SeriesSet) (res []storage.S
 		default:
 		}
 		res = append(res, it.At())
+
+		if !foundInvalidOrder {
+			curr := res[len(res)-1].Labels()
+			if len(prev) > 0 && labels.Compare(curr, prev) < 0 {
+				fmt.Printf("Unsorted labels:\n  curr: %s\n<\n  prev %s\n", curr.String(), prev.String())
+				fmt.Printf("Series set in the iterator so far:\n")
+				for i := 0; i < len(res); i++ {
+					fmt.Printf("  [%10d] %s\n", i, res[i].Labels().String())
+				}
+
+				foundInvalidOrder = true
+			}
+
+			prev = curr
+		}
 	}
 	return res, it.Err()
 }

And what do you know, it actually found an issue in production:

Jan 24 21:01:00 foodm prometheus[92486]: level=info ts=2019-01-24T21:01:00.30969338Z caller=compact.go:398 component=tsdb msg="write block" mint=1548352800000 maxt=1548360000000 ulid=01D20TK445PXFH78WBFQS5PZ4P
...
Jan 24 23:01:00 foodm prometheus[92486]: Unsorted labels:
Jan 24 23:01:00 foodm prometheus[92486]:   curr: {__name__="ssl_connections", cert_type="ECDSA", cipher="AEAD-AES128-GCM-SHA256", colo_id="123", colo_name="foo01", curve="X25519", instance="foo135", job="nginx_ssl", node_status="v", node_type="metal", region="wnam", resumption="Reused-Cache", version="TLSv1.3"}
Jan 24 23:01:00 foodm prometheus[92486]: <
Jan 24 23:01:00 foodm prometheus[92486]:   prev: {__name__="ssl_connections", cert_type="ECDSA", cipher="AEAD-AES128-GCM-SHA256", colo_id="123", colo_name="foo01", curve="X25519", instance="foo33", job="nginx_ssl", node_status="v", node_type="metal", region="wnam", resumption="Reused-Ticket", version="TLSv1.3"}
Jan 24 23:01:00 foodm prometheus[92486]: Series set in the iterator so far:
Jan 24 23:01:00 foodm prometheus[92486]:   [         0] {__name__="ssl_connections", cert_type="ECDSA", cipher="AEAD-AES128-GCM-SHA256", colo_id="123", colo_name="foo01", curve="P-256", instance="foo122", job="nginx_ssl", node_status="v", node_type="metal", region="wnam", resumption="Reused-Cache", version="TLSv1.3"}
Jan 24 23:01:00 foodm prometheus[92486]:   [         1] {__name__="ssl_connections", cert_type="ECDSA", cipher="AEAD-AES128-GCM-SHA256", colo_id="123", colo_name="foo01", curve="X25519", instance="foo136", job="nginx_ssl", node_status="v", node_type="metal", region="wnam", resumption="Reused-Ticket", version="TLSv1.3"}
Jan 24 23:01:00 foodm prometheus[92486]:   [         2] {__name__="ssl_connections", cert_type="ECDSA", cipher="AEAD-AES128-GCM-SHA256", colo_id="123", colo_name="foo01", curve="X25519", instance="foo33", job="nginx_ssl", node_status="v", node_type="metal", region="wnam", resumption="Reused-Ticket", version="TLSv1.3"}
Jan 24 23:01:00 foodm prometheus[92486]:   [         3] {__name__="ssl_connections", cert_type="ECDSA", cipher="AEAD-AES128-GCM-SHA256", colo_id="123", colo_name="foo01", curve="X25519", instance="foo135", job="nginx_ssl", node_status="v", node_type="metal", region="wnam", resumption="Reused-Cache", version="TLSv1.3"}

It took exactly 2h to trigger, coincidence?

	a.Flag("storage.tsdb.min-block-duration", "Minimum duration of a data block before being persisted. For use in testing.").
		Hidden().Default("2h").SetValue(&cfg.tsdb.MinBlockDuration)

Seems like it could be tsdb related after all.

@brian-brazil
Copy link
Contributor

Are you using remote_read?

@bobrik
Copy link
Contributor Author

bobrik commented Jan 24, 2019

No, local storage only.

@brian-brazil
Copy link
Contributor

Hmm. It's not specified if there's an ordering (that's only for sort/topk) but postings should always come out sorted. Are you still on 2.2.1, and if so what happens at master?

@bobrik
Copy link
Contributor Author

bobrik commented Jan 25, 2019

I'm on 2.4.3, but can try master if there's a chance it may help.

@brian-brazil
Copy link
Contributor

I don't think the relevant logic has changed, but I did touch the code for 2.7.

@brian-brazil
Copy link
Contributor

Also, does this happen only for recent data (last 2-3 hours), or only for older data (more than 2-3 hours), or only when crossing that boundary?

@bobrik
Copy link
Contributor Author

bobrik commented Jan 25, 2019

I've only seen it happen in recording rules, never in live queries. I don't run many live queries that do this sort of calculation, however.

@xjewer
Copy link
Contributor

xjewer commented Jan 25, 2019

It took exactly 2h to trigger, coincidence?

Each 2 hours tsdb truncates a head block (WAL) in a general one, I guess it should be related.

@brian-brazil
Copy link
Contributor

So that's always head data you're hitting then via recording rules. Can you try it on older data?

@bobrik
Copy link
Contributor Author

bobrik commented Jan 25, 2019

See my very first comment: live queries do not have the same artifact as historical recording rules.

@brian-brazil
Copy link
Contributor

That doesn't make much sense, live queries at a given time should usually return the same value as recording rules executed at that time.

@bobrik
Copy link
Contributor Author

bobrik commented Jan 25, 2019

Sorting breaks pretty rarely, since I had to wait for 2h.

Rules are evaluated at regular interval and they write down calculated results into tsdb -> high chance of recording incorrect result and 100% chance of seeing the issue, since it's set in tsdb.

My queries are evaluated when I issue them (rarely) and for historic data sorting either doesn't break or breaks very rarely at 2h infliction points -> low chance of seeing the issue.

It's not impossible for live query to hit the same issue if I send it at the exact wrong moment.

@brian-brazil brian-brazil reopened this Jan 25, 2019
@brian-brazil
Copy link
Contributor

While this is not a bug in and of itself, there's something odd going on here that may indicate a bug.

@brian-brazil
Copy link
Contributor

for historic data sorting either doesn't break or breaks very rarely at 2h infliction points

That doesn't seem to line up with the graph you shared, can you explain exactly what you mean?

@bobrik
Copy link
Contributor Author

bobrik commented Jan 31, 2019

My graph shows rate() over a broken metric produced by a recording rule. If I rate over rule's expression instead of produced metric, the graph is a flat line as expected.

@bobrik
Copy link
Contributor Author

bobrik commented Feb 1, 2019

Found another fun artifact around machine reboot, not sure if it's the same issue or not. We have multiple Promethei scraping this machine every 60s and only two out of five got this.

Graphs have resolution of 15s.

Here's how raw node_cpu looks like:

image

Calculating rate over that partially rewinded counter gives expected garbage data:

image

And here's how recording rule is calculated:

- record: instance_mode:node_cpu:sum
  expr: sum(node_cpu) without (cpu)

image

Doing rate over that looks much more sane:

image

@brian-brazil
Copy link
Contributor

I've looked through the code, and I don't see any way for things to get out of order. Could you find a historical query that has the issue, and send it and the blocks its touching to me?

@bobrik
Copy link
Contributor Author

bobrik commented Feb 3, 2019

I cannot replicate this with a historical query, it barely appears in recording rules.

Anyway, I gave up on the idea of aggregating counters.

@bobrik bobrik closed this as completed Feb 3, 2019
@lock lock bot locked and limited conversation to collaborators Aug 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants