node_cpu_seconds_total values are not monotonically increasing #1686

venkatbvc · 2020-04-24T13:51:07Z

Host operating system: output of `uname -a`

Linux ddebvnf-oame-1 3.10.0-1062.7.1.el7.x86_64 #1 SMP Wed Nov 13 08:44:42 EST 2019 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of `node_exporter --version`

node_exporter, version 0.17.0 (branch: HEAD, revision: f6f6194)
build user: root@322511e06ced
build date: 20181130-15:51:33
go version: go1.11.2

node_exporter command line flags

Are you running node_exporter in Docker?

NO

What did you do that produced an error?

nothing. Node exporter is running and prometheus is scrapping the metrics. scrape interval is 5s.
when a graph is plotted for node_cpu_seconds_total, we saw a huge spike. Following is the query used: rate(node_cpu_seconds_total{cpu="6",instance="osc1deacsdme1-oame-0",job="System",mode="iowait"}[2m])

What did you expect to see?

There should not be any huge spikes. and we saw a dip in node_cpu_seconds_total values.

What did you see instead?

There is a huge spike on 9th of March at 00:26:30 . as there is dip in node_cpu_seconds_total values.

following is the data in prometheus:
curl -g 'http://localhost:9090/api/v1/query?query=node_cpu_seconds_total{cpu="6",instance="osc1deacsdme1-oame-0",job="System",mode="iowait"}[2m]&time=1583693790'
{"status":"success","data":{"resultType":"matrix","result":[{"metric":{"name":"node_cpu_seconds_total","cpu":"6","instance":"osc1deacsdme1-oame-0","job":"System","mode":"iowait"},
"values":[[1583693670.227,"62176.51"],[1583693675.227,"62176.77"],[1583693680.227,"62176.98"],[1583693685.227,"62176.99"],[1583693690.227,"62176.99"],[1583693695.227,"62177.03"],
[1583693700.227,"62177.08"],[1583693705.228,"62177.08"],[1583693710.227,"62177.09"],[1583693715.227,"62177.09"],[1583693720.227,"62177.09"],[1583693725.227,"62177.09"],
[1583693730.227,"62177.09"],[1583693735.227,"62177.09"],[1583693740.227,"62177.09"],[1583693745.227,"62177.09"],[1583693750.227,"62177.09"],[1583693755.227,"62177.09"],
[1583693760.227,"62177.09"],[1583693765.227,"62177.09"],[1583693770.227,"62177.09"],[1583693775.227,"62177.24"],[1583693780.227,"62177.2"],[1583693785.227,"62177.2"]]}]}}

would like to know why there is a dip in counter value.

The text was updated successfully, but these errors were encountered:

discordianfish · 2020-04-25T09:56:31Z

Interesting, but I would assume some kernel issue? We just return this from procfs.

SuperQ · 2020-04-25T19:39:04Z

This is a know issue with iowait in the Linux kernel. We noticed this at SoundCloud years ago, but never got anywhere digging into it. Recently, I was looking into it again. We found a some interesting info. It seems specifically broken in iowait due to the way the data collection is implemented in the kernel.

What we ended up doing to work around this was break out iowait into a deriv() rule, separate from the rest of the CPU merics. I was considering updating the example recording rules file to document this.

I've thought about bugging kernel people, but I'm not sure there would be any interest in fixing this, especially since it means having a lock, which is something kernel devs are very cautious about.

venkatbvc · 2020-04-27T09:23:38Z

@discordianfish @SuperQ Thanks for your response.
Should we use deriv function instead of rate, so that these spikes are not seen in the graphs?

discordianfish · 2020-04-27T10:11:18Z

Hrm.. I mean.. it's kinda our problem now. We shouldn't expose it as counter if it's not really a counter after all.

Fixing this upstream would be great.. Or we could add a workaround that tracks the max value and print and error and returns that max if the current value is lower.

MaheshGPai · 2020-04-27T14:51:01Z

As per prometheus doc, it should be used only for gauges

deriv()
deriv(v range-vector) calculates the per-second derivative of the time series in a range vector v, using simple linear regression.
deriv should only be used with gauges.

Since the metric is curently exposed as counter, I'm not sure how the prometheus query engine will process.
If there is no issue with using deriv() instead of irate()/rate(), then it should be fine.
Else, instead changing the query to return only results <=100 should elimiate the spikes seen in grafana

sum by (instance)(irate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100 <=100

SuperQ · 2020-04-27T15:42:57Z

@discordianfish Yea, the only thing we can do is keep track of the data coming from /proc/stat and only output data if it goes up. We could log debug if there's a drop in values.

The question of what to do if the list of CPUs changes.

SuperQ · 2020-04-27T16:35:14Z

I've seen this enough times that I think we should do the workaround for the bad kernel data. It's not typically best practice for an exporter to do this kind of stuff, but I think we need to in this case.

brian-brazil · 2020-05-01T11:02:55Z

The question of what to do if the list of CPUs changes.

This has been stuck in my head, so I did some research. If I hotplug offline a CPU the relevant cpu disappears but the other cpu names in /proc/stat don't change - however when I online the CPU again at least the idle and iowait counters get reset:

Before:
cpu1 157846114 580231 38791682 1157995658 2587676 0 151288 0 0 0
After:
cpu1 157847655 580231 38792001 105 0 0 151288 0 0 0

So the problem isn't if the list of CPU changes, it's if there's an actual counter reset.

This was on 4.15.0-66-generic.

SuperQ · 2020-05-01T11:06:11Z

@brian-brazil Thanks. So I guess what we need is to track the list of CPUs, if the list changes, we invalidate the tracking cache.

brian-brazil · 2020-05-01T11:16:26Z

As long as there's a scrape while the CPU is offlined.

If it's only iowait and not idle that's buggy, another approach would be to check for both going down - plus they can't increase by more than a second per second anyway and I'd hope noone is toggling CPUs every scrape interval.

Cache CPU metrics to avoid counters (ie iowait) jumping backwards. Fixes: #1686 Signed-off-by: Ben Kochie <superq@gmail.com>

Cache CPU metrics to avoid counters (ie iowait) jumping backwards. Fixes: prometheus#1686 Signed-off-by: Ben Kochie <superq@gmail.com>

SuperQ added the bug label Apr 27, 2020

SuperQ added the platform/Linux Linux specific issue label Apr 27, 2020

SuperQ mentioned this issue May 23, 2020

Linux CPU: Cache CPU metrics #1711

Merged

SuperQ added a commit that referenced this issue May 24, 2020

Linux CPU: Cache CPU metrics

3565316

Cache CPU metrics to avoid counters (ie iowait) jumping backwards. Fixes: #1686 Signed-off-by: Ben Kochie <superq@gmail.com>

SuperQ closed this as completed in #1711 May 25, 2020

SuperQ mentioned this issue Jul 20, 2020

cpu iowait jitter #1372

Closed

oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this issue Apr 9, 2024

Linux CPU: Cache CPU metrics

db1f735

Cache CPU metrics to avoid counters (ie iowait) jumping backwards. Fixes: prometheus#1686 Signed-off-by: Ben Kochie <superq@gmail.com>

oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this issue Apr 9, 2024

Linux CPU: Cache CPU metrics

f9470fc

Cache CPU metrics to avoid counters (ie iowait) jumping backwards. Fixes: prometheus#1686 Signed-off-by: Ben Kochie <superq@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node_cpu_seconds_total values are not monotonically increasing #1686

node_cpu_seconds_total values are not monotonically increasing #1686

venkatbvc commented Apr 24, 2020

discordianfish commented Apr 25, 2020

SuperQ commented Apr 25, 2020

venkatbvc commented Apr 27, 2020

discordianfish commented Apr 27, 2020

MaheshGPai commented Apr 27, 2020

SuperQ commented Apr 27, 2020

SuperQ commented Apr 27, 2020

brian-brazil commented May 1, 2020

SuperQ commented May 1, 2020

brian-brazil commented May 1, 2020

node_cpu_seconds_total values are not monotonically increasing #1686

node_cpu_seconds_total values are not monotonically increasing #1686

Comments

venkatbvc commented Apr 24, 2020

Host operating system: output of uname -a

node_exporter version: output of node_exporter --version

node_exporter command line flags

Are you running node_exporter in Docker?

What did you do that produced an error?

What did you expect to see?

What did you see instead?

discordianfish commented Apr 25, 2020

SuperQ commented Apr 25, 2020

venkatbvc commented Apr 27, 2020

discordianfish commented Apr 27, 2020

MaheshGPai commented Apr 27, 2020

SuperQ commented Apr 27, 2020

SuperQ commented Apr 27, 2020

brian-brazil commented May 1, 2020

SuperQ commented May 1, 2020

brian-brazil commented May 1, 2020

Host operating system: output of `uname -a`

node_exporter version: output of `node_exporter --version`