Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node_cpu_seconds_total values are not monotonically increasing #1686

Closed
venkatbvc opened this issue Apr 24, 2020 · 10 comments · Fixed by #1711
Closed

node_cpu_seconds_total values are not monotonically increasing #1686

venkatbvc opened this issue Apr 24, 2020 · 10 comments · Fixed by #1711
Labels
bug platform/Linux Linux specific issue

Comments

@venkatbvc
Copy link

Host operating system: output of uname -a

Linux ddebvnf-oame-1 3.10.0-1062.7.1.el7.x86_64 #1 SMP Wed Nov 13 08:44:42 EST 2019 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 0.17.0 (branch: HEAD, revision: f6f6194)
build user: root@322511e06ced
build date: 20181130-15:51:33
go version: go1.11.2

node_exporter command line flags

node_exporter --collector.systemd
--collector.systemd.unit-whitelist=^(grafana|prometheus|node_exporter|rabbitmq-server|asprom|gmond|gmetad|mariadb.|ntpd|httpd|jaeger|metrics|gen3gppxml|alertmanager|etcd|alarmagtd|keepalived|zabbix.).service$
--collector.textfile.directory=/opt/node_exporter/metrics

Are you running node_exporter in Docker?

NO

What did you do that produced an error?

nothing. Node exporter is running and prometheus is scrapping the metrics. scrape interval is 5s.
when a graph is plotted for node_cpu_seconds_total, we saw a huge spike. Following is the query used: rate(node_cpu_seconds_total{cpu="6",instance="osc1deacsdme1-oame-0",job="System",mode="iowait"}[2m])

What did you expect to see?

There should not be any huge spikes. and we saw a dip in node_cpu_seconds_total values.

What did you see instead?

There is a huge spike on 9th of March at 00:26:30 . as there is dip in node_cpu_seconds_total values.

following is the data in prometheus:
curl -g 'http://localhost:9090/api/v1/query?query=node_cpu_seconds_total{cpu="6",instance="osc1deacsdme1-oame-0",job="System",mode="iowait"}[2m]&time=1583693790'
{"status":"success","data":{"resultType":"matrix","result":[{"metric":{"name":"node_cpu_seconds_total","cpu":"6","instance":"osc1deacsdme1-oame-0","job":"System","mode":"iowait"},
"values":[[1583693670.227,"62176.51"],[1583693675.227,"62176.77"],[1583693680.227,"62176.98"],[1583693685.227,"62176.99"],[1583693690.227,"62176.99"],[1583693695.227,"62177.03"],
[1583693700.227,"62177.08"],[1583693705.228,"62177.08"],[1583693710.227,"62177.09"],[1583693715.227,"62177.09"],[1583693720.227,"62177.09"],[1583693725.227,"62177.09"],
[1583693730.227,"62177.09"],[1583693735.227,"62177.09"],[1583693740.227,"62177.09"],[1583693745.227,"62177.09"],[1583693750.227,"62177.09"],[1583693755.227,"62177.09"],
[1583693760.227,"62177.09"],[1583693765.227,"62177.09"],[1583693770.227,"62177.09"],[1583693775.227,"62177.24"],[1583693780.227,"62177.2"],[1583693785.227,"62177.2"]]}]}}

would like to know why there is a dip in counter value.

@discordianfish
Copy link
Member

Interesting, but I would assume some kernel issue? We just return this from procfs.

@SuperQ
Copy link
Member

SuperQ commented Apr 25, 2020

This is a know issue with iowait in the Linux kernel. We noticed this at SoundCloud years ago, but never got anywhere digging into it. Recently, I was looking into it again. We found a some interesting info. It seems specifically broken in iowait due to the way the data collection is implemented in the kernel.

What we ended up doing to work around this was break out iowait into a deriv() rule, separate from the rest of the CPU merics. I was considering updating the example recording rules file to document this.

I've thought about bugging kernel people, but I'm not sure there would be any interest in fixing this, especially since it means having a lock, which is something kernel devs are very cautious about.

@venkatbvc
Copy link
Author

@discordianfish @SuperQ Thanks for your response.
Should we use deriv function instead of rate, so that these spikes are not seen in the graphs?

@discordianfish
Copy link
Member

Hrm.. I mean.. it's kinda our problem now. We shouldn't expose it as counter if it's not really a counter after all.

Fixing this upstream would be great.. Or we could add a workaround that tracks the max value and print and error and returns that max if the current value is lower.

@MaheshGPai
Copy link

As per prometheus doc, it should be used only for gauges

deriv()
deriv(v range-vector) calculates the per-second derivative of the time series in a range vector v, using simple linear regression.
deriv should only be used with gauges.

Since the metric is curently exposed as counter, I'm not sure how the prometheus query engine will process.
If there is no issue with using deriv() instead of irate()/rate(), then it should be fine.
Else, instead changing the query to return only results <=100 should elimiate the spikes seen in grafana

sum by (instance)(irate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100 <=100

@SuperQ
Copy link
Member

SuperQ commented Apr 27, 2020

@discordianfish Yea, the only thing we can do is keep track of the data coming from /proc/stat and only output data if it goes up. We could log debug if there's a drop in values.

The question of what to do if the list of CPUs changes.

@SuperQ SuperQ added the bug label Apr 27, 2020
@SuperQ
Copy link
Member

SuperQ commented Apr 27, 2020

I've seen this enough times that I think we should do the workaround for the bad kernel data. It's not typically best practice for an exporter to do this kind of stuff, but I think we need to in this case.

@SuperQ SuperQ added the platform/Linux Linux specific issue label Apr 27, 2020
@brian-brazil
Copy link
Contributor

The question of what to do if the list of CPUs changes.

This has been stuck in my head, so I did some research. If I hotplug offline a CPU the relevant cpu disappears but the other cpu names in /proc/stat don't change - however when I online the CPU again at least the idle and iowait counters get reset:

Before:
cpu1 157846114 580231 38791682 1157995658 2587676 0 151288 0 0 0
After:
cpu1 157847655 580231 38792001 105 0 0 151288 0 0 0

So the problem isn't if the list of CPU changes, it's if there's an actual counter reset.

This was on 4.15.0-66-generic.

@SuperQ
Copy link
Member

SuperQ commented May 1, 2020

@brian-brazil Thanks. So I guess what we need is to track the list of CPUs, if the list changes, we invalidate the tracking cache.

@brian-brazil
Copy link
Contributor

As long as there's a scrape while the CPU is offlined.

If it's only iowait and not idle that's buggy, another approach would be to check for both going down - plus they can't increase by more than a second per second anyway and I'd hope noone is toggling CPUs every scrape interval.

SuperQ added a commit that referenced this issue May 24, 2020
Cache CPU metrics to avoid counters (ie iowait) jumping backwards.

Fixes: #1686

Signed-off-by: Ben Kochie <superq@gmail.com>
oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this issue Apr 9, 2024
Cache CPU metrics to avoid counters (ie iowait) jumping backwards.

Fixes: prometheus#1686

Signed-off-by: Ben Kochie <superq@gmail.com>
oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this issue Apr 9, 2024
Cache CPU metrics to avoid counters (ie iowait) jumping backwards.

Fixes: prometheus#1686

Signed-off-by: Ben Kochie <superq@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug platform/Linux Linux specific issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants