-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node_cpu_seconds_total values are not monotonically increasing #1686
Comments
Interesting, but I would assume some kernel issue? We just return this from procfs. |
This is a know issue with iowait in the Linux kernel. We noticed this at SoundCloud years ago, but never got anywhere digging into it. Recently, I was looking into it again. We found a some interesting info. It seems specifically broken in iowait due to the way the data collection is implemented in the kernel. What we ended up doing to work around this was break out iowait into a I've thought about bugging kernel people, but I'm not sure there would be any interest in fixing this, especially since it means having a lock, which is something kernel devs are very cautious about. |
@discordianfish @SuperQ Thanks for your response. |
Hrm.. I mean.. it's kinda our problem now. We shouldn't expose it as counter if it's not really a counter after all. Fixing this upstream would be great.. Or we could add a workaround that tracks the max value and print and error and returns that max if the current value is lower. |
As per prometheus doc, it should be used only for gauges
Since the metric is curently exposed as counter, I'm not sure how the prometheus query engine will process.
|
@discordianfish Yea, the only thing we can do is keep track of the data coming from The question of what to do if the list of CPUs changes. |
I've seen this enough times that I think we should do the workaround for the bad kernel data. It's not typically best practice for an exporter to do this kind of stuff, but I think we need to in this case. |
This has been stuck in my head, so I did some research. If I hotplug offline a CPU the relevant cpu disappears but the other cpu names in /proc/stat don't change - however when I online the CPU again at least the idle and iowait counters get reset:
So the problem isn't if the list of CPU changes, it's if there's an actual counter reset. This was on 4.15.0-66-generic. |
@brian-brazil Thanks. So I guess what we need is to track the list of CPUs, if the list changes, we invalidate the tracking cache. |
As long as there's a scrape while the CPU is offlined. If it's only iowait and not idle that's buggy, another approach would be to check for both going down - plus they can't increase by more than a second per second anyway and I'd hope noone is toggling CPUs every scrape interval. |
Cache CPU metrics to avoid counters (ie iowait) jumping backwards. Fixes: #1686 Signed-off-by: Ben Kochie <superq@gmail.com>
Cache CPU metrics to avoid counters (ie iowait) jumping backwards. Fixes: prometheus#1686 Signed-off-by: Ben Kochie <superq@gmail.com>
Cache CPU metrics to avoid counters (ie iowait) jumping backwards. Fixes: prometheus#1686 Signed-off-by: Ben Kochie <superq@gmail.com>
Host operating system: output of
uname -a
Linux ddebvnf-oame-1 3.10.0-1062.7.1.el7.x86_64 #1 SMP Wed Nov 13 08:44:42 EST 2019 x86_64 x86_64 x86_64 GNU/Linux
node_exporter version: output of
node_exporter --version
node_exporter, version 0.17.0 (branch: HEAD, revision: f6f6194)
build user: root@322511e06ced
build date: 20181130-15:51:33
go version: go1.11.2
node_exporter command line flags
node_exporter --collector.systemd
--collector.systemd.unit-whitelist=^(grafana|prometheus|node_exporter|rabbitmq-server|asprom|gmond|gmetad|mariadb.|ntpd|httpd|jaeger|metrics|gen3gppxml|alertmanager|etcd|alarmagtd|keepalived|zabbix.).service$
--collector.textfile.directory=/opt/node_exporter/metrics
Are you running node_exporter in Docker?
NO
What did you do that produced an error?
nothing. Node exporter is running and prometheus is scrapping the metrics. scrape interval is 5s.
when a graph is plotted for node_cpu_seconds_total, we saw a huge spike. Following is the query used: rate(node_cpu_seconds_total{cpu="6",instance="osc1deacsdme1-oame-0",job="System",mode="iowait"}[2m])
What did you expect to see?
There should not be any huge spikes. and we saw a dip in node_cpu_seconds_total values.
What did you see instead?
There is a huge spike on 9th of March at 00:26:30 . as there is dip in node_cpu_seconds_total values.
following is the data in prometheus:
curl -g 'http://localhost:9090/api/v1/query?query=node_cpu_seconds_total{cpu="6",instance="osc1deacsdme1-oame-0",job="System",mode="iowait"}[2m]&time=1583693790'
{"status":"success","data":{"resultType":"matrix","result":[{"metric":{"name":"node_cpu_seconds_total","cpu":"6","instance":"osc1deacsdme1-oame-0","job":"System","mode":"iowait"},
"values":[[1583693670.227,"62176.51"],[1583693675.227,"62176.77"],[1583693680.227,"62176.98"],[1583693685.227,"62176.99"],[1583693690.227,"62176.99"],[1583693695.227,"62177.03"],
[1583693700.227,"62177.08"],[1583693705.228,"62177.08"],[1583693710.227,"62177.09"],[1583693715.227,"62177.09"],[1583693720.227,"62177.09"],[1583693725.227,"62177.09"],
[1583693730.227,"62177.09"],[1583693735.227,"62177.09"],[1583693740.227,"62177.09"],[1583693745.227,"62177.09"],[1583693750.227,"62177.09"],[1583693755.227,"62177.09"],
[1583693760.227,"62177.09"],[1583693765.227,"62177.09"],[1583693770.227,"62177.09"],[1583693775.227,"62177.24"],[1583693780.227,"62177.2"],[1583693785.227,"62177.2"]]}]}}
would like to know why there is a dip in counter value.
The text was updated successfully, but these errors were encountered: