Incorrect CPU statistics for mode "iowait" #2856

ArmanArbab · 2017-06-16T17:33:53Z

What did you do?

Attempt to show the percent of time spent, at a particular instant, in a particular mode by the cpu (or rather, the average of all the cpus)

What did you expect to see?

Values for each mode under 100%, with the values of all modes summed together equaling 100% when summed.

What did you see instead? Under which circumstances?
Unreasonable values for the "iowait" mode. Below are a few queries I executed and their associated outputs: *Whether I specify either one of my two instances, "**:9100" or "***:9100", or neither (and thus take the average over both instances), the queries and their associated results/graphs are almost identical.

Time spent in "idle mode" looks reasonable:

"(avg by (mode) (rate(node_cpu{mode="idle"}[5m])) * 100)"

1.)

Time spent in "iowait" mode does not- values graphed are almost exclusively above 100%, which should not be possible:

"(avg by (mode) (rate(node_cpu{mode="iowait"}[5m])) * 100)"

2.)

The time spent in all modes, except for "iowait", is reasonable/correct:

"(avg by (mode) (rate(node_cpu{}[5m])) * 100)"

3.)

I noticed that the "iowait" mode was also the only mode which had any counter resets. Perhaps this is related to the unreasonable values in the second screenshot (though my impression was that the "rate()" function was supposed to automatically handle counter resets):

"avg by (mode) (resets (node_cpu{}[5m])) "

4.)

A closer inspection of query in the second screenshot reveals that most CPUs have reasonable/ostensibly correct values, while a couple of cpus are responsible for the high error:

"(avg by (mode, cpu) (rate(node_cpu{mode="iowait"}[5m])) * 100)"

5.)
6.)
7.)
8.)
9.)

Note that when changing the mode to "idle" (or any other mode for that matter), no CPUs have unreasonable values:

"(avg by (mode, cpu) (rate(node_cpu{mode="idle"}[5m])) * 100)"

10.)
11.)
12.)

A similar pattern emerges when inspecting the counter resets of iowait counter for each CPU (almost all have a value of zero, while a few, not necessarily the same ones that have unreasonably high values, in the screenshots above, have non zero values for the number of counter resets):

"avg by (mode, cpu) (resets (node_cpu{mode="iowait"}[5m])) "

13.)
14.)
15.)

Again, there are no counter resets for the "idle" mode (or any other mode for that matter, except for "iowait". Refer to screenshot 4):

"(avg by (mode, cpu) (resets(node_cpu{mode="idle}[5m]))

16.)

A quick side note, - changing "rate" to "irate" does not result in reasonable values for the "iowait" mode:

"(avg by (mode) (irate(node_cpu{}[5m])) * 100)"

17.)

Environment

A prometheus server running on a CentOS7 VM, collecting statistics from 2 instances, each a server running Node Exporter on CentOS7.

System information:

Linux 3.10.0-514.10.2.el7.x86_64 x86_64
Prometheus version:

prometheus, version 1.7.0 (branch: master, revision: bfa37c8)
build user: root@7a6329cc02bb
build date: 20170607-09:43:48
go version: go1.8.3
Prometheus configuration file:

See attached file (Github doesn't support .yml, so I attached a .txt instead)

[prometheus-config.txt](https://github.com/prometheus/prometheus/files/1081534/prometheus-config.txt)


# my global config
global:
  scrape_interval:     2s # Set the scrape interval to every 2 seconds. Default is every 1 minute.
  evaluation_interval: 2s # Evaluate rules every 2 seconds. The default is every 1 minute.
  scrape_timeout: 2s
  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      monitor: 'codelab-monitor'

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first.rules"
  # - "second.rules"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'falconMonitoring'

    static_configs:
      - targets: ['**:9100', '***:9100']
        labels:
          group: 'falcon'

The text was updated successfully, but these errors were encountered:

juliusv · 2017-07-08T23:05:23Z

@ArmanArbab Are you still getting this issue? The first thing I would suspect here is a broken data source, like /proc/stat reporting broken iowait values. The reason is that Prometheus or the node exporter don't treat certain CPUs or modes specially in any way, nor should there be counter resets in individual modes. There should only be resets when a host reboots, but then all modes and cores should be affected at the same time.

Have you looked at the raw exported data from Node Exporter over time by looking at their metrics endpoints and seeing whether the iowait counters on the affected cores go down or otherwise behave weirdly?

It would also be interesting to just see node_cpu{mode="iowait"}[5m] in the tabular view for the cores that exhibit this problem.

brian-brazil · 2017-07-14T13:42:04Z

Closing as not a problem with Prometheus itself.

lock · 2019-03-23T12:45:49Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

brian-brazil added the kind/question label Jul 14, 2017

brian-brazil closed this as completed Jul 14, 2017

lock bot locked and limited conversation to collaborators Mar 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect CPU statistics for mode "iowait" #2856

Incorrect CPU statistics for mode "iowait" #2856

ArmanArbab commented Jun 16, 2017 •

edited

juliusv commented Jul 8, 2017

brian-brazil commented Jul 14, 2017

lock bot commented Mar 23, 2019

Incorrect CPU statistics for mode "iowait" #2856

Incorrect CPU statistics for mode "iowait" #2856

Comments

ArmanArbab commented Jun 16, 2017 • edited

juliusv commented Jul 8, 2017

brian-brazil commented Jul 14, 2017

lock bot commented Mar 23, 2019

ArmanArbab commented Jun 16, 2017 •

edited