Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect CPU statistics for mode "iowait" #2856

Closed
ArmanArbab opened this issue Jun 16, 2017 · 3 comments
Closed

Incorrect CPU statistics for mode "iowait" #2856

ArmanArbab opened this issue Jun 16, 2017 · 3 comments

Comments

@ArmanArbab
Copy link

ArmanArbab commented Jun 16, 2017

What did you do?

Attempt to show the percent of time spent, at a particular instant, in a particular mode by the cpu (or rather, the average of all the cpus)

What did you expect to see?

Values for each mode under 100%, with the values of all modes summed together equaling 100% when summed.

What did you see instead? Under which circumstances?
Unreasonable values for the "iowait" mode. Below are a few queries I executed and their associated outputs: *Whether I specify either one of my two instances, "**:9100" or "***:9100", or neither (and thus take the average over both instances), the queries and their associated results/graphs are almost identical.

Time spent in "idle mode" looks reasonable:

"(avg by (mode) (rate(node_cpu{mode="idle"}[5m])) * 100)"

1.) screen shot 2017-06-16 at 9 55 52 am

Time spent in "iowait" mode does not- values graphed are almost exclusively above 100%, which should not be possible:

"(avg by (mode) (rate(node_cpu{mode="iowait"}[5m])) * 100)"

2.) screen shot 2017-06-16 at 9 56 28 am

The time spent in all modes, except for "iowait", is reasonable/correct:

"(avg by (mode) (rate(node_cpu{}[5m])) * 100)"

3.) screen shot 2017-06-16 at 9 57 01 am

I noticed that the "iowait" mode was also the only mode which had any counter resets. Perhaps this is related to the unreasonable values in the second screenshot (though my impression was that the "rate()" function was supposed to automatically handle counter resets):

"avg by (mode) (resets (node_cpu{}[5m])) "

4.) screen shot 2017-06-16 at 9 57 47 am

A closer inspection of query in the second screenshot reveals that most CPUs have reasonable/ostensibly correct values, while a couple of cpus are responsible for the high error:

"(avg by (mode, cpu) (rate(node_cpu{mode="iowait"}[5m])) * 100)"

5.) screen shot 2017-06-16 at 10 05 21 am
6.) screen shot 2017-06-16 at 10 05 53 am
7.) screen shot 2017-06-16 at 10 06 10 am
8.) screen shot 2017-06-16 at 10 06 22 am
9.) screen shot 2017-06-16 at 10 16 34 am

Note that when changing the mode to "idle" (or any other mode for that matter), no CPUs have unreasonable values:

"(avg by (mode, cpu) (rate(node_cpu{mode="idle"}[5m])) * 100)"

10.) screen shot 2017-06-16 at 10 15 32 am
11.) screen shot 2017-06-16 at 10 15 58 am
12.) screen shot 2017-06-16 at 10 16 16 am

A similar pattern emerges when inspecting the counter resets of iowait counter for each CPU (almost all have a value of zero, while a few, not necessarily the same ones that have unreasonably high values, in the screenshots above, have non zero values for the number of counter resets):

"avg by (mode, cpu) (resets (node_cpu{mode="iowait"}[5m])) "

13.) screen shot 2017-06-16 at 10 10 04 am
14.) screen shot 2017-06-16 at 10 10 22 am
15.) screen shot 2017-06-16 at 10 18 08 am

Again, there are no counter resets for the "idle" mode (or any other mode for that matter, except for "iowait". Refer to screenshot 4):

"(avg by (mode, cpu) (resets(node_cpu{mode="idle}[5m]))

16.) screen shot 2017-06-16 at 10 19 21 am

A quick side note, - changing "rate" to "irate" does not result in reasonable values for the "iowait" mode:

"(avg by (mode) (irate(node_cpu{}[5m])) * 100)"

17.) screen shot 2017-06-16 at 10 29 23 am

Environment

A prometheus server running on a CentOS7 VM, collecting statistics from 2 instances, each a server running Node Exporter on CentOS7.

  • System information:

    Linux 3.10.0-514.10.2.el7.x86_64 x86_64

  • Prometheus version:

    prometheus, version 1.7.0 (branch: master, revision: bfa37c8)
    build user: root@7a6329cc02bb
    build date: 20170607-09:43:48
    go version: go1.8.3

  • Prometheus configuration file:

See attached file (Github doesn't support .yml, so I attached a .txt instead)

[prometheus-config.txt](https://github.com/prometheus/prometheus/files/1081534/prometheus-config.txt)


# my global config
global:
  scrape_interval:     2s # Set the scrape interval to every 2 seconds. Default is every 1 minute.
  evaluation_interval: 2s # Evaluate rules every 2 seconds. The default is every 1 minute.
  scrape_timeout: 2s
  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      monitor: 'codelab-monitor'

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first.rules"
  # - "second.rules"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'falconMonitoring'

    static_configs:
      - targets: ['**:9100', '***:9100']
        labels:
          group: 'falcon'
                         

@juliusv
Copy link
Member

juliusv commented Jul 8, 2017

@ArmanArbab Are you still getting this issue? The first thing I would suspect here is a broken data source, like /proc/stat reporting broken iowait values. The reason is that Prometheus or the node exporter don't treat certain CPUs or modes specially in any way, nor should there be counter resets in individual modes. There should only be resets when a host reboots, but then all modes and cores should be affected at the same time.

Have you looked at the raw exported data from Node Exporter over time by looking at their metrics endpoints and seeing whether the iowait counters on the affected cores go down or otherwise behave weirdly?

It would also be interesting to just see node_cpu{mode="iowait"}[5m] in the tabular view for the cores that exhibit this problem.

@brian-brazil
Copy link
Contributor

Closing as not a problem with Prometheus itself.

@lock
Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants