Negative Numbers on wmi_cpu_time_total #259

DevOps-Dad · 2018-09-16T01:34:38Z

I have a Windows 2012R2 server that is running the v0.4.3 wmi_exporter.

When I use the query below I pretty regularly get negative numbers for CPU usage. This is a very underutilized server running on VMware with 4 cores associated to it.

100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle", instance="$server"}[5m])) * 100)

I appreciate any guidance.

Thanks,
Joe

carlpett · 2018-09-16T14:24:41Z

Hey @DevOps-Dad,
Thanks for reporting! This seems odd. Could you try making a stacked diagram of all modes?

avonwyss · 2018-09-16T15:24:17Z

After each negative entry there is a spike... how long is the time between such spikes? About 16..17 Minutes by any chance?

DevOps-Dad · 2018-09-16T16:52:53Z

@avonwyss , yes the graph is just as you said. See below. I will give a stacked graph in a couple of minutes.

DevOps-Dad · 2018-09-16T16:56:41Z

@carlpett Below is the stacked graph. No negative numbers here. Very weird.

carlpett · 2018-09-17T05:19:47Z

Thanks, although not much wiser... Some other things to check:

It could potentially be related to clock drift in your VM - the CPU accounting is based on consumption per second, and if your clock is being adjusted some of the consumption might be accounted for in the "wrong" time. Could you check if irate(wmi_system_time[5m]) has irregularities around when you have those negative usages?
What is your scrape interval? Do you have any failed scrapes around these times?
As @avonwyss mentioned, are the negative spikes periodic and repeat every 16-17 minutes? There is a known Windows issue where WMI does an expensive refresh operation every 1000 seconds. I'm not sure if this could cause this behaviour, but would be interesting to see.

DevOps-Dad · 2018-09-17T11:06:31Z

Thank you for all the assistance.

I don't seem to have a stat getting scraped called wmi_system_time. The only thing I have similar is uptime but that had nothing interesting. Time adjustments in the log were 3 or less a day.
My scrape interval is every minute.
Yes I am seeing the spikes periodically at around every 16-17 minutes. See graph below. Assuming this is the issue, is there a way to workaround the MS issue?

carlpett · 2018-09-17T11:31:29Z

Oops, wmi_os_time is the name. Sorry.
Ok, good, then we can probably strike that bit (having too few scrapes within the interval can lead to strange effects)
Sadly, we don't have a workaround for it. There are additional details in issue CPU collector blocks every ~17 minutes on call to wmi.Query #89. I'm somewhat puzzled that this could lead to the effect you are seeing, though... I guess if the data you get is so delayed that when Prometheus sets the timestamp it is skewed so much from the reporting time that the interpolation goes wrong or something like that? What do you get from scrape_duration_seconds{job="wmi_exporter"} at these times?

DevOps-Dad · 2018-09-17T13:42:53Z

Okay, this is kind of interesting. There is a time change at each negative value. I took the wmi_os_time and the wmi_cpu_time_total on the same graph. I multiplied the OS time by 10 so I could see both. As you see below, the WMI negative numbers are when there is a time change. I also put the graph of just the wmi_os_time there. I am not sure if this is related to the WMI refresh issue or a clock issue, though the clock is fixing itself every 16-17 seconds so I wonder if the two are related.

mdunc · 2018-09-28T05:33:49Z

I'm running in to the exact same thing. The Windows server is an instance hosted in AWS (Windows 2016) and I'm scraping the data every 30 seconds.

Fredrik-Petersson · 2018-10-02T07:10:28Z

I have the exact same issues as @mdunc in a newly launched instance of Windows Server 2016 in AWS. I have older instances displaying the same issues but not to the same extent. The older instances see an increase in scrape time of about 4 times (from about 400 milliseconds to 1.6 seconds), the scrape of the new instance times out completely (>10 seconds)

I noticed in the documentation for setting up NTP and time on Windows Servers that AWS have changed the default behaviour of NTP in AMI instances launched after August 2018:

NTP Server – starting with the August 2018 release, AMIs will now use the AWS time service by default. This time service is accessible from any EC2 region at the 169.254.169.123 endpoint. Additionally, the 0x9 flag indicates that the time service is acting as a client, and to use “SpecialPollInterval” to determine how frequently to check in with the configured time server.

This is interesting:

Special Poll Interval – checks against the configured NTP Server every 900 seconds, or 15 minutes.

https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/windows-set-time.html

I am not sure if this is related at all but the time window of the AWS change and the timing of the "Special Pool Intervall" is interesting.

My next step is to launch a fresh empty instance and see if the behaviour is the same.

carlpett · 2018-10-02T07:22:11Z

Very interesting find @Fredrik-Petersson! If you are able to do some testing with disabling NTP/using another NTP server, etc it would be super useful.

Fredrik-Petersson · 2018-10-02T10:13:45Z

Got the fresh AWS hosted Windows server up and running. It has the same issues as the others. Currently testing different cases with disabled Amazon SSM Agent, disabled windows time, other NTP servers etc.

DevOps-Dad · 2018-10-02T12:50:00Z

I am very interested in what you find. Just a note, the issues I am having from above are on a Windows 2012R2 server running in a VMware environment.

Fredrik-Petersson · 2018-10-02T14:02:50Z

I have done some testing but so far no clue into what is causing the time to be adjusted every 16 minute. I have tried to disable the following services:
Amazon SSM Agent
AWS Lite Guest Agent
Windows Time

Event with these stopped something is still changing the system time either forwards or backwards. Here is a a graph from the newly created server (the "missed" spike is likely due to the scrape happening exactly right to not see it":

I am scraping every 15 seconds and the servers are on the same AWS network. I will do some digging with Process Explorer tomorrow and see if something is happening on the 16 minute mark.

Fredrik-Petersson · 2018-10-03T10:36:04Z

I have messed around a bit more and from what I can see AWS is likely not causing the issues. I have also tested against servers running locally on VMware and they show the same behaviour.

Back to the drawing board.

I made some more visualisations in Grafana and I noticed that at the same time as the scrape fails there is an increase in system exception dispatches. Is there a general way to see the system exceptions dispatches for all processes on Windows or do you need to look at each individual process?

Here is the graf (showing wmi_system_exceptions, scrape_duration and wmi_os_time) from my clean server:

Fredrik-Petersson · 2018-10-04T11:13:38Z

I have done some more testing in order to find a workaround for the issues as most of my monitoring are based on wmi_exporter.

The workaround I have found to at least make the issues less prominent is to start the wmi_exporter with the "--collectors.enabled="cpu,cs,logical_disk,net,os,system" excluding the "services" collector.

This change brings down the average scrape duration from about 1 800 milliseconds to about 200 milliseconds on some of the most loaded systems. The peaks, earlier causing timeouts with scrape times of more than 10 seconds, now stay below 8 seconds. Still a bit high but at least not breaking for now.

As a bonus the amount of scraped metrics goes from around 5 500 samples to around 150. As I was not using any service/process monitoring with wmi_exporter this is no loss for current setup.

Here is some graphs from the server with the most issues (the irregular peaks are due to me starting and testing different configs with wmi_exporter) :

Here some graphs from a server with less load:

leoluk · 2018-10-07T11:04:03Z

Like carlpett said, this is very like to be the issue discussed in #89 (WMI hanging every 1000 seconds). WMI queries are one-by-one and aren't transactional, so there's no consistent view. This usually doesn't matter, but with the hangs, you're likely to get inconsistent results.

You could try whether you see the same behavior with https://github.com/leoluk/perflib_exporter

robgryn · 2018-12-05T11:55:45Z

Hi, we encountered the same problem with negative cpu metrics and found this issue.

I gather from this @higels response in #89 :

I was told by an ex-MS employee that this would be unlikely to be fruitful - I should have mentioned that here.

That the issue is unlikely to be fixed in any foreseeable future, is my understanding correct? If so, we will probably try https://github.com/leoluk/perflib_exporter and let you know if this issue is also present there.

Gaibhne · 2019-01-07T15:21:46Z

Just to erase one theory in this issue, I am seeing this exact issue and I have never used AWS or installed anything related to it on this machine (or any other, since my employer has blacklisted Amazon due to data protection issues). So that is definitely not the culprit.

JMLX42 · 2019-02-27T14:30:22Z

Same as @Gaibhne : I can reproduce the issue on a dedicated OVH machine.

free · 2019-05-08T14:25:32Z

As far as I can tell, this is a non-issue.

Prometheus does not (nor could it) guarantee that scraping happens precisely at the configured scrape interval. So e.g. if you configure a 10s scrape interval, you may have one scrape interval of 11 s followed by a 9.3 s scrape interval etc. You will even have scrapes that are missed altogether (i.e. an interval that is a multiple of 10 s).

This in and of itself is not an issue for irate() as irate() only looks at the last couple of samples and divides the value difference between them by their timestamp difference. But those timestamps are not exact. I'm not sure how Prometheus assigns a timestamp to a sample (i.e. whether it's the time when the scrape request was started; when it finished; or anything in-between), but you are likely aware of the fact that network requests are not instant. So even if you have 2 Prometheus samples that are exactly 10 seconds apart, the values may have been collected anywhere between 9-11 seconds apart if the /metrics requests took more than 1 second to complete.

E.g. let's say the first CPU sample (let's call it v1) was collected at t0 and Prometheus received the response at t0 + 1s (due to network latency), assigning it a timestamp of t0 (when the request was issued); the second request was issued at t0 + 10s, but it took 1 second to reach the WMI exporter, so the CPU sample (v2) was actually collected at t0 + 11s and "instantly" returned to Prometheus; which assigned it a timestamp of t0 + 10s. Now if the CPU was idle for all this time, irate(wmi_cpu_time_total{mode="idle"}[5m]) would be (v2 - v1) / (t2 - t1) = 11 / (t0 + 10s - t0) = 1.1. So your CPU was "110% idle over the past 10 seconds". Now if you do 100 - 110 you'll get -10, which you could interpret as `my CPU was -10% in use over the past 10 seconds", but it really means "my CPU was -10% +/- 1000% idle over the past 10 seconds", with "+/- 1000%" being a generous estimate of just how precise you can expect sampled metrics to be.

If you want a better estimate of the percentage of time your CPU was idle for over the past 0-100+ s, you can instead write your query as:

1
  -
avg by (instance) (
  irate(wmi_cpu_time_total{mode="idle", instance="$server"}[5m])
    /
  sum by(instance) (irate(wmi_cpu_time_total{instance="$server"}[5m]))
)

I.e. divide the idle CPU time by total CPU time, so it always adds up to 100% regardless of how accurate the rate estimate or sample timestamps are.

free · 2019-05-08T14:35:52Z

Oh, and also note that even though irate() is the recommended way of graphing spiky data, it may also be hugely misleading at low resolutions. E.g. if you graph a whole day's worth of CPU usage based on samples collected every 10 seconds irate() will give you 10 second samples that are ~40 seconds apart in Grafana (86400 seconds / ~2k pixels graph resolution).

So depending on the actual range and/or the minimum resolution configured for your Grafana graph,when faced with regular spikes in CPU usage (as with Prometheus rule evaluation) you may end up seeing only the spikes in CPU usage. Or the low CPU usage periods in-between spikes. E.g. I run a Prometheus instance that evaluates rules every 10 seconds. If I scrape that Prometheus instance every 5s and look at irate() with a resolution that's a multiple of 10 seconds, I will only ever see either the spikes (when rules are evaluated) or the troughs, whereas the actual CPU utilization is actually the average of the two.

dhaval512 · 2020-03-26T11:25:47Z

hi everyone ,
can anyone know how to fetch Expensive Queries by IO (Top 25) and Expensive Queries by CPU (Top 25)
in grafana and promethues

for-mat · 2020-04-21T06:03:55Z

hi everyone ,
can anyone know how to fetch Expensive Queries by IO (Top 25) and Expensive Queries by CPU (Top 25)
in grafana and promethues

topk(25, xxxxxx)

github-actions · 2023-11-26T15:40:53Z

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

carlpett added the collector/cpu label Sep 17, 2018

carlpett mentioned this issue Mar 5, 2019

Discussion: Scope of wmi_exporter #321

Closed

for-mat mentioned this issue Apr 21, 2020

wmi_cpu_time_total abnormal #456

Closed

github-actions bot added the Stale label Nov 26, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Negative Numbers on wmi_cpu_time_total #259

Negative Numbers on wmi_cpu_time_total #259

DevOps-Dad commented Sep 16, 2018

carlpett commented Sep 16, 2018

avonwyss commented Sep 16, 2018

DevOps-Dad commented Sep 16, 2018

DevOps-Dad commented Sep 16, 2018

carlpett commented Sep 17, 2018

DevOps-Dad commented Sep 17, 2018

carlpett commented Sep 17, 2018

DevOps-Dad commented Sep 17, 2018

mdunc commented Sep 28, 2018 •

edited

Fredrik-Petersson commented Oct 2, 2018

carlpett commented Oct 2, 2018

Fredrik-Petersson commented Oct 2, 2018

DevOps-Dad commented Oct 2, 2018

Fredrik-Petersson commented Oct 2, 2018

Fredrik-Petersson commented Oct 3, 2018

Fredrik-Petersson commented Oct 4, 2018

leoluk commented Oct 7, 2018

robgryn commented Dec 5, 2018

Gaibhne commented Jan 7, 2019

JMLX42 commented Feb 27, 2019

free commented May 8, 2019

free commented May 8, 2019

dhaval512 commented Mar 26, 2020

for-mat commented Apr 21, 2020

github-actions bot commented Nov 26, 2023

Negative Numbers on wmi_cpu_time_total #259

Negative Numbers on wmi_cpu_time_total #259

Comments

DevOps-Dad commented Sep 16, 2018

carlpett commented Sep 16, 2018

avonwyss commented Sep 16, 2018

DevOps-Dad commented Sep 16, 2018

DevOps-Dad commented Sep 16, 2018

carlpett commented Sep 17, 2018

DevOps-Dad commented Sep 17, 2018

carlpett commented Sep 17, 2018

DevOps-Dad commented Sep 17, 2018

mdunc commented Sep 28, 2018 • edited

Fredrik-Petersson commented Oct 2, 2018

carlpett commented Oct 2, 2018

Fredrik-Petersson commented Oct 2, 2018

DevOps-Dad commented Oct 2, 2018

Fredrik-Petersson commented Oct 2, 2018

Fredrik-Petersson commented Oct 3, 2018

Fredrik-Petersson commented Oct 4, 2018

leoluk commented Oct 7, 2018

robgryn commented Dec 5, 2018

Gaibhne commented Jan 7, 2019

JMLX42 commented Feb 27, 2019

free commented May 8, 2019

free commented May 8, 2019

dhaval512 commented Mar 26, 2020

for-mat commented Apr 21, 2020

github-actions bot commented Nov 26, 2023

mdunc commented Sep 28, 2018 •

edited