Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative Numbers on wmi_cpu_time_total #259

Closed
DevOps-Dad opened this issue Sep 16, 2018 · 25 comments
Closed

Negative Numbers on wmi_cpu_time_total #259

DevOps-Dad opened this issue Sep 16, 2018 · 25 comments

Comments

@DevOps-Dad
Copy link

I have a Windows 2012R2 server that is running the v0.4.3 wmi_exporter.

When I use the query below I pretty regularly get negative numbers for CPU usage. This is a very underutilized server running on VMware with 4 cores associated to it.

100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle", instance="$server"}[5m])) * 100)

image

I appreciate any guidance.

Thanks,
Joe

@carlpett
Copy link
Collaborator

Hey @DevOps-Dad,
Thanks for reporting! This seems odd. Could you try making a stacked diagram of all modes?

@avonwyss
Copy link

After each negative entry there is a spike... how long is the time between such spikes? About 16..17 Minutes by any chance?

@DevOps-Dad
Copy link
Author

@avonwyss , yes the graph is just as you said. See below. I will give a stacked graph in a couple of minutes.
image

@DevOps-Dad
Copy link
Author

@carlpett Below is the stacked graph. No negative numbers here. Very weird.
image

@carlpett
Copy link
Collaborator

Thanks, although not much wiser... Some other things to check:

  1. It could potentially be related to clock drift in your VM - the CPU accounting is based on consumption per second, and if your clock is being adjusted some of the consumption might be accounted for in the "wrong" time. Could you check if irate(wmi_system_time[5m]) has irregularities around when you have those negative usages?
  2. What is your scrape interval? Do you have any failed scrapes around these times?
  3. As @avonwyss mentioned, are the negative spikes periodic and repeat every 16-17 minutes? There is a known Windows issue where WMI does an expensive refresh operation every 1000 seconds. I'm not sure if this could cause this behaviour, but would be interesting to see.

@DevOps-Dad
Copy link
Author

Thank you for all the assistance.

  1. I don't seem to have a stat getting scraped called wmi_system_time. The only thing I have similar is uptime but that had nothing interesting. Time adjustments in the log were 3 or less a day.

  2. My scrape interval is every minute.

  3. Yes I am seeing the spikes periodically at around every 16-17 minutes. See graph below. Assuming this is the issue, is there a way to workaround the MS issue?
    image

@carlpett
Copy link
Collaborator

  1. Oops, wmi_os_time is the name. Sorry.
  2. Ok, good, then we can probably strike that bit (having too few scrapes within the interval can lead to strange effects)
  3. Sadly, we don't have a workaround for it. There are additional details in issue CPU collector blocks every ~17 minutes on call to wmi.Query #89. I'm somewhat puzzled that this could lead to the effect you are seeing, though... I guess if the data you get is so delayed that when Prometheus sets the timestamp it is skewed so much from the reporting time that the interpolation goes wrong or something like that? What do you get from scrape_duration_seconds{job="wmi_exporter"} at these times?

@DevOps-Dad
Copy link
Author

Okay, this is kind of interesting. There is a time change at each negative value. I took the wmi_os_time and the wmi_cpu_time_total on the same graph. I multiplied the OS time by 10 so I could see both. As you see below, the WMI negative numbers are when there is a time change. I also put the graph of just the wmi_os_time there. I am not sure if this is related to the WMI refresh issue or a clock issue, though the clock is fixing itself every 16-17 seconds so I wonder if the two are related.

image

image

@mdunc
Copy link

mdunc commented Sep 28, 2018

I'm running in to the exact same thing. The Windows server is an instance hosted in AWS (Windows 2016) and I'm scraping the data every 30 seconds.
image

@Fredrik-Petersson
Copy link

I have the exact same issues as @mdunc in a newly launched instance of Windows Server 2016 in AWS. I have older instances displaying the same issues but not to the same extent. The older instances see an increase in scrape time of about 4 times (from about 400 milliseconds to 1.6 seconds), the scrape of the new instance times out completely (>10 seconds)

I noticed in the documentation for setting up NTP and time on Windows Servers that AWS have changed the default behaviour of NTP in AMI instances launched after August 2018:

NTP Server – starting with the August 2018 release, AMIs will now use the AWS time service by default. This time service is accessible from any EC2 region at the 169.254.169.123 endpoint. Additionally, the 0x9 flag indicates that the time service is acting as a client, and to use “SpecialPollInterval” to determine how frequently to check in with the configured time server.

This is interesting:

Special Poll Interval – checks against the configured NTP Server every 900 seconds, or 15 minutes.

https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/windows-set-time.html

I am not sure if this is related at all but the time window of the AWS change and the timing of the "Special Pool Intervall" is interesting.

My next step is to launch a fresh empty instance and see if the behaviour is the same.

@carlpett
Copy link
Collaborator

carlpett commented Oct 2, 2018

Very interesting find @Fredrik-Petersson! If you are able to do some testing with disabling NTP/using another NTP server, etc it would be super useful.

@Fredrik-Petersson
Copy link

Got the fresh AWS hosted Windows server up and running. It has the same issues as the others. Currently testing different cases with disabled Amazon SSM Agent, disabled windows time, other NTP servers etc.

@DevOps-Dad
Copy link
Author

I am very interested in what you find. Just a note, the issues I am having from above are on a Windows 2012R2 server running in a VMware environment.

@Fredrik-Petersson
Copy link

I have done some testing but so far no clue into what is causing the time to be adjusted every 16 minute. I have tried to disable the following services:
Amazon SSM Agent
AWS Lite Guest Agent
Windows Time

Event with these stopped something is still changing the system time either forwards or backwards. Here is a a graph from the newly created server (the "missed" spike is likely due to the scrape happening exactly right to not see it":
image

I am scraping every 15 seconds and the servers are on the same AWS network. I will do some digging with Process Explorer tomorrow and see if something is happening on the 16 minute mark.

@Fredrik-Petersson
Copy link

I have messed around a bit more and from what I can see AWS is likely not causing the issues. I have also tested against servers running locally on VMware and they show the same behaviour.

Back to the drawing board.

I made some more visualisations in Grafana and I noticed that at the same time as the scrape fails there is an increase in system exception dispatches. Is there a general way to see the system exceptions dispatches for all processes on Windows or do you need to look at each individual process?

Here is the graf (showing wmi_system_exceptions, scrape_duration and wmi_os_time) from my clean server:
image

@Fredrik-Petersson
Copy link

I have done some more testing in order to find a workaround for the issues as most of my monitoring are based on wmi_exporter.

The workaround I have found to at least make the issues less prominent is to start the wmi_exporter with the "--collectors.enabled="cpu,cs,logical_disk,net,os,system" excluding the "services" collector.

This change brings down the average scrape duration from about 1 800 milliseconds to about 200 milliseconds on some of the most loaded systems. The peaks, earlier causing timeouts with scrape times of more than 10 seconds, now stay below 8 seconds. Still a bit high but at least not breaking for now.

As a bonus the amount of scraped metrics goes from around 5 500 samples to around 150. As I was not using any service/process monitoring with wmi_exporter this is no loss for current setup.

Here is some graphs from the server with the most issues (the irregular peaks are due to me starting and testing different configs with wmi_exporter) :
image

Here some graphs from a server with less load:
image

@leoluk
Copy link

leoluk commented Oct 7, 2018

Like carlpett said, this is very like to be the issue discussed in #89 (WMI hanging every 1000 seconds). WMI queries are one-by-one and aren't transactional, so there's no consistent view. This usually doesn't matter, but with the hangs, you're likely to get inconsistent results.

You could try whether you see the same behavior with https://github.com/leoluk/perflib_exporter

@robgryn
Copy link

robgryn commented Dec 5, 2018

Hi, we encountered the same problem with negative cpu metrics and found this issue.

I gather from this @higels response in #89 :

I was told by an ex-MS employee that this would be unlikely to be fruitful - I should have mentioned that here.

That the issue is unlikely to be fixed in any foreseeable future, is my understanding correct? If so, we will probably try https://github.com/leoluk/perflib_exporter and let you know if this issue is also present there.

@Gaibhne
Copy link

Gaibhne commented Jan 7, 2019

Just to erase one theory in this issue, I am seeing this exact issue and I have never used AWS or installed anything related to it on this machine (or any other, since my employer has blacklisted Amazon due to data protection issues). So that is definitely not the culprit.

@JMLX42
Copy link

JMLX42 commented Feb 27, 2019

Same as @Gaibhne : I can reproduce the issue on a dedicated OVH machine.

@free
Copy link
Member

free commented May 8, 2019

As far as I can tell, this is a non-issue.

Prometheus does not (nor could it) guarantee that scraping happens precisely at the configured scrape interval. So e.g. if you configure a 10s scrape interval, you may have one scrape interval of 11 s followed by a 9.3 s scrape interval etc. You will even have scrapes that are missed altogether (i.e. an interval that is a multiple of 10 s).

This in and of itself is not an issue for irate() as irate() only looks at the last couple of samples and divides the value difference between them by their timestamp difference. But those timestamps are not exact. I'm not sure how Prometheus assigns a timestamp to a sample (i.e. whether it's the time when the scrape request was started; when it finished; or anything in-between), but you are likely aware of the fact that network requests are not instant. So even if you have 2 Prometheus samples that are exactly 10 seconds apart, the values may have been collected anywhere between 9-11 seconds apart if the /metrics requests took more than 1 second to complete.

E.g. let's say the first CPU sample (let's call it v1) was collected at t0 and Prometheus received the response at t0 + 1s (due to network latency), assigning it a timestamp of t0 (when the request was issued); the second request was issued at t0 + 10s, but it took 1 second to reach the WMI exporter, so the CPU sample (v2) was actually collected at t0 + 11s and "instantly" returned to Prometheus; which assigned it a timestamp of t0 + 10s. Now if the CPU was idle for all this time, irate(wmi_cpu_time_total{mode="idle"}[5m]) would be (v2 - v1) / (t2 - t1) = 11 / (t0 + 10s - t0) = 1.1. So your CPU was "110% idle over the past 10 seconds". Now if you do 100 - 110 you'll get -10, which you could interpret as `my CPU was -10% in use over the past 10 seconds", but it really means "my CPU was -10% +/- 1000% idle over the past 10 seconds", with "+/- 1000%" being a generous estimate of just how precise you can expect sampled metrics to be.

If you want a better estimate of the percentage of time your CPU was idle for over the past 0-100+ s, you can instead write your query as:

1
  -
avg by (instance) (
  irate(wmi_cpu_time_total{mode="idle", instance="$server"}[5m])
    /
  sum by(instance) (irate(wmi_cpu_time_total{instance="$server"}[5m]))
)

I.e. divide the idle CPU time by total CPU time, so it always adds up to 100% regardless of how accurate the rate estimate or sample timestamps are.

@free
Copy link
Member

free commented May 8, 2019

Oh, and also note that even though irate() is the recommended way of graphing spiky data, it may also be hugely misleading at low resolutions. E.g. if you graph a whole day's worth of CPU usage based on samples collected every 10 seconds irate() will give you 10 second samples that are ~40 seconds apart in Grafana (86400 seconds / ~2k pixels graph resolution).

So depending on the actual range and/or the minimum resolution configured for your Grafana graph,when faced with regular spikes in CPU usage (as with Prometheus rule evaluation) you may end up seeing only the spikes in CPU usage. Or the low CPU usage periods in-between spikes. E.g. I run a Prometheus instance that evaluates rules every 10 seconds. If I scrape that Prometheus instance every 5s and look at irate() with a resolution that's a multiple of 10 seconds, I will only ever see either the spikes (when rules are evaluated) or the troughs, whereas the actual CPU utilization is actually the average of the two.

@dhaval512
Copy link

hi everyone ,
can anyone know how to fetch Expensive Queries by IO (Top 25) and Expensive Queries by CPU (Top 25)
in grafana and promethues

@for-mat
Copy link

for-mat commented Apr 21, 2020

hi everyone ,
can anyone know how to fetch Expensive Queries by IO (Top 25) and Expensive Queries by CPU (Top 25)
in grafana and promethues

topk(25, xxxxxx)

Copy link

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

@github-actions github-actions bot added the Stale label Nov 26, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests