New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Negative Numbers on wmi_cpu_time_total #259
Comments
Hey @DevOps-Dad, |
After each negative entry there is a spike... how long is the time between such spikes? About 16..17 Minutes by any chance? |
@avonwyss , yes the graph is just as you said. See below. I will give a stacked graph in a couple of minutes. |
@carlpett Below is the stacked graph. No negative numbers here. Very weird. |
Thanks, although not much wiser... Some other things to check:
|
|
Okay, this is kind of interesting. There is a time change at each negative value. I took the wmi_os_time and the wmi_cpu_time_total on the same graph. I multiplied the OS time by 10 so I could see both. As you see below, the WMI negative numbers are when there is a time change. I also put the graph of just the wmi_os_time there. I am not sure if this is related to the WMI refresh issue or a clock issue, though the clock is fixing itself every 16-17 seconds so I wonder if the two are related. |
I have the exact same issues as @mdunc in a newly launched instance of Windows Server 2016 in AWS. I have older instances displaying the same issues but not to the same extent. The older instances see an increase in scrape time of about 4 times (from about 400 milliseconds to 1.6 seconds), the scrape of the new instance times out completely (>10 seconds) I noticed in the documentation for setting up NTP and time on Windows Servers that AWS have changed the default behaviour of NTP in AMI instances launched after August 2018:
This is interesting:
https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/windows-set-time.html I am not sure if this is related at all but the time window of the AWS change and the timing of the "Special Pool Intervall" is interesting. My next step is to launch a fresh empty instance and see if the behaviour is the same. |
Very interesting find @Fredrik-Petersson! If you are able to do some testing with disabling NTP/using another NTP server, etc it would be super useful. |
Got the fresh AWS hosted Windows server up and running. It has the same issues as the others. Currently testing different cases with disabled Amazon SSM Agent, disabled windows time, other NTP servers etc. |
I am very interested in what you find. Just a note, the issues I am having from above are on a Windows 2012R2 server running in a VMware environment. |
Like carlpett said, this is very like to be the issue discussed in #89 (WMI hanging every 1000 seconds). WMI queries are one-by-one and aren't transactional, so there's no consistent view. This usually doesn't matter, but with the hangs, you're likely to get inconsistent results. You could try whether you see the same behavior with https://github.com/leoluk/perflib_exporter |
Hi, we encountered the same problem with negative cpu metrics and found this issue. I gather from this @higels response in #89 :
That the issue is unlikely to be fixed in any foreseeable future, is my understanding correct? If so, we will probably try https://github.com/leoluk/perflib_exporter and let you know if this issue is also present there. |
Just to erase one theory in this issue, I am seeing this exact issue and I have never used AWS or installed anything related to it on this machine (or any other, since my employer has blacklisted Amazon due to data protection issues). So that is definitely not the culprit. |
Same as @Gaibhne : I can reproduce the issue on a dedicated OVH machine. |
As far as I can tell, this is a non-issue. Prometheus does not (nor could it) guarantee that scraping happens precisely at the configured scrape interval. So e.g. if you configure a This in and of itself is not an issue for E.g. let's say the first CPU sample (let's call it If you want a better estimate of the percentage of time your CPU was idle for over the past 0-100+ s, you can instead write your query as:
I.e. divide the idle CPU time by total CPU time, so it always adds up to 100% regardless of how accurate the rate estimate or sample timestamps are. |
Oh, and also note that even though So depending on the actual range and/or the minimum resolution configured for your Grafana graph,when faced with regular spikes in CPU usage (as with Prometheus rule evaluation) you may end up seeing only the spikes in CPU usage. Or the low CPU usage periods in-between spikes. E.g. I run a Prometheus instance that evaluates rules every 10 seconds. If I scrape that Prometheus instance every 5s and look at |
hi everyone , |
topk(25, xxxxxx) |
This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs. |
I have a Windows 2012R2 server that is running the v0.4.3 wmi_exporter.
When I use the query below I pretty regularly get negative numbers for CPU usage. This is a very underutilized server running on VMware with 4 cores associated to it.
100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle", instance="$server"}[5m])) * 100)
I appreciate any guidance.
Thanks,
Joe
The text was updated successfully, but these errors were encountered: