Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upSoundness of the rate() function: is division by ms.Range.Seconds() correct? #3812
Comments
This comment has been minimized.
This comment has been minimized.
|
This behaviour is as intended, averaged over the entire 4h the rate is not 1/s as it is 0/missing for much of that time. We used to do this differently, and it led to massive over reporting in similar scenarios. |
brian-brazil
closed this
Feb 7, 2018
brian-brazil
reopened this
Feb 7, 2018
This comment has been minimized.
This comment has been minimized.
|
Hmmm... I can't say I fully agree with that reasoning. I understand that dividing through the meaningful length of the window, as opposed to the requested length, may indeed cause us to over-report. But then again, isn't there a full symmetry between these two? Example from a medical context: over/under-reporting someone's heart rate or blood pressure: both are equally bad. Even in systems administration: one can create an alert for detecting that certain types of events don't occur often enough. Example: a web service receiving no HTTP requests, due to its network uplink going away. The algorithm used for computing |
This comment has been minimized.
This comment has been minimized.
I don't see how this is under reporting, it is the average rate over the time period in question.
That would lead to under reporting when a service was started or stopped. |
This comment has been minimized.
This comment has been minimized.
I thought about this a bit more on the way back home. Picking an arbitrary cutoff (50% as I suggested) doesn't seem wise indeed. For example, in the past I've used Has there ever been some discussion about introducing some kind of extended range vector selection expression that allows specifying a lowerbound? For example, |
This comment has been minimized.
This comment has been minimized.
|
There have been discussions that touched on that in the past, but our current rate function is the result. This sounds like it's getting into reporting, for which you're probably going to want to pull the raw data into a script anyway. |
brian-brazil
closed this
Feb 14, 2018
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
EdSchouten commentedFeb 7, 2018
What did you do?
Set up the SNMP exporter. Monitor some switches. Run this query:
What did you expect to see?
As the uptime increases with one second per second, I would expect to see flat lines.
What did you see instead? Under which circumstances?
If we scrape targets that were recently added or have been down for some time, this line is not flat.
This is due to data only being partially available; not for all four hours. Though this example may seem contrived, the problem applies to any type of rate computation. We effectively under-report values during the initial time frame.
Environment
Prometheus version:
2.1.0
Details
This is something I've noticed about Prometheus for a pretty long time, but finally decided to file an issuet about. Looking at the code, i suspect this may be due to
extrapolatedRate()dividing its result byms.Range.Seconds(), as opposed to using the actual length of the window containing samples.