Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soundness of the rate() function: is division by ms.Range.Seconds() correct? #3812

Closed
EdSchouten opened this Issue Feb 7, 2018 · 6 comments

Comments

Projects
None yet
2 participants
@EdSchouten
Copy link
Contributor

EdSchouten commented Feb 7, 2018

What did you do?

Set up the SNMP exporter. Monitor some switches. Run this query:

rate(sysUpTime[4h])

What did you expect to see?

As the uptime increases with one second per second, I would expect to see flat lines.

What did you see instead? Under which circumstances?

If we scrape targets that were recently added or have been down for some time, this line is not flat.

screenshot from 2018-02-07 18-22-32

This is due to data only being partially available; not for all four hours. Though this example may seem contrived, the problem applies to any type of rate computation. We effectively under-report values during the initial time frame.

Environment

  • Prometheus version:

    2.1.0

Details

This is something I've noticed about Prometheus for a pretty long time, but finally decided to file an issuet about. Looking at the code, i suspect this may be due to extrapolatedRate() dividing its result by ms.Range.Seconds(), as opposed to using the actual length of the window containing samples.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 7, 2018

This behaviour is as intended, averaged over the entire 4h the rate is not 1/s as it is 0/missing for much of that time. We used to do this differently, and it led to massive over reporting in similar scenarios.

@brian-brazil brian-brazil reopened this Feb 7, 2018

@EdSchouten

This comment has been minimized.

Copy link
Contributor Author

EdSchouten commented Feb 7, 2018

Hmmm... I can't say I fully agree with that reasoning.

I understand that dividing through the meaningful length of the window, as opposed to the requested length, may indeed cause us to over-report. But then again, isn't there a full symmetry between these two? Example from a medical context: over/under-reporting someone's heart rate or blood pressure: both are equally bad. Even in systems administration: one can create an alert for detecting that certain types of events don't occur often enough. Example: a web service receiving no HTTP requests, due to its network uplink going away.

The algorithm used for computing rate() currently tries to prevent potential over-reporting by always under-reporting, which I find odd. If we're concerned about quality of data, maybe it makes sense to simply omit any results until a certain fraction of time (>=50%) is present?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 7, 2018

The algorithm used for computing rate() currently tries to prevent potential over-reporting by always under-reporting, which I find odd.

I don't see how this is under reporting, it is the average rate over the time period in question.

If we're concerned about quality of data, maybe it makes sense to simply omit any results until a certain fraction of time (>=50%) is present?

That would lead to under reporting when a service was started or stopped.

@EdSchouten

This comment has been minimized.

Copy link
Contributor Author

EdSchouten commented Feb 7, 2018

If we're concerned about quality of data, maybe it makes sense to simply omit any results until a certain fraction of time (>=50%) is present?

That would lead to under reporting when a service was started or stopped.

I thought about this a bit more on the way back home. Picking an arbitrary cutoff (50% as I suggested) doesn't seem wise indeed. For example, in the past I've used rate(x[1d]) to get graphs with daily peaks smoothed out. Not emitting anything during the first 12 hours seems a bit pessimistic.

Has there ever been some discussion about introducing some kind of extended range vector selection expression that allows specifying a lowerbound? For example, rate(x[1h..1d]). This would mean: compute the rate over at one day of data, but only if at least one hour of data is available.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 7, 2018

There have been discussions that touched on that in the past, but our current rate function is the result. This sounds like it's getting into reporting, for which you're probably going to want to pull the raw data into a script anyway.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.