Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus query: query step is bound by min interval #14209

Closed
peergynt opened this issue Nov 27, 2018 · 11 comments
Closed

Prometheus query: query step is bound by min interval #14209

peergynt opened this issue Nov 27, 2018 · 11 comments

Comments

@peergynt
Copy link

What Grafana version are you using?

Grafana v5.3.2

What datasource are you using?

Prometheus

What OS are you running grafana on?

Ubuntu 18.04

What did you do?

In order to see some details on a 1-hour graph, I tried to specify: Min Step=10s and Min time interval=30s

What was the expected result?

The Prometheus query would use step=10 and $__interval=30s

What happened instead?

The query actually has the following values: step=30 and $__interval=30s.
It looks like the step cannot be less than the time interval.
However having a step of 10s with a 30s time interval seems like a reasonable thing to do.

@torkelo
Copy link
Member

torkelo commented Nov 27, 2018

However having a step of 10s with a 30s time interval seems like a reasonable thing to do.

I don't think that reasonable. They are connected. Min interval controls the minimum value for step.

Can you describe the reasoning?

@torkelo torkelo added the needs more info Issue needs more information, like query results, dashboard or panel json, grafana version etc label Nov 27, 2018
@peergynt
Copy link
Author

Here is a graph with a query similar to rate(my_metric[$__interval]):

interval_30s_step_30s
In the case above, I manually set the query Min step to 10s and the Min time interval to 30s.
But looking at the query inspector, the step used in the Prometheus query is actually 30.

The graph below is for the same query except that I manually plug in the rate interval: rate(my_metric[30s])

interval_30s_step_10s
In this case, I did not specify any Min step or Min time interval value. Looking at the query inspector, it is using my data source scrape interval for the step: 10.

I guess that my assumption is that $__interval can be used with rate intervals.
If this is the case, I do not see why it also acts as a lower bound for Min step.

@peergynt
Copy link
Author

Can you describe the reasoning?

Here is my reasoning:

  • if I use $__interval in my query, I want to set Min time interval to 30s because of my scraping interval of 10s. Asking Prometheus to calculate rates with a range interval lower than 30s does not seem to produce any meaningful result.
  • the step parameter determines how many samples Prometheus will return for each query. It seems to me that using a given interval for rate calculation should not dictate the resolution of the result set (the step).

The 2 graphs above are looking at same data except that the top one has a resolution of 30s and the bottom one has a resolution of 10s. The top graph is missing quite of bit of data (i.e. between 20:05 and 20:20).

I don't think that reasonable. They are connected. Min interval controls the minimum value for step.

This is where I am confused. Maybe the time interval and step are connected in some way but I do not see why the step has to be at least equal to the time interval. This seems like an unnecessary rule.

@torkelo
Copy link
Member

torkelo commented Nov 28, 2018

So effectively you want something like a moving average where each point includes data also included in the last point.

It’s tricky, I guess we could allow a 1/3 lower step than interval but not lower than scrape interval

@peergynt
Copy link
Author

So effectively you want something like a moving average where each point includes data also included in the last point.

I think that when graphing a rate, this is probably ok. Since a rate value is an instant vectors, it should be fine if the same point is used in multiple evaluation steps.

When using the Prometheus Expression Browser, if you leave the resolution field empty (i.e. step), it uses the total graph range to calculate the resolution. The rate interval does not affect the query resolution/step at all.

expression_browser
For instance, for a 1-hour graph, it will divide 3600s by 250 = 14s.
See https://github.com/prometheus/prometheus/blob/master/web/ui/static/js/graph/index.js#L474

In the book Prometheus Up & Running, @brian-brazil recommends the following:
"you should use ranges that are at least one or two scrape intervals larger than the step you are using"
and
"When using range vectors with query_range, you should usually use a range that is longer than your step in order not to skip data."

@torkelo torkelo added datasource/Prometheus type/feature-request and removed needs more info Issue needs more information, like query results, dashboard or panel json, grafana version etc labels Dec 4, 2018
@free
Copy link

free commented Feb 2, 2019

My 2c (although no one asked for it): the issue (in my view) is that Prometheus' rate function is broken.

What it should do for rate(foo[30s]) is it should take the value of foo now; subtract the value of foo 30 seconds ago (i.e. foo offset 30); adjust for any counter resets in-between; divide by 30. This would work perfectly with Grafana's approach of e.g. computing a rate over 30 seconds every 30 seconds (by providing you with $__interval to plug into your query).

What it does instead is it takes the value of foo now; it subtracts the first value of foo after 30 seconds ago; it adjusts for counter resets; it divides that by the time difference between the 2 samples (which is less than 30 seconds). (Plus some magic, which it needs to not seem as broken.) So it artificially limits itself to values falling strictly within the 30 seconds range, then tries to adjust for that by extrapolating.

But if your counter increase happens between the last sample in a 30 second step and the first sample in the next 30 second step, it is simply ignored. And if it happens within a 30 second step, then it has an outsized influence on the rate (e.g. if the actual counter increase is 1 and your samples are 10 seconds apart, then it will return a rate equivalent to 1.5 -- because an increase of 1 over 20 seconds is extrapolated to 1.5 over 30 seconds).

I've filed a feature request -- prometheus/prometheus#3806 -- and wrote the code for it -- prometheus/prometheus#3760 -- but there is no interest on their side to change anything (or to have a second implementation of rate, sort of like irate).

So in order to fix this on the Grafana side you'd need to be able to do some interval arithmetic, rather than arbitrarily allow the step to be 3x (or whatever fixed amount) shorter than $__interval. That's because unless you do this at all resolutions (whether you're looking at 1 hour or 1 month of data), you'll start losing samples as soon as you zoom out far enough for the step to be equal to $__interval and you're back exactly where you started.

So what you'd have to do is (assuming a sample resolution of 10 seconds) something like rate(my_metric[${__interval+10s}]) / ${__interval+10s} * $__interval. I realize this is a totally ridiculous fix, but it's the only thing that will work with rate as is.

Or, you can make use of Prometheus' brand new subquery feature and force it to create a sample every second (or whatever resolution works for you) and then compute the rate on top of that: rate(foo[30s:1s]). It's going to be quite a bit slower, use up more memory (and risk killing your Prometheus instance), but it's going to be closer to what you want.

@roidelapluie
Copy link
Collaborator

@free unrelated: is it really the

What it does instead is it takes the value of foo now; it subtracts the first value of foo after 30 seconds ago; it adjusts for counter resets; it divides that by the time difference between the 2 samples (which is less than 30 seconds). (Plus some magic, which it needs to not seem as broken.) So it artificially limits itself to values falling strictly within the 30 seconds range, then tries to adjust for that by extrapolating.

Are you sure that prometheus is not taking the first and last value in the interval? (e.g. with timestamp // different that the value "now")

@free
Copy link

free commented Mar 5, 2019

@roidelapluie I don't fully understand your question, but yes, Prometheus does take the first and last values within the requested range. However, the first value within the range is not the same as the value of the counter at the beginning of the range (unless that sample falls exactly at the start of the interval, with millisecond precision).

Which is what prevents it from being able to compute a rate over an arbitrary time range. E.g. if your samples are at 10 second resolution, it makes perfect sense to want to know their rate of change with 10 second resolution. Except in PromQL, which requires you to use 20 seconds and then completely fudges the actual rate (because you now only have data for 10 out of the 20 seconds and it needs to somehow make up for that).

@roidelapluie
Copy link
Collaborator

Okay. For your explanation I was thinking that prometheus was taking the last value in the range with the end of range timestamp.

@davkal
Copy link
Contributor

davkal commented Jul 1, 2020

#21417 might be interesting. Also, I have trouble understanding why step should ever be smaller than a min interval that's based on the scrape interval. This would try to generate multiple datapoints from the same underlying data. For cleaner peaks I recommend using irate() instead of rate().

@davkal davkal closed this as completed Jul 1, 2020
@trallnag
Copy link

trallnag commented Aug 6, 2020

@davkal It seems like the range of opinions on the usage of irate() are quite broad. Here is the opinion of a guy over at GitLab:

I strongly suggest you use rate as a matter of habit, and only use irate if you truly know what you're doing and that you need to see instantaneous rates only. Using it where you should be using 'rate' will lead to the original symptoms (sub-sampling), unless your step is 2 times the original sample interval, or less, and there's nothing grafana (or anything else) can do to make that Prometheus query return wider data.

https://www.stroppykitten.com/technical/prometheus-grafana-statistics

Just wanted to share this in case somebody comes across this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants