Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify desired step in Prometheus in dashboard panels #9705

Closed
zemek opened this issue Oct 27, 2017 · 61 comments · Fixed by #36422
Closed

Specify desired step in Prometheus in dashboard panels #9705

zemek opened this issue Oct 27, 2017 · 61 comments · Fixed by #36422
Assignees
Labels
area/datasource datasource/Prometheus effort/small onboarding prio/medium Important over the long term, but may not be staffed and/or may need multiple releases to complete. type/feature-request

Comments

@zemek
Copy link

zemek commented Oct 27, 2017

Previously we used to be able to specify the step parameter in Prometheus. This was changed in #8073

It would be nice to be able to get this functionality back to be able to display brief spikes in graphs.

@bergquist bergquist changed the title [Feature request] Specify desired step in Prometheus Specify desired step in Prometheus Oct 31, 2017
@atonkyra
Copy link

atonkyra commented Nov 1, 2017

This would be pretty awesome to have back. Right now some fast moving gauges are impossible to render correctly.

The animation at https://gyazo.com/f83dc32078209a7e6a4a87efe5b4e81b illustrates the problem pretty well.

@MarcMagnin
Copy link

This would be quite nice to have it back. Any idea when this could be done if it is planned?

@MarcMagnin
Copy link

MarcMagnin commented Nov 14, 2017

As a reference there is a discussion on prom github as well: prometheus/prometheus#2364
Basically without an enforced step on a metric which is not a counter (using the rate trick), there is no way to render properly a chart on each refresh when the time range is higher than 1h.

@seeruk
Copy link

seeruk commented Nov 30, 2017

I've also just left a comment on that issue explaining what I've found about the issue. Realistically I think this is something that Prometheus should handle, but it's something that Grafana may be able to deal with on it's own too - I'm still experimenting with it really.

Edit: After having experimented a little, Grafana could handle this if Prometheus doesn't, by ensuring that steps always contained the same points, i.e. if the step size is 15s, maybe each step should be one of these parts of a minute: 0-15s, 15-30s, 30-45s, or 45-60s. So, if now was 36 seconds into a minute, prometheus would be queried for 45s into that same minute (i.e. into the future). In this case, prometheus does actually seem to return the latest information for the last point which means that the last point will change, but once we get passed it it would always remain the same from then on.

This sort of behaves like pre-calculated buckets, in some cases the value in a bucket might be updated, in other cases it may be added to over time, but historic buckets will never change because their bounds wouldn't be moving targets like they are now.

@thenayr
Copy link

thenayr commented Dec 4, 2017

With this change in place I can't get metric resolution lower than 15s intervals (we actually use 10s resolution for all metrics).

I also have all of the issues mentioned above with spiky charts jumping around needlessly.

@dprittie
Copy link

dprittie commented Dec 5, 2017

I am also unable to get a metric resolution lower than 15s - we have some prometheus targets that scrape every 500 ms and we unable to properly observe this data which is a serious breaking change for us :(

if anyone has figured out a way to get lower than 15s please share!

@torkelo
Copy link
Member

torkelo commented Dec 5, 2017

Have you set the min step/interval option on the datasource options page? You can also override it on per panel or query level

@atonkyra
Copy link

atonkyra commented Dec 5, 2017

@torkelo the problem is the fact that it is min step. We can define the minimum step but not the maximum step which in turn causes us to end up overstepping high frequency polled data. (see my picture above for example, with that the min step was set to 1s on 10s interval polled data)

@seeruk
Copy link

seeruk commented Dec 5, 2017

One option to help with this issue is to use an *_over_time aggregation query, and set the time period to $__interval, this will scale to the step size. Unfortunately, it doesn't completely solve the problem.

@torkelo
Copy link
Member

torkelo commented Dec 5, 2017

Not sure I understand, you want grafana to query for more data points than there are pixels in the graph? Say you set a max step your query will return too much data.

Prometheus have range selectors and functions that should allow you to get what you want, using the interval variable can help

@atonkyra
Copy link

atonkyra commented Dec 5, 2017

@torkelo okay I'll try to explain the problem (as I understand it)

I have data with resolution of 1 value per 10 seconds. Now if I want to show all the data and want to see the spikes, what would be the logical step in a 5 minute graph? I'd say it's 10 seconds.

Okay, now lets set min step to 10s for the fun of it. Graph keeps changing completely every reload!? Okay let's examine what we send to Prometheus:

GET /grafana/api/datasources/proxy/1/api/v1/query_range?query=...&start=1512506717&end=1512507017&step=15

See the step 15? That basically means that we are stepping over existing values. Let's look at this on a timeline.

0s  10s  20s  30s  40s  50s  60s
+----+----+----+----+----+----+---- ...
0    1   100   0    1   100   0
^      ^       ^      ^       ^

+ = value interval
^ = where step happens to land at

Now looking at that we see that the spikes of 100 may be jumped over due to the misaligned step. Prometheus might report 1 or 100 on the 15 second marker depending on which side the step happens to land at. Now what happens if our data is at 1 second resolution? Well we miss 14 seconds of values and especially on very unstable gauges the graph looks completely different on each refresh. :)

I hope this explains the problem better.

@atonkyra
Copy link

atonkyra commented Dec 5, 2017

Also please have a look at the image here. This is a 10 second interval dataset with min step set to 10 (or 1, grafana sends 15 regardless). The 2 graphs are from same data, but with a difference that I pressed refresh 3 seconds after taking the left screenshot.

step_broken

I'd say there are enough pixels to show the other points as well... :P

@bmildren
Copy link

bmildren commented Dec 5, 2017

@atonkyra just to confirm, did you say you had already adjusted the scrape interval on the data source?

image

@atonkyra
Copy link

atonkyra commented Dec 5, 2017

@bmildren okay, I didn't even know there was such setting, setting that to the minimal value of any scrape job you have will fix the problem.

I still think defaulting to anything hard-coded is utterly broken (on the lower bound). At the very least it should be configurable on the graph.

@atonkyra
Copy link

atonkyra commented Dec 5, 2017

If we wanted some automation, one could use /api/v1/status/config API endpoint on Prometheus. The contents are in yaml so that would need parsing. Grafana could then perform a match on the job key and failing that, fallback to global scrape interval on the config.

@torkelo
Copy link
Member

torkelo commented Dec 6, 2017

I still think defaulting to anything hard-coded is utterly broken (on the lower bound). At the very least it should be configurable on the graph.

Grafana defaults to the same default prometheus use. And nothing is hard coded as you can change it. However I can still see a problem here as Grafana should align step in even intervals of the min step / scrape interval option

@bergquist
Copy link
Contributor

FYI that setting is only available in the night builds.

@atonkyra
Copy link

atonkyra commented Dec 6, 2017

I'd say we still need a per-graph (or query?) "scrape interval" field and when that is absent default to the global on the datasource. Let's say I have a Prometheus instance with default scrape interval of 10s but I happen to have some data at 1s and some at 60s.

Single global default just doesn't make sense for many Prometheus installations IMHO.

bergquist added a commit that referenced this issue Dec 6, 2017
This commit makes it possible to set min interval per panel.
Overrides the value configured on the datasource.

ref #9705
@bergquist
Copy link
Contributor

It's now possible to set min interval per panel in the nightly build
image

@zemek
Copy link
Author

zemek commented Dec 8, 2017

FWIW my original issue is actually solved by using max_over_time(my_metric{}[$__interval])

Although you can't take the max_over_time() of a rate() unless you make a recording rule first, which is slightly annoying

@matejzero
Copy link

This is a showstopper for us. We are trying to migrate to Prometheus from Graphite and not having consistent graphs doesn't work for us:)

It's really hard to debug an issue on the system when graphs are constantly changing.

Will follow this thread and provide info if needed.

@thenayr
Copy link

thenayr commented Dec 14, 2017

I believe Grafana 4.6.3 release today addresses this 8a16163

@matejzero
Copy link

matejzero commented Dec 14, 2017

I upgraded to 4.6.3 but it doesn't seem to fix my issue.

I have set scrape interval in data source to 10s and this is what I get:
output_jhx70k

This are 2 graphs with reload time of 1 minute.

@bmildren
Copy link

In this case, isn't that just an artifact of irate? irate is based on the last two data points in the range vector ( https://prometheus.io/docs/prometheus/latest/querying/functions/#irate() ), here you're looking at the graph 1 minute later so the last two data points in each of the 5m ranges are going to different no matter what you set your scrape interval to. 🤔

@matejzero
Copy link

Could be, now that I think of it (I'm a total Prometheus noob). Should using rate() function solve the problem, because I see the same problem with rate().

@atonkyra
Copy link

atonkyra commented Dec 14, 2017

@matejzero does resolution 1/1 help at all?

I think we still have a problem with this which I believe is that the dashboard refreshes aren't aligned to the step (which I believe @torkelo mentioned earlier), example:

scrape interval 10s, min step 15s

S = scraped value
! = step

                   S-----S-----S-----S-----S-----S ...
initial            !        !        !        !    ...
reload after 5s       !        !        !        ! ...
reload after 10s         !        !        !       ...

So basically when we have uneven scrape/step intervals we ultimately have situation where step causes us to hit completely different values on each refresh.

@free
Copy link

free commented Jun 8, 2018

FYI, I've just made a Prometheus 2.3.0 + xrate release, at https://github.com/free/prometheus/releases/tag/xrate_v2.3.0

Prometheus 2.3.0 has significantly improved the performance of range queries (which is where Grafana and xrate come together), so you may want to give it a whirl.

@matejzero
Copy link

Great! I'm already running it on our Prometheus and so far it looks good:)

@zemek
Copy link
Author

zemek commented Jul 5, 2018

is this effectively resolved in grafana 5.2 with #10434 ?

(my original request was resolved with using max_over_time() but it seemed like this discussion changed into how rate/irate graphs end up moving around a lot)

@gjcarneiro
Copy link

I have similar problems due to step. My metric is something like:

sum(max_over_time(pricefeed_num_clients}[15m]))

Now, if tick the Instant checkbox I get a URL that includes time=xxxx, and that's it.

If I untick the Instant checkbox, I get a URL that includes start=xxxx&end=yyy&step=300. Due to the step=300, I actually get a max_over_time, in a period of a few hours, that can be actually lower than the max_over_time I get with the Instant mode. Which is mathematically impossible.

I just want to get rid of the step=300 because it's screwing up with the calculation. I tried subquery syntax, sum(max_over_time(pricefeed_num_clients}[15m:1m])), but it makes no difference.

@zekth
Copy link

zekth commented Jan 27, 2020

Still no solution for this?

@davkal
Copy link
Contributor

davkal commented May 10, 2020

If you want to set a fixed step, you can do this in Explore. But for dashboard panels we have not settled on a solution yet.

@aocenas aocenas changed the title Specify desired step in Prometheus Specify desired step in Prometheus in dashboard panels Jul 1, 2020
@sksingh20
Copy link

Does this mean Grafana is not right tool for historical data view as this represent data wrongly and will completely mislead analysis and reporting?

Changing step value on fly is changing full data view resulting to wrong reporting... Please confirm so that same will be communicated to CXO forums.

@leoluk
Copy link

leoluk commented May 27, 2021

Changing step value on fly is changing full data view resulting to wrong reporting... Please confirm so that same will be communicated to CXO forums.

Prometheus is fine for analytics, but as with any data source, it's important to understand its properties/limitations. For example, you'll want to use an aggregation function like sum_over_time if you need accurate reporting.

The problem discussed here is inherent to any time-series DB that uses sampling, it just happens to be very visible.

@sksingh20
Copy link

@leoluk I think you mistaken to read problem condition. This appears to me that query is built at grafana level dynamically and not prometheus level. So
(1) How this is limitation at data source?
(2) Grafana is adding step value in query to reduce data points, which eventually loosing key data value itself!!
(3) can you help with a sample dashboard to find out all power outage in last 7 days or any longer duration on a dashboard?

@leoluk
Copy link

leoluk commented May 27, 2021

This sounds like a question for the community or commercial support through a company like Robust Perception.

When you query a data source to build a graph, you need either sampling (i.e. only looking at every n-th data point) or aggregation (min/max/quantiles) to reduce the number of data points to something that can be displayed in a graph. Grafana instructs Prometheus to sample data - via the step parameter - depending on the zoom level to render data at an appropriate resolution. If you zoom out, resolution will go down. There's a hardcoded limit on how many data points Prometheus will return in a single query.

If your use case is finding, say, power outages of 5m in a 7-day graph, then no, that won't work (with any DB).

This issue is about noisy graphs being unstable since a different set of data points are sampled each time.

@sksingh20
Copy link

@leoluk This makes limitation on gauge data type on grafana. Gauge is actually to show all data points. I see sole objective of dashboard to show actual guage values.
Even if limitation of grafana technical approach, business case is quite clear. If something can't be done using grafana then its just matter to confirm and communicate.

Can we summarize that Grafana is having limitation to show gauge value properly for data older than 12 hours? and there is high probability that this will show error.

@leoluk
Copy link

leoluk commented May 28, 2021

Your monitor wouldn't even have enough pixels to show all data points, depending on the resolution. This is not a Grafana limitation, it's a fundamental property of working with time-series data.

@aocenas
Copy link
Member

aocenas commented Jun 16, 2021

I guess we could do something like a dropdown select with (max|exact|min)[step] so that user can decide how the step param should be evaluated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment