Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upSimple Grafana graphs change significantly in between refreshes #2364
Comments
This comment has been minimized.
This comment has been minimized.
|
I should add that I have 1.3.1 prometheus servers around to but they have different data sets. For several other metrics/hosts I've tried on both versions I see some change due to the interpolation each boundary change, but I've not yet found a case that I'm confident is wrong/misleading for other metrics. I can dig more rigorously if it helps. |
This comment has been minimized.
This comment has been minimized.
|
Can you add the precise PromQL query you are running? |
This comment has been minimized.
This comment has been minimized.
|
Hi @beorn7, you can see in the linked Gist the one used there. For convenience here it is unescaped: If it helps, this metric is gathered from telegraf's CPU input. |
This comment has been minimized.
This comment has been minimized.
|
So here is my theory:
It's in general not good to use gauges like |
This comment has been minimized.
This comment has been minimized.
|
@beorn7 thanks for this. I think you are right. It took me quite some thinking through to get to the right reasoning about that is really going on. I had 2 parts wrong in my head:
As an example to help me clarify my thoughts and potentially help others who might see this: Assume we have a gauge that has samples 15 seconds apart. The values are given below and there is periodic peaks about every 45 seconds.
If we query that period with a step of 60 seconds, starting from t = 0 we get If we happen to have set step to be 45 seconds it's even worse. At start time = 0 we have Having straightened that out, I then wondered if this is a problem with gauges or fundamental to all types. While the reasoning about how samples are selected would be identical with a counter, the semantics are not the same. With a counter, it doesn't matter how many samples we step over each step, we'll still see the increase in the next sample which means when converted to a rate using @beorn7 thanks again for taking the time to point out my misunderstanding. I'm going to work on using raw counters for CPU and move the same percentage logic into prometheus query time where it will not be affected by graph resolution/alignment. |
banks
closed this
Jan 26, 2017
This comment has been minimized.
This comment has been minimized.
thenayr
commented
Feb 13, 2017
•
|
You can also set your Grafana step to |
This comment has been minimized.
This comment has been minimized.
MarcMagnin
commented
Nov 14, 2017
|
Hey, sorry to revive the subject but how can you sort that when you can't provide a counter? |
This comment has been minimized.
This comment has been minimized.
seeruk
commented
Nov 30, 2017
•
|
I'm also hitting this issue. I'm using the Go Prometheus client library, and there are metrics I'm trying to keep track of such as the number of goroutines (one that the library provides by default), however, when trying to create a graph for this in Grafana it's impossible to visualise it accurately across all time frames and step sizes. This is how I see the problem currently. Right now we have something like this:
Or maybe you arrived on the page one second later?
Let's say you're scraping some gauge every second in Prometheus. In Grafana you're looking at a time period that makes the minimum step size 5 seconds. In this scenario, if one person opens Grafana 1 second later than another person, they'll end up with different data spread across those steps. This can sometimes produce vastly different visualisations of the data. This problem is more pronounced in some other cases though. Say your scrape interval is 15 seconds, and your Grafana is set to refresh every 10 seconds (one of the defaults), every single time the graphs refresh you'll end up with completely different looking data. (That's what I was trying to visualise above). In the above illustration, imagine that points This all comes from one problem I think: historic steps should always contain the same data points, regardless of when they are viewed, or what the start / end times are. To illustrate that like above:
I understand it's not quite as simple as this, but this seems to be the root cause of the problem, and a solution that I can see. In reality, this is something that Grafana would potentially be able to implement by querying differently, but it could also be handled in Prometheus. If this were implemented somehow, I believe it would allow the graphs to remain consistent across any step size. Edit: Here's what this issue looks like: https://gfycat.com/lamecautiousliger |
This comment has been minimized.
This comment has been minimized.
matejzero
commented
Dec 26, 2017
|
Anything new on this front? |
This comment has been minimized.
This comment has been minimized.
yotamoron
commented
Dec 28, 2017
|
I'm having the same issues. Any news? |
This comment has been minimized.
This comment has been minimized.
matejzero
commented
Dec 28, 2017
|
According to developers, that is a fundamental problem of sampling (aliasing): https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem This can be mitigated to some degree by using I can get it to work nicely with a few specific configuration (like 6h/30s refresh, 12h/1min refresh), but not 1h/10s refresh or 3h/30s refresh. There are some more info in grafana/grafana#9705. |
This comment has been minimized.
This comment has been minimized.
bali-szabo
commented
Jan 4, 2018
|
Hi, I was also facing this problem, one solution IMO is to practically bind the frames to epoch 0. This is where the change needed to be done Line 416 in ec94df4 What I did is basically add 3 lines in front of the for loop, also change the for loop header.
This solution made my graphs stable any interval or resolution. Don't burn me for code quality, i'm a python developer, these are the very first lines i have ever written in go, also there was time pressure to have it fixed asap. Consider this solution rather as a working concept. I haven't ran the prometheus unittests either - just focused on fixing the graphs. |
This comment has been minimized.
This comment has been minimized.
|
Prometheus should not be second guessing the parameters the user has provided to from the API. If a user wishes to align their requests somehow it should happen before the request hits the API. |
This comment has been minimized.
This comment has been minimized.
|
FYI, here's my analysis of Prometheus' interaction with Grafana and the solution I propose for having them work together nicely: #3746 Essentially the problem is that whatever the value of My proposal is for Prometheus to use overlapping intervals (by taking the last point before the interval and the last point in the interval). That way you'll still get some jitter if your |
This comment has been minimized.
This comment has been minimized.
sbrudenell
commented
May 25, 2018
|
I had a slightly different solution to this, thought it might be helpful for anyone looking here. I have a graph of power demand, so I really want to be able to see spikes when zoomed out, as well as see the difference between a spike and a high average. So I graphed each of:
Then have max fill below to avg, and avg fill below to min. (I actually include all 3 series twice, because "fill below to x" seems to force the lines to be hidden) My second graph is a zoomed-in section of the spiky section in the first graph. |
This comment has been minimized.
This comment has been minimized.
huiminzeng
commented
Jun 21, 2018
|
Hi @sbrudenell , I think your method is really excellent for the evaluation. But could you please tell how i can define the variables? In which file? I will appreciate it if you can post your entire configuration. |
This comment has been minimized.
This comment has been minimized.
sbrudenell
commented
Jun 21, 2018
|
@huiminzeng I didn't define anything. |
This comment has been minimized.
This comment has been minimized.
huiminzeng
commented
Jun 25, 2018
|
@sbrudenell , ok Thanks a lot! :) |
leszekeljasz
referenced this issue
Jul 4, 2018
Open
Align The requested period with Interval if Interval templating is used #3781
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |


banks commentedJan 24, 2017
What did you do?
Looking at simple CPU usage metrics for a host, every time Grafana refreshes the values on the graph change significantly for same metric reported by same Prometheus host.
What did you expect to see?
The metric to always report the same value for a given timestamp
What did you see instead? Under which circumstances?
Here is a simple graph panel in Grafana screen shots form a few seconds apart after refreshing the graphs:
Notice how not just the shape changes but the raw values are pretty different - one indicates CPU load of about 40% the other about 80%.
The query above is a relabelling of raw metrics and grafana stacking the graphs but to rule out either of those as issues,
cpu_usage_user{host="$host"}where$hostis a constant defined in the Grafana template does the same thing with a single line being displayed.You can see source in this case is
prometheus.0which is a direct source connecting to one of a pair of prometheus hosts to rule out anything odd with out load/balancing/failover and I do see the same behaviour if I pick the other replicaprometheus.1too.Hitting the refresh button results in a radical change in the graph shape even over past data points in maybe 30% of times hit.
It only seems to happen when looking at "last X" time range - picking an absolute time range in the past doesn't seem to cause it so it seems to be that the issue is caused by interpolating the data to stat at a specific time. I understand that will cause some differences in interpolation which may render sharp spikes differently etc. But the overall difference between 40% usage over most of 24 hours and 80% seems to huge a discrepancy to be expected rounding error. It throws into question how reliable any graph is if the value is so sensitive to what time range is selected (i.e. few seconds difference changes whole graph substantially).
I've tried difference resolution and step settings in Grafana and they don't seem to make a difference to the fact that refreshes substantially change the graph content (obviously they affect the graph in expected ways).
This might be a Grafana bug - I've not been able to trace the actual API calls they are making yet, however I've seen the Grafana code before and it seems to be doing the sane thing.
As a quick investigation, I've scripted the following: https://gist.github.com/banks/7d3e1b0f43d88a6fb8ca238d761eb784
I can provide more data if it helps, including full API responses to each query etc.
I'm fairly sure I've seen similar effects on other graphs (not just CPU usage percentage which is a calculated value) in the past and assumed it was to do with processing rate/irate and time interpolation, but this is a raw metric value as stored.
If this is somehow expected/explainable behaviour, it would be great to document how to reason about this - having which second the dashboard loads have such a huge impact on how the data actually gets shown seems to be a significant issue for a monitoring tool we rely on for understanding our systems.
Environment
System information:
Prometheus version:
I realise this is not latest, we will upgrade soon. I've not seen issues/changelogs about this issue although it seems pretty significant.
Prometheus configuration file: