Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple Grafana graphs change significantly in between refreshes #2364

Closed
banks opened this Issue Jan 24, 2017 · 19 comments

Comments

Projects
None yet
@banks
Copy link

banks commented Jan 24, 2017

What did you do?

Looking at simple CPU usage metrics for a host, every time Grafana refreshes the values on the graph change significantly for same metric reported by same Prometheus host.

What did you expect to see?

The metric to always report the same value for a given timestamp

What did you see instead? Under which circumstances?

Here is a simple graph panel in Grafana screen shots form a few seconds apart after refreshing the graphs:

image

image

Notice how not just the shape changes but the raw values are pretty different - one indicates CPU load of about 40% the other about 80%.

The query above is a relabelling of raw metrics and grafana stacking the graphs but to rule out either of those as issues, cpu_usage_user{host="$host"} where $host is a constant defined in the Grafana template does the same thing with a single line being displayed.

You can see source in this case is prometheus.0 which is a direct source connecting to one of a pair of prometheus hosts to rule out anything odd with out load/balancing/failover and I do see the same behaviour if I pick the other replica prometheus.1 too.

Hitting the refresh button results in a radical change in the graph shape even over past data points in maybe 30% of times hit.

It only seems to happen when looking at "last X" time range - picking an absolute time range in the past doesn't seem to cause it so it seems to be that the issue is caused by interpolating the data to stat at a specific time. I understand that will cause some differences in interpolation which may render sharp spikes differently etc. But the overall difference between 40% usage over most of 24 hours and 80% seems to huge a discrepancy to be expected rounding error. It throws into question how reliable any graph is if the value is so sensitive to what time range is selected (i.e. few seconds difference changes whole graph substantially).

I've tried difference resolution and step settings in Grafana and they don't seem to make a difference to the fact that refreshes substantially change the graph content (obviously they affect the graph in expected ways).

This might be a Grafana bug - I've not been able to trace the actual API calls they are making yet, however I've seen the Grafana code before and it seems to be doing the sane thing.

As a quick investigation, I've scripted the following: https://gist.github.com/banks/7d3e1b0f43d88a6fb8ca238d761eb784

  • It issues queries to a single prometheus server for a single metric over a 24 hour period
  • Each query it moves the start and end timestamps forward by 10 seconds (our samples are collected every 15 seconds)
  • It does this over a range of 10 mins, i.e. all of these graphs are overlapping by at least 23 hours and 50 mins.
  • Step size is 2 minutes (I tried different values and could reproduce the general result with any I tried).
  • Each pair of lines in output shows start and end timestamps, then nest below the mean, max and min sample values seen.
    • Mean CPU usage percentage across 24 hours varies from ~23% to ~46% which is pretty substantial difference. The jumps seem to follow no pattern - occasionally the two steps seem to hit same samples and get identical result twice in a row but mostly every new samepl that is included/excluded significantly changes the overall data by far more than one would expect a single non-anomolous sample to skew things.
    • Min and max vary quite a bit too
  • I added full raw responses to two consecutive requests that had very different mean values (29 and 45) despite being just 10 seconds different in their start and end times across 24 hours...

I can provide more data if it helps, including full API responses to each query etc.

I'm fairly sure I've seen similar effects on other graphs (not just CPU usage percentage which is a calculated value) in the past and assumed it was to do with processing rate/irate and time interpolation, but this is a raw metric value as stored.

If this is somehow expected/explainable behaviour, it would be great to document how to reason about this - having which second the dashboard loads have such a huge impact on how the data actually gets shown seems to be a significant issue for a monitoring tool we rely on for understanding our systems.

Environment

  • System information:

    paul@prometheus.0 ~ $uname -srm
    Linux 3.13.0-74-generic x86_64
    
  • Prometheus version:

    prometheus, version 1.0.1 (branch: master, revision: be40190)
    build user:       root@e881b289ce76
    build date:       20160722-19:54:46
    go version:       go1.6.2
    

    I realise this is not latest, we will upgrade soon. I've not seen issues/changelogs about this issue although it seems pretty significant.

  • Prometheus configuration file:

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.
  evaluation_interval: 15s # By default, scrape targets every 15 seconds.

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets:
          - prometheus.0.dblayer.com:9090
          - prometheus.1.dblayer.com:9090
        labels:
          scrape_host: prometheus.0

  - job_name: 'discovered'
    file_sd_configs:
      - files:
        - /opt/prometheus/conf.d/*.yml
@banks

This comment has been minimized.

Copy link
Author

banks commented Jan 25, 2017

I should add that I have 1.3.1 prometheus servers around to but they have different data sets.

For several other metrics/hosts I've tried on both versions I see some change due to the interpolation each boundary change, but I've not yet found a case that I'm confident is wrong/misleading for other metrics. I can dig more rigorously if it helps.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jan 25, 2017

Can you add the precise PromQL query you are running?

@banks

This comment has been minimized.

Copy link
Author

banks commented Jan 25, 2017

Hi @beorn7, you can see in the linked Gist the one used there. For convenience here it is unescaped: query=cpu_usage_user{host="candidate.57"}&start=$START&end=$END&step=120.

If it helps, this metric is gathered from telegraf's CPU input.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jan 26, 2017

So here is my theory:

cpu_usage_user is probably very spiky, and the spikes might be regular (e.g. because something is using the CPU regularly every 10 seconds). That gives you alignment patters: If the samples that happened to be used by Grafana are at the beginning of that 10s spike period, they are all very high, but if they are shifted by 5s, they are all very low. Which explains why you only see it for "now" graphs, as "now" moves all the time.

It's in general not good to use gauges like cpu_usage_user. (You don't even know over which timeframe the percentage is calculated.) It is better to use a counter and then apply a PromQL rate in it, where you can specify the timefram, e.g. rate(cpu_time_user[2m]) gives you the CPU usage over the last 2m, which then matches the stepping of Grafana, which results in meaningful downsampling. In that way, you will never miss spikes. They all are included in your given range.

@banks

This comment has been minimized.

Copy link
Author

banks commented Jan 26, 2017

@beorn7 thanks for this. I think you are right. It took me quite some thinking through to get to the right reasoning about that is really going on. I had 2 parts wrong in my head:

  • That prometheus interpolates between samples (it in fact just takes the last data point before the timestamp in question)
  • That a query_range with start,end,step does some sort of averaging of the sample in each step implicitly. With hindsight this is a silly thing to assume. Of course it doesn't, you just get the last value before each requested step sample.

As an example to help me clarify my thoughts and potentially help others who might see this:

Assume we have a gauge that has samples 15 seconds apart. The values are given below and there is periodic peaks about every 45 seconds.

+---+---+----+----+----+----+----+----+-----+-----+-----+-----+
| t | 0 | 15 | 30 | 45 | 60 | 75 | 90 | 105 | 120 | 135 | 150 |
+---+---+----+----+----+----+----+----+-----+-----+-----+-----+
| v | 5 | 50 | 3  | 2  | 60 | 10 | 5  | 70  | 20  | 5   | 50  |
+---+---+----+----+----+----+----+----+-----+-----+-----+-----+

If we query that period with a step of 60 seconds, starting from t = 0 we get 5, 60, 20, if we shift start by 20 seconds, we see: 50, 10, 5 which looks like a totally different series.

If we happen to have set step to be 45 seconds it's even worse. At start time = 0 we have 5, 2, 5, 5 but at start time 20 we get 50, 60, 70, 50.

Having straightened that out, I then wondered if this is a problem with gauges or fundamental to all types. While the reasoning about how samples are selected would be identical with a counter, the semantics are not the same. With a counter, it doesn't matter how many samples we step over each step, we'll still see the increase in the next sample which means when converted to a rate using rate or irate, the resulting graph is "correct" at any resolution and not misleading in the same way a spiky gauge is.

@beorn7 thanks again for taking the time to point out my misunderstanding. I'm going to work on using raw counters for CPU and move the same percentage logic into prometheus query time where it will not be affected by graph resolution/alignment.

@banks banks closed this Jan 26, 2017

@thenayr

This comment has been minimized.

Copy link

thenayr commented Feb 13, 2017

You can also set your Grafana step to 15s and resolution to 1/1, this would ensure that every value is accounted for on a new dashboard "tick". The trouble with doing this is that when you start to zoom out further than a day or so the amount of queries to Prometheus (and the number of data points in the charts) tends to bog down CPU and your browser.

@MarcMagnin

This comment has been minimized.

Copy link

MarcMagnin commented Nov 14, 2017

Hey, sorry to revive the subject but how can you sort that when you can't provide a counter?
I'm using grafana 4.6 and it seems that the have replaced the 'step' by 'min step' so we can't enforce it.

@seeruk

This comment has been minimized.

Copy link

seeruk commented Nov 30, 2017

I'm also hitting this issue. I'm using the Go Prometheus client library, and there are metrics I'm trying to keep track of such as the number of goroutines (one that the library provides by default), however, when trying to create a graph for this in Grafana it's impossible to visualise it accurately across all time frames and step sizes.

This is how I see the problem currently. Right now we have something like this:

N | Example 1
O | A         | B         | C
W | 4 5 1 2 3 | 4 5 1 2 3 | 4 5 1 2 3

Or maybe you arrived on the page one second later?

N | Example 2
O | A         | B         | C
W | 3 4 5 1 2 | 3 4 5 1 2 | 3 4 5 1 2

Let's say you're scraping some gauge every second in Prometheus. In Grafana you're looking at a time period that makes the minimum step size 5 seconds. In this scenario, if one person opens Grafana 1 second later than another person, they'll end up with different data spread across those steps. This can sometimes produce vastly different visualisations of the data.

This problem is more pronounced in some other cases though. Say your scrape interval is 15 seconds, and your Grafana is set to refresh every 10 seconds (one of the defaults), every single time the graphs refresh you'll end up with completely different looking data. (That's what I was trying to visualise above).

In the above illustration, imagine that points E1A3 and E1B4 are the same as points E2B3 and E2B4, that those 4 points are equal to 100, and that all other points are equal to 3. If you imagine a graph containing those values, they'll look completely different. Also, other metrics you can gather like the maximum value will be completely different for both.

This all comes from one problem I think: historic steps should always contain the same data points, regardless of when they are viewed, or what the start / end times are. To illustrate that like above:

N | Example 3
O | A     | B         | C         | D
W | 3 4 5 | 1 2 3 4 5 | 1 2 3 4 5 | 1 2

I understand it's not quite as simple as this, but this seems to be the root cause of the problem, and a solution that I can see. In reality, this is something that Grafana would potentially be able to implement by querying differently, but it could also be handled in Prometheus. If this were implemented somehow, I believe it would allow the graphs to remain consistent across any step size.

Edit: Here's what this issue looks like: https://gfycat.com/lamecautiousliger

@matejzero

This comment has been minimized.

Copy link

matejzero commented Dec 26, 2017

Anything new on this front?

@yotamoron

This comment has been minimized.

Copy link

yotamoron commented Dec 28, 2017

I'm having the same issues. Any news?

@matejzero

This comment has been minimized.

Copy link

matejzero commented Dec 28, 2017

According to developers, that is a fundamental problem of sampling (aliasing): https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem

This can be mitigated to some degree by using $__interval timeframe (eg. rate(metric[$__interval])) or $interval template with auto setting (this is what Percona is using in their dashboards). People report good success with those two solutions, but I'm sadly not one of them.

I can get it to work nicely with a few specific configuration (like 6h/30s refresh, 12h/1min refresh), but not 1h/10s refresh or 3h/30s refresh. There are some more info in grafana/grafana#9705.

@bali-szabo

This comment has been minimized.

Copy link

bali-szabo commented Jan 4, 2018

Hi,

I was also facing this problem, one solution IMO is to practically bind the frames to epoch 0.
What I did is basically calculate how many frames elapsed (with math.floor) since the beginning (1970.1.1) with the timeframe provided and start from there. This method, given the math.floor, will keep my graph aligned to the same start timestamp while i'm between frame start + interval. I'm aware it'll cause some distortion of data in the start and end frames but i can live with that.

This is where the change needed to be done

for ts := s.Start; !ts.After(s.End); ts = ts.Add(s.Interval) {

What I did is basically add 3 lines in front of the for loop, also change the for loop header.

    ...
    Seriess := map[uint64]Series{}
    // Change starts here
    intervalSeconds := float64(s.Interval)/float64(time.Second)
    startTs := time.Unix(int64(math.Floor(float64(s.Start.Unix())/intervalSeconds)) * int64(intervalSeconds), 0)
    endTs := s.End.Add(s.Interval)

    for ts := startTs; !ts.After(endTs); ts = ts.Add(s.Interval) {
    // Change ends here
        if err := contextDone(ctx, "range evaluation"); err != nil {
            return nil, err
        }
    ...

This solution made my graphs stable any interval or resolution. Don't burn me for code quality, i'm a python developer, these are the very first lines i have ever written in go, also there was time pressure to have it fixed asap. Consider this solution rather as a working concept.

I haven't ran the prometheus unittests either - just focused on fixing the graphs.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 4, 2018

Prometheus should not be second guessing the parameters the user has provided to from the API. If a user wishes to align their requests somehow it should happen before the request hits the API.

@free

This comment has been minimized.

Copy link
Contributor

free commented Jan 29, 2018

FYI, here's my analysis of Prometheus' interaction with Grafana and the solution I propose for having them work together nicely: #3746

Essentially the problem is that whatever the value of $__interval, Prometheus will end up splitting the timeseries into disjoint intervals of the requested length and attempt to compute rates or increases out of each of those intervals but ignoring the change in the counter between the intervals. It tries to do all sorts of smart extrapolation to account for the information it threw away but it falls flat on its face. The problem is most visible at high resolution, when your $__interval is twice as long as the interval between the underlying data points: half the information is thrown away and the actual increase is multiplied by 2, due to interpolation.

My proposal is for Prometheus to use overlapping intervals (by taking the last point before the interval and the last point in the interval). That way you'll still get some jitter if your $__interval is longer than the resolution of the underlying data and start/end times are not always aligned with a multiple of $__interval (which Grafana doesn't do for now, but could easily do), but that jitter is simply an effect of including different sets of points in every interval rather than throwing away data and replacing it with noise.

@sbrudenell

This comment has been minimized.

Copy link

sbrudenell commented May 25, 2018

I had a slightly different solution to this, thought it might be helpful for anyone looking here.

image
image

I have a graph of power demand, so I really want to be able to see spikes when zoomed out, as well as see the difference between a spike and a high average. So I graphed each of:

  • max_over_time(demand[$__interval])
  • avg_over_time(demand[$__interval])
  • min_over_time(demand[$__interval])

Then have max fill below to avg, and avg fill below to min. (I actually include all 3 series twice, because "fill below to x" seems to force the lines to be hidden)

My second graph is a zoomed-in section of the spiky section in the first graph.

@huiminzeng

This comment has been minimized.

Copy link

huiminzeng commented Jun 21, 2018

Hi @sbrudenell , I think your method is really excellent for the evaluation. But could you please tell how i can define the variables? In which file? I will appreciate it if you can post your entire configuration.
Thanks a lot!

@sbrudenell

This comment has been minimized.

Copy link

sbrudenell commented Jun 21, 2018

@huiminzeng I didn't define anything. $__interval is a Grafana builtin, {max,avg,min}_over_time are Prometheus builtins. demand is just my name for the metric I'm trying to graph.

@huiminzeng

This comment has been minimized.

Copy link

huiminzeng commented Jun 25, 2018

@sbrudenell , ok Thanks a lot! :)

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.