Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rate()/increase() extrapolation considered harmful #3746

Closed
free opened this Issue Jan 26, 2018 · 45 comments

Comments

Projects
None yet
@free
Copy link
Contributor

free commented Jan 26, 2018

The problem

[Feel free to skip ahead to the proposed solution, as I'm guessing the problem is widely understood. Or just read the last couple of paragraphs for a real-life example of the problem and my current workaround.]

I've been trying to deal with the errors introduced by the rate() and increase() functions for so long that I keep going back every few months and spend a couple of hours tweaking Grafana dashboards before I remember that there is literally no way of getting an accurate result out of them short of using recorded rules for everything and, even in that case, working around and against Prometheus.

To clarify, let's assume I have a monitoring configuration with scrapes every 5 seconds and rule evaluation every 10 seconds. It collects among other things a request_count counter, that increases by 1 every 20 seconds. Let's also assume that I would really like a Grafana graph that displays an accurate number of requests over time, on a dashboard with an adjustable time range. (I.e. I want to be able to look at 15 minutes, 1 hour and 7 days on the same dashboard, with Grafana automatically adjusting the resolution.) Not a tall order, right?

For the purpose of clarity, here is how the request_count samples would look like (at a 5 second resolution):

[0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, ...]

The naive approach is you add a graph and put in an expression along the lines of:

increase(request_count[$__interval])

and set Min step to 10s (I would prefer a Min step of 5s, since the resolution of the underlying data is 5 seconds, but let's not even go there yet). Ideally the graphed values (with a 10 second resolution) would look something like:

[0, 1, 0, 1, 0, 1, 0, 1, 0, ...]

What I'm getting instead, depending on the exact time I reload the page is one of:

[0, 2, 0, 2, 0, 2, 0, 2, 0, ...]

or

[0, 0, 0, 0, 0, 0, 0, 0, 0, ...]

This is because what increase() ends up doing is it calculates the increase between disjoint pairs of values (i.e. (v0, v1), (v2, v3), (v4, v5)), losing half of the data in the process, and extrapolates it to 10 seconds, ending up with double the actual increase. At 15 seconds resolution it only throws away a third of the data and the increases it doesn't throw away get multiplied by 1.5 due to the extrapolation. As the resolution decreases, less data gets thrown away and the multiplier gets closer to 1 but there is, as I said before, literally no way to get an accurate rate()/increase() value in a range query directly from a counter. rate() is not affected by the multiplier issue, but still suffers from thrown away data and noise.

Even with less discrete data, such as CPU usage, the graph jumps every 5 seconds between one shape and another. In particular, if I'm looking at Prometheus' own CPU usage, because my rule evaluation happens every 10 seconds but takes less than 5 seconds, the graph keeps jumping between an almost flat 10% (when the pairs picked don't include rule evaluation) and an almost flat 20% (when the pairs do include rule evaluation).

The workaround I've come up with is to compute a recorded rule request_count_increase10s that I can then sum_over_time(), which is actually computed as

    increase(request_count[15s]) / 1.5

so all increases in request_count actually get included in the calculation (because every 15 second range includes 3 data points, i.e. 2 intervals). But even with this cumbersome solution, I still have to fight against the implementation (the / 1.5 part) and the extrapolation will introduce unnecessary noise whenever the underlying timeseries neither increases constantly nor looks like a staircase.

Proposed solution

In an ideal (but by no means unattainable) world, what would happen instead of extrapolation is ... wait for it ... interpolation. I.e. just before the matrix for the underlying samples is evaluated in promql.functions.extrapolatedRate(), the range could be expanded so it included one point before the start and one point after the end of the interval. This could be either:

  1. approximated by expanding the interval by a delta: a fixed amount (say 1 minute), a duration proportional to the time range or a combination of the two; or
  2. explicitly provided by MatrixSelector/Engine/SeriesIterator (e.g. as a SeekBefore(t) or Previous() function).

With the points immediately outside the interval now available there are 2 possible approaches:

  1. interpolate the values at the edges of the range and use the difference between the interpolated values (modulo counter resets) to compute the increase/rate; or
  2. (better) use the difference between the last point before the beginning of the range and the last point in the range (again, modulo counter resets) to compute the increase/rate.

The latter has the advantage that there are no weird fractional values where there shouldn't be any, e.g. if you're computing the increase in the number of requests over an interval.

I am willing to put in the legwork if anyone thinks it makes sense. To me it makes a lot of sense to use actual data instead of fabricated (extrapolated) data when computing rates/increases (which, I don't have to point out but I will, is the whole reason for the existence of counters). But it's possible it's only me.

Additional thoughts

If you're not tired by now, there are a couple more things I should add.

For one, I am aware that one could hardcode the increase range and adjust the result accordingly into the Grafana dashboard, but over a high enough range this will result in even worse data sampling, as you'd now be seeing increase(foo[15s]) with a 30 minute resolution. Sometimes this is acceptable, most of the time it's not.

Second, there is another workaround that only works for increase(foo[10s]), namely using foo - foo offset 10s. But there are a couple of problems with this approach: it doesn't handle counter resets and it's twice as slow because now Prometheus looks up foo twice. (Or, for whatever reason, it's twice as slow as increase(), even with a 5 minute range over samples with 5 second resolution.)

There is yet another Grafana workaround, which is to define a min step (graph resolution) of e.g. 5s and an interval template variable with a minimum value of 10s and try to guesstimate the Step count for the template variable so that it stays at about 2x the graph resolution. But then you can't even apply the 1.5 or 2 factor, because it varies with resolution, not to mention the whole thing is screen resolution dependent.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 26, 2018

the range could be expanded so it included one point before the start and one point after the end of the interval.

This would be a violation of the PromQL execution model, a range only returns data from its range. Evaluation should also not look into the future.

For one, I am aware that one could hardcode the increase range and adjust the result accordingly into the Grafana dashboard

This is the recommended way to deal with this. This is something to handle in Grafana.

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Jan 26, 2018

the range could be expanded so it included one point before the start and one point after the end of the interval.

This would be a violation of the PromQL execution model, a range only returns data from its range. Evaluation should also not look into the future.

I don't know much (if anything) about the PromQL execution model, but whatever it is I don't think it's a good reason for making up data instead of using real data. And if the execution model does in fact trump common sense, there is no good reason for having extrapolation. If my 10 second interval has 2 data points 5 seconds apart, then the increase is the difference between the 2 points and the rate is the difference divided by 5.

For one, I am aware that one could hardcode the increase range and adjust the result accordingly into the Grafana dashboard

This is the recommended way to deal with this. This is something to handle in Grafana.

Could you please explain how I can get a graph of counter increases that works just as well over 15 minutes and 7 days using a fixed increase range? I would be all set if I could write

sum_over_time(increase(foo[10s]) / 2)[$__interval])

but that won't work, will it?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 26, 2018

Could you please explain how I can get a graph of counter increases that works just as well over 15 minutes and 7 days using a fixed increase range?

You can't, which is why this is something that needs to be handled in the thing sending the requests to Prometheus.

Our current algorithm has been extensively debated and tested, it is very unlikely that any changes will be made.

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Jan 26, 2018

You can't, which is why this is something that needs to be handled in the thing sending the requests to Prometheus.

There are a couple of problems with that. One, Grafana or whoever creates the dashboard must know the exact resolution of the underlying data in order to be able to produce an accurate workaround for Prometheus' extrapolation.

And second, what Grafana would have to do to get that accurate value out of Prometheus is (assuming an original resolution of 5 seconds) something along the lines of:

increase(foo[$__interval + 5s]) * ($__interval / ($__interval + 5s))

Does that look like a reasonable solution to anyone?

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Jan 26, 2018

Also, when you recommend Grafana as the dashboarding solution in your documentation, you don't get to pass the buck to "the thing sending the requests to Prometheus". Sorry.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 26, 2018

All I can do is recommend is watching https://www.youtube.com/watch?v=67Ulrq6DxwA

Several of the things that you consider to be bugs are in fact features, non-integral values are correct for example. If you want exact results you need to use a logs-based monitoring system, not a metrics-based one.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jan 26, 2018

Our current algorithm has been extensively debated and tested,

For reference: The discussion and implementation can be found in the following PRs:

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Jan 26, 2018

I found and went through a couple of those before. But all I can see is a lot of discussion about extrapolating data and one quick dismissal of actually including the last value before the start of the range, on 2 accounts: that it wouldn't provide a lower bound anymore (which the current implementation doesn't either) and that with a range that is not a multiple of the scrape interval you'd get a sawtooth pattern.

Fundamentally, it feels inconsistent that we got rid of interpolation for individual sample values, but still do extrapolation for rate and friends. However, for individual sample values, we always take the latest value before evaluation time. So arguably, we should start the range for a rate with the latest value before the range starts, and then take it from there. In that way, we would not systematically under-report anymore, but we would also not return a lower bound, and we would still have the weird step function behavior described in the previous item, now at least centered around the “true” value of the rate. -- From #1161

So I'm asking you: what is the benefit of using extrapolated data when interpolated data is available? Or actual data, without interpolation, shifted by less than the scrape interval?

I mean, if (ignoring counter resets for now) you take the difference between the last point in the range and the last point before the range and compute a rate from that (by dividing by the actual distance between the 2 points), then multiply by the range length you'll get a smooth rate, smooth increase and no aliasing. What the $#@% is wrong with that? How is it worse than guesstimating by extrapolation?

You can still apply limits in special cases (e.g. if the series starts or ends in the middle of the range, then adjust the rate to reflect the ratio of the range that's actually covered by the data) but overall (i) you have fewer special cases, (ii) it's faster to compute and (iii) you're not replacing information with noise. Why, you could even get a graph of increases that you could add up and would match the actual increase in the counter.

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Jan 26, 2018

To be more concrete, what specifically is wrong with this particular solution?

func extrapolatedRate(ev *evaluator, arg Expr, isCounter bool, isRate bool) Value {
  ms := arg.(*MatrixSelector)
  // XXX: Also request the last point before the beginning of the range.
  ms.IncludeLastPointBeforeRange = true

  var (
    matrix       = ev.evalMatrix(ms)
    rangeStart   = ev.Timestamp - durationMilliseconds(ms.Range+ms.Offset)
    rangeEnd     = ev.Timestamp - durationMilliseconds(ms.Offset)
    resultVector = make(Vector, 0, len(matrix))
  )

  for _, samples := range matrix {
    if len(samples.Points) < 2 {
      continue
    }

    sampledInterval := float64(samples.Points[len(samples.Points)-1].T-samples.Points[0].T) / 1000
    averageDurationBetweenSamples := sampledInterval / float64(len(samples.Points)-1)
    sampleIntervalThreshold := averageDurationBetweenSamples * 1.1

    // XXX: If the last point before the range is too far from the range start, drop it.
    firstSample := 0
    if rangeStart - samples.Points[0].V > sampleIntervalThreshold {
      if len(samples.Points) < 3 {
        continue
      }
      firstSample = 1
      sampledInterval := float64(samples.Points[len(samples.Points)-1].T-samples.Points[1].T) / 1000
    }

    var (
      counterCorrection float64
      lastValue         float64
    )
    // XXX: Unrelated: only need to do the magic dance for counters.
    if isCounter {
      for i := fistSample; i < len(samples.Points); i++ {
        sample := samples.Points[i]
        if sample.V < lastValue {
          counterCorrection += lastValue
        }
        lastValue = sample.V
      }
    }
    resultValue := samples.Points[len(samples.Points)-1].V - samples.Points[firstSample].V + counterCorrection

    // Duration between last sample and boundary of range.
    durationToEnd := float64(rangeEnd-samples.Points[len(samples.Points)-1].T) / 1000

    if samples.Points[firstSample].T < rangeStart && durationToEnd < sampleIntervalThreshold {
      // XXX: Default case, the samples start just before the range start
      // and end just before the range end: the rate is the difference divided
      // by the sampled interval and the increase is the rate times the range length.
      if isRate {
        resultValue = resultValue / sampledInterval
      } else {
        resultValue = resultValue * (ms.Range.Seconds() / sampledInterval)
      }
    } else {
      // XXX: The samples don't cover the whole range: the increase is the
      // difference between the last and first values, the rate is the increase
      // divided by the range length.
      if isRate {
        resultValue = resultValue / ms.Range.Seconds()
      }
    }

    resultVector = append(resultVector, Sample{
      Metric: dropMetricName(samples.Metric),
      Point:  Point{V: resultValue, T: ev.Timestamp},
    })
  }
  return resultVector
}

What it does is it requests an extra point before the start of the range and (unless it's more than average interval * 1.1 away from the start of the range) uses the actual difference between the last point in the interval and this point as the base resultValue instead of trying to extrapolate anything.

Then, if the samples approximately cover the whole range (within the same average interval * 1.1 before either end) the rate is computed as resultValue divided by the actual sampled interval and the increase is that times the range.

Else, if the samples only cover part of the range, the increase is exactly the difference between last and first and the rate is that divided by the range.

There is no sawtooth pattern with any smoothly increasing counter; no jitter; no data dropped if you're computing a rate/increase over an interval of X seconds with a resolution of X seconds (which is the only thing Grafana knows how to do reliably). And you can compute a rate/increase over any interval, regardless of the resolution of the underlying data: your graph doesn't suddenly disappear just because you zoomed in too far. BTW, it also doesn't look into the future.

And the only downside AFAICT is that it's "a violation of the PromQL execution model", whatever that means.

Also, I understand if you want to keep the current behavior for rate/increase, for compatibility with rules and dashboards that try hard to work around it. Just add new functions, something along the lines of sensible_rate/sensible_increase. </wink>

And finally, I get quite tired of the "if you want accurate numbers use logs" argument. I'm not asking for accurate metrics in the face of restarts, timeouts or sampling. Just to have PromQL produce an exact difference between 2 data points X seconds away from each other without having to jump through hoops. It feels literally like someone decided that every arithmetic operation should replace the least significant 90% of the result with garbage. Because that's what I get unless I fight Prometheus tooth and nail.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Jan 28, 2018

@free when I want that precision level it often means that I can use irate or idelta. I miss a increase function however.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Jan 28, 2018

@free After more thinking: I think really what we do with irate/idelta is close to this ; but you want to change it from "between the last 2 datapoint" to "between the initial value and the last datapoint".

In that sense, beside the irate/idelta we could another prefixed function like xrate/xdelta

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Jan 29, 2018

@free when I want that precision level it often means that I can use irate or idelta. I miss a increase function however.

The reason irate/idelta work better than rate/delta/increase is that they don't do extrapolation, which is what I'm arguing here. That being said, they don't do what I need: a Grafana graph of rate or increase over time that's not tied to a fixed resolution.

My data has 5 second resolution, but if I do irate(foo[5s]) I get a blank graph. If I do irate(foo[$__interval]) (which is Grafana's way of automagically filling in the actual graph resolution in your query) and I set the Min step to anything larger than 5 seconds, two unwanted things happen:

  1. the resulting non-overlapping intervals will end up losing the increase between pairs of points that end up in different intervals; and
  2. because the interval is now larger than 5 seconds irate itself will only look at the last pair of points in the interval and drop everything else.

In other words, for what I need, irate is even worse than rate. And there is no iincrease function, as you point out, and most of the time I'm interested in the increase (i.e. how many requests I got and when).

@free After more thinking: I think really what we do with irate/idelta is close to this ; but you want to change it from "between the last 2 datapoint" to "between the initial value and the last datapoint".

I'd be fine with either replacing the irate/idelta implementations or defining new xrate/xdelta functions, as long as I can get a simple subtraction going. I mean it's right there in my face: I can do clamp_min(foo - foo offset $__interval, 0) but it takes twice as long to compute as increase because it has to look up foo twice, for crying out loud. And I know this will end up dropping the first increase from zero to whatever the counter is the first time we see it after the reset, but that's nothing compared to losing the increase between 2 data points every single interval.

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Jan 29, 2018

Oh, and by the way @roidelapluie , I like your proposed naming, the x could stand for exact. LOL

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Jan 29, 2018

Maybe you can take that one step further and make a pull request? so we can continue the discussion around code?

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Jan 29, 2018

Sure I can. I just wanted to get some buy-in before I put in any work, as I did on the TSDB, trying to work around it pushing data to disk and causing all of Prometheus to hang when disk writes block. I spent a couple of weeks working on that and the PR is still open 4 months later: prometheus/tsdb#149 .

However, if I do any work on this I'll try to avoid going into the selector/iterator and just extend the start of the interval by some fixed amount. Seeing how I have essentially already written the code for it a few comments up, it shouldn't be all that big an effort. It may take me a couple of days though, as I'm quite busy at work these days.

free added a commit to free/prometheus that referenced this issue Jan 30, 2018

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Jan 30, 2018

OK, finally got an initial (but quite clean and complete) iteration done: #3760.

There is one difference from my initial proposal (as also described in the PR), namely not to adjust the increase by range / sampledInterval, as it would only add noise. If the xincrease range is not an "exact" multiple of the sampling interval, I believe it's preferable to get exact increase values (which are quite useful for jumpy counters) for the price of a sawtooth pattern for smoothly increasing counters. Rates (xrate and xdelta) are not affected by this, as they are always divided by the actual sampled interval, so they will be perfectly smooth for a perfectly smooth counter. Again, this is only an issue if the requested range is not a multiple of the scrape/eval interval.

The other thing I realized is that while xincrease/xrate/xdelta will work just fine at a resolution higher than that of the underlying data, you will still only get one data point for every underlying data point and Grafana doesn't handle that very graciously. E.g. for a timeseries foo with a resolution of 5 seconds, a range query for xrate(foo[1s]) and step=1 will return a timeseries with a 5 second resolution but Grafana will still attempt to draw a point every 1 second, resulting in disjointed points. If your Grafana graph is configured to show points, that's perfectly fine, but it will not connect them by lines.

Other than that, in my relatively limited testing I haven't found any major issues. You could see some jitter if your start/end time match "exactly" the scrape times (e.g. when scraping times are clustered around whole 5 second multiples and some fall just before, some just after and your start/end times are also multiples of 5 seconds) but that's essentially a given and no issue at all with smoothly increasing counters.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 30, 2018

I appreciate that you are looking at his issue. Unfortunately while considering your proposal I see multiple issues with it. We operate under a number of constraints, and as-is your solution cannot be adopted due to its knock-on effects and inconsistencies with the rest of PromQL.

The constraints we are under include:

  • Functions may only use the range vectors they are given
  • A PromQL evaluation at a given time may not look at data in the future relative to that time
  • rate() and friends must be resilient to scrape failures, jitter, and target lifecycle events
  • Multiple mildly different rate() functions would cause user confusion (such as the existing confusion we have with users using incorrectly delta for counters), and increased support load
  • Range vectors ignore stale markers
  • increase() is syntactic sugar for rate()
  • We cannot add breaking changes in Prometheus 2.x, which includes changing range vectors semantics

In addition calling a function "exact" is misleading as exact answers are not possible in metrics, only approximations.

If you have suggestions for how to improve the existing rate() function, keeping in mind the above constraints, I'll be happy to consider them for Prometheus 3.0. I suspect though that you will find it more fruitful to try and tackle this in Grafana.

The recommended way to use rate() is to provide a range that covers several scrape intervals. If for example if you feel rate() should include data further in the past, you can expand the size of the range when writing your query. You also have the option of pulling the raw data via the API and performing whatever calculations you like. One of the reasons the idelta() function exists is to allow variants of rate() and friends to be implemented via recording rules without adding complexity to PromQL itself, so that's another approach to consider.

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Jan 30, 2018

Wow, that was harsh...

Let me address your concerns one by one, then. Even though considering your readiness to discuss the validity of my concerns or solution and the fact you threw everything but the kitchen sink at me, I'm sure it will be in vain.

We operate under a number of constraints, and as-is your solution cannot be adopted due to its knock-on effects and inconsistencies with the rest of PromQL.

I thought that was what code (and design) reviews were for. To bring the code into a shape that is widely agreed upon, not as a rubber stamp for "adopting code as-is".

Functions may only use the range vectors they are given

The only thing I can say to this is to then give functions that take range vectors a range vector that includes the last point before the range. Whichever functions don't need it, can skip it. Et voila!

A PromQL evaluation at a given time may not look at data in the future relative to that time

My code never does that. I mentioned that in my initial post, but it was neither in my first pseudocode proposal or the PR and I explicitly and repeatedly pointed that out.

rate() and friends must be resilient to scrape failures, jitter, and target lifecycle events

xrate() and friends behave exactly the same as rate() and friends (minus the extrapolation and possibly plus an extra point in the beginning). So to the extent that rate() is resilient to all of the above, so is xrate().

Multiple mildly different rate() functions would cause user confusion (such as the existing confusion we have with users using incorrectly delta for counters), and increased support load

I don't know what to say to this except to suggest that one could replace the current rate() implementation; or replace irate(), because xrate() fully matches it over ranges of up to 2x the underlying resolution and in your very first reply you were arguing that hardcoding the range based on the underlying resolution is The Right Way™. Or deal with the fact that the current rate() implementation is broken and replacing it should probably not be done in one go.

Range vectors ignore stale markers

I'm not sure what you mean by that, but again, xrate() and friends behave exactly the same as rate() and friends (minus A, plus B etc.). So xrate() deals with stale markers in the exact same way and to the exact same extent that rate() does.

increase() is syntactic sugar for rate()

"Maybe" and "so what"? This is indeed true for the current implementation of rate() and increase() but it doesn't mean it's a good/useful idea (see my comments above) or that this needs to hold for xrate() and xincrease(). Or, if this turns out to be the last standing strawman, I can definitely make it so, even though all it would do is introduce noise.

We cannot add breaking changes in Prometheus 2.x, which includes changing range vectors semantics

My code makes no changes to range vector semantics whatsoever. AFAICT I've extended the range of the iterators underlying range vectors, but the range vectors themselves work the exact same way as before. As proof of that, existing PromQL tests all pass without modifications. I did extend the range vectors used by the x***() functions, but that only applies to those functions (and is explicitly pointed out in the code as a hack that needs a proper solution). No breaking change.

In addition calling a function "exact" is misleading as exact answers are not possible in metrics, only approximations.

I have no strong attachment to the name, I thought it was a good joke. I'd be perfectly fine with lame_rate() and lame_increase() as long as they did the right thing.

If you have suggestions for how to improve the existing rate() function, keeping in mind the above constraints, I'll be happy to consider them for Prometheus 3.0. I suspect though that you will find it more fruitful to try and tackle this in Grafana.

I feel like a broken record here, but the current rate() and particularly increase() functions cannot be fixed. As long as they are artificially limited to look only at points inside the range and attempt to extrapolate the data to the entire range, they are random number generators with the range and resolution of the underlying data as inputs.

Also, when talking about Grafana (or common sense, for that matter) a rate()/increase() range query using the same value for the range and step will throw away data and replace it with estimates. In what world does that make sense? And the only 2 arguments I ever heard for that behavior are:

  1. (Buried in the middle of the 40 minute video you thought was an appropriate response, linked above.) If the last point before the range is 13 minutes before the start of the range, then why should it affect the rate/increase? Well, if your first point inside the range is 13 minutes from the start of the range, how could extrapolating over those 13 minutes ever be expected to provide a sane answer?
  2. This is how we do it and we won't accept breaking changes. Or additions.

As for your suggestion to use idelta() and recorded rules (I'm guessing at the same resolution as scraped series), I'm sorry to break this to you, but you don't need idelta() for that. Or irate() or rate(). Pretty much all of Prometheus' functions can be reproduced with simple recorded rules, but (let me break another one) the whole point is to do it efficiently. Like I said, I can simply use foo - foo offset 5m for my needs but it is twice as slow as rate(foo[5m]) because it retrieves foo twice. Counter resets are the least of my worries when rate()/increase() are consistently off by 10% in the best case (more like 50-100% on average).

@leoluk

This comment has been minimized.

Copy link

leoluk commented Jan 30, 2018

I don't know enough about the Prometheus internals in order to comment on the particulars, but I do agree that having an "exact rate" function would be very useful feature to have. [Edit - I misunderstood]

The default behavior of rate and irate is not very intuitive. It took me quite some time to really grasp the evaluation logic. I suggest we improve the documentation for this regardless of the outcome here and address this and the Grafana alignment issue.

I understand that a maintainer has to say "no" every so often, but I'd love to see this straightened out and merged!

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Jan 30, 2018

To be completely fair, the "graphs change every time I refresh" problem should be fixed from the Grafana end, by optionally allowing you to align the start and end parameters of range_query to a multiple of $__interval.

However, that won't change the fact that the rate() and increase() implementation introduces significant errors and (given Grafana's model of providing you with $__interval to drop into your query and no support for time arithmetic -- which makes perfect sense, as that way lies madness) one counter increase lost every $__interval seconds.

@leoluk

This comment has been minimized.

Copy link

leoluk commented Jan 30, 2018

The Graphite issue is due to alignment if the sampling rate is higher than the evaluation rate, and what your merge request adds is the ability to compute "exact" rates without extrapolating, correct?

We definitely need more docs, I'm confused :P

@matejzero

This comment has been minimized.

Copy link

matejzero commented Jan 30, 2018

I'm in the same boat as @leoluk as far as confusion/discussions go (with the difference that I still didn't grasp evaluation logic:)).

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Jan 30, 2018

The Graphite issue is due to alignment if the sampling rate is higher than the evaluation rate, and what your merge request adds is the ability to compute "exact" rates without extrapolating, correct?

Let's not use the term "exact". As Brian pointed out, it's still an approximation (evaluation is never exactly aligned with scraping, scraping is never precisely spaced etc.), except I would argue it's a more correct approximation because it doesn't do any extrapolation, indeed. Instead it only looks at actual data points and takes the difference between them as the increase and that divided by the distance between the data points as the rate.

The current rate() implementation is way more complex, as it attempts to only look at datapoints that fall inside the specified range, but then it extrapolates them using various heuristics in order to smooth over the fact that it doesn't actually know what the increase over the requested range was (as in foo - foo offset 10m). When adding Grafana into the picture -- because the one thing that Grafana does reliably is to ask for $__interval spaced points while allowing you to insert the value of $__interval into your Prometheus query -- what the current implementation ends up doing is it splits the timeline into $__interval-length chunks and computes a rate for each of them, ignoring the change in counter value between one interval and the next. increase() is even worse, because e.g. if each range contains 3 data points, it will compute a rate based on the 2 intervals it has, then multiplies it by 3, so you end up with 1.5 times the estimated rate instead of the actual increase.

As long as your rate/interval range is a multiple of the underlying timeseries' resolution, you will get jitter with any implementation, as long as there is no alignment of the start and end times to multiples of the range, because the set of points that the rate/interval is computed over keeps shifting by one point. But with the proposed xrate()/xincrease() implementation, you don't have the problem of one data point lost for every interval.

In order for your graph to be unchanging, just shifting to the left, you need to either align start and end or (with xrate()/xincrease()) match the resolution of the underlying data. In the latter case you don't need any alignment, as the behavior is basically similar to irate(), except you don't have to use an artificially inflated range.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 2, 2018

I have explained the constraints that we are under, I'm afraid that I can only work with ideas that are in line with those constraints.

I also believe that the claimed benefits of this proposal are overstated. For example it is claimed that it fixes aliasing, yet in various other places there are mentions aliasing issues still arising (that's what that sawtooth is). This is unfortunately not the sort of problem has a simple solution.

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Feb 2, 2018

[ @beorn7, as the other half of the owners of this code, would you please be so kind and express an opinion? I understand you may not have the time, but you can definitely get the gist from the 80 lines of code in #3746 and Brian's comment from 3 days ago. If you have even more time, this one comment should cover my point, no need to go through the rest of the thread. ]

OK, let's go back to the constraints you listed and go through them one by one. I'll leave the one you listed first for last, because I feel that is really the only point of contention.

A PromQL evaluation at a given time may not look at data in the future relative to that time

Check.

rate() and friends must be resilient to scrape failures, jitter, and target lifecycle events

Check.

My aliasing (sawtooth) comment only applied to increase(), not rate(). And it was a conscious choice on my side, because I felt it's more useful for increase() to reflect the underlying data rather than either interpolate or extrapolate. If you feel strongly about it can definitely be eliminated, it requires exactly 2 lines of code.

Multiple mildly different rate() functions would cause user confusion (such as the existing confusion we have with users using incorrectly delta for counters), and increased support load

Based on the multitude of issues raised both here and on Grafana, I would argue that any user who took a close look at the rate/increase values produced by Prometheus is way more confused than they would be by 3 rate functions. Or 10, for that matter. See the comment by @matejzero, above. Also, I am not insistent on whether it should be a new function or a replacement for the existing one. It's entirely up to the maintainers.

Range vectors ignore stale markers

Check.

increase() is syntactic sugar for rate()

Can be if you insist, not a problem. Even though my preference is different and I don't really see why this is a constraint (other than that's how increase() is currently documented).

We cannot add breaking changes in Prometheus 2.x, which includes changing range vectors semantics

Check.

As mentioned above, my proposed change passes all existing PromQL tests without modification, meaning (among other things) the range vector semantics are the same. Or it could mean that none of the tests actually verifies vector semantics, which I would find hard to believe.

Functions may only use the range vectors they are given

And we finally get to the one argument that actually merits a discussion. All I'm arguing for is that functions should be given one extra point before the beginning of the range of the interval. They may choose to use it or not. In effect, the signature of xrate should really be xrate(foo offset 5m, foo[5m]), with the constraint that the same duration has to be passed to both arguments. It is a rate over one point from 5 minutes ago plus all points since 5 minutes ago. xrate(foo[5m]) is essentially syntactic sugar for xrate(foo offset 5m, foo[5m]). (See what I did there?)

Looking at it from a different angle, the current implementation of rate cannot reliably produce a rate for anything shorter than 2x the resolution of the underlying data, precisely because of this one limitation. But rate(foo[5s]) does have a very precise meaning when foo has 5 second resolution. Additionally, because of this same constraint, computing a rate over a period X with resolution X (which, again, has a very clear meaning outside of the Prometheus bubble) will lose up to half of the information it has. This is readily visible when refreshing a graph. In particular, in the case of Prometheus itself, looking at a graph of CPU usage with the highest resolution available (2x sampling resolution, because why not) will actually show you one of 2 very different graphs depending on when you load it. It will either show you (1) CPU usage from (in my case) the 5 second periods that include rule evaluation or (2) CPU usage from the 5 second periods when it's mostly idle.

I also believe that the claimed benefits of this proposal are overstated.

No, they are not. Let's go through each:

there are no weird fractional values where there shouldn't be any, e.g. if you're computing the increase in the number of requests over an interval.

The code I'm currently proposing never produces fractions for increase() when the underlying counter is effectively an integer. I.e. you never ever get 1.5 requests over any time period.

There is no sawtooth pattern with any smoothly increasing counter;

There isn't. rate() is always perfectly smooth; increase() is perfectly smooth with an evaluation interval that's a multiple (including 1x) of the scraping interval. In the current implementation increase() is not perfectly smooth if the evaluation interval is not a multiple of the scrape interval but (1) I am arguing that's a plus and (2) it can be changed to be perfectly smooth if you value smoothness over accuracy.

no jitter;

There is none. What I understand by jitter is for spikes to suddenly appear or disappear because information that was ignored (the increase between points in 2 different intervals) is now suddenly taken into account. All data is always taken into account. It may move from one interval into the next and the corresponding increase will be subtracted from one and added to the other, but never appear and disappear. Just basic sampling.

And it never happens with a range equal to 1x sampling resolution. Never ever. Not even when sampling failed a few times. It will with the current rate(), though.

no data dropped if you're computing a rate/increase over an interval of X seconds with a resolution of X seconds (which is the only thing Grafana knows how to do reliably).

Check. That was the whole point of the exercise.

And you can compute a rate/increase over any interval, regardless of the resolution of the underlying data: your graph doesn't suddenly disappear just because you zoomed in too far.

Check. It won't produce a value when there are zero points in the range, but that could also be changed. You will get one value for every data point of the underlying. query_range will even helpfully not return any NaN values, just one point for every underlying point. It's quite beautiful, in fact. And, if you have Grafana draw points (in addition to or instead of lines) you will also see them on the graph.

BTW, it also doesn't look into the future.

Check.

For example it is claimed that it fixes aliasing, yet in various other places there are mentions aliasing issues still arising (that's what that sawtooth is).

See my comment above. That sawtooth I mentioned **only occurred for increase() when computed over a range that is not a multiple of the scraping interval and that was by choice. I (and not necessarily everyone else) prefer it to [0, 0, 0, 1.5, 0, 0, 0] for increase(requests[1m]).

This is unfortunately not the sort of problem has a simple solution.

But it does. That's the whole point. It's not a perfect solution (simply because there are more ways of looking at what should take priority, e.g. accurate vs. smooth increase) but simple it definitely is. Both in terms of concept -- you can cover all increases in an interval with a step equal to the range -- and implementation -- take the value difference between last and first, adjusted for counter resets, and divide by the time difference.

The current rate() implementation OTOH is a tortured, complex beast that can never produce accurate increase() values under any circumstances; and requires the use of recorded rules and express knowledge of the scrape interval in order to produce rate() values that neither overlap nor throw away data. And it will only do either at half resolution.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Feb 5, 2018

@free I'm sorry but I'm missing the time for a qualified response. The problem at hand is not simple. A whole bunch of people have discussed it at length in the past. @brian-brazil and I were the main actors in the end. We came from very different points, but ended up in perfect agreement (which is not always the case, to say the least). Even if you might have gotten the impression that @brian-brazil didn't care, I'm very confident that he spent a good deal of time and thought on your proposal, in good faith. It's highly unlikely that I would arrive at a different conclusion if I managed to make the time to get into this debate with the due diligence.

I also have to admit that the tone of this debate is disgusting. And that's definitely not @brian-brazil 's fault alone (to say the least). While I enjoy a friendly and factual debate simply for its academic sake, it would be no fun for me to participate here, where I sense a lot of exaggeration, arrogance, and mocking. As long as their is no compelling other reason to force me into this debate, I simply will do things that are more important for me (or my employer, FWIW).

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Feb 5, 2018

I'm sorry but I'm missing the time for a qualified response. The problem at hand is not simple.

It is indeed not simple if you choose to work under the constraints that Prometheus currently chooses to. Namely not looking at the point just before the beginning of the range. Once you do consider that option though, and you essentially shift the actual range back in time by up to one scraping interval, you get:

  1. exact increases (just subtract, accounting for resets);
  2. no data loss in the common case (all differences between data points are always accounted for); and
  3. you can average rates and sum increases over shorter ranges and get the exact same value as the rate/interval over the longer range. I mean it literally all adds up.

And from the point of view of Prometheus users it would make much more sense than the very complicated logic currently used, which by and large no one understands. Just look at how no one (Prometheus developer or otherwise) was even willing to touch my PR with a ten foot pole.

I also have to admit that the tone of this debate is disgusting.

Ouch, again. I admit I might have gotten carried away a couple of times, but "disgusting"? I mean I got 3 (virtually) one-liner answers from Brian and one from you before I sent out the PR and all that was said was:

  1. violation of PromQL execution model (arguable, but things can and should change when appropriate);
  2. Grafana + user should handle the various limitations and noise in the rate implementation (they just can't);
  3. this has already been discussed and we're not accepting changes (what can one possibly say to this?);
  4. a 40 minute video explaining 80 lines of code (I also have an employer, you know); and
  5. links to 3 issues longer than this one, never touching on the alternative I was proposing.

So I went ahead and wrote the code, just to get it rejected outright with "can't accept as-is". Plus a list of "constraints" some of which did not apply ("must be resilient to scrape failures, jitter, and target lifecycle events"), some actually describing problems with the current implementation ("increase is syntactic sugar for rate") and the rest subtle variations on "no breaking changes"/"no new functions" (logic would say these last two could be summarized as "no changes").

The subject of whether this is actually a better implementation from the point of view of the user was never even touched on. It was all a laundry list of implementation details, constraints and how this was not going to be accepted anyway. So, in a way yes, I guess "disgusting" is quite appropriate.

@Perlovka

This comment has been minimized.

Copy link

Perlovka commented Feb 6, 2018

Completely agree with the topic. I've spent last two days to get more or less accurate data from grafana+prometheus and failed. And this is not a grafana problem definitely. Such a good performance and data collection mechanism and ... impossibility to get actual data?! I just can't understand this. And yes, calculating increase() via rate() was just a brilliant idea 🤦‍♂️

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Feb 6, 2018

@Perlovka, one workaround I am using requires the use of recorded rules, but once you understand the implementation of rate()/increase(), it allows you to essentially "reverse engineer" a solution.

Namely, assuming a scraping interval of 5 seconds and a rule evaluation interval of 10 seconds you define a rule along the lines of:

instance:foo:increase_10s = increase(foo[15s]) / 1.5

What this ends up doing is:

  1. loads the last 3 values (last 2 intervals); the 15s more or less ensures this;
  2. computes a rate that is essentially the rate over the 2 intervals (it does extrapolation and divides by 15, but it's essentially the same as no extrapolation divided by 10);
  3. multiplies it by 15 to get you an extrapolated increase (which is basically 1.5 times the actual increase); and finally
  4. you divide it by 1.5 and get the actual increase over the last ~10 seconds.

The nice part (if you ignore the weird complexity of it all) is that you can then do sum_over_time(instance:foo:increase_10s[$__interval]) to get an accurate increase over any multiple of 10 seconds in Grafana,. (Don't forget to set Min step to 10s, though.) The downside, obviously is that you must define recorded rules for every counter you intend to graph.

You can similarly do 5 second resolution if you do rule evaluation every 5 seconds:

instance:foo:increase_5s = increase(foo[10s]) / 2

Or rate:

instance:foo:rate_10s = rate(foo[15s])  # no need to divide here

and then use avg_over_time(instance:foo:rate_10s[$__interval]) in Grafana to get an accurate rate over arbitrary multiple of 10s intervals.

However, I don't know of any way of doing rate() or interval() with arbitrary resolution and without data loss in the absence of a recorded rule.

@Perlovka

This comment has been minimized.

Copy link

Perlovka commented Feb 7, 2018

@free , thanks for your efforts, you've done a lot of work. The only thing I do not understand why developers are hiding behind "constraints" and "perfect agreements" to do nothing. I found this incompatible with production systems. I just don't have so much time to check every function and it's results. Monitoring system that give you a "guessed" results is a nonsense IMHO. Well, I can understand that I could not get "exact results" because of absense of either piece of data that have not been arrived/collected yet. But guessing results from already received data and call it a "feature"... 🤕
I'll better spend my time to build sort of monitoring system based on Clickhouse DB than fight against mythic "constraints". Thank you again.

@towolf

This comment has been minimized.

Copy link

towolf commented Mar 22, 2018

There's only one thing to be said here, If my dashboards have "dancing data" in them anytime they refresh, then trust in this software (and the hype around it) is maybe unjustified I should go back to Graphite.

@espro

This comment has been minimized.

Copy link

espro commented Apr 2, 2018

So I've come across this thread because I have graphs one second showing absolutely nothing, then the next second showing peaks everywhere, all depending on my viewing resolution and whatever way the wind is blowing. I dislike interacting with people who fall back on "this is how it is" as justification - I cannot believe the patience @free has had in not just discussing the issue, but trying to implement a change to improve this software. I'm still just at the evaluation phase of Prometheus but I mean...I have to look elsewhere now. This is ridiculous.

@matejzero

This comment has been minimized.

Copy link

matejzero commented Apr 3, 2018

@espro did you try @free's patch? Does that work? If so, free is thinking about maintaining a fork of Prometheus with xrate implemented.

Not an ideal solution, but better then jumping graphs if you still want to use Prometheus.

@roganartu

This comment has been minimized.

Copy link
Contributor

roganartu commented Apr 3, 2018

It was only released 5 days ago (so I wouldn't exactly call it stable or prod-ready), but the step normalisation feature of https://github.com/Comcast/trickster looks like it should solve the jumping graphs too. Might be worth watching or experimenting with.

@sandersaares

This comment has been minimized.

Copy link

sandersaares commented Apr 11, 2018

What would be a use case for a non-extrapolating xrate()?

What would be a use case for a non-extrapolating xincrease()?

I can see how they would not fix aliasing but I wonder if some more realistic examples with actual description of the data and queries might highlight the benefits if they exist. It feels potentially useful to me - certainly having "2" or a series of "0" as in the OP seems suboptimal, if the example is true - but I am unsure about the real world cases. Is the example in the OP actually real? Will Prometheus return me a flat 0 line in that case?

What sort of data would actually benefit from this type of analysis?

@haizaar

This comment has been minimized.

Copy link

haizaar commented Apr 11, 2018

Hi Sanders,

Here is the real life example of this issue: https://groups.google.com/d/msg/prometheus-users/GLKaSYQN_gI/OGqeuNjhEAAJ
We are gradually instrumenting all of our system with p8s metrics and I hit the wall here by not being able to can't get any meaningful graphs out of couple slowly changing counters.

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Apr 11, 2018

For a use case, pick any counter you may have lying around that doesn't increase smoothly (e.g. an error counter, assuming you only have errors every now and then) and create a graph of its increase or rate of increase. Refresh. Notice how the graph skips around.

To be more specific, assuming you have a counter error_counter that you collect once every minute. Go to your Prometheus instance (or Grafana) and add a graph of

rate(error_counter[2m])

with a 2m (or 120s) resolution. Now if you refresh your graph every minute you'll be swapping back and forth between 2 completely disjoint graphs. None of the spikes you see on one graph will be on the other. It may be that you get two increases in your counter in two successive scrapes and you'll see spikes in the same place on the two graphs, but that's because you see one increase on one graph and the other increase on the other graph. They will be multiplied by 2, so (in this very particular case) it will appear that both increases are taken into account, but in the general case (one single increase, with the counter constant to the left and the right of it) you'll get a spike equivalent to double the actual rate on one graph and zero on the other graph.

As for specific use cases, I've created a second (toned down) issue -- #3806 -- after this one got closed and linked to a number of graphs from there: https://free.github.io/xrate/index.html

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Apr 11, 2018

To clarify Zaar's use of closed vs open intervals, the problem is not specifically a point falling at exactly 15:30, which would actually get included in one of the two intervals, so strictly speaking the intervals are closed at both ends.

Rather, the problem is that the counter increase between the points immediately before and after 15:30 will not get included in either interval (unless they happen to be the same point with exactly 15:30 as timestamp). So whenever you're calculating rates or increases over what appear to be adjoining intervals (as you would, for example max_over_time(foo[30m]) with 30m resolution, where every single value gets included), you are actually throwing away one counter increase for every value you are computing.

This is why rate() as currently implemented requires extrapolation, because otherwise it would always underestimate the actual rate by the amount it just threw away (which is proportional to the ratio of the eval range to scrape interval). But this only works with smoothly increasing counters (where it doesn't really make a difference whether you look at 10 successive increases or 9 successive increases plus extrapolation).

As soon as you get any real-world counter, you'll get over- and under-estimates, growing larger the closer the range gets to the scraping interval. Here is where the recommendations to use rate() over ranges at least 2.5x the scrape interval (I've also seen 4-5x in other places) come from. In theory there is no good reason against computing a rate or increase at a resolution equal to the scraping interval.

@haizaar

This comment has been minimized.

Copy link

haizaar commented Apr 12, 2018

Thanks Alin,

I reread your original post again and I confess my complain about non-inclusive ranges has absolutely no grounds. Therefore I've edited my previous comment to remove it.

Apologies for pouring oil into this fire.

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Apr 12, 2018

I think your interval argument is actually a good analogy, but with regards to increases, not points. I might have used it in one form or another myself. (I've been going on about this thing for so long now that I don't remember all I've said.)

My argument (and others might not agree or at least have not addressed it directly) is that the important thing about a counter is the increase between successive collections. It's only with a gauge that you want the absolute value (e.g. how much RAM was I using at time t). So in the same way you want a sum_over_time(memory_usage[30m]) to include all absolute values within those 30 minutes (and then be able to further aggregate those values across longer periods), you also want an increase(error_counter[30m]) to include all increases within those 30 minutes (and be able to aggregate that further across longer periods). In terms of increases included in the calculation you do want the set of those to be [now-30m, now) rather than (now-30m, now).

So if we take the definition of "last increase" to be the difference between the last and next to last values of a counter (which is kind of obvious, I guess), then the definition of increase over the last 30 minutes has to be the difference between the last value and the last value before now-30m. Or in PromQL terms

error_counter - error_counter offset 30m
@sylr

This comment has been minimized.

Copy link
Contributor

sylr commented Nov 13, 2018

If you want exact results you need to use a logs-based monitoring system, not a metrics-based one.

We want functions that output results that make sense.

In the real world you have me explaining everyday to my ~200 colleagues that have their eyes glued to the corporate Grafana and who expect big round integers in the XXXX volume graph that, even with counters, because prometheus extrapolates everything, you can only have estimations. And believe me that does not satisfy them at all.

@intelfx

This comment has been minimized.

Copy link

intelfx commented Nov 29, 2018

We operate under a number of constraints

What good do they do?

If those "constraints" come into a disagreement with real-world usecases, you'd think it's time to reconsider the "constraints".

@free

This comment has been minimized.

Copy link
Contributor Author

free commented Dec 4, 2018

FYI, I am actually maintaining a Prometheus fork at free/prometheus, that simply adds xrate and xincrease functions, and attempting to mirror Prometheus releases. I am using it in a production environment with no known issues, but YMMV.

@sandersaares

This comment has been minimized.

Copy link

sandersaares commented Dec 4, 2018

I found this topic hard to parse before but the examples above have convinced me. If the fork will continue to actively track mainline Prometheus, I plan to make the switch with our next infrastructure upgrade. Thanks for maintaining it!

It would be quite sensible to see this additive enhancement in the mainline product, though. I urge Prometheus maintainers to reconsider this, especially as in its xrate/xincrease form it does not affect any existing non-exact measurement logic.

@jmwilkinson

This comment has been minimized.

Copy link

jmwilkinson commented Apr 9, 2019

Reading this thread is like if someone gave you an easel and paints and you're like "cool! now i can paint all the stuff" but sometimes you need to paint trees and there's no green, and so you're like "hey, could maybe we add some green, i made a green that works pretty good" and get told "no, green doesn't vibe with our rules about color that we made up please just mix yellow and blue or use a green canvas"

So, yeah, we can do a bunch of stuff to make rates behave in a predictable, expected manner, but really prometheus should just have green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.