Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rate()/increase() extrapolation considered harmful #3746

Closed
free opened this issue Jan 26, 2018 · 56 comments
Closed

rate()/increase() extrapolation considered harmful #3746

free opened this issue Jan 26, 2018 · 56 comments

Comments

@free
Copy link
Contributor

free commented Jan 26, 2018

The problem

[Feel free to skip ahead to the proposed solution, as I'm guessing the problem is widely understood. Or just read the last couple of paragraphs for a real-life example of the problem and my current workaround.]

I've been trying to deal with the errors introduced by the rate() and increase() functions for so long that I keep going back every few months and spend a couple of hours tweaking Grafana dashboards before I remember that there is literally no way of getting an accurate result out of them short of using recorded rules for everything and, even in that case, working around and against Prometheus.

To clarify, let's assume I have a monitoring configuration with scrapes every 5 seconds and rule evaluation every 10 seconds. It collects among other things a request_count counter, that increases by 1 every 20 seconds. Let's also assume that I would really like a Grafana graph that displays an accurate number of requests over time, on a dashboard with an adjustable time range. (I.e. I want to be able to look at 15 minutes, 1 hour and 7 days on the same dashboard, with Grafana automatically adjusting the resolution.) Not a tall order, right?

For the purpose of clarity, here is how the request_count samples would look like (at a 5 second resolution):

[0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, ...]

The naive approach is you add a graph and put in an expression along the lines of:

increase(request_count[$__interval])

and set Min step to 10s (I would prefer a Min step of 5s, since the resolution of the underlying data is 5 seconds, but let's not even go there yet). Ideally the graphed values (with a 10 second resolution) would look something like:

[0, 1, 0, 1, 0, 1, 0, 1, 0, ...]

What I'm getting instead, depending on the exact time I reload the page is one of:

[0, 2, 0, 2, 0, 2, 0, 2, 0, ...]

or

[0, 0, 0, 0, 0, 0, 0, 0, 0, ...]

This is because what increase() ends up doing is it calculates the increase between disjoint pairs of values (i.e. (v0, v1), (v2, v3), (v4, v5)), losing half of the data in the process, and extrapolates it to 10 seconds, ending up with double the actual increase. At 15 seconds resolution it only throws away a third of the data and the increases it doesn't throw away get multiplied by 1.5 due to the extrapolation. As the resolution decreases, less data gets thrown away and the multiplier gets closer to 1 but there is, as I said before, literally no way to get an accurate rate()/increase() value in a range query directly from a counter. rate() is not affected by the multiplier issue, but still suffers from thrown away data and noise.

Even with less discrete data, such as CPU usage, the graph jumps every 5 seconds between one shape and another. In particular, if I'm looking at Prometheus' own CPU usage, because my rule evaluation happens every 10 seconds but takes less than 5 seconds, the graph keeps jumping between an almost flat 10% (when the pairs picked don't include rule evaluation) and an almost flat 20% (when the pairs do include rule evaluation).

The workaround I've come up with is to compute a recorded rule request_count_increase10s that I can then sum_over_time(), which is actually computed as

    increase(request_count[15s]) / 1.5

so all increases in request_count actually get included in the calculation (because every 15 second range includes 3 data points, i.e. 2 intervals). But even with this cumbersome solution, I still have to fight against the implementation (the / 1.5 part) and the extrapolation will introduce unnecessary noise whenever the underlying timeseries neither increases constantly nor looks like a staircase.

Proposed solution

In an ideal (but by no means unattainable) world, what would happen instead of extrapolation is ... wait for it ... interpolation. I.e. just before the matrix for the underlying samples is evaluated in promql.functions.extrapolatedRate(), the range could be expanded so it included one point before the start and one point after the end of the interval. This could be either:

  1. approximated by expanding the interval by a delta: a fixed amount (say 1 minute), a duration proportional to the time range or a combination of the two; or
  2. explicitly provided by MatrixSelector/Engine/SeriesIterator (e.g. as a SeekBefore(t) or Previous() function).

With the points immediately outside the interval now available there are 2 possible approaches:

  1. interpolate the values at the edges of the range and use the difference between the interpolated values (modulo counter resets) to compute the increase/rate; or
  2. (better) use the difference between the last point before the beginning of the range and the last point in the range (again, modulo counter resets) to compute the increase/rate.

The latter has the advantage that there are no weird fractional values where there shouldn't be any, e.g. if you're computing the increase in the number of requests over an interval.

I am willing to put in the legwork if anyone thinks it makes sense. To me it makes a lot of sense to use actual data instead of fabricated (extrapolated) data when computing rates/increases (which, I don't have to point out but I will, is the whole reason for the existence of counters). But it's possible it's only me.

Additional thoughts

If you're not tired by now, there are a couple more things I should add.

For one, I am aware that one could hardcode the increase range and adjust the result accordingly into the Grafana dashboard, but over a high enough range this will result in even worse data sampling, as you'd now be seeing increase(foo[15s]) with a 30 minute resolution. Sometimes this is acceptable, most of the time it's not.

Second, there is another workaround that only works for increase(foo[10s]), namely using foo - foo offset 10s. But there are a couple of problems with this approach: it doesn't handle counter resets and it's twice as slow because now Prometheus looks up foo twice. (Or, for whatever reason, it's twice as slow as increase(), even with a 5 minute range over samples with 5 second resolution.)

There is yet another Grafana workaround, which is to define a min step (graph resolution) of e.g. 5s and an interval template variable with a minimum value of 10s and try to guesstimate the Step count for the template variable so that it stays at about 2x the graph resolution. But then you can't even apply the 1.5 or 2 factor, because it varies with resolution, not to mention the whole thing is screen resolution dependent.

@brian-brazil
Copy link
Contributor

the range could be expanded so it included one point before the start and one point after the end of the interval.

This would be a violation of the PromQL execution model, a range only returns data from its range. Evaluation should also not look into the future.

For one, I am aware that one could hardcode the increase range and adjust the result accordingly into the Grafana dashboard

This is the recommended way to deal with this. This is something to handle in Grafana.

@free
Copy link
Contributor Author

free commented Jan 26, 2018

the range could be expanded so it included one point before the start and one point after the end of the interval.

This would be a violation of the PromQL execution model, a range only returns data from its range. Evaluation should also not look into the future.

I don't know much (if anything) about the PromQL execution model, but whatever it is I don't think it's a good reason for making up data instead of using real data. And if the execution model does in fact trump common sense, there is no good reason for having extrapolation. If my 10 second interval has 2 data points 5 seconds apart, then the increase is the difference between the 2 points and the rate is the difference divided by 5.

For one, I am aware that one could hardcode the increase range and adjust the result accordingly into the Grafana dashboard

This is the recommended way to deal with this. This is something to handle in Grafana.

Could you please explain how I can get a graph of counter increases that works just as well over 15 minutes and 7 days using a fixed increase range? I would be all set if I could write

sum_over_time(increase(foo[10s]) / 2)[$__interval])

but that won't work, will it?

@brian-brazil
Copy link
Contributor

Could you please explain how I can get a graph of counter increases that works just as well over 15 minutes and 7 days using a fixed increase range?

You can't, which is why this is something that needs to be handled in the thing sending the requests to Prometheus.

Our current algorithm has been extensively debated and tested, it is very unlikely that any changes will be made.

@free
Copy link
Contributor Author

free commented Jan 26, 2018

You can't, which is why this is something that needs to be handled in the thing sending the requests to Prometheus.

There are a couple of problems with that. One, Grafana or whoever creates the dashboard must know the exact resolution of the underlying data in order to be able to produce an accurate workaround for Prometheus' extrapolation.

And second, what Grafana would have to do to get that accurate value out of Prometheus is (assuming an original resolution of 5 seconds) something along the lines of:

increase(foo[$__interval + 5s]) * ($__interval / ($__interval + 5s))

Does that look like a reasonable solution to anyone?

@free
Copy link
Contributor Author

free commented Jan 26, 2018

Also, when you recommend Grafana as the dashboarding solution in your documentation, you don't get to pass the buck to "the thing sending the requests to Prometheus". Sorry.

@brian-brazil
Copy link
Contributor

All I can do is recommend is watching https://www.youtube.com/watch?v=67Ulrq6DxwA

Several of the things that you consider to be bugs are in fact features, non-integral values are correct for example. If you want exact results you need to use a logs-based monitoring system, not a metrics-based one.

@beorn7
Copy link
Member

beorn7 commented Jan 26, 2018

Our current algorithm has been extensively debated and tested,

For reference: The discussion and implementation can be found in the following PRs:

@free
Copy link
Contributor Author

free commented Jan 26, 2018

I found and went through a couple of those before. But all I can see is a lot of discussion about extrapolating data and one quick dismissal of actually including the last value before the start of the range, on 2 accounts: that it wouldn't provide a lower bound anymore (which the current implementation doesn't either) and that with a range that is not a multiple of the scrape interval you'd get a sawtooth pattern.

Fundamentally, it feels inconsistent that we got rid of interpolation for individual sample values, but still do extrapolation for rate and friends. However, for individual sample values, we always take the latest value before evaluation time. So arguably, we should start the range for a rate with the latest value before the range starts, and then take it from there. In that way, we would not systematically under-report anymore, but we would also not return a lower bound, and we would still have the weird step function behavior described in the previous item, now at least centered around the “true” value of the rate. -- From #1161

So I'm asking you: what is the benefit of using extrapolated data when interpolated data is available? Or actual data, without interpolation, shifted by less than the scrape interval?

I mean, if (ignoring counter resets for now) you take the difference between the last point in the range and the last point before the range and compute a rate from that (by dividing by the actual distance between the 2 points), then multiply by the range length you'll get a smooth rate, smooth increase and no aliasing. What the $#@% is wrong with that? How is it worse than guesstimating by extrapolation?

You can still apply limits in special cases (e.g. if the series starts or ends in the middle of the range, then adjust the rate to reflect the ratio of the range that's actually covered by the data) but overall (i) you have fewer special cases, (ii) it's faster to compute and (iii) you're not replacing information with noise. Why, you could even get a graph of increases that you could add up and would match the actual increase in the counter.

@free
Copy link
Contributor Author

free commented Jan 26, 2018

To be more concrete, what specifically is wrong with this particular solution?

func extrapolatedRate(ev *evaluator, arg Expr, isCounter bool, isRate bool) Value {
  ms := arg.(*MatrixSelector)
  // XXX: Also request the last point before the beginning of the range.
  ms.IncludeLastPointBeforeRange = true

  var (
    matrix       = ev.evalMatrix(ms)
    rangeStart   = ev.Timestamp - durationMilliseconds(ms.Range+ms.Offset)
    rangeEnd     = ev.Timestamp - durationMilliseconds(ms.Offset)
    resultVector = make(Vector, 0, len(matrix))
  )

  for _, samples := range matrix {
    if len(samples.Points) < 2 {
      continue
    }

    sampledInterval := float64(samples.Points[len(samples.Points)-1].T-samples.Points[0].T) / 1000
    averageDurationBetweenSamples := sampledInterval / float64(len(samples.Points)-1)
    sampleIntervalThreshold := averageDurationBetweenSamples * 1.1

    // XXX: If the last point before the range is too far from the range start, drop it.
    firstSample := 0
    if rangeStart - samples.Points[0].V > sampleIntervalThreshold {
      if len(samples.Points) < 3 {
        continue
      }
      firstSample = 1
      sampledInterval := float64(samples.Points[len(samples.Points)-1].T-samples.Points[1].T) / 1000
    }

    var (
      counterCorrection float64
      lastValue         float64
    )
    // XXX: Unrelated: only need to do the magic dance for counters.
    if isCounter {
      for i := fistSample; i < len(samples.Points); i++ {
        sample := samples.Points[i]
        if sample.V < lastValue {
          counterCorrection += lastValue
        }
        lastValue = sample.V
      }
    }
    resultValue := samples.Points[len(samples.Points)-1].V - samples.Points[firstSample].V + counterCorrection

    // Duration between last sample and boundary of range.
    durationToEnd := float64(rangeEnd-samples.Points[len(samples.Points)-1].T) / 1000

    if samples.Points[firstSample].T < rangeStart && durationToEnd < sampleIntervalThreshold {
      // XXX: Default case, the samples start just before the range start
      // and end just before the range end: the rate is the difference divided
      // by the sampled interval and the increase is the rate times the range length.
      if isRate {
        resultValue = resultValue / sampledInterval
      } else {
        resultValue = resultValue * (ms.Range.Seconds() / sampledInterval)
      }
    } else {
      // XXX: The samples don't cover the whole range: the increase is the
      // difference between the last and first values, the rate is the increase
      // divided by the range length.
      if isRate {
        resultValue = resultValue / ms.Range.Seconds()
      }
    }

    resultVector = append(resultVector, Sample{
      Metric: dropMetricName(samples.Metric),
      Point:  Point{V: resultValue, T: ev.Timestamp},
    })
  }
  return resultVector
}

What it does is it requests an extra point before the start of the range and (unless it's more than average interval * 1.1 away from the start of the range) uses the actual difference between the last point in the interval and this point as the base resultValue instead of trying to extrapolate anything.

Then, if the samples approximately cover the whole range (within the same average interval * 1.1 before either end) the rate is computed as resultValue divided by the actual sampled interval and the increase is that times the range.

Else, if the samples only cover part of the range, the increase is exactly the difference between last and first and the rate is that divided by the range.

There is no sawtooth pattern with any smoothly increasing counter; no jitter; no data dropped if you're computing a rate/increase over an interval of X seconds with a resolution of X seconds (which is the only thing Grafana knows how to do reliably). And you can compute a rate/increase over any interval, regardless of the resolution of the underlying data: your graph doesn't suddenly disappear just because you zoomed in too far. BTW, it also doesn't look into the future.

And the only downside AFAICT is that it's "a violation of the PromQL execution model", whatever that means.

Also, I understand if you want to keep the current behavior for rate/increase, for compatibility with rules and dashboards that try hard to work around it. Just add new functions, something along the lines of sensible_rate/sensible_increase. </wink>

And finally, I get quite tired of the "if you want accurate numbers use logs" argument. I'm not asking for accurate metrics in the face of restarts, timeouts or sampling. Just to have PromQL produce an exact difference between 2 data points X seconds away from each other without having to jump through hoops. It feels literally like someone decided that every arithmetic operation should replace the least significant 90% of the result with garbage. Because that's what I get unless I fight Prometheus tooth and nail.

@roidelapluie
Copy link
Member

@free when I want that precision level it often means that I can use irate or idelta. I miss a increase function however.

@roidelapluie
Copy link
Member

roidelapluie commented Jan 28, 2018

@free After more thinking: I think really what we do with irate/idelta is close to this ; but you want to change it from "between the last 2 datapoint" to "between the initial value and the last datapoint".

In that sense, beside the irate/idelta we could another prefixed function like xrate/xdelta

@free
Copy link
Contributor Author

free commented Jan 29, 2018

@free when I want that precision level it often means that I can use irate or idelta. I miss a increase function however.

The reason irate/idelta work better than rate/delta/increase is that they don't do extrapolation, which is what I'm arguing here. That being said, they don't do what I need: a Grafana graph of rate or increase over time that's not tied to a fixed resolution.

My data has 5 second resolution, but if I do irate(foo[5s]) I get a blank graph. If I do irate(foo[$__interval]) (which is Grafana's way of automagically filling in the actual graph resolution in your query) and I set the Min step to anything larger than 5 seconds, two unwanted things happen:

  1. the resulting non-overlapping intervals will end up losing the increase between pairs of points that end up in different intervals; and
  2. because the interval is now larger than 5 seconds irate itself will only look at the last pair of points in the interval and drop everything else.

In other words, for what I need, irate is even worse than rate. And there is no iincrease function, as you point out, and most of the time I'm interested in the increase (i.e. how many requests I got and when).

@free After more thinking: I think really what we do with irate/idelta is close to this ; but you want to change it from "between the last 2 datapoint" to "between the initial value and the last datapoint".

I'd be fine with either replacing the irate/idelta implementations or defining new xrate/xdelta functions, as long as I can get a simple subtraction going. I mean it's right there in my face: I can do clamp_min(foo - foo offset $__interval, 0) but it takes twice as long to compute as increase because it has to look up foo twice, for crying out loud. And I know this will end up dropping the first increase from zero to whatever the counter is the first time we see it after the reset, but that's nothing compared to losing the increase between 2 data points every single interval.

@free
Copy link
Contributor Author

free commented Jan 29, 2018

Oh, and by the way @roidelapluie , I like your proposed naming, the x could stand for exact. LOL

@roidelapluie
Copy link
Member

Maybe you can take that one step further and make a pull request? so we can continue the discussion around code?

@free
Copy link
Contributor Author

free commented Jan 29, 2018

Sure I can. I just wanted to get some buy-in before I put in any work, as I did on the TSDB, trying to work around it pushing data to disk and causing all of Prometheus to hang when disk writes block. I spent a couple of weeks working on that and the PR is still open 4 months later: prometheus-junkyard/tsdb#149 .

However, if I do any work on this I'll try to avoid going into the selector/iterator and just extend the start of the interval by some fixed amount. Seeing how I have essentially already written the code for it a few comments up, it shouldn't be all that big an effort. It may take me a couple of days though, as I'm quite busy at work these days.

free added a commit to free/prometheus that referenced this issue Jan 30, 2018
@free
Copy link
Contributor Author

free commented Jan 30, 2018

OK, finally got an initial (but quite clean and complete) iteration done: #3760.

There is one difference from my initial proposal (as also described in the PR), namely not to adjust the increase by range / sampledInterval, as it would only add noise. If the xincrease range is not an "exact" multiple of the sampling interval, I believe it's preferable to get exact increase values (which are quite useful for jumpy counters) for the price of a sawtooth pattern for smoothly increasing counters. Rates (xrate and xdelta) are not affected by this, as they are always divided by the actual sampled interval, so they will be perfectly smooth for a perfectly smooth counter. Again, this is only an issue if the requested range is not a multiple of the scrape/eval interval.

The other thing I realized is that while xincrease/xrate/xdelta will work just fine at a resolution higher than that of the underlying data, you will still only get one data point for every underlying data point and Grafana doesn't handle that very graciously. E.g. for a timeseries foo with a resolution of 5 seconds, a range query for xrate(foo[1s]) and step=1 will return a timeseries with a 5 second resolution but Grafana will still attempt to draw a point every 1 second, resulting in disjointed points. If your Grafana graph is configured to show points, that's perfectly fine, but it will not connect them by lines.

Other than that, in my relatively limited testing I haven't found any major issues. You could see some jitter if your start/end time match "exactly" the scrape times (e.g. when scraping times are clustered around whole 5 second multiples and some fall just before, some just after and your start/end times are also multiples of 5 seconds) but that's essentially a given and no issue at all with smoothly increasing counters.

@brian-brazil
Copy link
Contributor

I appreciate that you are looking at his issue. Unfortunately while considering your proposal I see multiple issues with it. We operate under a number of constraints, and as-is your solution cannot be adopted due to its knock-on effects and inconsistencies with the rest of PromQL.

The constraints we are under include:

  • Functions may only use the range vectors they are given
  • A PromQL evaluation at a given time may not look at data in the future relative to that time
  • rate() and friends must be resilient to scrape failures, jitter, and target lifecycle events
  • Multiple mildly different rate() functions would cause user confusion (such as the existing confusion we have with users using incorrectly delta for counters), and increased support load
  • Range vectors ignore stale markers
  • increase() is syntactic sugar for rate()
  • We cannot add breaking changes in Prometheus 2.x, which includes changing range vectors semantics

In addition calling a function "exact" is misleading as exact answers are not possible in metrics, only approximations.

If you have suggestions for how to improve the existing rate() function, keeping in mind the above constraints, I'll be happy to consider them for Prometheus 3.0. I suspect though that you will find it more fruitful to try and tackle this in Grafana.

The recommended way to use rate() is to provide a range that covers several scrape intervals. If for example if you feel rate() should include data further in the past, you can expand the size of the range when writing your query. You also have the option of pulling the raw data via the API and performing whatever calculations you like. One of the reasons the idelta() function exists is to allow variants of rate() and friends to be implemented via recording rules without adding complexity to PromQL itself, so that's another approach to consider.

@free
Copy link
Contributor Author

free commented Jan 30, 2018

Wow, that was harsh...

Let me address your concerns one by one, then. Even though considering your readiness to discuss the validity of my concerns or solution and the fact you threw everything but the kitchen sink at me, I'm sure it will be in vain.

We operate under a number of constraints, and as-is your solution cannot be adopted due to its knock-on effects and inconsistencies with the rest of PromQL.

I thought that was what code (and design) reviews were for. To bring the code into a shape that is widely agreed upon, not as a rubber stamp for "adopting code as-is".

Functions may only use the range vectors they are given

The only thing I can say to this is to then give functions that take range vectors a range vector that includes the last point before the range. Whichever functions don't need it, can skip it. Et voila!

A PromQL evaluation at a given time may not look at data in the future relative to that time

My code never does that. I mentioned that in my initial post, but it was neither in my first pseudocode proposal or the PR and I explicitly and repeatedly pointed that out.

rate() and friends must be resilient to scrape failures, jitter, and target lifecycle events

xrate() and friends behave exactly the same as rate() and friends (minus the extrapolation and possibly plus an extra point in the beginning). So to the extent that rate() is resilient to all of the above, so is xrate().

Multiple mildly different rate() functions would cause user confusion (such as the existing confusion we have with users using incorrectly delta for counters), and increased support load

I don't know what to say to this except to suggest that one could replace the current rate() implementation; or replace irate(), because xrate() fully matches it over ranges of up to 2x the underlying resolution and in your very first reply you were arguing that hardcoding the range based on the underlying resolution is The Right Way™. Or deal with the fact that the current rate() implementation is broken and replacing it should probably not be done in one go.

Range vectors ignore stale markers

I'm not sure what you mean by that, but again, xrate() and friends behave exactly the same as rate() and friends (minus A, plus B etc.). So xrate() deals with stale markers in the exact same way and to the exact same extent that rate() does.

increase() is syntactic sugar for rate()

"Maybe" and "so what"? This is indeed true for the current implementation of rate() and increase() but it doesn't mean it's a good/useful idea (see my comments above) or that this needs to hold for xrate() and xincrease(). Or, if this turns out to be the last standing strawman, I can definitely make it so, even though all it would do is introduce noise.

We cannot add breaking changes in Prometheus 2.x, which includes changing range vectors semantics

My code makes no changes to range vector semantics whatsoever. AFAICT I've extended the range of the iterators underlying range vectors, but the range vectors themselves work the exact same way as before. As proof of that, existing PromQL tests all pass without modifications. I did extend the range vectors used by the x***() functions, but that only applies to those functions (and is explicitly pointed out in the code as a hack that needs a proper solution). No breaking change.

In addition calling a function "exact" is misleading as exact answers are not possible in metrics, only approximations.

I have no strong attachment to the name, I thought it was a good joke. I'd be perfectly fine with lame_rate() and lame_increase() as long as they did the right thing.

If you have suggestions for how to improve the existing rate() function, keeping in mind the above constraints, I'll be happy to consider them for Prometheus 3.0. I suspect though that you will find it more fruitful to try and tackle this in Grafana.

I feel like a broken record here, but the current rate() and particularly increase() functions cannot be fixed. As long as they are artificially limited to look only at points inside the range and attempt to extrapolate the data to the entire range, they are random number generators with the range and resolution of the underlying data as inputs.

Also, when talking about Grafana (or common sense, for that matter) a rate()/increase() range query using the same value for the range and step will throw away data and replace it with estimates. In what world does that make sense? And the only 2 arguments I ever heard for that behavior are:

  1. (Buried in the middle of the 40 minute video you thought was an appropriate response, linked above.) If the last point before the range is 13 minutes before the start of the range, then why should it affect the rate/increase? Well, if your first point inside the range is 13 minutes from the start of the range, how could extrapolating over those 13 minutes ever be expected to provide a sane answer?
  2. This is how we do it and we won't accept breaking changes. Or additions.

As for your suggestion to use idelta() and recorded rules (I'm guessing at the same resolution as scraped series), I'm sorry to break this to you, but you don't need idelta() for that. Or irate() or rate(). Pretty much all of Prometheus' functions can be reproduced with simple recorded rules, but (let me break another one) the whole point is to do it efficiently. Like I said, I can simply use foo - foo offset 5m for my needs but it is twice as slow as rate(foo[5m]) because it retrieves foo twice. Counter resets are the least of my worries when rate()/increase() are consistently off by 10% in the best case (more like 50-100% on average).

@leoluk
Copy link

leoluk commented Jan 30, 2018

I don't know enough about the Prometheus internals in order to comment on the particulars, but I do agree that having an "exact rate" function would be very useful feature to have. [Edit - I misunderstood]

The default behavior of rate and irate is not very intuitive. It took me quite some time to really grasp the evaluation logic. I suggest we improve the documentation for this regardless of the outcome here and address this and the Grafana alignment issue.

I understand that a maintainer has to say "no" every so often, but I'd love to see this straightened out and merged!

@free
Copy link
Contributor Author

free commented Jan 30, 2018

To be completely fair, the "graphs change every time I refresh" problem should be fixed from the Grafana end, by optionally allowing you to align the start and end parameters of range_query to a multiple of $__interval.

However, that won't change the fact that the rate() and increase() implementation introduces significant errors and (given Grafana's model of providing you with $__interval to drop into your query and no support for time arithmetic -- which makes perfect sense, as that way lies madness) one counter increase lost every $__interval seconds.

@leoluk
Copy link

leoluk commented Jan 30, 2018

The Graphite issue is due to alignment if the sampling rate is higher than the evaluation rate, and what your merge request adds is the ability to compute "exact" rates without extrapolating, correct?

We definitely need more docs, I'm confused :P

@matejzero
Copy link

I'm in the same boat as @leoluk as far as confusion/discussions go (with the difference that I still didn't grasp evaluation logic:)).

@free
Copy link
Contributor Author

free commented Jan 30, 2018

The Graphite issue is due to alignment if the sampling rate is higher than the evaluation rate, and what your merge request adds is the ability to compute "exact" rates without extrapolating, correct?

Let's not use the term "exact". As Brian pointed out, it's still an approximation (evaluation is never exactly aligned with scraping, scraping is never precisely spaced etc.), except I would argue it's a more correct approximation because it doesn't do any extrapolation, indeed. Instead it only looks at actual data points and takes the difference between them as the increase and that divided by the distance between the data points as the rate.

The current rate() implementation is way more complex, as it attempts to only look at datapoints that fall inside the specified range, but then it extrapolates them using various heuristics in order to smooth over the fact that it doesn't actually know what the increase over the requested range was (as in foo - foo offset 10m). When adding Grafana into the picture -- because the one thing that Grafana does reliably is to ask for $__interval spaced points while allowing you to insert the value of $__interval into your Prometheus query -- what the current implementation ends up doing is it splits the timeline into $__interval-length chunks and computes a rate for each of them, ignoring the change in counter value between one interval and the next. increase() is even worse, because e.g. if each range contains 3 data points, it will compute a rate based on the 2 intervals it has, then multiplies it by 3, so you end up with 1.5 times the estimated rate instead of the actual increase.

As long as your rate/interval range is a multiple of the underlying timeseries' resolution, you will get jitter with any implementation, as long as there is no alignment of the start and end times to multiples of the range, because the set of points that the rate/interval is computed over keeps shifting by one point. But with the proposed xrate()/xincrease() implementation, you don't have the problem of one data point lost for every interval.

In order for your graph to be unchanging, just shifting to the left, you need to either align start and end or (with xrate()/xincrease()) match the resolution of the underlying data. In the latter case you don't need any alignment, as the behavior is basically similar to irate(), except you don't have to use an artificially inflated range.

@brian-brazil
Copy link
Contributor

I have explained the constraints that we are under, I'm afraid that I can only work with ideas that are in line with those constraints.

I also believe that the claimed benefits of this proposal are overstated. For example it is claimed that it fixes aliasing, yet in various other places there are mentions aliasing issues still arising (that's what that sawtooth is). This is unfortunately not the sort of problem has a simple solution.

@free
Copy link
Contributor Author

free commented Feb 2, 2018

[ @beorn7, as the other half of the owners of this code, would you please be so kind and express an opinion? I understand you may not have the time, but you can definitely get the gist from the 80 lines of code in #3746 and Brian's comment from 3 days ago. If you have even more time, this one comment should cover my point, no need to go through the rest of the thread. ]

OK, let's go back to the constraints you listed and go through them one by one. I'll leave the one you listed first for last, because I feel that is really the only point of contention.

A PromQL evaluation at a given time may not look at data in the future relative to that time

Check.

rate() and friends must be resilient to scrape failures, jitter, and target lifecycle events

Check.

My aliasing (sawtooth) comment only applied to increase(), not rate(). And it was a conscious choice on my side, because I felt it's more useful for increase() to reflect the underlying data rather than either interpolate or extrapolate. If you feel strongly about it can definitely be eliminated, it requires exactly 2 lines of code.

Multiple mildly different rate() functions would cause user confusion (such as the existing confusion we have with users using incorrectly delta for counters), and increased support load

Based on the multitude of issues raised both here and on Grafana, I would argue that any user who took a close look at the rate/increase values produced by Prometheus is way more confused than they would be by 3 rate functions. Or 10, for that matter. See the comment by @matejzero, above. Also, I am not insistent on whether it should be a new function or a replacement for the existing one. It's entirely up to the maintainers.

Range vectors ignore stale markers

Check.

increase() is syntactic sugar for rate()

Can be if you insist, not a problem. Even though my preference is different and I don't really see why this is a constraint (other than that's how increase() is currently documented).

We cannot add breaking changes in Prometheus 2.x, which includes changing range vectors semantics

Check.

As mentioned above, my proposed change passes all existing PromQL tests without modification, meaning (among other things) the range vector semantics are the same. Or it could mean that none of the tests actually verifies vector semantics, which I would find hard to believe.

Functions may only use the range vectors they are given

And we finally get to the one argument that actually merits a discussion. All I'm arguing for is that functions should be given one extra point before the beginning of the range of the interval. They may choose to use it or not. In effect, the signature of xrate should really be xrate(foo offset 5m, foo[5m]), with the constraint that the same duration has to be passed to both arguments. It is a rate over one point from 5 minutes ago plus all points since 5 minutes ago. xrate(foo[5m]) is essentially syntactic sugar for xrate(foo offset 5m, foo[5m]). (See what I did there?)

Looking at it from a different angle, the current implementation of rate cannot reliably produce a rate for anything shorter than 2x the resolution of the underlying data, precisely because of this one limitation. But rate(foo[5s]) does have a very precise meaning when foo has 5 second resolution. Additionally, because of this same constraint, computing a rate over a period X with resolution X (which, again, has a very clear meaning outside of the Prometheus bubble) will lose up to half of the information it has. This is readily visible when refreshing a graph. In particular, in the case of Prometheus itself, looking at a graph of CPU usage with the highest resolution available (2x sampling resolution, because why not) will actually show you one of 2 very different graphs depending on when you load it. It will either show you (1) CPU usage from (in my case) the 5 second periods that include rule evaluation or (2) CPU usage from the 5 second periods when it's mostly idle.

I also believe that the claimed benefits of this proposal are overstated.

No, they are not. Let's go through each:

there are no weird fractional values where there shouldn't be any, e.g. if you're computing the increase in the number of requests over an interval.

The code I'm currently proposing never produces fractions for increase() when the underlying counter is effectively an integer. I.e. you never ever get 1.5 requests over any time period.

There is no sawtooth pattern with any smoothly increasing counter;

There isn't. rate() is always perfectly smooth; increase() is perfectly smooth with an evaluation interval that's a multiple (including 1x) of the scraping interval. In the current implementation increase() is not perfectly smooth if the evaluation interval is not a multiple of the scrape interval but (1) I am arguing that's a plus and (2) it can be changed to be perfectly smooth if you value smoothness over accuracy.

no jitter;

There is none. What I understand by jitter is for spikes to suddenly appear or disappear because information that was ignored (the increase between points in 2 different intervals) is now suddenly taken into account. All data is always taken into account. It may move from one interval into the next and the corresponding increase will be subtracted from one and added to the other, but never appear and disappear. Just basic sampling.

And it never happens with a range equal to 1x sampling resolution. Never ever. Not even when sampling failed a few times. It will with the current rate(), though.

no data dropped if you're computing a rate/increase over an interval of X seconds with a resolution of X seconds (which is the only thing Grafana knows how to do reliably).

Check. That was the whole point of the exercise.

And you can compute a rate/increase over any interval, regardless of the resolution of the underlying data: your graph doesn't suddenly disappear just because you zoomed in too far.

Check. It won't produce a value when there are zero points in the range, but that could also be changed. You will get one value for every data point of the underlying. query_range will even helpfully not return any NaN values, just one point for every underlying point. It's quite beautiful, in fact. And, if you have Grafana draw points (in addition to or instead of lines) you will also see them on the graph.

BTW, it also doesn't look into the future.

Check.

For example it is claimed that it fixes aliasing, yet in various other places there are mentions aliasing issues still arising (that's what that sawtooth is).

See my comment above. That sawtooth I mentioned **only occurred for increase() when computed over a range that is not a multiple of the scraping interval and that was by choice. I (and not necessarily everyone else) prefer it to [0, 0, 0, 1.5, 0, 0, 0] for increase(requests[1m]).

This is unfortunately not the sort of problem has a simple solution.

But it does. That's the whole point. It's not a perfect solution (simply because there are more ways of looking at what should take priority, e.g. accurate vs. smooth increase) but simple it definitely is. Both in terms of concept -- you can cover all increases in an interval with a step equal to the range -- and implementation -- take the value difference between last and first, adjusted for counter resets, and divide by the time difference.

The current rate() implementation OTOH is a tortured, complex beast that can never produce accurate increase() values under any circumstances; and requires the use of recorded rules and express knowledge of the scrape interval in order to produce rate() values that neither overlap nor throw away data. And it will only do either at half resolution.

@sylr
Copy link
Contributor

sylr commented Nov 13, 2018

If you want exact results you need to use a logs-based monitoring system, not a metrics-based one.

We want functions that output results that make sense.

In the real world you have me explaining everyday to my ~200 colleagues that have their eyes glued to the corporate Grafana and who expect big round integers in the XXXX volume graph that, even with counters, because prometheus extrapolates everything, you can only have estimations. And believe me that does not satisfy them at all.

@intelfx
Copy link

intelfx commented Nov 29, 2018

We operate under a number of constraints

What good do they do?

If those "constraints" come into a disagreement with real-world usecases, you'd think it's time to reconsider the "constraints".

@free
Copy link
Contributor Author

free commented Dec 4, 2018

FYI, I am actually maintaining a Prometheus fork at free/prometheus, that simply adds xrate and xincrease functions, and attempting to mirror Prometheus releases. I am using it in a production environment with no known issues, but YMMV.

@sandersaares
Copy link

sandersaares commented Dec 4, 2018

I found this topic hard to parse before but the examples above have convinced me. If the fork will continue to actively track mainline Prometheus, I plan to make the switch with our next infrastructure upgrade. Thanks for maintaining it!

It would be quite sensible to see this additive enhancement in the mainline product, though. I urge Prometheus maintainers to reconsider this, especially as in its xrate/xincrease form it does not affect any existing non-exact measurement logic.

@jmwilkinson
Copy link

Reading this thread is like if someone gave you an easel and paints and you're like "cool! now i can paint all the stuff" but sometimes you need to paint trees and there's no green, and so you're like "hey, could maybe we add some green, i made a green that works pretty good" and get told "no, green doesn't vibe with our rules about color that we made up please just mix yellow and blue or use a green canvas"

So, yeah, we can do a bunch of stuff to make rates behave in a predictable, expected manner, but really prometheus should just have green.

@brandond
Copy link

Just piling on to say that I'm switching over to the @free fork to address similar issues I've been encountering with historical data 'changing' between refreshes in grafana. I find it pretty disappointing that @brian-brazil and @beorn7 can't seem to bring themselves to accept that there is a legitimate problem that many users have, with an accepted fix that has to be maintained out-of-tree due to what appear to be solely philosophical issues.

How many folks need to switch to using a fork before y'all accept that is is an actual problem that your users want solved?

sylr pushed a commit to sylr/prometheus that referenced this issue May 10, 2019
Signed-off-by: Sylvain Rabot <s.rabot@lectra.com>
@blak3mill3r
Copy link

Wait... I will have to install a fork to get reasonable rate of change computation? That's inconvenient and I probably will just put up with the current behavior...

It makes good sense that you need N+1 samples to compute rates at N points spaced at the polling interval without losing information.

While it's true that Prometheus is not about drawing graphs, it sounds like the current rate behavior actually prevents me from getting "stable looking" rate graphs (unless I settle for the rather unsavory increase(foo[15s]) / 1.5 workaround.

This design constraint about not including 1 more point outside the range should be reexamined, I think. What exactly does it protect?

@roidelapluie
Copy link
Member

The main issue is that prometheus range only returns inner variables.

We would need another notation to make this happen:

rate(mymetric[5m])

vs

rate(mymetric]5m])

but that feels not good

@sylr
Copy link
Contributor

sylr commented May 18, 2019

The main issue is that prometheus range only returns inner variables.

@free 's patch proves that it does not have to.

@roidelapluie
Copy link
Member

roidelapluie commented May 18, 2019

That is why that implementation is not correct. Because it its implementation, [5m] is ]5m] and return values outside boundaries.

@sylr
Copy link
Contributor

sylr commented May 18, 2019

That is why that implementation is not correct. Because it its implementation, [5m] is ]5m] and return values outside boundaries.

Understood, it still only does that in a context where it's needed.

In any case, this has been an issue a lot of people have complained about for a long time now and we did not see anyone else proposing something acceptable to the maintainers standard, let alone implement it. So you say it's not correct, we say it works and does what we expect it to do.

@roidelapluie
Copy link
Member

I do not think anyone proposed the inclusion in promql of a syntax rate(metric]5m]) up to this day.

@sylr
Copy link
Contributor

sylr commented May 18, 2019

You're right, but as you said that feels not good, I thought you dismissed it yourself.

@sylr
Copy link
Contributor

sylr commented May 18, 2019

And unless I'm mistaken, If we take the mathematics notation of intervals the current equivalent of promql [5m] would be ]AB] which translate to A < x <= B.

So, in addition to looking weird, it's the exact opposite of the mathematical notation.

@sandersaares
Copy link

The main issue is that prometheus range only returns inner variables.
We would need another notation to make this happen:

I do not think the latter necessarily follows from the former. You could make a special case for xrate() functions if this is really the only place it is meaningful.

@free
Copy link
Contributor Author

free commented May 29, 2019

You could make a special case for xrate() functions if this is really the only place it is meaningful.

I would argue it's not xrate() only, it's counter only.

Intervals as currently defined, work perfectly for gauges: e.g. if I want the average memory usage over 5 minutes, then all relevant samples for that calculation are contained strictly within that 5 minute interval.

But with counters (as pointed out in all Prometheus documentation, slides, blog posts and videos) the relevant information is not the absolute value, but the change in that value (i.e. I don't care how many seconds the CPU was in use since whenever the process started, I want to know how many seconds it was in use since the previous scrape; or over the previous 5 minutes). But because counters are (understandably) stored by Prometheus just the same as gauges (i.e. the absolute value is stored rather than the increase), then it follows that computation using counters needs to handle the samples differently from computation using gauges. So (to make a parallel to the average memory usage over 5 minutes example above) computing the average CPU usage over the previous 5 minutes requires all counter increases over the previous 5 minutes, not merely the counter's absolute values over those 5 minutes.

And just as you can't possibly compute the last counter increase by looking at the last sample only, you can't compute the first counter increase in a 5 minute range by looking at the first sample in that 5 minute range only. So computing all counter increases over 5 minutes requires all samples in that 5 minute range plus the previous sample.

Which I take to mean that an interval is currently defined consistently in terms of Go code (all samples within [t - range, t]), which is nice for Prometheus developers. But it's defined inconsistently in terms of semantics: it is [t - range, t] for gauges; and [t - range + scrape_interval, t] or [t - range + eval_interval, t] for counters. Which is bad for Prometheus users.

[I am aware that Prometheus does not internally keep track of which timeseries is a gauge and which is a counter. But fortunately Prometheus functions are clearly identified as "gauge functions" and "counter functions" (e.g. delta vs rate). So it should be really obvious from the PromQL context which "implementation" of interval is needed where.]

sylr pushed a commit to sylr/prometheus that referenced this issue May 31, 2019
Signed-off-by: Sylvain Rabot <s.rabot@lectra.com>
sylr pushed a commit to sylr/prometheus that referenced this issue Jul 5, 2019
Signed-off-by: Sylvain Rabot <s.rabot@lectra.com>
sylr pushed a commit to sylr/prometheus that referenced this issue Sep 3, 2019
Signed-off-by: Sylvain Rabot <s.rabot@lectra.com>
sylr pushed a commit to sylr/prometheus that referenced this issue Oct 11, 2019
Signed-off-by: Sylvain Rabot <s.rabot@lectra.com>
sylr pushed a commit to sylr/prometheus that referenced this issue Nov 15, 2019
Signed-off-by: Sylvain Rabot <s.rabot@lectra.com>
@lock lock bot locked and limited conversation to collaborators Nov 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests