Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: Make rate() / increase() algorithmically more efficient by allowing precomputation of "counters-with-monotonicity-corrections-already-applied" #2655

Closed
virusdave opened this Issue Apr 26, 2017 · 16 comments

Comments

Projects
None yet
5 participants
@virusdave
Copy link

virusdave commented Apr 26, 2017

While there are functions that are specialized for monotonic counters (rate, increase, etc), they tend to be performance-pessimistic. That is, they tend to require an adhoc analysis of datapoints throughout the time range of interest to be of use.

It would be great if one could transform a monotonic counter being recorded into a post-corrected counter, perhaps as a recording-time transformation.

Basically, a recording rule like:

monotonic_counter[now] = 
  monotonic_counter[prev] + 
  ( counter[prev] > counter[now] ? counter[now] : (counter[now] - counter[prev]) )

This doesn't seem possible right now, but if this were possible, then expensive calculations like increase() over larger windows would be reduced to a constant-time single subtraction (monotonic - (monotonic offset window)) rather than a linear operation over window (which increase() seems to be now). This is obviously a significant (algorithmic) improvement.

This could potentially also be done as a query-time optimization, completely invisible to the user.

This real problem manifests itself most immediately to me when trying to have grafana dashboards showing percentiles of various distributions (latency, sizes, etc).

The problem can be partially mitigated (really, slightly masked) by precomputing the percentiles, but this requires choosing a discrete set of window durations ahead of time. It also doesn't actually solve the problem, it just amortizes it across each sample. In fact, since sampling potentially happens more often than a particular dashboard graph is rendered, this might even be substantially more expensive.

The discussion leading to this request (and ideas on how it could be handled) is at https://matrix.to/#/!HaYTjhTxVqshXFkNfu:matrix.org/$1493164205657354HHxeu:matrix.org

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 26, 2017

Simple example:

rate(foo[$window])

If $window is large and the graph range is large, then it can be very costly to recompute the long rate again and again at every graph time step, instead of just loading all underlying counter data once, "rectifying" it over the entire graph range, and then only apply something akin to delta() at each time step.

Complications:

  • counters lose integer precision at 2^53
  • rate() and increase() have semi-complicated extrapolation behaviors that would need to be preserved. Not sure if there would be a problem around that.
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 26, 2017

This is not something we can solve within Prometheus, at the least this would require perfect long term storage.

Precomputation via recording rules is the recommended way to solve this.

but this requires choosing a discrete set of window durations ahead of time.

This is something you normally do anyway. You want a single range for all your rate functions inside one Prometheus, preferably across your entire organisation.

@virusdave

This comment has been minimized.

Copy link
Author

virusdave commented Apr 26, 2017

@brian-brazil
Thanks for responding!

I'm not sure that strongly suggesting (or weakly imposing) a necessary choice of fixed windows is a great idea, but regardless, precomputing functions of aggregations across windows of particular sizes requires access to historical data anyway, doesn't it? Given that we already need to support historical data access for that, i'm asking for a slightly more general type of support for historical data access alongside this, which enables an extremely powerful complexity optimization.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 26, 2017

recomputing functions of aggregations across windows of particular sizes requires access to historical data anyway, doesn't it

No, it only requires our current imperfect approach. The design you propose requires perfect historical data, which is not within the scope of our design (or possible in general) as you can't handle historical data not being available briefly.

@virusdave

This comment has been minimized.

Copy link
Author

virusdave commented Apr 26, 2017

Hmm, ok. What about the thought that @juliusv was leading towards: implementing this idea as an optimization strategy in query execution? The major issue is that once the post-increase()/rate() of a particular timeseries is calculated, the (nearly) same work is often done repeatedly, for instance, to populate dashboard graphs.

Perhaps caching the results of one computation temporarily for shortcutting most of the computation of the inevitable next query (with a slight offset) would be doable? This could be transparent to the user, not require additional syntax, and done as an optimization strategy by the query engine? It would definitely require the engine to have knowledge of how to make use of partially-precomputed results, but that seems reasonable.

Or perhaps allowing a specification of a temporary precomputed series that would exist for the "duration of a transaction of other queries" would be possible? I can think of lots of uses for something like this!

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 26, 2017

The major issue is that once the post-increase()/rate() of a particular timeseries is calculated, the (nearly) same work is often done repeatedly, for instance, to populate dashboard graphs.

This won't work due to our extrapolation behaviour, which Julius already postulated.

Perhaps caching the results of one computation temporarily for shortcutting most of the computation of the inevitable next query (with a slight offset) would be doable?

This won't work for the same reason, plus you have to worry about cache invalidation.

Or perhaps allowing a specification of a temporary precomputed series that would exist for the "duration of a transaction of other queries" would be possible? I can think of lots of uses for something like this!

You can use recording rules for this.

@virusdave

This comment has been minimized.

Copy link
Author

virusdave commented Apr 26, 2017

Or perhaps allowing a specification of a temporary precomputed series that would exist for the "duration of a transaction of other queries" would be possible? I can think of lots of uses for something like this!

You can use recording rules for this.

This requires preordained choice of what you'd want to do this for, which very much limits the power of it (and could be very expensive if not all possible choices are used often).

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 26, 2017

@brian-brazil

This won't work due to our extrapolation behaviour, which Julius already postulated.

The only part of that which would currently create a problem if one pre-corrected counters seems to be this: https://github.com/prometheus/prometheus/blob/master/promql/functions.go#L85-L97

I haven't stared at it for long enough to see if there's an easy workaround.

This won't work for the same reason, plus you have to worry about cache invalidation.

For a range query, we load the entire graph range for each series once from storage and then run the expression on it at different timesteps. Pre-correcting counter resets over the entire loaded series (the expensive part of rate()/increase()) should thus be possible without running into crazy caching problems or such. As mentioned, the extrapolation behavior would need to be preserved.

You can use recording rules for this.

Recording rules suck (need to be configured, don't backfill history, require you to query differently). I'm just saying, if there's an easy way to optimize this under the hood for rates/increases with long range windows over large graph windows, that would be good.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 26, 2017

For a range query, we load the entire graph range for each series once from storage and then run the expression on it at different timesteps.

With new storage we iterate over data rather than pre-loading it, so there'd be memory implications here.

if there's an easy way to optimize this under the hood for rates/increases with long range windows over large graph windows, that would be good.

We're getting into fairly advanced query engine optimisation here. There's lower hanging fruit.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 26, 2017

With new storage we iterate over data rather than pre-loading it, so there'd be memory implications here.

That'll make it more complicated, alright.

In general I'm not saying that we should do this right away, but that it's a useful optimization we might want to consider sometime in the future. The efficiency gains at least for the mentioned use case would be very large.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Apr 26, 2017

With new storage we iterate over data rather than pre-loading it,

As far as I can see, there is no difference in this regard. Only in the call at https://github.com/prometheus/prometheus/blob/master/promql/functions.go#L58 are we materializing the data from iterators. This is working the same in the new storage.

Problem is more that we materialize / decode the data for each data point again. With a long range in the range expression, each materialized matrix will have a large overlap, and there would be a large gain in caching. I think this touches the discussion we had some weeks ago in person with @fabxc about optimizations in the query layer.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 20, 2018

Based on work in #3966, I wouldn't expect any benefit from this. rate() is already faster than changes().

@free

This comment has been minimized.

Copy link
Contributor

free commented May 11, 2018

Here's a simpler alternative: for each counter remember when it was last reset. Past that timestamp you can use subtraction, without needing to iterate over intermediary samples. It doesn't even have to be very precise: a (reliable) pessimistic approximation would give most of the benefits.

However, given the current TSDB implementation, it is unlikely to provide a performance improvement unless you're computing rates/increases over 120+ samples (quite the opposite actually, since you may end up decoding the same chunk twice).

But with the addition of a cache in front of the TSDB, it may work quite nicely.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 11, 2018

With OpenMetrics we'll actually gain that information, though how to integrate it sanely with PromQL is unclear and the data may not be 100% reliable due to races. With #3966 we should only be decoding each chunk once.

@free

This comment has been minimized.

Copy link
Contributor

free commented May 11, 2018

I couldn't find anything regarding that in the OpenMetrics repository, but I'm guessing it has to do with the scrape target providing this information. What I was referring to was for Prometheus (TSDB?) to maintain an in-memory threshold (doesn't even need to be persisted). The equivalent of a cache, if you want.

You can always go back and do the expensive calculation regardless of whether the counter actually got reset or not, but after a few hours of Prometheus uptime you'll probably take the fast path 99.9% of the time (particularly with rules).

But yeah, given #3966, unless you're doing binary search over the decoded data AND the range is large enough for this to be significantly faster than iteration, it's not a significant performance boost.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.