-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sub query support for range selection. #1227
Comments
An interesting question is how we choose the interval for this, currently it's not needed as everything in calculated at a point. We should also look at how to discourage/prevent use of this in rules/alerts on performance grounds. |
👍 |
Currently we can run query like |
That's incorrect usage, you need to do the rate separately first or you'll get the wrong answer around counter resets. I'd also suggest that the exporter do either total&hit or total&miss, as that's a bit easier to work with. |
Being able to run range queries on arbitrary expressions is valuable for ad-hoc explorations, either for sanity checking before making recorded rules, or for analyzing past data where you didn't already have recorded rules. |
Will leave one of the references with discussion on this topic https://groups.google.com/forum/#!topic/prometheus-users/JOVfqQsVRl8 |
Is there any ticket tracking this feature request (historical function evaluation)? |
This is the issue for it. I wouldn't be too hopeful of it ever getting implemented. |
OK. Is that due to a philosophical issue, or a eng-resource-scarcity problem? |
It's a semantic issue, and an operational one. It's unclear how to do this correctly, and even then its use would have to be greatly limited for safety. |
I do not understand why this is not considered important at all. The requirement is simple: we want to generate a range-vector from a function that returns an instant-vector. For example i want to execute something like:
The internal query returns a time series which has a Y value of 1 when the underlying value is > 0.1 and a 0 when the underlying value is <= 0.1. This is how i calculate failure incidents. Then, i just want to know how many seconds these incidents lasted in the selected time period. To do that, i need to execute the sum_over_time function which is not applicable as the internal query is not considered a range vector. In fact, Netflix Atlas allows this kind of operations without any issues. |
That would be an inappropriate usage of this feature, what you want is a recording rule. |
How is this supported, is through a recording rule. But not what i want. I cannot have a recording rule for each step of my query where an instance vector is generated. And this is just for a single query. Also, it is too difficult for everyone who want to create a dashboard to have to create a recording rule. It is not that easy. Adding recording rules is an administration process now and it is definitely not suitable for people who just want to create more elaborate queries for their dashboards. |
What would it take to implement this? I'm willing to jump through the hoops to make this happen. this makes a lot of |
See https://www.robustperception.io/composing-range-vector-functions-in-promql/ Most of the over_time functions make no sense with counters. |
Yeah, the I'm also undecided on adding this feature, but let's say we would want it, how would we determine the sub-query evaluation interval? I think we could go two routes here:
|
I was pondering the 2nd, if we ever add this. At the very least we'd want more protections against abusive queries first. |
In https://docs.google.com/document/d/1-C5PycocOZEVIPrmM1hn8fBelShqtqiAmFptoG4yK70/edit# the consensus is to adopt option 1. Some notes: I think it is fine to default the subquery evaluation to the global evaluation, but I would rather still keep the ":" separator:
Also subqueries would solve #3806 if we can do: rate(min without() foo)[5m:1m] |
@brian-brazil I would like to work on this if we are keen on adding it now. |
Sure, one thing is that the generated data will need to be aligned independently of the query parameters. |
This feature is really important for being able to work with empty values. For example, this one is adjusted exporter availability :
|
@juliusv catching up and trying to understand option 1. So for the use case from the above-linked discussion (https://groups.google.com/d/msg/prometheus-users/JOVfqQsVRl8/PLA0hGAMCgAJ), would the query in option 1 syntax be this? :
And returning to your example, is it correct that
|
Almost, to preserve the invariant that query_range is syntactic sugar on top of query the start time should be aligned with the step. |
so in theory at least, this could eventually remove the need for |
No, I'd expect instant queries to still not return range vector and subqueries don't support arbitrary start points. |
Instant queries can already return range vectors... |
I see one issue with the proposed implementation: in the same way that I think of Assuming a subquery range equal to the step, a Put another way, rather than "fixing" The only way I can see for fixing this inconsistency would be to drop/ignore the (calculated) sample at the start of each range, unless there happens to be an actual sample with that exact timestamp (which is what ranges do currently). I for one would rather live with the inconsistency than forever give up hope of a decent |
If that were the case it'd be a bug, but looking at the PR and unittests I'm not seeing that issue. Can you comment where you're seeing that over on the PR? |
I'm not seeing it in the PR because the series used in the tests all have samples at 10 second multiples and the resolutions are also a 10 second multiple. But if you used samples that were not aligned with the resolution and the increase between samples was no longer constant, you'd get a different result from E.g. something like this (back of the envelope calculation, not actually tested against the PR):
|
That tests passes. That's what I'd expect though semantically, and the subquery examples here aren't sane.
|
@codesome Can you add the above tests, with just a comment indicating that using subqueries unnecessarily is unwise? |
I'm not sure whether I should feel offended or not. I'll go with the latter and ask you to please elaborate on that statement. Yes, the ranges are rather short and cover only a couple of samples, but that's not my point. (Had I gone for longer ranges, with more samples I would have ran out of bits for the Fibonacci numbers.) My point was that all functions that take range vectors will include one extra sample when used in a subquery (as opposed to when used in a regular query). Which was what I was arguing for in #3806 (for the This is most obvious when using |
That's not what I mean (smaller data makes more sense for unittests like this anyway), I meant that the actual queries themselves aren't sane. Subqueries shouldn't be used on instant vector selectors, because you can use a range vector selector instead which is both more efficient, and doesn't lose you timestamps.
That's the same as if you were to use a recording rule via https://www.robustperception.io/composing-range-vector-functions-in-promql. Given that subqueries are to cover that sort of use case, I don't see a big issue here. |
OK, I finally see your point about this being the same as using a range function on top of a recording rule, but no, I don't think it's accurate. It would be accurate if (1) the query step/subquery resolution either matched or was a multiple of the eval interval (which would be a reasonable expectation) and (2) recording rule samples happened to also be aligned with the eval interval (their timestamps are exactly eval interval apart, but not usually aligned with it). Then the samples on the step/resolution boundaries would be included both in the range ending at said boundary and the range beginning there. Something like this:
This is exactly what a subquery now produces. What happens with the recording rule + range vector function in practice though, is that condition (2) doesn't hold (i.e. recording rule samples are not aligned with the eval step) so the resulting ranges look like this:
So if your range function was Furthermore, and apart from whether this does or doesn't exactly match the recording rule + range vector function outputs, I believe that it would make more sense (to the average user) to try and emulate the behavior of regular queries vs. emulating the behavior of the existing (and soon to be forgotten) workaround to the lack of subqueries. As to your observation about the queries not being sane because they use a subquery on an instant vector as opposed to some range vector function, you are technically right, but I was merely trying to come up with an obvious way of comparing queries to subqueries. Ideally I should have written a recording rule on top of the raw data and a range query on top of that and compared it with the equivalent subquery expression, but (1) it would have been overkill for an example and (2) you can't do that with the current query test infrastructure anyway. And now that I look at the comparison above, it suggests a different solution to achieving parity between queries and subqueries: have the subquery results explicitly NOT aligned with the subquery resolution. That would ensure that every range of length N x resolution will contain exactly N, not N+1 samples. And decisively crush my hopes for a better |
@free I tried the tests that you gave above, and they pass. @brian-brazil I'll add |
Yes, please do. The rest I'll need to think about in the morning. |
The time ranges in |
More accurately it is a thing that can happen with things currently, if the source data happens to be aligned.
It's not going away, it'd be unwise to use subqueries in a recording rule on performance grounds.
The new promtool rule test feature has that, we've yet to switch the promql unittests over to it.
That is an option, but it feels kinda weird. I think the real issue here is that ranges are closed intervals, rather than half-open intervals. I have previously spotted this as a bit odd, but it hasn't caused any real problems thus far. A user could do a slightly shorter range if they wanted to avoid this. |
That has a 1/5000 probability of happening in my deployment (5s scrape interval), even lower for the average Prometheus deployment. I wouldn't start from this as a baseline when deciding how subqueries should or should not behave.
I haven't actually tested the code, but I am confident the performance overhead is minimal. Most of the CPU load and memory usage will likely go into decoding and buffering the original timeseries plus functions and aggregations, not into sampling outputs (i.e. subqueries). There is going to be some overhead, but I for one will definitely replace all my
I fully agree with your assessment here, but I'll point out again that this hasn't been an issue thus far because both scraped series and recording rule outputs have essentially zero chance of aligning exactly with API query steps. With subqueries, you're going from 0% to 100% in one fell swoop. (Now that I think of it, there exists one situation where there's a 100% chance of queries using one another's output to have their samples align perfectly, and that is when you have 2 recording rules within the same rule group. But in that instance no one's going to take, say a
Yes, if said user was savvy enough. And happened to issue said query from a script (or PromQL supported range arithmetic, as in |
I didn't mean for your queries, but for things like
We have the option to tweak this in 3.0, only breakage it's likely to cause is in promql unittests as in general you should be writing your expressions to be robust in the face of jitter.
That's a standard subquery-equivalent use case, and we've had no complaints around it that I know of. The only one I know of in this general area is one user confused by the new PromQL unittesting stuff. |
To fully clarify my position on closed vs. half-open intervals: I agree that ranges should be half-open, so you can reliably do the recording rule equivalent of When it comes to counters though, I would argue (and I think there should be no disagreement here) that the relevant information in a counter is the (reset-adjusted) increase between successive samples, not the absolute value of the counter (you've written a few blog posts on the topic yourself). Meaning that, theoretically speaking, a range over a counter would retain all increases recorded within that half-open interval, namely what is included now plus the increase corresponding to the first sample in the range (i.e. the adjusted increase between the last sample right at or before range start, assuming it's not stale, and the first sample strictly within the range). So while in theory a range over a gauge would be the same as a range over a counter (all values in the half-open interval vs. all increases in the half-open interval), in practice (unless you're going to change the way counters are stored) they would be slightly different (a counter range would include one extra sample to the left). (I realize Prometheus doesn't keep track of which metrics are counters and which are gauges, but that information may be inferred from the function that the range vector is being used in.) This kind of change in 3.0 would essentially fix all issues with both query vs. subquery behavior and (my favorite topic, sorry) And I now realize that this kind of differentiation between counters and gauges would further improve the correctness of subqueries: currently, if subquery resolution is lower than the original resolution (or 2 samples simply happen to fall within the same step) there's a good chance that a counter reset will be undervalued (or possibly even missed). If counter ranges and gauge ranges would be handled differently (and the evaluator was made aware of which kind of range it was dealing with) it could correctly downsample counters (or expressions expected to return counter-like values), just as it does gauges. |
deriv and rate for counters/increasing gauges must always produce the same result. There is no other way. |
It's incorrect to use deriv on a counter, it only works correctly on
gauges. Similarly rate and gauges.
…On Fri 21 Dec 2018, 14:23 Julien Pivotto ***@***.*** wrote:
devir and rate for counters must always produce the same result. There is
no other way.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1227 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGyTdqes7QGH9aCFxpyW30W_FRsh5geRks5u7O7RgaJpZM4Gk2cx>
.
|
Yes but on an increasing vector I would expect rate and deriv to give the same result. |
|
Currently one can only do range selection on vector selectors and has to use recording rules to range-select expressions.
The text was updated successfully, but these errors were encountered: