-
Notifications
You must be signed in to change notification settings - Fork 11.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus data source: Add $__rate_interval
#21417
Comments
In the spirit of conference driven development, I have submitted a talk to Grafanacon to explain how rate queries should be done in Grafana. It would be great if we could inaugurate |
Not thrilled by this idea, seems to add a lot of complexity and confusion to a situation that is already too complex and confusing (step & interval & resolution). Need to rethink this a bit more ,maybe there are ways we can make the default behavior and concepts more widely usable and understandable. Or if there are ways to solve this in Prometheus? Another option would be to always make $__interval = interval set by grafana (taking into account resolution, ds scrape interval, query step, etc) + ds scrape interval. |
I agree that this is a problem we need to solve, but worried that we are making an already complex and messy solution worse. |
I had a long whiteboard session with @davkal and @SuperQ. To reduce the existing complexity and messiness, we needed changes that would make it hard to keep existing dashboards working. (One could do things like keep supporting the old settings but hide them in newly created panels, but @davkal explained to me that those kind of changes are highly problematic in the Grafana context.)
We thought about that, but the problem is that there are a lot of situations where you want
Indeed. That's why the duration added to
Short answer is: For better or worse, Prometheus doesn't know the scrape interval (and neither the evaluation interval, if we talk about rules). Longer answer: While the scrape and evaluation intervals are configured in Prometheus, they may change over time, and that change over time is not persisted in the TSDB (and it wouldn't be an easy change, neither in terms of implementation nor in terms of concepts, to add it to TSDB). On top of that, both scrapes and evaluations can be skipped or get delayed. Thus, PromQL expects the user to be explicit about it. The user needs to know the expected scrape interval and craft range durations accordingly (see the linked RP blog post for best practices). A syntax to "make something one scrape interval longer" would thus boil down to let the user explicitly fill in the expected scrape interval, which then degenerates to just providing a longer duration in the range selector. And that's what this issue suggests: Let's facilitate that in Grafana. Or looking at it from a slightly different angle: The "problem" really only appears because we have a dynamic duration in the range selector. But that dynamic range is a Grafana feature, provided via Another long answer: It has been viciously discussed before if the way Prometheus calculates a |
For reference: The discussion leading to this issue was sparked by #19885 . The meandering discussion there makes it quite clear that there were way more involved changes on the table than merely introducing the |
Ok, thanks for more context, that was very helpful. But what would be nice was more info on how step fits into this (something many prometheus users fail to understand). Today $__interval is & step are the same, and in your first comment you said:
But the resolution will impact step, which will impact $__interval, so having a different behavior for $__xInterval seems confusing/inconsistent. Why would $__interval not be the same as $__interval + scrape with exact same logic on how $__interval is calculated. Maybe we can add support for math in range vectors so one could write vectors like Just brainstorming trying to see if all there is a more user friendly & intutive for new users way to solve this. |
To clarify: |
I'm fairly sure that adding this kind of math on the PromQL level would be very controversial. Math in variable expansions (on the Grafana level) would be an option, as discussed in #19885, but it would be a quite involved change compared to the |
Hi! About two years ago I've opened #11451. Now we live with patched Grafana, which has two extra variables for Prometheus:
Here is the patch we use now (it has some legacy things, but you will get the idea): diff --git a/public/app/plugins/datasource/prometheus/datasource.ts b/public/app/plugins/datasource/prometheus/datasource.ts
index e6c79faa42..55668ac4f7 100644
--- a/public/app/plugins/datasource/prometheus/datasource.ts
+++ b/public/app/plugins/datasource/prometheus/datasource.ts
@@ -356,6 +356,15 @@ export class PrometheusDatasource extends DataSourceApi<PromQuery, PromOptions>
...this.getRangeScopedVars(options.range),
});
}
+ const intervalScrapeX3ForOverTimeVectors = Math.max(kbn.interval_to_seconds(this.interval) * 3, interval);
+ const intervalScrapeX4ForRangeVectors = Math.max(kbn.interval_to_seconds(this.interval) * 4, interval);
+ const intervalForRangeVectors = Math.max(kbn.interval_to_seconds(this.interval) * 2, interval);
+ scopedVars = Object.assign({}, scopedVars, {
+ __interval_rv: { text: intervalForRangeVectors + 's', value: intervalForRangeVectors + 's'},
+ __interval_sx3: { text: intervalScrapeX3ForOverTimeVectors + 's', value: intervalScrapeX3ForOverTimeVectors + 's'},
+ __interval_sx4: { text: intervalScrapeX4ForRangeVectors + 's', value: intervalScrapeX4ForRangeVectors + 's'},
+ });
+
query.step = interval;
let expr = target.expr;
diff --git a/public/app/plugins/datasource/prometheus/promql.ts b/public/app/plugins/datasource/prometheus/promql.ts
index 0bde2da9c2..a63a0ac707 100644
--- a/public/app/plugins/datasource/prometheus/promql.ts
+++ b/public/app/plugins/datasource/prometheus/promql.ts
@@ -2,6 +2,8 @@ import { CompletionItem } from '@grafana/ui';
export const RATE_RANGES: CompletionItem[] = [
{ label: '$__interval', sortText: '$__interval' },
+ { label: '$__interval_sx3', sortText: '$__interval_sx3' },
+ { label: '$__interval_sx4', sortText: '$__interval_sx4' },
{ label: '1m', sortText: '00:01:00' },
{ label: '5m', sortText: '00:05:00' },
{ label: '10m', sortText: '00:10:00' }, |
Thank you for your input @distol . Sounds like something worth investigating. |
Note that @distol 's suggestion addresses a different problem. It's about not having a duration in the range selector that doesn't get you enough points. This issue is about the problem of having "full coverage" of data points, for which you need a duration in the range selector that is equal to the interval plus one scrape interval. Having said that, @distol 's patch also has to face the problem that the scrape interval configured in the data source might not be the scrape interval of the concrete series displayed. A generally applicable solution needed a separate setting for the scrape interval per query (which is somewhat similar to the separate setting we need for the xinterval suggestion). However, once you set the scrape interval separately, it is not much less convenient to set the min-interval separately (which solves @distol 's problem, but not the problem this issue is about). Having some support for arithmetic in the variables would help in both cases. |
Having hundreds Prometheus in operation we end up strongly forcing one scrape interval across all metrics in one Prometheus. This approach simplifies a lot of things and it helped us a lot. |
Yes, that's definitely a good practice and also recommended, but it's not generally true. (Just saying that because we need to find a solution that's easy in the common case but also still works in the special case.) |
Okay, then maybe the general solution would be:
IMHO it solves all the problems. What do you think? This way, if you follow the practice of one-scrape-interval-per-prometheus — you can just use special vars for range vectors, and if you don't — you have to specify scrape interval on a per-query basis. |
So… what I propose in this issue is essentially (2) and (3) from your list. The good thing about it is that it doesn't prevent us from later doing (1). (That's why I'd like to not overload this issue. It might already appear pretty invasive to the core Grafana folks.) (2) and (3) are of higher immediate importance, because you can essentially emulate (1) by setting an appropriate min step (which is per query). The only restriction is that you cannot have higher resolution than that then. But that would lead to "overlapping" ranges, which is not desirable anyway (i.e. in practice you don't want a My gut feeling is that it will be easier/better to establish some kind of arithmetic in the Grafana variable expansion part than introducing more and more variables. Thus, I see two courses of action:
I prefer (2) here because it is relatively easy to implement and hopefully not very controversial. (1) will take longer to get it right. |
Another point in favor of "arithmetic in variable expansions" is if subqueries enter the game. @gouthamve just had a case where he would have liked to write |
After digesting all the discussion, additional thoughts during preparation for my GrafanaCon talk and some feedback received, here a snapshot of my brain state: The "proper" solution that actually nails all use cases is to allow arithmetic of some kind. This could be arithmetic in Grafana variable expansion. It could also be arithmetic in PromQL range durations. The former would also solve problems Grafana users have with non-Prometheus data sources, while the latter would also solve problems Prometheus users have without using Grafana (e.g. with other templating systems that might also lack support for arithmetic). The problem is that both solutions are quite invasive for their respective project (Grafana or Prometheus) and need a lot of thought to get right (and of course buy-in from the respective project itself – this might be a case where each project points fingers at each other as being in charge). Even if things work out really well, it might take a while. Thus, we need an interim solution. However, for an interim solution, we probably don't want to introduce yet another setting, as suggested above (called Scrape interval or Interval extension above). Also, the arithmetic will clearly be a power user feature, and it would be good to have a "zero configuration" setting for the more naïve user. Therefore, I came up with the following idea: Let's still introduce an additional variable, but call it I believe this should "just work" in 99+% of cases. It should be easy to add. It does not bloat the UI (no additional setting). It's just that one more variable name that we have to teach to users, but it can be used in most cases without much thinking (in contrast to the current situation, where you have to juggle with Min interval or Min step for (Minor note: This will also allow the use case where the range used in the rate query is much larger than the actually used step size AKA |
$__xinterval
$__rate_interval
This is exactly how
It would be great to have a list of issues that
FYI, this is already possible with MetricsQL:
|
I would like to add it appears much easier to just use a variable that "magically" does a reasonable thing, instead of having to deal with arithmetic for every range query. Moreover so since the proposed definition of |
Playing with this right now. Detailed result below. tl;dr: It works most of the time as intended, but setting a lower resolution (setting specific to Prometheus queries) throws things mostly off. Also, raising min step above the scrape interval in the data source gives surprising results. I'm using a setup with a default data source, i.e. scrape interval is assumed to be 15s. First test run: Don't set anything else, just use different time ranges to trigger different intervals:
Second test run: Set a min interval of 1m for the panel. This should not modify the assumed scrape interval.
Third test run: Set a min step of 1m in the query settings. This should change the assumed scrape interval to 1m.
Forth test run: Set a min step of 6s in the query settings. This should change the assumed scrape interval to 6s.
Fifth test run: Set a resolution of 1/2.
@zoltanbedi perhaps we can pair next week to got to the bottom of this and create suitable test cases? |
The latest docs already mention this new variable, but it looks like just released 7.1.4 doesn't provide it. |
Good point. This is new functionality and will be released with the next minor release (7.2) in September. Until then it's only available in the master builds. We'll update the docs to reflect this. |
Also note that the variable is it is in head is still buggy. @zoltanbedi is back from vacation next week, and we'll get things fixed up as described above. Hopefully, all will be good for 7.2. |
A bit off-topic, but I have just spent almost two hours (even though I have been following this issue for quite some time 🤦♂️) trying to figure out why I'm not able to graph the rate of response bytes. Already considered that something is wrong with my instrumentation, but no, it comes down to the interval. I have completely neglected the fact that this occurs on all time ranges and not just fairly big ones. Definitely looking forward to this getting released. I think I'll try out the master build. I could simply switch to a fixed interval like 5m and hope that the increase does not occur right at the beginning of a step, but then I would have to accept to ignore a lot of data at time ranges where Grafana's step exceeds 5m... Edit: Nevermind, v7.2.0-e475443dpre (a2a6b94) is really buggy |
@beorn7 I did and |
Great to hear about |
On 27 Aug 04:42, Björn Rabenstein wrote:
Great to hear about `$__rate_interval` working as expected. I don't know a lot about the Grafana release process and how stable master is supposed to be. (Apparently not so stable… :o) Despite my affiliation, I don't really work on Grafana itself a lot. Other have to tell us more here.
Master is usually very stable.
|
@roidelapluie Okay, thanks for the notice. I will try out the master again. Stuff like selecting template variables stopped working for me. And it seems like I'm not the only one (see this question on SO). I have also included a GIF recording that shows another weird behavior #21417 (comment) |
@beorn7, @zoltanbedi could the issue I have opened #27500 be related to this change? I cannot use Grafana interval variables in the min step field starting from Grafana versions > 7.1.5 |
That definitely seems a regression. Not sure if it has to do with this change, @zoltanbedi would be much more qualified to say. |
It is yes. Thanks for reporting it! |
For Prometheus newbies (i.e. most of us), it seems like $__rate_interval should be widely used, but the documentation doesn't help us understand that fact. So I added more about why it's good, and an explicit recommendation that it can be a good default value for use if (like most of us) you don't really understand what interval is good to use. (Let me know if that's not actually the case, but that's what I took away from grafana#21417) Also fixed a few typos in the existing doc.
<!-- Thank you for sending a pull request! Here are some tips: 1. If this is your first time, please read our contribution guide at https://github.com/grafana/grafana/blob/main/CONTRIBUTING.md 2. Ensure you include and run the appropriate tests as part of your Pull Request. 3. In a new feature or configuration option, an update to the documentation is necessary. Everything related to the documentation is under the docs folder in the root of the repository. 4. If the Pull Request is a work in progress, make use of GitHub's "Draft PR" feature and mark it as such. 5. If you can not merge your Pull Request due to a merge conflict, Rebase it. This gets it in sync with the main branch. 6. Name your PR as "<FeatureArea>: Describe your change", e.g. Alerting: Prevent race condition. If it's a fix or feature relevant for the changelog describe the user impact in the title. The PR title is used to auto-generate the changelog for issues marked with the "add to changelog" label. --> **What this PR does / why we need it**: For Prometheus newbies (i.e. most of us), it seems like $__rate_interval should be widely used, but the documentation doesn't tell us that. So I added more about why it's good, and an explicit recommendation that it can be a good default value for use if (like most of us) you don't really understand how to choose an interval. (Let me know if that's not actually the case, but that's what I took away from grafana#21417) Also fixed a few typos in the existing doc. **Which issue(s) this PR fixes**: <!-- - Automatically closes linked issue when the Pull Request is merged. Usage: "Fixes #<issue number>", or "Fixes (paste link of issue)" --> Fixes # **Special notes for your reviewer**:
* Doc: Info about $__rate_interval for naive users <!-- Thank you for sending a pull request! Here are some tips: 1. If this is your first time, please read our contribution guide at https://github.com/grafana/grafana/blob/main/CONTRIBUTING.md 2. Ensure you include and run the appropriate tests as part of your Pull Request. 3. In a new feature or configuration option, an update to the documentation is necessary. Everything related to the documentation is under the docs folder in the root of the repository. 4. If the Pull Request is a work in progress, make use of GitHub's "Draft PR" feature and mark it as such. 5. If you can not merge your Pull Request due to a merge conflict, Rebase it. This gets it in sync with the main branch. 6. Name your PR as "<FeatureArea>: Describe your change", e.g. Alerting: Prevent race condition. If it's a fix or feature relevant for the changelog describe the user impact in the title. The PR title is used to auto-generate the changelog for issues marked with the "add to changelog" label. --> **What this PR does / why we need it**: For Prometheus newbies (i.e. most of us), it seems like $__rate_interval should be widely used, but the documentation doesn't tell us that. So I added more about why it's good, and an explicit recommendation that it can be a good default value for use if (like most of us) you don't really understand how to choose an interval. (Let me know if that's not actually the case, but that's what I took away from #21417) Also fixed a few typos in the existing doc. **Which issue(s) this PR fixes**: <!-- - Automatically closes linked issue when the Pull Request is merged. Usage: "Fixes #<issue number>", or "Fixes (paste link of issue)" --> Fixes # **Special notes for your reviewer**: * Edits based on suggestions from @beorn7 * Update docs/sources/datasources/prometheus.md Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com> Co-authored-by: Ivana Huckova <30407135+ivanahuckova@users.noreply.github.com> Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com> Co-authored-by: Ivana <ivana.huckova@gmail.com>
* Doc: Info about $__rate_interval for naive users <!-- Thank you for sending a pull request! Here are some tips: 1. If this is your first time, please read our contribution guide at https://github.com/grafana/grafana/blob/main/CONTRIBUTING.md 2. Ensure you include and run the appropriate tests as part of your Pull Request. 3. In a new feature or configuration option, an update to the documentation is necessary. Everything related to the documentation is under the docs folder in the root of the repository. 4. If the Pull Request is a work in progress, make use of GitHub's "Draft PR" feature and mark it as such. 5. If you can not merge your Pull Request due to a merge conflict, Rebase it. This gets it in sync with the main branch. 6. Name your PR as "<FeatureArea>: Describe your change", e.g. Alerting: Prevent race condition. If it's a fix or feature relevant for the changelog describe the user impact in the title. The PR title is used to auto-generate the changelog for issues marked with the "add to changelog" label. --> **What this PR does / why we need it**: For Prometheus newbies (i.e. most of us), it seems like $__rate_interval should be widely used, but the documentation doesn't tell us that. So I added more about why it's good, and an explicit recommendation that it can be a good default value for use if (like most of us) you don't really understand how to choose an interval. (Let me know if that's not actually the case, but that's what I took away from #21417) Also fixed a few typos in the existing doc. **Which issue(s) this PR fixes**: <!-- - Automatically closes linked issue when the Pull Request is merged. Usage: "Fixes #<issue number>", or "Fixes (paste link of issue)" --> Fixes # **Special notes for your reviewer**: * Edits based on suggestions from @beorn7 * Update docs/sources/datasources/prometheus.md Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com> Co-authored-by: Ivana Huckova <30407135+ivanahuckova@users.noreply.github.com> Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com> Co-authored-by: Ivana <ivana.huckova@gmail.com> (cherry picked from commit 79747b4)
* Doc: Info about $__rate_interval for naive users <!-- Thank you for sending a pull request! Here are some tips: 1. If this is your first time, please read our contribution guide at https://github.com/grafana/grafana/blob/main/CONTRIBUTING.md 2. Ensure you include and run the appropriate tests as part of your Pull Request. 3. In a new feature or configuration option, an update to the documentation is necessary. Everything related to the documentation is under the docs folder in the root of the repository. 4. If the Pull Request is a work in progress, make use of GitHub's "Draft PR" feature and mark it as such. 5. If you can not merge your Pull Request due to a merge conflict, Rebase it. This gets it in sync with the main branch. 6. Name your PR as "<FeatureArea>: Describe your change", e.g. Alerting: Prevent race condition. If it's a fix or feature relevant for the changelog describe the user impact in the title. The PR title is used to auto-generate the changelog for issues marked with the "add to changelog" label. --> **What this PR does / why we need it**: For Prometheus newbies (i.e. most of us), it seems like $__rate_interval should be widely used, but the documentation doesn't tell us that. So I added more about why it's good, and an explicit recommendation that it can be a good default value for use if (like most of us) you don't really understand how to choose an interval. (Let me know if that's not actually the case, but that's what I took away from #21417) Also fixed a few typos in the existing doc. **Which issue(s) this PR fixes**: <!-- - Automatically closes linked issue when the Pull Request is merged. Usage: "Fixes #<issue number>", or "Fixes (paste link of issue)" --> Fixes # **Special notes for your reviewer**: * Edits based on suggestions from @beorn7 * Update docs/sources/datasources/prometheus.md Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com> Co-authored-by: Ivana Huckova <30407135+ivanahuckova@users.noreply.github.com> Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com> Co-authored-by: Ivana <ivana.huckova@gmail.com> (cherry picked from commit 79747b4) Co-authored-by: axlrosen <alexrosen@alum.mit.edu>
What would you like to be added:
A variable
$__xinterval
(“extended interval”) similar to the already existing$__interval
. It's value is the value of$__interval
plus a configurable duration, which should default to the Scrape interval configured in the Prometheus data source not multiplied by anything like the resolution factor!!!To configure the additional duration explicitly, we need a new per-query input field in the query editor. It's tempting to call it Scrape interval, too, because that's what it will usually be (see below), but perhaps it's better to be more explicit about what it's doing and call it Interval extension or something. I'll leave that to the UX experts.
Why is this needed:
Wider context: this RP blog post.
Key paragraph: “Another consideration is that if you're using query_range, such as in graphing, then the range should be at least the size of the step. Otherwise you'll skip over some data. Grafana's
$__interval
can be useful here.”To actually not skip over data, the range needs to be slightly larger than
$__interval
, ideally to create an overlap of one sample. (This is not true if there are already recording rules in the game, and thenavg_over_time
is used to extend the result to longer time spans, which is also described in the RP blog post. In this case, the current behavior works just fine.)This “not skipping over data at all” was such a deep desire for some, that they proposed to change how PromQL handles range selectors specifically for
rate
expressions. What came out of that is a fork with something calledxrate
. This approach has issues of very different kinds. It would be great if we could solve it in Grafana, which then separates concerns quite nicely.That's where this suggestion is coming from: Add a new variable just for the Prometheus data source
$__xinterval
which is (usually) one scrape interval longer than$__interval
. Since the scrape interval for a given time series cannot be inquired by Grafana, we need to make it configurable per query (with the default being the scrape interval configured in the data source). This also opens the door for some other, more fringe use cases, where people wanted to use range selectors with a somewhat modified$__xinterval
. (We are only offering adding a duration here, but should legitimate use cases come up where a factor is required, we could introduce that without a breaking change.)Note: Other solutions to this problem were discussed, but Grafana has an obligation to not break existing dashboards. That's why the introduction of a separate variable was inevitable.
The text was updated successfully, but these errors were encountered: