Align queries to prometheus with the step #10434

craig-miskell-fluxfederation · 2018-01-04T20:22:50Z

Aligns the start/end of the query sent to prometheus with the step, which ensures PromQL expressions with 'rate' functions get consistent results, and thus avoid graphs jumping around on reload.

Related to some of the later issues discussed in #9705, and in repeatedly in various other places

Works best combined with using $__interval as the rate interval, to avoid sub-sampling (step > sample interval), but has merit in itself. The two things are the full fix for the 'my rate-based graphs are inconsistent and change at every reload' problem.

…ssions get consistent results

CLAassistant · 2018-01-04T20:22:56Z

All committers have signed the CLA.

craig-miskell-fluxfederation · 2018-01-04T20:47:32Z

Huh, ok, I've got some tests to fix :)

… to truly work as desired

craig-miskell-fluxfederation · 2018-01-05T03:26:01Z

Updated to fix the CI fails, which I'm confident will pass now (did in my dev env).
Mostly just adjusting the expected start/end timestamps, but I had to move the clamping to a separate function and call it in several places, to get the required results in all cases (particularly the query with null results filled in by the datasource).
Also (and I'm less confident about this), removed the trailing 's' from the step for the annotation queries. I couldn't see a reason/need for it, and it would have complicated matters substantially to have to deal with it. Open to feedback.

codecov-io · 2018-01-05T03:43:11Z

Codecov Report

Merging #10434 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #10434   +/-   ##
=======================================
  Coverage   49.79%   49.79%           
=======================================
  Files         312      312           
  Lines       22096    22096           
  Branches     1125     1125           
=======================================
  Hits        11003    11003           
  Misses      10452    10452           
  Partials      641      641

torkelo · 2018-01-09T09:52:55Z

Hm.. maybe this should be done globally for all data sources, in https://github.com/grafana/grafana/blob/master/public/app/core/utils/kbn.ts#L160

craig-miskell-fluxfederation · 2018-01-09T19:44:22Z

Seems like a reasonable concept. Are you suggesting that calculateInterval should adjust range.from and range.to?

torkelo · 2018-01-10T11:08:15Z

no, just align interval to be a even multiple of min interval

craig-miskell-fluxfederation · 2018-01-10T19:34:00Z

I'm afraid I don't understand how that would help. The problem that my patch fixes is not the interval size, it's whether the from/to are integer multiples of the interval.

bergquist · 2018-01-11T10:28:49Z

ref #5190 #6930

I think its time that we actually solve this problem.
But I'm not sure if this should be introduced as configuration per data source or at a global Grafana configuration level.

free · 2018-02-05T09:49:46Z

My 2c (feel free to ignore): I believe there might be value in making start and end alignment default, but optional. I fully agree with the fact that rate/increase graphs jumping all over the place on refresh is a problem. (I'm actually trying to fix the other end of this, with little to no success: prometheus/prometheus#3746.) That being said, I can think of a couple of reasons why one might not want aligned data.

For one, the most recent value will always reflect a partial result. E.g. if you have a bar graph with 1 hour resolution (to take an extreme example), the last bar will always start at zero and start filling up as the hour goes by. With a line graph (assuming an otherwise constant rate/increase) the line will be horizontal except for the last point, where it will go down, basically reflecting where in the middle of $__interval the graph got loaded. An even worse outcome is (I imagine, haven't actually tested) a status panel that (by definition) only displays the last value: if you forget to mark it as an instant query (which are correctly handled by this PR, good job), the value will keep increasing then dropping to zero in a sawtooth pattern, and at first sight it may be difficult to understand why.

Second (and even more speculative), clamping the start and end points to a multiple of $__interval in the presence of an explicit Min step makes it look as if that is the resolution of the underlying data: an increase between 2 data points never moves from one interval to the next. At that point you shouldn't even bother scraping more often than Min step because you won't see most of that data.

Like I said, feel free to ignore though. It may not be worth the added UI clutter and code complexity.

free · 2018-02-05T09:59:24Z

Oh, I just realized that I wrote all that comment on the assumption that Prometheus uses my proposed rate/increase implementation.

With the current implementation (which always throws away the increase between adjoining intervals, iff you're requesting rate(foo[$__interval]) with step=$__interval) it is always the same increases that you're never going to see. E.g. with an $__interval of 1 minute, you will never see the increases between the last point in one minute M and the first point in the next. Which, if you're looking for spikes, might be worse than having your graph randomly jump around all the time.

Not Grafana's or this PR's fault, but should probably be taken into consideration as it's more serious than either of my rather philosophical points from the previous comment.

davkal · 2018-05-08T10:46:53Z

I'm cleaning up that merge (I introduced a variable conflict), and a test in datasource_specs.ts is failing.

torkelo · 2018-05-08T12:28:55Z

I was just reviewing this. Think this can be merged soon but would like to clean up the code a bit. the clampRange function is called twice in the query function which looks a bit clunky. Maybe this can be done in the createQuery function and start/stop added to the query object. This makes it possible to reuse it in the response handling code.

Also a not a big fan of how the transformerOptions object was changed by removing the label names. Think that makes it harder to read. https://github.com/grafana/grafana/pull/10434/files#diff-a59431ca1f1f94cdb3ae176c50a585b2L152

bergquist · 2018-05-08T12:40:15Z

I think this needs some reeebasing.

free · 2018-05-08T13:45:45Z

I would like to point out again that while this will indeed produce consistent results when used with rate() (the stated purpose of this PR) it will also make it virtually impossible to see some of the data.

Because of Prometheus' buggy implementation of rate(), graphing rate(foo[$__interval]) essentially causes one data point (or rather the increase between two adjacent data points) to be discarded every $__interval. When combined with aligning the query start/end with $__interval, this means that it is always going to be the same data that gets discarded. E.g. if you're graphing a rate() with 1m resolution you're never going to be able to see any increase between the last data point in minute M and the first data point in minute M+1. (I guess you could alter your query and add an offset 30s to it, but that is far from obvious, even if you understand why your 5 second error spike is missing from the graph.)

Please consider making this optional, as it's not a solution to (Prometheus', not Grafana's) problem, but merely a workaround for the annoyance of graphs jumping around.

davkal · 2018-05-08T14:08:55Z

@free I'm curious where this would hide anything. Could you point me to the relevant lines in promql/engine.go or promql/engine_test.go with a concrete example?

In general, we're committed to showing what's expected when a user wants to see the last n minutes of data of a time series. As a visual frontend we care less about the data that's there, but rather it's ease of interpretation in the most frequent use case, even if it means modifying date ranges to accommodate p8s' implementation "quirks". Wouldn't you be able to see the non-clamped data in original prometheus UI?

free · 2018-05-08T15:56:27Z

There are 2 Prometheus issues -- prometheus/prometheus#3806 and prometheus/prometheus#3746 -- and a Prometheus Developers mailing list thread -- https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/prometheus-developers/B_CMEp40PHE -- on the subject. Unfortunately the Prometheus developers don't agree that it's an important issue to solve, so it doesn't look like it will get fixed anytime soon.

Essentially the problem is that Prometheus only looks at the points falling in the specified range when computing a rate()/increase(), so if you have disjoint 1 minute ranges (as you would with a 1 minute rate() at 1 minute resolution) all the increases between the end of one range and the start of the next are ignored. So even now if you do a rate() over 1 minute with 1 minute resolution you will be missing some of the data (that's why spikes appear and disappear on refresh). But if you enforce alignment to 1 minute, then the same data will be missing regardless of when you run the query, which is arguably worse than the current status quo.

And yes, you would be able to see the data in the Prometheus UI, but most people won't bother (or won't be aware that they can -- e.g. I did not consider that option and I've worked with Prometheus and Grafana quite a bit over the past year).

That's why I would (personally) prefer if this was an option, but I fully understand that not everyone has the same priorities, so I'm merely asking nicely.

free · 2018-05-09T08:22:41Z

public/app/plugins/datasource/prometheus/datasource.ts

+    const startJitter = start % step;
+    const endJitter = end % step;
+    // Shift interval forward on jitter
+    if (startJitter || endJitter) {


This looks unnecessary complex.

What should happen (in my opinion) is for end to be rounded up to a multiple of step (so the range always includes end, which is most often the wall time) and start to be a fixed number of steps away from end, so that if end - start is not a multiple of steps you don't end up flipping between N and N+1 data points on the graph. I.e.

const clampedEnd = Math.ceil(end / step) * step; // Round up const clampedRange = Math.ceil((end - start) / step) * step; // Also round up the range length return { end: clampedEnd; start: clampedEnd - clampedRange; };

Not sure I follow. Given step = 3, start = 1 and end = 5, prometheus returns 2 datapoints:
http://localhost:9090/api/v1/query_range?query=1&start=1&end=5&step=3

My code clamps to:

start = 3 end = 6 returns 2 datapoints

Yours clamps to

start = 0 end = 6 returns 3 datapoints

Could you explain the benefits of your approach?

Well, if you want to mirror Prometheus in the number of points, you can always round the range down rather than up (because that's essentially what Prometheus does). I was instead going for covering the whole requested range (focusing particularly on the end of the range so you don't withhold that information until the a whole step has passed).

The deeper problem I noticed was that your code will indeed return 2 data points with step = 3, start = 1 and end = 5; but one second later -- when start = 2 and end = 6 -- it will return 3 data points: 3, 6 and 9, with the former leaving out the data at 2, which was "requested" and the latter unlikely to ever have any data (assuming the wall time is 6).

free · 2018-05-09T08:35:33Z

public/app/plugins/datasource/prometheus/datasource.ts

@@ -146,8 +145,7 @@ export class PrometheusDatasource {

    var allQueryPromise = _.map(queries, query => {
      if (!query.instant) {
-        let range = this.clampRange(start, end, query.step);
-        return this.performTimeSeriesQuery(query, range.start, range.end);
+        return this.performTimeSeriesQuery(query, query.start, query.end);


At this point you may simply pass query only as parameter and retrieve start and end as query.start and query.end. Then you also won't need to create the data object in performTimeSeriesQuery(), simply pass through the query object as it already has all the fields it needs.

Yeah, I had that in an earlier version. I quite like to make that argument dependency explicit in the func signature.

davkal · 2018-05-09T12:05:44Z

I think this is complete now. I added some tests for clampRange to make the behavior crystal clear. Big thanks to @free for sparring, really appreciate the constructive manner (it's not been said nearly often enough). An thanks to @craig-miskell-fluxfederation obviously for PR-ing this in the first place.

We had a discussion about making this optional and decided to hold off and get community feedback. Hopefully there will be enough until the next release is due. I'm removing myself from reviewing since I now added some code.

Lastly, to test the effect, I recommend using a series like rate(scrape_duration_seconds[1m]) over the last hour with a 10 sec refresh to see the difference to master.

free

I feel it necessary to point out that even though I'm suggesting improvements, I still don't like this change. :o)

free · 2018-05-09T12:07:40Z

public/app/plugins/datasource/prometheus/partials/query.editor.html

@@ -25,7 +25,8 @@
        placeholder="{{ctrl.panelCtrl.interval}}" data-min-length=0 data-items=100 ng-model-onblur ng-change="ctrl.refreshMetricData()"
      />
      <info-popover mode="right-absolute">
-        Leave blank for auto handling based on time range and panel width
+        Leave blank for auto handling based on time range and panel width. Note that the actual dates used in the query might be


s/might be adjusted to fit/will be adjusted to match/

Or better yet, "will be adjusted to a multiple of". It's clearer both on how the adjustment is made and on the fact that it's not optional.

bergquist

👍

Please rebase onto master before merging so all CI steps can pass.

* only increase interval by step if jitter happened * shift both start and end * simplified tests by using low epoch numbers

…eries * origin/master: (21 commits) docs: removes notes about beeing introduced in 5.0 lock caniuse-db version to resolve phantomjs rendering issue Update dashboard_permissions.md move database-specific code into dialects (#11884) refactor: tracing service refactoring (#11907) fix typo in getLdapAttrN (#11898) docs: update installation instructions targeting v5.1.2 stable changelog: add notes about closing #11862, #11656 Fix dependencies on Node v10 Update dashboard.md changelog: add notes about closing #10338 Phantom render.js is incorrectly retrieving number of active panels (#11100) singlestat: render time of last point based on dashboard timezone (#11425) Fix for #10078: symbol "&" is not escaped (#10137) Add alpha color channel support for graph bars (#10956) interpolate 'field' again in Elasticsearch terms queries (#10026) Templating : return __empty__ value when all value return nothing to prevent elasticsearch syntaxe error (#9701) http_server: All files in public/build have now a huge max-age (#11536) fix: ldap unit test decrease length of auth_id column in user_auth table ...

schweikert · 2018-05-31T19:32:11Z

This should be probable mentioned in the changelog

marefr · 2018-06-01T07:47:16Z

@davkal can you add a note and link to this pr of this change to our changelog?

davkal · 2018-06-01T08:14:07Z

Done.

Align queries to prometheus with the step to ensure 'rate' type expre…

61e6f63

…ssions get consistent results

craig-miskell-fluxfederation added 2 commits January 5, 2018 16:20

Update tests to match new reality, and rejig the implementation a bit…

f9fb315

… to truly work as desired

Remove silly noise

2e86985

bergquist added the type/discussion Issue to start a discussion label Jan 11, 2018

bergquist self-assigned this Mar 12, 2018

bergquist added the datasource/Prometheus label Mar 12, 2018

davkal self-assigned this May 8, 2018

davkal self-requested a review May 8, 2018 10:06

davkal removed their assignment May 8, 2018

Merge branch 'master' into prometheus_align_queries

06aca5e

free reviewed May 9, 2018

View reviewed changes

davkal removed their request for review May 9, 2018 12:06

free reviewed May 9, 2018

View reviewed changes

bergquist approved these changes May 14, 2018

View reviewed changes

davkal added 2 commits May 14, 2018 12:08

Prometheus step alignment: shift interval only on jitter

e731c24

* only increase interval by step if jitter happened * shift both start and end * simplified tests by using low epoch numbers

davkal merged commit 65f9970 into grafana:master May 14, 2018

davkal mentioned this pull request May 14, 2018

[Feature request] align query time range to improve the cache hit ratio #6930

Closed

roidelapluie mentioned this pull request May 21, 2018

Some graphes start with 0 #12024

Closed

grobie mentioned this pull request Jun 29, 2018

Bars overlap #12417

Closed

zemek mentioned this pull request Jul 5, 2018

Specify desired step in Prometheus in dashboard panels #9705

Closed

craig-miskell-fluxfederation deleted the prometheus_align_queries branch August 8, 2018 20:42

davkal mentioned this pull request Oct 2, 2018

Current values shows 0 for Prometheus with 5.x Grafana #13419

Closed

beorn7 mentioned this pull request Dec 6, 2018

Add an info level annotation if an API request contains future timestamps prometheus/prometheus#4966

Open

daniellee mentioned this pull request Jan 13, 2019

Prometheus query for annotation always uses 60s step regardless of dashboard range #14795

Closed

roidelapluie mentioned this pull request Feb 12, 2019

Prometheus step alignment issue #15385

Closed

marefr added the area/datasource label Mar 30, 2019

ying-jeanne added the pr/external This PR is from external contributor label Apr 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align queries to prometheus with the step #10434

Align queries to prometheus with the step #10434

craig-miskell-fluxfederation commented Jan 4, 2018

CLAassistant commented Jan 4, 2018 •

edited

Loading

craig-miskell-fluxfederation commented Jan 4, 2018

craig-miskell-fluxfederation commented Jan 5, 2018

codecov-io commented Jan 5, 2018 •

edited

Loading

torkelo commented Jan 9, 2018

craig-miskell-fluxfederation commented Jan 9, 2018

torkelo commented Jan 10, 2018

craig-miskell-fluxfederation commented Jan 10, 2018

bergquist commented Jan 11, 2018

free commented Feb 5, 2018

free commented Feb 5, 2018

davkal commented May 8, 2018

torkelo commented May 8, 2018 •

edited

Loading

bergquist commented May 8, 2018

free commented May 8, 2018

davkal commented May 8, 2018

free commented May 8, 2018

free May 9, 2018

davkal May 9, 2018

free May 9, 2018

free May 9, 2018

davkal May 9, 2018

davkal commented May 9, 2018

free left a comment

free May 9, 2018

bergquist left a comment

schweikert commented May 31, 2018

marefr commented Jun 1, 2018

davkal commented Jun 1, 2018

Align queries to prometheus with the step #10434

Align queries to prometheus with the step #10434

Conversation

craig-miskell-fluxfederation commented Jan 4, 2018

CLAassistant commented Jan 4, 2018 • edited Loading

craig-miskell-fluxfederation commented Jan 4, 2018

craig-miskell-fluxfederation commented Jan 5, 2018

codecov-io commented Jan 5, 2018 • edited Loading

Codecov Report

torkelo commented Jan 9, 2018

craig-miskell-fluxfederation commented Jan 9, 2018

torkelo commented Jan 10, 2018

craig-miskell-fluxfederation commented Jan 10, 2018

bergquist commented Jan 11, 2018

free commented Feb 5, 2018

free commented Feb 5, 2018

davkal commented May 8, 2018

torkelo commented May 8, 2018 • edited Loading

bergquist commented May 8, 2018

free commented May 8, 2018

davkal commented May 8, 2018

free commented May 8, 2018

free May 9, 2018

Choose a reason for hiding this comment

davkal May 9, 2018

Choose a reason for hiding this comment

free May 9, 2018

Choose a reason for hiding this comment

free May 9, 2018

Choose a reason for hiding this comment

davkal May 9, 2018

Choose a reason for hiding this comment

davkal commented May 9, 2018

free left a comment

Choose a reason for hiding this comment

free May 9, 2018

Choose a reason for hiding this comment

bergquist left a comment

Choose a reason for hiding this comment

schweikert commented May 31, 2018

marefr commented Jun 1, 2018

davkal commented Jun 1, 2018

CLAassistant commented Jan 4, 2018 •

edited

Loading

codecov-io commented Jan 5, 2018 •

edited

Loading

torkelo commented May 8, 2018 •

edited

Loading