Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align queries to prometheus with the step #10434

Merged
merged 6 commits into from
May 14, 2018
Merged

Align queries to prometheus with the step #10434

merged 6 commits into from
May 14, 2018

Conversation

craig-miskell-fluxfederation
Copy link
Contributor

Aligns the start/end of the query sent to prometheus with the step, which ensures PromQL expressions with 'rate' functions get consistent results, and thus avoid graphs jumping around on reload.

Related to some of the later issues discussed in #9705, and in repeatedly in various other places

Works best combined with using $__interval as the rate interval, to avoid sub-sampling (step > sample interval), but has merit in itself. The two things are the full fix for the 'my rate-based graphs are inconsistent and change at every reload' problem.

@CLAassistant
Copy link

CLAassistant commented Jan 4, 2018

CLA assistant check
All committers have signed the CLA.

@craig-miskell-fluxfederation
Copy link
Contributor Author

Huh, ok, I've got some tests to fix :)

@craig-miskell-fluxfederation
Copy link
Contributor Author

Updated to fix the CI fails, which I'm confident will pass now (did in my dev env).
Mostly just adjusting the expected start/end timestamps, but I had to move the clamping to a separate function and call it in several places, to get the required results in all cases (particularly the query with null results filled in by the datasource).
Also (and I'm less confident about this), removed the trailing 's' from the step for the annotation queries. I couldn't see a reason/need for it, and it would have complicated matters substantially to have to deal with it. Open to feedback.

@codecov-io
Copy link

codecov-io commented Jan 5, 2018

Codecov Report

Merging #10434 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #10434   +/-   ##
=======================================
  Coverage   49.79%   49.79%           
=======================================
  Files         312      312           
  Lines       22096    22096           
  Branches     1125     1125           
=======================================
  Hits        11003    11003           
  Misses      10452    10452           
  Partials      641      641

@torkelo
Copy link
Member

torkelo commented Jan 9, 2018

Hm.. maybe this should be done globally for all data sources, in https://github.com/grafana/grafana/blob/master/public/app/core/utils/kbn.ts#L160

@craig-miskell-fluxfederation
Copy link
Contributor Author

Seems like a reasonable concept. Are you suggesting that calculateInterval should adjust range.from and range.to?

@torkelo
Copy link
Member

torkelo commented Jan 10, 2018

no, just align interval to be a even multiple of min interval

@craig-miskell-fluxfederation
Copy link
Contributor Author

I'm afraid I don't understand how that would help. The problem that my patch fixes is not the interval size, it's whether the from/to are integer multiples of the interval.

@bergquist
Copy link
Contributor

ref #5190 #6930

I think its time that we actually solve this problem.
But I'm not sure if this should be introduced as configuration per data source or at a global Grafana configuration level.

@bergquist bergquist added the type/discussion Issue to start a discussion label Jan 11, 2018
@free
Copy link

free commented Feb 5, 2018

My 2c (feel free to ignore): I believe there might be value in making start and end alignment default, but optional. I fully agree with the fact that rate/increase graphs jumping all over the place on refresh is a problem. (I'm actually trying to fix the other end of this, with little to no success: prometheus/prometheus#3746.) That being said, I can think of a couple of reasons why one might not want aligned data.

For one, the most recent value will always reflect a partial result. E.g. if you have a bar graph with 1 hour resolution (to take an extreme example), the last bar will always start at zero and start filling up as the hour goes by. With a line graph (assuming an otherwise constant rate/increase) the line will be horizontal except for the last point, where it will go down, basically reflecting where in the middle of $__interval the graph got loaded. An even worse outcome is (I imagine, haven't actually tested) a status panel that (by definition) only displays the last value: if you forget to mark it as an instant query (which are correctly handled by this PR, good job), the value will keep increasing then dropping to zero in a sawtooth pattern, and at first sight it may be difficult to understand why.

Second (and even more speculative), clamping the start and end points to a multiple of $__interval in the presence of an explicit Min step makes it look as if that is the resolution of the underlying data: an increase between 2 data points never moves from one interval to the next. At that point you shouldn't even bother scraping more often than Min step because you won't see most of that data.

Like I said, feel free to ignore though. It may not be worth the added UI clutter and code complexity.

@free
Copy link

free commented Feb 5, 2018

Oh, I just realized that I wrote all that comment on the assumption that Prometheus uses my proposed rate/increase implementation.

With the current implementation (which always throws away the increase between adjoining intervals, iff you're requesting rate(foo[$__interval]) with step=$__interval) it is always the same increases that you're never going to see. E.g. with an $__interval of 1 minute, you will never see the increases between the last point in one minute M and the first point in the next. Which, if you're looking for spikes, might be worse than having your graph randomly jump around all the time.

Not Grafana's or this PR's fault, but should probably be taken into consideration as it's more serious than either of my rather philosophical points from the previous comment.

@bergquist bergquist self-assigned this Mar 12, 2018
@davkal davkal self-assigned this May 8, 2018
@davkal davkal self-requested a review May 8, 2018 10:06
@davkal davkal removed their assignment May 8, 2018
@davkal
Copy link
Contributor

davkal commented May 8, 2018

I'm cleaning up that merge (I introduced a variable conflict), and a test in datasource_specs.ts is failing.

@torkelo
Copy link
Member

torkelo commented May 8, 2018

I was just reviewing this. Think this can be merged soon but would like to clean up the code a bit. the clampRange function is called twice in the query function which looks a bit clunky. Maybe this can be done in the createQuery function and start/stop added to the query object. This makes it possible to reuse it in the response handling code.

Also a not a big fan of how the transformerOptions object was changed by removing the label names. Think that makes it harder to read. https://github.com/grafana/grafana/pull/10434/files#diff-a59431ca1f1f94cdb3ae176c50a585b2L152

@bergquist
Copy link
Contributor

I think this needs some reeebasing.

@free
Copy link

free commented May 8, 2018

I would like to point out again that while this will indeed produce consistent results when used with rate() (the stated purpose of this PR) it will also make it virtually impossible to see some of the data.

Because of Prometheus' buggy implementation of rate(), graphing rate(foo[$__interval]) essentially causes one data point (or rather the increase between two adjacent data points) to be discarded every $__interval. When combined with aligning the query start/end with $__interval, this means that it is always going to be the same data that gets discarded. E.g. if you're graphing a rate() with 1m resolution you're never going to be able to see any increase between the last data point in minute M and the first data point in minute M+1. (I guess you could alter your query and add an offset 30s to it, but that is far from obvious, even if you understand why your 5 second error spike is missing from the graph.)

Please consider making this optional, as it's not a solution to (Prometheus', not Grafana's) problem, but merely a workaround for the annoyance of graphs jumping around.

@davkal
Copy link
Contributor

davkal commented May 8, 2018

@free I'm curious where this would hide anything. Could you point me to the relevant lines in promql/engine.go or promql/engine_test.go with a concrete example?

In general, we're committed to showing what's expected when a user wants to see the last n minutes of data of a time series. As a visual frontend we care less about the data that's there, but rather it's ease of interpretation in the most frequent use case, even if it means modifying date ranges to accommodate p8s' implementation "quirks". Wouldn't you be able to see the non-clamped data in original prometheus UI?

@free
Copy link

free commented May 8, 2018

There are 2 Prometheus issues -- prometheus/prometheus#3806 and prometheus/prometheus#3746 -- and a Prometheus Developers mailing list thread -- https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/prometheus-developers/B_CMEp40PHE -- on the subject. Unfortunately the Prometheus developers don't agree that it's an important issue to solve, so it doesn't look like it will get fixed anytime soon.

Essentially the problem is that Prometheus only looks at the points falling in the specified range when computing a rate()/increase(), so if you have disjoint 1 minute ranges (as you would with a 1 minute rate() at 1 minute resolution) all the increases between the end of one range and the start of the next are ignored. So even now if you do a rate() over 1 minute with 1 minute resolution you will be missing some of the data (that's why spikes appear and disappear on refresh). But if you enforce alignment to 1 minute, then the same data will be missing regardless of when you run the query, which is arguably worse than the current status quo.

And yes, you would be able to see the data in the Prometheus UI, but most people won't bother (or won't be aware that they can -- e.g. I did not consider that option and I've worked with Prometheus and Grafana quite a bit over the past year).

That's why I would (personally) prefer if this was an option, but I fully understand that not everyone has the same priorities, so I'm merely asking nicely.

const startJitter = start % step;
const endJitter = end % step;
// Shift interval forward on jitter
if (startJitter || endJitter) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unnecessary complex.

What should happen (in my opinion) is for end to be rounded up to a multiple of step (so the range always includes end, which is most often the wall time) and start to be a fixed number of steps away from end, so that if end - start is not a multiple of steps you don't end up flipping between N and N+1 data points on the graph. I.e.

const clampedEnd = Math.ceil(end / step) * step;  // Round up
const clampedRange = Math.ceil((end - start) / step) * step; // Also round up the range length
return {
  end: clampedEnd;
  start: clampedEnd - clampedRange;
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow. Given step = 3, start = 1 and end = 5, prometheus returns 2 datapoints:
http://localhost:9090/api/v1/query_range?query=1&start=1&end=5&step=3

My code clamps to:

start = 3
end = 6
returns 2 datapoints

Yours clamps to

start = 0
end = 6
returns 3 datapoints

Could you explain the benefits of your approach?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if you want to mirror Prometheus in the number of points, you can always round the range down rather than up (because that's essentially what Prometheus does). I was instead going for covering the whole requested range (focusing particularly on the end of the range so you don't withhold that information until the a whole step has passed).

The deeper problem I noticed was that your code will indeed return 2 data points with step = 3, start = 1 and end = 5; but one second later -- when start = 2 and end = 6 -- it will return 3 data points: 3, 6 and 9, with the former leaving out the data at 2, which was "requested" and the latter unlikely to ever have any data (assuming the wall time is 6).

@@ -146,8 +145,7 @@ export class PrometheusDatasource {

var allQueryPromise = _.map(queries, query => {
if (!query.instant) {
let range = this.clampRange(start, end, query.step);
return this.performTimeSeriesQuery(query, range.start, range.end);
return this.performTimeSeriesQuery(query, query.start, query.end);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point you may simply pass query only as parameter and retrieve start and end as query.start and query.end. Then you also won't need to create the data object in performTimeSeriesQuery(), simply pass through the query object as it already has all the fields it needs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had that in an earlier version. I quite like to make that argument dependency explicit in the func signature.

@davkal
Copy link
Contributor

davkal commented May 9, 2018

I think this is complete now. I added some tests for clampRange to make the behavior crystal clear. Big thanks to @free for sparring, really appreciate the constructive manner (it's not been said nearly often enough). An thanks to @craig-miskell-fluxfederation obviously for PR-ing this in the first place.

We had a discussion about making this optional and decided to hold off and get community feedback. Hopefully there will be enough until the next release is due. I'm removing myself from reviewing since I now added some code.

Lastly, to test the effect, I recommend using a series like rate(scrape_duration_seconds[1m]) over the last hour with a 10 sec refresh to see the difference to master.

@davkal davkal removed their request for review May 9, 2018 12:06
Copy link

@free free left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it necessary to point out that even though I'm suggesting improvements, I still don't like this change. :o)

@@ -25,7 +25,8 @@
placeholder="{{ctrl.panelCtrl.interval}}" data-min-length=0 data-items=100 ng-model-onblur ng-change="ctrl.refreshMetricData()"
/>
<info-popover mode="right-absolute">
Leave blank for auto handling based on time range and panel width
Leave blank for auto handling based on time range and panel width. Note that the actual dates used in the query might be
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/might be adjusted to fit/will be adjusted to match/

Or better yet, "will be adjusted to a multiple of". It's clearer both on how the adjustment is made and on the fact that it's not optional.

Copy link
Contributor

@bergquist bergquist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Please rebase onto master before merging so all CI steps can pass.

* only increase interval by step if jitter happened
* shift both start and end
* simplified tests by using low epoch numbers
…eries

* origin/master: (21 commits)
  docs: removes notes about beeing introduced in 5.0
  lock caniuse-db version to resolve phantomjs rendering issue
  Update dashboard_permissions.md
  move database-specific code into dialects (#11884)
  refactor: tracing service refactoring (#11907)
  fix typo in getLdapAttrN (#11898)
  docs: update installation instructions targeting v5.1.2 stable
  changelog: add notes about closing #11862, #11656
  Fix dependencies on Node v10
  Update dashboard.md
  changelog: add notes about closing #10338
  Phantom render.js is incorrectly retrieving number of active panels (#11100)
  singlestat: render time of last point based on dashboard timezone (#11425)
  Fix for #10078: symbol "&" is not escaped (#10137)
  Add alpha color channel support for graph bars (#10956)
  interpolate 'field' again in Elasticsearch terms queries (#10026)
  Templating : return __empty__ value when all value return nothing to prevent elasticsearch syntaxe error (#9701)
  http_server: All files in public/build have now a huge max-age (#11536)
  fix: ldap unit test
  decrease length of auth_id column in user_auth table
  ...
@schweikert
Copy link

This should be probable mentioned in the changelog

@marefr
Copy link
Member

marefr commented Jun 1, 2018

@davkal can you add a note and link to this pr of this change to our changelog?

@davkal
Copy link
Contributor

davkal commented Jun 1, 2018

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datasource datasource/Prometheus pr/external This PR is from external contributor type/discussion Issue to start a discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants