Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should Prometheus pretend to know the future? #4966

Open
beorn7 opened this Issue Dec 6, 2018 · 7 comments

Comments

Projects
None yet
2 participants
@beorn7
Copy link
Member

beorn7 commented Dec 6, 2018

So far, the query model is very clear:

For any point in time tn, we process…

  • … for instant vectors the most recent data point with ttn, as long as the series is not stale and tnt is smaller than the --query.lookback-delta.
  • … for range vectors the data points within the specified range, where the range ends at tn.

The returned results are time-stamped at tn.

If tn is in the future, weird things happen, however:

  • Obviously, the query will yield dramatically different results if executed again when tn is not in the future anymore.
  • If tn is in the future by an amount close to --query.lookback-delta, results will start to disappear, which is especially problematic if the query is an aggregation, so that only some of the expected series are included.
  • For range vectors, the number of samples within the range will decrease the farther tn is in the future. Results will become gradually more noise, until rate and friends will stop yielding results at all once there is only one data point in the range. In other cases, there will be no result anymore if no data point is in the range. This again is especially problematic with aggregations.

State of the art is, IMHO, that we consider queries from the future an accident, resulting from time skew (we even warn in the web UI about time), and we return results as described above.

As you might have noticed, Grafana is now using future query times deliberately. It is doing so for consistently aligned query times in range queries, e.g. a range query with a step length of 30s will query for data points at the full and half minute always. To not miss the most recent data, Grafana will, in this case, query for up to 30s into the future and then (presumably) still cut the displayed graph to the present. This strategy collides with the implications I have listed above, creating issues.

One might argue that Grafana's approach here is simply wrong, and indeed, the feature is somewhat controversial, see the discussion in the relevant PR.
However, it triggered some thinking on my side if we should handle queries from the future in a different way. Even if we disregard Grafana's current use case, there will be more that will ask Prometheus to predict the future, accidentally or deliberately.

Mutually exclusive options that come to mind:

  1. Refuse queries from the future (with a certain grace period in the order of seconds at most).
  2. Don't return results for the future, i.e. no results at all for an instant query (with the same grace period again) and only return results for timestamps in the past for range queries, e.g. if a range query requests timestamps t–40s, t–10s, t+20s, and t+50s, only return results for t–40s and t–10s
  3. Adjust future timestamps to the present, i.e. an instant query for t+20s will return the result for t0s, timestamped as such, while a range query will coalesce all timestamps in the future into one at t0s, for which it will return a result, timestamped as such, e.g. if a range query requests timestamps t–40s, t–10s, t+20s, and t+50s, return results for t–40s, t–10s, and t0s.

There are certainly many pros and cons. 1. would break current Grafana hard. 2. would directly counteract Grafana's strategy to also get the freshest result. 3. has the advantage that it does what probably most deliberate users of future timestamps want.

Of course, there is the overarching question if this is a breaking change only possible in Prometheus 3, even if we believe it is a change we want.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 6, 2018

Even if we disregard Grafana's current use case, there will be more that will ask Prometheus to predict the future, accidentally or deliberately.

It used to do that via interpolation, and that behaviour was removed. Predictions are a different matter, we have predict_linear for that.

I don't think there's anything we can sanely/safely do here, I'd suggest that Grafana not request data from the future.

Of course, there is the overarching question if this is a breaking change only possible in Prometheus 3,

We do accept scrapes containing metrics up to 10 minutes into to the future as a safety measure, while allowing for time skew. Any of the proposed changes could break such users, and 3 violates timestamp and step semantics.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Dec 6, 2018

To be clear: This is not about actual prediction. The title refers to the fact that Prometheus returns a potentially very problematic result when asked for a sample from the future, without any flagging.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 6, 2018

One thing we could maybe do is return back the maximum timestamp currently in the tsdb as an additional piece of information in queries.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Mar 21, 2019

The Grafana problem has been solved by grafana/grafana#16110 .

I'll leave this issue open nevertheless. I think adding the maximum timestamp currently in the tsdb as suggested above is a good first step to at least make the problem of “querying the future” discoverable. I also think we should document the stability of query results over time explicitly (grace period for samples added with timestamps in the past and the future, ingestion delay (how large can it become in practice, actually? Prom1 had a problem with indexing delay, which shouldn't exist anymore, but I guess there will still be some finite ingestion delay), rule evaluation delay, mutation of the database by the new vertical block merges, …).

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 21, 2019

mutation of the database by the new vertical black merges

Those shouldn't be an issue, as it's either a compaction or a reload.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Mar 21, 2019

Those shouldn't be an issue, as it's either a compaction or a reload.

I don't understand the relevance.

If I add an overlapping block, I'll “change the past” and thus it is important to keep in mind if a user of the query API wants to do caching. No?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 21, 2019

Ah right. Deletion and retention also will have that effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.