New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Required
option for remote_read
#5667
Conversation
This allows the user to define whether the data from a remote-read is required or not. The default maintains the same current behavior (where all errors from a remote-read are marked as warnings). In addition to the additional flexibility IMO this is a cleaner implementation (as the mergequerier now doesn't have to deal with primary vs secondary queriers. Fixes prometheus#5662 Signed-off-by: Thomas Jackson <jacksontj.89@gmail.com>
Per #5662, this doesn't make sense in a sane Prometheus setup. |
@jacksontj There was some disagreement at the recent Prometheus dev summit about this topic. The two positions are:
There is still some discussion to be had to settle this. Having it specified per remote read datasource is an interesting idea, and I have to think about the impact of this. But it sounds like you're suggesting a fix to a similar problem? |
Incomplete, not bad. In a standard setup where a Prometheus is only requesting it's own previously written data, data past the retention period will be missing. |
@tomwilkie yes, sounds like we are working on similar issues :)
IMO the "required-ness" of data is not an attribute of the query but an attribute of the infrastructure. I covered this in some detail in the linked issue (#5662 (comment)) -- so I'll link instead of copy/pasting :)
This is an unfortunate byproduct of how the warnings were added. IMO these should all be marked as required as that was the previous behavior-- as this effectively is a backwards-incompatible change which can dramatically impact the outcome of queries.
IMO this is the level of granularity you want -- as "correctness" of the data is based on what setup you have defined. Presumnably the promql engine is already smart enough to only query storage that has data relevant to the query -- so the operator needs to let us know in config whether this data is unique or not -- and IMO if you added a remote-read endpoint it likely is (which is why I'd prefer it defaulting to required). Alternatively prometheus does no remote-read aggregation at all and that is left to tools more suited to the problem (e.g. promxy ).
This entirely depends on the configuration of prometheus. If the remote-read endpoint has different data than the local prom the data will be a subset of the actual data -- which makes it both incomplete and bad. Incomplete because it doesn't have all the data required, and bad because the value is incorrect. |
I'd disagree, it's often going to be what is doing the query and how the end data is being used. If it's a reporting query (e.g. capacity planning over last N months) it's probably required, anything else (e.g. dashboards) probably not. A reporting query like that could check for the warnings, or we could add a opt-in URL parameter to make warnings errors for things like that.
The previous behaviour was only due to noone getting around to that yet. And we're allowed backwards-incompatible changes in remote storage. Keep in mind also that Prometheus is a monitoring system that happens to contain a database, not a database for which 100% query accuracy is a goal.
In general only for special non-storage remote read endpoints, which doesn't help you I believe. If you presume that Prometheus always queries all remote read endpoints you won't be far off (
That's roughly speaking what read_recent is. I'd be dubious of the reliability a setup that was pulling in data from other Prometheuses via remote read though, whether directly or via a clustered storage system. Similar considerations to the Thanos Ruler apply, only use it if you absolutely have to.
I'd argue that any setup that has a hard dependency on remote data is incorrect. One of the core principles of Prometheus is that it doesn't depend on anything beyond being able to get to its targets and the alertmanager.
If you need that more reliably, then federation is a better and more efficient approach. And even in that case you still need to tolerate missing data. You don't want one datacenter going down to take out all your monitoring. If you explain more about your actual setup it would help, rather than arguing in abstracts. |
And I continue to disagree here as well :) As a very simple example, if the query is
Sure, but this means that things like
Which is why I'd be tempted to say that prometheus itself shouldn't be dealing with this at all -- and we just leave that to outside projects to do all the query merging.
If this is the case, why is the remote_read option even there? If prometheus really wants to stick with that hardline then it shouldn't support remote-read at all.
True, but that is completely controllable by your configuration. One DC taking down all of monitoring would be assuming you have a single prom remote-reading from everything and thats the only node for alerts and dashboards :) |
This is a monitoring system though, not a metrics system. If this is a problem for you your alerts should take this into account. Combining local and remote series together also seems a bit dubious, as it's not the usual hierarchy approach. Your actual setup would be useful to understand.
It does mean we drop sections of data, for example if a scrape fails we don't attempt to fill the gap. Rate is designed to be resilient to this, and you also need to take it into account in PromQL expressions more generally. What we don't do is ingest partial data for a target, as that'd be virtually impossible to reason about. An entire target or chunk of time is workable though, as that's the same as the target or Prometheus being down which you already need to allow for.
That's one possibility, given Prometheus is going for reliability over accuracy. If you want accuracy over reliability then a different tool may be more appropriate. If it's just checking that there's no warnings it'd seem simpler to do that in the client code, rather than managing a whole new thing.
It works usually and that can be useful, but that doesn't mean we should take down monitoring when it fails. For example a non-critical capacity planning alert that uses older data isn't impacted if it's broken for an hour, nor is dashboards temporarily not showing data more than a month back.
Indeed that'd be a bad idea, which is why I'm interested in what the actual use case for this is. By default I presume this, given how often users attempt it in various forms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for massive lag.
The code looks LGTM, but it would nice to rebase it and see how it looks like on the latest code.
(Triaged during our bug scrub session)
@@ -614,6 +614,9 @@ type RemoteReadConfig struct { | |||
// RequiredMatchers is an optional list of equality matchers which have to | |||
// be present in a selector to query the remote read endpoint. | |||
RequiredMatchers model.LabelSet `yaml:"required_matchers,omitempty"` | |||
|
|||
// Required makes reads from this remote-read endpoint required for querying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Required makes reads from this remote-read endpoint required for querying | |
// Required makes reads from this remote-read endpoint required for querying. |
Hello from the Bug Scrub again! Really sorry for so many delays on this, but look like there is no activity. Let's close for now and reopen if needed. Thanks for the work! |
This allows the user to define whether the data from a remote-read is required or not. The default maintains the same current behavior (where all errors from a remote-read are marked as warnings).
In addition to the additional flexibility IMO this is a cleaner implementation (as the mergequerier now doesn't have to deal with primary vs secondary queriers.
Fixes #5662
Signed-off-by: Thomas Jackson jacksontj.89@gmail.com