Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Required option for remote_read #5667

Closed
wants to merge 1 commit into from

Conversation

jacksontj
Copy link
Contributor

This allows the user to define whether the data from a remote-read is required or not. The default maintains the same current behavior (where all errors from a remote-read are marked as warnings).

In addition to the additional flexibility IMO this is a cleaner implementation (as the mergequerier now doesn't have to deal with primary vs secondary queriers.

Fixes #5662

Signed-off-by: Thomas Jackson jacksontj.89@gmail.com

This allows the user to define whether the data from a remote-read is required or not. The default maintains the same current behavior (where all errors from a remote-read are marked as warnings).

In addition to the additional flexibility IMO this is a cleaner implementation (as the mergequerier now doesn't have to deal with primary vs secondary queriers.

Fixes prometheus#5662

Signed-off-by: Thomas Jackson <jacksontj.89@gmail.com>
@brian-brazil
Copy link
Contributor

Per #5662, this doesn't make sense in a sane Prometheus setup.

@tomwilkie
Copy link
Member

@jacksontj There was some disagreement at the recent Prometheus dev summit about this topic. The two positions are:

  • Prometheus should error on queries when remote read fails, by default. This should be overridable per query. This would prevent old client that don't know about remote read from "accidentally" showing bad data and ignoring warnings. New clients that understand warning responses can override this default, either at the user discretion or all the time.

  • The current code, which will return warnings on queries when remote read fails. "Old" (ie most) clients will ignore these warnings and show bad data.

There is still some discussion to be had to settle this. Having it specified per remote read datasource is an interesting idea, and I have to think about the impact of this. But it sounds like you're suggesting a fix to a similar problem?

@brian-brazil
Copy link
Contributor

"Old" (ie most) clients will ignore these warnings and show bad data.

Incomplete, not bad. In a standard setup where a Prometheus is only requesting it's own previously written data, data past the retention period will be missing.

@jacksontj
Copy link
Contributor Author

@tomwilkie yes, sounds like we are working on similar issues :)

This should be overridable per query

IMO the "required-ness" of data is not an attribute of the query but an attribute of the infrastructure. I covered this in some detail in the linked issue (#5662 (comment)) -- so I'll link instead of copy/pasting :)

The current code, which will return warnings on queries when remote read fails. "Old" (ie most) clients will ignore these warnings and show bad data.

This is an unfortunate byproduct of how the warnings were added. IMO these should all be marked as required as that was the previous behavior-- as this effectively is a backwards-incompatible change which can dramatically impact the outcome of queries.

Having it specified per remote read datasource is an interesting idea

IMO this is the level of granularity you want -- as "correctness" of the data is based on what setup you have defined. Presumnably the promql engine is already smart enough to only query storage that has data relevant to the query -- so the operator needs to let us know in config whether this data is unique or not -- and IMO if you added a remote-read endpoint it likely is (which is why I'd prefer it defaulting to required). Alternatively prometheus does no remote-read aggregation at all and that is left to tools more suited to the problem (e.g. promxy ).

Incomplete, not bad.

This entirely depends on the configuration of prometheus. If the remote-read endpoint has different data than the local prom the data will be a subset of the actual data -- which makes it both incomplete and bad. Incomplete because it doesn't have all the data required, and bad because the value is incorrect.

@brian-brazil
Copy link
Contributor

brian-brazil commented Jun 26, 2019

IMO the "required-ness" of data is not an attribute of the query but an attribute of the infrastructure

I'd disagree, it's often going to be what is doing the query and how the end data is being used. If it's a reporting query (e.g. capacity planning over last N months) it's probably required, anything else (e.g. dashboards) probably not. A reporting query like that could check for the warnings, or we could add a opt-in URL parameter to make warnings errors for things like that.

IMO these should all be marked as required as that was the previous behavior-- as this effectively is a backwards-incompatible change which can dramatically impact the outcome of queries.

The previous behaviour was only due to noone getting around to that yet. And we're allowed backwards-incompatible changes in remote storage.

Keep in mind also that Prometheus is a monitoring system that happens to contain a database, not a database for which 100% query accuracy is a goal.

Presumnably the promql engine is already smart enough to only query storage that has data relevant to the query

In general only for special non-storage remote read endpoints, which doesn't help you I believe. If you presume that Prometheus always queries all remote read endpoints you won't be far off (read_recent is an optimisation to avoid unnecessary RTTs).

so the operator needs to let us know in config whether this data is unique or not

That's roughly speaking what read_recent is. I'd be dubious of the reliability a setup that was pulling in data from other Prometheuses via remote read though, whether directly or via a clustered storage system. Similar considerations to the Thanos Ruler apply, only use it if you absolutely have to.

This entirely depends on the configuration of prometheus.

I'd argue that any setup that has a hard dependency on remote data is incorrect. One of the core principles of Prometheus is that it doesn't depend on anything beyond being able to get to its targets and the alertmanager.

Incomplete, not bad.

If you need that more reliably, then federation is a better and more efficient approach. And even in that case you still need to tolerate missing data. You don't want one datacenter going down to take out all your monitoring.

If you explain more about your actual setup it would help, rather than arguing in abstracts.

@jacksontj
Copy link
Contributor Author

I'd disagree, it's often going to be what is doing the query and how the end data is being used.

And I continue to disagree here as well :) As a very simple example, if the query is sum(foo) and there are 2 series; 1 local: foo{a="1"} 2 and 1 remote: foo{a="2"} 2. The result is obviously 4 -- but in the event of the remote call failing the result will be 2 which I argue is both incomplete and bad (since its wrong). There may be situations where being wrong is acceptable, but I would argue that it is not most of the time -- especially for a metrics system.

Keep in mind also that Prometheus is a monitoring system that happens to contain a database, not a database for which 100% query accuracy is a goal.

Sure, but this means that things like rate etc aren't 100% accurate not that we'd just drop sections of data from the response.

That's roughly speaking what read_recent is. I'd be dubious of the reliability a setup that was pulling in data from other Prometheuses via remote read though, whether directly or via a clustered storage system.

Which is why I'd be tempted to say that prometheus itself shouldn't be dealing with this at all -- and we just leave that to outside projects to do all the query merging.

One of the core principles of Prometheus is that it doesn't depend on anything beyond being able to get to its targets and the alertmanager.

If this is the case, why is the remote_read option even there? If prometheus really wants to stick with that hardline then it shouldn't support remote-read at all.

If you need that more reliably, then federation is a better and more efficient approach. And even in that case you still need to tolerate missing data. You don't want one datacenter going down to take out all your monitoring.

True, but that is completely controllable by your configuration. One DC taking down all of monitoring would be assuming you have a single prom remote-reading from everything and thats the only node for alerts and dashboards :)

@brian-brazil
Copy link
Contributor

I would argue that it is not most of the time -- especially for a metrics system.

This is a monitoring system though, not a metrics system. If this is a problem for you your alerts should take this into account. Combining local and remote series together also seems a bit dubious, as it's not the usual hierarchy approach. Your actual setup would be useful to understand.

Sure, but this means that things like rate etc aren't 100% accurate not that we'd just drop sections of data from the response.

It does mean we drop sections of data, for example if a scrape fails we don't attempt to fill the gap. Rate is designed to be resilient to this, and you also need to take it into account in PromQL expressions more generally.

What we don't do is ingest partial data for a target, as that'd be virtually impossible to reason about. An entire target or chunk of time is workable though, as that's the same as the target or Prometheus being down which you already need to allow for.

Which is why I'd be tempted to say that prometheus itself shouldn't be dealing with this at all -- and we just leave that to outside projects to do all the query merging.

That's one possibility, given Prometheus is going for reliability over accuracy. If you want accuracy over reliability then a different tool may be more appropriate. If it's just checking that there's no warnings it'd seem simpler to do that in the client code, rather than managing a whole new thing.

If this is the case, why is the remote_read option even there? If prometheus really wants to stick with that hardline then it shouldn't support remote-read at all.

It works usually and that can be useful, but that doesn't mean we should take down monitoring when it fails. For example a non-critical capacity planning alert that uses older data isn't impacted if it's broken for an hour, nor is dashboards temporarily not showing data more than a month back.

One DC taking down all of monitoring would be assuming you have a single prom remote-reading from everything and thats the only node for alerts and dashboards :)

Indeed that'd be a bad idea, which is why I'm interested in what the actual use case for this is. By default I presume this, given how often users attempt it in various forms.

@stale stale bot added the stale label Feb 19, 2020
Base automatically changed from master to main February 23, 2021 19:36
@stale stale bot removed the stale label Feb 23, 2021
@stale stale bot added the stale label Apr 26, 2021
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for massive lag.

The code looks LGTM, but it would nice to rebase it and see how it looks like on the latest code.

(Triaged during our bug scrub session)

@@ -614,6 +614,9 @@ type RemoteReadConfig struct {
// RequiredMatchers is an optional list of equality matchers which have to
// be present in a selector to query the remote read endpoint.
RequiredMatchers model.LabelSet `yaml:"required_matchers,omitempty"`

// Required makes reads from this remote-read endpoint required for querying
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Required makes reads from this remote-read endpoint required for querying
// Required makes reads from this remote-read endpoint required for querying.

@bwplotka
Copy link
Member

bwplotka commented Feb 6, 2024

Hello from the Bug Scrub again!

Really sorry for so many delays on this, but look like there is no activity. Let's close for now and reopen if needed. Thanks for the work!

@bwplotka bwplotka closed this Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option to make failures of non-primary reads not warnings
4 participants