Implement `Required` option for remote_read #5667

jacksontj · 2019-06-14T15:46:21Z

This allows the user to define whether the data from a remote-read is required or not. The default maintains the same current behavior (where all errors from a remote-read are marked as warnings).

In addition to the additional flexibility IMO this is a cleaner implementation (as the mergequerier now doesn't have to deal with primary vs secondary queriers.

Fixes #5662

Signed-off-by: Thomas Jackson jacksontj.89@gmail.com

This allows the user to define whether the data from a remote-read is required or not. The default maintains the same current behavior (where all errors from a remote-read are marked as warnings). In addition to the additional flexibility IMO this is a cleaner implementation (as the mergequerier now doesn't have to deal with primary vs secondary queriers. Fixes prometheus#5662 Signed-off-by: Thomas Jackson <jacksontj.89@gmail.com>

brian-brazil · 2019-06-14T15:55:26Z

Per #5662, this doesn't make sense in a sane Prometheus setup.

tomwilkie · 2019-06-26T18:51:26Z

@jacksontj There was some disagreement at the recent Prometheus dev summit about this topic. The two positions are:

Prometheus should error on queries when remote read fails, by default. This should be overridable per query. This would prevent old client that don't know about remote read from "accidentally" showing bad data and ignoring warnings. New clients that understand warning responses can override this default, either at the user discretion or all the time.
The current code, which will return warnings on queries when remote read fails. "Old" (ie most) clients will ignore these warnings and show bad data.

There is still some discussion to be had to settle this. Having it specified per remote read datasource is an interesting idea, and I have to think about the impact of this. But it sounds like you're suggesting a fix to a similar problem?

brian-brazil · 2019-06-26T18:59:16Z

"Old" (ie most) clients will ignore these warnings and show bad data.

Incomplete, not bad. In a standard setup where a Prometheus is only requesting it's own previously written data, data past the retention period will be missing.

jacksontj · 2019-06-26T19:29:54Z

@tomwilkie yes, sounds like we are working on similar issues :)

This should be overridable per query

IMO the "required-ness" of data is not an attribute of the query but an attribute of the infrastructure. I covered this in some detail in the linked issue (#5662 (comment)) -- so I'll link instead of copy/pasting :)

The current code, which will return warnings on queries when remote read fails. "Old" (ie most) clients will ignore these warnings and show bad data.

This is an unfortunate byproduct of how the warnings were added. IMO these should all be marked as required as that was the previous behavior-- as this effectively is a backwards-incompatible change which can dramatically impact the outcome of queries.

Having it specified per remote read datasource is an interesting idea

IMO this is the level of granularity you want -- as "correctness" of the data is based on what setup you have defined. Presumnably the promql engine is already smart enough to only query storage that has data relevant to the query -- so the operator needs to let us know in config whether this data is unique or not -- and IMO if you added a remote-read endpoint it likely is (which is why I'd prefer it defaulting to required). Alternatively prometheus does no remote-read aggregation at all and that is left to tools more suited to the problem (e.g. promxy ).

Incomplete, not bad.

This entirely depends on the configuration of prometheus. If the remote-read endpoint has different data than the local prom the data will be a subset of the actual data -- which makes it both incomplete and bad. Incomplete because it doesn't have all the data required, and bad because the value is incorrect.

brian-brazil · 2019-06-26T19:47:17Z

IMO the "required-ness" of data is not an attribute of the query but an attribute of the infrastructure

I'd disagree, it's often going to be what is doing the query and how the end data is being used. If it's a reporting query (e.g. capacity planning over last N months) it's probably required, anything else (e.g. dashboards) probably not. A reporting query like that could check for the warnings, or we could add a opt-in URL parameter to make warnings errors for things like that.

IMO these should all be marked as required as that was the previous behavior-- as this effectively is a backwards-incompatible change which can dramatically impact the outcome of queries.

The previous behaviour was only due to noone getting around to that yet. And we're allowed backwards-incompatible changes in remote storage.

Keep in mind also that Prometheus is a monitoring system that happens to contain a database, not a database for which 100% query accuracy is a goal.

Presumnably the promql engine is already smart enough to only query storage that has data relevant to the query

In general only for special non-storage remote read endpoints, which doesn't help you I believe. If you presume that Prometheus always queries all remote read endpoints you won't be far off (read_recent is an optimisation to avoid unnecessary RTTs).

so the operator needs to let us know in config whether this data is unique or not

That's roughly speaking what read_recent is. I'd be dubious of the reliability a setup that was pulling in data from other Prometheuses via remote read though, whether directly or via a clustered storage system. Similar considerations to the Thanos Ruler apply, only use it if you absolutely have to.

This entirely depends on the configuration of prometheus.

I'd argue that any setup that has a hard dependency on remote data is incorrect. One of the core principles of Prometheus is that it doesn't depend on anything beyond being able to get to its targets and the alertmanager.

Incomplete, not bad.

If you need that more reliably, then federation is a better and more efficient approach. And even in that case you still need to tolerate missing data. You don't want one datacenter going down to take out all your monitoring.

If you explain more about your actual setup it would help, rather than arguing in abstracts.

jacksontj · 2019-06-26T21:36:05Z

I'd disagree, it's often going to be what is doing the query and how the end data is being used.

And I continue to disagree here as well :) As a very simple example, if the query is sum(foo) and there are 2 series; 1 local: foo{a="1"} 2 and 1 remote: foo{a="2"} 2. The result is obviously 4 -- but in the event of the remote call failing the result will be 2 which I argue is both incomplete and bad (since its wrong). There may be situations where being wrong is acceptable, but I would argue that it is not most of the time -- especially for a metrics system.

Keep in mind also that Prometheus is a monitoring system that happens to contain a database, not a database for which 100% query accuracy is a goal.

Sure, but this means that things like rate etc aren't 100% accurate not that we'd just drop sections of data from the response.

That's roughly speaking what read_recent is. I'd be dubious of the reliability a setup that was pulling in data from other Prometheuses via remote read though, whether directly or via a clustered storage system.

Which is why I'd be tempted to say that prometheus itself shouldn't be dealing with this at all -- and we just leave that to outside projects to do all the query merging.

One of the core principles of Prometheus is that it doesn't depend on anything beyond being able to get to its targets and the alertmanager.

If this is the case, why is the remote_read option even there? If prometheus really wants to stick with that hardline then it shouldn't support remote-read at all.

If you need that more reliably, then federation is a better and more efficient approach. And even in that case you still need to tolerate missing data. You don't want one datacenter going down to take out all your monitoring.

True, but that is completely controllable by your configuration. One DC taking down all of monitoring would be assuming you have a single prom remote-reading from everything and thats the only node for alerts and dashboards :)

brian-brazil · 2019-06-26T22:04:32Z

I would argue that it is not most of the time -- especially for a metrics system.

This is a monitoring system though, not a metrics system. If this is a problem for you your alerts should take this into account. Combining local and remote series together also seems a bit dubious, as it's not the usual hierarchy approach. Your actual setup would be useful to understand.

Sure, but this means that things like rate etc aren't 100% accurate not that we'd just drop sections of data from the response.

It does mean we drop sections of data, for example if a scrape fails we don't attempt to fill the gap. Rate is designed to be resilient to this, and you also need to take it into account in PromQL expressions more generally.

What we don't do is ingest partial data for a target, as that'd be virtually impossible to reason about. An entire target or chunk of time is workable though, as that's the same as the target or Prometheus being down which you already need to allow for.

Which is why I'd be tempted to say that prometheus itself shouldn't be dealing with this at all -- and we just leave that to outside projects to do all the query merging.

That's one possibility, given Prometheus is going for reliability over accuracy. If you want accuracy over reliability then a different tool may be more appropriate. If it's just checking that there's no warnings it'd seem simpler to do that in the client code, rather than managing a whole new thing.

If this is the case, why is the remote_read option even there? If prometheus really wants to stick with that hardline then it shouldn't support remote-read at all.

It works usually and that can be useful, but that doesn't mean we should take down monitoring when it fails. For example a non-critical capacity planning alert that uses older data isn't impacted if it's broken for an hour, nor is dashboards temporarily not showing data more than a month back.

One DC taking down all of monitoring would be assuming you have a single prom remote-reading from everything and thats the only node for alerts and dashboards :)

Indeed that'd be a bad idea, which is why I'm interested in what the actual use case for this is. By default I presume this, given how often users attempt it in various forms.

bwplotka

Sorry for massive lag.

The code looks LGTM, but it would nice to rebase it and see how it looks like on the latest code.

(Triaged during our bug scrub session)

bwplotka · 2023-10-17T11:39:56Z

config/config.go

@@ -614,6 +614,9 @@ type RemoteReadConfig struct {
 	// RequiredMatchers is an optional list of equality matchers which have to
 	// be present in a selector to query the remote read endpoint.
 	RequiredMatchers model.LabelSet `yaml:"required_matchers,omitempty"`
+
+	// Required makes reads from this remote-read endpoint required for querying


Suggested change

// Required makes reads from this remote-read endpoint required for querying

// Required makes reads from this remote-read endpoint required for querying.

bwplotka · 2024-02-06T11:46:06Z

Hello from the Bug Scrub again!

Really sorry for so many delays on this, but look like there is no activity. Let's close for now and reopen if needed. Thanks for the work!

jacksontj mentioned this pull request Jun 14, 2019

Option to make failures of non-primary reads not warnings #5662

Open

tomwilkie added the component/remote storage label Jun 26, 2019

stale bot added the stale label Feb 19, 2020

roidelapluie mentioned this pull request Jan 29, 2021

Added partial response parameter to query API. #8262

Closed

Base automatically changed from master to main February 23, 2021 19:36

stale bot removed the stale label Feb 23, 2021

stale bot added the stale label Apr 26, 2021

bwplotka reviewed Oct 17, 2023

View reviewed changes

bwplotka closed this Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `Required` option for remote_read #5667

Implement `Required` option for remote_read #5667

jacksontj commented Jun 14, 2019

brian-brazil commented Jun 14, 2019

tomwilkie commented Jun 26, 2019

brian-brazil commented Jun 26, 2019

jacksontj commented Jun 26, 2019

brian-brazil commented Jun 26, 2019 •

edited

jacksontj commented Jun 26, 2019

brian-brazil commented Jun 26, 2019

bwplotka left a comment

bwplotka Oct 17, 2023

bwplotka commented Feb 6, 2024

	// Required makes reads from this remote-read endpoint required for querying
	// Required makes reads from this remote-read endpoint required for querying.

Implement Required option for remote_read #5667

Implement Required option for remote_read #5667

Conversation

jacksontj commented Jun 14, 2019

brian-brazil commented Jun 14, 2019

tomwilkie commented Jun 26, 2019

brian-brazil commented Jun 26, 2019

jacksontj commented Jun 26, 2019

brian-brazil commented Jun 26, 2019 • edited

jacksontj commented Jun 26, 2019

brian-brazil commented Jun 26, 2019

bwplotka left a comment

Choose a reason for hiding this comment

bwplotka Oct 17, 2023

Choose a reason for hiding this comment

bwplotka commented Feb 6, 2024

Implement `Required` option for remote_read #5667

Implement `Required` option for remote_read #5667

brian-brazil commented Jun 26, 2019 •

edited