Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upRemote read should handle down or misbehaving backends gracefully #2573
Comments
brian-brazil
added
kind/enhancement
component/remote storage
labels
Apr 4, 2017
This comment has been minimized.
This comment has been minimized.
|
Yes we have the same problem in Cortex, in that we want to return partial results in the even of partial failures (for instance, if S3 is down we still have recent data in the ingesters). Cortex issue: cortexproject/cortex#309 There doesn't seem to be a particularly appropriate HTTP status code for this - some references imply 207 (http://www.restpatterns.org/HTTP_Status_Codes/207_-_Multi-Status) is appropriate, others that it should be 500 with a body. WDYT? |
This comment has been minimized.
This comment has been minimized.
|
I'd have a standard 200, and the extra information in an warning field in the response. |
This comment has been minimized.
This comment has been minimized.
|
DIBS |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil just to make sure I understand the issue here is what I did
this shouldn't happen as the local tsdb should be able to provide some samples if available and the the http header should be: |
This comment has been minimized.
This comment has been minimized.
|
The http response code should be 200. A 206 is something different. |
krasi-georgiev
referenced this issue
Nov 7, 2017
Closed
[WIP] return query samples if at least one backend has returned some data #3421
brian-brazil
referenced this issue
Nov 29, 2017
Closed
prometheus server do not work if remote_read server is stop responding #3520
gouthamve
added
the
priority/P2
label
Jan 18, 2018
brian-brazil
referenced this issue
Feb 6, 2018
Closed
Prometheus 2.0: after killing remote storage adapter, data is lost on remote storage #3800
This comment has been minimized.
This comment has been minimized.
|
This should also handle broken checksums in TSDB. |
This comment has been minimized.
This comment has been minimized.
|
That's a bit of a different problem, but may share the warning plumbing. |
This comment has been minimized.
This comment has been minimized.
|
Hi, After going through all references, I came to the following conclusions. Please do correct me if my interpretation is wrong.
After going through the code, I came to the following conclusions about it's implementation
Is this interpretation of the issue correct ? |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil, @tomwilkie isn't this solved by #3963 |
This comment has been minimized.
This comment has been minimized.
|
@sipian Your understanding it correct. @krasi-georgiev No, that's a bug in PromQL where it wasn't propagating errors correctly. |
brian-brazil
referenced this issue
Apr 25, 2018
Closed
add ignore_error option for remote read #4111
mtanda
referenced this issue
Jun 27, 2018
Open
Feature Request: support label completion with required_matchers in Prometheus datasource #12253
This comment has been minimized.
This comment has been minimized.
pmb311
commented
Jul 26, 2018
|
This issue is a major issue for us where we're trying to scale out while preserving user experience...it would be amazing if we could get it in a tagged release soon. |
mknapphrt
referenced this issue
Jul 26, 2018
Closed
Return whatever data is available when there is a failed remote read #4426
This comment has been minimized.
This comment has been minimized.
giusyfruity
commented
Sep 27, 2018
|
In order for us to move Prometheus into production, it is critical that this issue be fixed first! |
This comment has been minimized.
This comment has been minimized.
foosinn
commented
Nov 8, 2018
|
Seems like it is being worked on #4832 |
This comment has been minimized.
This comment has been minimized.
|
Fixed by #4832 |
brian-brazil commentedApr 4, 2017
I'm playing a bit with remote read, and it looks like any failure on the part of the remote read endpoint (e.g. down, malformed response, wrong number of result series) causes the query as a whole to fail.
When any remote read backend fails, the query should continue to work on the rest of the data available.
We may also in future want to report this out to the user via the query api, metrics etc.
FYI @tomwilkie as you mentioned this being relevant for Cortex.