Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote read should handle down or misbehaving backends gracefully #2573

Closed
brian-brazil opened this Issue Apr 4, 2017 · 14 comments

Comments

Projects
None yet
8 participants
@brian-brazil
Copy link
Member

brian-brazil commented Apr 4, 2017

I'm playing a bit with remote read, and it looks like any failure on the part of the remote read endpoint (e.g. down, malformed response, wrong number of result series) causes the query as a whole to fail.

When any remote read backend fails, the query should continue to work on the rest of the data available.

We may also in future want to report this out to the user via the query api, metrics etc.

FYI @tomwilkie as you mentioned this being relevant for Cortex.

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Apr 7, 2017

Yes we have the same problem in Cortex, in that we want to return partial results in the even of partial failures (for instance, if S3 is down we still have recent data in the ingesters). Cortex issue: cortexproject/cortex#309

There doesn't seem to be a particularly appropriate HTTP status code for this - some references imply 207 (http://www.restpatterns.org/HTTP_Status_Codes/207_-_Multi-Status) is appropriate, others that it should be 500 with a body. WDYT?

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Apr 7, 2017

I'd have a standard 200, and the extra information in an warning field in the response.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 4, 2017

DIBS

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 6, 2017

@brian-brazil just to make sure I understand the issue here is what I did
set to write and read from influxdb , than I stop the influxdb server and when running any query I am getting

{"status":"error","errorType":"internal","error":"error sending request: Post http://localhost:9201/read: dial tcp [::1]:9201: getsockopt: connection refused"}

this shouldn't happen as the local tsdb should be able to provide some samples if available and the the http header should be:
HTTP/1.1 206 Partial Content

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Nov 6, 2017

The http response code should be 200. A 206 is something different.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Mar 5, 2018

This should also handle broken checksums in TSDB.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Mar 5, 2018

That's a bit of a different problem, but may share the warning plumbing.

@sipian

This comment has been minimized.

Copy link
Contributor

sipian commented Mar 15, 2018

@brian-brazil @krasi-georgiev

Hi,
I read about this on GSOC'18 ideas page and would like to contribute to it as part of GSOC.

After going through all references, I came to the following conclusions. Please do correct me if my interpretation is wrong.

  1. From this, @brian-brazil proposed to handle this in fanout storage, using the primary and secondary storage .

After going through the code, I came to the following conclusions about it's implementation

  1. In fanout Querier, if subsequent Querier request are to remote storage, then
    • If primary Querier fails , report ERROR
    • If primary Querier succeeds but secondary fails, this is the special case.
      In this special case, return an error with the data collected so far & notify user that such a case has occurred. (This case needs to be discussed further)
    • If both Querier succed, its a success

Is this interpretation of the issue correct ?
And if yes, is the proposed implemetation a way to solve this issue ?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Mar 15, 2018

@brian-brazil, @tomwilkie isn't this solved by #3963

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Mar 15, 2018

@sipian Your understanding it correct.

@krasi-georgiev No, that's a bug in PromQL where it wasn't propagating errors correctly.

@pmb311

This comment has been minimized.

Copy link

pmb311 commented Jul 26, 2018

This issue is a major issue for us where we're trying to scale out while preserving user experience...it would be amazing if we could get it in a tagged release soon.

@giusyfruity

This comment has been minimized.

Copy link

giusyfruity commented Sep 27, 2018

In order for us to move Prometheus into production, it is critical that this issue be fixed first!

@foosinn

This comment has been minimized.

Copy link

foosinn commented Nov 8, 2018

Seems like it is being worked on #4832

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Nov 30, 2018

Fixed by #4832

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.