Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: automatic retries with exponential backoff for download failures caused by load shed #3040

Closed
ihuh0 opened this issue Oct 16, 2020 · 6 comments

Comments

@ihuh0
Copy link

@ihuh0 ihuh0 commented Oct 16, 2020

Is your feature request related to a problem? Please describe.
Occasionally we get failures to download objects from GCS due to too many requests being made at a time resulting in a load shed error. We considered having retries in our own code, but was wondering if it could be automatically handled in the storage library instead. Perhaps we can have some automatic retries with some exponential backoff in this case?

@tritone
Copy link
Collaborator

@tritone tritone commented Oct 16, 2020

Can you speak more specifically to the error(s) you are seeing and the library methods where they are returned?

Most methods in this library are retried with exponential backoff on code 429, which is the expected error for too many requests. See

func shouldRetry(err error) bool {

Loading

@ihuh0
Copy link
Author

@ihuh0 ihuh0 commented Oct 16, 2020

I think the error is when calling Read() with the storage.Reader. The error we saw looked like:
http2: server sent GOAWAY and closed the connection; LastStreamID=15, ErrCode=NO_ERROR, debug="load_shed"

We've also seen read: connection reset by peer errors which look like they should be retried already. How many times does it get retried and what is the backoff?

Loading

@tritone
Copy link
Collaborator

@tritone tritone commented Oct 16, 2020

Ah interesting, I've never seen that load_shed error before-- looks like some form of socket-level error rather than an actual 429 response.

Yes, the connection reset by peer error should be retried until the context timeout. We use the standard backoff here: https://pkg.go.dev/github.com/googleapis/gax-go/v2#Backoff

Can you clarify which version of the library you are using? Have you done anything unusual with the underlying http client for your storage client (e.g. subbing out using option.WithHTTPClient)?

Loading

@dt
Copy link

@dt dt commented Jun 2, 2021

Hello!

We have also been seeing this GOAWAY error from storage.Reader.Read() returned from calls pretty often lately in CockroachDB's nightly integration tests, These tests vary, with some opening, reading and closing thousands of smaller files and others -- that seem to fail with this error most -- opening a handful of multi-gb files and keeping them open for intermittent Read calls for a few hours or so. (for a sense of how often we see it: https://github.com/cockroachdb/cockroach/issues?q=is%3Aissue+GOAWAY)

Like @ihuh0 above, the ones we see are ErrCode=NO_ERROR, but ours typically have debug="server_shutting_down".

I don't think I'm overriding the http client or anything -- the code I'm using to init the client is all open source, right over here: https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/cloud/gcp/gcs_storage.go#L141

We were planning to wrap the returned storage.Reader in our own io.Reader that would inspect any returned errors and automatically re-open the underlying reader (at a tracked offset) to retry on these GOAWAY errors, but before we started doing that at the application level, I wanted to check if it was expected that the SDK would be doing returning these errors or if it sounds like something is wrong or we'd misconfigured it somehow.

Loading

@tritone
Copy link
Collaborator

@tritone tritone commented Jun 7, 2021

Hey @dt, thanks for reporting. I've done some research and I think we likely should add retries for this error specifically for reads.

The best description for what I think is happening here is this: golang/go#18639 (comment) (substitute GCS for ALB here). Your use case (intermittent reads over several hours) seems the most prone to this scenario since the period between when the server sends the headers and when the body read calls occur is drawn out, so there's more opportunity for the server to close the connection in the meantime.

The error type is this: https://github.com/golang/go/blob/master/src/net/http/h2_bundle.go#L8359 . Unfortunately I don't think there is any way of detecting this directly, but we already check for http2 INTERNAL errors here so I think it makes sense to add GOAWAY to this as well. Actually, I think there are other errors that might make sense to add as well, but I'll probably stick with GOAWAY for now to be conservative.

Also curious about your use case-- have you considered smaller ranged reads? Or some kind of buffering potentially? Either would probably increase reliability.

Loading

tritone added a commit to tritone/google-cloud-go that referenced this issue Jun 7, 2021
This error can occur in calls to reader.Read if the CFE closes the
connection between when the response headers are received and when
the caller reads from the reader.

Fixes googleapis#3040
tritone added a commit that referenced this issue Jun 23, 2021
Errors from reading the response body in Reader.Read will now always trigger a reopen() call (unless the context has been canceled). Previously, this was limited to only INTERNAL_ERROR from HTTP/2.

Fixes #3040
adityamaru added a commit to cockroachdb/google-cloud-go that referenced this issue Jun 23, 2021
Errors from reading the response body in Reader.Read will now always trigger a reopen() call (unless the context has been canceled). Previously, this was limited to only INTERNAL_ERROR from HTTP/2.

Fixes googleapis#3040
adityamaru added a commit to cockroachdb/google-cloud-go that referenced this issue Jun 24, 2021
Errors from reading the response body in Reader.Read will now always trigger a reopen() call (unless the context has been canceled). Previously, this was limited to only INTERNAL_ERROR from HTTP/2.

Fixes googleapis#3040
@tritone
Copy link
Collaborator

@tritone tritone commented Jun 28, 2021

The fix to this has been released in storage/v1.16.0.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants