storage: automatic retries with exponential backoff for download failures caused by load shed #3040
Is your feature request related to a problem? Please describe.
The text was updated successfully, but these errors were encountered:
Can you speak more specifically to the error(s) you are seeing and the library methods where they are returned?
Most methods in this library are retried with exponential backoff on code 429, which is the expected error for too many requests. See
I think the error is when calling Read() with the storage.Reader. The error we saw looked like:
We've also seen
Ah interesting, I've never seen that
Can you clarify which version of the library you are using? Have you done anything unusual with the underlying http client for your storage client (e.g. subbing out using option.WithHTTPClient)?
We have also been seeing this GOAWAY error from
Like @ihuh0 above, the ones we see are
I don't think I'm overriding the http client or anything -- the code I'm using to init the client is all open source, right over here: https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/cloud/gcp/gcs_storage.go#L141
We were planning to wrap the returned storage.Reader in our own io.Reader that would inspect any returned errors and automatically re-open the underlying reader (at a tracked offset) to retry on these GOAWAY errors, but before we started doing that at the application level, I wanted to check if it was expected that the SDK would be doing returning these errors or if it sounds like something is wrong or we'd misconfigured it somehow.
Hey @dt, thanks for reporting. I've done some research and I think we likely should add retries for this error specifically for reads.
The best description for what I think is happening here is this: golang/go#18639 (comment) (substitute GCS for ALB here). Your use case (intermittent reads over several hours) seems the most prone to this scenario since the period between when the server sends the headers and when the body read calls occur is drawn out, so there's more opportunity for the server to close the connection in the meantime.
The error type is this: https://github.com/golang/go/blob/master/src/net/http/h2_bundle.go#L8359 . Unfortunately I don't think there is any way of detecting this directly, but we already check for http2 INTERNAL errors here so I think it makes sense to add GOAWAY to this as well. Actually, I think there are other errors that might make sense to add as well, but I'll probably stick with GOAWAY for now to be conservative.
Also curious about your use case-- have you considered smaller ranged reads? Or some kind of buffering potentially? Either would probably increase reliability.