Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(storage): try to reopen for failed Reads #4226

Merged
merged 7 commits into from Jun 23, 2021

Conversation

@tritone
Copy link
Collaborator

@tritone tritone commented Jun 7, 2021

This error can occur in calls to reader.Read if the CFE closes the
connection between when the response headers are received and when
the caller reads from the reader.

Fixes #3040

This error can occur in calls to reader.Read if the CFE closes the
connection between when the response headers are received and when
the caller reads from the reader.

Fixes googleapis#3040
@tritone tritone requested a review from Jun 7, 2021
@tritone tritone requested a review from as a code owner Jun 7, 2021
@google-cla google-cla bot added the cla: yes label Jun 7, 2021
@tritone
Copy link
Collaborator Author

@tritone tritone commented Jun 7, 2021

FYI @neild

Loading

frankyn
frankyn approved these changes Jun 7, 2021
Copy link
Member

@frankyn frankyn left a comment

This LGTM, per offline discussions on GOAWAY being retriable safely with idempotent requests. We have not yet understood how to retry these errors with non-idempotent requests.

Loading

@@ -389,7 +389,17 @@ func shouldRetryRead(err error) bool {
if err == nil {
return false
}
return strings.HasSuffix(err.Error(), "INTERNAL_ERROR") && strings.Contains(reflect.TypeOf(err).String(), "http2")
// Certain HTTP/2 errors are retryable, but the types are not exported.
Copy link

@neild neild Jun 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What errors are not retryable here?

This path (shouldRetryRead) is only consulted when a read has returned a successful response header, but something went wrong while reading the body. Is there a reason not to unconditionally retry the read in this case? (Preferably with backoff.)

Loading

Copy link
Collaborator Author

@tritone tritone Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of this call, "retry" means to call the reader's reopen at the updated offset (which involves a new GET request to the GCS API) and then attempt to continue reading bytes from that point.

I'm a little bit hesitant to retry unconditionally:

  1. If the error from Read is a context timeout/cancellation, we would probably want to return to the caller immediately and not fire off another request.
  2. If there is some error which could indicate problems with the data received, we would definitely want to surface that to the caller (so that they can retry the whole Read from the start of the object if desired). I'm not sure if this is actually something that can occur, though (do you have thoughts?).

I could see having a retry for any http2 error though, if you think it's safe.

Loading

Copy link

@neild neild Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to stop retrying when the context expires/is canceled, of course. (I wouldn't test the error for this, though; test the context directly to see if it is done.)

For any other error--once you've read a response header, I don't think there is any error condition other than variations on "the stream broke". The only point where the remote end can provide a useful error is when sending the response header. I wouldn't limit retries to HTTP/2 errors, because the same applies to HTTP/1 (if someone disables HTTP/2 for some reason) or HTTP/3 (someday).

Loading

Copy link
Collaborator Author

@tritone tritone Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, much appreciated!

I tried to implement this but I'm a bit stuck on what to do with the context. What I'm seeing is that we don't have access to the context via the Read method; it's passed in via NewRangeReader and then embedded in reopen(). Any thoughts on what to do with this? I guess I could have reopen() itself check for context.Done before doing any calls? What do you think?

Loading

Copy link
Collaborator Author

@tritone tritone Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I wrote an implementation doing what I suggested above-- does this seem right to you? I think I should write a test that cancels the context as well.

Loading

Copy link
Collaborator Author

@tritone tritone Jun 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an integration test that makes sure cancelation actually works as expected as well (did this because I wanted to make sure there's no possibility of an infinite loop of calls to reopen at this point)

Loading

Copy link

@neild neild left a comment

LGTM

Loading

@tritone tritone changed the title fix(storage): retry GOAWAY errors for Read fix(storage): try to reopen for failed Reads Jun 22, 2021
Copy link
Member

@codyoss codyoss left a comment

LGTM

Loading

@tritone tritone merged commit 564102b into googleapis:master Jun 23, 2021
4 checks passed
Loading
@tritone tritone deleted the retry-goaway branch Jun 23, 2021
adityamaru added a commit to cockroachdb/google-cloud-go that referenced this issue Jun 23, 2021
Errors from reading the response body in Reader.Read will now always trigger a reopen() call (unless the context has been canceled). Previously, this was limited to only INTERNAL_ERROR from HTTP/2.

Fixes googleapis#3040
adityamaru added a commit to cockroachdb/vendored that referenced this issue Jun 23, 2021
adityamaru added a commit to adityamaru/cockroach that referenced this issue Jun 23, 2021
This change switches the release-21.1 branch to point to a
fork of v0.45.1 of the google cloud SDK that includes
googleapis/google-cloud-go#4226.

The above change provides better retry on GOAWAY errors which
have become an increasingly common occurrence in our roachtests.

Release note (general change): Switches the release-21.1 branch
to point to a fork of the google cloud SDK at version 0.45.1
with an upstream bug fix
googleapis/google-cloud-go#4226.
adityamaru added a commit to cockroachdb/google-cloud-go that referenced this issue Jun 24, 2021
Errors from reading the response body in Reader.Read will now always trigger a reopen() call (unless the context has been canceled). Previously, this was limited to only INTERNAL_ERROR from HTTP/2.

Fixes googleapis#3040
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

6 participants