Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google: retry transient errors in DefaultTokenSource #203

Open
jacobsa opened this issue Oct 20, 2016 · 7 comments
Open

google: retry transient errors in DefaultTokenSource #203

jacobsa opened this issue Oct 20, 2016 · 7 comments

Comments

@jacobsa
Copy link
Contributor

jacobsa commented Oct 20, 2016

As discussed in GoogleCloudPlatform/gcsfuse#187, the GCE metadata server is unreliable, which can cause DefaultTokenSource to return errors. Supposedly retrying works in these cases. (This is nothing but rumors to me; I haven't experienced this but am responsible for a product whose users have.)

It appears (googleapis/oauth2client#581) that the plan for the Python library is to add retries for at least "connection refused" errors. Perhaps the Go library should do the same?

@rakyll
Copy link
Contributor

rakyll commented Oct 20, 2016

Sounds good. Any suggestions about how many retries and within which time frame?

@jacobsa
Copy link
Contributor Author

jacobsa commented Oct 20, 2016

The usual correct answer for this sort of thing seems to be randomized exponential backoff forever (capping the sleep time at a few seconds), until the context is cancelled or an error that doesn't appear to be transient is encountered. I don't see any obvious reason not to do that here.

@jacobsa
Copy link
Contributor Author

jacobsa commented Oct 20, 2016

(Within Google see util/time/backoff.h for an example implementation. If there's not already a standard Go package for this, I'd use that algorithm with parameters of, say, min 1ms and max 5s.)

@rakyll
Copy link
Contributor

rakyll commented Oct 21, 2016

until the context is cancelled or an error that doesn't appear to be transient is encountered.

We cannot do it until forever because the client is not historically documented that it might be blocking forever if context is not cancelled, so it would be a breaking change.

If there's not already a standard Go package for this

gax has a retrier with backoff: https://godoc.org/github.com/googleapis/gax-go#WithRetry.

@jacobsa
Copy link
Contributor Author

jacobsa commented Oct 21, 2016

I'm not sure I agree it's a breaking change. What's the difference between retrying transient errors for a long time, and an HTTP request to the metadata server that actually just takes a long time? If you decide to put a cap on the maximum time you spend retrying, how do you choose it? 18 hours might as well be forever, but is 30 minutes? At what point does it become compatible with the old behavior? The context is how the user has already specified how long they're willing to wait, and continues to be if you retry until the context is cancelled.

Cool, the gax algorithm looks good (I'd suggest Initial being 1ms, Max being something like 5s, and Multiplier being 1.3, but the exact values aren't super important.) But note that it seems to be specific to grpc, which I'm not sure is applicable here?

@rakyll
Copy link
Contributor

rakyll commented Oct 21, 2016

transient errors for a long time, and an HTTP request to the metadata server that actually just takes a long time?

It matters because the metadata package has already a definite timeout which is 2 secs, see https://github.com/GoogleCloudPlatform/google-cloud-go/blob/master/compute/metadata/metadata.go#L67. Each metadata request is going through this package which is capped at 2 secs already.

But note that it seems to be specific to grpc, which I'm not sure is applicable here?

Will reimplement, I was not looking closely and didn't see that the status codes were grpc-specific.

@jacobsa
Copy link
Contributor Author

jacobsa commented Oct 21, 2016

Thanks, I didn't realize there was a timeout. I still think the context is the one true way for the user to express how long they want to wait for the blocking call, but capping the wait time to a small number times the max timeout would still be very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants