google: retry transient errors in DefaultTokenSource #203

jacobsa · 2016-10-20T04:47:03Z

As discussed in GoogleCloudPlatform/gcsfuse#187, the GCE metadata server is unreliable, which can cause DefaultTokenSource to return errors. Supposedly retrying works in these cases. (This is nothing but rumors to me; I haven't experienced this but am responsible for a product whose users have.)

It appears (googleapis/oauth2client#581) that the plan for the Python library is to add retries for at least "connection refused" errors. Perhaps the Go library should do the same?

rakyll · 2016-10-20T23:43:29Z

Sounds good. Any suggestions about how many retries and within which time frame?

jacobsa · 2016-10-20T23:52:42Z

The usual correct answer for this sort of thing seems to be randomized exponential backoff forever (capping the sleep time at a few seconds), until the context is cancelled or an error that doesn't appear to be transient is encountered. I don't see any obvious reason not to do that here.

jacobsa · 2016-10-20T23:55:02Z

(Within Google see util/time/backoff.h for an example implementation. If there's not already a standard Go package for this, I'd use that algorithm with parameters of, say, min 1ms and max 5s.)

rakyll · 2016-10-21T00:00:12Z

until the context is cancelled or an error that doesn't appear to be transient is encountered.

We cannot do it until forever because the client is not historically documented that it might be blocking forever if context is not cancelled, so it would be a breaking change.

If there's not already a standard Go package for this

gax has a retrier with backoff: https://godoc.org/github.com/googleapis/gax-go#WithRetry.

jacobsa · 2016-10-21T00:11:16Z

I'm not sure I agree it's a breaking change. What's the difference between retrying transient errors for a long time, and an HTTP request to the metadata server that actually just takes a long time? If you decide to put a cap on the maximum time you spend retrying, how do you choose it? 18 hours might as well be forever, but is 30 minutes? At what point does it become compatible with the old behavior? The context is how the user has already specified how long they're willing to wait, and continues to be if you retry until the context is cancelled.

Cool, the gax algorithm looks good (I'd suggest Initial being 1ms, Max being something like 5s, and Multiplier being 1.3, but the exact values aren't super important.) But note that it seems to be specific to grpc, which I'm not sure is applicable here?

rakyll · 2016-10-21T01:39:11Z

transient errors for a long time, and an HTTP request to the metadata server that actually just takes a long time?

It matters because the metadata package has already a definite timeout which is 2 secs, see https://github.com/GoogleCloudPlatform/google-cloud-go/blob/master/compute/metadata/metadata.go#L67. Each metadata request is going through this package which is capped at 2 secs already.

But note that it seems to be specific to grpc, which I'm not sure is applicable here?

Will reimplement, I was not looking closely and didn't see that the status codes were grpc-specific.

jacobsa · 2016-10-21T02:13:28Z

Thanks, I didn't realize there was a timeout. I still think the context is the one true way for the user to express how long they want to wait for the blocking call, but capping the wait time to a small number times the max timeout would still be very helpful.

jacobsa mentioned this issue Oct 20, 2016

oauth2client/contrib/_metadata.py does not retry googleapis/oauth2client#581

Closed

jacobsa mentioned this issue Oct 21, 2016

gax-go: package needs package-level documentation googleapis/gax-go#38

Closed

zombiezen mentioned this issue Dec 27, 2017

TokenSource.Token method should take in a Context #262

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

google: retry transient errors in DefaultTokenSource #203

google: retry transient errors in DefaultTokenSource #203

jacobsa commented Oct 20, 2016

rakyll commented Oct 20, 2016

jacobsa commented Oct 20, 2016

jacobsa commented Oct 20, 2016

rakyll commented Oct 21, 2016

jacobsa commented Oct 21, 2016 •

edited

Loading

rakyll commented Oct 21, 2016

jacobsa commented Oct 21, 2016

google: retry transient errors in DefaultTokenSource #203

google: retry transient errors in DefaultTokenSource #203

Comments

jacobsa commented Oct 20, 2016

rakyll commented Oct 20, 2016

jacobsa commented Oct 20, 2016

jacobsa commented Oct 20, 2016

rakyll commented Oct 21, 2016

jacobsa commented Oct 21, 2016 • edited Loading

rakyll commented Oct 21, 2016

jacobsa commented Oct 21, 2016

jacobsa commented Oct 21, 2016 •

edited

Loading