-
Notifications
You must be signed in to change notification settings - Fork 620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ca]: Fix GetRemoteSignedCertificate retry #2063
[ca]: Fix GetRemoteSignedCertificate retry #2063
Conversation
602714c
to
4a94b21
Compare
Codecov Report
@@ Coverage Diff @@
## master #2063 +/- ##
==========================================
+ Coverage 54.21% 54.26% +0.05%
==========================================
Files 111 111
Lines 19354 19366 +12
==========================================
+ Hits 10492 10509 +17
- Misses 7597 7603 +6
+ Partials 1265 1254 -11 Continue to review full report at Codecov.
|
4a94b21
to
9564d69
Compare
ca/certificates_test.go
Outdated
select { | ||
case err = <-completed: | ||
assert.Equal(t, grpc.Code(err), codes.DeadlineExceeded) | ||
case <-time.After(570 * time.Second): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this time.Second
meant to be time.Millisecond
? And if so, is 70ms of slack enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the comment above says 5 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, yes it was supposed to be time.Millisecond
. I can bump this up to 1 second.
9564d69
to
1c7f89d
Compare
ca/certificates_test.go
Outdated
select { | ||
case err = <-completed: | ||
assert.Equal(t, grpc.Code(err), codes.DeadlineExceeded) | ||
case <-time.After(1250 * time.Millisecond): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd bump the timeout to a few seconds. You can never be too careful. Also please update the comment above, which I don't think is quite right.
ca/certificates_test.go
Outdated
select { | ||
case <-completed: | ||
assert.FailNow(t, "GetRemoteSignedCertificate should wait at least 3 seconds") | ||
case <-time.After(1 * time.Second): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment above says 5 seconds, assertion failure message says 3 seconds...
This is tricky because the client side has no way to distinguish an unresponsive manager from a CA that is taking a long time to issue the certificate. The client needs to decide whether to try reconnecting to another manager, but it doesn't have much information to use for making that decision. If |
ca/certificates.go
Outdated
time.Sleep(expBackoff.Proceed(nil)) | ||
select { | ||
case <-ctx.Done(): | ||
conn.Close(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking the parameter to Close
should be true
, since it's really GetRemoteSignedCertificate
that's timing out, and not necessarily anything wrong with the manager. But one could make an argument for keeping it as false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair.
ca/certificates.go
Outdated
case err != nil: // this was a deadline exceeded error - we need to figure out which context | ||
select { // the entire `GetRemoteSignedCertificate` call context was cancelled - return the error | ||
case <-ctx.Done(): | ||
conn.Close(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as https://github.com/docker/swarmkit/pull/2063/files#r108485822
fa5f8be
to
3ac8569
Compare
@aaronlehmann I think I've addressed all the comments, and updated |
LGTM |
ping @diogomonica please review |
Not a big deal, but the new unit tests will run twice: Maybe they should exit early if |
@aaronlehmann makes sense, will skip if it's true (because we don't actually need an external CA server, and it's just one more listening port eating up resources :)) |
3ac8569
to
6731b83
Compare
Thanks. Can you do it in #2064 as well? There are probably a few existing |
Yep, will do. |
ca/certificates.go
Outdated
|
||
case err != nil: // this was a deadline exceeded error - we need to figure out which context | ||
select { // the entire `GetRemoteSignedCertificate` call context was cancelled - return the error | ||
case <-ctx.Done(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to check ctx.Done()
in two places? here and 864?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I can simplify this by adding err == nil &&
to the next case.
…in 5 seconds it errors with a context deadline exceeded and does not retry. Update it so that if the node has not been updated within 5 seconds, attempt to get the node status again after an exponential backoff. If NodeCertificateStatus errors with some other error (not context deadline exceeded), GetRemoteSignedCertificate will try again with a different connection. Signed-off-by: cyli <ying.li@docker.com>
6731b83
to
b727e68
Compare
func TestGetRemoteSignedCertificateWithPending(t *testing.T) { | ||
t.Parallel() | ||
if testutils.External { | ||
// we don't actually need an external signing server, since we're faking a CA TestCAServerUpdateRootCA which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the connection to TestCAServerUpdateRootCA
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's weird, I think it was a failed search and replace :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaronlehmann discovered an error where if an external CA was unreachable, a new node join would attempt to call
IssueNodeCertificate
on the remote CA 3 times at 5 second intervals, resulting in 3 different node entries. The assumption would be thatGetRemoteSignedCertificate
would be called, which would callIssueNodeCertificate
on the remote CA once, and then just keep callingNodeCertificateStatus
.The issue seems to be that we had a context timeout of 5 seconds on the call to
NodeCertificateStatus
. However, if the call toNodeCertificateStatus
errors (which it would with the timeout), rather than retryNodeCertificateStatus
later,GetRemoteSignedCertificate
just returns.This PR updates
GetRemoteSignedCertificate
so if the call toNodeCertificateStatus
times out after 5 seconds, it retries after an exponential backoff. If it errors due to some other reason (not timeout), it will close the existing connection and open a new one, and try again. This means thatGetRemoteSignedCertificate
could take a very long time, which is I think the expected behavior.