Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[17.03] Fix GetRemoteSignedCertificate retry #2069

Conversation

aaronlehmann
Copy link
Collaborator

Cherry-pick of #2063 to 17.03

@aaronlehmann discovered an error where if an external CA was unreachable, a new node join would attempt to call IssueNodeCertificate on the remote CA 3 times at 5 second intervals, resulting in 3 different node entries. The assumption would be that GetRemoteSignedCertificate would be called, which would call IssueNodeCertificate on the remote CA once, and then just keep calling NodeCertificateStatus.

The issue seems to be that we had a context timeout of 5 seconds on the call to NodeCertificateStatus. However, if the call to NodeCertificateStatus errors (which it would with the timeout), rather than retry NodeCertificateStatus later, GetRemoteSignedCertificate just returns.

This PR updates GetRemoteSignedCertificate so if the call to NodeCertificateStatus times out after 5 seconds, it retries after an exponential backoff. If it errors due to some other reason (not timeout), it will close the existing connection and open a new one, and try again. This means that GetRemoteSignedCertificate could take a very long time, which is I think the expected behavior.

cc @cyli @diogomonica @mavenugo

@codecov
Copy link

codecov bot commented Mar 29, 2017

Codecov Report

Merging #2069 into bump_v17.03.2 will decrease coverage by 0.02%.
The diff coverage is 91.66%.

@@                Coverage Diff                @@
##           bump_v17.03.2    #2069      +/-   ##
=================================================
- Coverage          55.29%   55.26%   -0.03%     
=================================================
  Files                102      102              
  Lines              17142    17154      +12     
=================================================
+ Hits                9479     9481       +2     
- Misses              6498     6518      +20     
+ Partials            1165     1155      -10

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4fcee01...258f3b5. Read the comment docs.

@cyli
Copy link
Contributor

cyli commented Mar 29, 2017

LGTM - thanks for backporting this. I didn't realize the connection broker change and the certificate request config change didn't make it into 17.03.

}
caClient = api.NewNodeCAClient(conn)

case err != nil: // this was a deadline exceeded error - we need to figure out which context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code no longer exists.

…in 5 seconds it errors

with a context deadline exceeded and does not retry.  Update it so that if the node has not
been updated within 5 seconds, attempt to get the node status again after an exponential
backoff.

If NodeCertificateStatus errors with some other error (not context deadline exceeded),
GetRemoteSignedCertificate will try again with a different connection.

Signed-off-by: cyli <ying.li@docker.com>
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
(cherry picked from commit 3ac8569)
@aaronlehmann aaronlehmann force-pushed the fix-node-certificate-reiussue-retry-17.03 branch from 8a533e4 to 258f3b5 Compare March 30, 2017 00:44
@diogomonica diogomonica merged commit 4ff8ef2 into moby:bump_v17.03.2 Mar 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants