-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm: Controllable timeout for join failures #60983
Conversation
@rosti: Reiterating the mentions to trigger a notification: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
for _, endpoint := range endpoints { | ||
wg.Add(1) | ||
go func(apiEndpoint string) { | ||
retries := constants.DiscoveryTimeout / constants.DiscoveryRetryInterval | ||
defer wg.Done() | ||
wait.Until(func() { | ||
fmt.Printf("[discovery] Trying to connect to API Server %q\n", apiEndpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Show the retry times here? So that users can get to know the progress.
defer wg.Done() | ||
wait.Until(func() { | ||
fmt.Printf("[discovery] Trying to connect to API Server %q\n", apiEndpoint) | ||
cfg, err := fetchKubeConfigFunc(apiEndpoint) | ||
if err != nil { | ||
fmt.Printf("[discovery] Failed to connect to API Server %q: %v\n", apiEndpoint, err) | ||
retries-- | ||
if retries == 0 { | ||
resultingError = err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to add a customized message to the error, like [discovery] Aborted after retrying xx times: <error msg here>
. WDYT?
timeout, _ := strconv.Atoi(endpoint) | ||
time.Sleep(time.Second * time.Duration(timeout)) | ||
return kubeconfigutil.CreateBasic(endpoint, "foo", "foo", []byte{}), nil | ||
}) | ||
if err != nil { | ||
t.Errorf("failed TestRunForEndpointsAndReturnFirst: unexpected error %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just t.Errorf("unexpected error: %v for test %s", err, idx)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The format follows the message from line 82 of the same file. I can simplify it though (as long as this does not break some test result analysis script somewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go ahead. This won't break the test.
Updated with the suggestions made by @dixudx |
if retryCount > retries { | ||
resultingError = err | ||
close(stopChan) | ||
fmt.Printf("[discovery] Abort connecting to API Server %q after %d retries: %v\n", apiEndpoint, retries, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about resultingError = fmt.Errorf("abort connecting to API Server %q after %d retries: %v\n", apiEndpoint, retries, err)
?
timeout, _ := strconv.Atoi(endpoint) | ||
time.Sleep(time.Second * time.Duration(timeout)) | ||
return kubeconfigutil.CreateBasic(endpoint, "foo", "foo", []byte{}), nil | ||
}) | ||
if err != nil { | ||
t.Errorf("unexpected error: %v for TestRunForEndpointsAndReturnFirst", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about t.Errorf("unexpected error: %v for test %s", err, idx)
?
We should clearly get to know which test case fails.
Again, updated with latest suggestions. |
@@ -206,7 +218,7 @@ func runForEndpointsAndReturnFirst(endpoints []string, fetchKubeConfigFunc func( | |||
}(endpoint) | |||
} | |||
wg.Wait() | |||
return resultingKubeConfig | |||
return resultingKubeConfig, resultingError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resultingError
can be mutated if we've got multiple endpoints, right?
If so, I'd prefer not to return an extra error, which will cause potential issues.
For example, if one of three endpoints is accessible, in this case, we should return an available kubeconfig with nil error, right? Instead of current non-nil resultingError
.
Then what we really need to do is to make sure kubeconfig is not nil when using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are quite right there. Somehow I missed that (probably because I test with a single master, hence a single endpoint)
Rewrote the main part of the PR as the previous implementation did not deal too well with multiple API servers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
ping @luxas @timothysc for another look. Thanks. |
/retest |
Updated the PR by introducing the |
Marking this as a WIP due to a failed test, that requires a bit more thinking to get over. |
[MILESTONENOTIFIER] Milestone Removed From Pull Request @dixudx @luxas @rosti @timothysc Important: This pull request was missing labels required for the v1.11 milestone for more than 3 days: kind: Must specify exactly one of |
/retest |
select { | ||
case <-time.After(timeout): | ||
close(stopChan) | ||
err := fmt.Errorf("Abort connecting to API servers after timeout of %v", timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Abort/abort
@@ -205,8 +210,21 @@ func runForEndpointsAndReturnFirst(endpoints []string, fetchKubeConfigFunc func( | |||
}, constants.DiscoveryRetryInterval, stopChan) | |||
}(endpoint) | |||
} | |||
wg.Wait() | |||
return resultingKubeConfig | |||
timeout := constants.DefaultDiscoveryTimeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set the default value for field DiscoveryTimeout
in SetDefaults_NodeConfiguration
instead of defining here?
Signed-off-by: Rostislav M. Georgiev <rostislavg@vmware.com>
Updated the PR with @xiangpengzhao 's suggestions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could nit on some style things, but lets get this unblocked now that 1.11 is open
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dixudx, rosti, timothysc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue (batch tested with PRs 60983, 62012, 61892, 62051, 62067). If you want to cherry-pick this change to another branch, please follow the instructions here. |
@rosti: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
What this PR does / why we need it:
This PR introduces a timeout for
kubeadm join
. During that time kubeadm will try to join as many times as possible. The timeout can be controlled via thediscoveryTimeout
config option. Its default value is 5 minutes.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes kubernetes/kubeadm#677
Special notes for your reviewer:
/cc @kubernetes/sig-cluster-lifecycle-pr-reviews
/area kubeadm
/assign @luxas
/assign @timothysc
Release note: