Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm: Controllable timeout for join failures #60983

Merged
merged 1 commit into from
Apr 3, 2018

Conversation

rosti
Copy link
Contributor

@rosti rosti commented Mar 9, 2018

What this PR does / why we need it:

This PR introduces a timeout for kubeadm join. During that time kubeadm will try to join as many times as possible. The timeout can be controlled via the discoveryTimeout config option. Its default value is 5 minutes.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes kubernetes/kubeadm#677

Special notes for your reviewer:

/cc @kubernetes/sig-cluster-lifecycle-pr-reviews
/area kubeadm
/assign @luxas
/assign @timothysc

Release note:

kubeadm: Introduce join timeout that can be controlled via the discoveryTimeout config option (set to 5 minutes by default).

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. area/kubeadm size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 9, 2018
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 9, 2018
@k8s-ci-robot
Copy link
Contributor

@rosti: Reiterating the mentions to trigger a notification:
@kubernetes/sig-cluster-lifecycle-pr-reviews

In response to this:

What this PR does / why we need it:

This PR introduces a 5 minute timeout for kubeadm join. During that time kubeadm will try to join as many times as possible (currently at 5 second intervals, thus having 60 retries).

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes kubernetes/kubeadm#677

Special notes for your reviewer:

/cc @kubernetes/sig-cluster-lifecycle-pr-reviews
/area kubeadm
/assign @luxas
/assign @timothysc

Release note:

kubeadm: Introduce join timeout (currently 5 minutes) instead of waiting indefinitely.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@BenTheElder
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 10, 2018
@timothysc timothysc added this to the v1.11 milestone Mar 11, 2018
for _, endpoint := range endpoints {
wg.Add(1)
go func(apiEndpoint string) {
retries := constants.DiscoveryTimeout / constants.DiscoveryRetryInterval
defer wg.Done()
wait.Until(func() {
fmt.Printf("[discovery] Trying to connect to API Server %q\n", apiEndpoint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show the retry times here? So that users can get to know the progress.

defer wg.Done()
wait.Until(func() {
fmt.Printf("[discovery] Trying to connect to API Server %q\n", apiEndpoint)
cfg, err := fetchKubeConfigFunc(apiEndpoint)
if err != nil {
fmt.Printf("[discovery] Failed to connect to API Server %q: %v\n", apiEndpoint, err)
retries--
if retries == 0 {
resultingError = err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to add a customized message to the error, like [discovery] Aborted after retrying xx times: <error msg here>. WDYT?

timeout, _ := strconv.Atoi(endpoint)
time.Sleep(time.Second * time.Duration(timeout))
return kubeconfigutil.CreateBasic(endpoint, "foo", "foo", []byte{}), nil
})
if err != nil {
t.Errorf("failed TestRunForEndpointsAndReturnFirst: unexpected error %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just t.Errorf("unexpected error: %v for test %s", err, idx)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format follows the message from line 82 of the same file. I can simplify it though (as long as this does not break some test result analysis script somewhere).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go ahead. This won't break the test.

@rosti
Copy link
Contributor Author

rosti commented Mar 12, 2018

Updated with the suggestions made by @dixudx

if retryCount > retries {
resultingError = err
close(stopChan)
fmt.Printf("[discovery] Abort connecting to API Server %q after %d retries: %v\n", apiEndpoint, retries, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about resultingError = fmt.Errorf("abort connecting to API Server %q after %d retries: %v\n", apiEndpoint, retries, err)?

timeout, _ := strconv.Atoi(endpoint)
time.Sleep(time.Second * time.Duration(timeout))
return kubeconfigutil.CreateBasic(endpoint, "foo", "foo", []byte{}), nil
})
if err != nil {
t.Errorf("unexpected error: %v for TestRunForEndpointsAndReturnFirst", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about t.Errorf("unexpected error: %v for test %s", err, idx)?

We should clearly get to know which test case fails.

@rosti
Copy link
Contributor Author

rosti commented Mar 13, 2018

Again, updated with latest suggestions.

@@ -206,7 +218,7 @@ func runForEndpointsAndReturnFirst(endpoints []string, fetchKubeConfigFunc func(
}(endpoint)
}
wg.Wait()
return resultingKubeConfig
return resultingKubeConfig, resultingError
Copy link
Member

@dixudx dixudx Mar 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resultingError can be mutated if we've got multiple endpoints, right?

If so, I'd prefer not to return an extra error, which will cause potential issues.
For example, if one of three endpoints is accessible, in this case, we should return an available kubeconfig with nil error, right? Instead of current non-nil resultingError.

Then what we really need to do is to make sure kubeconfig is not nil when using it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are quite right there. Somehow I missed that (probably because I test with a single master, hence a single endpoint)

@rosti
Copy link
Contributor Author

rosti commented Mar 14, 2018

Rewrote the main part of the PR as the previous implementation did not deal too well with multiple API servers.

Copy link
Member

@dixudx dixudx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 15, 2018
@dixudx
Copy link
Member

dixudx commented Mar 15, 2018

ping @luxas @timothysc for another look. Thanks.

@rosti
Copy link
Contributor Author

rosti commented Mar 19, 2018

/retest

@rosti rosti changed the title kubeadm: Timeout after 5 minutes of join failures kubeadm: Controllable timeout for join failures Mar 21, 2018
@rosti
Copy link
Contributor Author

rosti commented Mar 21, 2018

Updated the PR by introducing the discoveryTimeout config option. Also updated the PR name, text and release note in accordance with the change.

@rosti rosti changed the title kubeadm: Controllable timeout for join failures [WIP] kubeadm: Controllable timeout for join failures Mar 21, 2018
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 21, 2018
@rosti
Copy link
Contributor Author

rosti commented Mar 21, 2018

Marking this as a WIP due to a failed test, that requires a bit more thinking to get over.

@rosti rosti changed the title [WIP] kubeadm: Controllable timeout for join failures kubeadm: Controllable timeout for join failures Mar 22, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 22, 2018
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Removed From Pull Request

@dixudx @luxas @rosti @timothysc

Important: This pull request was missing labels required for the v1.11 milestone for more than 3 days:

kind: Must specify exactly one of kind/bug, kind/cleanup or kind/feature.
priority: Must specify exactly one of priority/critical-urgent, priority/important-longterm or priority/important-soon.

Help

@rosti
Copy link
Contributor Author

rosti commented Mar 26, 2018

/retest

select {
case <-time.After(timeout):
close(stopChan)
err := fmt.Errorf("Abort connecting to API servers after timeout of %v", timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Abort/abort

@@ -205,8 +210,21 @@ func runForEndpointsAndReturnFirst(endpoints []string, fetchKubeConfigFunc func(
}, constants.DiscoveryRetryInterval, stopChan)
}(endpoint)
}
wg.Wait()
return resultingKubeConfig
timeout := constants.DefaultDiscoveryTimeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set the default value for field DiscoveryTimeout in SetDefaults_NodeConfiguration instead of defining here?

Signed-off-by: Rostislav M. Georgiev <rostislavg@vmware.com>
@rosti
Copy link
Contributor Author

rosti commented Apr 3, 2018

Updated the PR with @xiangpengzhao 's suggestions

Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could nit on some style things, but lets get this unblocked now that 1.11 is open
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 3, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dixudx, rosti, timothysc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 3, 2018
@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 60983, 62012, 61892, 62051, 62067). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 7973c54 into kubernetes:master Apr 3, 2018
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 3, 2018

@rosti: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce 230a9c6 link /test pull-kubernetes-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@rosti rosti deleted the join-timeout branch April 26, 2018 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. milestone/removed release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kubeadm join discovery will loop infinitely upon failure
8 participants