Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm: fix some retry logic in PatchNodeOnce #105343

Merged
merged 1 commit into from
Oct 17, 2021

Conversation

jonyhy96
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

fix the retry logic in PatchNodeOnce

Special notes for your reviewer:

None

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 29, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @jonyhy96. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 29, 2021
@k8s-ci-robot k8s-ci-robot added area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 29, 2021
@jonyhy96 jonyhy96 changed the title fix: make patch node once support retry when timeout kubeadm: fix some retry logic in PatchNodeOnce Sep 29, 2021
@RA489
Copy link

RA489 commented Sep 29, 2021

/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 29, 2021
@neolit123
Copy link
Member

/priority awaiting-more-evidence
/triage accepted

@k8s-ci-robot k8s-ci-robot added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. triage/accepted Indicates an issue or PR is ready to be actively worked on. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 29, 2021
@jonyhy96
Copy link
Contributor Author

Thanks for review! ping @neolit123

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 12, 2021
@jonyhy96
Copy link
Contributor Author

/retest

1 similar comment
@jonyhy96
Copy link
Contributor Author

/retest

@neolit123
Copy link
Member

neolit123 commented Oct 12, 2021

the kind CI failures seem somewhat related to the patch node changes here:

I1012 16:23:28.581402     214 patchnode.go:31] [patchnode] Uploading the CRI Socket information "unix:///run/containerd/containerd.sock" to the Node API object "kind-worker2" as an annotation
I1012 16:23:28.599223     214 round_trippers.go:541] GET https://kind-control-plane:6443/api/v1/nodes/kind-worker2?timeout=10s 404 Not Found in 17 milliseconds
I1012 16:23:28.612455     214 round_trippers.go:541] GET https://kind-control-plane:6443/api/v1/nodes/kind-worker2?timeout=10s 404 Not Found in 1 milliseconds
I1012 16:23:28.666911     214 round_trippers.go:541] GET https://kind-control-plane:6443/api/v1/nodes/kind-worker2?timeout=10s 404 Not Found in 2 milliseconds
I1012 16:23:28.937311     214 round_trippers.go:541] GET https://kind-control-plane:6443/api/v1/nodes/kind-worker2?timeout=10s 404 Not Found in 2 milliseconds
nodes "kind-worker2" not found
error uploading crisocket
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runKubeletStartJoinPhase
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/kubelet.go:220
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:234
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
....

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/105343/pull-kubernetes-e2e-kind/1447957425293365248/build-log.txt

we should be careful with similar refactors to not break kubeadm, given the lack of unit tests.
i encourage you to test it locally with kind or with a two nodes (VMs) with bare kubeadm.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 13, 2021
@jonyhy96
Copy link
Contributor Author

jonyhy96 commented Oct 13, 2021

the kind CI failures seem somewhat related to the patch node changes here:

I1012 16:23:28.581402     214 patchnode.go:31] [patchnode] Uploading the CRI Socket information "unix:///run/containerd/containerd.sock" to the Node API object "kind-worker2" as an annotation
I1012 16:23:28.599223     214 round_trippers.go:541] GET https://kind-control-plane:6443/api/v1/nodes/kind-worker2?timeout=10s 404 Not Found in 17 milliseconds
I1012 16:23:28.612455     214 round_trippers.go:541] GET https://kind-control-plane:6443/api/v1/nodes/kind-worker2?timeout=10s 404 Not Found in 1 milliseconds
I1012 16:23:28.666911     214 round_trippers.go:541] GET https://kind-control-plane:6443/api/v1/nodes/kind-worker2?timeout=10s 404 Not Found in 2 milliseconds
I1012 16:23:28.937311     214 round_trippers.go:541] GET https://kind-control-plane:6443/api/v1/nodes/kind-worker2?timeout=10s 404 Not Found in 2 milliseconds
nodes "kind-worker2" not found
error uploading crisocket
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runKubeletStartJoinPhase
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/kubelet.go:220
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:234
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
....

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/105343/pull-kubernetes-e2e-kind/1447957425293365248/build-log.txt

we should be careful with similar refactors to not break kubeadm, given the lack of unit tests. i encourage you to test it locally with kind or with a two nodes (VMs) with bare kubeadm.

The reason for this failure is because we didn’t have enough time to retry.

@jonyhy96
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce-ubuntu-containerd

@pacoxu
Copy link
Member

pacoxu commented Oct 14, 2021

/test pull-kubernetes-e2e-kind

Factor: 2.0,
Jitter: 0,
}
err := wait.ExponentialBackoff(backOff, PatchNodeOnce(client, nodeName, patchFn, &lastError))
Copy link
Member

@neolit123 neolit123 Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something is not right here, why did wait.Poll(constants.APICallRetryInterval, constants.PatchNodeTimeout... worked before and not after the change?

these interval / timeout constants have been used for a very long time in kubeadm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait.Poll's retry logic is diffirent from wait.ExponentialBackoff's.

failed message from https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/105343/pull-kubernetes-e2e-kind/1447957425293365248/build-log.txt

I1012 16:23:28.581402     214 patchnode.go:31] [patchnode] Uploading the CRI Socket information "unix:///run/containerd/containerd.sock" to the Node API object "kind-worker2" as an annotation
I1012 16:23:28.599223     214 round_trippers.go:541] GET https://kind-control-plane:6443/api/v1/nodes/kind-worker2?timeout=10s 404 Not Found in 17 milliseconds
I1012 16:23:28.612455     214 round_trippers.go:541] GET https://kind-control-plane:6443/api/v1/nodes/kind-worker2?timeout=10s 404 Not Found in 1 milliseconds
I1012 16:23:28.666911     214 round_trippers.go:541] GET https://kind-control-plane:6443/api/v1/nodes/kind-worker2?timeout=10s 404 Not Found in 2 milliseconds
I1012 16:23:28.937311     214 round_trippers.go:541] GET https://kind-control-plane:6443/api/v1/nodes/kind-worker2?timeout=10s 404 Not Found in 2 milliseconds

success message from https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/105343/pull-kubernetes-e2e-kind/1448583659891200000/build-log.txt

I1014 09:54:30.367606     213 patchnode.go:31] [patchnode] Uploading the CRI Socket information "unix:///run/containerd/containerd.sock" to the Node API object "kind-worker2" as an annotation
I1014 09:54:30.390173     213 round_trippers.go:541] GET https://kind-control-plane:6443/api/v1/nodes/kind-worker2?timeout=10s 404 Not Found in 22 milliseconds
I1014 09:54:30.895376     213 round_trippers.go:541] GET https://kind-control-plane:6443/api/v1/nodes/kind-worker2?timeout=10s 404 Not Found in 4 milliseconds
I1014 09:54:31.900145     213 round_trippers.go:541] GET https://kind-control-plane:6443/api/v1/nodes/kind-worker2?timeout=10s 200 OK in 3 milliseconds
I1014 09:54:31.912029     213 round_trippers.go:541] PATCH https://kind-control-plane:6443/api/v1/nodes/kind-worker2?timeout=10s 200 OK in 6 milliseconds

We can see when use DefaultBackoff in ExponentialBackoff, it steps too fast to wait for kind-worker2 registry on apiserver. When using wait.Poll it will retry in every 500ms and timeout in 2min which gives it enough time to wait.This backOff is similar to the behavior of wait.Poll that it will timeout at 127.5s.

Copy link
Member

@neolit123 neolit123 Oct 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that the functions are different. It seems out of scope of this PR to change the function. We should continue to use Poll with the old constants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, use wait.Poll instead

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, this looks good now.
please squash the commits to 1 and i will LGTM.

/remove-priority awaiting-more-evidence
/priority backlog
/approve

@k8s-ci-robot k8s-ci-robot added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Oct 16, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jonyhy96, neolit123

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 16, 2021
Signed-off-by: haoyun <yun.hao@daocloud.io>
@jonyhy96
Copy link
Contributor Author

/test pull-kubernetes-unit

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 17, 2021
@k8s-ci-robot k8s-ci-robot merged commit 9804a83 into kubernetes:master Oct 17, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Oct 17, 2021
@jonyhy96 jonyhy96 deleted the fix-patch-node-once branch October 18, 2021 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note-none Denotes a PR that doesn't merit a release note. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants