Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm: improve getStaticPodSingleHash error messages #108315

Merged

Conversation

Monokaix
Copy link
Member

@Monokaix Monokaix commented Feb 24, 2022

What type of PR is this?

/kind feature

What this PR does / why we need it:

When using kubeadm upgrade cmd to upgrade cluster, kubeadm will compare static pod'hash when replace with new static pod manifes, err happened when get old static pod detail failed and there is no explicit err message printed, we are not aware of what happened and only get timed out waiting for the condition after timeout(5 mins), which is not user friendly.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

kubeadm: better surface errors during "kubeadm upgrade" when waiting for the kubelet to restart static pods on control plane nodes

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 24, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @Monokaix. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 24, 2022
@Monokaix Monokaix force-pushed the print-getStaticPodSingleHash-err branch from 9bb2a1b to 1c9a30f Compare February 24, 2022 01:49
@Monokaix
Copy link
Member Author

/assign @SataQiu @pacoxu

@Monokaix Monokaix force-pushed the print-getStaticPodSingleHash-err branch from ea203d9 to 3d1af13 Compare February 24, 2022 09:34
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 24, 2022
@pacoxu
Copy link
Member

pacoxu commented Feb 24, 2022

/cc @neolit123
Do think the adding message is valid?
I would add some troubleshooting methods or possible errs before waiting. And I may print such information when log level is debug(V 4).

Comment on lines 188 to 199
var err error
var lastErr = errors.New("")
mirrorPodHashes := map[string]string{}
for _, component := range kubeadmconstants.ControlPlaneComponents {
staticPodName := fmt.Sprintf("%s-%s", component, nodeName)
err = wait.PollImmediate(kubeadmconstants.APICallRetryInterval, w.timeout, func() (bool, error) {
componentHash, err = getStaticPodSingleHash(w.client, nodeName, component)
componentHash, err = getStaticPodSingleHash(w.client, staticPodName)
if err != nil {
if err.Error() != lastErr.Error() {
fmt.Printf("[apiclient] Get static pod %s hash failed with err: %v\n", staticPodName, err)
lastErr = err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try:

	var err, lastErr error
	mirrorPodHashes := map[string]string{}
	for _, component := range kubeadmconstants.ControlPlaneComponents {
		err = wait.PollImmediate(kubeadmconstants.APICallRetryInterval, w.timeout, func() (bool, error) {
			componentHash, err = getStaticPodSingleHash(w.client, nodeName, component)
			if err != nil {
				lastErr = err
				return false, nil
			}
			return true, nil
		})
		if err != nil {
			if lastErr != nil {
				err = lastErr
			}
			return nil, errors.Wrapf(err, "failed to obtain static Pod hash for component %s on Node %s, component, node)
		}

similar for WaitForStaticPodSingleHash

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! seems good, but a little doubt that this may lead user not be aware err so fast, it will return err message until timeout.

Copy link
Member

@neolit123 neolit123 Feb 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubeadm has a few places where it prints errors during polls iterations with higher verbosity (v >= 4), "etcd member join", "node join" come to mind, but the other places mostly print the last error after the poll.

this particular timeout is long (5 minutes)...:

UpgradeManifestTimeout = 5 * time.Minute

i'd remove this message because it's redundant:

klog.V(1).Infoln("[upgrade/apply] performing upgrade")

and modify this message to include the timeout to let the users know how long it is:

fmt.Printf("[upgrade/apply] Upgrading your Static Pod-hosted control plane to version %q...\n", internalcfg.KubernetesVersion)

to become:

fmt.Printf("[upgrade/apply] Upgrading your Static Pod-hosted control plane to version %q (timeout: %v)...\n",
    internalcfg.KubernetesVersion, waiter.timeout)

the retry interval is 500ms so it will be very spammy if we print each error for 5 minutes.

APICallRetryInterval = 500 * time.Millisecond

we don't do it in other places:
https://github.com/kubernetes/kubernetes/blob/9804a83d8fd9eec4c20e75e73e11ffd4b370c6f0/cmd/kubeadm/app/util/apiclient/idempotency.go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try:

	var err, lastErr error
	mirrorPodHashes := map[string]string{}
	for _, component := range kubeadmconstants.ControlPlaneComponents {
		err = wait.PollImmediate(kubeadmconstants.APICallRetryInterval, w.timeout, func() (bool, error) {
			componentHash, err = getStaticPodSingleHash(w.client, nodeName, component)
			if err != nil {
				lastErr = err
				return false, nil
			}
			return true, nil
		})
		if err != nil {
			if lastErr != nil {
				err = lastErr
			}
			return nil, errors.Wrapf(err, "failed to obtain static Pod hash for component %s on Node %s, component, node)
		}

similar for WaitForStaticPodSingleHash

Thanks, and maybe there is no need to check lastErr != nil because when err!=nil it must be timeout err here and lastErr is not nil either, so we can return lastErr directly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you are right and we don't need it in this case.
if err != nil, lastErr will always be != nil as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, some of the parts mentioned here:
#108315 (comment)
are still missing in the diff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's done.
And also WaitForStaticPodHashChange is a little special, we should distinguish two errors because it maybe apiclient getStaticPodSingleHash error or pod hash didn't change err.

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a couple of comments
the lastErr can be handled better and no need to print an error on every poll iteration like @pacoxu mentioned.

@neolit123
Copy link
Member

neolit123 commented Feb 24, 2022

something else you need to do - once the WaitForStaticPodSingleHash function is responsible for formatting the errors, this locations just need to return err and not have additional error formatting:

beforeEtcdPodHash, err := waiter.WaitForStaticPodSingleHash(cfg.NodeRegistration.Name, constants.Etcd)
if err != nil {
return true, errors.Wrap(err, "failed to get etcd pod's hash")

errors.Wrap(err, "failed to get etcd pod's hash") -> err

looks like this call location of WaitForStaticPodControlPlaneHashes already returns just err on the caller side:

beforePodHashMap, err := waiter.WaitForStaticPodControlPlaneHashes(cfg.NodeRegistration.Name)
if err != nil {
return err

check if there are other callers of WaitForStaticPodSingleHash and WaitForStaticPodControlPlaneHashes

@neolit123
Copy link
Member

/triage accepted
but needs updates.

@Monokaix Monokaix force-pushed the print-getStaticPodSingleHash-err branch from 308084d to 4380e33 Compare March 1, 2022 01:54
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 1, 2022
@Monokaix
Copy link
Member Author

Monokaix commented Mar 1, 2022

/test pull-kubernetes-integration

Copy link
Member

@pacoxu pacoxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for the nit(which is optional from my aspect.)

Comment on lines 247 to 255
// distinguish getStaticPodSingleHash err and kubelet possible err here
if err != nil {
if lastErr != nil {
return errors.Wrapf(lastErr, "failed to obtain static Pod hash for component %s on Node %s", component, nodeName)
}
return errors.Wrapf(err, "static pod hash didn't change for a while, kubelet may fail to restart it")
}
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// distinguish getStaticPodSingleHash err and kubelet possible err here
if err != nil {
if lastErr != nil {
return errors.Wrapf(lastErr, "failed to obtain static Pod hash for component %s on Node %s", component, nodeName)
}
return errors.Wrapf(err, "static pod hash didn't change for a while, kubelet may fail to restart it")
}
return nil
// If the error is a timeout for unchanged hash, return a more specific error
if lastErr != nil {
return errors.Wrapf(lastErr, "failed to obtain static Pod hash for component %s on Node %s", component, nodeName)
}
if err != nil {
return errors.Wrapf(err, "static pod hash didn't change for a while, kubelet may fail to restart it")
}
return nil

Nit: to make less nested if-else.

Copy link
Member Author

@Monokaix Monokaix Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks, it makes sense.
But we both ignored one point that if getStaticPodSingleHash return an err in the first time and return no err after some retries, lastErr will be still left not nil, if we directly check lastErr it will return wrong err becasue it maybe has already no err or be anoher err. So I add lastErr = nil when getStaticPodSingleHash has returned no err after some retries, which can distinguish different errs.

@Monokaix Monokaix force-pushed the print-getStaticPodSingleHash-err branch 2 times, most recently from 13a1455 to 4dd6def Compare March 1, 2022 08:28
@pacoxu
Copy link
Member

pacoxu commented Mar 1, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 1, 2022
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requesting a few more changes, but other than that LGTM.

return false, nil
}
return true, nil
})

if err != nil {
err = errors.Wrapf(lastErr, "failed to obtain static Pod hash for component %s on Node %s", component, nodeName)
}
return componentPodHash, err
}

// WaitForStaticPodHashChange blocks until it timeouts or notices that the Mirror Pod (for the Static Pod, respectively) has changed
// This implicitly means this function blocks until the kubelet has restarted the Static Pod in question
func (w *KubeWaiter) WaitForStaticPodHashChange(nodeName, component, previousHash string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes in this function seem new, so i'm going to review them as well.

cmd/kubeadm/app/util/apiclient/wait.go Outdated Show resolved Hide resolved
return false, nil
}
// Set lastErr to nil which identities `getStaticPodSingleHash` has no err already, and we can just care other err if any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Set lastErr to nil which identities `getStaticPodSingleHash` has no err already, and we can just care other err if any
// Set lastErr to nil to be able to later distinguish between getStaticPodSingleHash() and timeout errors

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's done.

// We should continue polling until the UID changes
if hash == previousHash {
return false, nil
}

return true, nil
})

// distinguish `getStaticPodSingleHash` err and kubelet possible err here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// distinguish `getStaticPodSingleHash` err and kubelet possible err here
// if lastError is not nil, this must be a getStaticPodSingleHash() error, else if err is not nil there was a poll timeout

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's done.

return errors.Wrapf(lastErr, "failed to obtain static Pod hash for component %s on Node %s", component, nodeName)
}
if err != nil {
return errors.Wrapf(err, "static pod hash didn't change for a while, kubelet may fail to restart it")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return errors.Wrapf(err, "static pod hash didn't change for a while, kubelet may fail to restart it")
return errors.Wrapf(err, "static Pod hash for component %s on Node %s did not change after %v",
component, nodeName, w.timeout)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done here.


// distinguish `getStaticPodSingleHash` err and kubelet possible err here
if lastErr != nil {
return errors.Wrapf(lastErr, "failed to obtain static Pod hash for component %s on Node %s", component, nodeName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we seem to be repeating this text in a number of places.

instead we should modify the function and just return lastErr in all the places.

// getStaticPodSingleHash computes hashes for a single Static Pod resource
func getStaticPodSingleHash(client clientset.Interface, nodeName string, component string) (string, error) {
	staticPodName := fmt.Sprintf("%s-%s", component, nodeName)
	staticPod, err := client.CoreV1().Pods(metav1.NamespaceSystem).Get(context.TODO(), staticPodName, metav1.GetOptions{})
	if err != nil {
		return "", errors.Wrapf(lastErr, "failed to obtain static Pod hash for component %s on Node %s", component, nodeName)
	}
	staticPodHash := staticPod.Annotations["kubernetes.io/config.hash"]
	return staticPodHash, nil
}

i've also removed the Printf in the function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done here, too.

@Monokaix Monokaix force-pushed the print-getStaticPodSingleHash-err branch from 4dd6def to 7824316 Compare March 2, 2022 01:35
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 2, 2022
@Monokaix
Copy link
Member Author

Monokaix commented Mar 2, 2022

/test pull-kubernetes-integration

@Monokaix
Copy link
Member Author

Monokaix commented Mar 2, 2022

@pacoxu @neolit123 @SataQiu some changes have been made according to your reviews, please check this once more.

@neolit123
Copy link
Member

/retitle kubeadm: improve getStaticPodSingleHash error messages

@k8s-ci-robot k8s-ci-robot changed the title Print getStaticPodSingleHash err message kubeadm: improve getStaticPodSingleHash error messages Mar 2, 2022
@neolit123
Copy link
Member

/release-note-edit

kubeadm: better surface errors during "kubeadm upgrade" when waiting for the kubelet to restart start pods on control plane nodes

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Mar 2, 2022
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks
/lgtm
/approve

@neolit123
Copy link
Member

/priority backlog

@k8s-ci-robot k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Mar 2, 2022
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 2, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Monokaix, neolit123, pacoxu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 2, 2022
@k8s-ci-robot k8s-ci-robot merged commit 4fcfc58 into kubernetes:master Mar 2, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Mar 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants