-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm: improve getStaticPodSingleHash error messages #108315
kubeadm: improve getStaticPodSingleHash error messages #108315
Conversation
Hi @Monokaix. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
9bb2a1b
to
1c9a30f
Compare
1c9a30f
to
ea203d9
Compare
ea203d9
to
3d1af13
Compare
/cc @neolit123 |
var err error | ||
var lastErr = errors.New("") | ||
mirrorPodHashes := map[string]string{} | ||
for _, component := range kubeadmconstants.ControlPlaneComponents { | ||
staticPodName := fmt.Sprintf("%s-%s", component, nodeName) | ||
err = wait.PollImmediate(kubeadmconstants.APICallRetryInterval, w.timeout, func() (bool, error) { | ||
componentHash, err = getStaticPodSingleHash(w.client, nodeName, component) | ||
componentHash, err = getStaticPodSingleHash(w.client, staticPodName) | ||
if err != nil { | ||
if err.Error() != lastErr.Error() { | ||
fmt.Printf("[apiclient] Get static pod %s hash failed with err: %v\n", staticPodName, err) | ||
lastErr = err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try:
var err, lastErr error
mirrorPodHashes := map[string]string{}
for _, component := range kubeadmconstants.ControlPlaneComponents {
err = wait.PollImmediate(kubeadmconstants.APICallRetryInterval, w.timeout, func() (bool, error) {
componentHash, err = getStaticPodSingleHash(w.client, nodeName, component)
if err != nil {
lastErr = err
return false, nil
}
return true, nil
})
if err != nil {
if lastErr != nil {
err = lastErr
}
return nil, errors.Wrapf(err, "failed to obtain static Pod hash for component %s on Node %s, component, node)
}
similar for WaitForStaticPodSingleHash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! seems good, but a little doubt that this may lead user not be aware err so fast, it will return err message until timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubeadm has a few places where it prints errors during polls iterations with higher verbosity (v >= 4), "etcd member join", "node join" come to mind, but the other places mostly print the last error after the poll.
this particular timeout is long (5 minutes)...:
UpgradeManifestTimeout = 5 * time.Minute |
i'd remove this message because it's redundant:
klog.V(1).Infoln("[upgrade/apply] performing upgrade") |
and modify this message to include the timeout to let the users know how long it is:
fmt.Printf("[upgrade/apply] Upgrading your Static Pod-hosted control plane to version %q...\n", internalcfg.KubernetesVersion) |
to become:
fmt.Printf("[upgrade/apply] Upgrading your Static Pod-hosted control plane to version %q (timeout: %v)...\n",
internalcfg.KubernetesVersion, waiter.timeout)
the retry interval is 500ms so it will be very spammy if we print each error for 5 minutes.
APICallRetryInterval = 500 * time.Millisecond |
we don't do it in other places:
https://github.com/kubernetes/kubernetes/blob/9804a83d8fd9eec4c20e75e73e11ffd4b370c6f0/cmd/kubeadm/app/util/apiclient/idempotency.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try:
var err, lastErr error mirrorPodHashes := map[string]string{} for _, component := range kubeadmconstants.ControlPlaneComponents { err = wait.PollImmediate(kubeadmconstants.APICallRetryInterval, w.timeout, func() (bool, error) { componentHash, err = getStaticPodSingleHash(w.client, nodeName, component) if err != nil { lastErr = err return false, nil } return true, nil }) if err != nil { if lastErr != nil { err = lastErr } return nil, errors.Wrapf(err, "failed to obtain static Pod hash for component %s on Node %s, component, node) }
similar for WaitForStaticPodSingleHash
Thanks, and maybe there is no need to check lastErr != nil
because when err!=nil
it must be timeout err here and lastErr
is not nil either, so we can return lastErr
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you are right and we don't need it in this case.
if err != nil
, lastErr
will always be != nil
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note, some of the parts mentioned here:
#108315 (comment)
are still missing in the diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it's done.
And also WaitForStaticPodHashChange
is a little special, we should distinguish two errors because it maybe apiclient getStaticPodSingleHash error or pod hash didn't change err.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a couple of comments
the lastErr can be handled better and no need to print an error on every poll iteration like @pacoxu mentioned.
something else you need to do - once the WaitForStaticPodSingleHash function is responsible for formatting the errors, this locations just need to return kubernetes/cmd/kubeadm/app/phases/upgrade/staticpods.go Lines 322 to 324 in cde45fb
looks like this call location of WaitForStaticPodControlPlaneHashes already returns just err on the caller side: kubernetes/cmd/kubeadm/app/phases/upgrade/staticpods.go Lines 417 to 419 in cde45fb
check if there are other callers of WaitForStaticPodSingleHash and WaitForStaticPodControlPlaneHashes |
/triage accepted |
308084d
to
4380e33
Compare
/test pull-kubernetes-integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for the nit(which is optional from my aspect.)
// distinguish getStaticPodSingleHash err and kubelet possible err here | ||
if err != nil { | ||
if lastErr != nil { | ||
return errors.Wrapf(lastErr, "failed to obtain static Pod hash for component %s on Node %s", component, nodeName) | ||
} | ||
return errors.Wrapf(err, "static pod hash didn't change for a while, kubelet may fail to restart it") | ||
} | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// distinguish getStaticPodSingleHash err and kubelet possible err here | |
if err != nil { | |
if lastErr != nil { | |
return errors.Wrapf(lastErr, "failed to obtain static Pod hash for component %s on Node %s", component, nodeName) | |
} | |
return errors.Wrapf(err, "static pod hash didn't change for a while, kubelet may fail to restart it") | |
} | |
return nil | |
// If the error is a timeout for unchanged hash, return a more specific error | |
if lastErr != nil { | |
return errors.Wrapf(lastErr, "failed to obtain static Pod hash for component %s on Node %s", component, nodeName) | |
} | |
if err != nil { | |
return errors.Wrapf(err, "static pod hash didn't change for a while, kubelet may fail to restart it") | |
} | |
return nil |
Nit: to make less nested if-else
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks, it makes sense.
But we both ignored one point that if getStaticPodSingleHash
return an err in the first time and return no err after some retries, lastErr
will be still left not nil, if we directly check lastErr
it will return wrong err becasue it maybe has already no err or be anoher err. So I add lastErr = nil
when getStaticPodSingleHash
has returned no err after some retries, which can distinguish different errs.
13a1455
to
4dd6def
Compare
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requesting a few more changes, but other than that LGTM.
return false, nil | ||
} | ||
return true, nil | ||
}) | ||
|
||
if err != nil { | ||
err = errors.Wrapf(lastErr, "failed to obtain static Pod hash for component %s on Node %s", component, nodeName) | ||
} | ||
return componentPodHash, err | ||
} | ||
|
||
// WaitForStaticPodHashChange blocks until it timeouts or notices that the Mirror Pod (for the Static Pod, respectively) has changed | ||
// This implicitly means this function blocks until the kubelet has restarted the Static Pod in question | ||
func (w *KubeWaiter) WaitForStaticPodHashChange(nodeName, component, previousHash string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the changes in this function seem new, so i'm going to review them as well.
return false, nil | ||
} | ||
// Set lastErr to nil which identities `getStaticPodSingleHash` has no err already, and we can just care other err if any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Set lastErr to nil which identities `getStaticPodSingleHash` has no err already, and we can just care other err if any | |
// Set lastErr to nil to be able to later distinguish between getStaticPodSingleHash() and timeout errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's done.
// We should continue polling until the UID changes | ||
if hash == previousHash { | ||
return false, nil | ||
} | ||
|
||
return true, nil | ||
}) | ||
|
||
// distinguish `getStaticPodSingleHash` err and kubelet possible err here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// distinguish `getStaticPodSingleHash` err and kubelet possible err here | |
// if lastError is not nil, this must be a getStaticPodSingleHash() error, else if err is not nil there was a poll timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's done.
return errors.Wrapf(lastErr, "failed to obtain static Pod hash for component %s on Node %s", component, nodeName) | ||
} | ||
if err != nil { | ||
return errors.Wrapf(err, "static pod hash didn't change for a while, kubelet may fail to restart it") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return errors.Wrapf(err, "static pod hash didn't change for a while, kubelet may fail to restart it") | |
return errors.Wrapf(err, "static Pod hash for component %s on Node %s did not change after %v", | |
component, nodeName, w.timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done here.
|
||
// distinguish `getStaticPodSingleHash` err and kubelet possible err here | ||
if lastErr != nil { | ||
return errors.Wrapf(lastErr, "failed to obtain static Pod hash for component %s on Node %s", component, nodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we seem to be repeating this text in a number of places.
instead we should modify the function and just return lastErr in all the places.
// getStaticPodSingleHash computes hashes for a single Static Pod resource
func getStaticPodSingleHash(client clientset.Interface, nodeName string, component string) (string, error) {
staticPodName := fmt.Sprintf("%s-%s", component, nodeName)
staticPod, err := client.CoreV1().Pods(metav1.NamespaceSystem).Get(context.TODO(), staticPodName, metav1.GetOptions{})
if err != nil {
return "", errors.Wrapf(lastErr, "failed to obtain static Pod hash for component %s on Node %s", component, nodeName)
}
staticPodHash := staticPod.Annotations["kubernetes.io/config.hash"]
return staticPodHash, nil
}
i've also removed the Printf in the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done here, too.
4dd6def
to
7824316
Compare
/test pull-kubernetes-integration |
@pacoxu @neolit123 @SataQiu some changes have been made according to your reviews, please check this once more. |
/retitle kubeadm: improve getStaticPodSingleHash error messages |
/release-note-edit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
/lgtm
/approve
/priority backlog |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Monokaix, neolit123, pacoxu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
When using
kubeadm upgrade
cmd to upgrade cluster, kubeadm will compare static pod'hash when replace with new static pod manifes, err happened when get old static pod detail failed and there is no explicit err message printed, we are not aware of what happened and only gettimed out waiting for the condition
after timeout(5 mins), which is not user friendly.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: