-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix rolling update daemonset bug in clock-skew scenario #77208
Conversation
Hi @DaiHao. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
it's a critical bug in our production environment. @zhangxiaoyu-zidif PTAL |
/priority important-soon |
@kubernetes/sig-apps-pr-reviews needs an approval |
/test pull-kubernetes-e2e-gce-device-plugin-gpu |
/lgtm |
ping @janetkuo |
@@ -1198,6 +1198,10 @@ func (dsc *DaemonSetsController) updateDaemonSetStatus(ds *apps.DaemonSet, nodeL | |||
return fmt.Errorf("error storing status for daemon set %#v: %v", ds, err) | |||
} | |||
|
|||
// Resync the DaemonSet after MinReadySeconds as a last line of defense to guard against clock-skew. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would help to enhance this comment with a more detailed explanation of this issue. I am still not following how the nodes (on which pod is running)clock skew from controller nodes clock causes this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the reason would be this line https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/daemon/daemon_controller.go#L1174 , the reason from what i understand is
1: kubelet marks the pod ready and changes LastTransitionTime
2: controller checks IsPodAvailable() and finds the pod unavailable since its diffing LastTransitionTime(set by kubelet) with time.Now which is controller time. Since controller is behind in clock, minreadySeconds is not satisfied and it marks unavailable even though minReadySeconds is satisfied
can one of you confirm if this is the right understanding of this issue @DaiHao @Kargakis ?
Also in this case when a pod is marked unavailable, why wont it be requeued ?
I think including all of this explanation will be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see here. #77208 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the reason would be this line https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/daemon/daemon_controller.go#L1174 , the reason from what i understand is
1: kubelet marks the pod ready and changes LastTransitionTime
2: controller checks IsPodAvailable() and finds the pod unavailable since its diffing LastTransitionTime(set by kubelet) with time.Now which is controller time. Since controller is behind in clock, minreadySeconds is not satisfied and it marks unavailable even though minReadySeconds is satisfiedcan one of you confirm if this is the right understanding of this issue @DaiHao @Kargakis ?
Also in this case when a pod is marked unavailable, why wont it be requeued ?I think including all of this explanation will be helpful
Your understanding is right.
Pod's status do not change in the sync loop, controller mark it unavailable only in daemonset's status, but which is equal with its last status, so daemonset also not requeue.
see here.
kubernetes/pkg/controller/daemon/daemon_controller.go
Lines 1107 to 1117 in b219272
func storeDaemonSetStatus(dsClient unversionedapps.DaemonSetInterface, ds *apps.DaemonSet, desiredNumberScheduled, currentNumberScheduled, numberMisscheduled, numberReady, updatedNumberScheduled, numberAvailable, numberUnavailable int, updateObservedGen bool) error { | |
if int(ds.Status.DesiredNumberScheduled) == desiredNumberScheduled && | |
int(ds.Status.CurrentNumberScheduled) == currentNumberScheduled && | |
int(ds.Status.NumberMisscheduled) == numberMisscheduled && | |
int(ds.Status.NumberReady) == numberReady && | |
int(ds.Status.UpdatedNumberScheduled) == updatedNumberScheduled && | |
int(ds.Status.NumberAvailable) == numberAvailable && | |
int(ds.Status.NumberUnavailable) == numberUnavailable && | |
ds.Status.ObservedGeneration >= ds.Generation { | |
return nil | |
} |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: DaiHao, janetkuo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
What this PR does / why we need it:
Related issue #41641
Which issue(s) this PR fixes:
Fixes #77203
Special notes for your reviewer:
Does this PR introduce a user-facing change?: