Fix rolling update daemonset bug in clock-skew scenario #77208

DaiHao · 2019-04-29T12:12:57Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Pod's Ready condition is managed by kubelet, which is related to node clock. When controller's clock is slower than node, we will found that IsPodAvailable function judges pod unavailable. Then daemonset lose the change to update its status.
In this commit, controller resync the DaemonSet after MinReadySeconds as a last line of defense to guard against clock-skew.

Related issue #41641

Which issue(s) this PR fixes:

Fixes #77203

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

enhance the daemonset sync logic in clock-skew scenario

k8s-ci-robot · 2019-04-29T12:13:04Z

Hi @DaiHao. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

DaiHao · 2019-04-29T12:16:22Z

PTAL @resouer @Kargakis

answer1991 · 2019-04-29T13:44:47Z

it's a critical bug in our production environment.

@zhangxiaoyu-zidif PTAL

pkg/controller/daemon/daemon_controller.go

0xmichalis · 2019-04-30T07:34:25Z

/priority important-soon
/kind bug

0xmichalis · 2019-04-30T07:39:24Z

@kubernetes/sig-apps-pr-reviews needs an approval

DaiHao · 2019-04-30T08:32:08Z

/test pull-kubernetes-e2e-gce-device-plugin-gpu

zhangxiaoyu-zidif · 2019-04-30T16:03:53Z

/lgtm

zhangxiaoyu-zidif · 2019-04-30T16:04:50Z

ping @janetkuo

krmayankk · 2019-05-01T04:30:15Z

pkg/controller/daemon/daemon_controller.go

@@ -1198,6 +1198,10 @@ func (dsc *DaemonSetsController) updateDaemonSetStatus(ds *apps.DaemonSet, nodeL
 		return fmt.Errorf("error storing status for daemon set %#v: %v", ds, err)
 	}

+	// Resync the DaemonSet after MinReadySeconds as a last line of defense to guard against clock-skew.


it would help to enhance this comment with a more detailed explanation of this issue. I am still not following how the nodes (on which pod is running)clock skew from controller nodes clock causes this ?

Seems like the reason would be this line https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/daemon/daemon_controller.go#L1174 , the reason from what i understand is
1: kubelet marks the pod ready and changes LastTransitionTime
2: controller checks IsPodAvailable() and finds the pod unavailable since its diffing LastTransitionTime(set by kubelet) with time.Now which is controller time. Since controller is behind in clock, minreadySeconds is not satisfied and it marks unavailable even though minReadySeconds is satisfied

can one of you confirm if this is the right understanding of this issue @DaiHao @Kargakis ?
Also in this case when a pod is marked unavailable, why wont it be requeued ?

I think including all of this explanation will be helpful

see here. #77208 (comment)

Seems like the reason would be this line https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/daemon/daemon_controller.go#L1174 , the reason from what i understand is
1: kubelet marks the pod ready and changes LastTransitionTime
2: controller checks IsPodAvailable() and finds the pod unavailable since its diffing LastTransitionTime(set by kubelet) with time.Now which is controller time. Since controller is behind in clock, minreadySeconds is not satisfied and it marks unavailable even though minReadySeconds is satisfied

can one of you confirm if this is the right understanding of this issue @DaiHao @Kargakis ?
Also in this case when a pod is marked unavailable, why wont it be requeued ?

I think including all of this explanation will be helpful

Your understanding is right.
Pod's status do not change in the sync loop, controller mark it unavailable only in daemonset's status, but which is equal with its last status, so daemonset also not requeue.
see here.

kubernetes/pkg/controller/daemon/daemon_controller.go

Lines 1107 to 1117 in b219272

func storeDaemonSetStatus(dsClient unversionedapps.DaemonSetInterface, ds *apps.DaemonSet, desiredNumberScheduled, currentNumberScheduled, numberMisscheduled, numberReady, updatedNumberScheduled, numberAvailable, numberUnavailable int, updateObservedGen bool) error {

if int(ds.Status.DesiredNumberScheduled) == desiredNumberScheduled &&

int(ds.Status.CurrentNumberScheduled) == currentNumberScheduled &&

int(ds.Status.NumberMisscheduled) == numberMisscheduled &&

int(ds.Status.NumberReady) == numberReady &&

int(ds.Status.UpdatedNumberScheduled) == updatedNumberScheduled &&

int(ds.Status.NumberAvailable) == numberAvailable &&

int(ds.Status.NumberUnavailable) == numberUnavailable &&

ds.Status.ObservedGeneration >= ds.Generation {

return nil

}

janetkuo · 2019-05-04T01:56:56Z

/approve

k8s-ci-robot · 2019-05-04T01:57:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DaiHao, janetkuo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/daemon/OWNERS~~ [janetkuo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Enqueue controllers after minreadyseconds when all pods are ready

18ed2f6

k8s-ci-robot requested review from erictune and tnozicka April 29, 2019 12:13

0xmichalis reviewed Apr 29, 2019

View reviewed changes

pkg/controller/daemon/daemon_controller.go Show resolved Hide resolved

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 30, 2019

k8s-ci-robot assigned 0xmichalis Apr 30, 2019

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 30, 2019

DaiHao changed the title ~~Enqueue controllers after minreadyseconds when all pods are ready~~ Fix rolling update daemonset bug in clock-skew scenario Apr 30, 2019

k8s-ci-robot assigned zhangxiaoyu-zidif Apr 30, 2019

krmayankk reviewed May 1, 2019

View reviewed changes

janetkuo approved these changes May 4, 2019

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2019

k8s-ci-robot merged commit e871241 into kubernetes:master May 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix rolling update daemonset bug in clock-skew scenario #77208

Fix rolling update daemonset bug in clock-skew scenario #77208

DaiHao commented Apr 29, 2019 •

edited

k8s-ci-robot commented Apr 29, 2019

DaiHao commented Apr 29, 2019

answer1991 commented Apr 29, 2019

0xmichalis commented Apr 30, 2019

0xmichalis commented Apr 30, 2019

DaiHao commented Apr 30, 2019

zhangxiaoyu-zidif commented Apr 30, 2019

zhangxiaoyu-zidif commented Apr 30, 2019

krmayankk May 1, 2019

krmayankk May 1, 2019

DaiHao May 1, 2019

DaiHao May 1, 2019

janetkuo commented May 4, 2019

k8s-ci-robot commented May 4, 2019

	func storeDaemonSetStatus(dsClient unversionedapps.DaemonSetInterface, ds *apps.DaemonSet, desiredNumberScheduled, currentNumberScheduled, numberMisscheduled, numberReady, updatedNumberScheduled, numberAvailable, numberUnavailable int, updateObservedGen bool) error {
	if int(ds.Status.DesiredNumberScheduled) == desiredNumberScheduled &&
	int(ds.Status.CurrentNumberScheduled) == currentNumberScheduled &&
	int(ds.Status.NumberMisscheduled) == numberMisscheduled &&
	int(ds.Status.NumberReady) == numberReady &&
	int(ds.Status.UpdatedNumberScheduled) == updatedNumberScheduled &&
	int(ds.Status.NumberAvailable) == numberAvailable &&
	int(ds.Status.NumberUnavailable) == numberUnavailable &&
	ds.Status.ObservedGeneration >= ds.Generation {
	return nil
	}

Fix rolling update daemonset bug in clock-skew scenario #77208

Fix rolling update daemonset bug in clock-skew scenario #77208

Conversation

DaiHao commented Apr 29, 2019 • edited

k8s-ci-robot commented Apr 29, 2019

DaiHao commented Apr 29, 2019

answer1991 commented Apr 29, 2019

0xmichalis commented Apr 30, 2019

0xmichalis commented Apr 30, 2019

DaiHao commented Apr 30, 2019

zhangxiaoyu-zidif commented Apr 30, 2019

zhangxiaoyu-zidif commented Apr 30, 2019

krmayankk May 1, 2019

Choose a reason for hiding this comment

krmayankk May 1, 2019

Choose a reason for hiding this comment

DaiHao May 1, 2019

Choose a reason for hiding this comment

DaiHao May 1, 2019

Choose a reason for hiding this comment

janetkuo commented May 4, 2019

k8s-ci-robot commented May 4, 2019

DaiHao commented Apr 29, 2019 •

edited