fix daemon set rolling update hang #77773

DaiHao · 2019-05-11T12:41:58Z

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind bug

What this PR does / why we need it:

daemon set rolling update hang when there exsits not-ready node in cluster

Which issue(s) this PR fixes:

Fixes #63465

Special notes for your reviewer:
When daemon controller delete pod on not-ready node, pod will stuck in terminating state.
Then informer will not receives pod delete event, which cause expectation never be satisfied and daemon controller not exec manage method.

Does this PR introduce a user-facing change?:

Fix a bug that causes DaemonSet rolling update to hang when its pod gets stuck at terminating.

k8s-ci-robot · 2019-05-11T12:42:05Z

Hi @DaiHao. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

DaiHao · 2019-05-11T12:46:48Z

/sig apps
/kind bug
/priority important-soon

DaiHao · 2019-05-11T12:54:53Z

@krmayankk @janetkuo @Kargakis @k82cn PTAL

draveness · 2019-05-12T00:45:01Z

/ok-to-test

draveness · 2019-05-12T00:45:23Z

/assign @soltysh

DaiHao · 2019-05-12T03:48:38Z

/test pull-kubernetes-kubemark-e2e-gce-big

krmayankk · 2019-05-13T07:29:58Z

pkg/controller/daemon/daemon_controller.go

+		dsc.deletePod(curPod)
+		return
+	}
+


i am confused how this fixes the issue ? Did you try with this fix and it solves the issue ?

The reason why ds rolling update stucked is expectations not be satisfied. In this fix, we lower deletion expectation once pod's DeletionTimestamp is not nil to satisfy expectation and let syncloop resume.

Before this fix, if a pod stucked in terminating, deamonset will never satisfy expectation.

shouldnt lowering of expectation be done using dsc.expectations.DeletionObserved(dsKey) ?

@janetkuo could you help understand this fix ? Do we need a test ?

shouldnt lowering of expectation be done using dsc.expectations.DeletionObserved(dsKey) ?

of course you could just use dsc.expectations.DeletionObserved(dsKey)
but in this fix we could reuse the code in "deletePod" and keep code consistent with replicas-controller and job-controller. see #77773 (comment)

DaiHao · 2019-05-15T16:29:43Z

ping @soltysh @janetkuo

pkg/controller/daemon/daemon_controller.go

janetkuo · 2019-05-15T20:03:45Z

/approve

k8s-ci-robot · 2019-05-15T20:04:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DaiHao, janetkuo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/daemon/OWNERS~~ [janetkuo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

janetkuo · 2019-05-15T21:29:13Z

I just reworded release note.

janetkuo · 2019-05-15T21:29:28Z

/assign @k82cn

answer1991 · 2019-05-16T08:55:05Z

awesome work!

@zhangxiaoyu-zidif

zhangxiaoyu-zidif · 2019-05-16T08:56:35Z

ping @k82cn for lgtm

answer1991 · 2019-05-16T08:56:42Z

is there same issue in ReplicaSet-Controller?

DaiHao · 2019-05-16T09:06:56Z

is there same issue in ReplicaSet-Controller?

job controller and replicaset controller has already contains this logic.

kubernetes/pkg/controller/job/job_controller.go

Lines 233 to 248 in b7bc5dc

    
           func (jm *JobController) updatePod(old, cur interface{}) { 
        
           	curPod := cur.(*v1.Pod) 
        
           	oldPod := old.(*v1.Pod) 
        
           	if curPod.ResourceVersion == oldPod.ResourceVersion { 
        
           		// Periodic resync will send update events for all known pods. 
        
           		// Two different versions of the same pod will always have different RVs. 
        
           		return 
        
           	} 
        
           	if curPod.DeletionTimestamp != nil { 
        
           		// when a pod is deleted gracefully it's deletion timestamp is first modified to reflect a grace period, 
        
           		// and after such time has passed, the kubelet actually deletes it from the store. We receive an update 
        
           		// for modification of the deletion timestamp and expect an job to create more pods asap, not wait 
        
           		// until the kubelet actually deletes the pod. 
        
           		jm.deletePod(curPod) 
        
           		return 
        
           	}

kubernetes/pkg/controller/replicaset/replica_set.go

Lines 298 to 320 in b7bc5dc

    
           func (rsc *ReplicaSetController) updatePod(old, cur interface{}) { 
        
           	curPod := cur.(*v1.Pod) 
        
           	oldPod := old.(*v1.Pod) 
        
           	if curPod.ResourceVersion == oldPod.ResourceVersion { 
        
           		// Periodic resync will send update events for all known pods. 
        
           		// Two different versions of the same pod will always have different RVs. 
        
           		return 
        
           	} 
        
           	labelChanged := !reflect.DeepEqual(curPod.Labels, oldPod.Labels) 
        
           	if curPod.DeletionTimestamp != nil { 
        
           		// when a pod is deleted gracefully it's deletion timestamp is first modified to reflect a grace period, 
        
           		// and after such time has passed, the kubelet actually deletes it from the store. We receive an update 
        
           		// for modification of the deletion timestamp and expect an rs to create more replicas asap, not wait 
        
           		// until the kubelet actually deletes the pod. This is different from the Phase of a pod changing, because 
        
           		// an rs never initiates a phase change, and so is never asleep waiting for the same. 
        
           		rsc.deletePod(curPod) 
        
           		if labelChanged { 
        
           			// we don't need to check the oldPod.DeletionTimestamp because DeletionTimestamp cannot be unset. 
        
           			rsc.deletePod(oldPod) 
        
           		} 
        
           		return 
        
           	}

k82cn · 2019-05-16T23:46:04Z

I'm going to review this PR today :)

answer1991 · 2019-05-18T13:48:57Z

pkg/controller/daemon/daemon_controller.go

@@ -535,6 +535,15 @@ func (dsc *DaemonSetsController) updatePod(old, cur interface{}) {
 		return
 	}

+	if curPod.DeletionTimestamp != nil {


if oldPod.DeletionTimestamp == nil && curPod.DeletionTimestamp != nil { ... }

Check oldPod's DeletionTimestamp will be better. Same as ReplicaSet-Controller.

Yes, it might reduce the nonsense syncloop.
The only risk I have considered is that once handler(updatePod or deletePod) returns before lowering of expectation, we will also lose the chance to satisfy the expectation.
@k82cn please help to review this PR, thanks

If my suggestion is acceptable, please create another PR to fix ReplicaSet-Controller.

k82cn · 2019-05-20T08:49:40Z

/lgtm

k8s-ci-robot requested review from foxish and janetkuo May 11, 2019 12:43

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels May 11, 2019

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 11, 2019

DaiHao force-pushed the daemon branch from 0ef4cef to 0771120 Compare May 11, 2019 13:04

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 12, 2019

k8s-ci-robot assigned soltysh May 12, 2019

krmayankk reviewed May 13, 2019

View reviewed changes

janetkuo approved these changes May 15, 2019

View reviewed changes

pkg/controller/daemon/daemon_controller.go Outdated Show resolved Hide resolved

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2019

k8s-ci-robot assigned k82cn May 15, 2019

fix daemon set rolling update hang

e25ff46

DaiHao force-pushed the daemon branch from 0771120 to e25ff46 Compare May 16, 2019 02:57

answer1991 reviewed May 18, 2019

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 20, 2019

k8s-ci-robot merged commit 6ba13bf into kubernetes:master May 20, 2019

DaiHao deleted the daemon branch May 21, 2019 08:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix daemon set rolling update hang #77773

fix daemon set rolling update hang #77773

DaiHao commented May 11, 2019 •

edited by janetkuo

k8s-ci-robot commented May 11, 2019

DaiHao commented May 11, 2019 •

edited

DaiHao commented May 11, 2019 •

edited

draveness commented May 12, 2019

draveness commented May 12, 2019

DaiHao commented May 12, 2019

krmayankk May 13, 2019

DaiHao May 13, 2019

DaiHao May 13, 2019

krmayankk May 17, 2019

krmayankk May 17, 2019

DaiHao May 18, 2019

DaiHao commented May 15, 2019

janetkuo commented May 15, 2019

k8s-ci-robot commented May 15, 2019

janetkuo commented May 15, 2019

janetkuo commented May 15, 2019

answer1991 commented May 16, 2019

zhangxiaoyu-zidif commented May 16, 2019

answer1991 commented May 16, 2019 •

edited

DaiHao commented May 16, 2019

k82cn commented May 16, 2019

answer1991 May 18, 2019

DaiHao May 18, 2019

answer1991 May 18, 2019

k82cn commented May 20, 2019

fix daemon set rolling update hang #77773

fix daemon set rolling update hang #77773

Conversation

DaiHao commented May 11, 2019 • edited by janetkuo

k8s-ci-robot commented May 11, 2019

DaiHao commented May 11, 2019 • edited

DaiHao commented May 11, 2019 • edited

draveness commented May 12, 2019

draveness commented May 12, 2019

DaiHao commented May 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaiHao commented May 15, 2019

janetkuo commented May 15, 2019

k8s-ci-robot commented May 15, 2019

janetkuo commented May 15, 2019

janetkuo commented May 15, 2019

answer1991 commented May 16, 2019

zhangxiaoyu-zidif commented May 16, 2019

answer1991 commented May 16, 2019 • edited

DaiHao commented May 16, 2019

k82cn commented May 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k82cn commented May 20, 2019

DaiHao commented May 11, 2019 •

edited by janetkuo

DaiHao commented May 11, 2019 •

edited

DaiHao commented May 11, 2019 •

edited

answer1991 commented May 16, 2019 •

edited