-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix daemon set rolling update hang #77773
Conversation
Hi @DaiHao. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig apps |
/ok-to-test |
/assign @soltysh |
/test pull-kubernetes-kubemark-e2e-gce-big |
dsc.deletePod(curPod) | ||
return | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am confused how this fixes the issue ? Did you try with this fix and it solves the issue ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why ds rolling update stucked is expectations not be satisfied. In this fix, we lower deletion expectation once pod's DeletionTimestamp is not nil to satisfy expectation and let syncloop resume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this fix, if a pod stucked in terminating, deamonset will never satisfy expectation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldnt lowering of expectation be done using dsc.expectations.DeletionObserved(dsKey) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janetkuo could you help understand this fix ? Do we need a test ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldnt lowering of expectation be done using dsc.expectations.DeletionObserved(dsKey) ?
of course you could just use dsc.expectations.DeletionObserved(dsKey)
but in this fix we could reuse the code in "deletePod" and keep code consistent with replicas-controller and job-controller. see #77773 (comment)
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: DaiHao, janetkuo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I just reworded release note. |
/assign @k82cn |
awesome work! |
ping @k82cn for lgtm |
is there same issue in ReplicaSet-Controller? |
job controller and replicaset controller has already contains this logic. kubernetes/pkg/controller/job/job_controller.go Lines 233 to 248 in b7bc5dc
kubernetes/pkg/controller/replicaset/replica_set.go Lines 298 to 320 in b7bc5dc
|
I'm going to review this PR today :) |
@@ -535,6 +535,15 @@ func (dsc *DaemonSetsController) updatePod(old, cur interface{}) { | |||
return | |||
} | |||
|
|||
if curPod.DeletionTimestamp != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if oldPod.DeletionTimestamp == nil && curPod.DeletionTimestamp != nil {
...
}
Check oldPod's DeletionTimestamp will be better. Same as ReplicaSet-Controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it might reduce the nonsense syncloop.
The only risk I have considered is that once handler(updatePod or deletePod) returns before lowering of expectation, we will also lose the chance to satisfy the expectation.
@k82cn please help to review this PR, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my suggestion is acceptable, please create another PR to fix ReplicaSet-Controller.
/lgtm |
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #63465
Special notes for your reviewer:
When daemon controller delete pod on not-ready node, pod will stuck in terminating state.
Then informer will not receives pod delete event, which cause expectation never be satisfied and daemon controller not exec
manage
method.Does this PR introduce a user-facing change?: