-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set reason and message on Pod during nodecontroller eviction #36017
Conversation
d0ee9d9
to
c1ce04b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit, otherwise LGTM.
@@ -70,6 +71,14 @@ func deletePods(kubeClient clientset.Interface, recorder record.EventRecorder, n | |||
continue | |||
} | |||
|
|||
// Set reason and message in the pod object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a test case covering this method and that action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soltysh Added test
7ebae58
to
7bbcb76
Compare
@@ -70,6 +71,14 @@ func deletePods(kubeClient clientset.Interface, recorder record.EventRecorder, n | |||
continue | |||
} | |||
|
|||
// Set reason and message in the pod object. | |||
if updatedPod, err := setPodTerminationReason(kubeClient, &pod, nodeName); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would block eviction. Do we want to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should keep the error and return it after deletion. Alternatively we can just hope we'll retry here (will we retry?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when we return false, we retry right away with 0 delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton I can't think of a scenario where the update fails when the delete could have proceeded. However, like you said, it would be more defensive to keep the error and not block the eviction, I'll go ahead and do that. At worst, it could lead to multiple delete calls due to the retry loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
85b94b0
to
940565e
Compare
@smarterclayton Updated. PTAL. |
I you think that such 'best effort' behavior is better, then I'm not going to block this PR. I just think it may cause more harm than good, and it might be a cause of some red herrings when debugging NC issues (so whatever you do, just please at least add a retry loop). |
The retry loop is the evictor:
|
Oh, that one, yes. But this is actually bad - if we succeed in deleting then there's nothing to update in another round (e.g. if Node is not existing and Pod will be force deleted by PodGC). |
Good point. So either we force the loop to run again if we get a conflict, So: for pods { On Thu, Nov 3, 2016 at 11:30 AM, Marek Grabowski notifications@github.com
|
Is the concern that we wouldn't have updated status and deleted it too soon? |
I'm fine with @smarterclayton proposal. |
@smarterclayton @gmarek Do we also want to cover other failure reasons. For example: ServerTimeout seems like something we should account for, and retry in addition to Conflict. |
The client should already retry those for you. On Thu, Nov 3, 2016 at 3:34 PM, Anirudh Ramanathan <notifications@github.com
|
@smarterclayton SG. I've updated the PR with the special retry in case of update conflict. |
Oops, there is an issue with it. Will push an update shortly. |
Pods which are evicted by the nodecontroller due to network malfunction, or unresponsive kubelet should be differentiated from termination initiated by other sources. The reason/message are consumed by kubectl to provide a better summary using get/describe.
@smarterclayton Updated. PTAL. |
/lgtm |
Bumping priority to P2 as #34825 will need to be rebased afterwards. |
Jenkins GCI GCE e2e failed for commit 6d7213d. Full PR test history. The magic incantation to run this job again is |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
@k8s-bot gci gce e2e test this |
Automatic merge from submit-queue |
What this PR does / why we need it: Pods which are evicted by the nodecontroller due to network partition, or unresponsive kubelet should be differentiated from termination initiated by other sources. The reason/message are consumed by kubectl to provide a better summary using get/describe.
Which issue this PR fixes (optional, in
fixes #<issue number>(, #<issue_number>, ...)
format, will close that issue when PR gets merged): fixes #35725Release note:
This change is![Reviewable](https://camo.githubusercontent.com/2d899f4291d07d3cd2fa4aaae1e3b243f164c23fce87d30a589ace0d496a444c/68747470733a2f2f72657669657761626c652e6b756265726e657465732e696f2f7265766965775f627574746f6e2e737667)