Set reason and message on Pod during nodecontroller eviction #36017

foxish · 2016-11-01T22:46:08Z

What this PR does / why we need it: Pods which are evicted by the nodecontroller due to network partition, or unresponsive kubelet should be differentiated from termination initiated by other sources. The reason/message are consumed by kubectl to provide a better summary using get/describe.

Which issue this PR fixes (optional, in fixes #<issue number>(, #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #35725

Release note:

Pods that are terminating due to eviction by the nodecontroller (typically due to unresponsive kubelet, or network partition) now surface in `kubectl get` output 
as being in state "Unknown", along with a longer description in `kubectl describe` output.

This change is

foxish · 2016-11-02T04:03:02Z

/cc @kubernetes/sig-apps @erictune @janetkuo

soltysh

One nit, otherwise LGTM.

soltysh · 2016-11-02T11:05:48Z

pkg/controller/node/controller_utils.go

@@ -70,6 +71,14 @@ func deletePods(kubeClient clientset.Interface, recorder record.EventRecorder, n
 			continue
 		}

+		// Set reason and message in the pod object.


It would be nice to have a test case covering this method and that action.

@soltysh Added test

smarterclayton · 2016-11-02T20:10:13Z

pkg/controller/node/controller_utils.go

@@ -70,6 +71,14 @@ func deletePods(kubeClient clientset.Interface, recorder record.EventRecorder, n
 			continue
 		}

+		// Set reason and message in the pod object.
+		if updatedPod, err := setPodTerminationReason(kubeClient, &pod, nodeName); err != nil {


This would block eviction. Do we want to do that?

Maybe we should keep the error and return it after deletion. Alternatively we can just hope we'll retry here (will we retry?)

Yes, when we return false, we retry right away with 0 delay.

@smarterclayton I can't think of a scenario where the update fails when the delete could have proceeded. However, like you said, it would be more defensive to keep the error and not block the eviction, I'll go ahead and do that. At worst, it could lead to multiple delete calls due to the retry loop.

soltysh

LGTM

foxish · 2016-11-02T22:00:57Z

@smarterclayton Updated. PTAL.

gmarek · 2016-11-03T14:55:17Z

I you think that such 'best effort' behavior is better, then I'm not going to block this PR. I just think it may cause more harm than good, and it might be a cause of some red herrings when debugging NC issues (so whatever you do, just please at least add a retry loop).

smarterclayton · 2016-11-03T15:04:31Z

The retry loop is the evictor:

remaining, err := deletePods(nc.kubeClient, nc.recorder, value.Value,
nodeUid, nc.daemonSetStore)
if err != nil {
utilruntime.HandleError(fmt.Errorf("unable to evict node %q: %v",
value.Value, err))
return false, 0
}

If we fail with error, we retry with no additional delay.

gmarek · 2016-11-03T15:30:34Z

Oh, that one, yes. But this is actually bad - if we succeed in deleting then there's nothing to update in another round (e.g. if Node is not existing and Pod will be force deleted by PodGC).

smarterclayton · 2016-11-03T16:29:45Z

Good point. So either we force the loop to run again if we get a conflict,
or we don't worry about retry at all. I'd prefer the former - queue the
error, don't delete the pod, return all errors, expect to get run again,
then succeed again. For maximum safety, we could only do that for Conflict
errors, and in all other cases just continue to delete (i.e. if you don't
grant the node controller permission to update pod status then we should
still delete the pod.

So:

for pods {
...
if updateStatus(); err != nil {
if errors.IsConflict(err) {
updateErrs = append(updateErrs, err)
continue
}
utilruntime.HandleError("unable to update pod status to indicate
unreachable")
}
deletePod()
}
if len(updateErrs) > 0 {
return false, NewAggregate(updateErrs)
}

On Thu, Nov 3, 2016 at 11:30 AM, Marek Grabowski notifications@github.com
wrote:

Oh, that one, yes. But this is actually bad - if we succeed in deleting
then there's nothing to update in another round (e.g. if Node is not
existing and Pod will be force deleted by PodGC).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#36017 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p4FZ5Axm_-E0AT3Ta3WwiRtfXrjZks5q6f4pgaJpZM4Kmq7x
.

foxish · 2016-11-03T16:33:06Z

Is the concern that we wouldn't have updated status and deleted it too soon?

gmarek · 2016-11-03T17:48:28Z

I'm fine with @smarterclayton proposal.

foxish · 2016-11-03T19:34:25Z

@smarterclayton @gmarek Do we also want to cover other failure reasons. For example: ServerTimeout seems like something we should account for, and retry in addition to Conflict.

smarterclayton · 2016-11-03T20:37:45Z

The client should already retry those for you.

On Thu, Nov 3, 2016 at 3:34 PM, Anirudh Ramanathan <notifications@github.com

wrote:

@smarterclayton https://github.com/smarterclayton @gmarek
https://github.com/gmarek Do we also want to cover other failure
reasons. For example: ServerTimeout

kubernetes/pkg/api/unversioned/types.go

Line 227 in 25afcc5

StatusReasonServerTimeout StatusReason = "ServerTimeout"

seems like something we should account for, and retry in addition to
Conflict.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#36017 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pw3XDa1GlVH1g35X1IyuDeLB6PBYks5q6jdQgaJpZM4Kmq7x
.

foxish · 2016-11-03T20:42:59Z

@smarterclayton SG. I've updated the PR with the special retry in case of update conflict.

foxish · 2016-11-03T20:44:10Z

Oops, there is an issue with it. Will push an update shortly.

Pods which are evicted by the nodecontroller due to network malfunction, or unresponsive kubelet should be differentiated from termination initiated by other sources. The reason/message are consumed by kubectl to provide a better summary using get/describe.

foxish · 2016-11-03T20:58:43Z

@smarterclayton Updated. PTAL.

smarterclayton · 2016-11-03T21:33:38Z

/lgtm

foxish · 2016-11-03T21:42:46Z

Bumping priority to P2 as #34825 will need to be rebased afterwards.

k8s-ci-robot · 2016-11-03T22:21:31Z

Jenkins GCI GCE e2e failed for commit 6d7213d. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

k8s-github-robot · 2016-11-04T00:27:31Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

foxish · 2016-11-04T01:02:17Z

@k8s-bot gci gce e2e test this

k8s-github-robot · 2016-11-04T01:05:44Z

Automatic merge from submit-queue

foxish added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. labels Nov 1, 2016

foxish added this to the v1.5 milestone Nov 1, 2016

foxish self-assigned this Nov 1, 2016

googlebot added the cla: yes label Nov 1, 2016

k8s-github-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 1, 2016

foxish force-pushed the kubectl-new-2 branch 2 times, most recently from d0ee9d9 to c1ce04b Compare November 2, 2016 03:12

k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 2, 2016

foxish force-pushed the kubectl-new-2 branch from c1ce04b to 6008b09 Compare November 2, 2016 03:30

foxish assigned smarterclayton, gmarek and mengqiy and unassigned foxish Nov 2, 2016

foxish removed the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Nov 2, 2016

soltysh requested changes Nov 2, 2016

View reviewed changes

foxish force-pushed the kubectl-new-2 branch 3 times, most recently from 7ebae58 to 7bbcb76 Compare November 2, 2016 20:00

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 2, 2016

smarterclayton reviewed Nov 2, 2016

View reviewed changes

soltysh approved these changes Nov 2, 2016

View reviewed changes

foxish force-pushed the kubectl-new-2 branch 2 times, most recently from 85b94b0 to 940565e Compare November 2, 2016 22:00

googlebot added cla: no and removed cla: yes labels Nov 2, 2016

foxish force-pushed the kubectl-new-2 branch from a35739a to 9b31a83 Compare November 3, 2016 18:37

foxish added 3 commits November 3, 2016 13:47

Added unit test for adding reason with termination.

8fd7de5

Update bazel

6d7213d

foxish force-pushed the kubectl-new-2 branch from 9b31a83 to 6d7213d Compare November 3, 2016 20:47

k8s-github-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 3, 2016

foxish added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Nov 3, 2016

k8s-github-robot merged commit f2b5600 into kubernetes:master Nov 4, 2016

chentao1596 mentioned this pull request Dec 5, 2016

WIP:kubelet: support multi-headers when getting pod from HTTP source #38089

Closed

sjenning mentioned this pull request Jun 3, 2017

Ignore pods for quota marked for deletion whose node is unreachable #46542

Merged

derekwaynecarr mentioned this pull request Jun 8, 2017

cache mutation detector causes memory/cpu pressure at the end of long e2e runs (like pull-kubernetes-e2e-gce-etcd3) #47135

Closed

foxish deleted the kubectl-new-2 branch June 8, 2017 20:54

chenchun mentioned this pull request Jan 23, 2024

Clean up useless daemonSetStore field of node life controller struct #122864

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set reason and message on Pod during nodecontroller eviction #36017

Set reason and message on Pod during nodecontroller eviction #36017

foxish commented Nov 1, 2016 •

edited

Loading

foxish commented Nov 2, 2016

soltysh left a comment

soltysh Nov 2, 2016

foxish Nov 2, 2016

smarterclayton Nov 2, 2016

smarterclayton Nov 2, 2016

foxish Nov 2, 2016 •

edited

Loading

foxish Nov 2, 2016

soltysh left a comment

foxish commented Nov 2, 2016

gmarek commented Nov 3, 2016

smarterclayton commented Nov 3, 2016

gmarek commented Nov 3, 2016

smarterclayton commented Nov 3, 2016

foxish commented Nov 3, 2016

gmarek commented Nov 3, 2016

foxish commented Nov 3, 2016

smarterclayton commented Nov 3, 2016

foxish commented Nov 3, 2016

foxish commented Nov 3, 2016

foxish commented Nov 3, 2016

smarterclayton commented Nov 3, 2016

foxish commented Nov 3, 2016

k8s-ci-robot commented Nov 3, 2016

k8s-github-robot commented Nov 4, 2016

foxish commented Nov 4, 2016

k8s-github-robot commented Nov 4, 2016

Set reason and message on Pod during nodecontroller eviction #36017

Set reason and message on Pod during nodecontroller eviction #36017

Conversation

foxish commented Nov 1, 2016 • edited Loading

foxish commented Nov 2, 2016

soltysh left a comment

Choose a reason for hiding this comment

soltysh Nov 2, 2016

Choose a reason for hiding this comment

foxish Nov 2, 2016

Choose a reason for hiding this comment

smarterclayton Nov 2, 2016

Choose a reason for hiding this comment

smarterclayton Nov 2, 2016

Choose a reason for hiding this comment

foxish Nov 2, 2016 • edited Loading

Choose a reason for hiding this comment

foxish Nov 2, 2016

Choose a reason for hiding this comment

soltysh left a comment

Choose a reason for hiding this comment

foxish commented Nov 2, 2016

gmarek commented Nov 3, 2016

smarterclayton commented Nov 3, 2016

gmarek commented Nov 3, 2016

smarterclayton commented Nov 3, 2016

foxish commented Nov 3, 2016

gmarek commented Nov 3, 2016

foxish commented Nov 3, 2016

smarterclayton commented Nov 3, 2016

foxish commented Nov 3, 2016

foxish commented Nov 3, 2016

foxish commented Nov 3, 2016

smarterclayton commented Nov 3, 2016

foxish commented Nov 3, 2016

k8s-ci-robot commented Nov 3, 2016

k8s-github-robot commented Nov 4, 2016

foxish commented Nov 4, 2016

k8s-github-robot commented Nov 4, 2016

foxish commented Nov 1, 2016 •

edited

Loading

foxish Nov 2, 2016 •

edited

Loading