Node status down -> pod status unknown -> pod cant be deleted, cordon doesnt finish. #51333

jayunit100 · 2017-08-25T12:12:21Z

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

A pod went in to zombie state (unknown) where it wouldn't be deleted after its node went down.

What you expected to happen:

Pods should readily delete.

How to reproduce it (as minimally and precisely as possible):

Not sure: But in this case we had a pod with 3 containers, the node went down (literally, couldn't ssh into it), and then, run kubectl delete pod infinitely, the delete operation succeeds, rather then failing - and in the pod is truly never deleted.

Anything else we need to know?:

Looking at the pod conditions, it clearly is aware that it lost its node.

  message: Node kubernetes-minion-group-w2v6 which was running pod nginx-webapp-logstash-3874544627-5p63l
    is unresponsive
  phase: Running
  podIP: 10.244.3.41
  qosClass: Burstable
  reason: NodeLost

Environment:

Kubernetes version (use kubectl version):
kubectl version

Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.3", GitCommit:"2c2fe6e8278a5db2d15a013987b53968c743f2a1", GitTreeState:"clean", BuildDate:"2017-08-03T06:43:48Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}```
- GKE

The text was updated successfully, but these errors were encountered:

k8s-github-robot · 2017-08-25T12:13:23Z

@jayunit100
There are no sig labels on this issue. Please add a sig label by:

mentioning a sig: @kubernetes/sig-<group-name>-<group-suffix>
e.g., @kubernetes/sig-contributor-experience-<group-suffix> to notify the contributor experience sig, OR
specifying the label manually: /sig <label>
e.g., /sig scalability to apply the sig/scalability label

Note: Method 1 will trigger an email to the group. You can find the group list here and label list here.
The <group-suffix> in the method 1 has to be replaced with one of these: bugs, feature-requests, pr-reviews, test-failures, proposals

jayunit100 · 2017-08-25T12:14:04Z

Here's a snippet reproducing the exact behaviour.

jvyas@kubernetes-master ~ $ kubectl delete pod --all 
pod "nginx-webapp-logstash-3874544627-5p63l" deleted
jvyas@kubernetes-master ~ $ kubectl get pods
NAME                                     READY     STATUS    RESTARTS   AGE
nginx-webapp-logstash-3874544627-5p63l   3/3       Unknown   6          8d
jvyas@kubernetes-master ~ $ kubectl get pod nginx-webapp-logstash-3874544627-5p63l
NAME                                     READY     STATUS    RESTARTS   AGE
nginx-webapp-logstash-3874544627-5p63l   3/3       Unknown   6          8d

cblecker · 2017-09-04T02:28:35Z

/sig node

ravisantoshgudimetla · 2017-09-11T23:18:04Z

@jayunit100 Have you tried forceful deletion?

kubectl delete pod <pod_name> --grace-period=0 --force

derekwaynecarr · 2017-09-13T22:56:10Z

this is expected behavior. previously, the node controller would have deleted these pods, but the behavior changed in kube 1.5 to require the admin to forceful delete the pod per the above comment.

lsylvain · 2018-03-21T20:02:11Z

Regardless of the intent of the 1.5 changes, as things stand now at release 1.9.2, even when using --grace-period=0 --force the pod is not deleted. The status remains Terminated. The web UI shows the status with a "moon" incon that displays the tooltip "This pod is in a pending state" and the status column displaysTerminated:ExitCode:$. But this is only the beginning of many potential other woes.

If the pod was part of a daemonset the pod cannot be replaced. And, BTW, the dameonset cannot be deleted because it refererences the terminated but pending pod. I hate cliche terms, but the term "zombie pod" someone used in another post out there almost seems appropriate here.

The only way at the moment I can see to resolve this is to clean up the etcd data. Since there are no etcd cleanup tools for K8S, that means the etcd data must be waxed, which really means the K8S cluster must be torn down and stood up again. That is obviously not very elegant and is disruptive, which in most production scenarios is not an acceptable alternative.

moserke · 2018-03-29T17:07:28Z

@lsylvain We have the same problem. What we've determined to fix for the time being is that you need to create a node object with the name of the one that died, bounce your schedulers & controllers, and then delete the node object. This seems to allow the scheduler to pick a new, good, node

kostyrev · 2018-05-21T20:53:54Z

@moserke could you please share commands how to achieve that?
UPD.
For those who will end up here:
How to create node object is shown in docs.
For me it was enough to create node object and to restart one of the master nodes.

moserke · 2018-05-29T13:23:44Z

Sorry for the slow response @kostyrev. Glad you were able to figure it out and that it worked for you too!

edwardstudy · 2018-05-31T06:20:46Z

Hi, I met the same problem.. When a node crashed, we expect pods to be recreated on some label nodes. But these pods stuck.

finndev · 2018-07-02T22:21:58Z

Hi, we are facing the same issue. When a node is crashed/deleted and a pod was running there, we can not remove the pod from the list:

kubectl get pods -o wide
xxx-77bd6557d-8rvv8          1/1       Unknown   0          7h        10.244.5.3    worker-13

kubectl delete pod xxx-77bd6557d-8rvv8 --force --grace-period=0
pod "xxx-77bd6557d-8rvv8" force deleted

However, it is still there. Not deleted.

Is there any way to remove it from the list?

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 25, 2017

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Aug 25, 2017

jayunit100 changed the title ~~Node status unknown -> pod cant be deleted, cordon doesnt finish.~~ Node status down -> pod status unknown -> pod cant be deleted, cordon doesnt finish. Aug 25, 2017

jayunit100 mentioned this issue Aug 25, 2017

[v0.9.1] pod status permanently unknown #3866

Closed

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Sep 4, 2017

k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 4, 2017

derekwaynecarr closed this as completed Sep 13, 2017

hongchaodeng mentioned this issue Feb 8, 2018

Handling node failures coreos/etcd-operator#1856

Open

tombee mentioned this issue Mar 2, 2018

Node status down -> pod status unknown k8s v.1.9.2 can't terminating #59467

Closed

uromahn mentioned this issue Jun 21, 2018

Issue with Service Discovery in K8s with "NotReady" node jgroups-extras/jgroups-kubernetes#49

Closed

kvaps mentioned this issue Jul 3, 2018

Can not force delete pod in Unknown state #65766

Closed

moonek mentioned this issue Aug 14, 2018

Do not count soft-deleted pods for scaling purposes in HPA controller #67067

Merged

killmeplz mentioned this issue Aug 21, 2018

Handling NodeLost problems AmadeusITGroup/Redis-Operator#21

Merged

zionwu mentioned this issue Mar 11, 2019

Delete pod with unknown status in reconcilePods kubeflow/training-operator#956

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node status down -> pod status unknown -> pod cant be deleted, cordon doesnt finish. #51333

Node status down -> pod status unknown -> pod cant be deleted, cordon doesnt finish. #51333

jayunit100 commented Aug 25, 2017 •

edited

Loading

k8s-github-robot commented Aug 25, 2017

jayunit100 commented Aug 25, 2017 •

edited

Loading

cblecker commented Sep 4, 2017

ravisantoshgudimetla commented Sep 11, 2017

derekwaynecarr commented Sep 13, 2017

lsylvain commented Mar 21, 2018 •

edited

Loading

moserke commented Mar 29, 2018

kostyrev commented May 21, 2018 •

edited

Loading

moserke commented May 29, 2018

edwardstudy commented May 31, 2018

finndev commented Jul 2, 2018 •

edited

Loading

Node status down -> pod status unknown -> pod cant be deleted, cordon doesnt finish. #51333

Node status down -> pod status unknown -> pod cant be deleted, cordon doesnt finish. #51333

Comments

jayunit100 commented Aug 25, 2017 • edited Loading

k8s-github-robot commented Aug 25, 2017

jayunit100 commented Aug 25, 2017 • edited Loading

cblecker commented Sep 4, 2017

ravisantoshgudimetla commented Sep 11, 2017

derekwaynecarr commented Sep 13, 2017

lsylvain commented Mar 21, 2018 • edited Loading

moserke commented Mar 29, 2018

kostyrev commented May 21, 2018 • edited Loading

moserke commented May 29, 2018

edwardstudy commented May 31, 2018

finndev commented Jul 2, 2018 • edited Loading

jayunit100 commented Aug 25, 2017 •

edited

Loading

jayunit100 commented Aug 25, 2017 •

edited

Loading

lsylvain commented Mar 21, 2018 •

edited

Loading

kostyrev commented May 21, 2018 •

edited

Loading

finndev commented Jul 2, 2018 •

edited

Loading