Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node status down -> pod status unknown -> pod cant be deleted, cordon doesnt finish. #51333

Closed
jayunit100 opened this issue Aug 25, 2017 · 11 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@jayunit100
Copy link
Member

jayunit100 commented Aug 25, 2017

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

A pod went in to zombie state (unknown) where it wouldn't be deleted after its node went down.

What you expected to happen:

Pods should readily delete.

How to reproduce it (as minimally and precisely as possible):

Not sure: But in this case we had a pod with 3 containers, the node went down (literally, couldn't ssh into it), and then, run kubectl delete pod infinitely, the delete operation succeeds, rather then failing - and in the pod is truly never deleted.

Anything else we need to know?:

Looking at the pod conditions, it clearly is aware that it lost its node.

  message: Node kubernetes-minion-group-w2v6 which was running pod nginx-webapp-logstash-3874544627-5p63l
    is unresponsive
  phase: Running
  podIP: 10.244.3.41
  qosClass: Burstable
  reason: NodeLost

Environment:

  • Kubernetes version (use kubectl version):
    kubectl version
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.3", GitCommit:"2c2fe6e8278a5db2d15a013987b53968c743f2a1", GitTreeState:"clean", BuildDate:"2017-08-03T06:43:48Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}```
- GKE
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 25, 2017
@k8s-github-robot
Copy link

@jayunit100
There are no sig labels on this issue. Please add a sig label by:

  1. mentioning a sig: @kubernetes/sig-<group-name>-<group-suffix>
    e.g., @kubernetes/sig-contributor-experience-<group-suffix> to notify the contributor experience sig, OR

  2. specifying the label manually: /sig <label>
    e.g., /sig scalability to apply the sig/scalability label

Note: Method 1 will trigger an email to the group. You can find the group list here and label list here.
The <group-suffix> in the method 1 has to be replaced with one of these: bugs, feature-requests, pr-reviews, test-failures, proposals

@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Aug 25, 2017
@jayunit100
Copy link
Member Author

jayunit100 commented Aug 25, 2017

Here's a snippet reproducing the exact behaviour.

jvyas@kubernetes-master ~ $ kubectl delete pod --all 
pod "nginx-webapp-logstash-3874544627-5p63l" deleted
jvyas@kubernetes-master ~ $ kubectl get pods
NAME                                     READY     STATUS    RESTARTS   AGE
nginx-webapp-logstash-3874544627-5p63l   3/3       Unknown   6          8d
jvyas@kubernetes-master ~ $ kubectl get pod nginx-webapp-logstash-3874544627-5p63l
NAME                                     READY     STATUS    RESTARTS   AGE
nginx-webapp-logstash-3874544627-5p63l   3/3       Unknown   6          8d

@jayunit100 jayunit100 changed the title Node status unknown -> pod cant be deleted, cordon doesnt finish. Node status down -> pod status unknown -> pod cant be deleted, cordon doesnt finish. Aug 25, 2017
@cblecker
Copy link
Member

cblecker commented Sep 4, 2017

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Sep 4, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 4, 2017
@ravisantoshgudimetla
Copy link
Contributor

@jayunit100 Have you tried forceful deletion?

kubectl delete pod <pod_name> --grace-period=0 --force

@derekwaynecarr
Copy link
Member

this is expected behavior. previously, the node controller would have deleted these pods, but the behavior changed in kube 1.5 to require the admin to forceful delete the pod per the above comment.

@lsylvain
Copy link

lsylvain commented Mar 21, 2018

Regardless of the intent of the 1.5 changes, as things stand now at release 1.9.2, even when using --grace-period=0 --force the pod is not deleted. The status remains Terminated. The web UI shows the status with a "moon" incon that displays the tooltip "This pod is in a pending state" and the status column displaysTerminated:ExitCode:$. But this is only the beginning of many potential other woes.

If the pod was part of a daemonset the pod cannot be replaced. And, BTW, the dameonset cannot be deleted because it refererences the terminated but pending pod. I hate cliche terms, but the term "zombie pod" someone used in another post out there almost seems appropriate here.

The only way at the moment I can see to resolve this is to clean up the etcd data. Since there are no etcd cleanup tools for K8S, that means the etcd data must be waxed, which really means the K8S cluster must be torn down and stood up again. That is obviously not very elegant and is disruptive, which in most production scenarios is not an acceptable alternative.

@moserke
Copy link

moserke commented Mar 29, 2018

@lsylvain We have the same problem. What we've determined to fix for the time being is that you need to create a node object with the name of the one that died, bounce your schedulers & controllers, and then delete the node object. This seems to allow the scheduler to pick a new, good, node

@kostyrev
Copy link

kostyrev commented May 21, 2018

@moserke could you please share commands how to achieve that?
UPD.
For those who will end up here:
How to create node object is shown in docs.
For me it was enough to create node object and to restart one of the master nodes.

@moserke
Copy link

moserke commented May 29, 2018

Sorry for the slow response @kostyrev. Glad you were able to figure it out and that it worked for you too!

@edwardstudy
Copy link
Contributor

Hi, I met the same problem.. When a node crashed, we expect pods to be recreated on some label nodes. But these pods stuck.

@finndev
Copy link

finndev commented Jul 2, 2018

Hi, we are facing the same issue. When a node is crashed/deleted and a pod was running there, we can not remove the pod from the list:

kubectl get pods -o wide
xxx-77bd6557d-8rvv8          1/1       Unknown   0          7h        10.244.5.3    worker-13
kubectl delete pod xxx-77bd6557d-8rvv8 --force --grace-period=0
pod "xxx-77bd6557d-8rvv8" force deleted

However, it is still there. Not deleted.

Is there any way to remove it from the list?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests