Pod from DaemonSet stuck in Unknown state because of NodeLost #44458

ghost · 2017-04-13T17:39:19Z

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.): no

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): unknown daemonset nodelost

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1", GitCommit:"b0b7a323cc5a4a2019b2e9520c21c7830b7f708e", GitTreeState:"clean", BuildDate:"2017-04-03T20:44:38Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1+coreos.0", GitCommit:"9212f77ed8c169a0afa02e58dce87913c6387b3e", GitTreeState:"clean", BuildDate:"2017-04-04T00:32:53Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Cloud provider or hardware configuration: bare-metal, 3 masters, 12 workers
OS (e.g. from /etc/os-release): Container Linux 1298.7.0
Kernel (e.g. uname -a): 4.9.16-coreos-r1
Install tools: own scripts
Others:

What happened:

# kubectl delete -f manifests/canal-node.yaml
error: error when stopping "manifests/canal-node.yaml": timed out waiting for the condition

canal-node.yaml contains a DaemonSet: https://github.com/projectcalico/canal/blob/master/k8s-install/canal.yaml

# kubectl get pods
...
canal-node-36b0t                            3/3       Unknown            0          6d
...

# kubectl describe pod/canal-node-36b0t
Status:                         Terminating (expires Thu, 13 Apr 2017 19:14:34 +0200)
Termination Grace Period:       1s
Reason:                         NodeLost
Message:                        Node XXX which was running pod canal-node-36b0t is unresponsive

# kubectl delete --now pod/canal-node-36b0t
pod "canal-node-36b0t" deleted

# kubectl get pods
...
canal-node-36b0t                            3/3       Unknown            0          6d
...

I cannot even recreate the DaemonSet - it will mention it needs 16 pods and has 1 running (for interesting definitions of "running", apparently), but not run any additional pods.

What you expected to happen:

canal-node-36b0t should have been deleted when its node got lost. Or when the DaemonSet was being deleted. At latest, it should have been deleted when I ran kubectl delete --force --now.

How to reproduce it (as minimally and precisely as possible):

Run a DaemonSet, make the node fail, be stuck with a stale pod. (Note that I did not actually try to reproduce this.)

Anything else we need to know:

Please ask for anything that might be missing.

The text was updated successfully, but these errors were encountered:

0xmichalis · 2017-04-14T18:01:14Z

@kubernetes/sig-apps-bugs

foxish · 2017-04-14T18:12:01Z

kubectl delete pods <pod> --grace-period=0 --force

The above should work. The behavior was modified in this PR. --now is equivalent to specifying a timeout of 1s, which will not force-delete the pod and wait for graceful termination (kubelet to report that pod was killed).

Not deleting the unreachable pod was a decision made in 1.5 in the interest of providing safety guarantees. The relevant rationale doc is https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-safety.md

Raffo · 2017-04-20T09:43:50Z

I had a similar issue as wrote in #41916, restarting the apiserver on master node makes the daemonset running on the node to exit and be stuck in Terminating state. Force delete "fixes" the problem.

foxish · 2017-06-17T00:41:45Z

Closing this issue, because behavior is as intended. The way around it would be to delete the node from k8s to keep it in sync with the infrastructure.

vasicvuk · 2019-07-09T09:49:58Z

But should node be automatically deleted from cluster if kubelet does not response in some period of time ?

glesage mentioned this issue Apr 13, 2017

AWS: Node becomes NodeNotReady without logged reason #41916

Closed

0xmichalis added area/workload-api/daemonset sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Apr 14, 2017

foxish closed this as completed Jun 17, 2017

itskingori mentioned this issue Jul 13, 2017

Unstable Kubernetes v1.6.6 cluster created with Kops 1.6.2 kubernetes/kops#2928

Closed

ksatchit mentioned this issue Oct 5, 2017

OpenEBS Operator Error handling/behaviour post cluster loss and recovery openebs/openebs#502

Closed

xtroncode mentioned this issue Jul 17, 2018

[Cluster-autoscaler, aws] Node not being deleted from kubernetes cluster kubernetes/autoscaler#1068

Closed

superseb mentioned this issue Oct 11, 2018

The automatic migration is applied successfully, but the original pod is not deleted rancher/rancher#16075

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod from DaemonSet stuck in Unknown state because of NodeLost #44458

Pod from DaemonSet stuck in Unknown state because of NodeLost #44458

ghost commented Apr 13, 2017

0xmichalis commented Apr 14, 2017

foxish commented Apr 14, 2017 •

edited

Loading

Raffo commented Apr 20, 2017

foxish commented Jun 17, 2017

vasicvuk commented Jul 9, 2019

Pod from DaemonSet stuck in Unknown state because of NodeLost #44458

Pod from DaemonSet stuck in Unknown state because of NodeLost #44458

Comments

ghost commented Apr 13, 2017

0xmichalis commented Apr 14, 2017

foxish commented Apr 14, 2017 • edited Loading

Raffo commented Apr 20, 2017

foxish commented Jun 17, 2017

vasicvuk commented Jul 9, 2019

foxish commented Apr 14, 2017 •

edited

Loading