Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod from DaemonSet stuck in Unknown state because of NodeLost #44458

Closed
ghost opened this issue Apr 13, 2017 · 5 comments
Closed

Pod from DaemonSet stuck in Unknown state because of NodeLost #44458

ghost opened this issue Apr 13, 2017 · 5 comments
Labels
area/workload-api/daemonset sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@ghost
Copy link

ghost commented Apr 13, 2017

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.): no

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): unknown daemonset nodelost


Is this a BUG REPORT or FEATURE REQUEST? (choose one):

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1", GitCommit:"b0b7a323cc5a4a2019b2e9520c21c7830b7f708e", GitTreeState:"clean", BuildDate:"2017-04-03T20:44:38Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1+coreos.0", GitCommit:"9212f77ed8c169a0afa02e58dce87913c6387b3e", GitTreeState:"clean", BuildDate:"2017-04-04T00:32:53Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: bare-metal, 3 masters, 12 workers
  • OS (e.g. from /etc/os-release): Container Linux 1298.7.0
  • Kernel (e.g. uname -a): 4.9.16-coreos-r1
  • Install tools: own scripts
  • Others:

What happened:

# kubectl delete -f manifests/canal-node.yaml
error: error when stopping "manifests/canal-node.yaml": timed out waiting for the condition

canal-node.yaml contains a DaemonSet: https://github.com/projectcalico/canal/blob/master/k8s-install/canal.yaml

# kubectl get pods
...
canal-node-36b0t                            3/3       Unknown            0          6d
...

# kubectl describe pod/canal-node-36b0t
Status:                         Terminating (expires Thu, 13 Apr 2017 19:14:34 +0200)
Termination Grace Period:       1s
Reason:                         NodeLost
Message:                        Node XXX which was running pod canal-node-36b0t is unresponsive

# kubectl delete --now pod/canal-node-36b0t
pod "canal-node-36b0t" deleted

# kubectl get pods
...
canal-node-36b0t                            3/3       Unknown            0          6d
...

I cannot even recreate the DaemonSet - it will mention it needs 16 pods and has 1 running (for interesting definitions of "running", apparently), but not run any additional pods.

What you expected to happen:

canal-node-36b0t should have been deleted when its node got lost. Or when the DaemonSet was being deleted. At latest, it should have been deleted when I ran kubectl delete --force --now.

How to reproduce it (as minimally and precisely as possible):

Run a DaemonSet, make the node fail, be stuck with a stale pod. (Note that I did not actually try to reproduce this.)

Anything else we need to know:

Please ask for anything that might be missing.

@0xmichalis
Copy link
Contributor

@kubernetes/sig-apps-bugs

@foxish
Copy link
Contributor

foxish commented Apr 14, 2017

kubectl delete pods <pod> --grace-period=0 --force

The above should work. The behavior was modified in this PR. --now is equivalent to specifying a timeout of 1s, which will not force-delete the pod and wait for graceful termination (kubelet to report that pod was killed).

Not deleting the unreachable pod was a decision made in 1.5 in the interest of providing safety guarantees. The relevant rationale doc is https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-safety.md

@Raffo
Copy link
Contributor

Raffo commented Apr 20, 2017

I had a similar issue as wrote in #41916, restarting the apiserver on master node makes the daemonset running on the node to exit and be stuck in Terminating state. Force delete "fixes" the problem.

@foxish
Copy link
Contributor

foxish commented Jun 17, 2017

Closing this issue, because behavior is as intended. The way around it would be to delete the node from k8s to keep it in sync with the infrastructure.

@vasicvuk
Copy link

vasicvuk commented Jul 9, 2019

But should node be automatically deleted from cluster if kubelet does not response in some period of time ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/workload-api/daemonset sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
Development

No branches or pull requests

4 participants