-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods that fail health checks always restarting on the same minion instead of others? #13385
Comments
There's two things here; first is to figure out what was wrong with your node and start detecting it. Second is the meta-problem of noticing when something is wrong with a node, even if we don't have a detection mechanism for that specific thing. |
@lavalamp certainly, it's critical to be able to detect issues with nodes. Is this something on that roadmap that will be built into kubernetes/kubelets? In the meantime, I need someway to detect this internally and either automatically handle it and/or send alerts. What are some ways you'd advise to do this? This issue bit me again over the weekend. I have a simple 3-node cluster in AWS that was provisioned with [1] is this is another Github issue I should create?
|
@joshm1 We already detect various problems with nodes (disk full, docker down, etc). It looks like you're running out of file handles, so something is leaking them or you have that system setting too low (we raise it for master, but I'm not sure about nodes). @dchen1107 Can we detect out-of-FDs and make the node not ready? |
The minion that ran out of file handles didn't have any running pods on it On Tuesday, September 8, 2015, Daniel Smith notifications@github.com
|
Detecting total number of fds open should be possible. In addition to that |
@dchen1107 Should we rename this issue to "detect out-of-FDs and mark node not ready when it happens"? I thought maybe we already had an issue open for that, but I can't find one. |
How should the user work around the error? There is a new report of hitting this issue in stackoverflow: http://stackoverflow.com/questions/37067434/kubernetes-cant-start-due-to-too-many-open-files-in-system. |
Also reported in issue #26246 |
Hi all, additionally to this, there can be other problems, like in virtualized environment the reported sizes that we can detect might not be those we can work with and we might infact be running on swap and therefore services might not react in time and though should be moved to other hosts. |
We experience this often. All nodes report healthy, but a pod gets stuck in restart loop (for whatever reason). However, if I delete the pod, it is recreated just fine (managed by RC or Deployment). Is there any way to kill a pod after some threshold of restart? |
This seems like a reasonable request though it's tricky to pick the right policy. @kubernetes/sig-node-feature-requests |
Original thread was in #127. We previously discussed moving anomalously crashlooping pods (if all pods of a controller are crashlooping, on multiple nodes, then there's no point in moving any) in the rescheduler: |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Closing in favor of kubernetes-sigs/descheduler#62 |
Over the weekend the
skydns
container in thekube-dns
pod died. The exact reason I'm not sure because I couldn't find much detail from the logs, but watching the etcd and skydns logs showed the root issue could've beenetcd
. A theory I have is that the /mnt/emphemeral/kubernetes filesystem was full (it's only 3.75GB and has a few large empty-dir volumes). It was showing 3/4 ready for kube-dns.This caused all of my application pods across 4 minions to go down. I had to manually delete the
kube-dns
pod and when it launched on another minion it was fine and everything came back online.On the same token. I had 1 minion that would never consider any of my pods "ready", even though the other the other 3 minions did. I didn't find out why and my logs weren't helpful, so I just had to manually terminate that minion (EC2 instance) and auto-scale a new one (which happened to work fine).
For both of these cases, if k8s automatically moved the pods that constantly failed to other minions I think the cluster would've healed itself. Is the fact that failing pods always try to restart on the same minion intentional or something in the works?
I'm sorry I don't have logs to show. I'm not sure how to retrieve them from 2 days ago after so many pods have been restarted.
The text was updated successfully, but these errors were encountered: