-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken minion strategy #1923
Comments
The first thing to do is take it out of the worker pool. Rule 1 is stop the bleeding. Pods controlled by replica controllers will move onto other workers. Then, it becomes a question of how much you want to investigate. Option 1) No investigation. Option 2) Investigate why things went south. You probably want to partition the machine off from the network somehow, so that it doesn't keep trying to communicate with working pods on other workers, SSH into the machine and drop the cbr0 bridge or something like that. Then look at logs, poke at things, etc, but once you are done investigating, you'll want to just do option 1 and re-image/shoot the machine. Since your pods are mobile and have already moved, and the machine is homogenous with other machines, there is little to be gained by trying to repair it. Shoot it and move on. |
(of course, as a kuberenetes developer I'm curious about what went wrong, did you happen to catch a top when you were pegging the CPU at 100%?) |
Hey @brendandburns still have the machine up, gonna kill it soon, but if you wan't I can shoot over any logs you might want to have a look at, from what I can see there is really nothing in the GCE console logs, It seems to think the disk is full for some reason, (guessing its due to lack of permish) there is no way that can be true, other then that, bunch of Connection reset by peer type stuff from SSHD so not really anything out of the ordinary in GCE logs, didn't dive in to any other logs |
Cool, if you wanted to attach the /var/log/kubelet.log and Then feel free to tear it down. Thanks! On Tue, Oct 21, 2014 at 6:44 AM, Sevki Hasirci notifications@github.com
|
At some point, I hope minion controller will be able to catch such condition, move it from worker pool and re-image the machine for you (may provide some configurable policy, like the options you listed). However, I'm also wondering what's happening in you case. Can master health check the minion? (can you also attach /var/log/apiserver.log, to see if there is any error logs from healthy_registry). My guess is yes, so the follow up question is how does kubernetes know a minion is not working properly. |
@brendandburns emailed them. hope it helps |
Please also send over the output of 'docker ps -a' if docker still can run. I bet you hit something like #1823 |
hah, got 2014/10/21 16:04:23 Cannot connect to the Docker daemon. Is 'docker -d' running on this host? and 2014/10/21 16:04:43 docker daemon: 1.2.0 fa7b24f; execdriver: native; graphdriver: [93abb000] +job serveapi(unix:///var/run/docker.sock) [info] Listening for HTTP on unix (/var/run/docker.sock) 2014/10/21 16:04:43 Couldn't create Tag store: unexpected end of JSON input On Tue, Oct 21, 2014 at 7:02 PM, Dawn Chen notifications@github.com
|
@ sevki, thanks for the output. I didn't see /var/log/kubelet.log attached. I think you hit the same which cause #1823, but the cause might be vary. Can you recall what you had done before you ran into this loop on that minion? I did reproduce once the same issue before. |
Nothing in particular actually On Tue, Oct 21, 2014 at 7:51 PM, Dawn Chen notifications@github.com
|
I know we have made some improvements re: detecting when docker doesn't start up properly. I'm going to close this due to age and lack of a clear action item-- please reopen or open new issues if you don't think that's OK. |
So I have this minion that is not working properly, this is what the average minion looks like
and this is what the broken one looks like, during the peak at the graph all its pods were "waiting"
I was looking trough the documentation and I couldn't find anything on this, but when something like this happens, a misconfigured machine or maybe it's haunted, what should one do?
The text was updated successfully, but these errors were encountered: