Broken minion strategy #1923

sevki · 2014-10-21T04:28:24Z

So I have this minion that is not working properly, this is what the average minion looks like

and this is what the broken one looks like, during the peak at the graph all its pods were "waiting"

I was looking trough the documentation and I couldn't find anything on this, but when something like this happens, a misconfigured machine or maybe it's haunted, what should one do?

Take that machine out of the minion pool?
Redeploy that machine?
Read the logs and try to fix the problem by sshing in to that one machine?
Clone a working machine to replace that?

brendandburns · 2014-10-21T13:34:54Z

The first thing to do is take it out of the worker pool. Rule 1 is stop the bleeding. Pods controlled by replica controllers will move onto other workers.

Then, it becomes a question of how much you want to investigate.

Option 1) No investigation.
In this case, just re-image the machine, (or in the cloud VM case, destroy it and re-create it) Re-add it to the worker pool, it will start to get more work.

Option 2) Investigate why things went south. You probably want to partition the machine off from the network somehow, so that it doesn't keep trying to communicate with working pods on other workers, SSH into the machine and drop the cbr0 bridge or something like that. Then look at logs, poke at things, etc, but once you are done investigating, you'll want to just do option 1 and re-image/shoot the machine.

Since your pods are mobile and have already moved, and the machine is homogenous with other machines, there is little to be gained by trying to repair it. Shoot it and move on.

brendandburns · 2014-10-21T13:35:44Z

(of course, as a kuberenetes developer I'm curious about what went wrong, did you happen to catch a top when you were pegging the CPU at 100%?)

sevki · 2014-10-21T13:44:42Z

Hey @brendandburns still have the machine up, gonna kill it soon, but if you wan't I can shoot over any logs you might want to have a look at, from what I can see there is really nothing in the GCE console logs, It seems to think the disk is full for some reason, (guessing its due to lack of permish) there is no way that can be true, other then that, bunch of Connection reset by peer type stuff from SSHD so not really anything out of the ordinary in GCE logs, didn't dive in to any other logs

brendandburns · 2014-10-21T14:23:43Z

Cool, if you wanted to attach the /var/log/kubelet.log and
/var/log/kube-proxy.log it would be useful.

Then feel free to tear it down.

Thanks!
--brendan

On Tue, Oct 21, 2014 at 6:44 AM, Sevki Hasirci notifications@github.com
wrote:

Hey @brendandburns https://github.com/brendandburns still have the
machine up, gonna kill it soon, but if you wan't I can shoot over any logs
you might want to have a look at, from what I can see there is really
nothing in the GCE console logs, It seems to think the disk is full for
some reason, (guessing its due to lack of permish) there is no way that can
be true, other then that, bunch of Connection reset by peer type stuff from
SSHD so not really.

—
Reply to this email directly or view it on GitHub
#1923 (comment)
.

ddysher · 2014-10-21T14:25:48Z

At some point, I hope minion controller will be able to catch such condition, move it from worker pool and re-image the machine for you (may provide some configurable policy, like the options you listed). However, I'm also wondering what's happening in you case. Can master health check the minion? (can you also attach /var/log/apiserver.log, to see if there is any error logs from healthy_registry). My guess is yes, so the follow up question is how does kubernetes know a minion is not working properly.

sevki · 2014-10-21T14:48:42Z

@brendandburns emailed them. hope it helps

dchen1107 · 2014-10-21T16:01:44Z

Please also send over the output of 'docker ps -a' if docker still can run. I bet you hit something like #1823

sevki · 2014-10-21T16:07:39Z

hah, got

2014/10/21 16:04:23 Cannot connect to the Docker daemon. Is 'docker -d' running on this host?

and

2014/10/21 16:04:43 docker daemon: 1.2.0 fa7b24f; execdriver: native; graphdriver:

[93abb000] +job serveapi(unix:///var/run/docker.sock)

[info] Listening for HTTP on unix (/var/run/docker.sock)

2014/10/21 16:04:43 Couldn't create Tag store: unexpected end of JSON input

On Tue, Oct 21, 2014 at 7:02 PM, Dawn Chen notifications@github.com
wrote:

Please also send over the output of 'docker ps -a' if docker still can run. I bet you hit something like #1823

Reply to this email directly or view it on GitHub:
#1923 (comment)

dchen1107 · 2014-10-21T16:51:21Z

@ sevki, thanks for the output. I didn't see /var/log/kubelet.log attached. I think you hit the same which cause #1823, but the cause might be vary. Can you recall what you had done before you ran into this loop on that minion? I did reproduce once the same issue before.

sevki · 2014-10-21T17:50:10Z

Nothing in particular actually

On Tue, Oct 21, 2014 at 7:51 PM, Dawn Chen notifications@github.com
wrote:

@ sevki, thanks for the output. I didn't see /var/log/kubelet.log attached. I think you hit the same which cause #1823, but the cause might be vary. Can you recall what you had done before you ran into this loop on that minion? I did reproduce once the same issue before.

Reply to this email directly or view it on GitHub:
#1923 (comment)

lavalamp · 2015-01-07T18:15:48Z

I know we have made some improvements re: detecting when docker doesn't start up properly. I'm going to close this due to age and lack of a clear action item-- please reopen or open new issues if you don't think that's OK.

dchen1107 added the kind/bug Categorizes issue or PR as related to a bug. label Oct 21, 2014

dchen1107 added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 21, 2014

bgrant0607 added the area/nodecontroller label Oct 21, 2014

ddysher mentioned this issue Nov 5, 2014

Improve node interface #2164

Closed

lavalamp closed this as completed Jan 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken minion strategy #1923

Broken minion strategy #1923

sevki commented Oct 21, 2014

brendandburns commented Oct 21, 2014

brendandburns commented Oct 21, 2014

sevki commented Oct 21, 2014

brendandburns commented Oct 21, 2014

ddysher commented Oct 21, 2014

sevki commented Oct 21, 2014

dchen1107 commented Oct 21, 2014

sevki commented Oct 21, 2014

Please also send over the output of 'docker ps -a' if docker still can run. I bet you hit something like #1823

dchen1107 commented Oct 21, 2014

sevki commented Oct 21, 2014

@ sevki, thanks for the output. I didn't see /var/log/kubelet.log attached. I think you hit the same which cause #1823, but the cause might be vary. Can you recall what you had done before you ran into this loop on that minion? I did reproduce once the same issue before.

lavalamp commented Jan 7, 2015

Broken minion strategy #1923

Broken minion strategy #1923

Comments

sevki commented Oct 21, 2014

brendandburns commented Oct 21, 2014

brendandburns commented Oct 21, 2014

sevki commented Oct 21, 2014

brendandburns commented Oct 21, 2014

ddysher commented Oct 21, 2014

sevki commented Oct 21, 2014

dchen1107 commented Oct 21, 2014

sevki commented Oct 21, 2014

Please also send over the output of 'docker ps -a' if docker still can run. I bet you hit something like #1823

dchen1107 commented Oct 21, 2014

sevki commented Oct 21, 2014

@ sevki, thanks for the output. I didn't see /var/log/kubelet.log attached. I think you hit the same which cause #1823, but the cause might be vary. Can you recall what you had done before you ran into this loop on that minion? I did reproduce once the same issue before.

lavalamp commented Jan 7, 2015