Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken minion strategy #1923

Closed
sevki opened this issue Oct 21, 2014 · 11 comments
Closed

Broken minion strategy #1923

sevki opened this issue Oct 21, 2014 · 11 comments
Labels
area/nodecontroller kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@sevki
Copy link

sevki commented Oct 21, 2014

So I have this minion that is not working properly, this is what the average minion looks like
google_developers_console
and this is what the broken one looks like, during the peak at the graph all its pods were "waiting"
google_developers_console

I was looking trough the documentation and I couldn't find anything on this, but when something like this happens, a misconfigured machine or maybe it's haunted, what should one do?

  • Take that machine out of the minion pool?
  • Redeploy that machine?
  • Read the logs and try to fix the problem by sshing in to that one machine?
  • Clone a working machine to replace that?
@brendandburns
Copy link
Contributor

The first thing to do is take it out of the worker pool. Rule 1 is stop the bleeding. Pods controlled by replica controllers will move onto other workers.

Then, it becomes a question of how much you want to investigate.

Option 1) No investigation.
In this case, just re-image the machine, (or in the cloud VM case, destroy it and re-create it) Re-add it to the worker pool, it will start to get more work.

Option 2) Investigate why things went south. You probably want to partition the machine off from the network somehow, so that it doesn't keep trying to communicate with working pods on other workers, SSH into the machine and drop the cbr0 bridge or something like that. Then look at logs, poke at things, etc, but once you are done investigating, you'll want to just do option 1 and re-image/shoot the machine.

Since your pods are mobile and have already moved, and the machine is homogenous with other machines, there is little to be gained by trying to repair it. Shoot it and move on.

@brendandburns
Copy link
Contributor

(of course, as a kuberenetes developer I'm curious about what went wrong, did you happen to catch a top when you were pegging the CPU at 100%?)

@sevki
Copy link
Author

sevki commented Oct 21, 2014

Hey @brendandburns still have the machine up, gonna kill it soon, but if you wan't I can shoot over any logs you might want to have a look at, from what I can see there is really nothing in the GCE console logs, It seems to think the disk is full for some reason, (guessing its due to lack of permish) there is no way that can be true, other then that, bunch of Connection reset by peer type stuff from SSHD so not really anything out of the ordinary in GCE logs, didn't dive in to any other logs

@brendandburns
Copy link
Contributor

Cool, if you wanted to attach the /var/log/kubelet.log and
/var/log/kube-proxy.log it would be useful.

Then feel free to tear it down.

Thanks!
--brendan

On Tue, Oct 21, 2014 at 6:44 AM, Sevki Hasirci notifications@github.com
wrote:

Hey @brendandburns https://github.com/brendandburns still have the
machine up, gonna kill it soon, but if you wan't I can shoot over any logs
you might want to have a look at, from what I can see there is really
nothing in the GCE console logs, It seems to think the disk is full for
some reason, (guessing its due to lack of permish) there is no way that can
be true, other then that, bunch of Connection reset by peer type stuff from
SSHD so not really.


Reply to this email directly or view it on GitHub
#1923 (comment)
.

@ddysher
Copy link
Contributor

ddysher commented Oct 21, 2014

At some point, I hope minion controller will be able to catch such condition, move it from worker pool and re-image the machine for you (may provide some configurable policy, like the options you listed). However, I'm also wondering what's happening in you case. Can master health check the minion? (can you also attach /var/log/apiserver.log, to see if there is any error logs from healthy_registry). My guess is yes, so the follow up question is how does kubernetes know a minion is not working properly.

@sevki
Copy link
Author

sevki commented Oct 21, 2014

@brendandburns emailed them. hope it helps

@dchen1107
Copy link
Member

Please also send over the output of 'docker ps -a' if docker still can run. I bet you hit something like #1823

@dchen1107 dchen1107 added the kind/bug Categorizes issue or PR as related to a bug. label Oct 21, 2014
@sevki
Copy link
Author

sevki commented Oct 21, 2014

hah, got 

2014/10/21 16:04:23 Cannot connect to the Docker daemon. Is 'docker -d' running on this host?

and 

2014/10/21 16:04:43 docker daemon: 1.2.0 fa7b24f; execdriver: native; graphdriver:

[93abb000] +job serveapi(unix:///var/run/docker.sock)

[info] Listening for HTTP on unix (/var/run/docker.sock)

2014/10/21 16:04:43 Couldn't create Tag store: unexpected end of JSON input

On Tue, Oct 21, 2014 at 7:02 PM, Dawn Chen notifications@github.com
wrote:

Please also send over the output of 'docker ps -a' if docker still can run. I bet you hit something like #1823

Reply to this email directly or view it on GitHub:
#1923 (comment)

@dchen1107
Copy link
Member

@ sevki, thanks for the output. I didn't see /var/log/kubelet.log attached. I think you hit the same which cause #1823, but the cause might be vary. Can you recall what you had done before you ran into this loop on that minion? I did reproduce once the same issue before.

@dchen1107 dchen1107 added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 21, 2014
@sevki
Copy link
Author

sevki commented Oct 21, 2014

Nothing in particular actually

On Tue, Oct 21, 2014 at 7:51 PM, Dawn Chen notifications@github.com
wrote:

@ sevki, thanks for the output. I didn't see /var/log/kubelet.log attached. I think you hit the same which cause #1823, but the cause might be vary. Can you recall what you had done before you ran into this loop on that minion? I did reproduce once the same issue before.

Reply to this email directly or view it on GitHub:
#1923 (comment)

@lavalamp
Copy link
Member

lavalamp commented Jan 7, 2015

I know we have made some improvements re: detecting when docker doesn't start up properly. I'm going to close this due to age and lack of a clear action item-- please reopen or open new issues if you don't think that's OK.

@lavalamp lavalamp closed this as completed Jan 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/nodecontroller kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

6 participants