Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Canonical Distribution of Kubernetes - Cluster do not start after a restart. #357
Comments
|
@Rody0 Thanks for reporting this. I'm able to reproduce the problem. I did a conjure-up on localhost, followed by a reboot of the host machine. After reboot, I'm not able to kubectl:
From juju status, it looks like etcd is in an error state:
Yep, etcd is dead:
Several other services are dead, too: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet, kube-proxy. Looks like they all died with the same error:
|
|
@Rody0 If you need a workaround, you can restart the failed services manually:
This worked for me, but is only lightly tested so your mileage may vary. You'll probably have to do this after every reboot until we get a proper fix released. Thanks again for the report and sorry for the trouble! |
Rody0
commented
Jul 19, 2017
|
Cynerva, Restarting the failed services manually worked. Please let me know when there is a definitive fix for this. Thanks, Rody0. |
|
@Cynerva I think the first problem we need to look at is that snap error
I would post on forum.snapcraft.io to see if we can get any more context around this error |
tvansteenburgh
added this to the 2017.08.04 milestone
Jul 24, 2017
|
Probably related to https://bugs.launchpad.net/snappy/+bug/1687079 |
|
Here are the logs of a kubernetes core deployment on lxd on a clean machine on AWS. If we could have our services start after the lxd.service we should be fine. But it not that straight forward because for now you cannot specify dependencies and load ordering with snap daemons. |
|
@stgraber might have a better idea on how to handle this dependency between snapped services and lxd.service. Thank you |
stgraber
commented
Aug 4, 2017
|
Yeah, you'd ideally want to wait for lxd.service... Alternatively you could call "lxd waitready" which will hang until LXD is functional. Not ideal but that may be an option for you. |
mach-kernel
commented
Aug 5, 2017
|
I have this issue too and nearly lost it debugging. Would it be possible to link to this issue in the documentation for bare-metal setups? |
|
Thank you for your input @stgraber . Your suggestion means we will have to patch all our snapped daemons to have a wrapper that would check for lxd and wait for it if present. @mach-kernel could you give us some more info on how your setup looks like? I am asking because you mention bare-metal and this issue seems to be lxd related. Do you have a setup where lxd containers run inside machines you provision with MaaS, or do you see the restart issue on MaaS as well? Thanks. |
ktsakalozos
referenced this issue
Aug 7, 2017
Closed
Deployment stuck when allow-privileged is set to true #262
mach-kernel
commented
Aug 7, 2017
•
|
I run bare metal, provisioned via The issue that is causing all of this reboot madness seems to be NOTE: Here's my
|
|
I hit this once before too. In my case, the systemd process of the FREEZING containers was in Ds state, hung indefinitely waiting on IO. IIRC, the containers were trying to umount and couldn't (I don't remember why). I did solve it eventually, but I don't remember exactly how. :-( I went strace-ing for clues, and eventually umount-ing a bunch of stuff manually to get the containers unstuck. Don't expect the containers to survive. |
mach-kernel
commented
Aug 7, 2017
|
@tvansteenburgh The strange thing is, I've had this work across reboots without any issues. It's the first thing I do after setting everything up. I am able to bring everything back by just restarting the containers but I have no clue why they are frozen. Thank you for the |
|
Yeah I don't think the FREEZING is directly related to the reboots anyway, b/c when I hit it, I hadn't rebooted at all. Sorry I can't recall more details about how I fixed it -- @stgraber might have better advice on how to debug/recover the FREEZING containers. |
|
This problem happened previously because of some of the bigdata charms enabling swap within the container. |
mach-kernel
commented
Aug 7, 2017
•
|
Doing a
How else can I be helpful? EDIT: Something happened shortly after this where the box timed out on me and I had to manually kill my SSH session, it seems to have hung up / become unresponsive. EDIT 2: Re the above, tunneled into my home subnet to restart that box -- it was still up and responding to ping but would not respond on KVM/HW tty so I had to do a hard reset. I'm otherwise happily up and running at this point. I'd love to help contribute with a troubleshooting section for your docs if there's a place to do that? |
tirithen
commented
Aug 8, 2017
•
|
It would be amazing to solve this, as a beginner looking into conjure-up/juju everything looks fantastic, just run one command to install Kubernetes solution but in reality I have probably spent about a week trying to debug why Kubernetes does not come up again after reboot and I still don't understand everything. It gets messy trying to re-install one part at a time and for me that is new to snap/juju/conjure-up (even though having lots of general linux server experience) and it's really hard to understand what goes wrong and why there are no information about this at https://conjure-up.io/ , on that site everything is sunshine senarios that will not work on a fresh Ubuntu Server 16.04.3, there seem to be a lot of hidden knowledge about pre-requirements that has not been documented yet. It would be wonderful to see more detailed documentation about all the steps that are needed to run on bare metal (I have not tried the cloud scenarios yet). |
battlemidget
modified the milestones:
2017.08.18,
2017.08.04
Aug 8, 2017
tvansteenburgh
modified the milestones:
2017.09.01,
2017.08.18
Aug 24, 2017
lghinet
commented
Sep 7, 2017
•
|
same problem here : juju run --application kubernetes-worker 'service snap.kubelet.daemon restart' - not working
|
lghinet
commented
Sep 7, 2017
lghinet
commented
Sep 7, 2017
|
related to |
This was referenced Sep 13, 2017
ktsakalozos
closed this
in
juju-solutions/layer-etcd#108
Sep 14, 2017
added a commit
to kubernetes/kubernetes
that referenced
this issue
Sep 23, 2017
pushed a commit
to marun/federation
that referenced
this issue
Oct 13, 2017
This was referenced Oct 26, 2017
evgkarasev
commented
Oct 27, 2017
|
The workaround is to run "zpool import default" after reboot then CDK cluster starts. |
Totally understand your frustration and we are working as hard as we can to get more documentation up on conjure-up.io. Our docs live here https://github.com/canonical-docs/conjure-up-docs and we could really use the help to speed up this process. |


Rody0 commentedJul 19, 2017
Hello,
I installed it on a physical machine (localhost) using the instructions in this link:
https://github.com/juju-solutions/bundle-canonical-kubernetes/tree/master/fragments/k8s/cdk
The first time I tried the canonical-kubernetes option in conjure-up, the second time I used kubernetes-core to make it simpler, but still get similar results. Cluster works after conjure-up finish installing but not after a restart.
After a server restart I am getting this message:
~$ sudo kubectl cluster-info dump
The connection to the server 10.217.171.11:6443 was refused - did you specify the right host or port?
I know this file exist, which has the cluster configuration:
~/.kube/config
Same as content as ~/.kube/config.conjure-kubernetes-core-ab1
Any idea?
Thanks,
Rody0.