Canonical Distribution of Kubernetes - Cluster do not start after a restart. #357

Closed
Rody0 opened this Issue Jul 19, 2017 · 24 comments

Comments

Projects
None yet
10 participants

Rody0 commented Jul 19, 2017

Hello,
I installed it on a physical machine (localhost) using the instructions in this link:
https://github.com/juju-solutions/bundle-canonical-kubernetes/tree/master/fragments/k8s/cdk
The first time I tried the canonical-kubernetes option in conjure-up, the second time I used kubernetes-core to make it simpler, but still get similar results. Cluster works after conjure-up finish installing but not after a restart.
After a server restart I am getting this message:
~$ sudo kubectl cluster-info dump
The connection to the server 10.217.171.11:6443 was refused - did you specify the right host or port?
I know this file exist, which has the cluster configuration:
~/.kube/config
Same as content as ~/.kube/config.conjure-kubernetes-core-ab1

Any idea?

Thanks,

Rody0.

Contributor

Cynerva commented Jul 19, 2017

@Rody0 Thanks for reporting this. I'm able to reproduce the problem.

I did a conjure-up on localhost, followed by a reboot of the host machine. After reboot, I'm not able to kubectl:

$ kubectl cluster-info dump
The connection to the server 10.246.170.85:6443 was refused - did you specify the right host or port?

From juju status, it looks like etcd is in an error state:

$ juju status
Model                        Controller                Cloud/Region         Version  SLA
conjure-kubernetes-core-7c7  conjure-up-localhost-614  localhost/localhost  2.2.1    unsupported

App                Version  Status   Scale  Charm              Store       Rev  OS      Notes
easyrsa            3.0.1    active       1  easyrsa            jujucharms   12  ubuntu  
etcd               2.3.8    active       1  etcd               jujucharms   40  ubuntu  
flannel            0.7.0    waiting      2  flannel            jujucharms   20  ubuntu  
kubernetes-master  1.7.0    waiting      1  kubernetes-master  jujucharms   35  ubuntu  exposed
kubernetes-worker  1.7.0    waiting      1  kubernetes-worker  jujucharms   40  ubuntu  exposed

Unit                  Workload  Agent      Machine  Public address  Ports           Message
easyrsa/0*            active    idle       0        10.246.170.87                   Certificate Authority connected.
etcd/0*               active    idle       1        10.246.170.98   2379/tcp        Errored with 0 known peers
kubernetes-master/0*  waiting   executing  2        10.246.170.85   6443/tcp        (update-status) Waiting to retry addon deployment
  flannel/0           waiting   idle                10.246.170.85                   Waiting for Flannel
kubernetes-worker/0*  waiting   idle       3        10.246.170.90   80/tcp,443/tcp  Waiting for kubelet to start.
  flannel/1*          waiting   idle                10.246.170.90                   Waiting for Flannel

Machine  State    DNS            Inst id        Series  AZ  Message
0        started  10.246.170.87  juju-e1edbe-0  xenial      Running
1        started  10.246.170.98  juju-e1edbe-1  xenial      Running
2        started  10.246.170.85  juju-e1edbe-2  xenial      Running
3        started  10.246.170.90  juju-e1edbe-3  xenial      Running

Relation      Provides           Consumes           Type
certificates  easyrsa            etcd               regular
certificates  easyrsa            kubernetes-master  regular
certificates  easyrsa            kubernetes-worker  regular
cluster       etcd               etcd               peer
etcd          etcd               flannel            regular
etcd          etcd               kubernetes-master  regular
cni           flannel            kubernetes-master  regular
cni           flannel            kubernetes-worker  regular
cni           kubernetes-master  flannel            subordinate
kube-control  kubernetes-master  kubernetes-worker  regular
cni           kubernetes-worker  flannel            subordinate

Yep, etcd is dead:

$ juju run --unit etcd/0 'systemctl status snap.etcd.etcd'
● snap.etcd.etcd.service - Service for snap application etcd.etcd
   Loaded: loaded (/etc/systemd/system/snap.etcd.etcd.service; enabled; vendor preset: enabled)
   Active: inactive (dead) (Result: exit-code) since Wed 2017-07-19 22:28:01 UTC; 7min ago
  Process: 1280 ExecStart=/usr/bin/snap run etcd (code=exited, status=1/FAILURE)
 Main PID: 1280 (code=exited, status=1/FAILURE)

Jul 19 22:28:01 juju-e1edbe-1 snap[1280]: cannot change profile for the next exec call: No such file or directory
Jul 19 22:28:01 juju-e1edbe-1 systemd[1]: snap.etcd.etcd.service: Main process exited, code=exited, status=1/FAILURE
Jul 19 22:28:01 juju-e1edbe-1 systemd[1]: snap.etcd.etcd.service: Unit entered failed state.
Jul 19 22:28:01 juju-e1edbe-1 systemd[1]: snap.etcd.etcd.service: Failed with result 'exit-code'.
Jul 19 22:28:01 juju-e1edbe-1 systemd[1]: snap.etcd.etcd.service: Service hold-off time over, scheduling restart.
Jul 19 22:28:01 juju-e1edbe-1 systemd[1]: Stopped Service for snap application etcd.etcd.
Jul 19 22:28:01 juju-e1edbe-1 systemd[1]: snap.etcd.etcd.service: Start request repeated too quickly.
Jul 19 22:28:01 juju-e1edbe-1 systemd[1]: Failed to start Service for snap application etcd.etcd.

Several other services are dead, too: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet, kube-proxy. Looks like they all died with the same error:

cannot change profile for the next exec call: No such file or directory
Contributor

Cynerva commented Jul 19, 2017

@Rody0 If you need a workaround, you can restart the failed services manually:

juju run --application etcd 'service snap.etcd.etcd restart'
juju run --application kubernetes-master 'service snap.kube-apiserver.daemon restart'
juju run --application kubernetes-master 'service snap.kube-controller-manager.daemon restart'
juju run --application kubernetes-master 'service snap.kube-scheduler.daemon restart'
juju run --application kubernetes-worker 'service snap.kubelet.daemon restart'
juju run --application kubernetes-worker 'service snap.kube-proxy.daemon restart'

This worked for me, but is only lightly tested so your mileage may vary. You'll probably have to do this after every reboot until we get a proper fix released. Thanks again for the report and sorry for the trouble!

Rody0 commented Jul 19, 2017

Cynerva,

Restarting the failed services manually worked. Please let me know when there is a definitive fix for this.

Thanks,

Rody0.

Member

tvansteenburgh commented Jul 20, 2017

@Cynerva @stokachu @johnsca Do you think this is a conjure-up bundled lxd problem, or a problem in lxd itself?

Contributor

battlemidget commented Jul 20, 2017

@Cynerva I think the first problem we need to look at is that snap error

Jul 19 22:28:01 juju-e1edbe-1 snap[1280]: cannot change profile for the next exec call: No such file or directory

I would post on forum.snapcraft.io to see if we can get any more context around this error

@tvansteenburgh tvansteenburgh added this to the 2017.08.04 milestone Jul 24, 2017

Member

ktsakalozos commented Jul 28, 2017

Here are the logs of a kubernetes core deployment on lxd on a clean machine on AWS.
results-2017-07-28-11-31-41.tar.gz

If we could have our services start after the lxd.service we should be fine. But it not that straight forward because for now you cannot specify dependencies and load ordering with snap daemons.

Member

ktsakalozos commented Aug 4, 2017

@stgraber might have a better idea on how to handle this dependency between snapped services and lxd.service. Thank you

stgraber commented Aug 4, 2017

Yeah, you'd ideally want to wait for lxd.service... Alternatively you could call "lxd waitready" which will hang until LXD is functional. Not ideal but that may be an option for you.

I have this issue too and nearly lost it debugging. Would it be possible to link to this issue in the documentation for bare-metal setups?

Member

ktsakalozos commented Aug 7, 2017

Thank you for your input @stgraber . Your suggestion means we will have to patch all our snapped daemons to have a wrapper that would check for lxd and wait for it if present.

@mach-kernel could you give us some more info on how your setup looks like? I am asking because you mention bare-metal and this issue seems to be lxd related. Do you have a setup where lxd containers run inside machines you provision with MaaS, or do you see the restart issue on MaaS as well?

Thanks.

mach-kernel commented Aug 7, 2017

I run bare metal, provisioned via conjure-up and the snap installed lxd. The host machine is an ESXi host, standard Linux configuration, nothing crazy other than the intel_rapl driver being noisy on dmesg.

The issue that is causing all of this reboot madness seems to be lxd related indeed: my containers are getting stuck in FREEZING mode. juju status|models|etc hangs unless you invoke juju controllers, presumably because it doesn't need to fetch anything with the latter command? I can't manage to kill the container by getting its PID from lxc so I don't know what to do next. I tried all of the force flags but none of them work. Restarting the physical host will still show those containers in FREEZING state. The snap bundled lxd also doesn't seem to have a lxc-freeze|unfreeze command line tool like a lot of documentation online says.

NOTE: lxc is invoked as conjure-up.lxc, etc. to reach the provisioned/bundled distribution.

Here's my lxc list output:

○ → conjure-up.lxc list
+---------------+----------+------+------+------------+-----------+
|     NAME      |  STATE   | IPV4 | IPV6 |    TYPE    | SNAPSHOTS |
+---------------+----------+------+------+------------+-----------+
| juju-73dc62-0 | STOPPED  |      |      | PERSISTENT | 0         |
+---------------+----------+------+------+------------+-----------+
| juju-a0edf7-0 | STOPPED  |      |      | PERSISTENT | 0         |
+---------------+----------+------+------+------------+-----------+
| juju-a0edf7-1 | STOPPED  |      |      | PERSISTENT | 0         |
+---------------+----------+------+------+------------+-----------+
| juju-a0edf7-2 | STOPPED  |      |      | PERSISTENT | 0         |
+---------------+----------+------+------+------------+-----------+
| juju-a0edf7-3 | STOPPED  |      |      | PERSISTENT | 0         |
+---------------+----------+------+------+------------+-----------+
| juju-a0edf7-4 | STOPPED  |      |      | PERSISTENT | 0         |
+---------------+----------+------+------+------------+-----------+
| juju-a0edf7-5 | STOPPED  |      |      | PERSISTENT | 0         |
+---------------+----------+------+------+------------+-----------+
| juju-a0edf7-6 | STOPPED  |      |      | PERSISTENT | 0         |
+---------------+----------+------+------+------------+-----------+
| juju-a0edf7-7 | FREEZING |      |      | PERSISTENT | 0         |
+---------------+----------+------+------+------------+-----------+
| juju-a0edf7-8 | FREEZING |      |      | PERSISTENT | 0         |
+---------------+----------+------+------+------------+-----------+
| juju-a0edf7-9 | STOPPED  |      |      | PERSISTENT | 0         |
+---------------+----------+------+------+------------+-----------+
Member

tvansteenburgh commented Aug 7, 2017

I hit this once before too. In my case, the systemd process of the FREEZING containers was in Ds state, hung indefinitely waiting on IO. IIRC, the containers were trying to umount and couldn't (I don't remember why). I did solve it eventually, but I don't remember exactly how. :-( I went strace-ing for clues, and eventually umount-ing a bunch of stuff manually to get the containers unstuck. Don't expect the containers to survive.

@tvansteenburgh The strange thing is, I've had this work across reboots without any issues. It's the first thing I do after setting everything up. I am able to bring everything back by just restarting the containers but I have no clue why they are frozen. Thank you for the strace tip -- I think I have no other options.

Member

tvansteenburgh commented Aug 7, 2017

Yeah I don't think the FREEZING is directly related to the reboots anyway, b/c when I hit it, I hadn't rebooted at all. Sorry I can't recall more details about how I fixed it -- @stgraber might have better advice on how to debug/recover the FREEZING containers.

Contributor

battlemidget commented Aug 7, 2017

This problem happened previously because of some of the bigdata charms enabling swap within the container.

mach-kernel commented Aug 7, 2017

ps -ef | grep lxc

root       872     1  0 Aug06 ?        00:00:00 /usr/bin/lxcfs /var/lib/lxcfs/
root       958     1  0 Aug06 ?        00:00:02 /snap/conjure-up/549/bin/lxcfs /var/snap/conjure-up/common/var/lib/lxcfs -p /var/snap/conjure-up/common/lxcfs.pid
root      2153     1  0 Aug06 ?        00:00:03 [lxc monitor] /var/snap/conjure-up/common/lxd/containers juju-a0edf7-7
root      2596     1  0 Aug06 ?        00:00:01 [lxc monitor] /var/snap/conjure-up/common/lxd/containers juju-a0edf7-8
pursuit   8892 32111  0 14:25 pts/0    00:00:00 grep --color=auto lxc
root     11482     1  0 Aug06 ?        00:00:00 /snap/conjure-up/current/libexec/lxc/lxc-monitord /var/snap/conjure-up/common/lxd/containers 29
root     12293 12192  0 Aug06 pts/1    00:00:00 sudo conjure-up.lxc stop juju-a0edf7-7 --force
root     12294 12293  0 Aug06 pts/1    00:00:00 /snap/conjure-up/549/bin/lxc stop juju-a0edf7-7 --force

Doing a kill -9 on the PIDs for the monitors has fixed the issue. The containers are unfrozen.

strace for a few seconds yields this loop. This was captured before I killed the PIDs.

How else can I be helpful?

EDIT: Something happened shortly after this where the box timed out on me and I had to manually kill my SSH session, it seems to have hung up / become unresponsive.

EDIT 2: Re the above, tunneled into my home subnet to restart that box -- it was still up and responding to ping but would not respond on KVM/HW tty so I had to do a hard reset. /var/log/syslog doesn't seem to indicate a panic, which makes sense considering it responded to ICMP, but otherwise the machine was unusable. Any suggestions for logs?

I'm otherwise happily up and running at this point. I'd love to help contribute with a troubleshooting section for your docs if there's a place to do that?

tirithen commented Aug 8, 2017

It would be amazing to solve this, as a beginner looking into conjure-up/juju everything looks fantastic, just run one command to install Kubernetes solution but in reality I have probably spent about a week trying to debug why Kubernetes does not come up again after reboot and I still don't understand everything.

It gets messy trying to re-install one part at a time and for me that is new to snap/juju/conjure-up (even though having lots of general linux server experience) and it's really hard to understand what goes wrong and why there are no information about this at https://conjure-up.io/ , on that site everything is sunshine senarios that will not work on a fresh Ubuntu Server 16.04.3, there seem to be a lot of hidden knowledge about pre-requirements that has not been documented yet. It would be wonderful to see more detailed documentation about all the steps that are needed to run on bare metal (I have not tried the cloud scenarios yet).

@battlemidget battlemidget modified the milestones: 2017.08.18, 2017.08.04 Aug 8, 2017

@tvansteenburgh tvansteenburgh modified the milestones: 2017.09.01, 2017.08.18 Aug 24, 2017

lghinet commented Sep 7, 2017

same problem here :
results-2017-09-07-12-09-32.tar.gz

juju run --application kubernetes-worker 'service snap.kubelet.daemon restart' - not working

image

totalsoft@kube:~$ juju run --unit kubernetes-worker/0 'systemctl status snap.kubelet.daemon'
● snap.kubelet.daemon.service - Service for snap application kubelet.daemon
   Loaded: loaded (/etc/systemd/system/snap.kubelet.daemon.service; enabled; vendor preset: enabled)
   Active: inactive (dead) (Result: exit-code) since Thu 2017-09-07 09:54:05 UTC; 10min ago
  Process: 2150 ExecStart=/usr/bin/snap run kubelet.daemon (code=exited, status=1/FAILURE)
 Main PID: 2150 (code=exited, status=1/FAILURE)

Sep 07 09:54:05 juju-bfb1d6-9 systemd[1]: snap.kubelet.daemon.service: Unit entered failed state.
Sep 07 09:54:05 juju-bfb1d6-9 systemd[1]: snap.kubelet.daemon.service: Failed with result 'exit-code'.
Sep 07 09:54:05 juju-bfb1d6-9 systemd[1]: snap.kubelet.daemon.service: Service hold-off time over, scheduling restart.
Sep 07 09:54:05 juju-bfb1d6-9 systemd[1]: Stopped Service for snap application kubelet.daemon.
Sep 07 09:54:05 juju-bfb1d6-9 systemd[1]: snap.kubelet.daemon.service: Start request repeated too quickly.
Sep 07 09:54:05 juju-bfb1d6-9 systemd[1]: Failed to start Service for snap application kubelet.daemon.

lghinet commented Sep 7, 2017

i dont see any docker0 interface

image

k8s-merge-robot added a commit to kubernetes/kubernetes that referenced this issue Sep 23, 2017

Merge pull request #52445 from Cynerva/gkk/cdk-service-kicker
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>..

Fix kubernetes charms not restarting services properly after host reboot on LXD

**What this PR does / why we need it**:

This fixes an issue when running the Kubernetes charms on LXD where the services don't restart properly after a reboot of the host machine.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: juju-solutions/bundle-canonical-kubernetes#357

**Special notes for your reviewer**:

See https://github.com/juju-solutions/layer-cdk-service-kicker

**Release note**:

```release-note
Fix kubernetes charms not restarting services properly after host reboot on LXD
```

marun pushed a commit to marun/federation that referenced this issue Oct 13, 2017

Merge pull request #52445 from Cynerva/gkk/cdk-service-kicker
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>..

Fix kubernetes charms not restarting services properly after host reboot on LXD

**What this PR does / why we need it**:

This fixes an issue when running the Kubernetes charms on LXD where the services don't restart properly after a reboot of the host machine.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: juju-solutions/bundle-canonical-kubernetes#357

**Special notes for your reviewer**:

See https://github.com/juju-solutions/layer-cdk-service-kicker

**Release note**:

```release-note
Fix kubernetes charms not restarting services properly after host reboot on LXD
```

The workaround is to run "zpool import default" after reboot then CDK cluster starts.

Contributor

battlemidget commented Oct 27, 2017

It gets messy trying to re-install one part at a time and for me that is new to snap/juju/conjure-up (even though having lots of general linux server experience) and it's really hard to understand what goes wrong and why there are no information about this at https://conjure-up.io/ , on that site everything is sunshine senarios that will not work on a fresh Ubuntu Server 16.04.3, there seem to be a lot of hidden knowledge about pre-requirements that has not been documented yet. It would be wonderful to see more detailed documentation about all the steps that are needed to run on bare metal (I have not tried the cloud scenarios yet).

Totally understand your frustration and we are working as hard as we can to get more documentation up on conjure-up.io. Our docs live here https://github.com/canonical-docs/conjure-up-docs and we could really use the help to speed up this process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment