Node in NotReady status after kernel upgrade and reboot on Ubuntu Vivid (15.04) due to Docker service failure (no aufs) #14162

romanek-adam · 2015-09-18T08:44:00Z

Hi. I'm writing this partially because I want to start a discussion and to leave some trails for others potentially affected.

I started my journey with Kubernetes just a few days ago. I downloaded the 1.0.6 release (now I know this a only a pre-release and I should've downloaded the 1.0.4 release but this doesn't seem to be related to my particular issue).

I brought up a tiny k8s cluster with the following config:

export KUBERNETES_PROVIDER=aws
export KUBE_AWS_ZONE=eu-west-1a
export NUM_MINIONS=3
export MINION_SIZE=t2.micro
export AWS_S3_REGION=eu-west-1
export AWS_S3_BUCKET=xxx
export KUBE_AWS_INSTANCE_PREFIX=xxx

My cluster was created properly and after a bit of playing I left it in a good shape.

The next day I wanted to investigate some DNS related issue on k8s nodes so I logged in on one of the minions. I noticed *** System restart required *** in the MOTD. I ignored it for a moment and investigated the DNS matters (some nslookups, nothing dangerous). Then I rebooted the machine.

Once the machine was up again I found out that kubectl reported it as NotReady. I started digging and investigating. It took me some time to find out that the Docker service on the node was not running:

$ sudo docker info
FATA[0001] Get http:///var/run/docker.sock/v1.18/info: read unix /var/run/docker.sock: connection reset by peer. Are you trying to connect to a TLS-enabled daemon without TLS?

$ sudo journalctl -u docker -f
-- Logs begin at Fri 2015-09-18 08:06:53 UTC. --
Sep 18 08:09:49 ip-172-20-0-149 systemd[1]: docker.service failed.
Sep 18 08:09:51 ip-172-20-0-149 systemd[1]: docker.service holdoff time over, scheduling restart.
Sep 18 08:09:51 ip-172-20-0-149 systemd[1]: Started Docker Application Container Engine.
Sep 18 08:09:51 ip-172-20-0-149 systemd[1]: Starting Docker Application Container Engine...
Sep 18 08:09:51 ip-172-20-0-149 docker[3397]: time="2015-09-18T08:09:51Z" level=info msg="+job serveapi(fd://)"
Sep 18 08:09:51 ip-172-20-0-149 docker[3397]: time="2015-09-18T08:09:51Z" level=info msg="Listening for HTTP on fd ()"
Sep 18 08:09:51 ip-172-20-0-149 docker[3397]: time="2015-09-18T08:09:51Z" level=fatal msg="Shutting down daemon due to errors: error intializing graphdriver: driver not supported"
Sep 18 08:09:51 ip-172-20-0-149 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Sep 18 08:09:51 ip-172-20-0-149 systemd[1]: Unit docker.service entered failed state.
Sep 18 08:09:51 ip-172-20-0-149 systemd[1]: docker.service failed.

Further investigation led me to a conclusion that the problem arose from the fact that before the reboot the kernel was upgraded to 3.19.0-28, which no longer by default supports AUFS module which Docker uses as its storage driver.

Looks like the solution is to install the linux-image-extra-virtual, as concluded from moby/moby#10859 and other sources, and reboot the node. After reboot the node got back to the Ready state.

Moreover, I guess an upgrade to a newer Docker version, using the official install command curl -sSL https://get.docker.com/ | sh would solve the problem as the most recent install script already installs linux-image-extra-virtual as per moby/moby#10860, although I'm not sure how Docker was installed on the k8s node, so I don't know if it's safe to upgrade it this way.

So I have a number of questions:

should k8s provide a workaround / fix for this particular issue?
or is it left to a cluster admin as it's an issue of a particular provider/env/configuration?
why is k8s using Docker v1.6.0 which is pretty old?
is it safe to use k8s in production on a setup in which Docker uses AUFS storage driver which seems to be deprecated by the Ubuntu team and thus dropped from the main kernel image?

Note that I haven't done anything unusual, just installed fresh k8s on AWS using the default Ubuntu release. Hence I believe it's important to fix it somehow in k8s.

Possible duplicate: #9779.

The text was updated successfully, but these errors were encountered:

dchen1107 · 2015-09-18T23:22:43Z

Thanks for reporting the issue to kick out more discussion. You just opened a can of worms here :-)

should k8s provide a workaround / fix for this particular issue?

You can upgrade your docker version through salt, or update docker flag to use devicemapper through salt.

or is it left to a cluster admin as it's an issue of a particular provider/env/configuration?

We should do better job on defining support metrics, specifying minimal system requirements, running integration tests / end-to-end tests through something like version manager which controls each component's versions, etc. I haven't listed all we should do yet, but you can see there are bunch of pending works here. This is open source project, and I believe our community can help with this.

why is k8s using Docker v1.6.0 which is pretty old?

When we cut Kubernetes v1.0 release, Docker is on 1.7.1, but has several serious issues unresolved, especially on host network. That is why we decided to cut 1.0 with docker v 1.6.2. But there is always a way to upgrade to a new version through salt for cluster admin.

For 1.1 release, the plan is cutting it with Docker 1.8.2 .

romanek-adam · 2015-09-21T11:32:32Z

It looks like the fix is to change linux-image-extra-$(uname -r) to linux-image-extra-virtual in cluster/aws/templates/format-disks.sh#L175, which will pick the right linux-image-extra-XXX on each kernel upgrade. I could provide a pull-request but my company hasn't signed the CLA yet.

paralin · 2015-10-21T20:31:06Z

Just ran into this issue, submitting a PR to address

four43 · 2015-10-28T15:09:23Z

+1, running Kubernetes v1.0.6 production release on AWS, provisioning Vivid instances.

paralin · 2015-10-28T15:21:17Z

@four43 fixed in the PR, or install linux-image-virtual on all nodes

four43 · 2015-10-28T15:23:21Z

Yup, I commented over there too :) Thanks @paralin

Fixes AWS ubuntu deployment due to extra-$(uname) vs extra-virtual package being installed. See issue kubernetes#14162 Signed-off-by: Christian Stewart <christian@paral.in>

romanek-adam changed the title ~~Node in NotReady status after kernel upgrade and reboot on Ubuntu Vivid (15.04) due to Docker start failure (no aufs)~~ Node in NotReady status after kernel upgrade and reboot on Ubuntu Vivid (15.04) due to Docker service failure (no aufs) Sep 18, 2015

mwielgus added area/install area/os/ubuntu sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Sep 18, 2015

justinsb added the area/platform/aws label Sep 18, 2015

thockin added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Sep 18, 2015

dchen1107 added team/control-plane priority/backlog Higher priority than priority/awaiting-more-evidence. labels Sep 18, 2015

paralin mentioned this issue Oct 21, 2015

cluster/aws: Fix #14162, reboot docker failure with Ubuntu server #16056

Merged

k8s-github-robot closed this as completed in baa61c1 Oct 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node in NotReady status after kernel upgrade and reboot on Ubuntu Vivid (15.04) due to Docker service failure (no aufs) #14162

Node in NotReady status after kernel upgrade and reboot on Ubuntu Vivid (15.04) due to Docker service failure (no aufs) #14162

romanek-adam commented Sep 18, 2015

dchen1107 commented Sep 18, 2015

romanek-adam commented Sep 21, 2015

paralin commented Oct 21, 2015

four43 commented Oct 28, 2015

paralin commented Oct 28, 2015

four43 commented Oct 28, 2015

Node in NotReady status after kernel upgrade and reboot on Ubuntu Vivid (15.04) due to Docker service failure (no aufs) #14162

Node in NotReady status after kernel upgrade and reboot on Ubuntu Vivid (15.04) due to Docker service failure (no aufs) #14162

Comments

romanek-adam commented Sep 18, 2015

dchen1107 commented Sep 18, 2015

romanek-adam commented Sep 21, 2015

paralin commented Oct 21, 2015

four43 commented Oct 28, 2015

paralin commented Oct 28, 2015

four43 commented Oct 28, 2015