Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node in NotReady status after kernel upgrade and reboot on Ubuntu Vivid (15.04) due to Docker service failure (no aufs) #14162

Closed
romanek-adam opened this issue Sep 18, 2015 · 6 comments
Labels
area/os/ubuntu priority/backlog Higher priority than priority/awaiting-more-evidence. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@romanek-adam
Copy link

Hi. I'm writing this partially because I want to start a discussion and to leave some trails for others potentially affected.

I started my journey with Kubernetes just a few days ago. I downloaded the 1.0.6 release (now I know this a only a pre-release and I should've downloaded the 1.0.4 release but this doesn't seem to be related to my particular issue).

I brought up a tiny k8s cluster with the following config:

export KUBERNETES_PROVIDER=aws
export KUBE_AWS_ZONE=eu-west-1a
export NUM_MINIONS=3
export MINION_SIZE=t2.micro
export AWS_S3_REGION=eu-west-1
export AWS_S3_BUCKET=xxx
export KUBE_AWS_INSTANCE_PREFIX=xxx

My cluster was created properly and after a bit of playing I left it in a good shape.

The next day I wanted to investigate some DNS related issue on k8s nodes so I logged in on one of the minions. I noticed *** System restart required *** in the MOTD. I ignored it for a moment and investigated the DNS matters (some nslookups, nothing dangerous). Then I rebooted the machine.

Once the machine was up again I found out that kubectl reported it as NotReady. I started digging and investigating. It took me some time to find out that the Docker service on the node was not running:

$ sudo docker info
FATA[0001] Get http:///var/run/docker.sock/v1.18/info: read unix /var/run/docker.sock: connection reset by peer. Are you trying to connect to a TLS-enabled daemon without TLS?
$ sudo journalctl -u docker -f
-- Logs begin at Fri 2015-09-18 08:06:53 UTC. --
Sep 18 08:09:49 ip-172-20-0-149 systemd[1]: docker.service failed.
Sep 18 08:09:51 ip-172-20-0-149 systemd[1]: docker.service holdoff time over, scheduling restart.
Sep 18 08:09:51 ip-172-20-0-149 systemd[1]: Started Docker Application Container Engine.
Sep 18 08:09:51 ip-172-20-0-149 systemd[1]: Starting Docker Application Container Engine...
Sep 18 08:09:51 ip-172-20-0-149 docker[3397]: time="2015-09-18T08:09:51Z" level=info msg="+job serveapi(fd://)"
Sep 18 08:09:51 ip-172-20-0-149 docker[3397]: time="2015-09-18T08:09:51Z" level=info msg="Listening for HTTP on fd ()"
Sep 18 08:09:51 ip-172-20-0-149 docker[3397]: time="2015-09-18T08:09:51Z" level=fatal msg="Shutting down daemon due to errors: error intializing graphdriver: driver not supported"
Sep 18 08:09:51 ip-172-20-0-149 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Sep 18 08:09:51 ip-172-20-0-149 systemd[1]: Unit docker.service entered failed state.
Sep 18 08:09:51 ip-172-20-0-149 systemd[1]: docker.service failed.

Further investigation led me to a conclusion that the problem arose from the fact that before the reboot the kernel was upgraded to 3.19.0-28, which no longer by default supports AUFS module which Docker uses as its storage driver.

Looks like the solution is to install the linux-image-extra-virtual, as concluded from moby/moby#10859 and other sources, and reboot the node. After reboot the node got back to the Ready state.

Moreover, I guess an upgrade to a newer Docker version, using the official install command curl -sSL https://get.docker.com/ | sh would solve the problem as the most recent install script already installs linux-image-extra-virtual as per moby/moby#10860, although I'm not sure how Docker was installed on the k8s node, so I don't know if it's safe to upgrade it this way.

So I have a number of questions:

  1. should k8s provide a workaround / fix for this particular issue?
  2. or is it left to a cluster admin as it's an issue of a particular provider/env/configuration?
  3. why is k8s using Docker v1.6.0 which is pretty old?
  4. is it safe to use k8s in production on a setup in which Docker uses AUFS storage driver which seems to be deprecated by the Ubuntu team and thus dropped from the main kernel image?

Note that I haven't done anything unusual, just installed fresh k8s on AWS using the default Ubuntu release. Hence I believe it's important to fix it somehow in k8s.

Possible duplicate: #9779.

@romanek-adam romanek-adam changed the title Node in NotReady status after kernel upgrade and reboot on Ubuntu Vivid (15.04) due to Docker start failure (no aufs) Node in NotReady status after kernel upgrade and reboot on Ubuntu Vivid (15.04) due to Docker service failure (no aufs) Sep 18, 2015
@mwielgus mwielgus added area/install area/os/ubuntu sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Sep 18, 2015
@thockin thockin added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Sep 18, 2015
@dchen1107 dchen1107 added team/control-plane priority/backlog Higher priority than priority/awaiting-more-evidence. labels Sep 18, 2015
@dchen1107
Copy link
Member

Thanks for reporting the issue to kick out more discussion. You just opened a can of worms here :-)

  1. should k8s provide a workaround / fix for this particular issue?

You can upgrade your docker version through salt, or update docker flag to use devicemapper through salt.

  1. or is it left to a cluster admin as it's an issue of a particular provider/env/configuration?

We should do better job on defining support metrics, specifying minimal system requirements, running integration tests / end-to-end tests through something like version manager which controls each component's versions, etc. I haven't listed all we should do yet, but you can see there are bunch of pending works here. This is open source project, and I believe our community can help with this.

  1. why is k8s using Docker v1.6.0 which is pretty old?

When we cut Kubernetes v1.0 release, Docker is on 1.7.1, but has several serious issues unresolved, especially on host network. That is why we decided to cut 1.0 with docker v 1.6.2. But there is always a way to upgrade to a new version through salt for cluster admin.

For 1.1 release, the plan is cutting it with Docker 1.8.2 .

@romanek-adam
Copy link
Author

It looks like the fix is to change linux-image-extra-$(uname -r) to linux-image-extra-virtual in cluster/aws/templates/format-disks.sh#L175, which will pick the right linux-image-extra-XXX on each kernel upgrade. I could provide a pull-request but my company hasn't signed the CLA yet.

@paralin
Copy link
Contributor

paralin commented Oct 21, 2015

Just ran into this issue, submitting a PR to address

@four43
Copy link

four43 commented Oct 28, 2015

+1, running Kubernetes v1.0.6 production release on AWS, provisioning Vivid instances.

@paralin
Copy link
Contributor

paralin commented Oct 28, 2015

@four43 fixed in the PR, or install linux-image-virtual on all nodes

@four43
Copy link

four43 commented Oct 28, 2015

Yup, I commented over there too :) Thanks @paralin

RichieEscarez pushed a commit to RichieEscarez/kubernetes that referenced this issue Dec 4, 2015
Fixes AWS ubuntu deployment due to extra-$(uname) vs extra-virtual
package being installed. See issue kubernetes#14162

Signed-off-by: Christian Stewart <christian@paral.in>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/os/ubuntu priority/backlog Higher priority than priority/awaiting-more-evidence. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

7 participants