Docker 1.9 doesn't restart cleanly #20995

bprashanth · 2016-02-10T19:23:55Z

I have a node stuck in NotReady with the symptoms described in: moby/moby#17083. Essentially supervisord keeps restarting docker.

docker logs: https://storage.googleapis.com/devnul/active_endpoints_docker.log
kubelet: https://storage.googleapis.com/devnul/active_endpoints_kubelet.log
kern (though this doens't looke like a kernel issue): https://storage.googleapis.com/devnul/active_endpoints_kern.log

A couple of weird things:

First it gets shutdown for some reason:

time="2016-02-10T04:50:09.765992265Z" level=info msg="GET /version" 
time="2016-02-10T04:50:14.773215450Z" level=info msg="GET /version" 
time="2016-02-10T04:50:19.866889346Z" level=info msg="GET /version" 
/usr/bin/docker already running.
time="2016-02-10T04:50:21.214073277Z" level=info msg="Processing signal 'terminated'"

Then it complains about cbr0, which is probably ok (though still weird):

time="2016-02-10T04:50:29.765361165Z" level=fatal msg="Error starting daemon: Error initializing network controller: Error creating default \"bridge\" network: bridge device with non default name cbr0 must be created manually" 
time="2016-02-10T04:50:37.233355561Z" level=fatal msg="Error starting daemon: Error initializing network controller: Error creating default \"bridge\" network: bridge device with non default name cbr0 must be created manually" 
time="2016-02-10T04:50:52.337432352Z" level=fatal msg="Error starting daemon: Error initializing network controller: Error creating default \"bridge\" network: bridge device with non default name cbr0 must be created manually"

then it goes lookin in the store:

time="2016-02-10T18:19:25.597635188Z" level=warning msg="Failed deleting endpoint 09230581f955753c5e4af795e3eaff527d1a43ce630c1d8c742d6e015443065c: failed to get endpoint from store during Delete: could not find endpoint 09230581f955753c5e4af795e3eaff527d1a43ce630c1d8c742d6e015443065c: Key not found in store\n" 
time="2016-02-10T18:19:30.665621023Z" level=error msg="getEndpointFromStore for eid d4f9cbeee5f2b6f2e6931462445ca39f2b7efc37431549cd962e1c587723e62a failed while trying to build sandbox for cleanup: could not find endpoint d4f9cbeee5f2b6f2e6931462445ca39f2b7efc37431549cd962e1c587723e62a: Key not found in store" 
time="2016-02-10T18:19:30.665749581Z" level=warning msg="Failed detaching sandbox d16eed32af089019fe2ab199e5fbe34db834c7cf0ad1a08cc420c17c15ed7872 from endpoint d4f9cbeee5f2b6f2e6931462445ca39f2b7efc37431549cd962e1c587723e62a: failed to get endpoint from store during leave: could not find endpoint d4f9cbeee5f2b6f2e6931462445ca39f2b7efc37431549cd962e1c587723e62a: Key not found in store\n"

it's wedged:

time="2016-02-10T19:08:42.686296946Z" level=error msg="Error during graph storage driver.Cleanup(): device or resource busy" 
time="2016-02-10T19:08:42.686453183Z" level=fatal msg="Error starting daemon: Error initializing network controller: could not delete the default bridge network: network bridge has active endpoin

But a rm -rf /var/lib/docker/network/files/ fixes it, though I suspect it was just some corrupt state in /var/lib/docker/network/files/local-kv.db.

@kubernetes/goog-node have we tested restarts with 1.9?

The text was updated successfully, but these errors were encountered:

yujuhong · 2016-02-10T20:44:30Z

I don't think we test "restarts" specifically, so my guess is no.

Based on moby/moby#17083 (comment), docker v1.10 has the same issue.

dchen1107 · 2016-02-10T20:51:11Z

cc/ @vishh on docker 1.9.X validation

dchen1107 · 2016-02-10T21:17:43Z

Both @freehan and @ArtfulCoder with docker 1.9.1 ran into this problem with their desktop. I helped them fix the issue by 1) remove their docker storage rootdir 2) remove .../docker/linkgraph.db 3) restart docker daemon

bprashanth · 2016-02-10T21:19:14Z

This was a fresh install though, just e2e cluster up

dchen1107 · 2016-02-10T21:20:05Z

I did test docker "restart" when I validating docker release. We ran into this issue with docker 1.7-rc before. Here is the issue I filed to docker: moby/moby#13850

bprashanth · 2016-02-10T21:20:48Z

I don't think this is about the storage driver. I think it's related to networking state that was stored in var/lib/docker/network/files/local-kv.db, it went away when i nuked their local-kv.db (and didn't touch anything else).

dchen1107 · 2016-02-10T21:24:31Z

@bprashanth Looks like upon restart, both storage driver state and network state could be corrupted with docker 1.9 release, and nuking the checkpoint files are the only way to workaround.

cc/ @andyzheng0831 You guys already build the image with docker 1.9.1. Any of your customer reports this problem?

bprashanth · 2016-02-10T21:26:42Z

Fwiw this showed up on one node in a 300 node cluster

andyzheng0831 · 2016-02-10T21:34:20Z

Did you mean any user of k8s/gke reported this problem to us? No.
On Feb 10, 2016 1:27 PM, "Prashanth B" notifications@github.com wrote:

Fwiw this showed up on one node in a 300 node cluster

—
Reply to this email directly or view it on GitHub
#20995 (comment)
.

dchen1107 · 2016-02-10T21:44:33Z

@bprashanth I know this is fresh installed node, but it still involves with docker restart here:

salt upgrade docker to the new version: docker 1.9.1 from docker 1.8.3 which is baked into containervm image.
Kubelet config cbr0, and restart docker daemon.

I suspected this is only triggered by docker upgrade since it involves checkpoint schema changes here. I will double check on it.

Meanwhile, @vishh is going to test docker restart with 1.9.1 without upgrade being involved here.

timothysc · 2016-02-10T21:48:41Z

I've only seen this issue on an initial provisioning, subsequent restarts all was well.

/cc @ncdc

dchen1107 · 2016-02-10T21:49:45Z

Talked to @andyzheng0831 offline. They have docker 1.9.1 being baked into their image, and there are roughly a couple of thousand instances running with such image today. They never receive any report on this docker issue.

If we can rule out pure docker restart, we can still go ahead to release 1.2 with docker 1.9.1 since it only happens at initial upgrade stage. We can document it as known issue for this one.

dchen1107 · 2016-02-10T21:51:02Z

#21000

timothysc · 2016-02-10T21:52:59Z

I should clarify, I've seen the docker behavior mentioned on an upgrade, not baked setup.

dchen1107 · 2016-02-10T21:53:41Z

@timothysc Thanks for quick updates on this very issue. In your case, does the initial provisioning involve a docker version upgrade?

bprashanth · 2016-02-10T21:54:16Z

fyi doing exactly what we do for Kubelet in daemon-restart should work, if we need to test docker restart in an e2e: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/daemon_restart.go#L297

timothysc · 2016-02-10T21:54:59Z

@dchen1107 - In the environment where this happened, yes. A docker upgrade was part of the process.

dchen1107 · 2016-02-10T21:57:08Z

Two cases I mentioned at #20995 (comment) also involve docker version upgrade.

@bprashanth Testing docker restart is a valid test case for container runtime validation test suite. We should including upgrade case too.

andyzheng0831 · 2016-02-10T22:18:24Z

FYI, we build 1.9.1 in image and have thousands of instances running. Instances may be rebooted, but we never allow upgrade or downgrade docker version. So far we don't receive report about this kind of corruption issue.

dchen1107 · 2016-02-11T21:54:51Z

A small summary:

@vishh has docker restart test without upgrade against docker 1.9.1, but 2 / 3 nodes run into issue Mitigate impact of unregister_netdevice kernel race #20096. He is going to kick out another round to test with hairpin mode disabled.
I received one instance report from the users who using the image created by @andyzheng0831's team. After discussing with the user, I found they issue kill -9 docker_daemon_pid without sending SIGTERM first. Anyway, corrupting docker's checkpoint is expected in this case. Red herring...

We are going to decide which version to go with 1.2 release once receiving the last signal from Vishnu's test.

dchen1107 · 2016-02-12T00:38:56Z

FYI: @Amey-D this is the last remaining issue for validating docker 1.9.1 for 1.2 release.

dchen1107 · 2016-02-12T18:12:39Z

Our stress test on docker restart over night indicates that clean restart of docker daemon without upgrade works fine. I am closing this one and using #21086 to track all required changes for our containerVM 1.2 release.

bprashanth added sig/node Categorizes an issue or PR as relevant to SIG Node. team/cluster labels Feb 10, 2016

dchen1107 added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 10, 2016

dchen1107 added this to the v1.2 milestone Feb 10, 2016

dchen1107 self-assigned this Feb 10, 2016

dchen1107 mentioned this issue Feb 10, 2016

e2e-gce flake: pods failed to start on a certain node #20985

Closed

dchen1107 mentioned this issue Feb 11, 2016

Docker validation test: restart and upgrade #21013

Closed

dchen1107 mentioned this issue Feb 12, 2016

If image download fails, kubelet is never able to spawn the pod #12670

Closed

bprashanth mentioned this issue Feb 12, 2016

Crashlooping containers sometimes not GC'd #21085

Closed

dchen1107 closed this as completed Feb 12, 2016

bprashanth mentioned this issue Feb 21, 2016

e2e Flake: Cluster failed to initialize within 300 seconds (possible metadata server issue?) #20916

Closed

dchen1107 mentioned this issue Feb 22, 2016

docker 1.9.1: No available IPv4 addresses on this network's address pools: bridge #21523

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker 1.9 doesn't restart cleanly #20995

Docker 1.9 doesn't restart cleanly #20995

bprashanth commented Feb 10, 2016

yujuhong commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

bprashanth commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

bprashanth commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

bprashanth commented Feb 10, 2016

andyzheng0831 commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

timothysc commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

timothysc commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

bprashanth commented Feb 10, 2016

timothysc commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

andyzheng0831 commented Feb 10, 2016

dchen1107 commented Feb 11, 2016

dchen1107 commented Feb 12, 2016

dchen1107 commented Feb 12, 2016

Docker 1.9 doesn't restart cleanly #20995

Docker 1.9 doesn't restart cleanly #20995

Comments

bprashanth commented Feb 10, 2016

yujuhong commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

bprashanth commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

bprashanth commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

bprashanth commented Feb 10, 2016

andyzheng0831 commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

timothysc commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

timothysc commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

bprashanth commented Feb 10, 2016

timothysc commented Feb 10, 2016

dchen1107 commented Feb 10, 2016

andyzheng0831 commented Feb 10, 2016

dchen1107 commented Feb 11, 2016

dchen1107 commented Feb 12, 2016

dchen1107 commented Feb 12, 2016