Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker 1.9 doesn't restart cleanly #20995

Closed
bprashanth opened this issue Feb 10, 2016 · 22 comments
Closed

Docker 1.9 doesn't restart cleanly #20995

bprashanth opened this issue Feb 10, 2016 · 22 comments
Assignees
Labels
priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@bprashanth
Copy link
Contributor

I have a node stuck in NotReady with the symptoms described in: moby/moby#17083. Essentially supervisord keeps restarting docker.

docker logs: https://storage.googleapis.com/devnul/active_endpoints_docker.log
kubelet: https://storage.googleapis.com/devnul/active_endpoints_kubelet.log
kern (though this doens't looke like a kernel issue): https://storage.googleapis.com/devnul/active_endpoints_kern.log

A couple of weird things:

First it gets shutdown for some reason:

time="2016-02-10T04:50:09.765992265Z" level=info msg="GET /version" 
time="2016-02-10T04:50:14.773215450Z" level=info msg="GET /version" 
time="2016-02-10T04:50:19.866889346Z" level=info msg="GET /version" 
/usr/bin/docker already running.
time="2016-02-10T04:50:21.214073277Z" level=info msg="Processing signal 'terminated'" 

Then it complains about cbr0, which is probably ok (though still weird):

time="2016-02-10T04:50:29.765361165Z" level=fatal msg="Error starting daemon: Error initializing network controller: Error creating default \"bridge\" network: bridge device with non default name cbr0 must be created manually" 
time="2016-02-10T04:50:37.233355561Z" level=fatal msg="Error starting daemon: Error initializing network controller: Error creating default \"bridge\" network: bridge device with non default name cbr0 must be created manually" 
time="2016-02-10T04:50:52.337432352Z" level=fatal msg="Error starting daemon: Error initializing network controller: Error creating default \"bridge\" network: bridge device with non default name cbr0 must be created manually" 

then it goes lookin in the store:

time="2016-02-10T18:19:25.597635188Z" level=warning msg="Failed deleting endpoint 09230581f955753c5e4af795e3eaff527d1a43ce630c1d8c742d6e015443065c: failed to get endpoint from store during Delete: could not find endpoint 09230581f955753c5e4af795e3eaff527d1a43ce630c1d8c742d6e015443065c: Key not found in store\n" 
time="2016-02-10T18:19:30.665621023Z" level=error msg="getEndpointFromStore for eid d4f9cbeee5f2b6f2e6931462445ca39f2b7efc37431549cd962e1c587723e62a failed while trying to build sandbox for cleanup: could not find endpoint d4f9cbeee5f2b6f2e6931462445ca39f2b7efc37431549cd962e1c587723e62a: Key not found in store" 
time="2016-02-10T18:19:30.665749581Z" level=warning msg="Failed detaching sandbox d16eed32af089019fe2ab199e5fbe34db834c7cf0ad1a08cc420c17c15ed7872 from endpoint d4f9cbeee5f2b6f2e6931462445ca39f2b7efc37431549cd962e1c587723e62a: failed to get endpoint from store during leave: could not find endpoint d4f9cbeee5f2b6f2e6931462445ca39f2b7efc37431549cd962e1c587723e62a: Key not found in store\n" 

it's wedged:

time="2016-02-10T19:08:42.686296946Z" level=error msg="Error during graph storage driver.Cleanup(): device or resource busy" 
time="2016-02-10T19:08:42.686453183Z" level=fatal msg="Error starting daemon: Error initializing network controller: could not delete the default bridge network: network bridge has active endpoin

But a rm -rf /var/lib/docker/network/files/ fixes it, though I suspect it was just some corrupt state in /var/lib/docker/network/files/local-kv.db.

@kubernetes/goog-node have we tested restarts with 1.9?

@bprashanth bprashanth added sig/node Categorizes an issue or PR as relevant to SIG Node. team/cluster labels Feb 10, 2016
@yujuhong
Copy link
Contributor

I don't think we test "restarts" specifically, so my guess is no.

Based on moby/moby#17083 (comment), docker v1.10 has the same issue.

@dchen1107
Copy link
Member

cc/ @vishh on docker 1.9.X validation

@dchen1107
Copy link
Member

Both @freehan and @ArtfulCoder with docker 1.9.1 ran into this problem with their desktop. I helped them fix the issue by 1) remove their docker storage rootdir 2) remove .../docker/linkgraph.db 3) restart docker daemon

@bprashanth
Copy link
Contributor Author

This was a fresh install though, just e2e cluster up

@dchen1107
Copy link
Member

I did test docker "restart" when I validating docker release. We ran into this issue with docker 1.7-rc before. Here is the issue I filed to docker: moby/moby#13850

@bprashanth
Copy link
Contributor Author

I don't think this is about the storage driver. I think it's related to networking state that was stored in var/lib/docker/network/files/local-kv.db, it went away when i nuked their local-kv.db (and didn't touch anything else).

@dchen1107
Copy link
Member

@bprashanth Looks like upon restart, both storage driver state and network state could be corrupted with docker 1.9 release, and nuking the checkpoint files are the only way to workaround.

cc/ @andyzheng0831 You guys already build the image with docker 1.9.1. Any of your customer reports this problem?

@bprashanth
Copy link
Contributor Author

Fwiw this showed up on one node in a 300 node cluster

@dchen1107 dchen1107 added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 10, 2016
@dchen1107 dchen1107 added this to the v1.2 milestone Feb 10, 2016
@andyzheng0831
Copy link

Did you mean any user of k8s/gke reported this problem to us? No.
On Feb 10, 2016 1:27 PM, "Prashanth B" notifications@github.com wrote:

Fwiw this showed up on one node in a 300 node cluster


Reply to this email directly or view it on GitHub
#20995 (comment)
.

@dchen1107
Copy link
Member

@bprashanth I know this is fresh installed node, but it still involves with docker restart here:

  1. salt upgrade docker to the new version: docker 1.9.1 from docker 1.8.3 which is baked into containervm image.
  2. Kubelet config cbr0, and restart docker daemon.

I suspected this is only triggered by docker upgrade since it involves checkpoint schema changes here. I will double check on it.

Meanwhile, @vishh is going to test docker restart with 1.9.1 without upgrade being involved here.

@timothysc
Copy link
Member

I've only seen this issue on an initial provisioning, subsequent restarts all was well.

/cc @ncdc

@dchen1107
Copy link
Member

Talked to @andyzheng0831 offline. They have docker 1.9.1 being baked into their image, and there are roughly a couple of thousand instances running with such image today. They never receive any report on this docker issue.

If we can rule out pure docker restart, we can still go ahead to release 1.2 with docker 1.9.1 since it only happens at initial upgrade stage. We can document it as known issue for this one.

@dchen1107
Copy link
Member

#21000

@timothysc
Copy link
Member

I should clarify, I've seen the docker behavior mentioned on an upgrade, not baked setup.

@dchen1107
Copy link
Member

@timothysc Thanks for quick updates on this very issue. In your case, does the initial provisioning involve a docker version upgrade?

@bprashanth
Copy link
Contributor Author

fyi doing exactly what we do for Kubelet in daemon-restart should work, if we need to test docker restart in an e2e: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/daemon_restart.go#L297

@timothysc
Copy link
Member

@dchen1107 - In the environment where this happened, yes. A docker upgrade was part of the process.

@dchen1107
Copy link
Member

Two cases I mentioned at #20995 (comment) also involve docker version upgrade.

@bprashanth Testing docker restart is a valid test case for container runtime validation test suite. We should including upgrade case too.

@andyzheng0831
Copy link

FYI, we build 1.9.1 in image and have thousands of instances running. Instances may be rebooted, but we never allow upgrade or downgrade docker version. So far we don't receive report about this kind of corruption issue.

@dchen1107
Copy link
Member

A small summary:

  • @vishh has docker restart test without upgrade against docker 1.9.1, but 2 / 3 nodes run into issue Mitigate impact of unregister_netdevice kernel race #20096. He is going to kick out another round to test with hairpin mode disabled.
  • I received one instance report from the users who using the image created by @andyzheng0831's team. After discussing with the user, I found they issue kill -9 docker_daemon_pid without sending SIGTERM first. Anyway, corrupting docker's checkpoint is expected in this case. Red herring...

We are going to decide which version to go with 1.2 release once receiving the last signal from Vishnu's test.

@dchen1107
Copy link
Member

FYI: @Amey-D this is the last remaining issue for validating docker 1.9.1 for 1.2 release.

@dchen1107
Copy link
Member

Our stress test on docker restart over night indicates that clean restart of docker daemon without upgrade works fine. I am closing this one and using #21086 to track all required changes for our containerVM 1.2 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

5 participants