Issue #20312 still open with 1.11.1 #23078

ascheman · 2016-05-28T11:52:19Z

I still see the problem reported in #20312 with Docker 1.11.1 occasionally.
The first start leaves a file /var/lib/docker/network/files//local-kv.db and then any subsequent try to restart docker fails with a message like

time="2016-05-28T11:15:14.366122921Z" level=fatal msg="Error starting daemon: Error initializing network controller: Error creating default \"bridge\" network: failed to allocate gateway (172.17.0.1): Address already in use"

If I remove the local-kv.db file Docker starts without problems.

I have a CI process which runs a Vagrant box (based on VirtualBox/ubuntu:14.04) which installs Docker and frequently runs into this problem (3 out of the last 10 tries).

So, please, please, PLEASE do not only solve the issue but also set up a CI job to install Docker development/release candidates again and again (should run every few minutes) to detect such race conditions in the future as early as possible! I would be glad to help set up such an CI environment.

Output of docker version:

Client:
 Version:      1.11.1
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   5604cbe
 Built:        Tue Apr 26 23:30:23 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.1
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   5604cbe
 Built:        Tue Apr 26 23:30:23 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 1.11.1
Storage Driver: devicemapper
 Pool Name: docker-8:1-267052-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: ext4
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 305.7 MB
 Data Space Total: 107.4 GB
 Data Space Available: 40.32 GB
 Metadata Space Used: 729.1 kB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.147 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.77 (2012-10-15)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge null host
Kernel Version: 3.13.0-85-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 2.939 GiB
Name: devopssquare-full
ID: QS4Q:JKGK:EY35:QZU3:BN4A:LYGN:OKSZ:O7GQ:7VO5:DNA2:IIYA:GRV6
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): true
 File Descriptors: 13
 Goroutines: 27
 System Time: 2016-05-28T11:40:30.503757476Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):

VirtualBox 5.0.14
Vagrant 1.7.4
Host: Ubuntu 15.10
Guest: Ubuntu 14.04 LTS

Steps to reproduce the issue:
Check out https://github.com/ascheman/minimal4dockerproblem and run something like the following shell loop

#!/bin/bash

set -e

while true
do
    vagrant destroy -f
    sleep 10 # Short wait
    vagrant up
    sleep 300 # Wait 5 minutes for the next try
done

After some tries you will run into the problem and can log into the Vagrant machine (the loop is automatically stopped then) to further investigate the problem.

Describe the results you received:
see above: Docker fails to start during first install.

Describe the results you expected:
Expecting Docker to start :-)

Additional information you deem important (e.g. issue happens only occasionally):
Happens here in 3/10 tries

The text was updated successfully, but these errors were encountered:

thaJeztah · 2016-05-28T12:07:01Z

Is the local-kv.db file part of the content that's preserved after the Vagrant machine is destroyed? Does vagrant destroy -f do a clean shutdown of the docker daemon, or is the machine forcibly killed? Wondering if that could be related here

ascheman · 2016-05-28T12:41:08Z

After the vagrant destroy -f everything is deleted. But during the loop this happens only if everything went well. Due to the set -e the loop should be aborted when the problem occurs (since the setup/test of the Vagrant box fails then). If this happens, you will have the Vagrant box left over. You can login then via vagrant ssh and find the local-kv.db file there (you'll probably need a sudo su - to become root and have access rights to the file in the VM).

thaJeztah · 2016-05-28T13:25:20Z

/cc @mavenugo @aboch could you have a look at this?

cyphar · 2016-05-29T02:05:40Z

There are other issues with local-kv.db that we've seen in SUSE. For instance, I'm currently debugging a corrupted database that causes Docker to segfault on startup.

ascheman · 2016-05-30T04:53:56Z

I made my Jenkins run my 'minimal4dockerproblem' setup every 5 minutes and it has failed with the error 2 out of ~250 times during the last ~24 hours. In more complex Vagrant setups which I run less frequently the problem occured more often. I cannot see any pattern in it. Still looks like a race condition to me.

tiborvass · 2016-06-27T18:51:12Z

@aboch does moby/libnetwork#1130 mean that this can be closed?

aboch · 2016-06-27T19:00:47Z

@tiborvass Yes that change plus other is likely going to fix this issue, in the sense the condition for the problem reported to happen should no longer be possible.

tiborvass · 2016-06-27T20:05:55Z

Okay thanks.

On a percentage of DC/OS agents (~5%) with DOcker 1.11.2, the Daemon will fail to start up with the following error: > Error starting daemon: Error initializing network controller: Error > creating default "bridge" network: failed to allocate gateway > (172.17.0.1): Address already in use This seems to be related to a Docker bug around the network controller initialization, where the controller has allocated an ip pool and persisted some state but not all of it. See: * moby/moby#22834 * moby/moby#23078 This fix simply removes the docker0 interface if it exists before starting the Docker daemon. This fix will need to be re-evaluated if we want to enable the 1.12+ containerd live-restore like Docker options as discussed in: * https://docs.docker.com/engine/admin/live-restore/ * moby/moby#2658

On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon will fail to start up with the following error: > Error starting daemon: Error initializing network controller: Error > creating default "bridge" network: failed to allocate gateway > (172.17.0.1): Address already in use This seems to be related to a Docker bug around the network controller initialization, where the controller has allocated an ip pool and persisted some state but not all of it. See: * moby/moby#22834 * moby/moby#23078 This fix simply removes the docker0 interface if it exists before starting the Docker daemon. This fix will need to be re-evaluated if we want to enable the 1.12+ containerd live-restore like Docker options as discussed in: * https://docs.docker.com/engine/admin/live-restore/ * moby/moby#2658

Amey-D · 2017-01-13T01:51:06Z

Was this issue fixed in v1.11.2?

GordonTheTurtle added the version/1.11 label May 28, 2016

thaJeztah added the area/networking label May 28, 2016

thaJeztah added this to the 1.12.0 milestone May 29, 2016

aboch mentioned this issue May 30, 2016

Avoid persisting ipam data if it can be reconstructed moby/libnetwork#1130

Merged

tiborvass closed this as completed Jun 27, 2016

lingmann mentioned this issue Aug 29, 2016

Fix for Docker 1.11.2 startup issue dcos/dcos#604

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue #20312 still open with 1.11.1 #23078

Issue #20312 still open with 1.11.1 #23078

ascheman commented May 28, 2016

thaJeztah commented May 28, 2016

ascheman commented May 28, 2016

thaJeztah commented May 28, 2016

cyphar commented May 29, 2016

ascheman commented May 30, 2016

tiborvass commented Jun 27, 2016

aboch commented Jun 27, 2016

tiborvass commented Jun 27, 2016

Amey-D commented Jan 13, 2017

Issue #20312 still open with 1.11.1 #23078

Issue #20312 still open with 1.11.1 #23078

Comments

ascheman commented May 28, 2016

thaJeztah commented May 28, 2016

ascheman commented May 28, 2016

thaJeztah commented May 28, 2016

cyphar commented May 29, 2016

ascheman commented May 30, 2016

tiborvass commented Jun 27, 2016

aboch commented Jun 27, 2016

tiborvass commented Jun 27, 2016

Amey-D commented Jan 13, 2017