Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #20312 still open with 1.11.1 #23078

Closed
ascheman opened this issue May 28, 2016 · 9 comments
Closed

Issue #20312 still open with 1.11.1 #23078

ascheman opened this issue May 28, 2016 · 9 comments

Comments

@ascheman
Copy link

I still see the problem reported in #20312 with Docker 1.11.1 occasionally.
The first start leaves a file /var/lib/docker/network/files//local-kv.db and then any subsequent try to restart docker fails with a message like

time="2016-05-28T11:15:14.366122921Z" level=fatal msg="Error starting daemon: Error initializing network controller: Error creating default \"bridge\" network: failed to allocate gateway (172.17.0.1): Address already in use"

If I remove the local-kv.db file Docker starts without problems.

I have a CI process which runs a Vagrant box (based on VirtualBox/ubuntu:14.04) which installs Docker and frequently runs into this problem (3 out of the last 10 tries).

So, please, please, PLEASE do not only solve the issue but also set up a CI job to install Docker development/release candidates again and again (should run every few minutes) to detect such race conditions in the future as early as possible! I would be glad to help set up such an CI environment.

Output of docker version:

Client:
 Version:      1.11.1
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   5604cbe
 Built:        Tue Apr 26 23:30:23 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.1
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   5604cbe
 Built:        Tue Apr 26 23:30:23 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 1.11.1
Storage Driver: devicemapper
 Pool Name: docker-8:1-267052-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: ext4
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 305.7 MB
 Data Space Total: 107.4 GB
 Data Space Available: 40.32 GB
 Metadata Space Used: 729.1 kB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.147 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.77 (2012-10-15)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge null host
Kernel Version: 3.13.0-85-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 2.939 GiB
Name: devopssquare-full
ID: QS4Q:JKGK:EY35:QZU3:BN4A:LYGN:OKSZ:O7GQ:7VO5:DNA2:IIYA:GRV6
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): true
 File Descriptors: 13
 Goroutines: 27
 System Time: 2016-05-28T11:40:30.503757476Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):

  • VirtualBox 5.0.14
  • Vagrant 1.7.4
  • Host: Ubuntu 15.10
  • Guest: Ubuntu 14.04 LTS

Steps to reproduce the issue:
Check out https://github.com/ascheman/minimal4dockerproblem and run something like the following shell loop

#!/bin/bash

set -e

while true
do
    vagrant destroy -f
    sleep 10 # Short wait
    vagrant up
    sleep 300 # Wait 5 minutes for the next try
done

After some tries you will run into the problem and can log into the Vagrant machine (the loop is automatically stopped then) to further investigate the problem.

Describe the results you received:
see above: Docker fails to start during first install.

Describe the results you expected:
Expecting Docker to start :-)

Additional information you deem important (e.g. issue happens only occasionally):
Happens here in 3/10 tries

@thaJeztah
Copy link
Member

Is the local-kv.db file part of the content that's preserved after the Vagrant machine is destroyed? Does vagrant destroy -f do a clean shutdown of the docker daemon, or is the machine forcibly killed? Wondering if that could be related here

@ascheman
Copy link
Author

After the vagrant destroy -f everything is deleted. But during the loop this happens only if everything went well. Due to the set -e the loop should be aborted when the problem occurs (since the setup/test of the Vagrant box fails then). If this happens, you will have the Vagrant box left over. You can login then via vagrant ssh and find the local-kv.db file there (you'll probably need a sudo su - to become root and have access rights to the file in the VM).

@thaJeztah
Copy link
Member

/cc @mavenugo @aboch could you have a look at this?

@cyphar
Copy link
Contributor

cyphar commented May 29, 2016

There are other issues with local-kv.db that we've seen in SUSE. For instance, I'm currently debugging a corrupted database that causes Docker to segfault on startup.

@thaJeztah thaJeztah added this to the 1.12.0 milestone May 29, 2016
@ascheman
Copy link
Author

I made my Jenkins run my 'minimal4dockerproblem' setup every 5 minutes and it has failed with the error 2 out of ~250 times during the last ~24 hours. In more complex Vagrant setups which I run less frequently the problem occured more often. I cannot see any pattern in it. Still looks like a race condition to me.

@tiborvass
Copy link
Contributor

@aboch does moby/libnetwork#1130 mean that this can be closed?

@aboch
Copy link
Contributor

aboch commented Jun 27, 2016

@tiborvass Yes that change plus other is likely going to fix this issue, in the sense the condition for the problem reported to happen should no longer be possible.

@tiborvass
Copy link
Contributor

Okay thanks.

lingmann pushed a commit to lingmann/dcos that referenced this issue Aug 29, 2016
On a percentage of DC/OS agents (~5%) with DOcker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Aug 29, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Aug 29, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Aug 30, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Aug 31, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Sep 1, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Sep 1, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
mellenburg pushed a commit to mellenburg/dcos that referenced this issue Sep 2, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
@Amey-D
Copy link

Amey-D commented Jan 13, 2017

Was this issue fixed in v1.11.2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants