Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker 1.10.3 fails to create network bridge on start up #20312

Closed
ascheman opened this issue Feb 14, 2016 · 30 comments · Fixed by #21261
Closed

Docker 1.10.3 fails to create network bridge on start up #20312

ascheman opened this issue Feb 14, 2016 · 30 comments · Fixed by #21261
Assignees
Milestone

Comments

@ascheman
Copy link

I seem to have the #18113 issue with Docker 1.10.1!

But it also occurs mostly when using a virtual machine (virtual box) with more than 3G of memory. Then startup of Dockers fails the most time and leaves a /var/lib/docker/network/files/local-kv.db. As long as the file exists, I can not start docker. As soon as I remove it, everything is fine. I have some more hints to reproduce the problem below.

I have the following setup:

  • Host:
  • Create a Vagrant virtual machine with VirtualBox:
  • Inside of the VM
    • Install Puppet (3.4.3) with some modules (apt, docker, epel, etckeeper, stdlib)
    • Start Docker via Puppet manifest: latest Puppet-Docker (garether-docker) sets up Docker version 1.10.1, build 9e83765

I have a Jenkins starting some of those virtual machines. Most of them require only 2048 MB of assigned RAM. But since yesterday I needed some larger instances and set the memory size for one of them to 4096 MB. This one almost always fails to start Docker and leaves the mentioned local-kv.db. Why only "almost", why not always?

I was trying around with some ideas:

  • At first I was running the puppet based setup with strace, something like strace -f -o /tmp/docker-problem -ttt puppet apply docker-setup.pp. When run with strace the problem never occured, though I was also never able to see access to the local-kv.db (I guess this happens in the container?).
  • Then I was trying with different memory sizes:
    • 3072 MB (always works)
    • 3584 MB: Worked some times (1 out of 3)
    • 3800 MB: Failed (only one try)
    • 4095 MB (trying to be slightly below the 4GB): Worked (only one try)
    • 4096 MB: Always fails

Maybe it's something completely different and has nothing todo with memory sizes?

I attach one of the local-kv.db (a bzip2 in a zip) for further investigation:
local-kv.db.zip

@ascheman
Copy link
Author

docker version:

Client:
 Version:      1.10.1
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   9e83765
 Built:        Thu Feb 11 19:27:08 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.1
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   9e83765
 Built:        Thu Feb 11 19:27:08 2016
 OS/Arch:      linux/amd64

docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 1.10.1
Storage Driver: devicemapper
 Pool Name: docker-8:1-267070-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: ext4
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 305.7 MB
 Data Space Total: 107.4 GB
 Data Space Available: 40.35 GB
 Metadata Space Used: 733.2 kB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.147 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.77 (2012-10-15)
Execution Driver: native-0.2
Logging Driver: json-file
Plugins: 
 Volume: local
 Network: null host bridge
Kernel Version: 3.13.0-77-generic
Operating System: Ubuntu 14.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 3.861 GiB
Name: devopssquare-full-strace
ID: 45XZ:LNXX:D4SZ:TNGL:B3IF:NCWI:USNB:WFJF:7JSQ:ZCTP:7C5C:HWTD
WARNING: No swap limit support

@mavenugo
Copy link
Contributor

@ascheman Just to confirm if this is related to #18113, do you see this error in the daemon log : failed to allocate gateway (172.17.0.1): No available addresses on this pool ?
Thanks for sharing the local-kv.db file. Can you also share the daemon bootup logs ?
FYI there can be other cases (especially when the db is put in an inconsistent state). Hence we cannot assume it is #18113 without knowing all the details.

@ascheman
Copy link
Author

I was setting the reference to #18113 since the problem was caused by the file local-kv.db which was left over. In fact, the problem occurs during installation of docker and first startup of the daemon through the puppet module.

During that time I get the following log in /var/log/syslog:

Feb 14 17:21:47 vagrant-ubuntu-trusty-64 kernel: [  109.602706] bio: create slab <bio-1> at 1
Feb 14 17:21:47 vagrant-ubuntu-trusty-64 kernel: [  109.610923] bio: create slab <bio-2> at 2
Feb 14 17:21:48 vagrant-ubuntu-trusty-64 kernel: [  110.780651] Bridge firewalling registered
Feb 14 17:21:48 vagrant-ubuntu-trusty-64 kernel: [  110.789882] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
Feb 14 17:21:48 vagrant-ubuntu-trusty-64 kernel: [  110.804419] ip_tables: (C) 2000-2006 Netfilter Core Team
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  110.972998] init: docker main process (12420) terminated with status 2
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.188388] init: docker main process (12642) terminated with status 1
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.188394] init: docker main process ended, respawning
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.284999] bio: create slab <bio-1> at 1
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.292451] bio: create slab <bio-2> at 2
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.359650] init: docker main process (12717) terminated with status 1
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.359656] init: docker main process ended, respawning
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.462584] bio: create slab <bio-1> at 1
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.481806] bio: create slab <bio-2> at 2
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.539345] init: docker main process (12841) terminated with status 1
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.539351] init: docker main process ended, respawning
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.611057] bio: create slab <bio-1> at 1
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.616451] bio: create slab <bio-2> at 2
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.659566] init: docker main process (12924) terminated with status 1
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.659583] init: docker main process ended, respawning
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.705675] bio: create slab <bio-1> at 1
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.711428] bio: create slab <bio-2> at 2
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.754647] init: docker main process (13002) terminated with status 1
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.754654] init: docker main process ended, respawning
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.799799] bio: create slab <bio-1> at 1
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.805659] bio: create slab <bio-2> at 2
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.849109] init: docker main process (13079) terminated with status 1
Feb 14 17:21:49 vagrant-ubuntu-trusty-64 kernel: [  111.849128] init: docker main process ended, respawning
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  111.899541] bio: create slab <bio-1> at 1
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  111.904846] bio: create slab <bio-2> at 2
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  111.948785] init: docker main process (13156) terminated with status 1
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  111.948791] init: docker main process ended, respawning
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  111.996065] bio: create slab <bio-1> at 1
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  112.002065] bio: create slab <bio-2> at 2
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  112.044233] init: docker main process (13233) terminated with status 1
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  112.044255] init: docker main process ended, respawning
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  112.092092] bio: create slab <bio-1> at 1
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  112.098328] bio: create slab <bio-2> at 2
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  112.151326] init: docker main process (13310) terminated with status 1
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  112.151333] init: docker main process ended, respawning
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  112.196786] bio: create slab <bio-1> at 1
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  112.202399] bio: create slab <bio-2> at 2
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  112.245816] init: docker main process (13387) terminated with status 1
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  112.245837] init: docker main process ended, respawning
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  112.293325] bio: create slab <bio-1> at 1
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  112.298044] bio: create slab <bio-2> at 2
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  112.341466] init: docker main process (13464) terminated with status 1
Feb 14 17:21:50 vagrant-ubuntu-trusty-64 kernel: [  112.341472] init: docker respawning too fast, stopped

BTW: If I suspend (save state in VBox) and relaunch the VM (Start) after this problem occurs, everything seems to be OK (Docker is up and running after restart).

@ascheman
Copy link
Author

Just calling service docker start (before restart of the VM) always shows the mentioned logs and docker aborts.
Starting docker manually with docker daemon prints the following:

WARN[0000] devmapper: Usage of loopback devices is strongly discouraged for production use. Please use `--storage-opt dm.thinpooldev` or use `man docker` to refer to dm.thinpooldev section. 
WARN[0000] devmapper: Base device already exists and has filesystem ext4 on it. User specified filesystem  will be ignored. 
INFO[0000] [graphdriver] using prior storage driver "devicemapper" 
INFO[0000] Graph migration to content-addressability took 0.00 seconds 
INFO[0000] Firewalld running: false                     
INFO[0000] Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address 
FATA[0000] Error starting daemon: Error initializing network controller: Error creating default "bridge" network: failed to allocate gateway (172.17.0.1): Address already in use 

@thaJeztah thaJeztah added this to the 1.10.2 milestone Feb 14, 2016
@ascheman
Copy link
Author

Last night the variant with 4096 MB for the VM also was build successfully for one time. So my thesis that it has something to do with the memory of the VM has proven wrong ... Maybe it also depends on the load of the hosting server (at that time it was the only build job at the physical server)?

@mavenugo
Copy link
Contributor

@ascheman yes. could you pls share the full daemon log during bootup ?
Can you also make sure there arent multiple docker daemons running at the same time sharing the graph folder (-g) corrupting the db ?
Please read my earlier comment on this topic : #17939 (comment)

Also, can you share the docker daemon configuration ?

@ascheman
Copy link
Author

@mavenugo I will attach a ZIP with two tars, one of a non-working instance and one of a working one. Both contain the docker.log, some other logs and the docker configuration.

As I pointed out earlier, I am setting up Docker on fresh VM instances. Sometimes the initial installation fails due to the mentioned problem. Then I stop the complete setup of the VM (the state you will get from the one tar in the ZIP). If the installation works well, other steps are performed afterwards (so the logs become much longer, this is the case for the other tar).

docker-differences.zip

@ascheman
Copy link
Author

BTW: During the mentioned builds the one which was working had a VM RAM size of 4GB while the non-working one had only 2 GB.

@tiborvass tiborvass modified the milestones: 1.10.3, 1.10.2 Feb 19, 2016
@claytono
Copy link

I've experienced the same problem with Docker 1.10 and 1.10.1. We're doing hourly and triggered CI runs and we see about a 30-40% failure rate due to this issue since upgrading to Docker 1.10.

docker version:

Client:
 Version:      1.10.1
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   9e83765
 Built:        Thu Feb 11 19:27:08 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.1
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   9e83765
 Built:        Thu Feb 11 19:27:08 2016
 OS/Arch:      linux/amd64
docker info:
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 1.10.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 0
 Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: son-file
Plugins:
 Volume: local
 Network: bridge null host
Kernel Version: 3.13.0-67-generic
Operating System: Ubuntu 14.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.548 GiB
Name: dev01-build-001
ID: YGVH:QXXZ:YE7E:BBBO:HMOA:HD6Z:6HJB:DYWN:DGN3:DNLX:5HWA:FNTC
WARNING: No swap limit support

/var/log/upstart/docker.log:

Waiting for /var/run/docker.sock
�[34mINFO�[0m[0000] Graph migration to content-addressability took 0.00 seconds 
�[33mWARN�[0m[0000] Running modprobe bridge br_netfilter failed with message: modprobe: WARNING: Module br_netfilter not found.
insmod /lib/modules/3.13.0-67-generic/kernel/net/llc/llc.ko 
insmod /lib/modules/3.13.0-67-generic/kernel/net/802/stp.ko 
insmod /lib/modules/3.13.0-67-generic/kernel/net/bridge/bridge.ko 
, error: exit status 1 
/var/run/docker.sock is up
�[34mINFO�[0m[0000] Firewalld running: false                     
�[34mINFO�[0m[0000] Default bridge (docker0) is assigned with an IP address 172.18.0.0/16. Daemon option --bip can be used to set a preferred IP address 
�[34mINFO�[0m[0000] [graphdriver] using prior storage driver "auks" 
�[34mINFO�[0m[0000] Graph migration to content-addressability took 0.00 seconds 
�[34mINFO�[0m[0000] Firewalld running: false                     
�[33mWARN�[0m[0000] Could not find endpoint count key docker/network/v1.0/endpoint_count/3b8db9905f9ebe4fba7733b6b29b2e2a61deb5c990d4314362442c9f9ac5ef50/ for network bridge while listing: Key not found in store 
�[33mWARN�[0m[0000] could not find endpoint count key docker/network/v1.0/endpoint_count/3b8db9905f9ebe4fba7733b6b29b2e2a61deb5c990d4314362442c9f9ac5ef50/ for network bridge while listing: Key not found in store 
�[34mINFO�[0m[0000] Default bridge (docker0) is assigned with an IP address 172.18.0.0/16. Daemon option --bip can be used to set a preferred IP address 
�[31mFATA�[0m[0000] Error starting daemon: Error initializing network controller: Error creating default "bridge" network: failed to allocate gateway (172.18.0.1): Address already in use 
�[34mINFO�[0m[0000] [graphdriver] using prior storage driver "auks" 
�[34mINFO�[0m[0000] Graph migration to content-addressability took 0.00 seconds 
�[34mINFO�[0m[0000] Firewalld running: false                     
�[33mWARN�[0m[0000] Could not find endpoint count key docker/network/v1.0/endpoint_count/3b8db9905f9ebe4fba7733b6b29b2e2a61deb5c990d4314362442c9f9ac5ef50/ for network bridge while listing: Key not found in store 
�[33mWARN�[0m[0000] could not find endpoint count key docker/network/v1.0/endpoint_count/3b8db9905f9ebe4fba7733b6b29b2e2a61deb5c990d4314362442c9f9ac5ef50/ for network bridge while listing: Key not found in store 
�[34mINFO�[0m[0000] Default bridge (docker0) is assigned with an IP address 172.18.0.0/16. Daemon option --bip can be used to set a preferred IP address 
�[31mFATA�[0m[0000] Error starting daemon: Error initializing network controller: Error creating default "bridge" network: failed to allocate gateway (172.18.0.1): Address already in use 
�[34mINFO�[0m[0000] [graphdriver] using prior storage driver "auks" 
�[34mINFO�[0m[0000] Graph migration to content-addressability took 0.00 seconds 
�[34mINFO�[0m[0000] Firewalld running: false

@tiborvass
Copy link
Contributor

@Dvorak Probably unrelated to the issue, but something is really weird with your setup: there is no auks storage driver...

@claytono
Copy link

That was auto-correct being stupid on cut and paste, it actually says aufs in the logs.

@aboch
Copy link
Contributor

aboch commented Feb 22, 2016

Thanks @ascheman @Dvorak for the logs, but one request: In order to debug this further, we need the debug level logs (start daemon with the -D option).

@aboch
Copy link
Contributor

aboch commented Feb 22, 2016

@ascheman
I checked the local-kv.db file you provided. I could not find any anomalies.
The bitsequence structure backing the allocated IP addresses is correct.
The network object referring to the default network has the correct IP information (including network gateway IP).
Therefore on daemon start, when the default network is being deleted, the network gateway and the 172.17.0.0/16 pool should be released and later reallocated.

If you check you debug level logs on daemon start, you should see one like:

releasing IPv4 pools from network <name> (<ID>)

before this one is printed:

Allocating IPv4 pools for network <name> (<ID>)

Can you please confirm you see the above two logs. Thanks.

@claytono
Copy link

We actually use the default pool 172.16.0.0/16 and 172.17.0.0/17 in our development environments.

Debug startup logs:

DEBU[0000] docker group found. gid: 999
DEBU[0000] Server created for HTTP on unix (/var/run/docker.sock)
DEBU[0000] Using default logging driver json-file
INFO[0000] [graphdriver] using prior storage driver "aufs"
DEBU[0000] Using graph driver aufs
INFO[0000] Graph migration to content-addressability took 0.00 seconds
DEBU[0000] Option DefaultDriver: bridge
DEBU[0000] Option DefaultNetwork: bridge
INFO[0000] Firewalld running: false
DEBU[0000] /sbin/iptables, [--wait -t nat -D PREROUTING -m addrtype --dst-type LOCAL -j DOCKER]
DEBU[0000] /sbin/iptables, [--wait -t nat -D OUTPUT -m addrtype --dst-type LOCAL ! --dst 127.0.0.0/8 -j DOCKER]
DEBU[0000] /sbin/iptables, [--wait -t nat -D OUTPUT -m addrtype --dst-type LOCAL -j DOCKER]
DEBU[0000] /sbin/iptables, [--wait -t nat -D PREROUTING]
DEBU[0000] /sbin/iptables, [--wait -t nat -D OUTPUT]
DEBU[0000] /sbin/iptables, [--wait -t nat -F DOCKER]
DEBU[0000] /sbin/iptables, [--wait -t nat -X DOCKER]
DEBU[0000] /sbin/iptables, [--wait -t filter -F DOCKER]
DEBU[0000] /sbin/iptables, [--wait -t filter -X DOCKER]
DEBU[0000] /sbin/iptables, [--wait -t filter -F DOCKER-ISOLATION]
DEBU[0000] /sbin/iptables, [--wait -t filter -X DOCKER-ISOLATION]
DEBU[0000] /sbin/iptables, [--wait -t nat -n -L DOCKER]
DEBU[0000] /sbin/iptables, [--wait -t nat -N DOCKER]
DEBU[0000] /sbin/iptables, [--wait -t filter -n -L DOCKER]
DEBU[0000] /sbin/iptables, [--wait -t filter -N DOCKER]
DEBU[0000] /sbin/iptables, [--wait -t filter -n -L DOCKER-ISOLATION]
DEBU[0000] /sbin/iptables, [--wait -t filter -N DOCKER-ISOLATION]
DEBU[0000] /sbin/iptables, [--wait -t filter -C DOCKER-ISOLATION -j RETURN]
DEBU[0000] /sbin/iptables, [--wait -I DOCKER-ISOLATION -j RETURN]
DEBU[0000] Registering ipam driver: "default"
INFO[0000] Default bridge (docker0) is assigned with an IP address 172.18.0.0/16. Daemon option --bip can be used to set a preferred IP address
DEBU[0000] Allocating IPv4 pools for network bridge (52cad9c52af1fe0ee72089bbd2a5fe10055084c14d47d524159c59051312739f)
DEBU[0000] RequestPool(LocalDefault, 172.18.0.0/16, , map[], false)
DEBU[0000] RequestAddress(LocalDefault/172.18.0.0/16, 172.18.0.1, map[RequestAddressType:com.docker.network.gateway])
DEBU[0000] ReleasePool(LocalDefault/172.18.0.0/16)
DEBU[0000] Cleaning up old shm/mqueue mounts: start.
FATA[0000] Error starting daemon: Error initializing network controller: Error creating default "bridge" network: failed to allocate gateway (172.18.0.1): Address already in use

@claytono
Copy link

Also, if I delete the docker0 bridge (ip link del docker0) then the daemon starts up fine afterwards. We have a work around for the issue, but that doesn't help in CI environments where this is happening unpredictably.

@ascheman
Copy link
Author

I will look into this as soon as I have access to my test environment again during the next days.

The problem will be that I just run a lot of CI jobs which build up a Vagrant/VirtualBox environment with Ubuntu 14.04 LTS. These boxes set Docker up by a Puppet module (https://github.com/garethr/garethr-docker by @garethr) which simply calls a dpkg install as far as I can see. To make any change to this setup with Docker debugging switched on would require to prepare a patched version of the Puppet module and the respective infrastructure to inject it into my provisioning chain. I am not that deep into Puppet and do not have the time to prepare this in short term. Maybe @garether can assist here?

In general it seems to be the same problem as @Dvorak pointed out.

On 22.02.2016, at 20:56, aboch notifications@github.com wrote:

@ascheman
I checked the local-kv.db file you provided. I could not find any anomalies.
The bitsequence structure backing the allocated IP addresses is correct.
The network object referring to the default network has the correct IP information (including network gateway IP).
Therefore on daemon start, when the default network is being deleted, the network gateway and the 172.17.0.0/16 pool should be released and later reallocated.

If you check you debug level logs on daemon start, you should see one like:

`"releasing IPv4 pools from network ()

before this one is printed:

log.Debugf("Allocating IPv4 pools for network ()

Can you please confirm you see the above two logs. Thanks.


Reply to this email directly or view it on GitHub.

Gerd Aschemann --- Veröffentlichen heißt Verändern (Carmen Thomas)
+49/173/3264070 -- gerd@aschemann.net -- http://www.aschemann.net

@ascheman
Copy link
Author

@Dvorak: As I pointed out, my CI setup is driven by a Puppet module. However, my workaround is the following, maybe it also works in your case?

# Workaround for https://github.com/docker/docker/issues/20312
sleeptime=10
maxtries=5
try=0
echo "Sleeping ${sleeptime}s until Docker is up and running (first)"
sleep $sleeptime
until $dir/test.sh || test $try -ge $maxtries
do  sudo /bin/rm -f /var/lib/docker/network/files/local-kv.db
    sudo service docker start
    try=`expr $try + 1`
    echo "Sleeping ${sleeptime}s until Docker is up and running (try #$try)"
    sleep $sleeptime
done
echo "Sleeping ${sleeptime}s until Docker is up and running (final)"
sleep $sleeptime
$dir/test.sh

The mentioned test.sh in $dir is simply

#!/bin/bash

set -eu

if status docker 2>/dev/null | grep -q 'process '
   then echo "Docker is up and running"
   else echo "Docker is not running!" >&2
        exit 1
fi

If it does not work after n (=5) retries I give up and stop the overall CI box setup. Usually this works (sometimes with 3 or 4 retries, but it works).

On 23.02.2016, at 15:43, dvorak notifications@github.com wrote:

Also, if I delete the docker0 bridge (ip link del docker0) then the daemon starts up fine afterwards. We have a work around for the issue, but that doesn't help in CI environments where this is happening unpredictably.

@garethr
Copy link
Contributor

garethr commented Feb 23, 2016

@ascheman you can pass arbitrary options to the docker daemon on startup using the extra_paramemters param on the docker class, mentioned here: https://github.com/garethr/garethr-docker/blob/master/manifests/init.pp#L158-L160. These end up in the docker engine config here: https://github.com/garethr/garethr-docker/blob/master/templates/etc/conf.d/docker.erb#L43. That should allow you to enable debugging with the -D flag.

@ascheman
Copy link
Author

ascheman commented Mar 4, 2016

@garethr: Thanks for the hint

@aboch: Added two docker debug logs, one from an instance that failed and one from a working one.
docker-logs-20160304.zip

@tiborvass tiborvass modified the milestones: 1.11.0, 1.10.3 Mar 4, 2016
@aboch
Copy link
Contributor

aboch commented Mar 5, 2016

Thanks @ascheman for the debug logs.
They confirm you are hitting the same issue reported by @Dvorak.

The fact that the following logs are missing when docker reboots is a hint to the issue:

DEBU[0001] releasing IPv4 pools from network bridge (<id>) 
DEBU[0001] ReleaseAddress(LocalDefault/172.17.0.0/16, 172.17.0.1) 
DEBU[0001] ReleasePool(LocalDefault/172.17.0.0/16)

The default bridge network's gateway IP allocated during the previous daemon instance is not being removed, therefore the subsequent allocation failure.

Thanks for now,
will let you know when we find something or if we need more info.

@aboch
Copy link
Contributor

aboch commented Mar 5, 2016

Running docker with latest libnetwork using @ascheman's local-kv.db file (thanks @ascheman for providing the file and for the extra logs) explains why daemon does not find the bridge network, and therefore does not delete it at https://github.com/docker/docker/blob/master/daemon/daemon_unix.go#L596

DEBU[0000] Registering ipam driver: "default"           
DEBU[0000] Registering ipam driver: "null"              
WARN[0000] Could not find endpoint count key docker/network/v1.0/endpoint_count/a18d34c84f16467a2555bb3c834f192a567b4a6a14692a11a728e482cd4bdcbf/ for network bridge while listing: Key not found in store 
WARN[0000] could not find endpoint count key docker/network/v1.0/endpoint_count/a18d34c84f16467a2555bb3c834f192a567b4a6a14692a11a728e482cd4bdcbf/ for network bridge while listing: Key not found in store 

(This log was not present in docker/docker 1.10.1 code)

Not sure how we ended in this state where the network object is present but not the correspondent endpoint_count object. (Maybe because of a well-timed ungraceful shutdown of the daemon ?)

We'll check if some code change can be done to avoid or limiting the chances to get into this scenario.

@claytono
Copy link

claytono commented Mar 5, 2016

In my case the sequence is such:

  1. Package is installed by puppet.
  2. Docker engine started by packaging
  3. Puppet changes /etc/default/docker
  4. Puppet stops docker engine
  5. Puppet starts docker engine
  6. Start fails.

There is no manual steps for me. This doesn't happen 100% of the time.

@ascheman
Copy link
Author

ascheman commented Mar 5, 2016

Same here, sometimes it works, sometimes not ...

On 05.03.2016, at 13:00, dvorak notifications@github.com wrote:

In my case the sequence is such:

Package is installed by puppet.
Docker engine started by packaging
Puppet changes /etc/default/docker
Puppet stops docker engine
Puppet starts docker engine
Start fails.
There is no manual steps for me. This doesn't happen 100% of the time.


Reply to this email directly or view it on GitHub.

Gerd Aschemann --- Veröffentlichen heißt Verändern (Carmen Thomas)
+49/173/3264070 -- gerd@aschemann.net -- http://www.aschemann.net

@mavenugo
Copy link
Contributor

mavenugo commented Mar 5, 2016

@Dvorak @ascheman for those of us who are not familiar with puppet, it would be good if you can make it easier by providing the scripts and steps to reproduce it (with the most bare-minimal steps). The above steps are useful, but we dont know what all it takes to reproduce this exact same scenario.

@unclejack unclejack changed the title Re-Open #18113 for Docker 1.10.1 Docker 1.10.1 fails to create network bridge on start up Mar 7, 2016
ascheman added a commit to ascheman/minimal4dockerproblem that referenced this issue Mar 18, 2016
@ascheman
Copy link
Author

The problem still exists with docker-10.0.3.
I have provided a minimal setup in https://github.com/ascheman/minimal4dockerproblem

Just run vagrant up (maybe several times to reproduce)

@ascheman ascheman changed the title Docker 1.10.1 fails to create network bridge on start up Docker 1.10.3 fails to create network bridge on start up Mar 18, 2016
@thaJeztah
Copy link
Member

@ascheman I see a PR was just merged to resolve this issue, and will be included in docker 1.11

@ascheman
Copy link
Author

Thanks for the Info, I was thinking it was resolved with 1.10.3 ... So I will wait for the next Minor release.
BTW: : is there a release roadmap somewhere?

On 18.03.2016, at 12:11, Sebastiaan van Stijn notifications@github.com wrote:

@ascheman I see a PR was just merged to resolve this issue, and will be included in docker 1.11


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

Gerd Aschemann --- Veröffentlichen heißt Verändern (Carmen Thomas)
+49/173/3264070 -- gerd@aschemann.net -- http://www.aschemann.net

@thaJeztah
Copy link
Member

We usually use the wiki for that; https://github.com/docker/docker/wiki. But the project is always moving fast, so sometimes it may be out of date, or not yet include longer term goals

@mavenugo
Copy link
Contributor

@ascheman could you please try master build to make sure this is infact fixed ?

@ascheman
Copy link
Author

It is not a question of running the test once. I have set up a Jenkins job building a Vagrant machine with Docker on Ubuntu every few minutes. ~15% of the builds fail due to this bug.

The install uses the following chain/nested setup: Vagrant->Ubuntu->Puppet->Docker Debian Package (see my mentioned minimal Puppet install setup). If you could provide me with a current master build as an Ubuntu/Debian package I could easily phase this into my regular test build. On the other hand you could easily run this on your own if you have some Continuous Integration set up for Docker? This would allow for permanent testing of the minimal Docker setup as well as for more sophisticated tests (compositions of the different Docker tools etc.). I wonder if a project like Docker does not use CI?

On 18.03.2016, at 17:41, Madhu Venugopal notifications@github.com wrote:

@ascheman could you please try master build to make sure this is infact fixed ?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

Gerd Aschemann --- Veröffentlichen heißt Verändern (Carmen Thomas)
+49/173/3264070 -- gerd@aschemann.net -- http://www.aschemann.net

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants