-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker 1.10.3 fails to create network bridge on start up #20312
Comments
docker version:
docker info:
|
@ascheman Just to confirm if this is related to #18113, do you see this error in the daemon log : |
I was setting the reference to #18113 since the problem was caused by the file local-kv.db which was left over. In fact, the problem occurs during installation of docker and first startup of the daemon through the puppet module. During that time I get the following log in /var/log/syslog:
BTW: If I suspend (save state in VBox) and relaunch the VM (Start) after this problem occurs, everything seems to be OK (Docker is up and running after restart). |
Just calling
|
Last night the variant with 4096 MB for the VM also was build successfully for one time. So my thesis that it has something to do with the memory of the VM has proven wrong ... Maybe it also depends on the load of the hosting server (at that time it was the only build job at the physical server)? |
@ascheman yes. could you pls share the full daemon log during bootup ? Also, can you share the docker daemon configuration ? |
@mavenugo I will attach a ZIP with two tars, one of a non-working instance and one of a working one. Both contain the docker.log, some other logs and the docker configuration. As I pointed out earlier, I am setting up Docker on fresh VM instances. Sometimes the initial installation fails due to the mentioned problem. Then I stop the complete setup of the VM (the state you will get from the one tar in the ZIP). If the installation works well, other steps are performed afterwards (so the logs become much longer, this is the case for the other tar). |
BTW: During the mentioned builds the one which was working had a VM RAM size of 4GB while the non-working one had only 2 GB. |
I've experienced the same problem with Docker 1.10 and 1.10.1. We're doing hourly and triggered CI runs and we see about a 30-40% failure rate due to this issue since upgrading to Docker 1.10. docker version:
/var/log/upstart/docker.log:
|
@Dvorak Probably unrelated to the issue, but something is really weird with your setup: there is no |
That was auto-correct being stupid on cut and paste, it actually says aufs in the logs. |
@ascheman If you check you debug level logs on daemon start, you should see one like:
before this one is printed:
Can you please confirm you see the above two logs. Thanks. |
We actually use the default pool 172.16.0.0/16 and 172.17.0.0/17 in our development environments. Debug startup logs:
|
Also, if I delete the docker0 bridge (ip link del docker0) then the daemon starts up fine afterwards. We have a work around for the issue, but that doesn't help in CI environments where this is happening unpredictably. |
I will look into this as soon as I have access to my test environment again during the next days. The problem will be that I just run a lot of CI jobs which build up a Vagrant/VirtualBox environment with Ubuntu 14.04 LTS. These boxes set Docker up by a Puppet module (https://github.com/garethr/garethr-docker by @garethr) which simply calls a dpkg install as far as I can see. To make any change to this setup with Docker debugging switched on would require to prepare a patched version of the Puppet module and the respective infrastructure to inject it into my provisioning chain. I am not that deep into Puppet and do not have the time to prepare this in short term. Maybe @garether can assist here? In general it seems to be the same problem as @Dvorak pointed out. On 22.02.2016, at 20:56, aboch notifications@github.com wrote:
Gerd Aschemann --- Veröffentlichen heißt Verändern (Carmen Thomas) |
@Dvorak: As I pointed out, my CI setup is driven by a Puppet module. However, my workaround is the following, maybe it also works in your case?
The mentioned
If it does not work after n (=5) retries I give up and stop the overall CI box setup. Usually this works (sometimes with 3 or 4 retries, but it works). On 23.02.2016, at 15:43, dvorak notifications@github.com wrote:
|
@ascheman you can pass arbitrary options to the docker daemon on startup using the |
@garethr: Thanks for the hint @aboch: Added two docker debug logs, one from an instance that failed and one from a working one. |
Thanks @ascheman for the debug logs. The fact that the following logs are missing when docker reboots is a hint to the issue:
The default bridge network's gateway IP allocated during the previous daemon instance is not being removed, therefore the subsequent allocation failure. Thanks for now, |
Running docker with latest libnetwork using @ascheman's
(This log was not present in docker/docker 1.10.1 code) Not sure how we ended in this state where the network object is present but not the correspondent endpoint_count object. (Maybe because of a well-timed ungraceful shutdown of the daemon ?) We'll check if some code change can be done to avoid or limiting the chances to get into this scenario. |
In my case the sequence is such:
There is no manual steps for me. This doesn't happen 100% of the time. |
Same here, sometimes it works, sometimes not ... On 05.03.2016, at 13:00, dvorak notifications@github.com wrote:
Gerd Aschemann --- Veröffentlichen heißt Verändern (Carmen Thomas) |
The problem still exists with docker-10.0.3. Just run |
@ascheman I see a PR was just merged to resolve this issue, and will be included in docker 1.11 |
Thanks for the Info, I was thinking it was resolved with 1.10.3 ... So I will wait for the next Minor release. On 18.03.2016, at 12:11, Sebastiaan van Stijn notifications@github.com wrote:
Gerd Aschemann --- Veröffentlichen heißt Verändern (Carmen Thomas) |
We usually use the wiki for that; https://github.com/docker/docker/wiki. But the project is always moving fast, so sometimes it may be out of date, or not yet include longer term goals |
@ascheman could you please try master build to make sure this is infact fixed ? |
It is not a question of running the test once. I have set up a Jenkins job building a Vagrant machine with Docker on Ubuntu every few minutes. ~15% of the builds fail due to this bug. The install uses the following chain/nested setup: Vagrant->Ubuntu->Puppet->Docker Debian Package (see my mentioned minimal Puppet install setup). If you could provide me with a current master build as an Ubuntu/Debian package I could easily phase this into my regular test build. On the other hand you could easily run this on your own if you have some Continuous Integration set up for Docker? This would allow for permanent testing of the minimal Docker setup as well as for more sophisticated tests (compositions of the different Docker tools etc.). I wonder if a project like Docker does not use CI? On 18.03.2016, at 17:41, Madhu Venugopal notifications@github.com wrote:
Gerd Aschemann --- Veröffentlichen heißt Verändern (Carmen Thomas) |
I seem to have the #18113 issue with Docker 1.10.1!
But it also occurs mostly when using a virtual machine (virtual box) with more than 3G of memory. Then startup of Dockers fails the most time and leaves a
/var/lib/docker/network/files/local-kv.db
. As long as the file exists, I can not start docker. As soon as I remove it, everything is fine. I have some more hints to reproduce the problem below.I have the following setup:
Docker version 1.10.1, build 9e83765
I have a Jenkins starting some of those virtual machines. Most of them require only 2048 MB of assigned RAM. But since yesterday I needed some larger instances and set the memory size for one of them to 4096 MB. This one almost always fails to start Docker and leaves the mentioned local-kv.db. Why only "almost", why not always?
I was trying around with some ideas:
strace -f -o /tmp/docker-problem -ttt puppet apply docker-setup.pp
. When run with strace the problem never occured, though I was also never able to see access to the local-kv.db (I guess this happens in the container?).Maybe it's something completely different and has nothing todo with memory sizes?
I attach one of the local-kv.db (a bzip2 in a zip) for further investigation:
local-kv.db.zip
The text was updated successfully, but these errors were encountered: