Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon ungraceful shutdown during starting make daemon failed to start next time with Error initializing network controller #22834

Closed
coolljt0725 opened this issue May 19, 2016 · 1 comment
Labels
area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed.

Comments

@coolljt0725
Copy link
Contributor

During daemon starting, if the daemon was shutdown ungracefully, the daemon may will failed to start next time with error
FATA[0001] Error starting daemon: Error initializing network controller: Error creating default "bridge" network: failed to allocate gateway (172.17.0.1): Address already in use
we hit this problem on one of our server, on this server, the daemon failed to start with timeout(because there is a process take out all of the cup resource, 400% on a 4-core server)
6e6b9dce-091a-402f-8a61-b0c5397cb99a
and after several times of timeout failure, the failure became Error starting daemon: Error initializing network controller: Error creating default "bridge" network: failed to allocate gateway (172.17.0.1): Address already in use. I investigated and found that this may due to the the ungraceful shutdown happened between https://github.com/docker/docker/blob/master/vendor/src/github.com/docker/libnetwork/controller.go#L507 to https://github.com/docker/docker/blob/master/vendor/src/github.com/docker/libnetwork/controller.go#L535, the controller has allocated ip pool and stored the bitmap to store, but the network failed to finish the initialization and didn't store the network to store. so this ip pool doesn't belong to any network, so it can't be cleaned up.
It's hard to reproduce, to produce you can add

if name == "bridge" {
       log.Infof("after ipamAllocate, please kill the daemon")
       time.Sleep(10* time.Second)
       log.Infof("after sleep")
}

after https://github.com/docker/docker/blob/master/vendor/src/github.com/docker/libnetwork/controller.go#L506 and the kill the daemon when we get message after ipamAllocate, please kill the daemon
and then start again, the daemon will failed with FATA[0001] Error starting daemon: Error initializing network controller: Error creating default "bridge" network: failed to allocate gateway (172.17.0.1): Address already in use

To fix this, I think we should have a way to clean up unused allocated ip pool or synchronized the store of bit map and network.

@coolljt0725 coolljt0725 added the kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. label May 19, 2016
@thaJeztah
Copy link
Member

@coolljt0725 could you open an issue in the libnetwork repository as well? Looks like this is a libnetwork issue

lingmann pushed a commit to lingmann/dcos that referenced this issue Aug 29, 2016
On a percentage of DC/OS agents (~5%) with DOcker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Aug 29, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Aug 29, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Aug 30, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Aug 31, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Sep 1, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Sep 1, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
mellenburg pushed a commit to mellenburg/dcos that referenced this issue Sep 2, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed.
Projects
None yet
Development

No branches or pull requests

2 participants