New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lots of “failed to allocate gateway: Address already in use” in Docker CE 17.09.0 #35204
Comments
Swarm managed networks are dynamically extended to nodes where a task is deployed on; so that explains why the network was not found on those two nodes. For the "unable to allocate" error; does that IP range possibly overlap with a physical network on those two nodes? |
Thanks for confirming that networks dynamically extended to nodes, @thaJeztah. And, I've confirmed that the IP range doesn't overlap with a physical network on those two nodes. I've been playing around with this some more, and I've narrowed down the problem to networks not being removed from all nodes when stacks are removed. If I make a new deployment with a stack named Output of docker stack rm:
Lingering networks on some nodes 5 minutes after stack is removed:
The other nodes don't have any leftovers. Redeploy of rms-ci-config, now showing “unable to allocate” errors (trimmed to a few examples):
Digging into task
If I remove the stack, manually remove any leftover networks on the nodes, and redeploy, then the deployment will succeed without any “unable to allocate” errors. Really confused as to why this is happening. |
I've ended up adding the following to all my CI deploy jobs after
I'd really like to find out why networks are left lingering around. But, this avoids the issue in the meantime so that CI deployments work reliably. One other thing I've noticed is that the larger the stack, the more likely it is that networks get left behind. The smallest stack has 3 services, and never seems to run into this. The biggest stack has 43 services, and seems to run into this every single time. |
Is there a way to exclude a specific subnet or address range? (I know there's a way to configure the new subnet specifically, but that's not what I'm talking about) I'm seeing this problem on 17.06-ee as well, using stacks with an automatically-configured stack network. Why doesn't it specifically exclude the host addresses/subnets from the automatic subnet creation pool? Is this by design? or has it just not been added yet? |
I've just completed a clean install of Docker 17.09.1 on the same 10-node cluster. The problem still persists. After removing a stack, some of the stack's networks linger on the nodes. This causes anomalies on subsequent deploys of the same stack. For example, the stack
Services that use the
Here's the details of the two copies of the
As before, explicitly visiting each node after removing a stack and removing dangling networks seems to avoid errors on subsequent deployments. |
I've tried doing a clean install of Docker CE 17.12.0-rc4 on the 10-node swarm and still get the same behaviour. Removing stacks sometimes leaves behind stray networks. I'm really not sure how to debug this. I'll be in San Francisco this Monday to Thursday (Dec 25 to 28), if there's anyone from Docker that wants to see this interactively. |
@kinghuang yes. These dangling dynamic networks are the cause for the issue. In order to understand why we have these networks left uncleaned, can you pls enable daemon debug logs and capture the issue when the network was left uncleaned ? Pls share the debug logs once you are able to reproduce the root-cause. |
@mavenugo I cleared all deployed stacks (including dangling networks) on the swarm, set Engine and swarm info (from itrmsdev01):
On itrmsdev01, 10 minutes after stack deployed:
On itrmsdev03, after stack removed:
Attached are the daemon logs from itrmsdev01 (where the stack commands were issued) and itrmsdev03 (the only node with a dangling network after docker-itrmsdev01.ucalgary.ca.log I can repeat this again, if that helps. I also have the daemon logs from the other 8 nodes available. |
I repeated it again. Cleaned and restarted all the nodes, then deployed and removed the same stack. This time, there are two leftover networks: itrmsdev03:
itrmsdev04:
Logs for nodes 01 (where stack commands were issued), 03, and 04 attached. docker2-itrmsdev01.ucalgary.ca.log |
For me I just needed to run |
@juliusakula that cleaned the network but did't solve the problem in my case . |
I just hit this as well. For me it turned out to be an I am on |
Having leftover networks on Docker version 18.01.0-ce as well. |
Description
I recently upgraded a 10-node cluster from Docker CE 17.07.0 to 17.09.0. After the upgrade, I'm having a lot of difficulties with services unable to start on random nodes (seems different every time). Typically, the service's tasks will show a status like “failed to allocate gateway (10.0.24.1): Address already in use”.
Steps to reproduce the issue:
Describe the results you received:
I expect all the networks and services defined in the stack to be created and started.
Describe the results you expected:
Some services fail to start. The tasks report errors along the lines of “failed to allocate gateway (10.0.24.1): Address already in use”.
Additional information you deem important (e.g. issue happens only occasionally):
Here's the output of
docker service ps
for a specific service, as an example. The tasks did launch on some nodes (and failed), but towards the end, it kept getting rejected by nodes 06 and 09 in this swarm.If I inspect into the very last task (lh4wyejdi489), the status block shows the full error.
The gateway address
10.0.24.1
corresponds to a network namedrms_adjunct
, which can be found in the tasks'sNetworkAttachments
.This task was repeatedly rejected by nodes 06 and 09. If I do a
docker network inspect rms_adjunct
on all 10 nodes, those two nodes return “Error: No such network: rms_adjunct”, while the other nodes return the network.Q: How do I debug why the failed to allocate gateway error occurs?
Across the 10 nodes, only the 3 controller nodes (01 to 03) consistently have all overlay networks. The remaining 7 nodes all seem to have a different list of networks. I can't remember if it was like this before with Docker CE 17.07.0 or not, but I didn't have this problem before.
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
10 node Docker CE 17.09.0 swarm on RHEL 7.3 VMs.
The text was updated successfully, but these errors were encountered: