-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swarm service / overlay breakage - starting container failed: Address already in use #34163
Comments
Creation / removal is done via Docker/Swarm API: Creation: https://github.com/alexellis/faas/blob/master/gateway/handlers/functionshandler.go#L180 Removal: https://github.com/alexellis/faas/blob/master/gateway/handlers/functionshandler.go#L132 This works very intermittently. |
This is the
|
@alexellis do you happen to have the debug logs ? |
Here's a diagnostics ID from DfM which should have a debug log. Otherwise is there anything you'd like me to run on the Moby tty? D5A8DFC1-74C9-4986-AF95-439CBAFD67E0 @abhinandanpb does this help? Thanks |
Ref: #32548 |
Same problem here. Let me know if you need some debug information. |
Faced this problem intermittently. Restarting the docker daemon on the worker node resolves the issue |
Restarting the daemon doesn't help and this is reproducible. @cpuguy83 @thaJeztah can you guys think of anyone who can help with this issue? |
I am having the same issue. I'm running |
@alexellis as explained here #31698 (comment) , if this is a temporary state - > task reconcilation happens and get rescheduled on an other node. If it is permanent state then I have a possible fix in moby/libnetwork#1853. Let me know your thoughts on this. I will give this try again today and update the thread. |
@abhinandanpb - a fix would be great. This seems like a very normal use-case for CD. The error I'm getting appears to be permanent unless this is dependent on a restart-policy? https://github.com/alexellis/faas/blob/master/gateway/handlers/functionshandler.go#L189 |
@alexellis it very well could be. Is it possible for you to confirm the theory by increasing the max attempts ? |
This appears to be a temporary workaround which works, but unfortunately also introduces latency and errors. It seems to be able to re-allocate on the 2nd attempt. |
I have the same issue with I didn't try to only restart the docker daemon but rebooting EC2 instances actually fix the problem temporarily. The issue occurs again intermittently when updating services. |
So we follow the docker stack deploy CI pipeline to deploy our services to the cluster as opposed to docker service create. Attempt #1 Engine version 17.06-ce Attempt #2 It has been a major roadblock for us. Not really sure how everyone else has been getting their clusters to work. Would be more than happy to share relevant logs. 👍 |
@alexellis we are looking at a more concrete fix in swarmkit to address the issue. Will update the thread once we have that in. |
Any update on this? Running into it quite a lot |
Pretty much the same with us. Abandoned Swarm migration for the time being.
|
@sentinelcross we worked around it on openfaas by setting a higher restart policy.. around 3-5 seems to work well every time. @abhi do you have any updates? |
Finally this workaround in my CD system made the magic happen:
Imho it looks likes that terminated containers are still connected to the overlay network. As I first noticed the issue and no workaround worked, I used to disconnect all containers from the network, removing the stack and so beeing able again to redeploy. Now I use the excerpt above with each deploy and it seems to work fine.
|
@sentinelcross @alexellis @flavioaiello @developius can you try 17.11+ versions ? It has fix for the way IPAM allocation is done. This issue should be addressed. This issue will remain open untill swarmkit design change is done to completely solve this issue. |
Any update on this issue? I have to deal with the problem every week. Each time, I try to remove the service and create it from scrach. |
Description
Swarm service / overlay breakage - starting container failed: Address already in use
Also reported by @nickjj who said it prevented a production roll-out of Swarm.
Possibly related: #31698
Steps to reproduce the issue:
Describe the results you received:
(scroll right)
Describe the results you expected:
1/1 replicas.
Additional information you deem important (e.g. issue happens only occasionally):
Output of
docker version
:Dockerfile:
https://github.com/alexellis/faas-cli/blob/master/template/python/Dockerfile
(watchdog process binds to port 8080)
Python file:
https://github.com/alexellis/faas-cli/blob/master/sample/url_ping/handler.py
The text was updated successfully, but these errors were encountered: