Swarm service / overlay breakage - starting container failed: Address already in use #34163

alexellis · 2017-07-18T16:18:06Z

Description

Swarm service / overlay breakage - starting container failed: Address already in use

Also reported by @nickjj who said it prevented a production roll-out of Swarm.

Possibly related: #31698

Steps to reproduce the issue:

Deploy a service
Remove service
Deploy same service with same name

Describe the results you received:

"starting container failed: Address already in use"

(scroll right)

docker service ps --no-trunc=true node_info
ID                          NAME                IMAGE                       NODE                DESIRED STATE       CURRENT STATE           ERROR                                                 PORTS
ig7sqbw8bz554fw2fckqr20gq   node_info.1         alexellis2/faas-node_info   moby                Shutdown            Failed 59 seconds ago   "starting container failed: Address already in use"

Describe the results you expected:

1/1 replicas.

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

$ docker version
Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:31:53 2017
 OS/Arch:      darwin/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:51:55 2017
 OS/Arch:      linux/amd64
 Experimental: true

Dockerfile:

https://github.com/alexellis/faas-cli/blob/master/template/python/Dockerfile

(watchdog process binds to port 8080)

Python file:

https://github.com/alexellis/faas-cli/blob/master/sample/url_ping/handler.py

The text was updated successfully, but these errors were encountered:

alexellis · 2017-07-18T16:35:37Z

Creation / removal is done via Docker/Swarm API:

Creation:

https://github.com/alexellis/faas/blob/master/gateway/handlers/functionshandler.go#L180

Removal:

https://github.com/alexellis/faas/blob/master/gateway/handlers/functionshandler.go#L132

This works very intermittently.

alexellis · 2017-07-18T16:41:57Z

This is the service inspect output:

[
    {
        "ID": "p4ws511b36lkaeedu05mgakbd",
        "Version": {
            "Index": 191816
        },
        "CreatedAt": "2017-07-18T16:40:18.242004154Z",
        "UpdatedAt": "2017-07-18T16:40:18.244244092Z",
        "Spec": {
            "Name": "url_ping",
            "Labels": {},
            "TaskTemplate": {
                "ContainerSpec": {
                    "Image": "alexellis2/faas-urlping",
                    "Labels": {
                        "function": "true"
                    },
                    "Env": [
                        "fprocess=python index.py"
                    ],
                    "StopGracePeriod": 10000000000,
                    "DNSConfig": {}
                },
                "Resources": {},
                "RestartPolicy": {
                    "Condition": "none",
                    "Delay": 5000000000,
                    "MaxAttempts": 1
                },
                "Placement": {},
                "Networks": [
                    {
                        "Target": "mj0nnp38t65p7csrhgjpovk7d"
                    }
                ],
                "ForceUpdate": 0,
                "Runtime": "container"
            },
            "Mode": {
                "Replicated": {
                    "Replicas": 1
                }
            },
            "UpdateConfig": {
                "Parallelism": 1,
                "FailureAction": "pause",
                "Monitor": 5000000000,
                "MaxFailureRatio": 0,
                "Order": "stop-first"
            },
            "RollbackConfig": {
                "Parallelism": 1,
                "FailureAction": "pause",
                "Monitor": 5000000000,
                "MaxFailureRatio": 0,
                "Order": "stop-first"
            }
        },
        "Endpoint": {
            "Spec": {},
            "VirtualIPs": [
                {
                    "NetworkID": "mj0nnp38t65p7csrhgjpovk7d",
                    "Addr": "10.0.0.32/24"
                }
            ]
        }
    }
]

abhi · 2017-07-18T16:46:14Z

@alexellis do you happen to have the debug logs ?

alexellis · 2017-07-18T17:27:33Z

Here's a diagnostics ID from DfM which should have a debug log. Otherwise is there anything you'd like me to run on the Moby tty? D5A8DFC1-74C9-4986-AF95-439CBAFD67E0

@abhinandanpb does this help? Thanks

alexellis · 2017-07-19T10:04:03Z

Ref: #32548

giovantenne · 2017-08-03T13:53:32Z

Same problem here. Let me know if you need some debug information.

endeepak · 2017-08-03T14:20:33Z

Faced this problem intermittently. Restarting the docker daemon on the worker node resolves the issue

alexellis · 2017-08-16T12:18:03Z

Restarting the daemon doesn't help and this is reproducible. @cpuguy83 @thaJeztah can you guys think of anyone who can help with this issue?

rorpage · 2017-08-16T12:34:10Z

I am having the same issue. I'm running Docker version 17.06.0-ce, build 02c1d87 on OS X.

abhi · 2017-08-16T13:24:59Z

@alexellis as explained here #31698 (comment) , if this is a temporary state - > task reconcilation happens and get rescheduled on an other node. If it is permanent state then I have a possible fix in moby/libnetwork#1853. Let me know your thoughts on this. I will give this try again today and update the thread.

alexellis · 2017-08-16T17:32:36Z

@abhinandanpb - a fix would be great. This seems like a very normal use-case for CD. The error I'm getting appears to be permanent unless this is dependent on a restart-policy?

https://github.com/alexellis/faas/blob/master/gateway/handlers/functionshandler.go#L189

abhi · 2017-08-16T20:06:06Z

@alexellis it very well could be. Is it possible for you to confirm the theory by increasing the max attempts ?

alexellis · 2017-08-16T20:32:03Z

This appears to be a temporary workaround which works, but unfortunately also introduces latency and errors. It seems to be able to re-allocate on the 2nd attempt.

puffin · 2017-08-24T16:20:45Z

I have the same issue with docker 17.06.0-ce on Ubuntu 16.04.3 LTS operating a swarm of 3 managers and 3 workers on AWS.

I didn't try to only restart the docker daemon but rebooting EC2 instances actually fix the problem temporarily. The issue occurs again intermittently when updating services.

sentinelcross · 2017-08-28T14:37:32Z

So we follow the docker stack deploy CI pipeline to deploy our services to the cluster as opposed to docker service create.

Attempt #1 Engine version 17.06-ce
Had run into this very problem on a 1 master/1 worker Swarm cluster. Tried "rescuing" the cluster by manually deleting the stale VIP endpoints. Worked well for a while before going back to how it was before.
This cluster had a few services that did not expose their ports. Forcing them to expose their ports did not resolve the problem.

Attempt #2
Upgrading to a 3 master, 2 worker cluster(while also going from a /24 network to a /16 custom overlay network) delayed the problem for a while before it hit us again. This time around the Swarm was loaded with around 48 services. The endpoint mode on the services that did not expose their ports was set to dnsrr(since a few users reported problems with using the default VIP mode).
Didn't help much.

It has been a major roadblock for us. Not really sure how everyone else has been getting their clusters to work.
Rebuilt the cluster with 17.06.1-ce today(encountered some problems with IPTables and node communication breakdown while upgrading sequentially - but that is for another day). Will monitor how it pans out and update here accordingly.

Would be more than happy to share relevant logs. 👍

abhi · 2017-08-31T17:43:13Z

@alexellis we are looking at a more concrete fix in swarmkit to address the issue. Will update the thread once we have that in.

developius · 2017-11-05T00:47:39Z

Any update on this? Running into it quite a lot

sentinelcross · 2017-11-05T20:34:29Z

Pretty much the same with us. Abandoned Swarm migration for the time being.

alexellis · 2017-11-05T21:00:05Z

@sentinelcross we worked around it on openfaas by setting a higher restart policy.. around 3-5 seems to work well every time. @abhi do you have any updates?

flavioaiello · 2017-11-08T11:19:36Z

@alexellis

First workaround was to scale to 2 and back 1, but it made it even worse collecting zombie interfaces in the background.
Removing the stack and redeploy even didn't work due to the zombie interfaces
A higher restart policy to 10 as well increasing the delay to 5s didn't worked.

Finally this workaround in my CD system made the magic happen:

echo "*** spin-up all containers ***"
docker stack deploy -c stages/${SCOPE}/${PROJECT}/${STAGE}.yml ${CONTEXT} --with-registry-auth
echo "*** Disconnect terminated containers from network to avoid the address already in use error ***"
eval $(docker stack ps ${CONTEXT} --filter 'desired-state=shutdown' --format 'docker network disconnect --force ${CONTEXT}_default {{.Name}}{{.ID}};') || true

Imho it looks likes that terminated containers are still connected to the overlay network. As I first noticed the issue and no workaround worked, I used to disconnect all containers from the network, removing the stack and so beeing able again to redeploy. Now I use the excerpt above with each deploy and it seems to work fine.

echo "*** Disconnect all containers from network ***"
eval $(docker network inspect ${CONTEXT}_default --format '{{range .Containers}}docker network disconnect --force ${CONTEXT}_default {{.Name}};{{end}}')
docker stack rm ${CONTEXT}

abhi · 2017-12-15T17:38:18Z

@sentinelcross @alexellis @flavioaiello @developius can you try 17.11+ versions ? It has fix for the way IPAM allocation is done. This issue should be addressed. This issue will remain open untill swarmkit design change is done to completely solve this issue.

xiaochuan-du · 2018-06-12T09:37:20Z

Any update on this issue? I have to deal with the problem every week. Each time, I try to remove the service and create it from scrach.

GordonTheTurtle added area/swarm version/17.06 labels Jul 18, 2017

GordonTheTurtle added area/builder area/swarm version/17.06 labels Jul 18, 2017

cpuguy83 added area/networking and removed area/builder labels Jul 18, 2017

alexellis mentioned this issue Aug 16, 2017

Docker Swarm issue - Address in use (upstream issue) openfaas/faas-cli#33

Closed

akalipetis mentioned this issue Aug 22, 2017

Docker : starting container failed: Address already in use #34581

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swarm service / overlay breakage - starting container failed: Address already in use #34163

Swarm service / overlay breakage - starting container failed: Address already in use #34163

alexellis commented Jul 18, 2017 •

edited

Loading

alexellis commented Jul 18, 2017

alexellis commented Jul 18, 2017

abhi commented Jul 18, 2017

alexellis commented Jul 18, 2017 •

edited

Loading

alexellis commented Jul 19, 2017

giovantenne commented Aug 3, 2017

endeepak commented Aug 3, 2017

alexellis commented Aug 16, 2017

rorpage commented Aug 16, 2017

abhi commented Aug 16, 2017

alexellis commented Aug 16, 2017

abhi commented Aug 16, 2017

alexellis commented Aug 16, 2017 •

edited

Loading

puffin commented Aug 24, 2017

sentinelcross commented Aug 28, 2017

abhi commented Aug 31, 2017

developius commented Nov 5, 2017

sentinelcross commented Nov 5, 2017 via email •

edited by thaJeztah

Loading

alexellis commented Nov 5, 2017

flavioaiello commented Nov 8, 2017 •

edited

Loading

abhi commented Dec 15, 2017

xiaochuan-du commented Jun 12, 2018

Swarm service / overlay breakage - starting container failed: Address already in use #34163

Swarm service / overlay breakage - starting container failed: Address already in use #34163

Comments

alexellis commented Jul 18, 2017 • edited Loading

alexellis commented Jul 18, 2017

alexellis commented Jul 18, 2017

abhi commented Jul 18, 2017

alexellis commented Jul 18, 2017 • edited Loading

alexellis commented Jul 19, 2017

giovantenne commented Aug 3, 2017

endeepak commented Aug 3, 2017

alexellis commented Aug 16, 2017

rorpage commented Aug 16, 2017

abhi commented Aug 16, 2017

alexellis commented Aug 16, 2017

abhi commented Aug 16, 2017

alexellis commented Aug 16, 2017 • edited Loading

puffin commented Aug 24, 2017

sentinelcross commented Aug 28, 2017

abhi commented Aug 31, 2017

developius commented Nov 5, 2017

sentinelcross commented Nov 5, 2017 via email • edited by thaJeztah Loading

alexellis commented Nov 5, 2017

flavioaiello commented Nov 8, 2017 • edited Loading

abhi commented Dec 15, 2017

xiaochuan-du commented Jun 12, 2018

alexellis commented Jul 18, 2017 •

edited

Loading

alexellis commented Jul 18, 2017 •

edited

Loading

alexellis commented Aug 16, 2017 •

edited

Loading

sentinelcross commented Nov 5, 2017 via email •

edited by thaJeztah

Loading

flavioaiello commented Nov 8, 2017 •

edited

Loading