Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swarm service / overlay breakage - starting container failed: Address already in use #34163

Open
alexellis opened this issue Jul 18, 2017 · 22 comments

Comments

@alexellis
Copy link
Contributor

alexellis commented Jul 18, 2017

Description

Swarm service / overlay breakage - starting container failed: Address already in use

Also reported by @nickjj who said it prevented a production roll-out of Swarm.

Possibly related: #31698

Steps to reproduce the issue:

  1. Deploy a service
  2. Remove service
  3. Deploy same service with same name

Describe the results you received:

"starting container failed: Address already in use"  

(scroll right)

docker service ps --no-trunc=true node_info
ID                          NAME                IMAGE                       NODE                DESIRED STATE       CURRENT STATE           ERROR                                                 PORTS
ig7sqbw8bz554fw2fckqr20gq   node_info.1         alexellis2/faas-node_info   moby                Shutdown            Failed 59 seconds ago   "starting container failed: Address already in use"   

Describe the results you expected:

1/1 replicas.

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

$ docker version
Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:31:53 2017
 OS/Arch:      darwin/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:51:55 2017
 OS/Arch:      linux/amd64
 Experimental: true

Dockerfile:

https://github.com/alexellis/faas-cli/blob/master/template/python/Dockerfile

(watchdog process binds to port 8080)

Python file:

https://github.com/alexellis/faas-cli/blob/master/sample/url_ping/handler.py

@alexellis
Copy link
Contributor Author

Creation / removal is done via Docker/Swarm API:

Creation:

https://github.com/alexellis/faas/blob/master/gateway/handlers/functionshandler.go#L180

Removal:

https://github.com/alexellis/faas/blob/master/gateway/handlers/functionshandler.go#L132

This works very intermittently.

@alexellis
Copy link
Contributor Author

This is the service inspect output:

[
    {
        "ID": "p4ws511b36lkaeedu05mgakbd",
        "Version": {
            "Index": 191816
        },
        "CreatedAt": "2017-07-18T16:40:18.242004154Z",
        "UpdatedAt": "2017-07-18T16:40:18.244244092Z",
        "Spec": {
            "Name": "url_ping",
            "Labels": {},
            "TaskTemplate": {
                "ContainerSpec": {
                    "Image": "alexellis2/faas-urlping",
                    "Labels": {
                        "function": "true"
                    },
                    "Env": [
                        "fprocess=python index.py"
                    ],
                    "StopGracePeriod": 10000000000,
                    "DNSConfig": {}
                },
                "Resources": {},
                "RestartPolicy": {
                    "Condition": "none",
                    "Delay": 5000000000,
                    "MaxAttempts": 1
                },
                "Placement": {},
                "Networks": [
                    {
                        "Target": "mj0nnp38t65p7csrhgjpovk7d"
                    }
                ],
                "ForceUpdate": 0,
                "Runtime": "container"
            },
            "Mode": {
                "Replicated": {
                    "Replicas": 1
                }
            },
            "UpdateConfig": {
                "Parallelism": 1,
                "FailureAction": "pause",
                "Monitor": 5000000000,
                "MaxFailureRatio": 0,
                "Order": "stop-first"
            },
            "RollbackConfig": {
                "Parallelism": 1,
                "FailureAction": "pause",
                "Monitor": 5000000000,
                "MaxFailureRatio": 0,
                "Order": "stop-first"
            }
        },
        "Endpoint": {
            "Spec": {},
            "VirtualIPs": [
                {
                    "NetworkID": "mj0nnp38t65p7csrhgjpovk7d",
                    "Addr": "10.0.0.32/24"
                }
            ]
        }
    }
]

@abhi
Copy link
Contributor

abhi commented Jul 18, 2017

@alexellis do you happen to have the debug logs ?

@alexellis
Copy link
Contributor Author

alexellis commented Jul 18, 2017

Here's a diagnostics ID from DfM which should have a debug log. Otherwise is there anything you'd like me to run on the Moby tty? D5A8DFC1-74C9-4986-AF95-439CBAFD67E0

@abhinandanpb does this help? Thanks

@alexellis
Copy link
Contributor Author

Ref: #32548

@giovantenne
Copy link

Same problem here. Let me know if you need some debug information.

@endeepak
Copy link

endeepak commented Aug 3, 2017

Faced this problem intermittently. Restarting the docker daemon on the worker node resolves the issue

@alexellis
Copy link
Contributor Author

Restarting the daemon doesn't help and this is reproducible. @cpuguy83 @thaJeztah can you guys think of anyone who can help with this issue?

@rorpage
Copy link

rorpage commented Aug 16, 2017

I am having the same issue. I'm running Docker version 17.06.0-ce, build 02c1d87 on OS X.

@abhi
Copy link
Contributor

abhi commented Aug 16, 2017

@alexellis as explained here #31698 (comment) , if this is a temporary state - > task reconcilation happens and get rescheduled on an other node. If it is permanent state then I have a possible fix in moby/libnetwork#1853. Let me know your thoughts on this. I will give this try again today and update the thread.

@alexellis
Copy link
Contributor Author

@abhinandanpb - a fix would be great. This seems like a very normal use-case for CD. The error I'm getting appears to be permanent unless this is dependent on a restart-policy?

https://github.com/alexellis/faas/blob/master/gateway/handlers/functionshandler.go#L189

@abhi
Copy link
Contributor

abhi commented Aug 16, 2017

@alexellis it very well could be. Is it possible for you to confirm the theory by increasing the max attempts ?

@alexellis
Copy link
Contributor Author

alexellis commented Aug 16, 2017

This appears to be a temporary workaround which works, but unfortunately also introduces latency and errors. It seems to be able to re-allocate on the 2nd attempt.

@puffin
Copy link

puffin commented Aug 24, 2017

I have the same issue with docker 17.06.0-ce on Ubuntu 16.04.3 LTS operating a swarm of 3 managers and 3 workers on AWS.

I didn't try to only restart the docker daemon but rebooting EC2 instances actually fix the problem temporarily. The issue occurs again intermittently when updating services.

@sentinelcross
Copy link

So we follow the docker stack deploy CI pipeline to deploy our services to the cluster as opposed to docker service create.

Attempt #1 Engine version 17.06-ce
Had run into this very problem on a 1 master/1 worker Swarm cluster. Tried "rescuing" the cluster by manually deleting the stale VIP endpoints. Worked well for a while before going back to how it was before.
This cluster had a few services that did not expose their ports. Forcing them to expose their ports did not resolve the problem.

Attempt #2
Upgrading to a 3 master, 2 worker cluster(while also going from a /24 network to a /16 custom overlay network) delayed the problem for a while before it hit us again. This time around the Swarm was loaded with around 48 services. The endpoint mode on the services that did not expose their ports was set to dnsrr(since a few users reported problems with using the default VIP mode).
Didn't help much.

It has been a major roadblock for us. Not really sure how everyone else has been getting their clusters to work.
Rebuilt the cluster with 17.06.1-ce today(encountered some problems with IPTables and node communication breakdown while upgrading sequentially - but that is for another day). Will monitor how it pans out and update here accordingly.

Would be more than happy to share relevant logs. 👍

@abhi
Copy link
Contributor

abhi commented Aug 31, 2017

@alexellis we are looking at a more concrete fix in swarmkit to address the issue. Will update the thread once we have that in.

@developius
Copy link

Any update on this? Running into it quite a lot

@sentinelcross
Copy link

sentinelcross commented Nov 5, 2017 via email

@alexellis
Copy link
Contributor Author

@sentinelcross we worked around it on openfaas by setting a higher restart policy.. around 3-5 seems to work well every time. @abhi do you have any updates?

@flavioaiello
Copy link

flavioaiello commented Nov 8, 2017

@alexellis

  • First workaround was to scale to 2 and back 1, but it made it even worse collecting zombie interfaces in the background.
  • Removing the stack and redeploy even didn't work due to the zombie interfaces
  • A higher restart policy to 10 as well increasing the delay to 5s didn't worked.

Finally this workaround in my CD system made the magic happen:

echo "*** spin-up all containers ***"
docker stack deploy -c stages/${SCOPE}/${PROJECT}/${STAGE}.yml ${CONTEXT} --with-registry-auth
echo "*** Disconnect terminated containers from network to avoid the address already in use error ***"
eval $(docker stack ps ${CONTEXT} --filter 'desired-state=shutdown' --format 'docker network disconnect --force ${CONTEXT}_default {{.Name}}{{.ID}};') || true 

Imho it looks likes that terminated containers are still connected to the overlay network. As I first noticed the issue and no workaround worked, I used to disconnect all containers from the network, removing the stack and so beeing able again to redeploy. Now I use the excerpt above with each deploy and it seems to work fine.

echo "*** Disconnect all containers from network ***"
eval $(docker network inspect ${CONTEXT}_default --format '{{range .Containers}}docker network disconnect --force ${CONTEXT}_default {{.Name}};{{end}}')
docker stack rm ${CONTEXT}

@abhi
Copy link
Contributor

abhi commented Dec 15, 2017

@sentinelcross @alexellis @flavioaiello @developius can you try 17.11+ versions ? It has fix for the way IPAM allocation is done. This issue should be addressed. This issue will remain open untill swarmkit design change is done to completely solve this issue.

@xiaochuan-du
Copy link

Any update on this issue? I have to deal with the problem every week. Each time, I try to remove the service and create it from scrach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests