Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swarm with multiple ingress networks #2637

Open
Mobe91 opened this issue May 19, 2018 · 10 comments
Open

Swarm with multiple ingress networks #2637

Mobe91 opened this issue May 19, 2018 · 10 comments

Comments

@Mobe91
Copy link

Mobe91 commented May 19, 2018

I noticed that my swarm has 2 ingress networks:

docker network ls
NETWORK ID          NAME                                   DRIVER              SCOPE
1aaeb441a06b        bridge                                 bridge              local
81d942ace568        docker_gwbridge                        bridge              local
03a8cf90b847        host                                   host                local
8c6oqwchdzvf        ingress                                overlay             swarm
mfoezf9fniby        ingress                                overlay             swarm
2918a2ddc532        none                                   null                local

I think as a consequence, one of my services fails to start - it remains in state starting forever and when I docker inspect the correspoding container, it says:

"State": {
            "Status": "created",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 128,
            "Error": "network 8c6oqwchdzvfgayqkzsq3be7m not found",
            "StartedAt": "0001-01-01T00:00:00Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },

So it seems that it fails to find one of the 2 ingress networks.

docker version
Client:
 Version:       18.03.0-ce
 API version:   1.37
 Go version:    go1.9.4
 Git commit:    0520e24
 Built: Wed Mar 21 23:05:35 2018
 OS/Arch:       linux/amd64
 Experimental:  false
 Orchestrator:  swarm

Server:
 Engine:
  Version:      18.03.0-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.4
  Git commit:   0520e24
  Built:        Wed Mar 21 23:14:32 2018
  OS/Arch:      linux/amd64
  Experimental: true
  1. Is it normal to have multiple ingress networks and, if not, how can this happen?
  2. How can I resolve this for my current swarm.

EDIT 1
I checked my other running services and none of them uses ingress network mfoezf9fniby. So I tried docker network rm mfoezf9fniby but this fails with Error response from daemon: network mfoezf9fnibyov8ps098ngvjy not found. After that, running docker network ls still shows the 2 ingress networks.

EDIT 2
Running docker network ls on a different node only lists 1 ingress network (network mfoezf9fniby is gone). So it seems that the node on which the service task fails has stale data?
Inspecting docker.log on the corrupt node constantly shows the following entries:

May 19 14:28:33 moby root: time="2018-05-19T14:28:33.651593661Z" level=warning msg="error locating sandbox id f3ce58d7eccbbd270959f73e141818b2310ffff199704e7a2a308b42e5903a89: sandbox f3ce58d7eccbbd270959f73e
141818b2310ffff199704e7a2a308b42e5903a89 not found"
May 19 14:28:33 moby root: time="2018-05-19T14:28:33.652546451Z" level=error msg="fc016a345607573568b64824f6a40dcc2226b4620641b5cad8613558d92d5809 cleanup: failed to delete container from containerd: no such
container"

I tried docker rm -f fc016a345607573568b64824f6a40dcc2226b4620641b5cad8613558d92d5809 which completed successfully. I turns out that this container was the service task that was in starting state forever. The service deployment then picked a different node automatically and launched a new service task. But again, the service could not be started. I ran docker network ls on the newly picked node and again, 2 ingress networks were shown (both with the same ID like on the original node). And again, the service could not be started.

I should also mention that I am using docker-for-aws - don't know if that matters.

@thaJeztah
Copy link
Member

ping @ctelfer

@Mobe91
Copy link
Author

Mobe91 commented May 19, 2018

I was able to resolve this as follows:

  • I completely removed that stack with the failing service using docker stack rm.
  • I terminated the 2 nodes that showed the 2 ingress networks in the output of docker network ls.
  • I then tried to recreate the stack using docker stack deploy. This time, the service creation failed with Error response from daemon: network <service-name>_default not found.
  • I added a custom network to the service definition to avoid hitting the default network.
  • Now service creation succeeded, but again, the service did not start and 2 ingress networks showed up on the node that ran the service task.
  • Again, I completely removed the corresponding stack
  • I ran docker network prune which apparently deleted an existing network called <service-name>_default
  • I removed the node that showed the 2 ingress networks
  • I recreated the stack and this time everything worked fine

@thaJeztah
Copy link
Member

I recall there was an issue in the past where nodes upgraded from an old version did not have the "ingress" attribute set on the ingress network; were these existing nodes, and upgraded from an older version of docker (and if so, do you know what version?)

@Mobe91
Copy link
Author

Mobe91 commented May 19, 2018

@thaJeztah No I performed a CloudFormation Stack Update which means that all old nodes are replaced. Also, the old nodes ran the same docker version as the new ones.

@ctelfer
Copy link

ctelfer commented May 21, 2018

To answer the first question, no there should definitely not be two ingress networks present at the same time.

My first thought was that this had something to do with some kind of incomplete restoration of the ingress network after a dockerd restart. My second thought was that since docker network ls only showed 1 ingress network on other (worker?) nodes, that the extra ingress network was one restored from a previous run, but one which swarm had no knowledge of leading for swarm to create a fresh one on the manager node and the other nodes. I would be curious whether both "ingress" networks (on the nodes that have 2 ingress networks) are marked as "Ingress: true" in docker network inspect.

From @Mobe91 's last comments, it sounds like something needed to be pruned whether it was FOO_default or ingress.

@Mobe91
Copy link
Author

Mobe91 commented May 22, 2018

My second thought was that since docker network ls only showed 1 ingress network on other (worker?) nodes

It were other manager nodes that showed 1 ingress network. I think I did not check the worker nodes.

I would be curious whether both "ingress" networks (on the nodes that have 2 ingress networks) are marked as "Ingress: true" in docker network inspect.

Unfortunately, I did not save the output of docker network inspect when this happened...

@nsteinmetz
Copy link

Hi,

Seems the same bug as mine with Docker 18.06.1

I should have opened it here maybe : docker/for-linux#424

@cypx
Copy link

cypx commented Jun 5, 2019

I encounter the same problem and after few tests it's seems caused by network with duplicated name and local scope.

It could be reproduced on a fresh Ubuntu 18.04 install

root@server:~# docker --version 
Docker version 18.09.5, build e8ff056
root@server:~# docker swarm init
Swarm initialized: current node (t8qsjoaroynwsxeq9la4f0i5b) is now a manager.

To add a worker to this swarm, run the following command:

    docker swarm join --token SWMTKN-1-24qa1rusq46mmgjah41z8pvtyghnlrz9g3u7q49keol2p0r5te-9ek99ggjlgrtnd2iiqy5h44bw 51.15.155.120:2377

To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.

root@server:~# docker network create --scope=local test_stack_default
02a5d86ad0fa904c268efa3d0debe7efd83d9438eca7288372406c118bce36c4
root@server:~# docker network create --scope=swarm additional
nf74b2kya75mui7zzj5ofdqs8
root@server:~# cat test_stack.yml 
version: '3.4'

services:
  test_service:
    image: "traefik"
    networks:
      - default
      - additional

networks:
  additional:
    external: true
root@server:~# docker stack deploy -c test_stack.yml test_stack
Creating network test_stack_default
Creating service test_stack_test_service
root@server:~# docker network ls
NETWORK ID          NAME                 DRIVER              SCOPE
nf74b2kya75m        additional           bridge              swarm
8d04df932427        bridge               bridge              local
7c3601152798        docker_gwbridge      bridge              local
150cf9c0525f        host                 host                local
xqz9vq824pa1        ingress              overlay             swarm
560fc2bccd2e        none                 null                local
02a5d86ad0fa        test_stack_default   bridge              local
lrp4mt5zjwmf        test_stack_default   overlay             swarm
root@server:~# docker service ps test_stack_test_service --no-trunc
ID                          NAME                        IMAGE                                                                                    NODE                DESIRED STATE       CURRENT STATE          ERROR               PORTS
8je5ervdw0udo880i4ryo0ri9   test_stack_test_service.1   traefik:latest@sha256:02cfdb77b0cd82d973dffb3dafe498283f82399bd75b335797d7f0fe3ebeccb8   server              Running             Running 1 second ago                       
root@server:~# service docker restart
root@server:~# docker service ps test_stack_test_service --no-trunc
ID                          NAME                            IMAGE                                                                                    NODE                DESIRED STATE       CURRENT STATE            ERROR                                        PORTS
kltfyg26kiydgoextni5kt7kb   test_stack_test_service.1       traefik:latest@sha256:02cfdb77b0cd82d973dffb3dafe498283f82399bd75b335797d7f0fe3ebeccb8   server              Ready               Rejected 4 seconds ago   "network nf74b2kya75mui7zzj5ofdqs8 exists"   
t1ux7w6ulsildjqwut3mysrp1    \_ test_stack_test_service.1   traefik:latest@sha256:02cfdb77b0cd82d973dffb3dafe498283f82399bd75b335797d7f0fe3ebeccb8   server              Shutdown            Rejected 9 seconds ago   "network nf74b2kya75mui7zzj5ofdqs8 exists"   
8je5ervdw0udo880i4ryo0ri9    \_ test_stack_test_service.1   traefik:latest@sha256:02cfdb77b0cd82d973dffb3dafe498283f82399bd75b335797d7f0fe3ebeccb8   server              Shutdown            Complete 9 seconds ago                                                
root@server:~#

@Mobe91
Copy link
Author

Mobe91 commented Nov 27, 2019

This just happened to me again.

@ctelfer

I would be curious whether both "ingress" networks (on the nodes that have 2 ingress networks) are marked as "Ingress: true" in docker network inspect.

I checked this time. Both ingress networks are marked as "Ingress: true" in the output of docker network inspect.

@mvandermade
Copy link

FYI @cypx checked with 19.03.5, still same behavior as your result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants