New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swarm Mode at Scale #30820

Closed
sebgie opened this Issue Feb 8, 2017 · 54 comments

Comments

Projects
None yet
@sebgie

sebgie commented Feb 8, 2017

Description

I'm in the process of migrating an LXC setup to a Docker Swarm environment. The size of this setup is 50 nodes with about 10k services running. The cluster with 50 nodes (32GB RAM) seems to be working as expected. When I now try to start services I get stuck at around 500 - 700 services. After that point docker ls shows entries like: rxlga525wqjd <service-name> replicated 0/1 <container>

and docker service ps <service-name> shows:

ixoggohaa7kc  <service-name>      <container>  prd-pro-16  Running        Running 1 second ago                                     
p0gziroccv42   \_ <service-name>  <container>  prd-pro-24  Shutdown       Failed 12 seconds ago  "starting container failed: co…"  
j0fvbve416gz   \_ <service-name>  <container>  prd-pro-35  Shutdown       Failed 2 minutes ago   "starting container failed: co…"  
873vsi6c80vx   \_ <service-name>  <container>  prd-pro-24  Shutdown       Failed 4 minutes ago   "starting container failed: co…"  

Every container is connected to 2 networks, proxy (10.1.0.0/16) and db (10.2.0.0/16). Running containers are reachable.

Steps to reproduce the issue:

  1. start Docker Swarm Mode with 50 nodes
  2. start > 500 services and wait till they fail to start up

Describe the results you received:

After about 500 containers creating a new service doesn't work anymore.

Describe the results you expected:

Service creation works within 1 or 2 seconds.

Additional information you deem important (e.g. issue happens only occasionally):

When I log into the node where the container is started and run docker ps the command hangs until the container fails. Creating a new container directly on a node that has failed before works as expected and container start up takes less than a second.

All machines appear to be idle and there are no tremendous peaks in load/memory from what I can see. The average load per machine is 10 - 15 containers with around 100 - 150 MB memory usage each.

I am aware that debugging this is hard and I'm very happy to do screen sharing or provide any logs that might be needed to get to the bottom of this behaviour. I'm grateful for every input!

Output of docker version:

# docker version
Client:
 Version:      1.13.0
 API version:  1.25
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Tue Jan 17 09:50:17 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.0
 API version:  1.25 (minimum version 1.12)
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Tue Jan 17 09:50:17 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

# docker info
Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 1
Server Version: 1.13.0
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 7
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: ssci1al4vgi7nzirptugjrpsl
 Is Manager: true
 ClusterID: qfi82y6p6rpnnfxkarxxfdca4
 Managers: 3
 Nodes: 50
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.133.6.234
 Manager Addresses:
  10.133.6.234:2377
  10.133.8.162:2377
  10.133.8.89:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 2f7393a47307a16f8cee44a37b262e8b81021e3e
init version: 949e6fa
Security Options:
 apparmor
Kernel Version: 4.4.0-53-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 31.42 GiB
Name: prd-pro-01
ID: 37WG:BKQR:YXQF:LTMK:P3IQ:CDQ5:ZEM5:PI6L:32DR:3KET:QH4Z:GB5N
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: ghostengineering
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):
DigitalOcean

@ErisDS

This comment has been minimized.

ErisDS commented Feb 8, 2017

Hi there docker folks 👋

Hope you don't mind me swinging by to explain the importance of this. This issue is currently 100% blocking us (Ghost) from shipping a docker-based replacement for our currently lxc-based infrastructure, which is also the prerequisite to every other project we need to ship this year. We're also in the awkward position of being stuck running both the old and new infrastructures until we can get past this roadblock.

I hope anyone reading this can understand why this is a pretty desperate situation to be in 😨.

Ideally, we're looking for someone who knows the docker swarm internals to help us track down whether this is a configuration issue or a bug, and if it's a bug help us get to a swift resolution. We're doing our own investigation as fast as possible, however if there is anyone around able to give us a hand, we'd be extremely grateful.

Thanks for listening👂

@justincormack

This comment has been minimized.

Contributor

justincormack commented Feb 8, 2017

Can you get the full logs for the failed containers? "starting container failed: co…" looks like it might have a reason truncated...

@sebgie

This comment has been minimized.

sebgie commented Feb 8, 2017

Error: starting container failed: containerd: container did not start before the specified timeout

Full container log from docker inspect <id>. This is a different container, but the behaviour is the same.

# docker inspect m3i5ddx6gjjfmhqcpivbyaya4
[
    {
        "ID": "m3i5ddx6gjjfmhqcpivbyaya4",
        "Version": {
            "Index": 48588
        },
        "CreatedAt": "2017-02-08T01:30:48.744172564Z",
        "UpdatedAt": "2017-02-08T01:33:03.494714651Z",
        "Spec": {
            "ContainerSpec": {
                "Image": "ghost/ghost-moya:v0.11.4@sha256:5aea0d66dd93b4baba5a43011e899c9e74edb58828aa15ad891e5ffe19d3bcbe",
                "Env": [
                    "url=http://example.com",
                    "database__connection__host=db-proxy",
                    "database__connection__port=-",
                    "database__connection__user=23941",
                    "database__connection__password=-",
                    "database__connection__database=23941"
                ],
                "Mounts": [
                    {
                        "Type": "bind",
                        "Source": "/mnt/blog-content/",
                        "Target": "/usr/src/app/content"
                    },
                    {
                        "Type": "bind",
                        "Source": "/mnt/blog-content/",
                        "Target": "/usr/src/app/default"
                    }
                ]
            },
            "Resources": {
                "Limits": {},
                "Reservations": {
                    "MemoryBytes": 136314880
                }
            },
            "RestartPolicy": {
                "Condition": "any",
                "MaxAttempts": 0
            },
            "Placement": {
                "Constraints": [
                    "node.role == worker",
                    "node.labels.myrole == app"
                ]
            },
            "ForceUpdate": 0
        },
        "ServiceID": "ytaz1cus1rhno1qcihgcf4rvi",
        "Slot": 1,
        "NodeID": "txqvuskaesjgegqvg0nenx0qr",
        "Status": {
            "Timestamp": "2017-02-08T01:33:03.19077799Z",
            "State": "failed",
            "Message": "starting",
            "Err": "starting container failed: containerd: container did not start before the specified timeout",
            "ContainerStatus": {
                "ContainerID": "57a500fafc641625ce580c910caf895a1f348acf7fc8636e1aaa13364de62d30",
                "ExitCode": 128
            },
            "PortStatus": {}
        },
        "DesiredState": "shutdown",
        "NetworksAttachments": [
            {
                "Network": {
                    "ID": "d8p88sa6uamoxoi50asel6nme",
                    "Version": {
                        "Index": 523
                    },
                    "CreatedAt": "2017-02-01T14:17:16.571459275Z",
                    "UpdatedAt": "2017-02-01T14:17:16.574686931Z",
                    "Spec": {
                        "Name": "proxy",
                        "DriverConfiguration": {
                            "Name": "overlay",
                            "Options": {
                                "encrypted": ""
                            }
                        },
                        "IPAMOptions": {
                            "Driver": {
                                "Name": "default"
                            },
                            "Configs": [
                                {
                                    "Subnet": "10.1.0.0/16",
                                    "Gateway": "10.1.0.1"
                                }
                            ]
                        }
                    },
                    "DriverState": {
                        "Name": "overlay",
                        "Options": {
                            "com.docker.network.driver.overlay.vxlanid_list": "4097",
                            "encrypted": ""
                        }
                    },
                    "IPAMOptions": {
                        "Driver": {
                            "Name": "default"
                        },
                        "Configs": [
                            {
                                "Subnet": "10.1.0.0/16",
                                "Gateway": "10.1.0.1"
                            }
                        ]
                    }
                },
                "Addresses": [
                    "10.1.1.88/16"
                ]
            },
            {
                "Network": {
                    "ID": "mk30ctlvsy4u6ld5ffyvq7vsi",
                    "Version": {
                        "Index": 3778
                    },
                    "CreatedAt": "2017-02-07T22:02:35.622438079Z",
                    "UpdatedAt": "2017-02-07T22:02:35.633174982Z",
                    "Spec": {
                        "Name": "db",
                        "DriverConfiguration": {
                            "Name": "overlay",
                            "Options": {
                                "encrypted": ""
                            }
                        },
                        "IPAMOptions": {
                            "Driver": {
                                "Name": "default"
                            },
                            "Configs": [
                                {
                                    "Subnet": "10.2.0.0/16",
                                    "Gateway": "10.2.0.1"
                                }
                            ]
                        }
                    },
                    "DriverState": {
                        "Name": "overlay",
                        "Options": {
                            "com.docker.network.driver.overlay.vxlanid_list": "4098",
                            "encrypted": ""
                        }
                    },
                    "IPAMOptions": {
                        "Driver": {
                            "Name": "default"
                        },
                        "Configs": [
                            {
                                "Subnet": "10.2.0.0/16",
                                "Gateway": "10.2.0.1"
                            }
                        ]
                    }
                },
                "Addresses": [
                    "10.2.1.85/16"
                ]
            }
        ]
    }
]
@justincormack

This comment has been minimized.

Contributor

justincormack commented Feb 8, 2017

Is there anything in the system logs? eg kernel messages? Or anything else helpful in the Docker logs?

You are saying you cannot create any more services at this point? But containers do run ok?

@sebgie

This comment has been minimized.

sebgie commented Feb 8, 2017

Nothing that points to the error above.

I see loads of these messages in /var/log/syslog on the worker:
Feb 8 09:05:35 prd-pro-22 kernel: [28932.052648] IPVS: __ip_vs_del_service: enter

Also loads of these messages in /var/log/upstart/docker.log on the leader:
time="2017-02-08T09:05:11.964983106Z" level=error msg="task unavailable" method="(*Dispatcher).processUpdates" module=dispatcher task.id=0328ss292ojqpvbuihdzsyl0d

I can dig out more of these if it's relevant?

When a new service is created it stays in the 0/1and reschedules to another server after it fails. Existing containers respond without a problem.

@justincormack

This comment has been minimized.

Contributor

justincormack commented Feb 8, 2017

This looks similar to #22226 (later messages). Which itself is possibly related to #5618 although unclear, ie it is a kernel issue. If you can try more recent kernels it might give more clues, eg Ubuntu 16.04/16.10 rather than 14.04, or Docker for AWS (which has 4.9 kernel), or a PPA kernel for 14.04, I don't know what would be easiest for you.

@justincormack

This comment has been minimized.

Contributor

justincormack commented Feb 8, 2017

Running strace on one of the processes that is timing out might be useful too.

@sebgie

This comment has been minimized.

sebgie commented Feb 8, 2017

My current kernel version is 4.4.0-53-generic. I'll try to update and post my results here.

In the meantime I have run another test and tracked the console output to make it clearer what I see.

From what it looks to me right now my Swarm has the following startup times:

  • 0 - 100 services: < 10 sec
  • 100 - 200 services: < 40 sec
  • 200 - 300 services: < 80 sec
  • 300+ services: 100 sec or more

Services that take more then 120 seconds will be rescheduled and try to start up again. If I keep adding more services they are not able to start up before they are rescheduled. Below is one failed startup that managed to finish in time during the second iteration. If I add 100 more services it will be rescheduled over and over again.

Number of services:

# docker service ls | wc -l
354

Iterating through a failed startup.

# docker service ps blog-32237
ID            NAME          IMAGE                     NODE        DESIRED STATE  CURRENT STATE                ERROR  PORTS
o9o2uqvhys0d  blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Running        Starting about a minute ago         

# docker service ps blog-32237
ID            NAME          IMAGE                     NODE        DESIRED STATE  CURRENT STATE                ERROR  PORTS
o9o2uqvhys0d  blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Running        Starting about a minute ago         

# docker service ps blog-32237
ID            NAME          IMAGE                     NODE        DESIRED STATE  CURRENT STATE                ERROR  PORTS
o9o2uqvhys0d  blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Running        Starting about a minute ago         

# docker service ps blog-32237
ID            NAME          IMAGE                     NODE        DESIRED STATE  CURRENT STATE                ERROR  PORTS
o9o2uqvhys0d  blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Running        Starting about a minute ago         

# docker service ps blog-32237
ID            NAME              IMAGE                     NODE        DESIRED STATE  CURRENT STATE                  ERROR                             PORTS
l42im1185416  blog-32237.1      ghost/ghost-moya:v0.11.4  prd-pro-07  Ready          Ready less than a second ago                                     
o9o2uqvhys0d   \_ blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Shutdown       Failed less than a second ago  "starting container failed: co…"  

# docker service ps blog-32237
ID            NAME              IMAGE                     NODE        DESIRED STATE  CURRENT STATE                    ERROR                             PORTS
l42im1185416  blog-32237.1      ghost/ghost-moya:v0.11.4  prd-pro-07  Running        Starting less than a second ago                                    
o9o2uqvhys0d   \_ blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Shutdown       Failed 5 seconds ago             "starting container failed: co…"  

# docker service ps blog-32237
ID            NAME              IMAGE                     NODE        DESIRED STATE  CURRENT STATE            ERROR                             PORTS
l42im1185416  blog-32237.1      ghost/ghost-moya:v0.11.4  prd-pro-07  Running        Starting 50 seconds ago                                    
o9o2uqvhys0d   \_ blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Shutdown       Failed 55 seconds ago    "starting container failed: co…"  

# docker service ps blog-32237
ID            NAME              IMAGE                     NODE        DESIRED STATE  CURRENT STATE            ERROR                             PORTS
l42im1185416  blog-32237.1      ghost/ghost-moya:v0.11.4  prd-pro-07  Running        Starting 52 seconds ago                                    
o9o2uqvhys0d   \_ blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Shutdown       Failed 57 seconds ago    "starting container failed: co…"  

# docker service ps blog-32237
ID            NAME              IMAGE                     NODE        DESIRED STATE  CURRENT STATE              ERROR                             PORTS
l42im1185416  blog-32237.1      ghost/ghost-moya:v0.11.4  prd-pro-07  Running        Starting 55 seconds ago                                      
o9o2uqvhys0d   \_ blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Shutdown       Failed about a minute ago  "starting container failed: co…"  

# docker service ps blog-32237
ID            NAME              IMAGE                     NODE        DESIRED STATE  CURRENT STATE              ERROR                             PORTS
l42im1185416  blog-32237.1      ghost/ghost-moya:v0.11.4  prd-pro-07  Running        Starting 57 seconds ago                                      
o9o2uqvhys0d   \_ blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Shutdown       Failed about a minute ago  "starting container failed: co…"  

# docker service ps blog-32237
ID            NAME              IMAGE                     NODE        DESIRED STATE  CURRENT STATE              ERROR                             PORTS
l42im1185416  blog-32237.1      ghost/ghost-moya:v0.11.4  prd-pro-07  Running        Starting 58 seconds ago                                      
o9o2uqvhys0d   \_ blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Shutdown       Failed about a minute ago  "starting container failed: co…"  

# docker service ps blog-32237
ID            NAME              IMAGE                     NODE        DESIRED STATE  CURRENT STATE              ERROR                             PORTS
l42im1185416  blog-32237.1      ghost/ghost-moya:v0.11.4  prd-pro-07  Running        Starting 59 seconds ago                                      
o9o2uqvhys0d   \_ blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Shutdown       Failed about a minute ago  "starting container failed: co…"  

# docker service ps blog-32237
ID            NAME              IMAGE                     NODE        DESIRED STATE  CURRENT STATE                ERROR                             PORTS
l42im1185416  blog-32237.1      ghost/ghost-moya:v0.11.4  prd-pro-07  Running        Starting about a minute ago                                    
o9o2uqvhys0d   \_ blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Shutdown       Failed about a minute ago    "starting container failed: co…"  

# docker service ps blog-32237
ID            NAME              IMAGE                     NODE        DESIRED STATE  CURRENT STATE                ERROR                             PORTS
l42im1185416  blog-32237.1      ghost/ghost-moya:v0.11.4  prd-pro-07  Running        Starting about a minute ago                                    
o9o2uqvhys0d   \_ blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Shutdown       Failed about a minute ago    "starting container failed: co…"  

# docker service ps blog-32237
ID            NAME              IMAGE                     NODE        DESIRED STATE  CURRENT STATE                ERROR                             PORTS
l42im1185416  blog-32237.1      ghost/ghost-moya:v0.11.4  prd-pro-07  Running        Starting about a minute ago                                    
o9o2uqvhys0d   \_ blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Shutdown       Failed about a minute ago    "starting container failed: co…"  

# docker service ps blog-32237
ID            NAME              IMAGE                     NODE        DESIRED STATE  CURRENT STATE                ERROR                             PORTS
l42im1185416  blog-32237.1      ghost/ghost-moya:v0.11.4  prd-pro-07  Running        Starting about a minute ago                                    
o9o2uqvhys0d   \_ blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Shutdown       Failed about a minute ago    "starting container failed: co…"  

# docker service ps blog-32237
ID            NAME              IMAGE                     NODE        DESIRED STATE  CURRENT STATE                ERROR                             PORTS
l42im1185416  blog-32237.1      ghost/ghost-moya:v0.11.4  prd-pro-07  Running        Starting about a minute ago                                    
o9o2uqvhys0d   \_ blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Shutdown       Failed about a minute ago    "starting container failed: co…"  

# docker service ps blog-32237
ID            NAME              IMAGE                     NODE        DESIRED STATE  CURRENT STATE                   ERROR                             PORTS
l42im1185416  blog-32237.1      ghost/ghost-moya:v0.11.4  prd-pro-07  Running        Running less than a second ago                                    
o9o2uqvhys0d   \_ blog-32237.1  ghost/ghost-moya:v0.11.4  prd-pro-20  Shutdown       Failed about a minute ago       "starting container failed: co…"  
@justincormack

This comment has been minimized.

Contributor

justincormack commented Feb 8, 2017

Hmm, that just looks like increasingly slow startup time (which is odd) but not a step change. You can increase the overall timeout, I think you may have to start docker-containerd standalone with the increased timeout. That might work around it.

@sebgie

This comment has been minimized.

sebgie commented Feb 8, 2017

Are there any logs I can provide to narrow down the increasingly slow startup time. Or config options to tweak?

I really would like to make this work but increasing the startup time isn't an option. If I assume an increase by 30 seconds for every 100 services we are somewhere close to an hour to start the 10000th service :-(.

@justincormack

This comment has been minimized.

Contributor

justincormack commented Feb 8, 2017

As per instructions under #22226 (comment) some idea of the stack trace and kernel stack trace would be useful. I would probably try a newer kernel first though...

@ErisDS

This comment has been minimized.

ErisDS commented Feb 8, 2017

Just want to add here, that 10k isn't an imaginary number for us. We need to start & run a bare minimum of almost 11k services to match our existing system and switch over.

@sebgie

This comment has been minimized.

sebgie commented Feb 8, 2017

I've upgraded to the latest kernel 4.9.8 using kernel PPA.

# uname -r
4.9.8-040908-generic

Unfortunately this didn't have the result we hoped for. Creating services seems a little bit faster but this could be related to restarting the servers as well.

When starting services I see that the first ~100 services start quite fast. After that startup slows down and at around 500 services the default timeout sends my services in an endless restart loop.

I also captured an strace as described in the issue above: https://gist.github.com/sebgie/e0cac3dcb9d6747f36d22b9c4cfa487c

@sebgie

This comment has been minimized.

sebgie commented Feb 9, 2017

I have run several tests now and have identified the root of the problem: overlay network.

Test1

Script:

for i in {1..1000}
do
  docker service create --name my-service-$i nginx
  sleep 0.1
done

Result: startup works as expected within 1 - 2 seconds

Test2

Networks:

  • docker network create --opt encrypted --driver overlay --subnet 10.2.0.0/16 db
  • docker network create --opt encrypted --driver overlay --subnet 10.1.0.0/16 proxy

Script:

for i in {1..1000}
do
  docker service create --name my-service-$i --network proxy --network db nginx
  sleep 0.1
done

Result: As described above, container startup becomes slower and slower. Breaking point is around 350 containers where the first services hit the timeout.

Test3

Networks:

  • docker network create --driver overlay --subnet 10.2.0.0/16 db
  • docker network create --driver overlay --subnet 10.1.0.0/16 proxy

Script:

for i in {1..1000}
do
  docker service create --name my-service-$i --network proxy --network db nginx
  sleep 0.1
done

Result: Similar to Test2. Services start up slower and slower but the breaking point is around 400 services.

My sysctl settings on all nodes taken from https://github.com/swarmzilla/swarm3k/blob/866a503a08661dc11885144a7e53afa283459fb0/NODE_PREPARATION.md:

net.ipv4.neigh.default.gc_thresh1 = 30000
net.ipv4.neigh.default.gc_thresh2 = 32000
net.ipv4.neigh.default.gc_thresh3 = 32768

This is a major blocker for me! What should our next steps for debugging this be? Are there any alternative network settings I can use?

@ErisDS

This comment has been minimized.

ErisDS commented Feb 9, 2017

Current line of enquiry is to downgrade to docker 1.12.6, we will post details when that test is done.

Are there any other recommendations? It doesn't seem right that this has been marked as a performance issue. Is it really expected behaviour to only be able to start 500 services?

@mavenugo

This comment has been minimized.

Contributor

mavenugo commented Feb 9, 2017

@ErisDS it is infact a scale/performance issue since there is a bottleneck in some part of the code which we have to identify. Can you pls confirm if this happens if we start all these services in a single overlay network or if you split them into multiple networks ? (say have 100 services per network ? ).
That will help us narrow down where the bottleneck might be.

@mavenugo

This comment has been minimized.

Contributor

mavenugo commented Feb 9, 2017

@sebgie @ErisDS also could you get the stacktrace from the daemon when the slowness is seen (using SIG8HUP signal) . That will help us spot where the daemon is busy... I will also try to reproduce it locally

@ErisDS

This comment has been minimized.

ErisDS commented Feb 9, 2017

Testing 1.12.6 has shown a minor but not significant improvement (we're getting to 700 services, rather than 500). @sebgie will pop more details here shortly.

In the meantime:

also could you get the stacktrace from the daemon when the slowness is seen (using SIG8HUP signal)

Could you link to some documentation on how to do this? It's not something I'm familiar with.

@sebgie

This comment has been minimized.

sebgie commented Feb 9, 2017

Test4

Same as Test2 above but with Docker 1.12.6: I can get about 700 services up before the timout strikes again.

Test5

Only 100 services per network, Docker 1.12.6. This works quite well until up to about 1200 services and 12 networks. At that point docker service ls command on master is very slow/hangs and I see error message like: time="2017-02-09T17:59:35.350867817Z" level=warning msg="2017/02/09 17:59:35 [WARN] memberlist: Was able to reach prd-pro-16-35e357840abe via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP\n" appear in the logs.

Notably, the first services attached to a new network are starting up faster than the later ones.


Another observation is that I can get to 10k containers without a network attached. Unfortunately there is not much use for a web service that can't be accessed.

@mavenugo please let me know how I can get a stacktrace from the daemon or what you mean by using SIG8HUP signal?

@mavenugo

This comment has been minimized.

Contributor

mavenugo commented Feb 9, 2017

@sebgie thanks for confirming Test5. It confirms that the scale issue here is not number of services.... but number of services per network. This matters a lot since we have network scoped gossip and network scoped SD/Load-balancing and points us to look at where to fine-tune.

I dont want to mix the service ls command slowing down with the issue of containers failing to launch due to containerd.

Can you pls confirm how far you can go with this 100 services per network without the containerd failing on timeout ? Also, try to reduce the # of services/network just to fine tune the needs. How many managers and worker nodes do you have ?

BTW, I gave the wrong signal ... it should SIGUSR1.
So find the pid of dockerd and run kill -SIGUSR1 {pid} (or) just do kill -SIGUSR1 $(pidof dockerd) if your running process is infact named dockerd.

@sebgie

This comment has been minimized.

sebgie commented Feb 9, 2017

Can you pls confirm how far you can go with this 100 services per network without the containerd failing on timeout ? Also, try to reduce the # of services/network just to fine tune the needs. How many managers and worker nodes do you have ?

The cluster consists of 3 managers and 45 workers.
What version should I test this with? My Swarm is currently on 1.12.6.
What do you mean in terms of "how far you can go"? Maximum number of services in general or maximum number of services per network?

Beyond 1200 services the creation process gets also a bit choppy (slow/hangs), I guess this is due to the network issues between the nodes?

@pascalandy

This comment has been minimized.

pascalandy commented Feb 9, 2017

Hey @sebgie thank you for sharing this with such details.

First, about Test 2

docker network create --opt encrypted --driver overlay --subnet 10.2.0.0/16 db

I had an issue with --opt encrypted. I stopped using it and my DB connection is normal since. #30600 (comment)

Second, are you using a private or a public network when creating your swarm?
I guess it's private right?

Third, @mavenugo I had issues about the maximum number of instances I could deploy within a given network. Has something been done to improve the maximum number (252) of containers per network? #26702 (comment)

And finally, I would be happy to test this setup at scale if someone can give me credits on Digital Ocean and report carefully the results. 50 machines of 32G is serious money.

Cheers!
Pascal | twitter

@pascalandy

This comment has been minimized.

pascalandy commented Feb 9, 2017

@chanwit, I'm sure you'll be interested by this conversation :)

@mavenugo

This comment has been minimized.

Contributor

mavenugo commented Feb 9, 2017

@pascalandy thanks for sharing your comments. But let us not mix multiple unrelated issues together. For example, #30600 is not related to this conversation because we are discussing about Error: starting container failed: containerd: container did not start before the specified timeout in this issue. Also, #26702 is unrelated to this discussion.
If we can focus on the containerd timeout issue first, we can look at the dataplane issues in its own issue so that we can divide and conquer these issues. Hope that is fine with you.

@sebgie

This comment has been minimized.

sebgie commented Feb 9, 2017

I have collected a stacktrace from one of the hanging workers (Docker 1.12.6): https://gist.github.com/sebgie/b3164f78284d7abb147c29879a58d2cb.

Could you please clarify what the "how far you can go" means? I had a script running for some time now and it doesn't seem to slow down considerable. I'm at 5k services now.

@mavenugo

This comment has been minimized.

Contributor

mavenugo commented Feb 9, 2017

@sebgie thanks. I will look into the stacktrace.

by, "how far you can go", I exactly wanted to know the number of services you can scale before you hit the containerd error. This is with keeping the # of services as a variable to # of networks. (instead of attaching all the services in the same network, which it seems like having a bottleneck, which we will see how to address).

@mavenugo

This comment has been minimized.

Contributor

mavenugo commented Feb 14, 2017

@sebgie as I understand this was an urgent issue that was blocking Ghost-Engineering and hence if you can respond quickly to the questions, we can do our best to help you out. Can you please try the above suggestion and let us know if that satisfies your immediate requirement as we will address the scale issue identified here.

@sebgie

This comment has been minimized.

sebgie commented Feb 14, 2017

With dnsrr it still takes > 20 minutes to scale the proxy to 5 running services.

  • # docker service create --name proxy --constraint 'node.role == worker' --endpoint-mode dnsrr --network proxy-1 ... more networks ... --network proxy-200 nginx
  • # docker service scale proxy=5

Another observation: When I try to docker service rm proxy and start it again I get an Address already in use error.

@mavenugo

This comment has been minimized.

Contributor

mavenugo commented Feb 14, 2017

@sebgie could you capture the traceback (as you did earlier using SIGUSR1) when it takes 20 mins ? That will give us a hint on where the daemon is spending its time while scheduling the proxy service to 5 tasks. (you may have to identify the node where it takes time and get the traceback from that node).

@sebgie

This comment has been minimized.

sebgie commented Feb 14, 2017

Stacktrace during start of the first service: https://gist.github.com/sebgie/291788428e1ce154a37a8d9cfe5fcefb which also takes a long time.

docker service scale proxy=5

The 4 new services then spend > 2 minutes preparing:

ID                         NAME     IMAGE  NODE        DESIRED STATE  CURRENT STATE            ERROR
8o5yvd6yl4ae8y2l6fjjpjyaf  proxy.1  nginx  prd-pro-06  Running        Running 3 minutes ago    
9qzzl1gy7vw4ymn3lxen4d510  proxy.2  nginx  prd-pro-04  Running        Preparing 2 minutes ago  
4bqehjta1gy9oihm9jbe1nd1g  proxy.3  nginx  prd-pro-05  Running        Preparing 2 minutes ago  
a0br48n8zh6zteg9010464oos  proxy.4  nginx  prd-pro-07  Running        Preparing 2 minutes ago  
efxgq7m4akkaipuxl07t1v9ib  proxy.5  nginx  prd-pro-09  Running        Starting 3 seconds ago

The stacktrace on prd-pro-09 is https://gist.github.com/sebgie/bbd724967da794798ada451ed38afa8b.

@BretFisher

This comment has been minimized.

BretFisher commented Mar 17, 2017

OK so this has been a month since last update.

@sebgie Do you have any news or discoveries in further testing?

It sounds like where we're at, is that between what you and @mavenugo were troubleshooting, the issue looks to be narrowed to a potential scalability issue in the SwarmKit code when:

  1. Hundreds of Services are attached to the same Overlay network, or
  2. Hundreds of Overlay networks are attached to the same Service

I'm speculating on the cause, but do those two scenarios sound accurate?

FTR @chanwit did see lots of manager-based cpu and network labor when scaling in swarm3k, but that was a scenario around node scaling, and you're doing Service + Overlay scaling on a much smaller set of nodes...

And you said that cpu/memory pressure was fine. What about network? iftop or something? Any info there?

I'm sure you've had lots of time to think up other workarounds/options, so I imagine you're way past my questions at this point, but if it's any consolation I'm happy to get on a zoom/hangout with you next week to throw around Swarm Mode architecture ideas for how to reduce the service-to-overlay-network ratio and still meet your needs until this (apparent/assumed) bottleneck is improved in code. I'm a big fan, user, and customer of Ghost and want to help make this rock.

@mavenugo

This comment has been minimized.

Contributor

mavenugo commented Mar 18, 2017

@BretFisher @sebgie apologies on the radio silence on this one.
Yes, as I mentioned in this thread earlier, we have a scale issue wrt # of services/tasks in a given network, due to the way VIP based load-balancer is handled. Hence I suggested @sebgie to scope out the services to multiple networks to ease out the load-balancer programming scale issues. That will work for @sebgie's use case since they dont have the need for all any-any connectivity between all the services. But the North-South connectivity can be achieved using a proxy layer.

But, the scale issue is real and am work on addressing it. I couldnt get that patch in for 17.04. But am trying to get that in before 17.05. pls stay tuned.

@mancubus77

This comment has been minimized.

mancubus77 commented Mar 20, 2017

+1 for the issue
I have similar setup with several LXC containers and swarm workers/masters inside.

@pascalandy

This comment has been minimized.

pascalandy commented Mar 20, 2017

Since I upgrade from 1.13.x to 17.x.x it takes about 2m30 to deploy one Ghost instance (one simple docker service create). I'm running 4 nodes, 3 manager + 1 worker. The cluster is not busy and have plenty of memory.

The main issue is the fact that Swarm put the status of the service as "Starting" for 2-3 min.

The result of loop > docker service ps ghostname (sleep 2 between each check)

Pikwi >  Service status: Starting
Pikwi >  Container for service g99999001-martine-nadal-ghost is not running yet. Retrying in 2 secs... (68)
[  => ] 100%

Pikwi >  Service status: Starting
Pikwi >  Container for service g99999001-martine-nadal-ghost is not running yet. Retrying in 2 secs... (69)
[  => ] 100%

Pikwi >  Service status: Starting
Pikwi >  Container for service g99999001-martine-nadal-ghost is not running yet. Retrying in 2 secs... (70)
[  => ] 100%

Pikwi >  Service status: Running
Running

As it became so long to deploy, I even created a function to monitor this behaviour. Here is the result.

GhostUpApp: 3 min and 19 sec elapsed. (2017-03-20_17H00_21 <> 2017-03-20_17H02_57)

Important note: It does this only with Ghost. For portainer, nginx, caddy, it's running normally.

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Mar 20, 2017

@pascalandy I just tried running a ghost service, and it was active almost instantly (but did just the bare-bones options; docker service create --name ghost -p 8080:2368 ghost); any additional options you're passing?

@pascalandy

This comment has been minimized.

pascalandy commented Mar 20, 2017

Just had a flash! I think Healthcheck might mess around... Not sure how to validate this.

@pascalandy

This comment has been minimized.

pascalandy commented Mar 20, 2017

If it helps, when I update the image, it's really fast.

ENV_SERVICE=blog100
ENV_IMG=user/ghost:0.11.7-alpine-2
docker service update --image $ENV_IMG $ENV_SERVICE
@pascalandy

This comment has been minimized.

pascalandy commented Mar 31, 2017

When I first tried to run 1,000 containers in a service, I hit a limitation that resulted in the Docker swarm becoming unresponsive, with occasional connectivity interruptions. When inspecting this, I found many messages like the following one in output of dmesg.
neighbour: arp_cache: neighbor table overflow!

It takes about ten minutes to start up all the (1000) containers

via https://blog.codeship.com/running-1000-containers-in-docker-swarm/

@cirocosta

This comment has been minimized.

Contributor

cirocosta commented Aug 28, 2017

Hey, we've been running ~2k swarm services across a 3-manager 20-worker setup (~1k overlay networks) for some weeks. One of this networks has ~1k services attached to it (we're on 17.07.0-ce-rc1 on most of the nodes btw). So far we didn't face any issues regarding connectivity.

We can share some metrics if you wish 👍

@chanwit

This comment has been minimized.

chanwit commented Aug 28, 2017

Would love to see it @cirocosta !!

@gurayyildirim

This comment has been minimized.

gurayyildirim commented Aug 28, 2017

@cirocosta It would be great to see as many details(how many different locations, provider, OS, manager node specs, how long is it to create a new service etc.) and metrics as possible.

@mustafayildirim

This comment has been minimized.

mustafayildirim commented Aug 29, 2017

@cirocosta Can you send me some metrics.
We tried 3 master and 6 nodes with 20 services, but it failed.
Some network issues happened with docker-12.06 version

@chanwit

This comment has been minimized.

chanwit commented Aug 29, 2017

We tried 3 master and 6 nodes with 20 services, but it failed.
Some network issues happened with docker-12.06 version

@mustafayildirim you meant 17.06?

@notsureifkevin

This comment has been minimized.

notsureifkevin commented Nov 17, 2017

This issue was mentioned as "known" in the most recent docker-ee release notes. However, it appears (mostly) inactive. This seems like a fairly major issue, is there any additional visibility into the work being done to address these limitations?

@fcrisciani

This comment has been minimized.

Contributor

fcrisciani commented Mar 23, 2018

@kcrawley just updating that work on remediation of this limitation started
cc @ctelfer

@jaschaio

This comment has been minimized.

jaschaio commented May 4, 2018

I am unsure about how to get around this issue using --endpoint-mode dnsrr. The updated docs say:

It’s recommended that users create overlay networks with /24 blocks (the default) of 256 IP addresses when networks are used by services created using VIP-based endpoint-mode (the default). This is because of limitations with Docker Swarm #30820. Users should not work around this by increasing the IP block size. To work around this limitation, either use dnsrr endpoint-mode or use multiple smaller overlay networks.

How does that work in practice? Do I still attach an overlay network to the service but when I set the --endpoint-mode dnsrr I won't encounter the VIP limit of 256 anymore?

I have tried this and still can't scale past 228 replicas and/or single services.

If I inspect the overlay network the service with --endpoint-mode dnsrr is attached to with the --verbose flag I can still see that each task gets its own EndpointIP on the subnet of the overlay network. So the limitation of 256 is still in place, despite dnsrr.

Maybe the docs are just not clear enough: Is it save to increase the IP block size for example using /20 subnet as long as I use the dnsrr mode?

@fcrisciani

This comment has been minimized.

Contributor

fcrisciani commented May 4, 2018

@jaschaio the limitation is due to the number of iptables/ipvs rules that has to be configured inside the namespace. The recommendation of keeping a /24 subnet is a simplification of the problem. On a /24 network there is 256 IP, out of which you can use 253 (2 are for network and broadcast address and 1 for the default gateway). For this reason keeping a /24 in some way disallow you to have more than 253 containers resulting on a limit of the number of rules that need to be configured. Now dnsrr mode does not configure the load balancer so you don't have the restriction on the number of rules, but still if you continue to use a /24 network, the number of IP will clearly remain the same limited to 253. If you want more IPs you need a bigger subnet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment