Dead tasks stuck in overlay networks after container restarts #36940

mddos · 2018-04-24T22:47:28Z

Description
The issues seen in issue 35548 appear to have resurfaced in 18.3 and 18.4.

Steps to reproduce the issue:
1.Set up swarm
2.Brake some node?? To force this issue we take rabbit offline to cause containers to die and try to restart to reconnect.
3. Look for dead tasks in network overlay.

Describe the results you received:
Ingress network containing dead tasks.

Describe the results you expected:
No dead tasks in overlay network

Additional information you deem important (e.g. issue happens only occasionally):
We have tried Ubuntu 18, and RHEL but with no change.

Output of docker version:

Docker version 18.04.0-ce, build 3d479c0

Output of docker info:

Containers: 25
 Running: 24
 Paused: 0
 Stopped: 1
Images: 106
Server Version: 18.04.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 409
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: hnax5jop1oflya8zwwh9wenn4
 Is Manager: true
 ClusterID: ymab3474jlamu4f7nrcen43ia
 Managers: 2
 Nodes: 2
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.0.0.233
 Manager Addresses:
  10.0.0.214:2377
  10.0.0.233:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-121-generic
Operating System: Ubuntu 16.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.67GiB
Name: ip-10-0-0-233
ID: YLGA:RMCO:ABZZ:3QJZ:PT74:ZHOF:YAI2:575Z:7GPO:E2VJ:Y4BX:RBH3
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):
Hosts are AWS M4.XLarge

The text was updated successfully, but these errors were encountered:

thaJeztah · 2018-04-24T23:11:50Z

ping @ddebroy

eduardolundgren · 2018-12-03T19:45:26Z

Any updates about this issue? @thaJeztah @ddebroy

fbuecklers · 2019-12-11T09:49:27Z

Hi,

we still have expirienced the same issue in our docker swarm verison 18.09.9

Are there any updates?

blucz · 2019-12-17T19:04:23Z

We have been struggling with this issue on 19.03.5 for about a month.

We think it may have first occurred as a result of OOM on a manager node. We had a second recurrence of the issue before we could address the root cause of the OOM. We have since addressed that.

At the time of both incidents we were running 19 managers. We have since reduced that to 5.

The issue hasn't recurred, but it's difficult to prove if either or both of our attempted mitigations made a difference, or if our swarm is still a ticking time bomb.

As of now, there are services and tasks that are stuck in the ingress overlay network that we can't remove. These services and tasks are not visible in the list of services we get back from docker service ls, and are not visible as running containers if we run docker container ls on any node in the swarm. Attempting to remove them with docker network disconnnect <containerid> results in an error message.

Does anyone know a way to manually clean up the dead tasks stuck in the overlay networks?

dmichael mentioned this issue May 20, 2019

[Sparkswap] Error: 14 UNAVAILABLE: TCP Read failed crypdex/blackbox#4

Closed

thaJeztah added the area/networking label Dec 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dead tasks stuck in overlay networks after container restarts #36940

Dead tasks stuck in overlay networks after container restarts #36940

mddos commented Apr 24, 2018 •

edited

thaJeztah commented Apr 24, 2018

eduardolundgren commented Dec 3, 2018

fbuecklers commented Dec 11, 2019

blucz commented Dec 17, 2019

Dead tasks stuck in overlay networks after container restarts #36940

Dead tasks stuck in overlay networks after container restarts #36940

Comments

mddos commented Apr 24, 2018 • edited

thaJeztah commented Apr 24, 2018

eduardolundgren commented Dec 3, 2018

fbuecklers commented Dec 11, 2019

blucz commented Dec 17, 2019

mddos commented Apr 24, 2018 •

edited