swarm node cant participate in swarm due to grpc size limit #39160

VynDragon · 2019-04-30T11:04:07Z

Description
docker node cant participate in the swarm due to GRPC message size errors following a large number of task failure due to a failing network card. I am not able to delete the node nor manage the runaway task's history. There seem to be no way to recover from this situation.

Steps to reproduce the issue:

have working swarm node
large number of task instance fail on the same node
grpc message now too large

Describe the results you received:
grpc message too large, makes most of docker unusable in older versions, makes node unable to participate in newer version

Describe the results you expected:
docker node works normally after large number of task failure or ability to manage leftover task instances as to lighten the weight of the 'grpc message' to fix the problem

Additional information you deem important (e.g. issue happens only occasionally):
I encountered the issue where a large number of tasks failed and 'docker service ls' stopped working with "grpc: received message larger than max (x vs. 4194304)", fixed it by upgrading docker, meanwhile a node couldnt seem to do anything during that, it is now repetadly throwing "level=error msg="agent: session failed" backoff=100ms error="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (6057386 vs. 4194304)"" while trying to participate in the swarm.
Attempting to delete the node gives "Error response from daemon: rpc error: code = Unknown desc = raft: raft message is too large and can't be send"
I did try "docker swarm update --task-history-limit 0" to clear the task instance history but that didnt work.

Output of docker version:
Master

Client:
 Version:           18.09.5
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        e8ff056
 Built:             Thu Apr 11 04:44:24 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.5
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       e8ff056
  Built:            Thu Apr 11 04:10:53 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Node:

Client:
 Version:           18.09.5-ce
 API version:       1.39
 Go version:        go1.12.3
 Git commit:        e8ff056dbc
 Built:             Fri Apr 12 08:22:13 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.09.5-ce
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.12.3
  Git commit:       e8ff056dbc
  Built:            Fri Apr 12 08:21:24 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:
Master

Containers: 5
 Running: 5
 Paused: 0
 Stopped: 0
Images: 520
Server Version: 18.09.5
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 618
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
 NodeID: zeenic2wt2gnvnlgpals2llbv
 Is Manager: true
 ClusterID: mmee22mjudztn5ss2ejptr6o9
 Managers: 1
 Nodes: 16
 Default Address Pool: 10.0.0.0/8  
 SubnetSize: 24
 Orchestration:
  Task History Retention Limit: 2
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 192.168.5.4
 Manager Addresses:
  192.168.5.4:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-137-generic
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 62.8GiB
Name: Brutus
ID: S7GW:Q6CX:GXGM:UPJ4:6P6V:HUZC:5SUG:ISCP:JMRG:VMXG:4AFO:3ZRI
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support

Node:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 18.09.5-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
 NodeID: ron8p90h8f2u4anntfcnucsu9
 Is Manager: false
 Node Address: 192.168.5.203
 Manager Addresses:
  192.168.5.4:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb.m
runc version: 69ae5da6afdcaaf38285a10b36f362e41cb298d6
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 5.0.9-arch1-1-ARCH
Operating System: Arch Linux
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 23.53GiB
Name: DockerHost-Ubuntu1404-x86-64-3
ID: TJZY:HQCG:MLMJ:KRPQ:LI3C:CY4Q:57KA:FN2C:VLHS:EOTU:EELR:RZL7
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):
All physical, Master running ubuntu 16.04, node running a up to date arch linux.

The text was updated successfully, but these errors were encountered:

GordonTheTurtle added the area/swarm label Apr 30, 2019

VynDragon changed the title ~~swarm node cant stay in swarm due to grpc size limit~~ swarm node cant participate in swarm due to grpc size limit Apr 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swarm node cant participate in swarm due to grpc size limit #39160

swarm node cant participate in swarm due to grpc size limit #39160

VynDragon commented Apr 30, 2019 •

edited

swarm node cant participate in swarm due to grpc size limit #39160

swarm node cant participate in swarm due to grpc size limit #39160

Comments

VynDragon commented Apr 30, 2019 • edited

VynDragon commented Apr 30, 2019 •

edited