Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker-17.06 "docker swarm leave -f" command return "context deadline exceeded" error #35337

Open
isula-xufeng opened this issue Oct 31, 2017 · 11 comments

Comments

@isula-xufeng
Copy link

isula-xufeng commented Oct 31, 2017

Description

I have a test case with one manager node and two worker nodes, but sometimes, after the worker nodes
left the cluster, running "docker swarm leave -f" command in manager node will return in "context deadline exceeded" error.
This issue is hard to reproduce, searching the Internet tells me that this problem maybe caused by "lost quorum" issue, but I have only one manager node, so I don't think this is the case for my problem.

Steps to reproduce the issue:
This is my test case, but it can't reproduce this problem consistently:

  1. use "docker swarm init" and "docker swarm join" to initialize one manager and two workers' cluster
  2. use "docker service create --replicas 3 --name servicetest --limit-cpu=2" to create 3 replicas service
  3. use "docker service update servicetest --limit-cpu=6" to update this service
  4. then use "docker service update servicetest --rollback" to rollback this service
  5. run "docker swarm leave -f" in the worker nodes to leave the cluser
  6. run "docker swarm leave -f" in the manager node to shut down the cluster

Describe the results you received:
run "docker swarm leave -f" command in manager node will get "context deadline exceeded" error.

Describe the results you expected:
Run "docker swarm leave -f" command in manager node returns no error.

Additional information you deem important (e.g. issue happens only occasionally):
issue happens only occasionally, not easy to reproduce.
This is the docker logs while "context deadline exceeded" error happens:
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.542736621-04:00" level=debug msg="Calling POST /v1.30/swarm/leave?force=1"
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.542816371-04:00" level=debug msg="(*worker).Assign" len(assignments)=0 module="node/agent" node.id=lzqtqs59vhxcqhle15z091xo6
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.542839748-04:00" level=debug msg="(*worker).reconcileSecrets" len(removedSecrets)=0 len(updatedSecrets)=0 module="node/agent" node.id=lzqtqs59vhxcqhle15z091xo6
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.542860965-04:00" level=debug msg="(*worker).reconcileConfigs" len(removedConfigs)=0 len(updatedConfigs)=0 module="node/agent" node.id=lzqtqs59vhxcqhle15z091xo6
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.542879584-04:00" level=debug msg="(*worker).reconcileTaskState" len(removedTasks)=0 len(updatedTasks)=0 module="node/agent" node.id=lzqtqs59vhxcqhle15z091xo6
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.543129972-04:00" level=debug msg="Sending kill signal 15 to container 2b7fd13239f41743c04db1b9bf8b1355acc58d705d29c64ac4c10105bad84c6a"
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.544004363-04:00" level=debug msg="Sending kill signal 15 to container 63548d2bcab3d5105083a3f77bde1c6c125716bcf1af942ab6493d90904bb45e"
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.545029229-04:00" level=debug msg="devmapper: DeleteDevice START(hash=8ce100d43e1ffe74220558b760e5dbff3cfe8400a4eae19340d85e14d0271fd6 syncDelete=false)"
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.545053967-04:00" level=debug msg="devmapper: deactivateDevice START(8ce100d43e1ffe74220558b760e5dbff3cfe8400a4eae19340d85e14d0271fd6)"
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.545069405-04:00" level=debug msg="(*Agent).UpdateTaskStatus" module="node/agent" node.id=lzqtqs59vhxcqhle15z091xo6 task.id=oviw2qlw21i78wh276bdb3dw0
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.545113875-04:00" level=debug msg="devmapper: deactivateDevice END(8ce100d43e1ffe74220558b760e5dbff3cfe8400a4eae19340d85e14d0271fd6)"
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.545523932-04:00" level=debug msg="task status reported" module="node/agent" node.id=lzqtqs59vhxcqhle15z091xo6
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.547735861-04:00" level=debug msg="devmapper: unregisterDevice(8ce100d43e1ffe74220558b760e5dbff3cfe8400a4eae19340d85e14d0271fd6)"
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.547976964-04:00" level=debug msg="devmapper: DeleteDevice END(hash=8ce100d43e1ffe74220558b760e5dbff3cfe8400a4eae19340d85e14d0271fd6 syncDelete=false)"
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.548617958-04:00" level=debug msg="devmapper: DeleteDevice START(hash=8ce100d43e1ffe74220558b760e5dbff3cfe8400a4eae19340d85e14d0271fd6-init syncDelete=false)"
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.548639245-04:00" level=debug msg="devmapper: deactivateDevice START(8ce100d43e1ffe74220558b760e5dbff3cfe8400a4eae19340d85e14d0271fd6-init)"
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.548671993-04:00" level=debug msg="devmapper: deactivateDevice END(8ce100d43e1ffe74220558b760e5dbff3cfe8400a4eae19340d85e14d0271fd6-init)"
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.551769300-04:00" level=debug msg="devmapper: unregisterDevice(8ce100d43e1ffe74220558b760e5dbff3cfe8400a4eae19340d85e14d0271fd6-init)"
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.552036514-04:00" level=debug msg="devmapper: DeleteDevice END(hash=8ce100d43e1ffe74220558b760e5dbff3cfe8400a4eae19340d85e14d0271fd6-init syncDelete=false)"
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.631897057-04:00" level=debug msg="task status updated" method="(*Dispatcher).processUpdates" module=dispatcher node.id=lzqtqs59vhxcqhle15z091xo6 state.transition="RUNNING->RUNNING" task.id=oviw2qlw21i78wh276bdb3dw0
Oct 31 06:12:04 swarm1 dockerd[120833]: time="2017-10-31T06:12:04.802769282-04:00" level=debug msg="memberlist: Stream connection from=9.81.254.167:39526"
Oct 31 06:12:08 swarm1 dockerd[120833]: time="2017-10-31T06:12:08.028012513-04:00" level=debug msg="update of service tsm4qskk9cesmqmaq4ml4zw69 complete" module=node node.id=lzqtqs59vhxcqhle15z091xo6
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.173887528-04:00" level=info msg="Container 2b7fd13239f41743c04db1b9bf8b1355acc58d705d29c64ac4c10105bad84c6a failed to exit within 10 seconds of signal 15 - using the force"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.174448528-04:00" level=debug msg="Sending kill signal 9 to container 2b7fd13239f41743c04db1b9bf8b1355acc58d705d29c64ac4c10105bad84c6a"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.194790412-04:00" level=debug msg="containerd: process exited" id=2b7fd13239f41743c04db1b9bf8b1355acc58d705d29c64ac4c10105bad84c6a pid=init status=137 systemPid=41555
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.198843228-04:00" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:"exit", Id:"2b7fd13239f41743c04db1b9bf8b1355acc58d705d29c64ac4c10105bad84c6a", Status:0x89, Pid:"init", Timestamp:(*timestamp.Timestamp)(0xc420d11470)}"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.199663901-04:00" level=debug msg="Revoking external connectivity on endpoint servicetest.3.13hqvvetm6iypdba97xmtpr5z (ff3e1e4fff85419823cf9a378634d2b5449ba64b068760859a2b4afb97f9effc)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.202445759-04:00" level=debug msg="DeleteConntrackEntries purged ipv4:0, ipv6:0"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.232206587-04:00" level=debug msg="Releasing addresses for endpoint servicetest.3.13hqvvetm6iypdba97xmtpr5z's interface on network bridge"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.232231924-04:00" level=debug msg="ReleaseAddress(LocalDefault/172.17.0.0/16, 172.17.0.3)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.233356647-04:00" level=debug msg="devmapper: UnmountDevice START(hash=23ab655ec855942d26ff185bd51a12ca7faa953a6e27c4383c4c4df0cced57a1)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.233380592-04:00" level=debug msg="devmapper: Unmount(/var/lib/docker/devicemapper/mnt/23ab655ec855942d26ff185bd51a12ca7faa953a6e27c4383c4c4df0cced57a1)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.237876010-04:00" level=debug msg="devmapper: Unmount done"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.237899319-04:00" level=debug msg="devmapper: deactivateDevice START(23ab655ec855942d26ff185bd51a12ca7faa953a6e27c4383c4c4df0cced57a1)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.237977459-04:00" level=debug msg="devicemapper: RemoveDeviceDeferred START(docker-253:0-1837521-23ab655ec855942d26ff185bd51a12ca7faa953a6e27c4383c4c4df0cced57a1)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.256798965-04:00" level=debug msg="devicemapper: RemoveDeviceDeferred END(docker-253:0-1837521-23ab655ec855942d26ff185bd51a12ca7faa953a6e27c4383c4c4df0cced57a1)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.256820878-04:00" level=debug msg="devmapper: deactivateDevice END(23ab655ec855942d26ff185bd51a12ca7faa953a6e27c4383c4c4df0cced57a1)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.256829947-04:00" level=debug msg="devmapper: UnmountDevice END(hash=23ab655ec855942d26ff185bd51a12ca7faa953a6e27c4383c4c4df0cced57a1)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.258184589-04:00" level=debug msg="state changed" module="node/agent/taskmanager" node.id=lzqtqs59vhxcqhle15z091xo6 service.id=tsm4qskk9cesmqmaq4ml4zw69 state.desired=SHUTDOWN state.transition="RUNNING->SHUTDOWN" task.id=13hqvvetm6iypdba97xmtpr5z
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.258615938-04:00" level=debug msg="(*Agent).UpdateTaskStatus" module="node/agent" node.id=lzqtqs59vhxcqhle15z091xo6 task.id=13hqvvetm6iypdba97xmtpr5z
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.258722608-04:00" level=info msg="Stopping manager" module=node node.id=lzqtqs59vhxcqhle15z091xo6
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.258762090-04:00" level=error msg="closing session after fatal error" error="rpc error: code = 1 desc = context canceled" module="node/agent" node.id=lzqtqs59vhxcqhle15z091xo6
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.258835474-04:00" level=info msg="shutting down certificate renewal routine" module="node/tls" node.id=lzqtqs59vhxcqhle15z091xo6 node.role=swarm-manager
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.258887502-04:00" level=debug msg="(*Agent).run exited" module="node/agent" node.id=lzqtqs59vhxcqhle15z091xo6
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.259106918-04:00" level=error msg="LogBroker exited with an error" error="context canceled" module=node node.id=lzqtqs59vhxcqhle15z091xo6
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.259343516-04:00" level=debug msg="stop transport"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.259383465-04:00" level=debug msg="devmapper: DeleteDevice START(hash=23ab655ec855942d26ff185bd51a12ca7faa953a6e27c4383c4c4df0cced57a1 syncDelete=false)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.259400925-04:00" level=debug msg="devmapper: deactivateDevice START(23ab655ec855942d26ff185bd51a12ca7faa953a6e27c4383c4c4df0cced57a1)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.259460959-04:00" level=debug msg="devmapper: deactivateDevice END(23ab655ec855942d26ff185bd51a12ca7faa953a6e27c4383c4c4df0cced57a1)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.261862565-04:00" level=debug msg="devmapper: unregisterDevice(23ab655ec855942d26ff185bd51a12ca7faa953a6e27c4383c4c4df0cced57a1)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.262129464-04:00" level=debug msg="devmapper: DeleteDevice END(hash=23ab655ec855942d26ff185bd51a12ca7faa953a6e27c4383c4c4df0cced57a1 syncDelete=false)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.262683635-04:00" level=debug msg="devmapper: DeleteDevice START(hash=23ab655ec855942d26ff185bd51a12ca7faa953a6e27c4383c4c4df0cced57a1-init syncDelete=false)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.262703757-04:00" level=debug msg="devmapper: deactivateDevice START(23ab655ec855942d26ff185bd51a12ca7faa953a6e27c4383c4c4df0cced57a1-init)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.262754216-04:00" level=debug msg="devmapper: deactivateDevice END(23ab655ec855942d26ff185bd51a12ca7faa953a6e27c4383c4c4df0cced57a1-init)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.265574302-04:00" level=debug msg="devmapper: unregisterDevice(23ab655ec855942d26ff185bd51a12ca7faa953a6e27c4383c4c4df0cced57a1-init)"
Oct 31 06:12:13 swarm1 dockerd[120833]: time="2017-10-31T06:12:13.265815833-04:00" level=debug msg="devmapper: DeleteDevice END(hash=23ab655ec855942d26ff185bd51a12ca7faa953a6e27c4383c4c4df0cced57a1-init syncDelete=false)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.544724592-04:00" level=info msg="Container 63548d2bcab3d5105083a3f77bde1c6c125716bcf1af942ab6493d90904bb45e failed to exit within 10 seconds of signal 15 - using the force"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.544764425-04:00" level=debug msg="Sending kill signal 9 to container 63548d2bcab3d5105083a3f77bde1c6c125716bcf1af942ab6493d90904bb45e"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.56701383-04:00" level=debug msg="containerd: process exited" id=63548d2bcab3d5105083a3f77bde1c6c125716bcf1af942ab6493d90904bb45e pid=init status=137 systemPid=41881
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.573463523-04:00" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:"exit", Id:"63548d2bcab3d5105083a3f77bde1c6c125716bcf1af942ab6493d90904bb45e", Status:0x89, Pid:"init", Timestamp:(*timestamp.Timestamp)(0xc422109330)}"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.574681619-04:00" level=debug msg="Revoking external connectivity on endpoint servicetest.3.oviw2qlw21i78wh276bdb3dw0 (463476893e593efb2dff6c59bc841116eb61192ac6b75260b38519e396d66660)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.578023749-04:00" level=debug msg="DeleteConntrackEntries purged ipv4:0, ipv6:0"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.607376819-04:00" level=debug msg="Releasing addresses for endpoint servicetest.3.oviw2qlw21i78wh276bdb3dw0's interface on network bridge"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.607416690-04:00" level=debug msg="ReleaseAddress(LocalDefault/172.17.0.0/16, 172.17.0.4)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.609479288-04:00" level=debug msg="devmapper: UnmountDevice START(hash=299e8b852728aa7451edb14757352e851e0d3d907fa82a165feaba7aaf3ffc41)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.609813340-04:00" level=debug msg="devmapper: Unmount(/var/lib/docker/devicemapper/mnt/299e8b852728aa7451edb14757352e851e0d3d907fa82a165feaba7aaf3ffc41)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.614080649-04:00" level=debug msg="devmapper: Unmount done"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.614104109-04:00" level=debug msg="devmapper: deactivateDevice START(299e8b852728aa7451edb14757352e851e0d3d907fa82a165feaba7aaf3ffc41)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.614183941-04:00" level=debug msg="devicemapper: RemoveDeviceDeferred START(docker-253:0-1837521-299e8b852728aa7451edb14757352e851e0d3d907fa82a165feaba7aaf3ffc41)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.636835871-04:00" level=debug msg="devicemapper: RemoveDeviceDeferred END(docker-253:0-1837521-299e8b852728aa7451edb14757352e851e0d3d907fa82a165feaba7aaf3ffc41)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.636861476-04:00" level=debug msg="devmapper: deactivateDevice END(299e8b852728aa7451edb14757352e851e0d3d907fa82a165feaba7aaf3ffc41)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.636870737-04:00" level=debug msg="devmapper: UnmountDevice END(hash=299e8b852728aa7451edb14757352e851e0d3d907fa82a165feaba7aaf3ffc41)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.639441765-04:00" level=debug msg="devmapper: DeleteDevice START(hash=299e8b852728aa7451edb14757352e851e0d3d907fa82a165feaba7aaf3ffc41 syncDelete=false)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.639462952-04:00" level=debug msg="devmapper: deactivateDevice START(299e8b852728aa7451edb14757352e851e0d3d907fa82a165feaba7aaf3ffc41)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.639503002-04:00" level=debug msg="devmapper: deactivateDevice END(299e8b852728aa7451edb14757352e851e0d3d907fa82a165feaba7aaf3ffc41)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.641750074-04:00" level=debug msg="devmapper: unregisterDevice(299e8b852728aa7451edb14757352e851e0d3d907fa82a165feaba7aaf3ffc41)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.641963594-04:00" level=debug msg="devmapper: DeleteDevice END(hash=299e8b852728aa7451edb14757352e851e0d3d907fa82a165feaba7aaf3ffc41 syncDelete=false)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.642471986-04:00" level=debug msg="devmapper: DeleteDevice START(hash=299e8b852728aa7451edb14757352e851e0d3d907fa82a165feaba7aaf3ffc41-init syncDelete=false)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.642491388-04:00" level=debug msg="devmapper: deactivateDevice START(299e8b852728aa7451edb14757352e851e0d3d907fa82a165feaba7aaf3ffc41-init)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.642525553-04:00" level=debug msg="devmapper: deactivateDevice END(299e8b852728aa7451edb14757352e851e0d3d907fa82a165feaba7aaf3ffc41-init)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.645277462-04:00" level=debug msg="devmapper: unregisterDevice(299e8b852728aa7451edb14757352e851e0d3d907fa82a165feaba7aaf3ffc41-init)"
Oct 31 06:12:14 swarm1 dockerd[120833]: time="2017-10-31T06:12:14.645478268-04:00" level=debug msg="devmapper: DeleteDevice END(hash=299e8b852728aa7451edb14757352e851e0d3d907fa82a165feaba7aaf3ffc41-init syncDelete=false)"
Oct 31 06:12:19 swarm1 dockerd[120833]: time="2017-10-31T06:12:19.542978627-04:00" level=error msg="failed to shut down cluster node: context deadline exceeded"

Output of docker version:

(paste your output here)

[root@localhost ~]# docker version
Client:
 Version:      17.06.1-ce-rc1
 EulerVersion: 17.06.0.3
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   cdd7c1a
 Built:        Wed Sep 27 22:11:28 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.1-ce-rc1
 EulerVersion: 17.06.0.3
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   cdd7c1a
 Built:        Wed Sep 27 22:11:02 2017
 OS/Arch:      linux/amd64
 Experimental: false

**Output of `docker info`:**

(paste your output here)

[root@localhost ~]# docker version
Client:
 Version:      17.06.1-ce-rc1
 EulerVersion: 17.06.0.3
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   cdd7c1a
 Built:        Wed Sep 27 22:11:28 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.1-ce-rc1
 EulerVersion: 17.06.0.3
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   cdd7c1a
 Built:        Wed Sep 27 22:11:02 2017
 OS/Arch:      linux/amd64
 Experimental: false


**Additional environment details (AWS, VirtualBox, physical, etc.):**
The test case run on a KVM virtual machine.
@isula-xufeng
Copy link
Author

Not for sure if this is helpful, below is the normal log(I mean "docker swarm leave -f" run successful in manager node):
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.585022775-04:00" level=debug msg="Calling POST /v1.30/swarm/leave?force=1"
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.585086117-04:00" level=debug msg="(*worker).Assign" len(assignments)=0 module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.585104939-04:00" level=debug msg="(*worker).reconcileSecrets" len(removedSecrets)=0 len(updatedSecrets)=0 module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.585117472-04:00" level=debug msg="(*worker).reconcileConfigs" len(removedConfigs)=0 len(updatedConfigs)=0 module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.585128519-04:00" level=debug msg="(*worker).reconcileTaskState" len(removedTasks)=0 len(updatedTasks)=0 module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.585494547-04:00" level=debug msg="Sending kill signal 15 to container 517cc961ddb9ae27570ae466af384a35cc29a70a68540f298bdf24e8a5bb7c51"
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.586319090-04:00" level=debug msg="(*Agent).UpdateTaskStatus" module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4 task.id=zywyykkjmuzutjc6lrpg75rs8
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.586751281-04:00" level=error msg="status reporter failed to report status to agent" error="context canceled" module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.586772860-04:00" level=debug msg="(*Agent).UpdateTaskStatus" module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4 task.id=zywyykkjmuzutjc6lrpg75rs8
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.586776353-04:00" level=error msg="closing session after fatal error" error="rpc error: code = 1 desc = context canceled" module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.586785973-04:00" level=error msg="status reporter failed to report status to agent" error="context canceled" module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.586791793-04:00" level=info msg="Stopping manager" module=node node.id=m88r1xrksaywtpd6c28mvwwv4
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.586802715-04:00" level=debug msg="(*Agent).UpdateTaskStatus" module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4 task.id=zywyykkjmuzutjc6lrpg75rs8
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.586815806-04:00" level=error msg="status reporter failed to report status to agent" error="context canceled" module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.586827817-04:00" level=debug msg="(*Agent).UpdateTaskStatus" module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4 task.id=zywyykkjmuzutjc6lrpg75rs8
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.586837710-04:00" level=error msg="status reporter failed to report status to agent" error="context canceled" module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.586846974-04:00" level=debug msg="(*Agent).UpdateTaskStatus" module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4 task.id=zywyykkjmuzutjc6lrpg75rs8
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.586857500-04:00" level=error msg="status reporter failed to report status to agent" error="context canceled" module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.586860136-04:00" level=info msg="shutting down certificate renewal routine" module="node/tls" node.id=m88r1xrksaywtpd6c28mvwwv4 node.role=swarm-manager
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.586866627-04:00" level=debug msg="(*Agent).UpdateTaskStatus" module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4 task.id=zywyykkjmuzutjc6lrpg75rs8
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.586876612-04:00" level=error msg="status reporter failed to report status to agent" error="context canceled" module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.586889062-04:00" level=debug msg="(*Agent).UpdateTaskStatus" module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4 task.id=zywyykkjmuzutjc6lrpg75rs8
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.586904888-04:00" level=debug msg="(*Agent).run exited" module="node/agent" node.id=m88r1xrksaywtpd6c28mvwwv4
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.587017633-04:00" level=error msg="LogBroker exited with an error" error="context canceled" module=node node.id=m88r1xrksaywtpd6c28mvwwv4
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.587194729-04:00" level=debug msg="stop transport"
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.587764482-04:00" level=debug msg="devmapper: DeleteDevice START(hash=92f39d04480a6552da9eacec78b61be1aa0e70c9a6ec3ea094aba1dcfd0277eb syncDelete=false)"
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.587789268-04:00" level=debug msg="devmapper: deactivateDevice START(92f39d04480a6552da9eacec78b61be1aa0e70c9a6ec3ea094aba1dcfd0277eb)"
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.587888192-04:00" level=debug msg="devmapper: deactivateDevice END(92f39d04480a6552da9eacec78b61be1aa0e70c9a6ec3ea094aba1dcfd0277eb)"
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.590646965-04:00" level=debug msg="devmapper: unregisterDevice(92f39d04480a6552da9eacec78b61be1aa0e70c9a6ec3ea094aba1dcfd0277eb)"
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.590866488-04:00" level=debug msg="devmapper: DeleteDevice END(hash=92f39d04480a6552da9eacec78b61be1aa0e70c9a6ec3ea094aba1dcfd0277eb syncDelete=false)"
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.591451955-04:00" level=debug msg="devmapper: DeleteDevice START(hash=92f39d04480a6552da9eacec78b61be1aa0e70c9a6ec3ea094aba1dcfd0277eb-init syncDelete=false)"
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.591467814-04:00" level=debug msg="devmapper: deactivateDevice START(92f39d04480a6552da9eacec78b61be1aa0e70c9a6ec3ea094aba1dcfd0277eb-init)"
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.591501787-04:00" level=debug msg="devmapper: deactivateDevice END(92f39d04480a6552da9eacec78b61be1aa0e70c9a6ec3ea094aba1dcfd0277eb-init)"
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.595115836-04:00" level=debug msg="devmapper: unregisterDevice(92f39d04480a6552da9eacec78b61be1aa0e70c9a6ec3ea094aba1dcfd0277eb-init)"
Oct 31 06:08:11 swarm1 dockerd[120833]: time="2017-10-31T06:08:11.595380543-04:00" level=debug msg="devmapper: DeleteDevice END(hash=92f39d04480a6552da9eacec78b61be1aa0e70c9a6ec3ea094aba1dcfd0277eb-init syncDelete=false)"
Oct 31 06:08:19 swarm1 dockerd[120833]: time="2017-10-31T06:08:19.587549668-04:00" level=error msg="failed to receive changes from store watch API" error="rpc error: code = 13 desc = transport is closing"
Oct 31 06:08:19 swarm1 dockerd[120833]: time="2017-10-31T06:08:19.588152204-04:00" level=info msg="Manager shut down" module=node node.id=m88r1xrksaywtpd6c28mvwwv4
Oct 31 06:08:19 swarm1 dockerd[120833]: time="2017-10-31T06:08:19.588527861-04:00" level=error msg="cluster exited with error: manager stopped: context canceled"
Oct 31 06:08:19 swarm1 dockerd[120833]: time="2017-10-31T06:08:19.588980128-04:00" level=debug msg="Sending kill signal 9 to container 517cc961ddb9ae27570ae466af384a35cc29a70a68540f298bdf24e8a5bb7c51"
Oct 31 06:08:19 swarm1 dockerd[120833]: time="2017-10-31T06:08:19.590466020-04:00" level=info msg="Node swarm1-a3fee35d01a9/9.81.254.163, left gossip cluster"
Oct 31 06:08:19 swarm1 dockerd[120833]: time="2017-10-31T06:08:19.590543692-04:00" level=info msg="Node swarm1-a3fee35d01a9/9.81.254.163, added to failed nodes list"
Oct 31 06:08:19 swarm1 dockerd[120833]: time="2017-10-31T06:08:19.590660075-04:00" level=debug msg="cleanupServiceBindings for "
Oct 31 06:08:19 swarm1 dockerd[120833]: time="2017-10-31T06:08:19.608461783-04:00" level=debug msg="containerd: process exited" id=517cc961ddb9ae27570ae466af384a35cc29a70a68540f298bdf24e8a5bb7c51 pid=init status=137 systemPid=34126
Oct 31 06:08:19 swarm1 dockerd[120833]: time="2017-10-31T06:08:19.613085254-04:00" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:"exit", Id:"517cc961ddb9ae27570ae466af384a35cc29a70a68540f298bdf24e8a5bb7c51", Status:0x89, Pid:"init", Timestamp:(*timestamp.Timestamp)(0xc4210a60d0)}"

@isula-xufeng
Copy link
Author

And after this problem happened in this manager node, run "docker swarm leave -f" again will hang for ever.

@isula-xufeng
Copy link
Author

Could anyone take a look at this issue?
Thanks a lot!

@thaJeztah
Copy link
Member

I'm not sure this is a bug (but understand it's inconvenient); first of all; doing docker swarm leave does not remove a node from the swarm. Only managers are allowed to remove a node from the swarm (using docker node rm). Doing docker swarm leave on any node will just make the node itself disconnect from the swarm, but if you would check the output of docker node ls on a manager node, you'll still find the node listed as being part of the swarm, but having a status "unreachable" (or some other error status).

To remove a node from the swarm, use docker node rm; if you're trying to remove a manager node, it needs to be demoted to a worker first; docker node demote, then docker node rm. Removing the last manager from a swarm (i.e.; stop being a swarm) will be problematic in most cases. Let me try to explain:

Manager nodes perform two tasks;

  • they are a manger
  • they are a worker node

When you docker swarm leave -f the manager, the swarm Raft store is not modified; the node is still registered as being part of the swarm, but not reachable.

This is where the problem is; the node left the swarm, but it's still registered as manager in the swarm. The Raft consensus algorithm in swarm mode (see https://docs.docker.com/engine/swarm/raft/) requires a majority of all managers for any change that is made. The moment the manager node "left the swarm", you lose control over the swarm; no changes can be made to the swarm, no nodes joined/removed, and no services created/removed/controlled.

The context deadline exceeded error indicates that no communication was possible with the manager (in this case; because it left the swarm), and you don't have control (the error is not very informative; more recent versions of docker improved some error messages in a couple of places).

If you're in this situation and want to "destroy" the swarm, you can try stopping the daemon, and removing the /var/lib/docker/swarm/ directory. You can also re-initialize the swarm (which may preserve the information that was present in the swarm), using docker swarm init --force-new-cluster on the manager node.

Please keep in mind that the GitHub issue tracker is not intended as a general support forum,
but for reporting bugs and feature requests. For other type of questions, consider using one of;

@isula-xufeng
Copy link
Author

thaJeztah,
Many thanks for your detail explaination!
I have a further question, could this problem happen in a single node? I mean could "docker swarm leave -f" lead to "context deadline exceeded" error if the cluster only have one manager node?

@thaJeztah
Copy link
Member

Yes, I think the situation is the same; the manager is a bit "schizophrenic" in that situation, because it still follow the same protocol, and requests a majority of the managers to make changes (in this case: itself).

It's not pretty, but I know people looked into this, and it's a difficult one to tackle, and while it would be nice to improve that, it wasn't seen as a "high priority" issue, given that destroying a swarm is not a standard operation, and the alternative (removing the /var/lib/docker/swarm) can be used to work around this issue.

@isula-xufeng
Copy link
Author

isula-xufeng commented Nov 9, 2017

thaJeztah,

After debug, we found that this "context deadline exceeded" issue happens easily in a memory-pressure environment.
In order to let swarm exit the cluster state even in this timeout scenario(otherwise, the afterwards "docker swarm leave -f" command will hang for ever), we make below changes:

--- a/components/engine/daemon/cluster/noderunner.go
+++ b/components/engine/daemon/cluster/noderunner.go
@@ -260,6 +260,8 @@ func (n *nodeRunner) Stop() error {
    defer cancel()
    n.mu.Unlock()
    if err := n.swarmNode.Stop(ctx); err != nil && !strings.Contains(err.Error(), "context canceled") {
+       n.cluster.SendClusterEvent(lncluster.EventNodeLeave)
+       <-n.done
        return err
    }
    n.cluster.SendClusterEvent(lncluster.EventNodeLeave)

After this change, even "context deadline exceeded" error happens, we still can run "docker swarm init" command successful without removing the /var/lib/docker/swarm, but we don't know the side effect of this change, and we found there is still a timer running in swarm go routine:

    components/engine/vendor/github.com/docker/swarmkit/agent/task.go:
    168             case exec.ErrTaskRetry:
    169                 // TODO(stevvooe): Add exponential backoff with random jitter
    170                 // here. For now, this backoff is enough to keep the task
    171                 // manager from running away with the CPU.
    172                 time.AfterFunc(time.Second, func() {                           <----- loop in this timer
    173                     errs <- nil // repump this branch, with no err
    174                 })
    175                 continue

we don't know why.

@thaJeztah
Copy link
Member

After debug, we found that this "context deadline exceeded" issue happens easily in a memory-pressure environment.

The host running out of resources can definitely result in problems yes

but we don't know the side effect of this change, and we found there is still a timer running in swarm go routine:

I'd have to check that with @nishanttotla @anshulpundir (PTAL)

@anshulpundir
Copy link
Contributor

I can't think of any side effects. What do you think @nishanttotla

@ricardogil
Copy link

When I got this issue I removed the folder /var/lib/docker/swarm and ran docker swarm init again
For some reason I dont know my docker manager thought it was a worker and after I removed the swarm file it got fixed

@snth
Copy link

snth commented Jul 25, 2019

Removing /var/lib/docker/swarm didn't work for me but removing /var/lib/docker did. Of course that means that you lose all locally stored images and have to download these again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants