New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[swarm] corrupted manger not able to leave a cluster with --force #25432

Closed
rogaha opened this Issue Aug 5, 2016 · 67 comments

Comments

@rogaha
Contributor

rogaha commented Aug 5, 2016

Output of docker version:

Client:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:        Thu Jul 28 23:54:00 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:        Thu Jul 28 23:54:00 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 4
 Running: 4
 Paused: 0
 Stopped: 0
Images: 1
Server Version: 1.12.0
Storage Driver: aufs
 Root Dir: /mnt/sda1/var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 15
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge overlay host null
Swarm: active
 NodeID: 2jml8zh2ap8gnw3ghchc03g09
 Error: rpc error: code = 2 desc = raft: no elected cluster leader
 Is Manager: true
 ClusterID:
 Managers: 0
 Nodes: 0
 Orchestration:
  Task History Retention Limit: 0
 Raft:
  Snapshot interval: 0
  Heartbeat tick: 0
  Election tick: 0
 Dispatcher:
  Heartbeat period: Less than a second
 CA configuration:
  Expiry duration: Less than a second
 Node Address: 192.168.99.100
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 4.4.16-boot2docker
Operating System: Boot2Docker 1.12.0 (TCL 7.2); HEAD : e030bab - Fri Jul 29 00:29:14 UTC 2016
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 995.9 MiB
Name: master1
ID: XFBC:QIPK:ZJH5:MGSZ:D6IA:32XG:TUHL:6E43:HXOQ:FVLW:OY64:HWD4
Docker Root Dir: /mnt/sda1/var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 72
 Goroutines: 218
 System Time: 2016-08-05T09:22:12.607462985Z
 EventsListeners: 4
Registry: https://index.docker.io/v1/
Labels:
 provider=virtualbox
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):
5 VirtualBox VMs cluster

Steps to reproduce the issue:

  1. Create a cluster with 2 managers and 3 workers
  2. Turn off the laptop

Describe the results you received:

docker@master1:~$ docker swarm leave
Error response from daemon: You are attempting to leave cluster on a node that is participating as a manager. The only way to restore a cluster that has lost consensus is to reinitialize it with `--force-new-cluster`. Use `--force` to ignore this message.
docker@master1:~$ docker swarm leave --force
Error response from daemon: context deadline exceeded
docker@master1:~$

manager logs --
logs1.txt

It might be related to #25395 (comment)

Describe the results you expected:

docker@master1:~$ docker swarm leave --force
Node left the swarm.
docker@master1:~$

Additional information you deem important (e.g. issue happens only occasionally):

@rogaha rogaha changed the title from [swarm] corrupter manger not able to leave a cluster with --force to [swarm] corrupted manger not able to leave a cluster with --force Aug 5, 2016

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Aug 5, 2016

I think #25159 will (at least, partly) help in this situation, as it would allow running docker node rm --force <unhealthy-node> from a healthy master.

The node itself would probably still think it's part of the swarm

/cc @abronan PTAL

@ventz

This comment has been minimized.

ventz commented Aug 10, 2016

@thaJeztah having the same problem, but in a weird position:

In my case, there were only 2 docker hosts (node01, and node02)-- and the problem is the error propagated to both. I decided to re-init the cluster, and so on node02 I am having the problem above. I can't join an existing ,and I can't leave/force leave. But the issue is that on node01, it has already left the cluster...

So there's no healthy one to kick out the other one. How can I manually do this so clean up node02, so that I can re-join it to a new cluster.

On node02 getting:

# docker swarm leave
Error response from daemon: context deadline exceeded

# docker swarm leave --force
Error response from daemon: context deadline exceeded

and on node01 -- there is no cluster.

@ventz

This comment has been minimized.

ventz commented Aug 10, 2016

Follow up: ended up removing everything from: /var/lib/docker/swarm/*
and just restarting docker. On both systems. That seems to have unjoined it. Not sure if this is the "correct" way (if there even is one at this point for this bug? :))

@abronan

This comment has been minimized.

Contributor

abronan commented Aug 11, 2016

@rogaha @ventz Just checking but are the VMs changing IPs somehow upon restart/sleep? Manager IPs must be stable for this release and I think that for your setups it wasn't the case somehow. The Managers are restarting with different IPs and lose the contact with other Managers thus breaking the quorum set. So the only remaining solution in this case is to restart the cluster using --force-new-cluster on one of the Managers and add the other nodes back to the cluster with join.

The Administration guide provides with information on how to keep an healthy set of Managers.

@ventz

This comment has been minimized.

ventz commented Aug 11, 2016

@abronan Not in my case.
I do have 2 IPs on the system (one is primary/routed L3IP, and the second is a L2 used for an NFS share)

I had actually set a systemd override to listen on 0.0.0.0, and later made it specific just to the L3IP. Did this because I saw at one point when I joined node02 as a manage or worker that it picked the L2 IP.

Removing everything in /var/lib/docker/swarm and restarting it fixed it for me. Obviously it wiped all info for the cluster, but it restored it so I could start it again.

One thing that's still weird however, when I do "docker info" on node02, it's showing "Node Address: L2IP" vs the main routable IP. Even when I join with "--listen-addr L3IP", it still shows L2IP in "docker info"

The other question, which might be related - with 2 nodes -- if you join both as "managers", in the "docker node ls" one has "Leader" and the other has "Reachable". But if you join the second as a worker, and then promote it, it never has "Reachable".

@eungjun-yi

This comment has been minimized.

Contributor

eungjun-yi commented Aug 22, 2016

I have the same issue.

After restarting my docker manager node which exited abnormally because of disk full, docker swarm leave, docker swarm leave --force and docker swarm init --force-new-cluster doesn't work with Error response from daemon: context deadline exceeded error.

The only solution is removing /var/lib/docker and restarting docker daemon.

@makosblade

This comment has been minimized.

makosblade commented Sep 9, 2016

Same symptoms here on 1.12.1 on CentOS 7

@achautha

This comment has been minimized.

achautha commented Sep 21, 2016

I am facing the same issue on Amazon EC2 Ubuntu AMI.


ubuntu@vm-swarm-manager:~$ sudo docker swarm leave --force
Error response from daemon: context deadline exceeded
@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Sep 21, 2016

ping @aluzzardi PTAL

@Vanuan

This comment has been minimized.

Vanuan commented Sep 26, 2016

I have the same issue.

Restarting the manager results in this:

docker node ls
Error response from daemon: rpc error: code = 2 desc = raft: no elected cluster leader
@alexellis

This comment has been minimized.

Contributor

alexellis commented Sep 27, 2016

+1 getting the same issue on my RPi.

Error response from daemon: context deadline exceeded

@rogaha

This comment has been minimized.

Contributor

rogaha commented Sep 29, 2016

It seems to be fixed on 1.12.2-rc1.

@rogaha rogaha closed this Sep 29, 2016

@rogaha

This comment has been minimized.

Contributor

rogaha commented Sep 29, 2016

I wasn't able to reproduce it.

@nsamala

This comment has been minimized.

nsamala commented Sep 29, 2016

I'm having the same issue on a rebooted EC2 instance. The manager node can't leave the swarm

@rogaha

This comment has been minimized.

Contributor

rogaha commented Sep 29, 2016

@nsamala are you running the latest version 1.12.2-rc1?

@aluzzardi

This comment has been minimized.

Member

aluzzardi commented Sep 29, 2016

@alexellis Would you be able to reproduce this on 1.12.2-rc1 please?

/cc @LK4D4 @aaronlehmann

@rajkumar49

This comment has been minimized.

rajkumar49 commented Sep 30, 2016

hi ,
how to upgrade the docker engine to version to 1.12.2-rc1 ?

@rajkumar49

This comment has been minimized.

rajkumar49 commented Sep 30, 2016

i am also facing the same issue in Docker 1.12.1

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Sep 30, 2016

@rajkumar49 the 1.12.2-rc1 can be found here; https://github.com/docker/docker/releases/tag/v1.12.2-rc1. Installation instructions depend on what platform / distro you're running on. If you're on Linux; you can use the curl -fsSL .... instructions from there. For Docker Toolbox, downloads are here: https://github.com/docker/toolbox/releases/tag/v1.12.2-rc1, and Docker for Mac/Win: it's included in the beta channel https://docs.docker.com/docker-for-mac/

@nsamala

This comment has been minimized.

nsamala commented Oct 4, 2016

@rogaha sorry for the late response. No this was on 1.12.1

@rajkumar49

This comment has been minimized.

rajkumar49 commented Oct 5, 2016

I think we still have this issue in Docker 1.12.2 rc

On Oct 4, 2016 10:49 PM, "Nishanth Samala" notifications@github.com wrote:

@rogaha https://github.com/rogaha sorry for the late response. No this
was on 1.12.1


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#25432 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/APlrLhlAwZkw0pQMd7hwl-gEdEYw_Q3hks5qwoqCgaJpZM4JdiAl
.

@tonistiigi

This comment has been minimized.

Member

tonistiigi commented Oct 6, 2016

Can you send sigusr1 signal to daemon process and send us the stacktrace if context deadline exceeded happens on leave. Master build already automatically prints the stacktrace in logs when this happens.

@DiegoGallegos4

This comment has been minimized.

DiegoGallegos4 commented May 15, 2017

Problem still on Docker version 17.03.0-ce, build 60ccb22. Same symptoms as before

  • Restarted machine
  • Nodes changed IP
  • Nodes lost communication
@Vanuan

This comment has been minimized.

Vanuan commented May 15, 2017

@DiegoGallegos4 This is an expected behavior. I was confused too. You are not supposed to restart any manager if you have 2 of them. If you have 3 of them, you can restart 1. If you have 5 managers, you can restart 2. And so on: Number of managers you can restart = ceil(number of managers / 2) - 1.

@DiegoGallegos4

This comment has been minimized.

DiegoGallegos4 commented May 15, 2017

@Vanuan I have 3 managers and 3 workers.

@Vanuan

This comment has been minimized.

Vanuan commented May 15, 2017

So you can only restart one machine at a time. Probably you restarted 2 or 3 of them at the same time.

@DiegoGallegos4

This comment has been minimized.

DiegoGallegos4 commented May 15, 2017

I restarted my machine and the IPs changed, you cannot assign static IPs to docker-machine vm.

@aaronlehmann

This comment has been minimized.

Contributor

aaronlehmann commented May 15, 2017

There should not be a problem with restarting multiple managers at once, other than a temporary interruption in service. However, changing IP addresses can cause problems. Neither overlay networking nor the consensus protocol can handle IP address changes. It's something we're looking at improving.

@markserrano915

This comment has been minimized.

markserrano915 commented May 17, 2017

I have the same problem. I had to run the following command to force my remaining manager to leave the swarm.

sudo docker swarm leave --force

So I am running a swarm of Raspberry Pis with 4 nodes with 2 managers. The other manager crashed and the remaining manager started giving me the error that's been described here:

Error response from daemon: rpc error: code = 4 desc = context deadline exceeded

So if the solutions

  • Do not restart your managers if there's only EVEN number of managers. However I didn't restart any managers. The other manager just crashed unexpectedly. So this solution won't apply to the scenario that I've just encountered.
  • Start with an even number of managers. Let's assume there's 3 managers, and the other manager crashed, you're left with two. The second one crashed again (just like I had experienced). Aren't you potentially replicating the issue again?
@thaJeztah

This comment has been minimized.

Member

thaJeztah commented May 17, 2017

@markserrano915 never run with two managers in your swarm; effectively you double the chance you loose control over your cluster (when you have two managers, there's twice as much chance a manager crashes, or is unavailable). Either use 1 or 3, never two; see https://docs.docker.com/engine/swarm/admin_guide/#add-manager-nodes-for-fault-tolerance
and https://docs.docker.com/engine/swarm/raft/

@markserrano915

This comment has been minimized.

markserrano915 commented May 17, 2017

@thaJeztah thanks for the links. They are informative.

The only reason I had two was I accidentally joined the other node as manager than a worker (I was playing with my Pi's). So it was a big surprise when this issue bit me. I never expected it would be a big deal.

On a hindsight earlier before I encountered the bug I was setting up a Zookeeper ensemble and wondered why Docker didn't recommend an odd number of managers as well. It turns out I didn't read enough.

@markserrano915

This comment has been minimized.

markserrano915 commented May 21, 2017

@ventz Funny I encountered the bug again after I switched to a new router. I have three managers this time but the second node will not join nor leave.

Removing the files on /var/lib/docker/swarm and restarting docker did the trick. Before doing that I tried just clearing the state.json and docker-state.json from the directory. It has the IPs of the old nodes (before I switched to a new router). It didn't work. I had to take out the whole contents of the swarm directory

@Vanuan

This comment has been minimized.

Vanuan commented May 22, 2017

@markserrano915 you can't leave the manager, you have to demote it first. I think docker should write some message when you're trying to make manager node leave.

@ventz

This comment has been minimized.

ventz commented May 22, 2017

@Vanuan Technically you can - it's just that your swarm is gone at that point. I think this is a bit of the confusion around swarm, or at least was, with the initial releases. This is the reason you need the "--force" on the manager, to indicate that you are about to smash it.

@markserrano915 I think at this point, all of this should have been (mostly)fixed in the latest release(s). You have run into the swarm raft "split-brain" issue (with the 2 managers). But as you said - the clean up is still vital when things end up in an error state. I have also seen this happen a few releases ago. I am not sure what exactly caused it, but something crashed, and from that point on, I could not establish a new swarm without cleaning up the dir. Sadly, it has become a "normal step" for me at this point - whenever I upgrade or re-create the swarm, I 1.) shut down docker, and 2.) clean up the swarm dir, and 3.) start up docker again. This guarantees me a clean state. I remember talking to someone from the dev. team that this really wan an issue earlier on, but it should be fixed at this point. It is possible that there is an edge case that still produces it.

@Vanuan

This comment has been minimized.

Vanuan commented May 22, 2017

@ventz

  • Technically you can
  • the clean up is still vital

If manual clean up is required I think this behavior is not expected.
"Smashing" the swarm is only supported for a single-node swarm:

You can use the --force option on a manager to remove it from the swarm. However, this does not reconfigure the swarm to ensure that there are enough managers to maintain a quorum in the swarm. The safe way to remove a manager from a swarm is to demote it to a worker and then direct it to leave the quorum without using --force. Only use --force in situations where the swarm will no longer be used after the manager leaves, such as in a single-node swarm.

Alas, Docker doesn't enforce this.

@Fabryprog

This comment has been minimized.

Fabryprog commented May 23, 2017

I have same problem. Manager is not running (but swarm is running), worker is inside a swarm!!!!

docker swarm leave --force
Error response from daemon: context deadline exceeded

Server Version: 17.05.0-ce

@cristhianbicca

This comment has been minimized.

cristhianbicca commented May 23, 2017

I also had the same problem
I have 3 server manager and 2 worker, when trying to get a worker from the cluster I got the error mentioned below

root@worker03: ~ # docker swarm leave
Error response from daemon: context deadline exceeded

Client:
Version: 17.04.0-ce
API version: 1.28
Go version: go1.7.5
Git commit: 4845c56
Built: Wed Apr 5 19:28:09 2017
OS/Arch: linux/amd64

Server:
Version: 17.04.0-ce
API version: 1.28 (minimum version 1.12)
Go version: go1.7.5
Git commit: 4845c56
Built: Wed Apr 5 19:28:09 2017
OS/Arch: linux/amd64
Experimental: false

@Fabryprog

This comment has been minimized.

Fabryprog commented May 23, 2017

I found follow solution "manual cleanup"!

  1. sudo service docker stop
  2. sudo rm -Rf /var/lib/docker/swarm
  3. sudo service docker start
@cristhianbicca

This comment has been minimized.

cristhianbicca commented May 23, 2017

Thansk @Fabryprog

I put it here just so people can see that there are more people with the mistake

@tonistiigi

This comment has been minimized.

@aaronlehmann

This comment has been minimized.

Contributor

aaronlehmann commented May 24, 2017

I think there's a good chance docker/swarmkit#2203 will fix this.

@rogaha

This comment has been minimized.

Contributor

rogaha commented May 24, 2017

👍

@emillynge

This comment has been minimized.

emillynge commented May 29, 2017

@Fabryprog that solution will wipe any secrets stored in the swarm along with some other stuff.
Provided the swarm isn't locked, then a more precise clean up is:

# using systemd
sudo systemctl stop docker
# make sure to make a backup if you delete something wrong
sudo cp -ar /var/lib/docker/swarm/ /tmp/swarm.bak
sudo nano /var/lib/docker/swarm/state.json

state.json will look something like this:

[{"node_id":"nodeidofhealthynode","addr":"123.123.123.123:2377"},
{"node_id":"nodeidofunhealthynode","addr":"123.123.123.124:2377"}]

You want to delete any entries to unhealthy nodes. so just one healthy manager node is left

[{"node_id":"nodeidofhealthynode","addr":"123.123.123.123:2377"}]

lastly restart docker

sudo systemctl start docker
@gaui

This comment has been minimized.

gaui commented Jan 8, 2018

I removed a worker node from the manager, before leaving the swarm on the worker, then I got this when I tried docker swarm leave -f on the worker node.

@helmesjo

This comment has been minimized.

helmesjo commented Jan 22, 2018

@gaui Same procedure here, and (thankfully) the bug is very consistent with these steps.

@ghost

This comment has been minimized.

ghost commented Apr 1, 2018

Happening again with 17.12.1-ce

docker swarm leave -f

^C
root@ip-172-16-1-49:~# journalctl -fu docker.service
-- Logs begin at Mon 2018-03-12 03:55:26 UTC. --
Apr 01 03:28:43 ip-172-16-1-49 dockerd[1533]: time="2018-04-01T03:28:43.284147517Z" level=info msg="Node ecb719557234/172.16.1.23, left gossip cluster"
Apr 01 03:28:43 ip-172-16-1-49 dockerd[1533]: time="2018-04-01T03:28:43.284195325Z" level=info msg="Node ecb719557234 change state NodeActive --> NodeFailed"
Apr 01 03:28:43 ip-172-16-1-49 dockerd[1533]: time="2018-04-01T03:28:43.284336494Z" level=info msg="Node ecb719557234/172.16.1.23, added to failed nodes list"
Apr 01 03:28:49 ip-172-16-1-49 dockerd[1533]: time="2018-04-01T03:28:49.020277949Z" level=info msg="Node 9cff3cda8d33 change state NodeActive --> NodeLeft"
Apr 01 03:28:49 ip-172-16-1-49 dockerd[1533]: time="2018-04-01T03:28:49.020646579Z" level=info msg="ip-172-16-1-49(188b594110f2): Node leave event for 9cff3cda8d33/172.16.1.176"
Apr 01 03:28:49 ip-172-16-1-49 dockerd[1533]: time="2018-04-01T03:28:49.420070945Z" level=info msg="Node 9cff3cda8d33/172.16.1.176, left gossip cluster"
Apr 01 03:29:05 ip-172-16-1-49 dockerd[1533]: time="2018-04-01T03:29:05.203619069Z" level=info msg="NetworkDB stats ip-172-16-1-49(188b594110f2) - netID:4bjz2q0omsoxvqumhdg9xbpgx leaving:false netPeers:1 entries:4 Queue qLen:0 netMsg/s:0"
Apr 01 03:29:05 ip-172-16-1-49 dockerd[1533]: time="2018-04-01T03:29:05.203655148Z" level=info msg="NetworkDB stats ip-172-16-1-49(188b594110f2) - netID:7png237q3h63gaeupqicq5jeq leaving:false netPeers:1 entries:6 Queue qLen:0 netMsg/s:0"
Apr 01 03:34:05 ip-172-16-1-49 dockerd[1533]: time="2018-04-01T03:34:05.203706877Z" level=info msg="NetworkDB stats ip-172-16-1-49(188b594110f2) - netID:7png237q3h63gaeupqicq5jeq leaving:false netPeers:1 entries:6 Queue qLen:0 netMsg/s:0"
Apr 01 03:34:05 ip-172-16-1-49 dockerd[1533]: time="2018-04-01T03:34:05.203802024Z" level=info msg="NetworkDB stats ip-172-16-1-49(188b594110f2) - netID:4bjz2q0omsoxvqumhdg9xbpgx leaving:false netPeers:1 entries:4 Queue qLen:0 netMsg/s:0"

@rbucker

This comment has been minimized.

rbucker commented Apr 10, 2018

@eungjun-yi deleting /var/lib/docker might not be a good idea in rancheros

UPDATE: might be in /var/lib/system-docker

@matteocng

This comment has been minimized.

matteocng commented May 6, 2018

Happened again on a "single-node Swarm", version: 18.03.1-ce.

I can post any logs from the system if useful, but unfortunately I can't reproduce this on purpose.

Context

  • The whole machine got hanged and had to be rebooted (investigating the cause, may be totally unrelated to Docker).
  • After restarting it, docker stats worked but no containers were being created with error: no suitable node (1 node not available for new tasks).
  • We had to executedocker swarm leave -f multiple times for it to work.
  • We use "restrictions" and "limits".
  • We execute docker service update --image every few minutes on all the services. Since some services don't yet have images available, our sudo journalctl -fu docker.service is full of this stuff: level=error msg="fatal task error" error="No such image:.

System

Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:17:38 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:15:45 2018
  OS/Arch:      linux/amd64
  Experimental: false

docker-compose version 1.21.0-rc1, build 1d32980
4.13.0-39-generic

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 17.10
Release:	17.10
Codename:	artful

Docker info

Containers: 387
 Running: 20
 Paused: 0
 Stopped: 367
Images: 46
Server Version: 18.03.1-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 992
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: jejb6my7n50ulnilktgd2fxof
 Is Manager: true
 ClusterID: 0yus20uq607uzugzqbv9a1vzr
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 192.168.2.102
 Manager Addresses:
  192.168.2.102:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.13.0-39-generic
Operating System: Ubuntu 17.10
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 23.54GiB
Name: linuxcompany
ID: FLM4:OCTS:BWQR:ZLRQ:HVGG:VBSW:O5NE:IA2W:4Z6T:SS47:4BSE:A6ZT
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: daybreakhotels
Registry: https://index.docker.io/v1/
Labels:
 provider=generic
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

### Other logs

The two services in the logs below are configured to have 2 replicas each, they are visible in docker service ls but are never created because we currently don't have images for them, probably unrelated, but just in case.

-- Logs begin at Mon 2018-04-16 06:59:12 CEST. --
May 07 00:09:24 linuxcompany dockerd[4572]: time="2018-05-07T00:09:24.279477606+02:00" level=warning msg="failed to deactivate service binding for container ourwebapp_webtest_alpha2.zzzzcqo6e92w9tgiedonjphg1" error="No such container: ourwebapp_webtest_alpha2.zzzzcqo6e92w9tgiedonjphg1" module=node/agent node.id=j5rdwstrbw2534pfiq6xomt0w
May 07 00:09:24 linuxcompany dockerd[4572]: time="2018-05-07T00:09:24.279481142+02:00" level=warning msg="failed to deactivate service binding for container ourwebapp_webtest_beta.2.zzy7upc4qtbwls8ni53c086tx" error="No such container: ourwebapp_webtest_beta.2.zzy7upc4qtbwls8ni53c086tx" module=node/agent node.id=j5rdwstrbw2534pfiq6xomt0w
May 07 00:09:24 linuxcompany dockerd[4572]: time="2018-05-07T00:09:24.279496756+02:00" level=warning msg="failed to deactivate service binding for container ourwebapp_webtest_alpha1.ja6p9qhvuemnkusgqa9sxdtdr" error="No such container: ourwebapp_webtest_alpha1.ja6p9qhvuemnkusgqa9sxdtdr" module=node/agent node.id=j5rdwstrbw2534pfiq6xomt0w
May 07 00:09:24 linuxcompany dockerd[4572]: time="2018-05-07T00:09:24.279503004+02:00" level=warning msg="failed to deactivate service binding for container ourwebapp_webtest_alpha2.zzzsnzby1ew5b1zrz632158q7" error="No such container: ourwebapp_webtest_alpha2.zzzsnzby1ew5b1zrz632158q7" module=node/agent node.id=j5rdwstrbw2534pfiq6xomt0w
May 07 00:09:24 linuxcompany dockerd[4572]: time="2018-05-07T00:09:24.279551691+02:00" level=warning msg="failed to deactivate service binding for container ourwebapp_webtest_alpha2.g38yg4rd0ftz464420ehj32cf" error="No such container: ourwebapp_webtest_alpha2.g38yg4rd0ftz464420ehj32cf" module=node/agent node.id=j5rdwstrbw2534pfiq6xomt0w
May 07 00:09:24 linuxcompany dockerd[4572]: time="2018-05-07T00:09:24.279562380+02:00" level=warning msg="failed to deactivate service binding for container ourwebapp_webtest_beta.2.zzzxuf7vqy4n1b9ni32h13tdi" error="No such container: ourwebapp_webtest_beta.2.zzzxuf7vqy4n1b9ni32h13tdi" module=node/agent node.id=j5rdwstrbw2534pfiq6xomt0w

@alexanderkjeldaas

This comment has been minimized.

alexanderkjeldaas commented Jul 21, 2018

can't leave with --force on 18.06.0-ce

@blackknight36

This comment has been minimized.

blackknight36 commented Aug 3, 2018

@alexanderkjeldaas I'm seeing the same issue. After rebooting a host docker will show the host as "Down" even though it's online.

[root@docker-srv2 log]# docker node ls
ID                            HOSTNAME                             STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
md5wdxhtlqvx7io56puoe4i5h     docker-srv1   Ready               Active              Reachable           18.06.0-ce
mpuw8k0hc7nedmz38owjp7rv4 *  docker-srv2   Down                Active              Reachable           18.06.0-ce
uzto40aetpqoy0jhkm9i0n6n9     docker-srv3                     Ready               Active              Leader              18.06.0-ce

The only solution I've found is to leave the cluster and then rejoin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment