Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added node twice will destroy swarm #34722

Open
Fank opened this issue Sep 4, 2017 · 12 comments
Open

Added node twice will destroy swarm #34722

Fank opened this issue Sep 4, 2017 · 12 comments

Comments

@Fank
Copy link

Fank commented Sep 4, 2017

Description

Steps to reproduce the issue:

  1. Created a swarm with 3 nodes some month ago
  2. Run docker node update --availability drain node01
  3. Shutdown node and reinstall with same name and ip
  4. joined swarm via docker swarm join --token SWMTKN-1-2x0u0zht9x3us2bcxzr3melvavpzfh82jfbzpxh0sriyqt5sou-54viclw2nj7lywfo8bkcz7suq 192.168.85.102:2377
  5. joined node say Error response from daemon: Timeout was reached before node was joined. The attempt to join the swarm will continue in the background. Use the "docker info" command to see the current swarm status of your node. and lost network connection (this may could be my fault but don't know now)
  6. All other node say Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online. when executing swarm commands like docker node ls

Describe the results you received:
Only receiving Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online. when executing swarm commands.

Describe the results you expected:
Existing "drain" node will be overwritten, instead of beeing added.

Additional information you deem important (e.g. issue happens only occasionally):
Maybe related to #34384

Output of docker version:

Client:
 Version:      17.07.0-ce
 API version:  1.31
 Go version:   go1.8.3
 Git commit:   8784753
 Built:        Tue Aug 29 17:43:00 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.07.0-ce
 API version:  1.31 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   8784753
 Built:        Tue Aug 29 17:41:51 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 27
 Running: 17
 Paused: 0
 Stopped: 10
Images: 92
Server Version: 17.07.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: qtetxgwsavahdzqlkbtw4exoo
 Error: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
 Is Manager: true
 Node Address: 192.168.85.102
 Manager Addresses:
  192.168.85.101:2377
  192.168.85.101:2377
  192.168.85.102:2377
  192.168.85.103:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 3addd840653146c90a254301d6c3a663c7fd6429
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.10.0-33-generic
Operating System: Ubuntu 17.04
OSType: linux
Architecture: x86_64
CPUs: 24
Total Memory: 94.41GiB
Name: docker0102
ID: 4YOY:WYKL:5TFO:BYVH:JOGV:TC3J:D6AU:FDMO:7YRB:USLS:5RB5:EESR
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):
3 physical hosts (HPE Proliant blade)

@thaJeztah
Copy link
Member

ping @anshulpundir @nishanttotla

@nishanttotla
Copy link
Contributor

@Fank is this node that you put into drain mode the leader?

@Fank
Copy link
Author

Fank commented Sep 21, 2017

Im not sure, but i dont think so. Because it was offline for hours, so it should have the status "down"

@anshulpundir
Copy link
Contributor

@Fank Can I request you for the deamon logs for all the nodes ? Thanks!

@Fank
Copy link
Author

Fank commented Sep 21, 2017

Sry but i think the are gone, logrotate killed them.

@yunghoy
Copy link

yunghoy commented Oct 25, 2017

Swarm Cluster has been destroyed by adding a new node once or by no reason.

w8e5ljetmz35ovd09wqa2k05z     APSEO-EG-LINUX4    Ready               Active              
lb3lrt6tzjsj2vm7ura4j8vq6     APSEO-EG-LINUX6    Ready               Active              
l6caavaqcjb0o57cput7r2upc     APSEO-EG-LINUX8    Ready               Active              
nh3q88mmbcfpovk5dzv0hez7r *   APSEO-EG-LINUX10   Ready               Active              Leader
wbydv57elb88x515z144alhh6     APSEO-EG-LINUX11   Ready               Active              Reachable
APSEO-EG-LINUX10:~$ docker node inspect APSEO-EG-LINUX11
[
    {
        "ID": "wbydv57elb88x515z144alhh6",
        "Version": {
            "Index": 890120
        },
        "CreatedAt": "2017-10-10T08:39:44.218703354Z",
        "UpdatedAt": "2017-10-24T04:00:07.876516817Z",
        "Spec": {
            "Labels": {},
            "Role": "manager",
            "Availability": "active"
        },
        "Description": {
            "Hostname": "APSEO-EG-LINUX11",
            "Platform": {
                "Architecture": "x86_64",
                "OS": "linux"
            },
            "Resources": {
                "NanoCPUs": 6000000000,
                "MemoryBytes": 16821977088
            },
            "Engine": {
                "EngineVersion": "17.09.0-ce",
                "Plugins": [
                    {
                        "Type": "Log",
                        "Name": "awslogs"
                    },
                    {
                        "Type": "Log",
                        "Name": "fluentd"
                    },
                    {
                        "Type": "Log",
                        "Name": "gcplogs"
                    },
                    {
                        "Type": "Log",
                        "Name": "gelf"
                    },
                    {
                        "Type": "Log",
                        "Name": "journald"
                    },
                    {
                        "Type": "Log",
                        "Name": "json-file"
                    },
                    {
                        "Type": "Log",
                        "Name": "logentries"
                    },
                    {
                        "Type": "Log",
                        "Name": "splunk"
                    },
                    {
                        "Type": "Log",
                        "Name": "syslog"
                    },
                    {
                        "Type": "Network",
                        "Name": "bridge"
                    },
                    {
                        "Type": "Network",
                        "Name": "host"
                    },
                    {
                        "Type": "Network",
                        "Name": "macvlan"
                    },
                    {
                        "Type": "Network",
                        "Name": "null"
                    },
                    {
                        "Type": "Network",
                        "Name": "overlay"
                    },
                    {
                        "Type": "Volume",
                        "Name": "local"
                    }
                ]
            },
            "TLSInfo": {
                "TrustRoot": "REMOVED",
                "CertIssuerSubject": "REMOVED",
                "CertIssuerPublicKey": "REMOVED"
            }
        },
        "Status": {
            "State": "ready",
            "Addr": "REMOVED"
        },
        "ManagerStatus": {
            "Reachability": "reachable",
            "Addr": "REMOVED:2377"
        }
    }
]

The only msg even after several restarting docker.

APSEO-EG-LINUX11:~$ docker node ls
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
APSEO-EG-LINUX10:~$ docker swarm init --force-new-cluster
APSEO-EG-LINUX11:~$ docker swarm init --force-new-cluster

Working but status is now "Down" I wonder why node10 has lost its manager status.

APSEO-EG-LINUX10:~$ docker node ls
ID                            HOSTNAME                         STATUS              AVAILABILITY        MANAGER STATUS
w8e5ljetmz35ovd09wqa2k05z     APSEO-EG-LINUX4    Ready               Active              
lb3lrt6tzjsj2vm7ura4j8vq6     APSEO-EG-LINUX6    Ready               Active              
l6caavaqcjb0o57cput7r2upc     APSEO-EG-LINUX8    Ready               Active              
nh3q88mmbcfpovk5dzv0hez7r *   APSEO-EG-LINUX10   Down               Active
wbydv57elb88x515z144alhh6     APSEO-EG-LINUX11   Down               Active              Leader

Working but status is now "Down", actually, not working.

APSEO-EG-LINUX11:~$ docker node ls
ID                            HOSTNAME                         STATUS              AVAILABILITY        MANAGER STATUS
w8e5ljetmz35ovd09wqa2k05z     APSEO-EG-LINUX4    Ready               Active              
lb3lrt6tzjsj2vm7ura4j8vq6     APSEO-EG-LINUX6    Ready               Active              
l6caavaqcjb0o57cput7r2upc     APSEO-EG-LINUX8    Ready               Active              
nh3q88mmbcfpovk5dzv0hez7r *   APSEO-EG-LINUX10   Down               Active
wbydv57elb88x515z144alhh6     APSEO-EG-LINUX11   Down               Active              Leader
APSEO-EG-LINUX10:~$ sudo service docker restart
APSEO-EG-LINUX11:~$ sudo service docker restart

Now, node10 get its manager status back.

APSEO-EG-LINUX10:~$ docker node ls
ID                            HOSTNAME                         STATUS              AVAILABILITY        MANAGER STATUS
w8e5ljetmz35ovd09wqa2k05z     APSEO-EG-LINUX4    Ready               Active              
lb3lrt6tzjsj2vm7ura4j8vq6     APSEO-EG-LINUX6    Ready               Active              
l6caavaqcjb0o57cput7r2upc     APSEO-EG-LINUX8    Ready               Active              
nh3q88mmbcfpovk5dzv0hez7r *   APSEO-EG-LINUX10   Ready               Active              Reachable
wbydv57elb88x515z144alhh6     APSEO-EG-LINUX11   Down               Active              Leader

Working but status is now "Down", actually, not working.

APSEO-EG-LINUX11:~$ docker node ls
ID                            HOSTNAME                         STATUS              AVAILABILITY        MANAGER STATUS
w8e5ljetmz35ovd09wqa2k05z     APSEO-EG-LINUX4    Ready               Active              
lb3lrt6tzjsj2vm7ura4j8vq6     APSEO-EG-LINUX6    Ready               Active              
l6caavaqcjb0o57cput7r2upc     APSEO-EG-LINUX8    Ready               Active              
nh3q88mmbcfpovk5dzv0hez7r *   APSEO-EG-LINUX10   Down               Active
wbydv57elb88x515z144alhh6     APSEO-EG-LINUX11   Down               Active              Leader

Demoted node11

APSEO-EG-LINUX11:~$ docker demote APSEO-EG-LINUX11
APSEO-EG-LINUX11:~$ sudo service docker restart
APSEO-EG-LINUX10:~$ docker node ls
ID                            HOSTNAME                         STATUS              AVAILABILITY        MANAGER STATUS
w8e5ljetmz35ovd09wqa2k05z     APSEO-EG-LINUX4    Ready               Active              
lb3lrt6tzjsj2vm7ura4j8vq6     APSEO-EG-LINUX6    Ready               Active              
l6caavaqcjb0o57cput7r2upc     APSEO-EG-LINUX8    Ready               Active              
nh3q88mmbcfpovk5dzv0hez7r *   APSEO-EG-LINUX10   Ready               Active              Leader
wbydv57elb88x515z144alhh6     APSEO-EG-LINUX11   Ready               Active

@yunghoy
Copy link

yunghoy commented Oct 25, 2017

Can you tell me how many IT companies are using docker-ce for their live service?
I think this bug has happened once a quarter since the last year. The only thing to solve the problem is dettatch all node and configure the cluster again although I found another solution written above.
We don't even know what causes the problem but it just happens.

@shibug
Copy link

shibug commented Oct 25, 2017

I see you have 2 manager nodes. You need an odd number of managers to select a leader in a quorum. At least 3 managers to sustain 1 manager node failure

@yunghoy
Copy link

yunghoy commented Oct 25, 2017

It happened when we have even 3 manager nodes in docker 17.06.2-ce.
I will change the number of manager nodes to prime number(>2), and write a comment again when it happens. It will take around 3 months, one a quarter.

@shibug
Copy link

shibug commented Oct 25, 2017

We had the same issue a couple of days ago with 3 managers. It happened while I was recycling the nodes to upgrade to docker-ce 17.09. I recycled all the non managers node one by one and then started working on the managers. The problem when the last manager was recycled.

@thaJeztah
Copy link
Member

@yunghoy what version and configuration are you running? (i.e. at least make sure to post the output of docker version and docker info). Are there any logs that provide more information from around the time you saw the swarm was destroyed?

Also running --force-new-cluster on both managers likely won't work, as both managers now attempt to re-create the swarm (which could explain the DOWN status). Running in a two manager setup is indeed highly discouraged, as you effectively double the chance you loose control over your cluster (if any of the managers has problems, you loose control; see Administer and maintain a swarm of Docker Engines in the docs)

@oneumyvakin
Copy link

oneumyvakin commented Jul 25, 2020

The same issue, I have only 3 nodes which are want to have both roles manager and worker.
After I've add third node this node was actually added twice and there quite strange amount of workers 7:

# docker info
Client:
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 8
 Server Version: 19.03.12
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: rcv0rcdm4fefrtj6di4wysqtz
  Is Manager: true
  ClusterID: w50b3smuscm2dcwgr9j1olu7x
  Managers: 4
  Nodes: 7
  Default Address Pool: 10.0.0.0/8
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 10.52.241.42
  Manager Addresses:
   10.52.241.42:2377
   10.52.241.42:2377
   10.52.241.8:2377
   10.52.241.9:2377
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-42-generic
 Operating System: Ubuntu 20.04 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 3.843GiB
 Name: ha3
 ID: IRCU:KV5A:5KCI:S6ML:SFBW:T732:WKKD:OGYD:NNNR:W64K:MTRD:673N
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support
#

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants