New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swarm unstable if host's address is different from the public IP address #26167

Open
dpp opened this Issue Aug 30, 2016 · 4 comments

Comments

Projects
None yet
5 participants
@dpp

dpp commented Aug 30, 2016

Output of docker version:

Client:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:33:38 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:33:38 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 3
 Running: 0
 Paused: 0
 Stopped: 3
Images: 3
Server Version: 1.12.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 39
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null overlay bridge host
Swarm: active
 NodeID: xxxxxxxx
 Is Manager: false
 Node Address: 172.31.1.100
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-31-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 992.4 MiB
Name: s1
ID: xxxxxx
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):

Virtual instance running at Hetzner swarmed with instances running on Digital Ocean, Packet, etc.

Steps to reproduce the issue:

  1. Create a stable docker swarm with machines that all share a machine and public internet address
  2. Add a node using output from docker swarm join-token manager on a machine with an IP address that's different from than the public IP address and do not use the --advertise-addr option
    3.Try to leave the swarm with docker swarm leave or docker swarm leave --force

Describe the results you received:

  • The node cannot leave the swarm:
root@s1 /etc # docker swarm leave
Error response from daemon: context deadline exceeded
root@s1 /etc # docker swarm leave --force
Error response from daemon: context deadline exceeded
  • The node must be manually rebuilt (reinstall the OS) in order to be stable again
  • The swarm become unstable because the swarm thinks the node is a manager, but it is neither reachable nor unreachable. If there is a partitioning event, the swarm is unable to elect a new leader.

Describe the results you expected:

  • If the IP supplied by the "join" is not the same address the node is seen from the manager, the node should not be allowed to join and the command should prompt the user to use the --advertise-addr option
  • The docker swarm leave --force command should result in the node being in a stable and disconnected state
  • Leader election should force a node that's neither connected nor disconnected to be left out of the consensus

Additional information you deem important (e.g. issue happens only occasionally):

@aluzzardi

This comment has been minimized.

Show comment
Hide comment
@aluzzardi
Contributor

aluzzardi commented Aug 31, 2016

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Aug 31, 2016

Contributor

Thanks very much for the detailed report. I've seen a few similar issues recently, and I'm eager to find and fix the root cause.

@tonistiigi: This is related to #26038 (comment)

If the IP supplied by the "join" is not the same address the node is seen from the manager, the node should not be allowed to join and the command should prompt the user to use the --advertise-addr option

--advertise-addr shouldn't be necessary when joining an existing cluster. The manager it connects to should detect the new node's IP address and use that automatically. Perhaps this is not working as expected. I'll try to look into it.

The docker swarm leave --force command should result in the node being in a stable and disconnected state

Agreed. The fact that this is getting stuck appears to be a bug. We'll try to understand and resolve the problem.

Contributor

aaronlehmann commented Aug 31, 2016

Thanks very much for the detailed report. I've seen a few similar issues recently, and I'm eager to find and fix the root cause.

@tonistiigi: This is related to #26038 (comment)

If the IP supplied by the "join" is not the same address the node is seen from the manager, the node should not be allowed to join and the command should prompt the user to use the --advertise-addr option

--advertise-addr shouldn't be necessary when joining an existing cluster. The manager it connects to should detect the new node's IP address and use that automatically. Perhaps this is not working as expected. I'll try to look into it.

The docker swarm leave --force command should result in the node being in a stable and disconnected state

Agreed. The fact that this is getting stuck appears to be a bug. We'll try to understand and resolve the problem.

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Aug 31, 2016

Contributor

I looked into this and found that remote-side address detection is indeed not working as it should. I've opened #26211 to address this part of the issue. Thanks again for the report.

We are also looking into avoiding the bad state that results from an incorrect advertise address being specified or detected. docker/swarmkit#1440 is related to this. It prevents certain errors from being swallowed, and will avoid the bad state where the node is partially but not completely a manager.

Contributor

aaronlehmann commented Aug 31, 2016

I looked into this and found that remote-side address detection is indeed not working as it should. I've opened #26211 to address this part of the issue. Thanks again for the report.

We are also looking into avoiding the bad state that results from an incorrect advertise address being specified or detected. docker/swarmkit#1440 is related to this. It prevents certain errors from being swallowed, and will avoid the bad state where the node is partially but not completely a manager.

@dpp

This comment has been minimized.

Show comment
Hide comment
@dpp

dpp Aug 31, 2016

Thank you all for the quick and diligent efforts to resolving these issues!

dpp commented Aug 31, 2016

Thank you all for the quick and diligent efforts to resolving these issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment