Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker node not coming back after reboot #23828

Closed
robbertkl opened this issue Jun 21, 2016 · 8 comments · Fixed by #24237
Closed

Worker node not coming back after reboot #23828

robbertkl opened this issue Jun 21, 2016 · 8 comments · Fixed by #24237
Assignees
Labels
area/swarm priority/P1 Important: P1 issues are a top priority and a must-have for the next release. version/1.12
Milestone

Comments

@robbertkl
Copy link

Output of docker version:

Client:
 Version:      1.12.0-rc2
 API version:  1.24
 Go version:   go1.6.2
 Git commit:   906eacd
 Built:        Fri Jun 17 20:35:33 2016
 OS/Arch:      darwin/amd64
 Experimental: true

Server:
 Version:      1.12.0-rc2
 API version:  1.24
 Go version:   go1.6.2
 Git commit:   906eacd
 Built:        Fri Jun 17 20:45:29 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 1.12.0-rc2
Storage Driver: aufs
 Root Dir: /mnt/sda1/var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 0
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host overlay null
Swarm: active
 NodeID: ar6yxb70u9bvamsz1iinqhw97
 IsManager: Yes
 Managers: 1
 Nodes: 2
 CACertHash: sha256:42745a0afed8dfcbed66473e3e01ff990d74beb02edc0aef5ede6d1c2f9232c2
Runtimes: default
Default Runtime: default
Security Options: seccomp
Kernel Version: 4.4.13-boot2docker
Operating System: Boot2Docker 1.12.0-rc2 (TCL 7.1); HEAD : 52952ef - Fri Jun 17 21:01:09 UTC 2016
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 492.7 MiB
Name: sw1
ID: FI6P:JUNT:2KP4:HLAJ:HTEX:MQ72:ADJ3:53D2:RL33:W4WM:FAI3:BU4Y
Docker Root Dir: /mnt/sda1/var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 35
 Goroutines: 109
 System Time: 2016-06-21T20:54:27.884000237Z
 EventsListeners: 0
Username: robbertkl
Registry: https://index.docker.io/v1/
Labels:
 provider=virtualbox
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):

VirtualBox

Steps to reproduce the issue:

  1. docker-machine create -d virtualbox sw1
  2. docker-machine create -d virtualbox sw2
  3. docker $(docker-machine config sw1) swarm init
  4. docker $(docker-machine config sw2) swarm join $(docker-machine ip sw1):2377
  5. docker-machine restart sw2

Describe the results you received:

docker $(docker-machine config sw1) node ls showing sw2 status Down, even after the restart was completed. The node does not come back.

Describe the results you expected:

docker $(docker-machine config sw1) node ls showing sw2 status Down during the restart, but changing back to status Ready soon after the restart completed.

Additional information you deem important (e.g. issue happens only occasionally):

After the restart, manually (re)joining won't fix it:

$ docker $(docker-machine config sw2) swarm join $(docker-machine ip sw1):2377
Error response from daemon: This node is already part of a Swarm cluster. Use "docker swarm leave" to leave this cluster and join another one.
$ docker $(docker-machine config sw2) info | grep Swarm
Swarm: pending

Accepting the node on the manager sw1 doesn't help either.

@robbertkl
Copy link
Author

I just found out it does work when I initialize the swarm with:

docker $(docker-machine config sw1) swarm init --listen-addr $(docker-machine ip sw1):2377

Is that intended behavior? The documentation says:

--listen-addr value   Listen address (default 0.0.0.0:2377)

Which made me believe it would just listen on all addresses. Indeed, joining just worked the first time, using the docker machine 192.168.99. address. However, docker $(docker-machine config sw1) node inspect sw1 shows the ManagerStatus.Addr as 10.0.2.15:2377! It seems because of this, sw2 can't join the swarm again after the reboot, but why then did the manual swarm join work fine?

@tonistiigi
Copy link
Member

@robbertkl 0.0.0.0:2377 means that the server will listen on all addresses but there is only one address that the server advertises it services on. In case of 0.0.0.0 this address is found based on your default route(10.0.2.15 in your case). This is because obviously your other node can't connect back to 0.0.0.0 because that would be it's own localhost. There is some discussion/improvements on this topic in moby/swarmkit#803

After the worker has established a connection it maintains a list of all known manager's advertise-addresses and uses this list to find a new manager on restart. It doesn't use the address specified on join because that address is known to exists only on the time the join request was made and may not be active anymore.

@nathanleclaire Is there a way to make the defaults work for docker-machine?

@abronan Seems something similar to moby/swarmkit#957 could be useful for workers as well.

@vdemeester vdemeester added this to the 1.12.0 milestone Jun 22, 2016
@robbertkl
Copy link
Author

Thank you, that makes sense. So actually the listen address you set is more like "the advertised listen address", and the default for this is really your "default IP" (based on default route).

The same happens when I promote a worker to manager. It gets the IP from the default route.

The additional thing here is:

  • To prevent this, I have to specify --listen-addr for each docker swarm join as well, or else I can never promote them to a manager later on (without leaving + re-joining + removing the old node)
  • While the joined node is still a worker, I cannot see the listen address anywhere; docker node inspect doesn't show it, but it's still remembered, because:
  • As soon as I promote the worker node, the "hidden" listen address comes into play and becomes visible in the ManagerStatus.Addr

@tiborvass tiborvass added priority/P2 Normal priority: default priority applied. area/swarm priority/P1 Important: P1 issues are a top priority and a must-have for the next release. and removed priority/P2 Normal priority: default priority applied. labels Jun 27, 2016
@tiborvass
Copy link
Contributor

Related to #23877

@nathanleclaire
Copy link
Contributor

nathanleclaire commented Jul 7, 2016

@nathanleclaire Is there a way to make the defaults work for docker-machine?

I don't understand what you want docker-machine to do in this case? Can you explain?

Isn't the issue here that Swarm is trying to divine an IP address to advertise with from eth0 when in this case it should actually be divining it from eth1?

Take a look at ifconfig inside of a docker-machine VM.

docker@sw1:~$ ifconfig
docker0   Link encap:Ethernet  HWaddr 02:42:31:3C:10:2F
          inet addr:172.17.0.1  Bcast:0.0.0.0  Mask:255.255.0.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

docker_gwbridge Link encap:Ethernet  HWaddr 02:42:AE:88:2F:BC
          inet addr:172.18.0.1  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:aeff:fe88:2fbc/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:8 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:536 (536.0 B)  TX bytes:648 (648.0 B)

eth0      Link encap:Ethernet  HWaddr 08:00:27:97:61:7F
          inet addr:10.0.2.15  Bcast:10.0.2.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe97:617f/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:966 errors:0 dropped:0 overruns:0 frame:0
          TX packets:604 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:164634 (160.7 KiB)  TX bytes:171644 (167.6 KiB)

eth1      Link encap:Ethernet  HWaddr 08:00:27:FC:64:01
          inet addr:192.168.99.101  Bcast:192.168.99.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fefc:6401/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:471 errors:0 dropped:0 overruns:0 frame:0
          TX packets:358 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:62195 (60.7 KiB)  TX bytes:100518 (98.1 KiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:219 errors:0 dropped:0 overruns:0 frame:0
          TX packets:219 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:23450 (22.9 KiB)  TX bytes:23450 (22.9 KiB)

veth88fa3ce Link encap:Ethernet  HWaddr 66:2B:A3:3B:74:C8
          inet6 addr: fe80::642b:a3ff:fe3b:74c8/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:8 errors:0 dropped:0 overruns:0 frame:0
          TX packets:15 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:648 (648.0 B)  TX bytes:1206 (1.1 KiB)

eth0 is the connection to the host computer (10.0.2.15, which is falsely advertised as the manager IP above). Really you want eth1, which is the connection to other machines on the host only network.

I kind of liked how in the libnetwork stuff you could specify an interface to advertise on. We should be careful about making assumptions WRT advertising IP addresses. There's no way to know if eth0 is an internal subnet address, a publicly routable IPv4 address, or something else entirely. As an operator I would not expect --listen-addr to also decide which IP address to advertise, that's conflating separate behaviors. You might want to listen on one interface and advertise to connect to another.

EDIT: Oops, didn't realize this is a bit on the older side.

@nathanleclaire
Copy link
Contributor

nathanleclaire commented Jul 7, 2016

#24237 seems to be headed in right direction 👍

@tonistiigi
Copy link
Member

@nathanleclaire If we add a daemon command line argument for default interface that would probably work for machine as well?

@nathanleclaire
Copy link
Contributor

@tonistiigi Sure, we could add a daemon flag for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/swarm priority/P1 Important: P1 issues are a top priority and a must-have for the next release. version/1.12
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants