Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to create a swarm using RHEL on Azure #33345

Closed
briantd opened this issue May 22, 2017 · 20 comments
Closed

Unable to create a swarm using RHEL on Azure #33345

briantd opened this issue May 22, 2017 · 20 comments
Labels
area/networking area/swarm kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. version/17.03

Comments

@briantd
Copy link

briantd commented May 22, 2017

Description
I'm unable to join a node to a freshly initialized swarm. Attempting to use the worker join-token from the swarm leader on a prospective worker yields this:

[bd-test321-vm2 /]$ docker swarm join --token SWMTKN-1-2jvcr4vjnl1hsbageqgrmb62q19x3t1w6j61el1814uc2hxnjg-7bc9rf7kcsctlz1y19o3b20d4 10.0.0.4:2377
Error response from daemon: rpc error: code = 14 desc = grpc: the connection is unavailable

I checked the logs in /var/log/messages:

May 22 18:02:28 bd-test321-vm2 dockerd: time="2017-05-22T18:02:28.618905383Z" level=error msg="failed to retrieve remote root CA certificate" error="rpc error: code = 14 desc = grpc: the connection is unavailable" module=node

I also confirmed the clocks are not suffering from skew: the time is identical on both servers.

To confirm that this is not a network issue involving network security groups or equivalent, I spun up an nginx container on port 2378 on the swarm leader node, and attempted to connect.

Output of sudo netstat -plnt:

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1/systemd
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      5106/sshd
tcp6       0      0 :::111                  :::*                    LISTEN      1/systemd
tcp6       0      0 :::22                   :::*                    LISTEN      5106/sshd
tcp6       0      0 :::2377                 :::*                    LISTEN      9643/dockerd
tcp6       0      0 :::2378                 :::*                    LISTEN      37816/docker-proxy
tcp6       0      0 :::7946                 :::*                    LISTEN      9643/dockerd`

This shows dockerd listening on port 2377 (swarm), and docker-proxy listening on 2378 (nginx):

root     37816  0.0  0.0 108124  3640 ?        Sl   May20   0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 2378 -container-ip 172.17.0.3 -container-port 80

Sure enough, I can connect to nginx from the other node:

[bd-test321-vm2 /]$ telnet 10.0.0.4 2378
Trying 10.0.0.4...
Connected to 10.0.0.4.
Escape character is '^]'.
GET /
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
... <snip> ...
Connection closed by foreign host.

Trying to connect to the swarm port (2377) produces an odd error:

[bd-test321-vm2 /]$ telnet 10.0.0.4 2377
Trying 10.0.0.4...
telnet: connect to address 10.0.0.4: No route to host

It seems that the "no route to host" is an artifact of running in Azure. Basically if Azure isn't aware of a service listening on that port, it reports that error message.

Thinking this might be a special issue with that port, I removed the swarm leader from the swarm (docker swarm leave --force), and spun up another nginx instance on the leader node, this time on port 2377.

[bd-test321-vm1 ~]$ docker run -d -p2377:80 nginx
04536e1741da1c71476107aa7e7da62f6582d109402be60c4fdb8301319ffd5a
[bd-test321-vm1 ~]$ sudo netstat -plnt | grep 2377
tcp6       0      0 :::2377                 :::*                    LISTEN      43674/docker-proxy

Note: this time it's docker-proxy listening...

Now telnet can connect to port 2377:

[bd-test321-vm2 var]$ telnet 10.0.0.4 2377
Trying 10.0.0.4...
Connected to 10.0.0.4.
Escape character is '^]'.
GET /
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
... <snip> ...
Connection closed by foreign host.

Steps to reproduce the issue:

  1. Provision 2 Azure VMs running RHEL (see (gist))
  2. Install docker following RHEL instructions (see (gist))
  3. Start swarm on node 1, save join-key output...
[bd-test321-vm1 ~]$ docker swarm init --advertise-addr eth0
Swarm initialized: current node (rijmtkfct2el98i9n8jmla1cd) is now a manager.

To add a worker to this swarm, run the following command:

    docker swarm join \
    --token SWMTKN-1-2y5oceutioq1qbeipo1b59js86y8dzzqzvy2lc03o67teosy9c-b42fs2laexqy8op56zwjniedu \
    10.0.0.4:2377

To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
  1. Attempt to join from the second node
[bd-test321-vm2 var]$ docker swarm join \
>     --token SWMTKN-1-2y5oceutioq1qbeipo1b59js86y8dzzqzvy2lc03o67teosy9c-b42fs2laexqy8op56zwjniedu \
>     10.0.0.4:2377
Error response from daemon: rpc error: code = 14 desc = grpc: the connection is unavailable

Describe the results you received:

Error response from daemon: rpc error: code = 14 desc = grpc: the connection is unavailable

Describe the results you expected:
Node successfully joins

Additional information you deem important (e.g. issue happens only occasionally):
I've tried using both 10.0.0.0/16 and 172.17.0.? ip schemes for the underlying VMs -- that didn't make a difference.

Output of docker version:

$ docker version
Client:
 Version:      17.03.1-ee-3
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   3fcee33
 Built:        Thu Mar 30 20:03:25 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.1-ee-3
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   3fcee33
 Built:        Thu Mar 30 20:03:25 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

$ docker info
Containers: 5
 Running: 1
 Paused: 0
 Stopped: 4
Images: 2
Server Version: 17.03.1-ee-3
Storage Driver: overlay
 Backing Filesystem: xfs
 Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: rijmtkfct2el98i9n8jmla1cd
 Is Manager: true
 ClusterID: zl4tvp5lngium1l33fo9z7110
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.0.0.4
 Manager Addresses:
  10.0.0.4:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-514.16.1.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.3 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 6.805 GiB
Name: bd-test321-vm1
ID: MUUP:UOJ4:7KRK:BG5I:UEZG:4ABA:CRNE:CWI3:IATW:3VV7:3PZG:FX2F
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):
Azure

@briantd
Copy link
Author

briantd commented May 23, 2017

Running the same test using the Ubuntu image works fine (i.e. --image UbuntuLTS):

bd-test321-vm2:~$ docker swarm join \
>     --token SWMTKN-1-0nkadu51dqakgen5om7numpmcmq5685xfxv85q8p7ipp5bd0j9-3ak0p4xryztn5js6dn8jzgkcp \
>     10.0.0.4:2377
This node joined a swarm as a worker.

cc @friism

@thaJeztah
Copy link
Member

@briantd since this is Docker EE, this should be handled though support; can you open a ticket there?

@briantd
Copy link
Author

briantd commented May 24, 2017

Centos checks out too (i.e. --image CentOS)

[bd-test321-centvm2 ~]$ docker swarm join \
>     --token SWMTKN-1-01vq06wi9b41bz56cycf7z4sronextn2vz7sf95klk2iyc8dj9-a0n1kjjmzmpzkn7fcccdmqmaj \
>     10.0.0.6:2377
This node joined a swarm as a worker.

As CentOS is CE, I used a different engine version:

[briandonaldson@brian-donaldson-test321-centvm2 ~]$ docker version
Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:10:29 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:10:29 2017
 OS/Arch:      linux/amd64
 Experimental: false

@briantd
Copy link
Author

briantd commented Jun 16, 2017

For posterity, I just re-tested on Azure RHEL (fresh VMs) using the latest CE and got the same swarm error:

$ docker swarm join --token SWMTKN-1-45z4yzft4dpwewocpob6u6yt0spadjbamu65sii8lglrnhagul-evfmvtf59daxez1tyfqu96u9q 10.0.0.4:2377
Error response from daemon: rpc error: code = 14 desc = grpc: the connection is unavailable

$ docker version
Client:
 Version:      17.06.0-ce-rc4
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   29fcd5d
 Built:        Thu Jun 15 17:25:35 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce-rc4
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   29fcd5d
 Built:        Thu Jun 15 17:28:11 2017
 OS/Arch:      linux/amd64
 Experimental: false

@briantd
Copy link
Author

briantd commented Jun 17, 2017

@friism ^^

@briantd
Copy link
Author

briantd commented Jun 17, 2017

Another data point: confirmed that latest CE (RC4) can start a swarm on UbuntuLTS Azure VMs. RHEL remains the outlier.

@friism
Copy link
Contributor

friism commented Jun 20, 2017

@thaJeztah do you think this could be the networkmonitor thing?

@tiborvass
Copy link
Contributor

@briantd out of curiosity, do you have NetworkManager running? systemctl status NetworkManager

@briantd
Copy link
Author

briantd commented Jun 21, 2017

@tiborvass

[docker@bd-rhel-swarmtest-vm1 ~]$ systemctl status NetworkManager
● NetworkManager.service - Network Manager
   Loaded: loaded (/usr/lib/systemd/system/NetworkManager.service; enabled; vendor preset: enabled)
   Active: active (running) since Sat 2017-06-17 00:54:09 UTC; 4 days ago
     Docs: man:NetworkManager(8)
 Main PID: 5249 (NetworkManager)
   Memory: 428.0K
   CGroup: /system.slice/NetworkManager.service
           ├─5249 /usr/sbin/NetworkManager --no-daemon
           └─5472 /sbin/dhclient -d -q -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-eth0.pid -lf /var/lib/NetworkManager/dhclient-5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03-eth0.lease -cf /var/...
[docker@bd-rhel-swarmtest-vm1 ~]$

@briantd
Copy link
Author

briantd commented Jun 21, 2017

@cpuguy83 @friism

Results of explicitly specifying --listen-addr eth0

[docker@bd-rhel-swarmtest-vm1 ~]$ ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.0.4  netmask 255.255.0.0  broadcast 10.0.255.255
        inet6 fe80::20d:3aff:fe73:fc05  prefixlen 64  scopeid 0x20<link>
        ether 00:0d:3a:73:fc:05  txqueuelen 1000  (Ethernet)
        RX packets 2970442  bytes 1572414134 (1.4 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 4105874  bytes 773028040 (737.2 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[docker@bd-rhel-swarmtest-vm1 ~]$ docker swarm init --listen-addr eth0
Swarm initialized: current node (yph50oaoeaq23zqm9bkd422ky) is now a manager.

[docker@bd-rhel-swarmtest-vm2 ~]$ docker swarm join --token SWMTKN-1-2anvp8x7c7cp2vlkc5xiztr4vw1kkgedqz6dbu6pdhjtw9pxyn-4xefjsj2ujhbxvv4jo42wnzie 10.0.0.4:2377
Error response from daemon: rpc error: code = 14 desc = grpc: the connection is unavailable

@kolyshkin
Copy link
Contributor

@briantd Have you checked iptables rules on the nodes (especially the one running the manager)? The port might just be blocked unless explicitly allowed (via firewalld).

@briantd
Copy link
Author

briantd commented Jun 21, 2017

@kolyshkin

I don't see anything obvious in the chain rules (see below). Also, in the ticket above I try an experiment whereby I expose an nginx container on port 2377 (when swarm is not active), and CAN connect from other VMs. This leads me to conclude it's not a firewall issue. It seems to be that port 2377 doesn't respond to other VMs when swarm is using it.

iptables output:

[docker@bd-rhel-swarmtest-vm1 ~]$ sudo iptables --list
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere
INPUT_direct  all  --  anywhere             anywhere
INPUT_ZONES_SOURCE  all  --  anywhere             anywhere
INPUT_ZONES  all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere             ctstate INVALID
REJECT     all  --  anywhere             anywhere             reject-with icmp-host-prohibited

Chain FORWARD (policy DROP)
target     prot opt source               destination
DOCKER-USER  all  --  anywhere             anywhere
DOCKER-ISOLATION  all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere
FORWARD_direct  all  --  anywhere             anywhere
FORWARD_IN_ZONES_SOURCE  all  --  anywhere             anywhere
FORWARD_IN_ZONES  all  --  anywhere             anywhere
FORWARD_OUT_ZONES_SOURCE  all  --  anywhere             anywhere
FORWARD_OUT_ZONES  all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere             ctstate INVALID
REJECT     all  --  anywhere             anywhere             reject-with icmp-host-prohibited
DROP       all  --  anywhere             anywhere

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
OUTPUT_direct  all  --  anywhere             anywhere

Chain DOCKER (2 references)
target     prot opt source               destination

Chain DOCKER-ISOLATION (1 references)
target     prot opt source               destination
DROP       all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere
RETURN     all  --  anywhere             anywhere

Chain DOCKER-USER (1 references)
target     prot opt source               destination
RETURN     all  --  anywhere             anywhere

Chain FORWARD_IN_ZONES (1 references)
target     prot opt source               destination
FWDI_public  all  --  anywhere             anywhere            [goto]
FWDI_public  all  --  anywhere             anywhere            [goto]

Chain FORWARD_IN_ZONES_SOURCE (1 references)
target     prot opt source               destination

Chain FORWARD_OUT_ZONES (1 references)
target     prot opt source               destination
FWDO_public  all  --  anywhere             anywhere            [goto]
FWDO_public  all  --  anywhere             anywhere            [goto]

Chain FORWARD_OUT_ZONES_SOURCE (1 references)
target     prot opt source               destination

Chain FORWARD_direct (1 references)
target     prot opt source               destination

Chain FWDI_public (2 references)
target     prot opt source               destination
FWDI_public_log  all  --  anywhere             anywhere
FWDI_public_deny  all  --  anywhere             anywhere
FWDI_public_allow  all  --  anywhere             anywhere
ACCEPT     icmp --  anywhere             anywhere

Chain FWDI_public_allow (1 references)
target     prot opt source               destination

Chain FWDI_public_deny (1 references)
target     prot opt source               destination

Chain FWDI_public_log (1 references)
target     prot opt source               destination

Chain FWDO_public (2 references)
target     prot opt source               destination
FWDO_public_log  all  --  anywhere             anywhere
FWDO_public_deny  all  --  anywhere             anywhere
FWDO_public_allow  all  --  anywhere             anywhere

Chain FWDO_public_allow (1 references)
target     prot opt source               destination

Chain FWDO_public_deny (1 references)
target     prot opt source               destination

Chain FWDO_public_log (1 references)
target     prot opt source               destination

Chain INPUT_ZONES (1 references)
target     prot opt source               destination
IN_public  all  --  anywhere             anywhere            [goto]
IN_public  all  --  anywhere             anywhere            [goto]

Chain INPUT_ZONES_SOURCE (1 references)
target     prot opt source               destination

Chain INPUT_direct (1 references)
target     prot opt source               destination

Chain IN_public (2 references)
target     prot opt source               destination
IN_public_log  all  --  anywhere             anywhere
IN_public_deny  all  --  anywhere             anywhere
IN_public_allow  all  --  anywhere             anywhere
ACCEPT     icmp --  anywhere             anywhere

Chain IN_public_allow (1 references)
target     prot opt source               destination
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:ssh ctstate NEW

Chain IN_public_deny (1 references)
target     prot opt source               destination

Chain IN_public_log (1 references)
target     prot opt source               destination

Chain OUTPUT_direct (1 references)
target     prot opt source               destination
[docker@bd-rhel-swarmtest-vm1 ~]$

@briantd
Copy link
Author

briantd commented Jun 23, 2017

UPDATE: @cpuguy83 found the root cause and a workaround

There is a default rule in the iptables INPUT chain to reject ICMP, this is preventing the join from happening. I just deleted the rule and join worked.

@thaJeztah
Copy link
Member

ping @sanimej @fcrisciani @ddebroy something that we can change / fix / check for?

@thaJeztah thaJeztah added area/networking kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. labels Jul 14, 2017
@sanimej
Copy link

sanimej commented Jul 20, 2017

@cpuguy83 How was the ICMP rule affecting the grpc connection ? Was it a PMTU issue ?

@cpuguy83
Copy link
Member

It's not an icmp rule, it blocks everything from everywhere.

@ddebroy
Copy link
Contributor

ddebroy commented Jul 20, 2017

@briantd Do we know:

  1. If the above rule that blocked everything was present during your nginx-on-port 2377 experiment? In other words is docker adding the rule on rhel or someone else?

  2. Is the rule absent in iptables on the other distros you tested?

In case we are trying to figure out who is adding the rule, one suspect may be the Azure linux agent script for RHEL.

@cpuguy83
Copy link
Member

cpuguy83 commented Jul 20, 2017 via email

@friism
Copy link
Contributor

friism commented Jul 20, 2017 via email

@thaJeztah
Copy link
Member

Let me close this ticket for now, as it looks like it went stale, although we do have some Azure people on this thread now 😅😉

@thaJeztah thaJeztah closed this as not planned Won't fix, can't repro, duplicate, stale Sep 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking area/swarm kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. version/17.03
Projects
None yet
Development

No branches or pull requests

10 participants