Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlay Network on EC2: no route to host #19697

Closed
tylerFowler opened this issue Jan 26, 2016 · 9 comments
Closed

Overlay Network on EC2: no route to host #19697

tylerFowler opened this issue Jan 26, 2016 · 9 comments

Comments

@tylerFowler
Copy link

I have a setup where I have 3 docker hosts running in an etcd-backed cluster on CoreOS where I want to use a docker overlay network to connect relevant containers (potentially across hosts) and I seem to be running into a routing issue.

The Setup

I have 3 containers each running on separate hosts on an overlay network. One is a PGPool load balancer, then a master Postgres db, and a standby replica instance, all running traffic out of 5432. Where 5432 on the first server (pgpool) is exposed publicly, and the actual PG instances are intended to be connected via hostnames in the network. Running the containers with the network in this way gives me an iptables entry of:

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
DROP       all  --  ip-172-17-0-0.us-west-2.compute.internal/16  ip-172-18-0-0.us-west-2.compute.internal/16
DROP       all  --  ip-172-18-0-0.us-west-2.compute.internal/16  ip-172-17-0-0.us-west-2.compute.internal/16
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain DOCKER (2 references)
target     prot opt source               destination
ACCEPT     tcp  --  anywhere             ip-172-17-0-2.us-west-2.compute.internal  tcp dpt:2375
ACCEPT     tcp  --  anywhere             ip-172-18-0-2.us-west-2.compute.internal  tcp dpt:9999

The Problem

When I have a container that is the master db, say, db-master and then the pgpool instance that has the name db-lb and I open up a shell on either container I get the following:

PING db-master(10.0.0.3): 56 data bytes
92 bytes from 3d983a69b284 (10.0.0.2): Destination Host Unreachable
92 bytes from 3d983a69b284 (10.0.0.2): Destination Host Unreachable
92 bytes from 3d983a69b284 (10.0.0.2): Destination Host Unreachable
92 bytes from 3d983a69b284 (10.0.0.2): Destination Host Unreachable
6 packets transmitted, 0 packets received, 100% packet loss

Where, of course, 10.0.0.3 is db-master and 10.0.0.2 is db-lb. Running traceroute gives me:

traceroute to analysisdb-master (10.0.0.3), 30 hops max, 60 byte packets
 1  3d983a69b284 (10.0.0.2)  3005.798 ms !H  3005.728 ms !H  3005.715 ms !H

And finally the output of route -n on a host is:

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.31.32.1     0.0.0.0         UG    1024   0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.18.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker_gwbridge
172.31.32.0     0.0.0.0         255.255.240.0   U     0      0        0 eth0
172.31.32.1     0.0.0.0         255.255.255.255 UH    1024   0        0 eth0

What's interesting is that two containers that are on the same host don't have any trouble at all communicating with each other over the network, but I tried completely opening up my AWS security group and no dice. Additionally fleet & etcd have no problem communicating between the cluster. I also tried the exact same setup on a local VM cluster using vagrant and had no issues with routing between hosts. Am I missing something? Or does AWS require some sort of specific setup?

Docker Version info:

Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   9894698
 Built:
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   9894698
 Built:
 OS/Arch:      linux/amd64
@GordonTheTurtle
Copy link

If you are reporting a new issue, make sure that we do not have any duplicates already open. You can ensure this by searching the issue list for this repository. If there is a duplicate, please close your issue and add a comment to the existing issue instead.

If you suspect your issue is a bug, please edit your issue description to include the BUG REPORT INFORMATION shown below. If you fail to provide this information within 7 days, we cannot debug your issue and will close it. We will, however, reopen it if you later provide the information.

For more information about reporting issues, see CONTRIBUTING.md.

You don't have to include this information if this is a feature request

(This is an automated, informational response)


BUG REPORT INFORMATION

Use the commands below to provide key information from your environment:

docker version:
docker info:

Provide additional environment details (AWS, VirtualBox, physical, etc.):

List the steps to reproduce the issue:
1.
2.
3.

Describe the results you received:

Describe the results you expected:

Provide additional info you think is important:

----------END REPORT ---------

#ENEEDMOREINFO

@tylerFowler
Copy link
Author

UPDATE:

I was able to get this working by changing my cluster advertise url from $private_ipv4:2376 to $private_ipv4:2375. Which is strange because my docker host is listening for outside connections only on 2376 and I'm using TLS everywhere. However I'm guessing this is because my swarm agents are all running on 2375 (though not exposed to the outside world).

It might be worth adding something about this in the docs if that is indeed expected behavior.

@clkao
Copy link
Contributor

clkao commented Feb 26, 2016

Some older doc suggested using eth0:0 as advertise address. not sure if it still works.

@pilgrim2go
Copy link

pilgrim2go commented Jul 20, 2016

I have the same issue with
EC2
Docker 1.11.2

Here my test
Create overlay network

docker network create --driver overlay --subnet=10.0.9.0/24 my-net

Create nginx node1

 docker run -itd --name=web --network=my-net --env="constraint:node==node1" nginx

Run busybox

docker run -it --rm --net=my-net --env="constraint:node==node2" busybox wget -O- http://web
Connecting to web (10.0.9.2:80)
wget: can't connect to remote host (10.0.9.2): No route to host

Any help??

Docker info

Containers: 3
 Running: 3
 Paused: 0
 Stopped: 0
Images: 4
Server Version: swarm/1.2.3
Role: replica
Primary: manager1:4000
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint
Nodes: 2
 node1: 10.x.x.x.:2375
  └ ID: OTBV:YND4:TOEK:V753:DY7P:WFEZ:7GK6:2T2G:EBI5:4FCW:PR56:IECW
  └ Status: Healthy
  └ Containers: 2
  └ Reserved CPUs: 0 / 1
  └ Reserved Memory: 0 B / 1.021 GiB
  └ Labels: executiondriver=, kernelversion=4.1.17-22.30.amzn1.x86_64, operatingsystem=Amazon Linux AMI 2016.03, storagedriver=devicemapper
  └ UpdatedAt: 2016-07-20T02:45:17Z
  └ ServerVersion: 1.11.2
 node2: 10.x.x.x:2375
  └ ID: IYB4:74YE:IZ54:YIYH:A62T:BUHX:KOCC:RITK:B3ID:4FQQ:2LOR:XX3B
  └ Status: Healthy
  └ Containers: 1
  └ Reserved CPUs: 0 / 1
  └ Reserved Memory: 0 B / 1.021 GiB
  └ Labels: executiondriver=, kernelversion=4.1.17-22.30.amzn1.x86_64, operatingsystem=Amazon Linux AMI 2016.03, storagedriver=devicemapper
  └ UpdatedAt: 2016-07-20T02:45:38Z
  └ ServerVersion: 1.11.2
Plugins:
 Volume:
 Network:
Kernel Version: 4.1.17-22.30.amzn1.x86_64
Operating System: linux
Architecture: amd64
CPUs: 2
Total Memory: 2.042 GiB
Name: 1b1e76de6f70
Docker Root Dir:
Debug mode (client): false
Debug mode (server): false
WARNING: No kernel memory limit support

@pilgrim2go
Copy link

Update: have it work by changing --cluster-advertise to private_ip:2375. Before eth0:2375

@spangaer
Copy link

For me that did very little, I also had no need to expose that port so far, as I'm not using Docker Swarm.
I was still on 1.10.3. When going through the changelog, I see that there where plenty of bugfixes in the networking feature. This caught my attention especially.

Fix unreliable inter-service communication after scaling down and up #25603

That's a 1.12.1 fix. Upgrade to 1.12.1 does seem to fix the problem for me.

@febbraro
Copy link

@pilgrim2go Thanks for the pointer, fixing --cluster-advertise fixes networking for me.

@thaJeztah
Copy link
Member

Looks like this issue is resolved / answered, so I'll close, but ping me if you think there's still something that needs to be addressed

@ghovat
Copy link

ghovat commented Aug 14, 2017

I have the same issue.
On my EC2 Instances. I have a docker container with for the etcd

- name: etcd
    - image: quay.io/coreos/etcd
    - binds:
      - /etc/ssl:/etc/ssl:ro
    - restart_policy: always
    - binds:
      - /srv/etcd:/data.etcd:rw
    - port_bindings:
      - 2379:2379
      - 2380:2380
    - environment:
      - ETCD_DATA_DIR: /data.etcd
      - ETCD_NAME: {{ grains['id'] }}
      - ETCD_ADVERTISE_CLIENT_URLS: http://{{ grains['private_ip'] }}:2379
      - ETCD_LISTEN_CLIENT_URLS: http://0.0.0.0:2379
      - ETCD_INITIAL_ADVERTISE_PEER_URLS: http://{{ grains['private_ip'] }}:2380
      - ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380
      - ETCD_INITIAL_CLUSTER_TOKEN: digifit-1
      - ETCD_INITIAL_CLUSTER: {{ pillar['ETCD_INITIAL_CLUSTER'] }}
      - ETCD_INITIAL_CLUSTER_STATE: {{ pillar.get('ETCD_INITIAL_CLUSTER_STATE', 'new') }}
    - require:
      - pkg: 'install ca-certificates'
      - iptables: 'filter etcd client port'
      - iptables: 'filter etcd peer port'
    - __monitoring__:
        - service: container-running
          title: etcd container
          container_name: etcd

And I create a docker network with docker network create -d overlay --subnet 10.0.0.0/16 network

The network is correct replicated over etcd and its available on all hosts. But if I try to connect to a container over a container which is correct in the network (docker network inspect network) it doenst work

Here is my docker info:


oot@ip-10-0-127-34:/home/georg.sattler# docker info
#Containers: 3
 Running: 3
 Paused: 0
 Stopped: 0
Images: 10
Server Version: 17.05.0-ce
Storage Driver: devicemapper
 Pool Name: docker-202:1-2050903-pool
 Pool Blocksize: 65.54kB
 Base Device Size: 10.74GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 4.683GB
 Data Space Total: 107.4GB
 Data Space Available: 102.7GB
 Metadata Space Used: 7.115MB
 Metadata Space Total: 2.147GB
 Metadata Space Available: 2.14GB
 Thin Pool Minimum Free Space: 3.221GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.110 (2015-10-30)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9048e5e50717ea4497b757314bad98ea3763c145
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-1028-aws
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.303GiB
Name: ip-10-0-127-34
ID: JFQO:C56I:VM22:UDKJ:QDKD:HUDT:KL3X:JCSR:WAPG:66JL:S4RM:4ENN
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Cluster Store: etcd://127.0.0.1:2379
Cluster Advertise: 10.0.127.34:2376
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: devicemapper: usage of loopback devices is strongly discouraged for production use.
         Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
WARNING: No swap limit support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants