Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker swarm load balancing not working over private network #36689

Open
agrrh opened this Issue Mar 25, 2018 · 5 comments

Comments

Projects
None yet
5 participants
@agrrh
Copy link

agrrh commented Mar 25, 2018

Description

Problem is probably similar to #25325. Docker can't reach containers from hostB when I query hostA public address.

I'm using Docker swarm with 2 hosts, they are connected via wireguard tunnel and are reachable to each other. I'm able to ping those hosts from each other using internal addresses.

Then I initialize swarm mode using --advertise-addr, --data-path-addr and --listen-addr options, also stated internal addresses there. Hosts are visible via docker node ls, both active. No errors in syslog.

But when I create service with 2 replicas, I'm facing strange behavior, accessing service via one of public IPs, I'm able to reach only containers which are running on this particular node. Other requests fail with timeout.

Steps to reproduce the issue:

  1. Setup wireguard tunnel, check that it works fine.
  2. Setup docker in swarm mode.
  3. Run a serivce. I'm using this one: agrrh/dummy-service-py. It runs HTTP service on port 80 and answers with container's hostname + random uuid.
  4. Scale service at least with 2 replicas. (docker service create --name dummy --replicas 2 --publish 8080:80 agrrh/dummy-service-py)
  5. Try to cycle through replicas querying HostA address.

Describe the results you received:

As I said, requests to containers on other nodes fail:

$ http host1:port
{ "hostname": "containerA" } # this container running at host1
$ http host1:port
http: error: Request timed out (30.0s).

$ http host2:port
http: error: Request timed out (30.0s).
$ http host2:port
{ "hostname": "containerB" } # this container running at host2

Describe the results you expected:

I expect to be able to reach all of running containers by querying public address of any single node.

Additional information you deem important (e.g. issue happens only occasionally):

It seems to me that wireguard/tunnel itself is not the cause as I still able to send pings between containers. For example, containerB can reach those containerA addresses:

  • 10.255.0.4 @lo ~0.050 ms (looks like this actually don't leave host2)
  • 10.255.0.5 @eth0 ~0.700 ms (I can see this with tcpdump on other end, it's reachable!)
  • 172.18.0.3 @eth1 ~0.050 ms (this probably don't leave host2 too)

Due to using --advertise-addr I can see packets running between hosts via private interface.

I tried to install ntp and sync the clock but this not helped.

I also attempted to apply various fixes (e.g. turn off masquerading, re-create default bridge with lower MTU, set default bind IP, etc), but got no luck.

I reproduced the issue 3 already times with clean setup and ready to provide collaborators access to my test hosts if you would like to investigate onsite.

Output of docker version:

Same on both hosts:

Client:
 Version:	18.03.0-ce
 API version:	1.37
 Go version:	go1.9.4
 Git commit:	0520e24
 Built:	Wed Mar 21 23:10:01 2018
 OS/Arch:	linux/amd64
 Experimental:	false
 Orchestrator:	swarm

Server:
 Engine:
  Version:	18.03.0-ce
  API version:	1.37 (minimum version 1.12)
  Go version:	go1.9.4
  Git commit:	0520e24
  Built:	Wed Mar 21 23:08:31 2018
  OS/Arch:	linux/amd64
  Experimental:	false

Output of docker info:

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 1
Server Version: 18.03.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: rdwi6u922eb93s3z3cq1vuih1
 Is Manager: true
 ClusterID: g8urrtm78sc68oro86k3wvjzf
 Managers: 1
 Nodes: 2
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.0.5.1
 Manager Addresses:
  10.0.5.1:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.13.0-37-generic
Operating System: Ubuntu 16.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 481.8MiB
Name: test1
ID: IS5W:2U5W:XDAE:UXIF:KXRR:FQSU:PI7K:UXEQ:OOHK:HC4O:TLZR:P4UU
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):

Wireguard setup guide (assuming you installed it):

### Server

cd /etc/wireguard
umask 077
wg genkey | tee server_private_key | wg pubkey > server_public_key

# /etc/wireguard/wg0.conf 
[Interface]
Address = 10.0.5.1/32
SaveConfig = true
PrivateKey = <paste server private key here>
ListenPort = 51820

[Peer]
PublicKey = <paste client public key here>
AllowedIPs = 10.0.5.2/32

wg-quick up wg0

### Client

cd /etc/wireguard
umask 077
wg genkey | tee server_private_key | wg pubkey > server_public_key

# /etc/wireguard/wg0.conf 
[Interface]
Address = 10.0.5.2/32
PrivateKey = <paste client private key here>

[Peer]
PublicKey = <paste server public key here>
Endpoint = <paste server IP here>:51820
AllowedIPs = 10.0.5.0/24

wg-quick up wg0

Servers should be reachable via internal addresses in a moment after this steps.

@cecchisandrone

This comment has been minimized.

Copy link

cecchisandrone commented Sep 19, 2018

Any update on this? Did you solve the issue in some way?

It happens also to me with 3 nodes swarm on docker 18.06.1-ce

@agrrh

This comment has been minimized.

Copy link
Author

agrrh commented Sep 19, 2018

Not really, I've found a job. 😅

Gonna try to reproduce the issue today and report back.

@agrrh

This comment has been minimized.

Copy link
Author

agrrh commented Sep 20, 2018

Yes, the issue still persists for current wireguard (0.0.20180910-wg1) and docker-ce (18.06.1-ce).

I have 2 nodes, both are are active and reachable over internal addresses but every 2nd request to docker service fails.

Sadly, I stuck at the same point. Could not figure out what blocks requests between docker nodes.

@bagbag

This comment has been minimized.

Copy link

bagbag commented Nov 16, 2018

Same happens for me with wireguard 0.0.20181018 and docker-ce 18.09.0
Have you found a solution for this problem?

@agrrh

This comment has been minimized.

Copy link
Author

agrrh commented Nov 18, 2018

JFYI, found very same issue: #37985

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.