Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

userland-proxy: false does not clean-up NAT rule when switching to userland-proxy: true #44721

Closed
Tracked by #14856
polarathene opened this issue Dec 31, 2022 · 4 comments · Fixed by #44803
Closed
Tracked by #14856
Labels
area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage version/20.10

Comments

@polarathene
Copy link
Contributor

polarathene commented Dec 31, 2022

Description

Network behaviour from userland-proxy: false carries over to userland-proxy: true.

Docker networks (at least with the default bridge, or custom bridges) are not behaving deterministically for userland-proxy: true, based on these two configs:

  • System and Docker daemon is started with userland-proxy: true (no switch from userland-proxy: false involved).
  • Daemon is started or restarted with userland-proxy: false set, and then afterwards switches to userland-proxy: true.

This is a niche bug. It will only be visible when adjusting this configuration without restarting the system. Thus likely only to occur while learning / debugging docker networking with this feature involved (commonly seen with IPv6 config advice / discussions).

I found this confusing to track down, but I am not sure what is going on behind the scenes to cause this.


Reproduction

Steps

Pay attention to the userland-proxy state and the outputs: value:

  1. echo '{ "userland-proxy": true }' > /etc/docker/daemon.json
  2. systemctl restart docker
  3. docker network create test
  4. docker run --rm -d -p 80:80 --network test --name bug traefik/whoami
  5. curl -s http://192.168.1.42 | grep RemoteAddr (outputs: RemoteAddr: 192.168.1.42)
  6. docker stop bug
  7. echo '{ "userland-proxy": false }' > /etc/docker/daemon.json
  8. systemctl restart docker
  9. docker run --rm -d -p 80:80 --network test --name bug traefik/whoami
  10. curl -s http://192.168.1.42 | grep RemoteAddr (outputs: RemoteAddr: 172.23.0.1)
  11. echo '{ "userland-proxy": true }' > /etc/docker/daemon.json
  12. systemctl restart docker
  13. docker run --rm -d -p 80:80 --network test --name bug traefik/whoami
  14. curl -s http://192.168.1.42 | grep RemoteAddr (outputs: RemoteAddr: 172.23.0.1)

Reproduction notes

NOTE: To reproduce:

  • 192.168.1.42 should be substituted for an IP on the hosts local interfaces, I've used the router assigned IP here, but have also reproduced with public IP (v4 and v6 addresses) from a VPS too.
  • 172.x.0.1 obviously being the docker network gateway connected to that container (it does not need to be a custom network, the default bridge network also exhibits this bug).
  • The initial step from userland-proxy: true is only to showcase that the expected value appears by default. The bug itself is specifically the userland-proxy: false to userland-proxy: true change.
  • You could alternatively use dockerd --userland-proxy=true and dockerd --userland-proxy=false to toggle instead of the daemon.json config, same issue.
  • If you create a new network after all this when userland-proxy: true, then test it it will resolve the correct remote address IP, while the earlier network remains affected from the previous userland-proxy: false changes until the system restarts.

Copy & paste shell commands:
# Replace IP 192.168.42 to one relevant on your machine:
echo '{ "userland-proxy": true }' > /etc/docker/daemon.json \
  && systemctl restart docker \
  && docker run --rm -d -p 80:80 traefik/whoami \
  && curl -s http://192.168.1.42 | grep RemoteAddr

# Correct output for `userland-proxy: true`:
RemoteAddr: 192.168.1.42


echo '{ "userland-proxy": false }' > /etc/docker/daemon.json \
  && systemctl restart docker \
  && docker run --rm -d -p 80:80 traefik/whoami \
  && curl -s http://192.168.1.42 | grep RemoteAddr

# Correct output for `userland-proxy: false`:
RemoteAddr: 172.23.0.1


echo '{ "userland-proxy": true }' > /etc/docker/daemon.json \
  && systemctl restart docker \
  && docker run --rm -d -p 80:80 traefik/whoami \
  && curl -s http://192.168.1.42 | grep RemoteAddr

# Incorrect output for `userland-proxy: true`:
RemoteAddr: 172.23.0.1

Expected behavior

userland-proxy: false to userland-proxy: true should not cause existing networks to behave differently than those created with userland-proxy: true (with no prior switch to userland-proxy: false involved). Both networks should behave the same way when userland-proxy: true is applied.

More info

docker version
Client: Docker Engine - Community
 Version:           20.10.22
 API version:       1.41
 Go version:        go1.18.9
 Git commit:        3a2c30b
 Built:             Thu Dec 15 22:28:16 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.22
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.9
  Git commit:       42c8b31
  Built:            Thu Dec 15 22:26:08 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.14
  GitCommit:        9ba4b250366a5ddde94bb7c9d1def331423aa323
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.9.1-docker)
  compose: Docker Compose (Docker Inc., v2.14.1)
  scan: Docker Scan (Docker Inc., v0.23.0)

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 2
 Server Version: 20.10.22
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 9ba4b250366a5ddde94bb7c9d1def331423aa323
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.19.0-26-generic
 Operating System: Ubuntu 22.10
 OSType: linux
 Architecture: x86_64
 CPUs: 1
 Total Memory: 968.9MiB
 Name: vultr-test
 ID: GNPV:BVKP:LWTR:FFLY:4LCT:RPF4:X6A7:M6DC:RWKL:WTEH:LHQS:EEFV
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional Observations

When userland-proxy: false:

  • The reported remote address is that of the containers network gateway IP (expected, as similar effect to when iptables / ip6tables NAT is disabled).
  • userland-proxy: false also leads to curl http://[::1] timing out (curl http://127.0.0.1 would work however), while IPv6 addresses associated to other interfaces than the loopback were at least connecting.

When userland-proxy: true:

  • Despite the primary issue this bug details; - curl http://[::1] would now connect, and it will always return the expected docker network gateway IP matching that protocol (if there was an IPv6 enabled docker network + with ip6tables + experimental enabled in daemon.json, otherwise IPv4 gateway IP returned).
  • It is just the other host NICs that are expected to report back their own IP instead of the gateway of the docker network used (technically --network host would also report loopback IP instead of the gateway IP, but that's not a concern here).
  • The bug has only been observed when connecting from the host to a container; whereas connecting to the same public IP address of a host NIC remotely instead would still correctly report the expected remote IP (belonging to the connecting remote / client system). A remote client would only report the remote address IP as the docker networks Gateway IP only when userland-proxy: false in this case (for IPv4, and with relevant IPv6 config, the IPv6 equivalent gateway when appropriate instead of the IPv4 gateway)

/etc/docker/daemon.json used for enabling IPv6 with NAT on default docker bridge (so that containers report back the expected IPv4 / IPv6 address as remote):

{ 
  "ipv6": true,
  "fixed-cidr-v6": "fd00:cafe:babe::/48",
  "ip6tables": true,
  "experimental": true,
  "userland-proxy": true
}

NOTE: I have verified the bug without any of the IPv6 / experimental config present as well. Still affects IPv4.

@polarathene polarathene added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage labels Dec 31, 2022
@polarathene
Copy link
Contributor Author

polarathene commented Jan 1, 2023

Identified where bug is occurring

I have found from diffing nft list ruleset output that this bug is from:

table ip nat {
        chain POSTROUTING {
                type nat hook postrouting priority srcnat; policy accept;
                oifname != "docker0" ip saddr 172.17.0.0/16 counter packets 0 bytes 0 masquerade 
                oifname "docker0" fib saddr type local counter packets 0 bytes 0 masquerade 
                meta l4proto tcp ip saddr 172.17.0.2 ip daddr 172.17.0.2 tcp dport 80 counter packets 0 bytes 0 masquerade 
        }

...

table ip6 nat {
        chain POSTROUTING {
                type nat hook postrouting priority srcnat; policy accept;
                oifname != "docker0" ip6 saddr fd00:1111:2222::/48 counter packets 0 bytes 0 masquerade  
                oifname "docker0" fib saddr type local counter packets 0 bytes 0 masquerade  
                meta l4proto tcp ip6 saddr fd00:1111:2222::242:ac11:2 ip6 daddr fd00:1111:2222::242:ac11:2 tcp dport 80 counter packets 0 bytes 0 masquerade  
        }

The third POSTROUTING line for each nat table has persisted from userland-proxy: false to userland-proxy: true:

oifname "docker0" fib saddr type local counter packets 0 bytes 0 masquerade

Reproduction details

/etc/docker/daemon.json content
{
  "ipv6": true,
  "fixed-cidr-v6": "fd00:1111:2222::/48",
  "ip6tables": true,
  "experimental" : true,
  "userland-proxy": true
}

I updated the userland-proxy bool in daemon.json config (to false to observe the first diff, then to true to get the 2nd output to diff and compare), restarted the docker service, then ran docker run --rm -d -p 80:80 traefik/whoami followed by nft list ruleset.


iptables -t nat -L output

With no IPv6 related config, just toggling userland-proxy:

{
  "userland-proxy": true
}

userland-proxy: true:

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
DOCKER     all  --  anywhere             anywhere             ADDRTYPE match dst-type LOCAL

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
DOCKER     all  --  anywhere            !localhost/8          ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         
MASQUERADE  all  --  172.17.0.0/16        anywhere            
MASQUERADE  tcp  --  172.17.0.2           172.17.0.2           tcp dpt:http

Chain DOCKER (2 references)
target     prot opt source               destination         
RETURN     all  --  anywhere             anywhere            
DNAT       tcp  --  anywhere             anywhere             tcp dpt:http to:172.17.0.2:80

userland-proxy: false:

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
DOCKER     all  --  anywhere             anywhere             ADDRTYPE match dst-type LOCAL

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
DOCKER     all  --  anywhere             anywhere             ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         
MASQUERADE  all  --  anywhere             anywhere             ADDRTYPE match src-type LOCAL
MASQUERADE  all  --  172.17.0.0/16        anywhere            
MASQUERADE  tcp  --  172.17.0.2           172.17.0.2           tcp dpt:http

Chain DOCKER (2 references)
target     prot opt source               destination         
DNAT       tcp  --  anywhere             anywhere             tcp dpt:http to:172.17.0.2:80

userland-proxy: true (persists rule from previous false config):

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
DOCKER     all  --  anywhere             anywhere             ADDRTYPE match dst-type LOCAL

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
DOCKER     all  --  anywhere            !localhost/8          ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         
MASQUERADE  all  --  172.17.0.0/16        anywhere            
MASQUERADE  all  --  anywhere             anywhere             ADDRTYPE match src-type LOCAL
MASQUERADE  tcp  --  172.17.0.2           172.17.0.2           tcp dpt:http

Chain DOCKER (2 references)
target     prot opt source               destination         
RETURN     all  --  anywhere             anywhere            
DNAT       tcp  --  anywhere             anywhere             tcp dpt:http to:172.17.0.2:80

POSTROUTING rule that shouldn't be there:

MASQUERADE  all  --  anywhere             anywhere             ADDRTYPE match src-type LOCAL

@polarathene polarathene changed the title Network behaviour from userland-proxy: false carries over to userland-proxy: true userland-proxy: false does not clean-up NAT rule when switching to userland-proxy: true Jan 1, 2023
@polarathene
Copy link
Contributor Author

polarathene commented Jan 2, 2023

Reproduction with workaround fix

Demonstrates that this is the cause of the bug observed.

Workaround is to manually remove the NAT rule from iptables and if IPv6 is used from ip6tables as well. This avoids needing to restart the system to fix:

  • iptables -t nat -D POSTROUTING -o docker0 -m addrtype --src-type LOCAL -j MASQUERADE
  • ip6tables -t nat -D POSTROUTING -o docker0 -m addrtype --src-type LOCAL -j MASQUERADE

Before starting this, change TEST_IP to an IP on an interface, it could be a public IPv4 address too.

TEST_IP=http://192.168.1.42 \
  && echo '{ "userland-proxy": false }' > /etc/docker/daemon.json \
  && systemctl restart docker \
  && docker run --rm -d -p 80:80 traefik/whoami \
  && (curl -s "${TEST_IP}" | grep RemoteAddr) \
  && echo '{ "userland-proxy": true }' > /etc/docker/daemon.json \
  && systemctl restart docker \
  && docker run --rm -d -p 80:80 traefik/whoami \
  && (curl -s "${TEST_IP}" | grep RemoteAddr) \
  && iptables -t nat -D POSTROUTING -o docker0 -m addrtype --src-type LOCAL -j MASQUERADE \
  && (curl -s "${TEST_IP}" | grep RemoteAddr)
  • Your first two IPs returned should be the same internal docker gateway IP, despite the change to userland-proxy: true.
  • The iptables rule that shouldn't be present is then removed and the expected queried IP is returned as the RemoteAddr value. This is the same value you'd see if you started with userland-proxy: true (without userland-proxy: false since boot, otherwise this iptables / ip6tables rule is already lingering).

@neersighted
Copy link
Member

I would say this issue is partially related to #14856:

  • The 'loss of a workaround' should not matter -- the goal with disabling the userland proxy by default is that there is no change in behavior, or said change is as minimal as possible and the differences are well-characterized. We're a long way from that still, however.
  • This should be a blocker for resolving Disable Userland proxy by default #14856 -- we should properly clean up our rules (though I am not sure where in the lifecycle that needs to happen yet) before we turn off the userland proxy by default.

@polarathene
Copy link
Contributor Author

The 'loss of a workaround' should not matter -- the goal with disabling the userland proxy by default is that there is no change in behavior

Oh I completely agree that not needing a workaround would be preferred!

I just wanted to raise awareness that defaulting to userland-proxy to disabled (without resolving this bug) would add more friction / confusion for those encountering this bug 😅


This should be a blocker for resolving Disable Userland proxy by default #14856

Agreed 👍

The rules seem to be properly cleaned up when toggling userland-proxy, it's just this particular rule that I noticed lingered. I haven't looked for where this happens in the source code, but I assume it's co-located with the other rules that do get properly cleaned up?

akerouanton added a commit to akerouanton/docker that referenced this issue Jan 11, 2023
When userland-proxy was turned off and on again, the iptables nat rule
doing hairpinning wasn't properly removed. This fix makes sure that nat
rule is removed whenever the bridge is torn down or hairpinning is
disabled (through setting userland-proxy to true).

Fixes moby#44721.
akerouanton added a commit to akerouanton/docker that referenced this issue Jan 11, 2023
When userland-proxy is turned off and on again, the iptables nat rule
doing hairpinning isn't properly removed. This fix makes sure this nat
rule is removed whenever the bridge is torn down or hairpinning is
disabled (through setting userland-proxy to true).

Unlike for ip masquerading and ICC, the `programChainRule()` call
setting up the "MASQ LOCAL HOST" rule has to be called unconditionally
because the hairpin parameter isn't restored from the driver store, but
always comes from the driver config.

Fixes moby#44721.
akerouanton added a commit to akerouanton/docker that referenced this issue Jan 11, 2023
When userland-proxy is turned off and on again, the iptables nat rule
doing hairpinning isn't properly removed. This fix makes sure this nat
rule is removed whenever the bridge is torn down or hairpinning is
disabled (through setting userland-proxy to true).

Unlike for ip masquerading and ICC, the `programChainRule()` call
setting up the "MASQ LOCAL HOST" rule has to be called unconditionally
because the hairpin parameter isn't restored from the driver store, but
always comes from the driver config.

For the "SKIP DNAT" rule, things are a bit different: this rule is
always deleted by `removeIPChains()` when the bridge driver is
initialized.

Fixes moby#44721.
akerouanton added a commit to akerouanton/docker that referenced this issue Jan 11, 2023
When userland-proxy is turned off and on again, the iptables nat rule
doing hairpinning isn't properly removed. This fix makes sure this nat
rule is removed whenever the bridge is torn down or hairpinning is
disabled (through setting userland-proxy to true).

Unlike for ip masquerading and ICC, the `programChainRule()` call
setting up the "MASQ LOCAL HOST" rule has to be called unconditionally
because the hairpin parameter isn't restored from the driver store, but
always comes from the driver config.

For the "SKIP DNAT" rule, things are a bit different: this rule is
always deleted by `removeIPChains()` when the bridge driver is
initialized.

Fixes moby#44721.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
akerouanton added a commit to akerouanton/docker that referenced this issue Jan 11, 2023
When userland-proxy is turned off and on again, the iptables nat rule
doing hairpinning isn't properly removed. This fix makes sure this nat
rule is removed whenever the bridge is torn down or hairpinning is
disabled (through setting userland-proxy to true).

Unlike for ip masquerading and ICC, the `programChainRule()` call
setting up the "MASQ LOCAL HOST" rule has to be called unconditionally
because the hairpin parameter isn't restored from the driver store, but
always comes from the driver config.

For the "SKIP DNAT" rule, things are a bit different: this rule is
always deleted by `removeIPChains()` when the bridge driver is
initialized.

Fixes moby#44721.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
akerouanton added a commit to akerouanton/docker that referenced this issue Jan 12, 2023
When userland-proxy is turned off and on again, the iptables nat rule
doing hairpinning isn't properly removed. This fix makes sure this nat
rule is removed whenever the bridge is torn down or hairpinning is
disabled (through setting userland-proxy to true).

Unlike for ip masquerading and ICC, the `programChainRule()` call
setting up the "MASQ LOCAL HOST" rule has to be called unconditionally
because the hairpin parameter isn't restored from the driver store, but
always comes from the driver config.

For the "SKIP DNAT" rule, things are a bit different: this rule is
always deleted by `removeIPChains()` when the bridge driver is
initialized.

Fixes moby#44721.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
(cherry picked from commit 566a2e4)
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage version/20.10
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants