Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support hairpin NAT #6810

Merged
merged 1 commit into from Nov 10, 2014
Merged

Support hairpin NAT #6810

merged 1 commit into from Nov 10, 2014

Conversation

@phemmer
Copy link
Contributor

@phemmer phemmer commented Jul 2, 2014

This re-applies commit b39d02b with additional iptables rules to solve the issue with containers routing back into themselves.

The previous issue with this attempt was that the DNAT rule would send traffic back into the container it came from. When this happens you have 2 issues:

  1. reverse path filtering. The container is going to see the traffic coming in from the outside and it's going to have a source address of itself. So reverse path filtering will kick in and drop the packet.
  2. direct return mismatch. Assuming you turned reverse path filtering off, when the packet comes back in, it's goign to have a source address of itself, thus when the reply traffic is sent, it's also going to have a source address of itself. But the original packet was sent to the docker host IP address, so the traffic will be dropped because it's coming from an address which the original traffic was not sent to (and likely with an incorrect port as well).

The solution to this is to masquerade the traffic when it gets routed back into the origin container. However for this to work you need to enable hairpin mode on the bridge port, otherwise the kernel will just drop the traffic.
The hairpin mode set is part of libcontainer, while the MASQ change is part of docker.

The libcontainer change is docker-archive/libcontainer#62

Also, since part of this change is in libcontainer, I wasn't sure how to handle that. Such as whether the files in /vendor should be patched.


Note, with this change, it is almost possible to remove the docker proxy. The only thing left that the proxy handles is connecting to a port mapping via localhost from the docker server (the TestAllocateTCPPortLocalhost test).
This can easily be supported by iptables, but requires 3 changes:

  1. The ! -d 127.0.0.0/8 on the -t nat OUTPUT rule needs to be removed. However from the comment on line 131, it looks like this was deliberately added for some reason.
  2. A MASQUERADE rule needs to be added to masquerade traffic so it's not appearing to come from 127.0.0.1.
  3. net.ipv4.conf.$iface.route_localnet=1 needs to be set on the bridge interface. While route_localnet is there for security reasons (since local network traffic should never be routed), the route_localnet inside a container's namespace will prevent any container from being able to send localnet traffic to another through the bridge. And then the accept_local setting in the container will prevent the traffic from being received as well.

I ran the integration test suite with the proxy disabled, and the only tests that failed were those localhost tests.

ref: #4442 #5133

@vieux
Copy link
Contributor

@vieux vieux commented Jul 8, 2014

@jpetazzo can you take this one please ?

@jpetazzo
Copy link
Contributor

@jpetazzo jpetazzo commented Jul 9, 2014

I'll look as soon as I can (but I have a few big presentations looming right on the corner); meanwhile, I'd like to summon @mpetazzoni because I know he was severely impacted by hairpin NAT!

@mpetazzoni
Copy link

@mpetazzoni mpetazzoni commented Jul 10, 2014

Haven't tested it, but that's basically the change that I need indeed. We also never use localhost to connect to a bound port, so this fix as is is already a great step forward.

It would be great of course in the long run if the proxy could be removed completely, but the changes for that look more complex (and should be looked at by people with more network knowledge than I have).

I'll build this branch tomorrow, give it a spin and report back.

@crosbymichael
Copy link
Contributor

@crosbymichael crosbymichael commented Jul 10, 2014

We would be willing to remove the proxy if you can get this working....

@phemmer
Copy link
Contributor Author

@phemmer phemmer commented Jul 10, 2014

I might, but should probably be a separate PR.

@mpetazzoni
Copy link

@mpetazzoni mpetazzoni commented Jul 10, 2014

+1, :shipit:

Confirmed working, the Docker daemon is no longer hogging CPU being busy moving data around via the userland proxy. Helps a lot!

@crosbymichael
Copy link
Contributor

@crosbymichael crosbymichael commented Jul 10, 2014

Ok but do thinks like mongodb going out via the ext interfaces back in to itself still work?

@crosbymichael
Copy link
Contributor

@crosbymichael crosbymichael commented Jul 10, 2014

i wonder if we can test removing the userland proxy with this.....

@vieux
Copy link
Contributor

@vieux vieux commented Jul 10, 2014

@phemmer
Copy link
Contributor Author

@phemmer phemmer commented Jul 10, 2014

@crosbymichael See second half of the description. It comments on that.

@jpetazzo
Copy link
Contributor

@jpetazzo jpetazzo commented Jul 11, 2014

I don't think we want to remove the userland proxy (do we?) but an option to disable it (daemon-wide) would be nice.

(Rationale: when people want to port Docker to FreeBSD/Solaris, it will be nice to use the userland proxy, right?)

@phemmer
Copy link
Contributor Author

@phemmer phemmer commented Jul 11, 2014

The big problem with the proxy is that you lose client information. The server no longer knows the IP/port of the client.
I think it would be dangerous to have that behavior vary between host OS. If you run your containerized app, and it can see the client IP/host on one host OS, it should be exactly the same on any other host OS.

Addendum: I'm all for making it an option, but the default behavior should be the same everywhere.

@jpetazzo
Copy link
Contributor

@jpetazzo jpetazzo commented Jul 16, 2014

Yes, but then porting Docker will be more difficult, since you have to port the iptables logic.

Also, AFAIR handling NAT connections to localhost requires tweaking some sysctls. I don't know if we want to do that.

@phemmer
Copy link
Contributor Author

@phemmer phemmer commented Jul 16, 2014

if/when docker gets ported to other OSs, iptables will have to be ported anyway as there are numerous other things using iptables besides potential localhost traffic. Plus I think implementing the iptables equivalent in the target OS will be trivial compared to everything else.

And yes, it does require setting a single sysctl param on the bridge interface, net.ipv4.conf.$iface.route_localnet=1. Docker already changes sysctl params in order to enable ip_forward.

@LK4D4
Copy link
Contributor

@LK4D4 LK4D4 commented Jul 16, 2014

@phemmer Hi, I tried to test this yesterday. Would you mind give me exact MASQUERADE rule that needed for "localhost exposing"?

@phemmer
Copy link
Contributor Author

@phemmer phemmer commented Jul 17, 2014

@LK4D4 sorry, missed your message.

I successfully routed localhost by performing the following:

iptables -t nat -A OUTPUT -m addrtype --dst-type LOCAL -j DOCKER
iptables -t nat -D OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
iptables -t nat -A POSTROUTING -m addrtype --src-type LOCAL -o docker0 -j MASQUERADE
sysctl -w net.ipv4.conf.docker0.route_localnet=1

@LK4D4
Copy link
Contributor

@LK4D4 LK4D4 commented Nov 4, 2014

@phemmer Would you mind to rebase? I think we really want to remove userland-proxy now.

This re-applies commit b39d02b with additional iptables rules to solve the issue with containers routing back into themselves.

The previous issue with this attempt was that the DNAT rule would send traffic back into the container it came from. When this happens you have 2 issues.
1) reverse path filtering. The container is going to see the traffic coming in from the outside and it's going to have a source address of itself. So reverse path filtering will kick in and drop the packet.
2) direct return mismatch. Assuming you turned reverse path filtering off, when the packet comes back in, it's goign to have a source address of itself, thus when the reply traffic is sent, it's going to have a source address of itself. But the original packet was sent to the host IP address, so the traffic will be dropped because it's coming from an address which the original traffic was not sent to (and likely with an incorrect port as well).

The solution to this is to masquerade the traffic when it gets routed back into the origin container. However for this to work you need to enable hairpin mode on the bridge port, otherwise the kernel will just drop the traffic.
The hairpin mode set is part of libcontainer, while the MASQ change is part of docker.

This reverts commit 63c303e.

Docker-DCO-1.1-Signed-off-by: Patrick Hemmer <patrick.hemmer@gmail.com> (github: phemmer)
@phemmer
Copy link
Contributor Author

@phemmer phemmer commented Nov 4, 2014

rebased. no conflicts

@LK4D4
Copy link
Contributor

@LK4D4 LK4D4 commented Nov 4, 2014

@jessfraz
Copy link
Contributor

@jessfraz jessfraz commented Nov 4, 2014

RIP userland-proxy

@mrunalp
Copy link
Contributor

@mrunalp mrunalp commented Nov 6, 2014

Does docker0 bridge have to be set to have STP enabled for this?

@phemmer
Copy link
Contributor Author

@phemmer phemmer commented Nov 6, 2014

@mrunalp No. This requires no special features/modes other than that the container's external virtual interface have hairpin mode enabled.

@jessfraz
Copy link
Contributor

@jessfraz jessfraz commented Nov 6, 2014

considering the original commit for this was by @ibuildthecloud, do you have any thoughts/warnings on what happened the first time around.. I would be curious to know since I was not involved then.

@phemmer
Copy link
Contributor Author

@phemmer phemmer commented Nov 6, 2014

Well we have an integration test for the issue so it can't happen again.

@jessfraz
Copy link
Contributor

@jessfraz jessfraz commented Nov 6, 2014

ah well then thats perfect, this LGTM, I tried with routing localhost too and it worked

@LK4D4
Copy link
Contributor

@LK4D4 LK4D4 commented Nov 6, 2014

ping @crosbymichael
Are you OK to merge this? Seems like all okay with this patch, proxy will be removed in another.

@LK4D4
Copy link
Contributor

@LK4D4 LK4D4 commented Nov 10, 2014

I can confirm that without userland proxy and with rules which was written above by @phemmer and route_localnet localhost access working as expected. Also I tested that without these rules - localhost access doesn't working.

@LK4D4
Copy link
Contributor

@LK4D4 LK4D4 commented Nov 10, 2014

ping @icecrime

@icecrime
Copy link
Contributor

@icecrime icecrime commented Nov 10, 2014

LGTM

icecrime pushed a commit that referenced this pull request Nov 10, 2014
@icecrime icecrime merged commit 0f21d9a into moby:master Nov 10, 2014
1 check passed
1 check passed
@aluzzardi
default The build succeeded on drone.io
Details
@icecrime icecrime mentioned this pull request Nov 11, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

10 participants