-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to preserve source IP using TPROXY
#7089
Comments
As an update, when I worked on the proof of concept I was curious if we can preserve the source IP of the request by doing all packet routing at the NAT level, instead of relying on mangling headers and using the The benefits of doing this at the NAT level:
The main disadvantage is that source nat can happen when we do destination nat IFF two connections to the same target end up using the same srcIP and srcPORT in the TCP connection identifying 5 tuple. This seems pretty improbable to me though, and in this case everything would still work, we just wouldn't have the source IP preserved. We can handle this case in the proxy through log messages though. The main disadvantages of relying on
Would love some opinions on this if you have the time. |
Update: We had some discussions off GitHub about this recently. The team decided not to move on with implementing this; while the research proves this can be done, the trade-off comes at the cost of a more permissive model for the proxy. Put plainly, in order to set the This solution might work in other use cases; it would make sense to have it on ingress traffic, but as it stands, it's not feasible to have for inter-cluster communication. I'll be closing the issue, the research is there if circumstances change in the future. |
This issue outlines the concepts involved in adding support for
TPROXY
, and proposes changes that would enable this to work with Linkerd. This was initially requested in #4713, and I know some members of the community would be glad to have it in.The proposal leverages
TPROXY
and socket optionIP_TRANSPARENT
to preserve the IP of a client for any TCP connections. A great advantage of doing this at the firewall level is that we do not have to set any special request headers (e.gX-FORWARDED-FOR
); this would be application protocol agnostic. To make this work, there are two parts:iptables
.Note: TCP connections are identified through a 5-tuple:
(src_ip, src_port, protocol, dst_ip, dst_port)
. When relying on nat at a firewall level to re-write the destination (to the proxy's port), in general the source IP will be preserved. In thePREROUTING
chain, it is common to only do DNAT. However, if two connections to the same host share the same IP and port, it is possible to have SNAT done in the prerouting step. Theoretically, we might be able to do this without tproxy; it does introduce some guarantees that are good to have despite the complexity.Update: I created a proof of concept with how this would work in k8s. The proof of concept will preserve the src IP of the client; it can do so at a
nat
level (with current ipt rules we have in place) or at amangle
level using tproxy target. You can find the proof of concept hereTPROXY support
Introduction. Tproxy refers to a module that adds transparent proxy support to the kernel[1]. In essence, it allows to proxy traffic from a client to a server through a local socket; as far as the client is concerned, it connected successfully to the original target. The documentation outlines the steps to make this work[1]:
IP_TRANSPARENT
option (this let's the socket bind to a non-local address).TPROXY
target in iptables to have traffic routed through the socket mentioned in step (3) instead of going directly to the original dst.Essentially, this was designed to intercept traffic on a router, even if the destination is not local, we can "impersonate it" through a local socket that has the
IP_TRANSPARENT
option set. This works in a similar way to theREDIRECT
target, which we make use of at the moment.Before we go on, there are some netfilter concepts to revisit.
This is a short catch-up on some iptables concepts. In total, there are four predefined tables, traversed in order: raw, mangle, nat, and filter. There are also a set of predefined chains that a packet will go through, these chains are associated with the tables. A chain may belong to multiple tables, but multiple tables may not have the same chain.
For tproxy, we are concerned primarily with the
mangle
table (which will be traversed beforenat
). Mangle, as the name implies, allows us to mangle packets -- this generally means changing the IP header -- but it has another advantage. The mangle table can use theMARK
target, which was briefly mentioned in the introduction. Simply put, we can mark certain packets with an arbitrary value (note though, the mark is something the kernel tracks internally, it's not ON the packet). Marks are generally used with policy based routing at a firewall level[2]. We can essentially create our own routes and our own routing table and apply this "policy" only to certain packets that are marked. This is important to know for most of this stuff to make this. Setting marks is only supported in the mangle table[2].Lastly, the mangle table has the same two chains we have been operating on using nat:
PREROUTING
andOUTPUT
.Changes required. The set-up itself isn't very complicated, we need to set-up new routing rules and intercept packets with a socket that has the
IP_TRANSPARENT
option. Since we are only concerned with this on the inbound side, my feeling is that we can still rely on nat to proxy requests through the outbound side. Inbound, we would have to do add a few rules:1
.0.0.0.0/0
) will be treated as local.iptables -t mangle -A PREROUTING -p tcp -j TPROXY --tproxy-mark <0x1, our mark> --on-port <port>
.TL;DR. Tproxy works like nat, but with some extra steps. Its advantage is that it does not re-write the destination address, and it is also more reliable in the face of SNAT (which may happen if 2 connections use the same src port and ip, for whatever reason). To use the
TPROXY
target in iptables, we first need to mark inbound packets and then route them using policy routing -- this will ensure all packets are treated as local, instead of being forwarded. I'd argue this is not strictly necessary in k8s, since we will never act as a router (dst will always be local to proxy), but it seems to be the norm. In essence, all destination addresses will be treated as local (if you're confused, welcome to the club).By re-using the mark, we can tell iptables to send packets to our port instead of their original destination. Neat. We are missing one step though, we have to actually impersonate the original destination.
If you're confused about routing tables, click me.
These are the two commands that the tproxy docs lists:
Like I said before, a mark is just a field maintained in the kernel and associated with a specific packet. When a packet comes in, after it traverses PREROUTING, a routing decision has to be made. The kernel consults its routing policy database. These two commands will say, for any packet marked as 1, lookup table 100 (it can be any name). The second command adds a new routing rule to the database, it says: the scope of this route is local, it applies to any IPV4, if you see it, it has to be handled on the loopback interface; i.e it never leaves the host.
IP_TRANSPARENT
Introduction. To successfully route traffic using the tproxy target, we also need to set the
IP_TRANSPARENT
option on the socket. The documentation is quite confusing for it[3], it allows us to bind to a non-local IP address. As we will see, this goes two ways; inbound and outbound.Inbound. We need to set the option on our server socket (not the app, the proxy's server) so that we may receive traffic routed through tproxy. The
local_addr
on this socket will be the application's IP:PORT! For example, say we have our server bound on0.0.0.0:5000
and our app on0.0.0.0:3000
. We set tproxy to redirect all packets to 5000. When a client sends a request to10.0.0.1:3000
, our proxy will receive it. As far as all parties are concerned, however, they think the local address is10.0.0.1:3000
but it is infact10.0.0.1:5000
.Outbound: when we open a connection from the proxy to the application, we need to again use this
IP_TRANSPARENT
option; the usecase here is a bit more subtle perhaps. We have access to the peer address from our server socket, we can re-use that and bind before connect. The peer address is not local, butIP_TRANSPARENT
will let us bind to it anyway.The actual complication comes from what the application perceives as being the client. It thinks it talks to the client, so any reply packets will be sent to the client. We need some additional rules in place to route replies back through our proxy. This is where the
CONNMARK
target comes into place. When we build the socket, aside from the transparent option, we also set the same mark on it as all other packets. We then add a rule to each chain: PREROUTING and OUTPUT. On the PREROUTING side, we add a CONNMARK rule whereby packets turn their mark into a connection mark. On the OUTPUT side, we restore the connection mark back into a packet mark so policy routing can be applied.Implementation and scoped work
I wanted to give a thorough introduction on how everything works conceptually. To illustrate all of this in practice, I created a proof of concept project: https://github.com/mateiidavid/linkerd-tproxy-poc.
In the poc, we have a server and a "proxy". The proxy will set-up iptable rules and intercept traffic from a client (
nc
orcurl
) and then send requests to the server using a spoofed address. This should show how things come together. The proxy talks to the server over localhost for simplicity.For Linkerd:
SO_ORIGINAL_DST
on the socket (we can get the address directly from the socket since packet is not re-written by nat); not compulsory, we can still keep the logic in. We'd have to set different socket options though, most notablyIP_TRANSPARENT
. In the poc I have also usedfreebind
andreuseaddr
. This is a good article on the topic. Much of this set-up is actually inspired by mmproxy.iproute2
to configure policy routing.The surface area of the change is not exessively large, I think we can get away with only adding inbound rules (and some additional output rules for connmarking) on the iptables side. We can feature flag this in the helm charts with
--set initContainer.tproxy=true
or something similar. Setting this should be reflected in the proxy template partial; my idea is to have an environment variable that let's the proxy know it has to set-up the connections differently. The code will make use of this env variable to add/remove socket options.I'll continue updating this as discussions go on, the next step for me would be to add a checklist to this issue that will track all necessary work. I'd like to first socialize the idea and see what other people think about the proposal. I'll also update this post with more explanations if needed.
References
1: tproxy docs
2: Mark target
3: man ip
4: bind-before-connect
Further reading:
Tproxy proof of concept in C
Tproxy proof of concept in Rust, made with Linkerd in mind
Cloudflare blog on tproxy
mmproxy: preserving src IP
The text was updated successfully, but these errors were encountered: