-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use iptables for proxying instead of userspace #3760
Comments
Cool! I think we should definitely get this merged in. On a separate note, I was seeing the proxy eat ~30% of a core under heavy load, I have to believe that iptables will give us better performance than that. |
We have to prioritize this - it's almost a total rewrite of kube-proxy and On Mon, Jan 26, 2015 at 11:06 AM, Brendan Burns notifications@github.com
|
Maybe implementing it as a parallel option and slowly migrating makes sense? On Mon, Jan 26, 2015 at 12:01 PM, Tim Hockin notifications@github.com
|
I'm trying to coax someone else who doesn't know this code well to learn it That said, you also sent (good) email about the massive P1 list - and I On Mon, Jan 26, 2015 at 1:06 PM, Brendan Burns notifications@github.com
|
Is this a P2? Might it be worth making it a P3 for now? |
I'm hoping to make it work, but we may yet demote it On Wed, Feb 11, 2015 at 2:49 PM, Satnam Singh notifications@github.com
|
Doesn't "hope" equate to a P3 that we'll get to if we can? |
From discussion with @thockin: This is a requirement in order to support service port ranges, which aren't required for 1.0, but we would like to support eventually. |
@thockin "This has the downside of not being compatible with older iptables and kernels." How 'new' would the kernel have to be? |
Not TOO new, but we have some users who REALLY want iptables from 2012 to On Mon, Feb 23, 2015 at 2:44 PM, Sidharta Seethana <notifications@github.com
|
@thockin thanks. We are using/testing with RHEL/CentOS 6, for example - so it would nice if we don't have a hard dependency on recent 3.x kernels. |
@pweil- we were discussing this the other day
|
Well, you do need Docker to run, and at some point we have to cut it off. On Mon, Feb 23, 2015 at 8:40 PM, Sidharta Seethana <notifications@github.com
|
With @thockin 's help, we tried the same with udp. We created a GCE Kubernetes cluster with 3 sky-dns replication controllers. iptables -t nat -N TESTSVC iptables -t nat -A TESTSVC -m recent --name hostA --rcheck --seconds 1 --reap -j TESTSVC_A iptables -t nat -A TESTSVC -m statistic --mode random --probability 0.333 -j TESTSVC_A iptables -t nat -A TESTSVC_A -m recent --name hostA --set -j DNAT -p udp --to-destination 10.244.0.5:53 kubernetes-master>nslookup kubernetes.default.kuberenetes.local 10.0.0.10 We get a response back! |
Great stuff! Just FYI (confirming from our face-to-face conversation), it's not safe to run multiple concurrent iptables commands in general (different chains sounds like it might be OK). iptables is a wrapper around libiptc, and see the comment on iptc_commit: http://www.tldp.org/HOWTO/Querying-libiptc-HOWTO/mfunction.html This was apparently fixed in 2013, but maybe only if you pass --wait (?): http://git.netfilter.org/iptables/commit/?id=93587a04d0f2511e108bbc4d87a8b9d28a5c5dd8 The root cause of this is that iptables effectively calls iptables-save / iptables-restore (at least per chain); I've seen a lot of code which just therefore calls iptables-save & iptables-restore rather than doing things through adds and deletes. I may even have some code to do that I could dig up if that is helpful. |
It boggles my mind that there's no way to do CAS or LL/SC sorts of ops. We should add support for --wait, though it is recent enough that GCE's Maybe we should do our own locking inside our code to at least prevent us On Thu, Feb 26, 2015 at 1:56 PM, Justin Santa Barbara <
|
What happens in the case of failures in the middle of creating a bunch of rules? |
Fair question - we should probably think really hard about what it means to On Thu, Feb 26, 2015 at 8:47 PM, Brian Grant notifications@github.com
|
@thockin From irc today: The
|
It dawns on me (via Abhishek) that even if this works, we STILL have to On Mon, Jan 18, 2016 at 9:50 PM, Tim Hockin thockin@google.com wrote:
|
That's unfortunate :-( not sure why by the way. I'll try something with MPLS then, I want to learn it anyway. |
If you have 2 backends for Service and you want to send more than a single On Wed, Jan 20, 2016 at 12:24 PM, Mikaël Cluseau notifications@github.com
|
I kind of assumed that for UDP workloads, yes. It also can be optional to go stateless even for UDP. @qoke any comment on this? |
Also, we could use things like client IP hashing to make the flow more stable while still balanced (I don't know if we can call that "some kind of tracking" :-)). |
@MikaelCluseau we use the default IPVS behaviour, which does some very light-weight UDP "stickyness"...
-- Quoted from http://kb.linuxvirtualserver.org/wiki/IPVS Of course, this only works if you have many clients talking to a single service, or a single client with varying source ports. If you have a single high-volume client, all sending traffic from the same source port, and you want to load balance this over multiple backends, then you may prefer to use a stateless/spray-and-pray approach. We load balance a lot of DNS and RADIUS traffic - DNS typically falls into the first category (lotso of clients, or clients with lots of source ports), and RADIUS typically falls into the later category (few clients, lots of packets all from the same IP/port). Rather than using a stateless hash for RADIUS we instead decided to randomize source ports to get an even spread. |
After reading the whole thread I still can't figure out whether activating the iptables mode for kube-proxy should fix the problem of external IPs being hidden (#10921) or not. We did enable the iptables mode with v1.1 as suggested here but we're still seeing the IPs from the cluster, no the real ones from the users. Our cluster is in GCE and we just need a load balancer with HTTPS support before we go live. As GCE doesn't support v.1.2 alpha we cannot use the new Ingress (which AFAIK supports HTTPS Load Balancers), so the Network Load Balancer is our only option. But obviously we cannot go live without the ability of logging real ips from our users. Some clarification for new users on this would be appreciated. Supporting HTTPS is mandatory for many of us. Thanks! |
I have been using the iptables proxy on and off for quite some time and can confirm that the external IPs of clients are still hidden/show cluster IPs. We've gotten around this so far by running our frontend HTTP/HTTPS proxy running in host network mode so that it sees the source IP address. |
@maclof thanks for the feedback. Could you share more info about how your workaround? What do you mean by your HTTP/HTTPS running in host network? |
@javiercr we use a pod spec something like this: http://pastie.org/private/zpdelblsob654zif7xus5g Using host network means that the pod runs in the host machines network, instead of being assigned a cluster IP. That means when our nginx server binds to port 80/443 it will listen on a host IP and will see the source IP addresses. |
I'm using kubernetes 1.1,
|
I've learned a lot reading this thread! As an FYI this doc states that AWS ELB uses round-robin for TCP connections and least connections for http/https: http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/how-elb-works.html#request-routing I agree that focussing on getting requests only to nodes that run pods and to try to serve local pods is the best way to go about it. The nice side benefit of this is there'll be less node-to-node traffic within the cluster and I suppose a latency improvement by always service local requests from service to pod (which I guess is of even more benefit if you have nodes in multiple availability zones in the same cluster). In terms of working with a load balancer that doesn't support weighting then you could solve this with your replication controller by trying to always keep the same number of pods on a node (if there's more than 1 per node) and then distributing evenly between them, even if this means having to move pods off of a node in certain situations and allowing only certain replica counts. e.g. for a 4 node cluster with a service connected to a load balancer the only number of acceptable pod replicas would be 1,2,3,4,6,8,9,12,16,20, etc |
We're also looking to solve for traffic to route to local pods only. I'd be fine with the nodeport going away on a node at times when no pods are present locally for a service. This way a simple load balancer TCP health check would prevent requests from going to those nodes. I think if we can at least solve for the iptables\kube-proxy portion of this, then we'll find out what the load balancing implications are when the pods are not balanced across the cluster. I think there are ways to solve that on load balancers without having to set a weight for each node w/ an API call. Load balancers already deal w/ this using other dynamic methods. Also, depending on what the service you're running is actually doing inside that container for each api call it may not be able to support 2x the traffic when there are 2 pods on a node vs one anyways. If Kubernetes Limits are set and if maximum levels of usage are being approached on a pod\node could play into this as well which adds yet another layer of complexity to trying to find the right weight setting on the external load balancer. I'd say, stay away from that level of complexity and not try to set load balancer weight from kubernetes. |
@yoshiwaan Can I suggest opening a new issue for the inter-node traffic suggestion, as this issue is now closed. Personally I think a good first step would be to ensure that if a pod is running on the local node, that we route to the local pod. I suspect that this will be sufficient, because you can then scale your RC so that there are pods on every node. |
@justinsb +1, also we're running into a problem now where we need to see client IPs and it's basically impossible with the current setup. |
This could be way too naive, yet I was wondering what's the difference between userspace mode and iptables? I cannot really tell from the user doc. |
Userland mode means kube-proxy handles the connections itself by receiving the connection request from the client and opening a socket to the server, which (1) consume much more CPU and memory and (2) is limited to the number of ports a single can open (<65k). The iptables mode works at a lower level, in the kernel, and uses connection tracking instead, so it's much lighter and handles a lot more connections*. (edit) (*) As long as you don't SNAT packets going through, which in turn requires a setup where you are sure packets will cross the connection tracking rules associated with them. For instance, using a routed access design allows you to avoid SNAT, which means the endpoint of the service will see the real client's IP. |
@MikaelCluseau |
On 04/19/2016 10:51 PM, Emma He wrote:
Yes. |
Sorry but I absolutely missed this earlier.
@MikaelCluseau I was thinking iptables adopts SNAT and DNAT, which is not the case according to you. Could you please clarify this for me? |
On 04/20/2016 01:59 PM, Emma He wrote:
It's the tricky part. (1) Using service/external IPs requires DNAT. The condition of (2) is usually ok in routed access network designs For instance, given
Then,
If the condition of (2) cannot be met, you have to use SNAT/MASQUERADING |
@MikaelCluseau - drop me an email at my github name at google.com - I have On Tue, Apr 19, 2016 at 8:20 PM, Mikaël Cluseau notifications@github.com
|
@justinsb @yoshiwaan did anyone ever create an issue for this? My search fu is failing me, and I have a similar need.
|
I didn't raise it myself |
Ahhhhh, I think I found it, this appears to be the feature/fix: kubernetes/enhancements#27 Seems to be beta as of 1.5.x. |
I was playing with iptables yesterday, and I protoyped (well, copied from Google hits and mutated) a set of iptables rules that essentially do all the proxying for us without help from userspace. It's not urgent, but I want to file my notes before I lose them.
This has the additional nice side-effect (as far as I can tell) of preserving the source IP and being a large net simplification. Now kube-proxy would just need to sync Services -> iptables. This has the downside of not being compatible with older iptables and kernels. We had a problem with this before - at some point we need to decide just how far back in time we care about.
This can probably be optimized further, but in basic testing, I see sticky sessions working and if I comment that part out I see ~equal probability of hitting each backend. I was not able to get deterministic round-robin working properly (with --nth instead of --probability) but we could come back to that if we want.
This sets up a service portal with the backends listed below
The text was updated successfully, but these errors were encountered: