Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use iptables for proxying instead of userspace #3760

Closed
thockin opened this issue Jan 23, 2015 · 187 comments

Comments

Projects
None yet
@thockin
Copy link
Member

commented Jan 23, 2015

I was playing with iptables yesterday, and I protoyped (well, copied from Google hits and mutated) a set of iptables rules that essentially do all the proxying for us without help from userspace. It's not urgent, but I want to file my notes before I lose them.

This has the additional nice side-effect (as far as I can tell) of preserving the source IP and being a large net simplification. Now kube-proxy would just need to sync Services -> iptables. This has the downside of not being compatible with older iptables and kernels. We had a problem with this before - at some point we need to decide just how far back in time we care about.

This can probably be optimized further, but in basic testing, I see sticky sessions working and if I comment that part out I see ~equal probability of hitting each backend. I was not able to get deterministic round-robin working properly (with --nth instead of --probability) but we could come back to that if we want.

This sets up a service portal with the backends listed below

iptables -t nat -N TESTSVC
iptables -t nat -F TESTSVC
iptables -t nat -N TESTSVC_A
iptables -t nat -F TESTSVC_A
iptables -t nat -N TESTSVC_B
iptables -t nat -F TESTSVC_B
iptables -t nat -N TESTSVC_C
iptables -t nat -F TESTSVC_C
iptables -t nat -A TESTSVC -m recent --name hostA --rcheck --seconds 1 --reap -j TESTSVC_A
iptables -t nat -A TESTSVC -m recent --name hostB --rcheck --seconds 1 --reap -j TESTSVC_B
iptables -t nat -A TESTSVC -m recent --name hostC --rcheck --seconds 1 --reap -j TESTSVC_C
iptables -t nat -A TESTSVC -m statistic --mode random --probability 0.333 -j TESTSVC_A
iptables -t nat -A TESTSVC -m statistic --mode random --probability 0.500 -j TESTSVC_B
iptables -t nat -A TESTSVC -m statistic --mode random --probability 1.000 -j TESTSVC_C

iptables -t nat -A TESTSVC_A -m recent --name hostA --set -j DNAT -p tcp --to-destination 10.244.4.6:9376
iptables -t nat -A TESTSVC_B -m recent --name hostB --set -j DNAT -p tcp --to-destination 10.244.1.15:9376
iptables -t nat -A TESTSVC_C -m recent --name hostC --set -j DNAT -p tcp --to-destination 10.244.4.7:9376

iptables -t nat -F KUBE-PORTALS-HOST
iptables -t nat -A KUBE-PORTALS-HOST -d 10.0.0.93/32 -m state --state NEW -p tcp -m tcp --dport 80 -j TESTSVC
iptables -t nat -F KUBE-PORTALS-CONTAINER
iptables -t nat -A KUBE-PORTALS-CONTAINER -d 10.0.0.93/32 -m state --state NEW -p tcp -m tcp --dport 80 -j TESTSVC
@brendandburns

This comment has been minimized.

Copy link
Contributor

commented Jan 26, 2015

Cool! I think we should definitely get this merged in. On a separate note, I was seeing the proxy eat ~30% of a core under heavy load, I have to believe that iptables will give us better performance than that.

@thockin

This comment has been minimized.

Copy link
Member Author

commented Jan 26, 2015

We have to prioritize this - it's almost a total rewrite of kube-proxy and
all the tests thereof. It also has back-compat problems (will not work on
older kernels or older iptables binaries).

On Mon, Jan 26, 2015 at 11:06 AM, Brendan Burns notifications@github.com
wrote:

Cool! I think we should definitely get this merged in. On a separate note,
I was seeing the proxy eat ~30% of a core under heavy load, I have to
believe that iptables will give us better performance than that.

Reply to this email directly or view it on GitHub
#3760 (comment)
.

@brendandburns

This comment has been minimized.

Copy link
Contributor

commented Jan 26, 2015

Maybe implementing it as a parallel option and slowly migrating makes sense?

On Mon, Jan 26, 2015 at 12:01 PM, Tim Hockin notifications@github.com
wrote:

We have to prioritize this - it's almost a total rewrite of kube-proxy and
all the tests thereof. It also has back-compat problems (will not work on
older kernels or older iptables binaries).

On Mon, Jan 26, 2015 at 11:06 AM, Brendan Burns notifications@github.com
wrote:

Cool! I think we should definitely get this merged in. On a separate
note,
I was seeing the proxy eat ~30% of a core under heavy load, I have to
believe that iptables will give us better performance than that.

Reply to this email directly or view it on GitHub
<
#3760 (comment)

.


Reply to this email directly or view it on GitHub
#3760 (comment)
.

@thockin

This comment has been minimized.

Copy link
Member Author

commented Jan 27, 2015

I'm trying to coax someone else who doesn't know this code well to learn it
and take it on. I really want to tackle it, but it would be better if
someone else learned this space (not you! :)

That said, you also sent (good) email about the massive P1 list - and I
don;t think this is on that list yet.

On Mon, Jan 26, 2015 at 1:06 PM, Brendan Burns notifications@github.com
wrote:

Maybe implementing it as a parallel option and slowly migrating makes
sense?

On Mon, Jan 26, 2015 at 12:01 PM, Tim Hockin notifications@github.com
wrote:

We have to prioritize this - it's almost a total rewrite of kube-proxy
and
all the tests thereof. It also has back-compat problems (will not work
on
older kernels or older iptables binaries).

On Mon, Jan 26, 2015 at 11:06 AM, Brendan Burns <
notifications@github.com>
wrote:

Cool! I think we should definitely get this merged in. On a separate
note,
I was seeing the proxy eat ~30% of a core under heavy load, I have to
believe that iptables will give us better performance than that.

Reply to this email directly or view it on GitHub
<

#3760 (comment)

.

Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-71527216>

.

Reply to this email directly or view it on GitHub
#3760 (comment)
.

@satnam6502

This comment has been minimized.

Copy link
Contributor

commented Feb 11, 2015

Is this a P2? Might it be worth making it a P3 for now?

@thockin

This comment has been minimized.

Copy link
Member Author

commented Feb 11, 2015

I'm hoping to make it work, but we may yet demote it

On Wed, Feb 11, 2015 at 2:49 PM, Satnam Singh notifications@github.com
wrote:

Is this a P2? Might it be worth making it a P3 for now?

Reply to this email directly or view it on GitHub
#3760 (comment)
.

@roberthbailey

This comment has been minimized.

Copy link
Member

commented Feb 12, 2015

Doesn't "hope" equate to a P3 that we'll get to if we can?

@bgrant0607

This comment has been minimized.

Copy link
Member

commented Feb 23, 2015

From discussion with @thockin: This is a requirement in order to support service port ranges, which aren't required for 1.0, but we would like to support eventually.

@sidharta-s

This comment has been minimized.

Copy link
Contributor

commented Feb 23, 2015

@thockin "This has the downside of not being compatible with older iptables and kernels." How 'new' would the kernel have to be?

@thockin

This comment has been minimized.

Copy link
Member Author

commented Feb 24, 2015

Not TOO new, but we have some users who REALLY want iptables from 2012 to
work.

On Mon, Feb 23, 2015 at 2:44 PM, Sidharta Seethana <notifications@github.com

wrote:

@thockin https://github.com/thockin "This has the downside of not being
compatible with older iptables and kernels." How 'new' would the kernel
have to be?

Reply to this email directly or view it on GitHub
#3760 (comment)
.

@sidharta-s

This comment has been minimized.

Copy link
Contributor

commented Feb 24, 2015

@thockin thanks. We are using/testing with RHEL/CentOS 6, for example - so it would nice if we don't have a hard dependency on recent 3.x kernels.

@pmorie

This comment has been minimized.

Copy link
Member

commented Feb 24, 2015

@pweil- we were discussing this the other day
On Mon, Feb 23, 2015 at 11:40 PM Sidharta Seethana notifications@github.com
wrote:

@thockin https://github.com/thockin thanks. We are using/testing with
RHEL/CentOS 6, for example - so it would nice if we don't have a hard
dependency on recent 3.x kernels.


Reply to this email directly or view it on GitHub
#3760 (comment)
.

@thockin

This comment has been minimized.

Copy link
Member Author

commented Feb 24, 2015

Well, you do need Docker to run, and at some point we have to cut it off.
The back-rev iptables support will not stop me from (eventually) making
this change, and it's going to sting for some people.

On Mon, Feb 23, 2015 at 8:40 PM, Sidharta Seethana <notifications@github.com

wrote:

@thockin https://github.com/thockin thanks. We are using/testing with
RHEL/CentOS 6, for example - so it would nice if we don't have a hard
dependency on recent 3.x kernels.

Reply to this email directly or view it on GitHub
#3760 (comment)
.

@ArtfulCoder

This comment has been minimized.

Copy link
Contributor

commented Feb 26, 2015

With @thockin 's help, we tried the same with udp.

We created a GCE Kubernetes cluster with 3 sky-dns replication controllers.
On the kubernetes-master, we set the following in iptables:
The dns service ip was 10.0.0.10, and the pod endpoints running dns were 10.244.0.5:53, 10.244.3.6:53, 10.244.0.6:53

iptables -t nat -N TESTSVC
iptables -t nat -F TESTSVC
iptables -t nat -N TESTSVC_A
iptables -t nat -F TESTSVC_A
iptables -t nat -N TESTSVC_B
iptables -t nat -F TESTSVC_B
iptables -t nat -N TESTSVC_C
iptables -t nat -F TESTSVC_C
iptables -t nat -N KUBE-PORTALS-HOST
iptables -t nat -F KUBE-PORTALS-HOST

iptables -t nat -A TESTSVC -m recent --name hostA --rcheck --seconds 1 --reap -j TESTSVC_A
iptables -t nat -A TESTSVC -m recent --name hostB --rcheck --seconds 1 --reap -j TESTSVC_B
iptables -t nat -A TESTSVC -m recent --name hostC --rcheck --seconds 1 --reap -j TESTSVC_C

iptables -t nat -A TESTSVC -m statistic --mode random --probability 0.333 -j TESTSVC_A
iptables -t nat -A TESTSVC -m statistic --mode random --probability 0.5 -j TESTSVC_B
iptables -t nat -A TESTSVC -m statistic --mode random --probability 1.000 -j TESTSVC_C

iptables -t nat -A TESTSVC_A -m recent --name hostA --set -j DNAT -p udp --to-destination 10.244.0.5:53
iptables -t nat -A TESTSVC_B -m recent --name hostB --set -j DNAT -p udp --to-destination 10.244.3.6:53
iptables -t nat -A TESTSVC_C -m recent --name hostC --set -j DNAT -p udp --to-destination 10.244.0.6:53
iptables -t nat -A KUBE-PORTALS-HOST -d 10.0.0.10/32 -p udp -m udp --dport 53 -j TESTSVC
iptables -t nat -A OUTPUT -j KUBE-PORTALS-HOST


kubernetes-master>nslookup kubernetes.default.kuberenetes.local 10.0.0.10

We get a response back!

@justinsb

This comment has been minimized.

Copy link
Member

commented Feb 26, 2015

Great stuff! Just FYI (confirming from our face-to-face conversation), it's not safe to run multiple concurrent iptables commands in general (different chains sounds like it might be OK). iptables is a wrapper around libiptc, and see the comment on iptc_commit: http://www.tldp.org/HOWTO/Querying-libiptc-HOWTO/mfunction.html

This was apparently fixed in 2013, but maybe only if you pass --wait (?): http://git.netfilter.org/iptables/commit/?id=93587a04d0f2511e108bbc4d87a8b9d28a5c5dd8

The root cause of this is that iptables effectively calls iptables-save / iptables-restore (at least per chain); I've seen a lot of code which just therefore calls iptables-save & iptables-restore rather than doing things through adds and deletes. I may even have some code to do that I could dig up if that is helpful.

@thockin

This comment has been minimized.

Copy link
Member Author

commented Feb 26, 2015

It boggles my mind that there's no way to do CAS or LL/SC sorts of ops.

We should add support for --wait, though it is recent enough that GCE's
debian-backports doesn't have it.

Maybe we should do our own locking inside our code to at least prevent us
from stepping on ourselves.

On Thu, Feb 26, 2015 at 1:56 PM, Justin Santa Barbara <
notifications@github.com> wrote:

Great stuff! Just FYI (confirming from our face-to-face conversation),
it's not safe to run multiple concurrent iptables commands in general
(different chains sounds like it might be OK). iptables is a wrapper around
libiptc, and see the comment on iptc_commit:
http://www.tldp.org/HOWTO/Querying-libiptc-HOWTO/mfunction.html

This was apparently fixed in 2013, but maybe only if you pass --wait (?):
http://git.netfilter.org/iptables/commit/?id=93587a04d0f2511e108bbc4d87a8b9d28a5c5dd8

The root cause of this is that iptables effectively calls iptables-save /
iptables-restore (at least per chain); I've seen a lot of code which just
therefore calls iptables-save & iptables-restore rather than doing things
through adds and deletes. I may even have some code to do that I could dig
up if that is helpful.

Reply to this email directly or view it on GitHub
#3760 (comment)
.

@bgrant0607

This comment has been minimized.

Copy link
Member

commented Feb 27, 2015

What happens in the case of failures in the middle of creating a bunch of rules?

@thockin

This comment has been minimized.

Copy link
Member Author

commented Feb 27, 2015

Fair question - we should probably think really hard about what it means to
encounter an error in the middle of this

On Thu, Feb 26, 2015 at 8:47 PM, Brian Grant notifications@github.com
wrote:

What happens in the case of failures in the middle of creating a bunch of
rules?

Reply to this email directly or view it on GitHub
#3760 (comment)
.

@larsks

This comment has been minimized.

Copy link

commented Feb 27, 2015

@thockin From irc today:

The net.ipv4.conf.all.route_localnet permits 127.0.0.1 to be the target of DNAT rules. From the docs:

route_localnet - BOOLEAN

Do not consider loopback addresses as martian source or destination
while routing. This enables the use of 127/8 for local routing purposes.
default FALSE

@mcluseau

This comment has been minimized.

Copy link
Contributor

commented Jan 18, 2016

I'm trying on my side to get something (with pure netns on my local host
for now).

I'm trying an approach where I affect service IP ranges to hosts to
reduce the number of routing entries:

cli -- elb -- h1 -- c1

| `--- c2

`--- h2 -- c2

h1_ep_ip_ranges=( 10.1.1.0/24 10.1.2.0/24 )
h2_ep_ip_ranges=( 10.1.3.0/24 )

No ping ATM (packets not going through the PREROUTING chain...), and
need to sleep. More on this tomorrow ;)

On 01/18/2016 06:28 PM, Tim Hockin wrote:

I can't make this work on GCE, and I am not sure about AWS - there is a
limited number of static routes available.

I wonder if I could do it by piggybacking 2 IP ranges together in a single
route. It's a lot of IPs to spend, but I guess it only matters for UDP.
I'll have to try it out.

Edit: I tried it out and couldn't make it work, but I'm missing something.
We'll have to add/remove IPs in containers in response to services coming
and going, but I could not make "extra" IPs in a container work (it could
ping but not TCP or UDP, not sure why).

I'll have to try again sometime.

@thockin

This comment has been minimized.

Copy link
Member Author

commented Jan 19, 2016

I got a bit farther, but something I should have predicted happened.

I set up a pod with 10.244.2.8/25 as its main interface and 10.244.2.250/25
as its "in-a-service" interface. I was hoping that I could send UDP to
.250 and detect responses, to SNAT them. But of course, if the client is
not in the same /25 (which it can not be) the default route kicks in, which
comes from the .8 address. tcpdump confirms that responses come from .8
when using UDP.

I am again at a place where I am not sure how to make it work. will think
more on it.

On Mon, Jan 18, 2016 at 2:59 AM, Mikaël Cluseau notifications@github.com
wrote:

I'm trying on my side to get something (with pure netns on my local host
for now).

I'm trying an approach where I affect service IP ranges to hosts to
reduce the number of routing entries:

cli -- elb -- h1 -- c1

| `--- c2

`--- h2 -- c2

h1_ep_ip_ranges=( 10.1.1.0/24 10.1.2.0/24 )
h2_ep_ip_ranges=( 10.1.3.0/24 )

No ping ATM (packets not going through the PREROUTING chain...), and
need to sleep. More on this tomorrow ;)

On 01/18/2016 06:28 PM, Tim Hockin wrote:

I can't make this work on GCE, and I am not sure about AWS - there is a
limited number of static routes available.

I wonder if I could do it by piggybacking 2 IP ranges together in a
single
route. It's a lot of IPs to spend, but I guess it only matters for UDP.
I'll have to try it out.

Edit: I tried it out and couldn't make it work, but I'm missing
something.
We'll have to add/remove IPs in containers in response to services coming
and going, but I could not make "extra" IPs in a container work (it could
ping but not TCP or UDP, not sure why).

I'll have to try again sometime.


Reply to this email directly or view it on GitHub
#3760 (comment)
.

@thockin

This comment has been minimized.

Copy link
Member Author

commented Jan 20, 2016

It dawns on me (via Abhishek) that even if this works, we STILL have to
track flows somewhere, so it's not stateless in the end anyway.

On Mon, Jan 18, 2016 at 9:50 PM, Tim Hockin thockin@google.com wrote:

I got a bit farther, but something I should have predicted happened.

I set up a pod with 10.244.2.8/25 as its main interface and
10.244.2.250/25 as its "in-a-service" interface. I was hoping that I
could send UDP to .250 and detect responses, to SNAT them. But of course,
if the client is not in the same /25 (which it can not be) the default
route kicks in, which comes from the .8 address. tcpdump confirms that
responses come from .8 when using UDP.

I am again at a place where I am not sure how to make it work. will think
more on it.

On Mon, Jan 18, 2016 at 2:59 AM, Mikaël Cluseau notifications@github.com
wrote:

I'm trying on my side to get something (with pure netns on my local host
for now).

I'm trying an approach where I affect service IP ranges to hosts to
reduce the number of routing entries:

cli -- elb -- h1 -- c1

| `--- c2

`--- h2 -- c2

h1_ep_ip_ranges=( 10.1.1.0/24 10.1.2.0/24 )
h2_ep_ip_ranges=( 10.1.3.0/24 )

No ping ATM (packets not going through the PREROUTING chain...), and
need to sleep. More on this tomorrow ;)

On 01/18/2016 06:28 PM, Tim Hockin wrote:

I can't make this work on GCE, and I am not sure about AWS - there is a
limited number of static routes available.

I wonder if I could do it by piggybacking 2 IP ranges together in a
single
route. It's a lot of IPs to spend, but I guess it only matters for UDP.
I'll have to try it out.

Edit: I tried it out and couldn't make it work, but I'm missing
something.
We'll have to add/remove IPs in containers in response to services
coming
and going, but I could not make "extra" IPs in a container work (it
could
ping but not TCP or UDP, not sure why).

I'll have to try again sometime.


Reply to this email directly or view it on GitHub
#3760 (comment)
.

@mcluseau

This comment has been minimized.

Copy link
Contributor

commented Jan 20, 2016

That's unfortunate :-( not sure why by the way. I'll try something with MPLS then, I want to learn it anyway.

@thockin

This comment has been minimized.

Copy link
Member Author

commented Jan 20, 2016

If you have 2 backends for Service and you want to send more than a single
packet, you need to track flows in some way, don't you? Or are you
assuming it is safe to spray packets at different backends?

On Wed, Jan 20, 2016 at 12:24 PM, Mikaël Cluseau notifications@github.com
wrote:

That's unfortunate :-( not sure why by the way. I'll try something with
MPLS then, I want to learn it anyway.


Reply to this email directly or view it on GitHub
#3760 (comment)
.

@mcluseau

This comment has been minimized.

Copy link
Contributor

commented Jan 20, 2016

I kind of assumed that for UDP workloads, yes. It also can be optional to go stateless even for UDP. @qoke any comment on this?

@mcluseau

This comment has been minimized.

Copy link
Contributor

commented Jan 20, 2016

Also, we could use things like client IP hashing to make the flow more stable while still balanced (I don't know if we can call that "some kind of tracking" :-)).

@qoke

This comment has been minimized.

Copy link

commented Feb 1, 2016

@MikaelCluseau we use the default IPVS behaviour, which does some very light-weight UDP "stickyness"...

For scheduling UDP datagrams, IPVS load balancer records UDP datagram scheduling with configurable timeout, and the default UDP timeout is 300 seconds. Before UDP connection timeouts, all UDP datagrams from the same socket (protocol, ip address and port) will be directed to the same server.

-- Quoted from http://kb.linuxvirtualserver.org/wiki/IPVS

Of course, this only works if you have many clients talking to a single service, or a single client with varying source ports. If you have a single high-volume client, all sending traffic from the same source port, and you want to load balance this over multiple backends, then you may prefer to use a stateless/spray-and-pray approach.

We load balance a lot of DNS and RADIUS traffic - DNS typically falls into the first category (lotso of clients, or clients with lots of source ports), and RADIUS typically falls into the later category (few clients, lots of packets all from the same IP/port). Rather than using a stateless hash for RADIUS we instead decided to randomize source ports to get an even spread.

@javiercr

This comment has been minimized.

Copy link

commented Feb 18, 2016

After reading the whole thread I still can't figure out whether activating the iptables mode for kube-proxy should fix the problem of external IPs being hidden (#10921) or not. We did enable the iptables mode with v1.1 as suggested here but we're still seeing the IPs from the cluster, no the real ones from the users.

Our cluster is in GCE and we just need a load balancer with HTTPS support before we go live. As GCE doesn't support v.1.2 alpha we cannot use the new Ingress (which AFAIK supports HTTPS Load Balancers), so the Network Load Balancer is our only option. But obviously we cannot go live without the ability of logging real ips from our users.

Some clarification for new users on this would be appreciated. Supporting HTTPS is mandatory for many of us. Thanks!

@maclof

This comment has been minimized.

Copy link
Contributor

commented Feb 19, 2016

I have been using the iptables proxy on and off for quite some time and can confirm that the external IPs of clients are still hidden/show cluster IPs.

We've gotten around this so far by running our frontend HTTP/HTTPS proxy running in host network mode so that it sees the source IP address.

@javiercr

This comment has been minimized.

Copy link

commented Feb 19, 2016

@maclof thanks for the feedback. Could you share more info about how your workaround? What do you mean by your HTTP/HTTPS running in host network?

@maclof

This comment has been minimized.

Copy link
Contributor

commented Feb 19, 2016

@javiercr we use a pod spec something like this: http://pastie.org/private/zpdelblsob654zif7xus5g

Using host network means that the pod runs in the host machines network, instead of being assigned a cluster IP.

That means when our nginx server binds to port 80/443 it will listen on a host IP and will see the source IP addresses.

@mcluseau

This comment has been minimized.

Copy link
Contributor

commented Feb 21, 2016

I'm using kubernetes 1.1, /opt/bin/kube-proxy ... --proxy-mode=iptables --masquerade-all=false and routing cluser IP network through an host having a kube-proxy. In this setup, my services are seeing the external IP. I use a highly available network namespace who has a external IP and a route to the hosts:

I0221 01:20:32.695440       1 main.go:224] <A6GSXEKN> Connection from 202.22.xxx.yyy:51954 closed.
@yoshiwaan

This comment has been minimized.

Copy link

commented Mar 30, 2016

I've learned a lot reading this thread!

As an FYI this doc states that AWS ELB uses round-robin for TCP connections and least connections for http/https: http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/how-elb-works.html#request-routing

I agree that focussing on getting requests only to nodes that run pods and to try to serve local pods is the best way to go about it. The nice side benefit of this is there'll be less node-to-node traffic within the cluster and I suppose a latency improvement by always service local requests from service to pod (which I guess is of even more benefit if you have nodes in multiple availability zones in the same cluster).

In terms of working with a load balancer that doesn't support weighting then you could solve this with your replication controller by trying to always keep the same number of pods on a node (if there's more than 1 per node) and then distributing evenly between them, even if this means having to move pods off of a node in certain situations and allowing only certain replica counts. e.g. for a 4 node cluster with a service connected to a load balancer the only number of acceptable pod replicas would be 1,2,3,4,6,8,9,12,16,20, etc

@emaildanwilson

This comment has been minimized.

Copy link
Contributor

commented Mar 31, 2016

We're also looking to solve for traffic to route to local pods only. I'd be fine with the nodeport going away on a node at times when no pods are present locally for a service. This way a simple load balancer TCP health check would prevent requests from going to those nodes. I think if we can at least solve for the iptables\kube-proxy portion of this, then we'll find out what the load balancing implications are when the pods are not balanced across the cluster. I think there are ways to solve that on load balancers without having to set a weight for each node w/ an API call.

Load balancers already deal w/ this using other dynamic methods. Also, depending on what the service you're running is actually doing inside that container for each api call it may not be able to support 2x the traffic when there are 2 pods on a node vs one anyways. If Kubernetes Limits are set and if maximum levels of usage are being approached on a pod\node could play into this as well which adds yet another layer of complexity to trying to find the right weight setting on the external load balancer.

I'd say, stay away from that level of complexity and not try to set load balancer weight from kubernetes.

@justinsb

This comment has been minimized.

Copy link
Member

commented Apr 1, 2016

@yoshiwaan Can I suggest opening a new issue for the inter-node traffic suggestion, as this issue is now closed. Personally I think a good first step would be to ensure that if a pod is running on the local node, that we route to the local pod. I suspect that this will be sufficient, because you can then scale your RC so that there are pods on every node.

@paralin

This comment has been minimized.

Copy link
Contributor

commented Apr 14, 2016

@justinsb +1, also we're running into a problem now where we need to see client IPs and it's basically impossible with the current setup.

@dalanlan

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2016

This could be way too naive, yet I was wondering what's the difference between userspace mode and iptables? I cannot really tell from the user doc.

@mcluseau

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2016

Userland mode means kube-proxy handles the connections itself by receiving the connection request from the client and opening a socket to the server, which (1) consume much more CPU and memory and (2) is limited to the number of ports a single can open (<65k). The iptables mode works at a lower level, in the kernel, and uses connection tracking instead, so it's much lighter and handles a lot more connections*.

(edit) (*) As long as you don't SNAT packets going through, which in turn requires a setup where you are sure packets will cross the connection tracking rules associated with them. For instance, using a routed access design allows you to avoid SNAT, which means the endpoint of the service will see the real client's IP.

@dalanlan

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2016

@MikaelCluseau
meaning kube-proxy is only responsible for setting up and maintaining iptables rules and we no longer get a random local port for each service in iptables mode, right?

@mcluseau

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2016

On 04/19/2016 10:51 PM, Emma He wrote:

meaning kube-proxy is only responsible for setting up and maintaining
iptables and we no longer get a random local port for each service in
iptables mode, right?

Yes.

@dalanlan

This comment has been minimized.

Copy link
Contributor

commented Apr 20, 2016

Sorry but I absolutely missed this earlier.

(edit) (*) As long as you don't SNAT packets going through, which in turn requires a setup where you are sure packets will cross the connection tracking rules associated with them. For instance, using a routed access design allows you to avoid SNAT, which means the endpoint of the service will see the real client's IP.

@MikaelCluseau I was thinking iptables adopts SNAT and DNAT, which is not the case according to you. Could you please clarify this for me?

@mcluseau

This comment has been minimized.

Copy link
Contributor

commented Apr 20, 2016

On 04/20/2016 01:59 PM, Emma He wrote:

@MikaelCluseau https://github.com/MikaelCluseau I was thinking
iptables adopts SNAT and DNAT, which is not the case according to you.
Could you please clarify this for me?

It's the tricky part.

(1) Using service/external IPs requires DNAT.
(2) If you are sure reply packets will go through the same conntrack
rule (ie, the same network stack or a replicated conntrack table), you
can skip the SNAT part (ie, MASQUERADE rules).

The condition of (2) is usually ok in routed access network designs
(which is the simplest design I can think of).

For instance, given

  • a client 1.0.1.1,
  • a service 1.0.2.1,
  • a pod implementing the service 1.0.3.1.

Then,

  1. Your router/firewall/loadbalancer/host/whatever receives a packet
    for the service so it sees a packet "1.0.1.1 -> 1.0.2.1";
  2. It DNATs it to the endpoint (pod) so the packet will be "1.0.1.1 ->
    1.0.3.1" in the cluster network;
  3. The pod replies with a packet "1.0.3.1 -> 1.0.1.1";
  4. The packet goes through a router/firewall/loadbalancer/host/whatever
    having the conntrack rule, the conntrack system rewrite the packet
    to "1.0.2.1 -> 1.0.1.1" before sending it back to the client.

If the condition of (2) cannot be met, you have to use SNAT/MASQUERADING
to be sure that the packet will go back through the
router/firewall/loadbalancer/host/whatever's conntrack.

@thockin

This comment has been minimized.

Copy link
Member Author

commented Apr 20, 2016

@MikaelCluseau - drop me an email at my github name at google.com - I have
something for you

On Tue, Apr 19, 2016 at 8:20 PM, Mikaël Cluseau notifications@github.com
wrote:

On 04/20/2016 01:59 PM, Emma He wrote:

@MikaelCluseau https://github.com/MikaelCluseau I was thinking
iptables adopts SNAT and DNAT, which is not the case according to you.
Could you please clarify this for me?

It's the tricky part.

(1) Using service/external IPs requires DNAT.
(2) If you are sure reply packets will go through the same conntrack
rule (ie, the same network stack or a replicated conntrack table), you
can skip the SNAT part (ie, MASQUERADE rules).

The condition of (2) is usually ok in routed access network designs
(which is the simplest design I can think of).

For instance, given

  • a client 1.0.1.1,
  • a service 1.0.2.1,
  • a pod implementing the service 1.0.3.1.

Then,

  1. Your router/firewall/loadbalancer/host/whatever receives a packet
    for the service so it sees a packet "1.0.1.1 -> 1.0.2.1";
  2. It DNATs it to the endpoint (pod) so the packet will be "1.0.1.1 ->
    1.0.3.1" in the cluster network;
  3. The pod replies with a packet "1.0.3.1 -> 1.0.1.1";
  4. The packet goes through a router/firewall/loadbalancer/host/whatever
    having the conntrack rule, the conntrack system rewrite the packet
    to "1.0.2.1 -> 1.0.1.1" before sending it back to the client.

If the condition of (2) cannot be met, you have to use SNAT/MASQUERADING
to be sure that the packet will go back through the
router/firewall/loadbalancer/host/whatever's conntrack.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#3760 (comment)

@mrkurt

This comment has been minimized.

Copy link

commented Jan 5, 2017

@justinsb @yoshiwaan did anyone ever create an issue for this? My search fu is failing me, and I have a similar need.

Can I suggest opening a new issue for the inter-node traffic suggestion, as this issue is now closed. Personally I think a good first step would be to ensure that if a pod is running on the local node, that we route to the local pod. I suspect that this will be sufficient, because you can then scale your RC so that there are pods on every node.

@yoshiwaan

This comment has been minimized.

Copy link

commented Jan 6, 2017

I didn't raise it myself

@mrkurt

This comment has been minimized.

Copy link

commented Jan 6, 2017

Ahhhhh, I think I found it, this appears to be the feature/fix: kubernetes/enhancements#27

Seems to be beta as of 1.5.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.