Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Iptables based Proxy #9210

Merged
merged 1 commit into from Aug 12, 2015

Conversation

@BenTheElder
Copy link
Member

commented Jun 3, 2015

This is an implementation of #3760.

It implements an iptables-restore wrapper, a dummy 'loadbalancer' that serves as a config.EndpointsConfigHandler, and a purely iptables based version of Proxier, as well as a method for determining whether to use the new ProxierIptables or the old Proxier based on the iptables version. kube-proxy now autoselects between the two and uses the new ProxierIptables + DummyLoadBalancerIptables pair when possible.

TODO:

  • Write new Proxier and LoadBalancer (ProxierIptables and DummyLoadBalancerIptables)
  • Update kube-proxy to select between old Proxier and ProxierIptables
  • Basic iptables rule generation
  • Faster rule generation (switched to using bytes.Buffer[s] for string concatenation)
  • Sticky Sessions
  • Delete old rule chains
  • Testing. We need more tests and this needs to be more thoroughly run through e2e when done. (mostly done now, just need more verification on e2e and the iptables rules)
  • Select iptables minimum version for switching over from Proxier to ProxierIptables (I've selected a version now, but we may want to change it, this probably needs some discussion)
  • Make rules fully deterministic (namely make the chain generation use hashing instead of a counter).
  • Parse iptables-save output to extract existing chains and rules
  • Use iptables-save output to preserve counters

Further TODO:

@googlebot googlebot added the cla: yes label Jun 3, 2015

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Jun 3, 2015

This has a long ways to go before it should be merged, but I wanted to get an issue/PR up tracking the state of it since otherwise I've been pretty quietly working on this.

I'd also like to get some eyes on it just to verify that it's headed in the right direction.

Vagrant/e2e haven't been playing very nice [bugging out during validation mostly, occasionally the vbox kext has even crashed my laptop(!)] so testing this has been very slow.

Edit: It looks like I've also fallen ~300 commits behind master, I'll be rebasing soon to catch up to that.

/cc @thockin

@k8s-bot

This comment has been minimized.

Copy link

commented Jun 3, 2015

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Jun 5, 2015

I'm not sure what happened to the commit history here, looks like I messed up a rebase or something.
I'll try to clean it up when this is ready.

@k8s-bot

This comment has been minimized.

Copy link

commented Jun 10, 2015

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Jun 11, 2015

I've pushed some smaller PR's working on breaking down some of the things that this will eventually rely on. I'll clean it and the commit history up and try to finish debugging/vetting the iptables rules once #9563 lands.
After that I'll be on track to move on to writing new tests hopefully.

@spikecurtis

This comment has been minimized.

Copy link

commented Jun 12, 2015

@BenTheElder Thanks for your comments on #7245.

2 things worth drawing your attention to:

Firstly, in Calico, instead of attaching the container veth to a local bridge, we leave the host end in the root namespace and use Linux routing instead of bridging. We don't assign an IP address to the veth in the root namespace (only the container namespace). In the current userspace proxy, we find that REDIRECT rules don't work correctly to redirect traffic to local ports. I don't think this is an issue in the iptables proxy, since you'll be forwarding directly to the service instance pod, but it's something to keep in mind.

Secondly, it is highly desirable for the traffic that arrives at the service instance pod to have the correct source IP (and source port, although in practice port is less important than IP) from the pod that is connecting to the service. (Rather than, say, an IP associated with the pod host.) This is because Calico uses IP tables rules to enforce network policy. Kubernetes doesn't support defining network policy, but will be doing so very soon with the introduction of namespaces, so it's important we get the proxy working in concert with those efforts.

Hopefully that makes sense! I'm happy to discuss further if you want clarification or detail.

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2015

@spikecurtis Thanks.
I currently am only using REDIRECT in a certain case of nodePorts in the same was as the userspace proxy but otherwise DNAT, you can see the rule generation in syncProxyRules here: https://github.com/BenTheElder/kubernetes/blob/iptables_proxy/pkg/proxy/proxieriptables.go#L247
I notice that in #7245 that nodePorts were not implemented in proxier when you implemented those changes, you may want to rebase from upstream and look into that.

I believe the source IP should be preserved correctly as I'm only using dnat otherwise; no MASQUERADE etc, but that is one of the goals of this implementation so any error in that respect should be fixed and working before this lands.

Other than that, I'm currently adding/maintaining the rules for connecting PREROUTING and OUTPUT to the proxy the same way as the userspace version; but the rest of the rules are synced with iptables-restore periodically, flushing the existing Kubernetes chains and restoring freshly generated rules/chains for atomicity. Hopefully that shouldn't be a problem but it does mean trying to add rules to Kubernetes' chains could be problematic. It doesn't look like calico does this though, looking at #7245.

@jdef

This comment has been minimized.

Copy link
Contributor

commented Jun 14, 2015

How easy will be to automate the removal of the new iptables rules, if kube-proxy is terminated and k8s is uninstalled from a node?

see mesosphere/kubernetes-mesos#353

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Jun 14, 2015

@jdef Well all of the chains are known or have a known prefix, we have 4 chains that always exist (KUBE-PORTALS-CONTAINER, KUBE-PORTALS-HOST, KUBE-NODEPORT-CONTAINER, KUBE-NODEPORT-HOST) and then a number of KUBE-SERVICE- prefixed chains generated in this version. I track all of the chains internally so that the KUBE-SERVICE-* chains are cleaned up and then regenerated each update (as well as the other 4); but should kube-proxy be terminated we could parse iptables-save for :<CHAIN_NAME> and :KUBE-SERVICE* and flush then delete each of those chains.

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Jun 15, 2015

The history should be cleaned up now and unified with the other PRs. I've just finished running a full e2e before pushing this and i see 13 failures (output: https://gist.github.com/BenTheElder/f8d7f70fa5bd45f9ad06#file-e2e-snippet-L5130) however as noted in #9681 vagrant + e2e has numerous failures on master currently.

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Jun 17, 2015

That last commit (54eae42) also fixes an odd bug: while running e2e i saw instances where the port allocator was returning 0 which appears to be default behavior for one without a configured port-range(?). Iptables of course won't let you use 0 for your ports so I have a fallback routine that attempts to get a port assigned randomly by the OS by opening a socket with port 0 and getting the actual port then closing it and returning the port.

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Jun 17, 2015

Pretty typical rules on a minion from e2e currently look like:

[vagrant@kubernetes-minion-2 ~]$ sudo iptables-save
# Generated by iptables-save v1.4.21 on Wed Jun 17 05:57:33 2015
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [2:120]
:POSTROUTING ACCEPT [2:120]
:DOCKER - [0:0]
:KUBE-NODEPORT-CONTAINER - [0:0]
:KUBE-NODEPORT-HOST - [0:0]
:KUBE-PORTALS-CONTAINER - [0:0]
:KUBE-PORTALS-HOST - [0:0]
:KUBE-SERVICE-0 - [0:0]
:KUBE-SERVICE-0_0 - [0:0]
:KUBE-SERVICE-1 - [0:0]
:KUBE-SERVICE-1_0 - [0:0]
:KUBE-SERVICE-2 - [0:0]
:KUBE-SERVICE-2_0 - [0:0]
-A PREROUTING -m comment --comment "handle ClusterIPs; NOTE: this must be before the NodePort rules" -j KUBE-PORTALS-CONTAINER
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A PREROUTING -m addrtype --dst-type LOCAL -m comment --comment "handle service NodePorts; NOTE: this must be the last rule in the chain" -j KUBE-NODEPORT-CONTAINER
-A OUTPUT -m comment --comment "handle ClusterIPs; NOTE: this must be before the NodePort rules" -j KUBE-PORTALS-HOST
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT -m addrtype --dst-type LOCAL -m comment --comment "handle service NodePorts; NOTE: this must be the last rule in the chain" -j KUBE-NODEPORT-HOST
-A POSTROUTING -s 10.246.2.0/24 ! -o cbr0 -j MASQUERADE
-A KUBE-PORTALS-CONTAINER -d 10.247.0.10/32 -p tcp -m comment --comment "portal for default/kube-dns:dns-tcp" -m state --state NEW -m tcp --dport 53 -j KUBE-SERVICE-0
-A KUBE-PORTALS-CONTAINER -d 10.247.0.1/32 -p tcp -m comment --comment "portal for default/kubernetes:" -m state --state NEW -m tcp --dport 443 -j KUBE-SERVICE-1
-A KUBE-PORTALS-CONTAINER -d 10.247.0.10/32 -p udp -m comment --comment "portal for default/kube-dns:dns" -m state --state NEW -m udp --dport 53 -j KUBE-SERVICE-2
-A KUBE-PORTALS-HOST -d 10.247.0.10/32 -p tcp -m comment --comment "portal for default/kube-dns:dns-tcp" -m state --state NEW -m tcp --dport 53 -j KUBE-SERVICE-0
-A KUBE-PORTALS-HOST -d 10.247.0.1/32 -p tcp -m comment --comment "portal for default/kubernetes:" -m state --state NEW -m tcp --dport 443 -j KUBE-SERVICE-1
-A KUBE-PORTALS-HOST -d 10.247.0.10/32 -p udp -m comment --comment "portal for default/kube-dns:dns" -m state --state NEW -m udp --dport 53 -j KUBE-SERVICE-2
-A KUBE-SERVICE-0 -j KUBE-SERVICE-0_0
-A KUBE-SERVICE-0_0 -p tcp -m recent --set --name KUBE-SERVICE-0_0 --mask 255.255.255.255 --rsource -j DNAT --to-destination 10.246.2.81:53
-A KUBE-SERVICE-1 -j KUBE-SERVICE-1_0
-A KUBE-SERVICE-1_0 -p tcp -m recent --set --name KUBE-SERVICE-1_0 --mask 255.255.255.255 --rsource -j DNAT --to-destination 10.245.1.2:6443
-A KUBE-SERVICE-2 -j KUBE-SERVICE-2_0
-A KUBE-SERVICE-2_0 -p udp -m recent --set --name KUBE-SERVICE-2_0 --mask 255.255.255.255 --rsource -j DNAT --to-destination 10.246.2.81:53
COMMIT
@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Jun 17, 2015

As far as I can tell the service and endpoint monitoring should be working 100%, the chain tracking works, iptables.Restore(...) works, logging is fine (but we might want to change some of the log levels), the only thing left is just ensuring that the rules generated in syncProxyRules() (in proxieriptables.go) are functioning as needed, and then writing some more tests to verify all of the components.

I'll be working on those tests today, but i'd appreciate any help reviewing the rule generation.
In particular I can't properly test E.G. public IPs on my setup (vagrant) since it isn't a "real" cluster.
cc @lxpollitt

@lxpollitt

This comment has been minimized.

Copy link

commented Jun 18, 2015

@BenTheElder The person on my team I would most like to look at this is @Symmetric, but he is on vacation. He's back in the office Tuesday next week (23rd). He should be able to look at it then. CC'ing @fasaxc in case he gets a chance to look at this later today too.

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Jun 18, 2015

Thanks!
On Jun 18, 2015 6:11 PM, "Alex Pollitt" notifications@github.com wrote:

@BenTheElder https://github.com/BenTheElder The person on my team I
would most like to look at this is @Symmetric
https://github.com/Symmetric, but he is on vacation. He's back in the
office Tuesday next week (23rd). He should be able to look at it then.
CC'ing @fasaxc https://github.com/fasaxc in case he gets a chance to
look at this later today too.


Reply to this email directly or view it on GitHub
#9210 (comment)
.

@statik

This comment has been minimized.

Copy link

commented Jun 24, 2015

I'd be happy to help with testing this with my application if that would be useful.

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Jun 24, 2015

@statik that would be. I'm currently testing in a vagrant test cluster, on which a number of tests are currently always failing even on master so it's been difficult to finish validating it.

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Jun 24, 2015

Rebased to current master.

@Symmetric

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2015

@BenTheElder I've just started giving this another look, I'll apply our patches on top of the latest code here and check that the iptables rules look correct. Should (hopefully) have something tomorrow, I'll let you know how I get on.

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Jun 26, 2015

Thanks!

On Thu, Jun 25, 2015, 20:51 Paul Tiplady notifications@github.com wrote:

@BenTheElder https://github.com/BenTheElder I've just started giving
this another look, I'll apply our patches on top of the latest code here
and check that the iptables rules look correct. Should (hopefully) have
something tomorrow, I'll let you know how I get on.


Reply to this email directly or view it on GitHub
#9210 (comment)
.

@Symmetric

This comment has been minimized.

Copy link
Contributor

commented Jun 29, 2015

Hi @BenTheElder , I gave your PR a spin, and I'm seeing some unexpected rules; it looks like services (e.g. kube-dns) are now getting the old REDIRECT rules, instead of DNAT:

-A KUBE-PORTALS-CONTAINER -d 10.247.0.10/32 -p udp -m comment --comment "default/kube-dns:dns" -m udp --dport 53 -j REDIRECT --to-ports 60906
-A KUBE-PORTALS-CONTAINER -d 10.247.0.10/32 -p tcp -m comment --comment "default/kube-dns:dns-tcp" -m tcp --dport 53 -j REDIRECT --to-ports 42567
-A KUBE-PORTALS-CONTAINER -d 10.247.0.1/32 -p tcp -m comment --comment "default/kubernetes:" -m tcp --dport 443 -j REDIRECT --to-ports 60461
-A KUBE-PORTALS-HOST -d 10.247.0.10/32 -p udp -m comment --comment "default/kube-dns:dns" -m udp --dport 53 -j DNAT --to-destination 10.0.2.15:60906

I'm on iptables 1.4.21. The ShouldUseProxierIptables logic looks incorrect:

return (major >= IPTABLES_MIN_MAJOR) && (minor > IPTABLES_MIN_MINOR) && (patch > IPTABLES_MIN_PATCH)

This will return false for 1.4.21, I think you want >= on the minor version. I checked the previous commit and the DNAT rules are getting created as expected there.

I'll test a bit more with my changes merged into your HEAD^ (without the ShouldUseProxierIptables) and let you know how that goes.

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Aug 12, 2015

#12568 rebasing against that should fix shippable it looks like.

Edit: i've rebased to current master.

@thockin thockin added lgtm and removed lgtm labels Aug 12, 2015

@thockin

This comment has been minimized.

Copy link
Member

commented Aug 12, 2015

LGTM

@k8s-bot

This comment has been minimized.

Copy link

commented Aug 12, 2015

GCE e2e build/test failed for commit aa45ac2.

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Aug 12, 2015

Reviewing log...
Edit: DNS seems to be the issue.

Summarizing 2 Failures:

[Fail] DNS [It] should provide DNS for services 
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/dns.go:236

[Fail] DNS [It] should provide DNS for the cluster 
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/dns.go:198

Ran 84 of 142 Specs in 570.360 seconds
FAIL! -- 82 Passed | 2 Failed | 2 Pending | 56 Skipped 

Edit again: older commit anyhow, we should be able to just ignore this.

@thockin

This comment has been minimized.

Copy link
Member

commented Aug 12, 2015

I figured out the LB issue, it's a GCE thing, we have a fix in the works.

Next: graphs, nodeports, and hairpins.

@thockin

This comment has been minimized.

Copy link
Member

commented Aug 12, 2015

@ArtfulCoder you'll have to patch your external IPs work onto this, too

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Aug 12, 2015

Also the bridge-nf module sysctl/modprobe right?

@thockin

This comment has been minimized.

Copy link
Member

commented Aug 12, 2015

yes - I think of what's left as the :advanced use cases" :)

@k8s-bot

This comment has been minimized.

Copy link

commented Aug 12, 2015

GCE e2e build/test passed for commit fc928c5.

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Aug 12, 2015

@thockin, which do you want to tackle first? I'll open a PR in the morning.

Though I'm not sure how we want to handle nodePorts yet (obviously with some iptables rules, but...).
The other two "advanced uses" could just be some init scripts (salt-stack maybe?) to probe/load the module and set hairpin mode on the bridges or something, and the graphs can be from a contrib script/tool.

@k8s-bot

This comment has been minimized.

Copy link

commented Aug 12, 2015

GCE e2e build/test passed for commit ae569e2.

@mcluseau

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2015

@thockin if I may, the hairpin requirement is for running an Apache Kafka cluster without having to put the advertised hostname to 127.0.0.1 in /etc/hosts, which doesn't seems to be an advanced use case to me ;) It will probably surprise some users (here, literally generating more than 100MBps of "hard-to-understand-why" logs), and I think it should be fixed before integrating the iptables-proxy implementation in a release. Of course, I can understand this shouldn't, but I really don't know how many other use-cases it may impact. At least, I give this issue a +1 in the priority vote :)

In any case, of course, I'll test with the current setup (with the workaround for when hairpin is disabled).

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Aug 12, 2015

I think by "advanced use cases" @thockin is mainly referring to things that
need to be handled beyond basic service-portal support. We intend to reach
full parity and test everything thoroughly before the new proxy is
eventually un-gated, but those can be in later change sets. Right now it's
behind an "unofficial" cli flag that defaults to forcing the userspace
proxy.
It's not quite intended for real use yet, and all of the "advanced use
cases" will be taken care of shortly.

I don't have anything to do with releases of course, but I don't believe it
will be integrated untill everything listed is complete.

Thanks for testing :)
On Aug 12, 2015 4:38 AM, "MikaelCluseau" notifications@github.com wrote:

@thockin https://github.com/thockin if I may, the hairpin requirement
is for running an Apache Kafka cluster without having to put the advertised
hostname to 127.0.0.1 in /etc/hosts, which doesn't seems to be an advanced
use case to me ;) It will probably surprise some users (here, literally
generating more than 100MBps of "hard-to-understand-why" logs), and I think
it should be fixed before integrating the iptables-proxy implementation in
a release. Of course, I can understand this shouldn't, but I really don't
know how many other use-cases it may impact. At least, I give this issue a
+1 in the priority vote :)

In any case, of course, I'll test with the current setup (with the
workaround for when hairpin is disabled).


Reply to this email directly or view it on GitHub
#9210 (comment)
.

@mcluseau

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2015

Better safe than sorry ;-)

@thockin

This comment has been minimized.

Copy link
Member

commented Aug 12, 2015

Don't get me wrong, we know how to solve it but we wanted to get the core
code in first :)

On Wed, Aug 12, 2015 at 1:38 AM, MikaelCluseau notifications@github.com
wrote:

@thockin https://github.com/thockin if I may, the hairpin requirement
is for running an Apache Kafka cluster without having to put the advertised
hostname to 127.0.0.1 in /etc/hosts, which doesn't seems to be an advanced
use case to me ;) It will probably surprise some users (here, literally
generating more than 100MBps of "hard-to-understand-why" logs), and I think
it should be fixed before integrating the iptables-proxy implementation in
a release. Of course, I can understand this shouldn't, but I really don't
know how many other use-cases it may impact. At least, I give this issue a
+1 in the priority vote :)

In any case, of course, I'll test with the current setup (with the
workaround for when hairpin is disabled).


Reply to this email directly or view it on GitHub
#9210 (comment)
.

@mcluseau

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2015

On 08/13/2015 02:19 AM, Tim Hockin wrote:

Don't get me wrong, we know how to solve it but we wanted to get the core
code in first :)

Totally fine, I just wanted to express my view about this "secondary
list item" should some time constraint arise :)

cjcullen added a commit that referenced this pull request Aug 12, 2015
Merge pull request #9210 from BenTheElder/iptables_proxy
Implement Iptables based Proxy

@cjcullen cjcullen merged commit b8dc963 into kubernetes:master Aug 12, 2015

4 checks passed

Jenkins GCE e2e 142 tests run, 58 skipped, 0 failed.
Details
Shippable Shippable builds completed
Details
cla/google All necessary CLAs are signed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Aug 12, 2015

🎆 😄

@thockin

This comment has been minimized.

Copy link
Member

commented Aug 12, 2015

W00t!!

@thockin

This comment has been minimized.

Copy link
Member

commented Aug 12, 2015

I'm looking at NodePorts. I want you to focus on the test - I want to show a graph of how this is better :)

@thockin

This comment has been minimized.

Copy link
Member

commented Aug 12, 2015

even if the test is largely manual, just needs to be reproducible.

@BenTheElder

This comment has been minimized.

Copy link
Member Author

commented Aug 12, 2015

Sounds good. :)

On Wed, Aug 12, 2015 at 12:42 PM, Tim Hockin notifications@github.com
wrote:

I'm looking at NodePorts. I want you to focus on the test - I want to show
a graph of how this is better :)


Reply to this email directly or view it on GitHub
#9210 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.