Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: nftables backend for kube-proxy #62720

Open
nevola opened this issue Apr 17, 2018 · 22 comments

Comments

Projects
None yet
@nevola
Copy link

commented Apr 17, 2018

/kind feature
/sig network

nftables (iptables next generation) development is almost completed with all the iptables abilities, but also, with a much flexible language that allows to have a very complete load balancer with a very little extension of the designed infrastructure with higher performance.

For that reason, we've created a small footprint daemon to optimize and to make easy the rules generation named nftlb, which is fully compatible with additional firewalling nftables rules.

The current abilities are:

  • Topologies supported: Destination NAT, Source NAT and Direct Server Return. This enables the use of the load balancer in one-armed and two-armed network architectures.
  • Support for both IPv4 and IPv6 families.
  • Multilayer load balancer: DSR in layer 2, IP based load balancing with protocol agnostic at layer 3, and support of load balancing of UDP, TCP and SCTP at layer 4.
  • Multiport support for ranges and lists of ports.
  • Multiple virtual services (or farms) support.
  • Schedulers available: weight, round robin, hash and symmetric hash.
  • Priority support per backend.
  • Live management of virtual services and backends programmatically through a JSON API.
  • Web service authentication with a security Key.
  • Automated testbed included.

This approach also solves some known corner cases with the iptables and IPVS backends, and even performance improvements. More info here:
https://www.zevenet.com/knowledge-base/nftlb/what-is-nftlb/

Main repository can be found here:
https://github.com/zevenet/nftlb

It'd be great to know the possibilities to integrate the nftables backend in kube-proxy and if there is another use case not included yet in the nftables approach.

Thanks.

@m1093782566

This comment has been minimized.

Copy link
Member

commented Apr 17, 2018

This approach also solves some known corner cases with the iptables and IPVS backends

I am curious to learn what corner cases.

and even performance improvements.

Any test data?

@danwinship

This comment has been minimized.

Copy link
Contributor

commented Apr 17, 2018

I think there are two separate issues:

  1. Do we want an nft-based kube-proxy in-tree? I think we were kind of hoping that IPVS was going to save the day and we wouldn't need another replacement (especially so soon), although the comments about IPVS in the blog post (especially the 10x performance claims) are interesting. (And then there's the port range problem (#23864).)

  2. Do we want to support the possibility of nft-based systems at all? Because as I understand it, even if we reject this code for kubernetes itself, but we want to allow users to run an out-of-tree third-party nft-based kube-proxy, then that still creates problems for kubernetes, because it is effectively required that either all components on the system use iptables, or all components on the system use nft. (Eg, if a network plugin wants to ensure that traffic coming off its internal bridge doesn't get eaten by the firewall, then it needs to know whether the firewall is implemented with iptables rules or nft rules, because adding the exception to the wrong system won't work.)

    • Kubernetes still uses iptables rules for a few things other than kube-proxy (eg, HostPort), and presumably all of that would have to be switched over to use nft. (This isn't a lot of work, but would probably end up requiring duplicate iptables/nft codepaths, and some way for kubelet to know which one to use.)
    • CNI plugins that create their own iptables rules (and that want to support the nft kube-proxy) would probably need to have separate iptables and nft codepaths too. And some way to figure out which one to use.
    • Pods that use hostNetwork and modify iptables rules are also an issue, but that one has to be Somebody Else's Problem. ("If you have pods that do iptables stuff, you can't switch to the nft kube-proxy until you update them.")
    • I think at this point, docker is probably not an issue: if you are using a CNI plugin that doesn't use the docker bridge, then docker's own use of iptables is irrelevant and harmless. (It might add rules that do nothing, but it shouldn't end up adding any rules that would hurt anything.)

I think the answer to question 2 (do we want to support nft-based systems at all) is "yes". Or at least, if it's not "yes" now, it will be eventually. iptables is considered a technological dead end in the kernel and there's more and more work going into improving nft and less and less going into iptables.

Are you actually using nftlb with kubernetes already? Have you run into nft-vs-iptables problems? (Or are you only using it for cluster-ingress routing, not for pod-to-pod routing?)

@m1093782566

This comment has been minimized.

Copy link
Member

commented Apr 17, 2018

@danwinship

(And then there's the port range problem (#23864).)

I proposed fwmark + IPVS to implement port ranges in kubernetes/community#1738, please check the IPVS section.

Any comment?

@nevola

This comment has been minimized.

Copy link
Author

commented Apr 19, 2018

Thank you for your response @danwinship

  1. Do we want an nft-based kube-proxy in-tree?
  2. Do we want to support the possibility of nft-based systems at all?

That are interesting to know. FMPOV, the adoption of nft will be done sooner or later and all major distributions are currently working on the integration.

The kube-proxy proposal to use nft is just an example, cause I consider that the value added in terms of usability and performance really worths it.

But, as you said, the full picture is bigger than that and the usage should be extended to the current networking functions to support both of them. For the list you provide, we can study the "translations" to be done and identify if there is something missing in the current nft infrastructure that needs to be included.

Although during our research we provide some numbers (here where we discovered the 10x in DSR mode), we're performing further benchmarks between iptablesLB-ipvs-nftlb with several scenarios. After that, we'll work on the first PoC of kubernetes with nft which hopefully will be ready by September.

Or are you only using it for cluster-ingress routing, not for pod-to-pod routing?

Both use cases are interesting to be addressed.

@nevola

This comment has been minimized.

Copy link
Author

commented Apr 19, 2018

Thank you for your response @m1093782566

Exactly, there are some workarounds but the main idea of this proposal is to discuss if the integration of the nft infrastructure is something to take into account in the roadmap.

@thockin

This comment has been minimized.

Copy link
Member

commented Apr 30, 2018

Here's my feelings:

  • I am open to an nft impl in kube-proxy (and EBPF, maybe) if we can keep it isolated, and if we can keep the feature set consistent. Doing it out of tree is also viable, but if we want to push on that we should do a better job of writing a spec for services impls.

  • We really need to modularize kube-proxy better :)

  • the FWMARK hack for IPVS ranges is a hack and I really dislike using it

  • This adds MORE fuel to the fire for making the net plugin and services plugin more closely coupled

@nevola

This comment has been minimized.

Copy link
Author

commented Jun 28, 2018

Hi, during the netfilter workshop last week we presented some benchmarks that could enlighten the nftables approach for kube-proxy.

As a summarize, we're getting more than 50% of performance in NAT cases than iptables (with 3 backends cases) but the performance is even better by adding more backends due to the constant complexity order of the nftables rules design.

As well, we presented the penalties caused by Spectre/Meltdown mitigations that is about 40% of penalty in iptables but only a 17% in nftables, tested with the same NAT cases with conntrack enabled.

Extended info here:
https://www.zevenet.com/knowledge-base/nftlb/nftlb-benchmarks-and-performance-keys/

@nevola

This comment has been minimized.

Copy link
Author

commented Jun 28, 2018

On the other hand, nftables-ebpf shouldn't be incompatible. There are some work in progress to go forward to this integration: https://www.spinics.net/lists/netfilter-devel/msg53891.html

@thockin

This comment has been minimized.

Copy link
Member

commented Jul 2, 2018

So, @nevola how do you want to proceed? Here's my proposal.

  1. between you, the IPVS folks, and the iptables folks (myself and others) we write a doc that covers all of the things you need to implement to be a viable Service implementation (node ports, external IPs, etc).

  2. you write a new kube-proxy module

  3. we let users decide which mode they want.

@nevola

This comment has been minimized.

Copy link
Author

commented Jul 3, 2018

Hi @thockin that sounds good. Although I'm not a go developer expert, I accept the challenge.

@fejta-bot

This comment has been minimized.

Copy link

commented Oct 1, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@nevola

This comment has been minimized.

Copy link
Author

commented Oct 2, 2018

/remove-lifecycle stale

We're still working on this issue.

@nevola

This comment has been minimized.

Copy link
Author

commented Oct 5, 2018

Hi, we've started with a prototype named kube-nftlb, thanks to the collaboration of @AquoDev

https://github.com/zevenet/kube-nftlb

Any feedback and guidance to continue with the integration will be appreciated.
Thanks!

@rbtcollins

This comment has been minimized.

Copy link
Contributor

commented Oct 25, 2018

@nevola I'm very interested in the DSR story (just technically / future proofing, we haven't hit a performance point where the distinction would be important to us today). What encap are you using to forward to nonlocal pods? How are you telling k8s that the pod is allowed to send from the destination address etc?

FWIW IPVS supports DSR too, so though I'm totally pro nftables the DSR feature shouldn't really be tied to nftables <-> IPVS (and in fact IPVS should be switching to nftables at some point too, right?)

@nevola

This comment has been minimized.

Copy link
Author

commented Oct 25, 2018

Hi @rbtcollins the first milestone with this project is to integrate nftlb with the same features that kube-proxy provides. Currently, we're implementing a go client separated from the nftlb daemon in order to manage the "translation" layer between k8s and the nftlb API.

The forwarding to non-local pods has not been faced yet, but if you've any concern in that regards please let us know to have it into consideration.

nftables and IPVS have different implementations for DSR but IPVS relies on the netfilter hooks for features like multiport or tproxy. Once the migration to nftables is done, it shouldn't affect to IPVS.

@nevola

This comment has been minimized.

Copy link
Author

commented Nov 26, 2018

Hi everybody, we've recently released nftlb 0.3 with the following new features:

  • Stateless NAT support from ingress
  • Automated DSR configuration from layer 3
  • Flow mark per service and per backend
  • Logging support per virtual service
  • L7 helpers support
  • Support of custom source IP instead of masquerading

Currently, kube-nftlb is able to manage the minimal options to create a working scalable service, but now we're in the phase of making kubernetes & docker rules live together with nftlb ones. Using a host with the iptables-nftables compatibility layer this is possible.

We hoped that kube-proxy was the only component that manages rules, but that fact is not true. So we're thinking about implementing some intelligence in nftlb to know which rules are active already and then insert the nftlb ones accordingly.

If you've any other idea it'll be appreciated,
Thanks!

@fejta-bot

This comment has been minimized.

Copy link

commented Feb 24, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@kvaps

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2019

/remove-lifecycle stale

This comment has been minimized.

Copy link
Member

commented Mar 5, 2019 — with Octobox

@nevola Has there been anymore progress on this?

@nevola

This comment has been minimized.

Copy link
Author

commented Mar 5, 2019

Hi @cmluciano , sure we're implementing more features in nftlb (the load balancer core) including security policies per service, persistence and more, but it's still required more integration in the kube-nftlb side in order to be fully compatible with the docker rules.

@DanyC97

This comment has been minimized.

Copy link

commented Jun 4, 2019

@nevola a while back #62720 (comment) it was suggested a doc to be written, was that done and if so can you please share it ? thx

@nevola

This comment has been minimized.

Copy link
Author

commented Jun 4, 2019

Hi @DanyC97 , it wasn't arranged but I'll be available to follow-up with the requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.