-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nftables kube-proxy TODO #122572
Comments
I can easily switch kernels in my env (and "nft" for that matter), so I can check that. But what to check, and how precise? Is it enough to verify basic traffic, or is a full K8s e2e required? Even if I can switch kernels, download and building them is a tedious task (especially if there are gcc version problems), so the fewer the better. UpdateK8s e2e with:
runs fine on |
@danwinship Any idea on what to check with "nft --check"? And, is it OK to discuss in this PR? I can run on linux-5.6.10 without problems, but on linux-5.6 and linux-5.6.4 I get errors:
(I use exactly the same kernel-config for linux-5.6.10 and linux-5.6.4) |
@uablrek For v1.29.0, the minimal kernel version is linux-5.6.9 according to previous investigation: #122296 (comment)
After #122296, the minimal version may be relaxed a lot. I guess it's 4.1 but haven't tested it on that version yet. |
I assume this PR is going to get a lot of discussion about a lot of things, and that's fine, but if you're planning to grab one of the items anyway, maybe file a separate issue for it (or an initial/placeholder PR) and move discussion there?
We need to verify that every kind of rule the nftables Proxier can generate works correctly. So somewhere in between; full e2e would do the trick, but in theory you could do something much simpler. One starting point would be the |
Currently we use |
So, per the wiki, if we want to require 5.4, I guess you could try a rule with " |
Also, I haven't investigated the |
It seems
It means before 4.18 iptables NAT rule will not take effect when nftables NAT rule is in use. The page describes it. And I have confirmed on kernel 4.15, after nftables nat rules are installed, the iptables nat rules are no longer matched, causing problems when Pods try to access external network via iptables MASQUERADE. So if a CNI plugin relies on iptables NAT table, it's not going to work with nftables mode before 4.18. |
"dependency" here is nftables-maintainer-speak. It's warning you that (I got the "5.3" requirement from here but I agree, I can't see any commits between 5.2 and 5.3 that seem relevant.) |
I have tested some versions of
Both works with K8s e2e with FOCUS/SKIP as in #122572 (comment). Can we settle for nft UpdateThere is an aspect of dynamic lib's that I haven't covered. For instance I have used |
Yes. I think we'll officially declare that we support kernel 5.4+ and nftables command-line 0.9.2+, and we can use a FTR (and maybe worth documenting in a comment wherever we put the check):
|
I'm picking up |
For perf job do we need something similar to ci-kubernetes-e2e-gce-scale-performance configured for NFTables with lower cadence and nodes? |
I'm not sure exactly what we want for perf... @aojea may have thoughts on that, and sig-scalability may have more information on what sort of big tests we're able to run. (I assume we can't spin up a 2000 node cluster just because we're bored...) Maybe open an issue to start the discussion on that? (Antonio also said he was going to start thinking about metrics that the perf job could be measuring.) |
Do we need to implement it only in nftables mode or in all modes(both in iptables and ipvs)? And by the way, I think we should create new issue to track every task. This issue is not a good place to discuss details. |
Will try to tackle |
Yes, I want to work on it after we confirm some details. |
picking up Instead of directly linking the services chain to service port chain (using verdict maps), can we create a new chain(service-ports) which links services chain to service-ports chain if daddr matches any of {cluster|loadbalancer|external}IP. We can then add a rule to match on service port verdict map in this new chain and simply reject if nothing matches. // create set of IPs (cluster + loadbalancer + external)
add set ip kube-proxy service-ips { type ipv4_addr }
// populate the service-ips set
add element ip kube-proxy service-ips {172.30.0.45, 5.6.7.8}
// create map for ServicePorts {ip + protocol + port}
add map ip kube-proxy service-ports { type ipv4_addr . inet_proto . inet_service : verdict}
// populate service-ports map
add element ip kube-proxy service-ports { 172.30.0.45 . tcp . 80 : goto service-HVFWP5L3-ns5/svc5/tcp/p80 }
add element ip kube-proxy service-ports { 5.6.7.8 . tcp . 80 : goto external-HVFWP5L3-ns5/svc5/tcp/p80 }
// define service-ports chain
add chain ip kube-proxy service-ports
// match on service-ports map
add rule ip kube-proxy service-ports ip daddr . meta l4proto . th dport vmap @service-ports
// reject everything else
add rule ip kube-proxy service-ports reject
// link services chain to service-port chain
add rule ip kube-proxy services ip daddr @service-ports jump service-ports [EDIT] We can also add a set of ServiceCIDRS and add a rule to jump to this service-ports chain later. |
We're only changing behavior of nftables mode. All of the todo items here are for nftables mode only, except for the "Additional metrics for iptables mode" section.
Yes, people should comment here if they want to claim an item, but it's best to split out discussion. You can either open an issue for the specific item, or else just file a Draft PR with some initial work and we can discuss there. (And I'll update the list at the top to link to other issues/PRs as people file them.) |
I will check making it possible to run multiple instances of kube-proxy. Not the most urgent, but fun 😄 It can be tested as a PoC with Multus, independent of K8s multi-network, but will be really useful with it. Load balancing is not yet considered in Kpng is already supporting multiple instances, so this is not impossible at all. |
An update on
|
I exec into kube-proxy pod to run nft in a kind cluster 😅 |
moving versioning discussion to #122743 |
Multiple instances of kube-proxy in #122814 |
@danwinship Is there a reason why we have a check on daddr, port and proto for individual service mark-for-maq jump?
|
How about a Reset method on transactions? We can call tx.Reset in sync loop. |
For 1.30:
nft
versions (@danwinship, update client/kernel version requirements for nftables kube-proxy #124152)--nodeport-addresses
behavior to default to "primary node IP(s) only" rather than "all node IPs". (@nayihz, change --nodeport-addresses behavior to default to primary node ip only #122724)reject
connections on invalid ports of service IPs (@aroradaman, proxy/nftables: reject packets destined for invalid ports of service ips #122692)ServiceCIDR
objects to learn the full service CIDR(s) and reject connections on all service IPs, not just currently-in-use-ones.@no-endpoints-services
/@no-endpoints-nodeports
handling, in which case note that we're currently checking no-endpoints nodeports from more places than we need to be (Document the nftables kube-proxy packet flow #122687).UNRESOLVED
section of the KEP after this is implemented.danwinship/knftables
tokubernetes-sigs/knftables
(and eventually declare a v0.1.0 API) (@danwinship, REQUEST: Migrate danwinship/knftables to kubernetes-sigs/knftables org#4673, Update knftables, with new sigs.k8s.io module name #122920)kubernetes-sigs/knftables
(Add scripts for CI kubernetes-sigs/knftables#3)ct state invalid drop
rule (@aroradaman, pkg/proxy/nftables: drop conntrack state invalid rule #122663)--conntrack-tcp-be-liberal
) so we should remove the ugly hack from the nftables proxy and push people to use that instead if they need it.helpers_test.go
.) (@npinaeva, Add ParseDump function to allow using Fake.Dump() output as a test setup. kubernetes-sigs/knftables#2, Split regex for map and set elements to enable elements with colon. kubernetes-sigs/knftables#6, Ensure nftables unit test parity with iptables #123389)UNRESOLVED
section of the KEP after a decision is made.@no-endpoint-services
and@no-endpoint-nodeports
have comments so that you can see which IP/port goes with which service. But elements of@service-ips
and@service-nodeports
don't because you can already figure that out from the names of the chains that they jump to.Things that depend on us having a perf job (and good metrics) first:
nft -f -
input, like iptables does. This could be done in a few ways. (HaveProxier
keep buffers around like iptables does and pass a buffer tonft.Run()
; haveknftables.realNFTables
keep buffers around itself and pass them to theTransaction
; haveknftables.realNFTables
keepTransaction
s around and reuse them (with the transaction storing its buffer); ...)knftables.Transaction.operation
arrays somehow?knftables.Rule
generation wrtknftables.Concat
. Some discussion here.Additional metrics for iptables mode to help users figure out if they'd have trouble migrating:
--ctstate INVALID -j DROP
rule (and should be using--conntrack-tcp-be-liberal
instead) (@aroradaman, Metric to track conntrack state invalid packets dropped by iptables #122812)For 1.31/beta:
content/en/docs/reference/networking/virtual-ips.md
in k/website)UNRESOLVED
section of the KEP after a decision is made.For GA:
UNRESOLVED
section of the KEP when this is implemented./sig network
/priority important-soon
/triage accepted
cc @aojea @uablrek @aroradaman @tnqn
The text was updated successfully, but these errors were encountered: