Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nftables kube-proxy TODO #122572

Open
16 of 27 tasks
danwinship opened this issue Jan 3, 2024 · 26 comments
Open
16 of 27 tasks

nftables kube-proxy TODO #122572

danwinship opened this issue Jan 3, 2024 · 26 comments
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@danwinship
Copy link
Contributor

danwinship commented Jan 3, 2024

For 1.30:

Things that depend on us having a perf job (and good metrics) first:

  • add iptables-style partial syncing (@npinaeva)
  • optimize memory allocation:
    • reuse the buffers used for building the nft -f - input, like iptables does. This could be done in a few ways. (Have Proxier keep buffers around like iptables does and pass a buffer to nft.Run(); have knftables.realNFTables keep buffers around itself and pass them to the Transaction; have knftables.realNFTables keep Transactions around and reuse them (with the transaction storing its buffer); ...)
    • maybe reuse the knftables.Transaction.operation arrays somehow?
    • consider optimizing knftables.Rule generation wrt knftables.Concat. Some discussion here.
  • rewrite masquerading to use an nftables set rather than using the mark, as discussed in the KEP
  • try out the alternative hairpin rule suggested in the KEP.

Additional metrics for iptables mode to help users figure out if they'd have trouble migrating:

  • Metric to indicate if you are using localhost nodeports. This could be done by making sure there is always a separate rule to catch 127.0.0.1:nodeport rather than letting it be caught by a generic rule, and then checking the counters on those rules and exporting some metric with the result. (@aroradaman, Kube-Proxy: Track packets accepted on localhost nodeports #125015)
  • More generally, metric to indicate if you are using non-primary-node-IP nodeports. (Maybe the metric could just indicate every IP/interface that you're making use of nodeports on?)
  • Metric to indicate if you are relying on the --ctstate INVALID -j DROP rule (and should be using --conntrack-tcp-be-liberal instead) (@aroradaman, Metric to track conntrack state invalid packets dropped by iptables #122812)
  • Possibly other metrics if we change other behavior
  • (We could also add metrics to ipvs mode to help ipvs users, but that's not a blocker for eventually changing the default like the iptables metrics are.)

For 1.31/beta:

For GA:

  • Document the "API guarantees" for components interoperating with nftables kube-proxy.
    • Discussion in KEP
    • Resolve the UNRESOLVED section of the KEP when this is implemented.

/sig network
/priority important-soon
/triage accepted
cc @aojea @uablrek @aroradaman @tnqn

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 3, 2024
@uablrek
Copy link
Contributor

uablrek commented Jan 4, 2024

I can easily switch kernels in my env (and "nft" for that matter), so I can check that. But what to check, and how precise?

Is it enough to verify basic traffic, or is a full K8s e2e required?

Even if I can switch kernels, download and building them is a tedious task (especially if there are gcc version problems), so the fewer the better.

Update

K8s e2e with:

FOCUS='\[sig-network\].*ervice'
SKIP='Disruptive|Serial|ESIPP|DNS|GCE|finalizer'

runs fine on linux-6.6.9 (latest at the time of writing)

@uablrek
Copy link
Contributor

uablrek commented Jan 4, 2024

actually, we've pretty much decided on kernel 5.6, though we will probably implement this via an nft --check-based check rather than via a kernel version number check.

@danwinship Any idea on what to check with "nft --check"? And, is it OK to discuss in this PR?

I can run on linux-5.6.10 without problems, but on linux-5.6 and linux-5.6.4 I get errors:

E0104 13:16:35.176593     405 proxier.go:1561] "nftables sync failed" err=<
        /dev/stdin:50:1-116: Error: Could not process rule: Invalid argument
        add set ip6 kube-proxy firewall-allow { type ipv6_addr . inet_proto . inet_service . ipv6_addr ; flags interval ; }
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        /dev/stdin:54:1-116: Error: Could not process rule: No such file or directory
        add rule ip6 kube-proxy firewall-allow-check ip6 daddr . meta l4proto . th dport . ip6 saddr @firewall-allow return
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        /dev/stdin:61:1-40: Error: Could not process rule: No such file or directory
        flush set ip6 kube-proxy firewall-allow
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(I use exactly the same kernel-config for linux-5.6.10 and linux-5.6.4)

@tnqn
Copy link
Member

tnqn commented Jan 4, 2024

I can run on linux-5.6.10 without problems, but on linux-5.6 and linux-5.6.4 I get errors:

@uablrek For v1.29.0, the minimal kernel version is linux-5.6.9 according to previous investigation: #122296 (comment)

Yes, I managed to run it with 5.6 but only after 5.6.9. It's because nftables CLI "Set NFT_SET_CONCAT flag for sets with concatenated ranges" since 0.9.5, and the flag was added to kernel 5.7 and backported to 5.6.9.

After #122296, the minimal version may be relaxed a lot. I guess it's 4.1 but haven't tested it on that version yet.

@danwinship
Copy link
Contributor Author

And, is it OK to discuss in this PR?

I assume this PR is going to get a lot of discussion about a lot of things, and that's fine, but if you're planning to grab one of the items anyway, maybe file a separate issue for it (or an initial/placeholder PR) and move discussion there?

Is it enough to verify basic traffic, or is a full K8s e2e required?

We need to verify that every kind of rule the nftables Proxier can generate works correctly. So somewhere in between; full e2e would do the trick, but in theory you could do something much simpler.

One starting point would be the expected ruleset from TestOverallNFTablesRules in pkg/proxy/nftables/proxier_test.go. Unfortunately, it looks like it's currently not in a loadable order. (e.g., it adds a jump endpoints-check rule before adding the endpoints-check chain.) You can hack around that by first running just the add table and add chain lines, and then after that re-run the entire transaction. (Probably we should fix the value of expected so that it is usable directly...) But also, this was mostly just a port of the corresponding iptables proxy test, and I don't know for sure that it actually includes a copy of every possible rule type. (Though it would obviously be good if it did.) So, (1) figure out what additional service types are needed and add them, (2) figure out if there are any rule types that can't be generated here because they'd require different global config (e.g. not having a local detector) and add comments about that, (3) re-order the rules into a usable order (while still trying to preserve the grouping-by-service as much as possible for debuggability), (4) test against kernels.

@danwinship
Copy link
Contributor Author

danwinship commented Jan 4, 2024

After #122296, the minimal version may be relaxed a lot. I guess it's 4.1 but haven't tested it on that version yet.

Currently we use th expressions, which were added in 5.3, though there's a workaround for that if needed. Other than that though, there are also bugfixes to worry about; the list of updates in the wiki includes really major ones, but I worry that the further back we go, the more minor bugs creep in that could make things not work. (Especially, if we wanted to support earlier than 4.18 we'd have to figure out what "nftables NAT is no longer incompatible with iptables NAT" means.) So I don't think we want to go any earlier than 5.4, which I believe is the oldest kernel in use in any LTS distro that's still in a "fully supported" phase.

@danwinship
Copy link
Contributor Author

Any idea on what to check with "nft --check"?

So, per the wiki, if we want to require 5.4, I guess you could try a rule with "meta time"... though it seems a little weird to have the test be based on something we aren't actually using, so maybe th would be better?

@danwinship
Copy link
Contributor Author

Also, I haven't investigated the nft version situation at all, and there may be constraints there that make us need a newer nft than "whatever version added support for the oldest kernel feature we're using". Especially, I'm not sure exactly when --json support was added, and we definitely need that.

@tnqn
Copy link
Member

tnqn commented Jan 4, 2024

Currently we use th expressions, which were added in 5.3

It seems th expressions is only an alias and the change is needed in CLI only: proto: add pseudo th protocol to match d/sport in generic way. See As "th" is an alias for the raw expression, no dependency is generated. And kernel 5.3 doesn't mention related change in kernel at all. I successfully ran nftables mode and saw a lot of th expression with kernel 4.15 and nftables v1.0.6.

Especially, if we wanted to support earlier than 4.18 we'd have to figure out what "nftables NAT is no longer incompatible with iptables NAT" means.

It means before 4.18 iptables NAT rule will not take effect when nftables NAT rule is in use. The page describes it. And I have confirmed on kernel 4.15, after nftables nat rules are installed, the iptables nat rules are no longer matched, causing problems when Pods try to access external network via iptables MASQUERADE. So if a CNI plugin relies on iptables NAT table, it's not going to work with nftables mode before 4.18.

@danwinship
Copy link
Contributor Author

See As "th" is an alias for the raw expression, no dependency is generated.

"dependency" here is nftables-maintainer-speak. It's warning you that th doesn't imply {tcp,udp,sctp} so if you're careless with your rules, a th expression might end up matching bits inside an ICMP packet or something.

(I got the "5.3" requirement from here but I agree, I can't see any commits between 5.2 and 5.3 that seem relevant.)

@uablrek
Copy link
Contributor

uablrek commented Jan 5, 2024

I have tested some versions of nft (archives):

nftables v0.9.2 (Scram)                  2019-08-19
nftables v1.0.0 (Fearless Fosdick #2)    2021-08-19

Both works with K8s e2e with FOCUS/SKIP as in #122572 (comment).

Can we settle for nft $\ge$ v0.9.2?

Update

There is an aspect of dynamic lib's that I haven't covered. For instance I have used libnftnl 1.2.6 (archives) in all my builds, and some other libs, like libjansson, from my Ubuntu 22.04.3 LTS PC.

@danwinship
Copy link
Contributor Author

Can we settle for nft ≥ v0.9.2?

Yes. I think we'll officially declare that we support kernel 5.4+ and nftables command-line 0.9.2+, and we can use a th expression to check that. (Which means really we'll support kernel 5.3, but we won't advertise it that way.)

FTR (and maybe worth documenting in a comment wherever we put the check):

  • If we decide we need concatenations-with-ranges support again later, that would imply 5.6/0.9.4.
  • If we want to use typeof in sets/maps, that would require at least 0.9.4 (and maybe later depending on exactly how we're using it; we'd have to re-check the release notes).
  • Support for comments on tables/chains/sets/maps/etc (anything but rules and elements) didn't come until 5.10/0.9.7, but knftables itself already does a --check for that at construct time, and just drops those comments if they're not supported.

@aroradaman
Copy link
Member

aroradaman commented Jan 9, 2024

I'm picking up drop the ct state invalid drop rule.
#122663

@aroradaman
Copy link
Member

For perf job do we need something similar to ci-kubernetes-e2e-gce-scale-performance configured for NFTables with lower cadence and nodes?

@danwinship
Copy link
Contributor Author

I'm not sure exactly what we want for perf... @aojea may have thoughts on that, and sig-scalability may have more information on what sort of big tests we're able to run. (I assume we can't spin up a 2000 node cluster just because we're bored...)

Maybe open an issue to start the discussion on that?

(Antonio also said he was going to start thinking about metrics that the perf job could be measuring.)

@nayihz
Copy link
Contributor

nayihz commented Jan 10, 2024

  • change --nodeport-addresses behavior to default to "primary node IP(s) only" rather than "all node IPs".

Do we need to implement it only in nftables mode or in all modes(both in iptables and ipvs)?

And by the way, I think we should create new issue to track every task. This issue is not a good place to discuss details.

@npinaeva
Copy link
Member

Will try to tackle ensure unit test parity with iptables. and change --nodeport-addresses behavior (@nayihz if you wanted to work on the latter, lmk and I will pick something else)

@nayihz
Copy link
Contributor

nayihz commented Jan 10, 2024

(@nayihz if you wanted to work on the latter, lmk and I will pick something else)

Yes, I want to work on it after we confirm some details.

@aroradaman
Copy link
Member

aroradaman commented Jan 10, 2024

picking up reject connections on invalid ports of service IPs.

Instead of directly linking the services chain to service port chain (using verdict maps), can we create a new chain(service-ports) which links services chain to service-ports chain if daddr matches any of {cluster|loadbalancer|external}IP. We can then add a rule to match on service port verdict map in this new chain and simply reject if nothing matches.

// create set of IPs (cluster + loadbalancer + external)
add set ip kube-proxy service-ips { type ipv4_addr }

// populate the service-ips set
add element ip kube-proxy service-ips {172.30.0.45, 5.6.7.8}

// create map for ServicePorts {ip + protocol + port}
add map ip kube-proxy service-ports { type ipv4_addr . inet_proto . inet_service : verdict}

// populate service-ports map
add element ip kube-proxy service-ports { 172.30.0.45 . tcp . 80 : goto service-HVFWP5L3-ns5/svc5/tcp/p80 }
add element ip kube-proxy service-ports { 5.6.7.8 . tcp . 80 : goto external-HVFWP5L3-ns5/svc5/tcp/p80 }

// define service-ports chain
add chain ip kube-proxy service-ports

// match on service-ports map
add rule ip kube-proxy service-ports ip daddr . meta l4proto . th dport vmap @service-ports

// reject everything else
add rule ip kube-proxy service-ports reject

// link services chain to service-port chain
add rule ip kube-proxy services ip daddr @service-ports jump  service-ports

[EDIT]

We can also add a set of ServiceCIDRS and add a rule to jump to this service-ports chain later.

@danwinship
Copy link
Contributor Author

Do we need to implement it only in nftables mode or in all modes(both in iptables and ipvs)?

We're only changing behavior of nftables mode. All of the todo items here are for nftables mode only, except for the "Additional metrics for iptables mode" section.

And by the way, I think we should create new issue to track every task. This issue is not a good place to discuss details.

Yes, people should comment here if they want to claim an item, but it's best to split out discussion. You can either open an issue for the specific item, or else just file a Draft PR with some initial work and we can discuss there. (And I'll update the list at the top to link to other issues/PRs as people file them.)

@uablrek
Copy link
Contributor

uablrek commented Jan 10, 2024

I will check making it possible to run multiple instances of kube-proxy. Not the most urgent, but fun 😄

It can be tested as a PoC with Multus, independent of K8s multi-network, but will be really useful with it. Load balancing is not yet considered in K8s multi-network, but I think is must at some point.

Kpng is already supporting multiple instances, so this is not impossible at all.

@uablrek
Copy link
Contributor

uablrek commented Jan 12, 2024

An update on nft versions

On KinD nft v0.9.8 segv's. On a KinD node:

apt update
apt install -y nftables
nft list ruleset
Segmentation fault (core dumped)

Stack trace:

Core was generated by `nft list ruleset'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007f3691345743 in set_make_key (attr=attr@entry=0x558676fe2f1c)
    at netlink.c:767
#2  0x00007f3691347bba in netlink_delinearize_set (
    ctx=ctx@entry=0x7ffdc4c346a0, nls=0x558676fed450) at netlink.c:841
#3  0x00007f3691348253 in list_set_cb (nls=<optimized out>, arg=0x7ffdc4c346a0)
    at netlink.c:978
#4  0x00007f36910d91d5 in nftnl_set_list_foreach ()
   from /usr/lib/x86_64-linux-gnu/libnftnl.so.11
#5  0x00007f36913482b8 in netlink_list_sets (ctx=ctx@entry=0x7ffdc4c346a0, 
    h=h@entry=0x558676fe1410) at netlink.c:999
#6  0x00007f369132e67d in cache_init_objects (flags=536871039, 
    ctx=0x7ffdc4c346a0) at rule.c:170
#7  cache_init (flags=536871039, ctx=0x7ffdc4c346a0) at rule.c:235
#8  cache_update (nft=nft@entry=0x558676fd72a0, flags=<optimized out>, 
    msgs=msgs@entry=0x7ffdc4c35660) at rule.c:291
#9  0x00007f369135cf12 in nft_evaluate (nft=nft@entry=0x558676fd72a0, 
    msgs=msgs@entry=0x7ffdc4c35660, cmds=cmds@entry=0x7ffdc4c35670)
    at libnftables.c:420
#10 0x00007f369135d8eb in nft_run_cmd_from_buffer (nft=0x558676fd72a0, 
    buf=0x558676fd7b30 "list ruleset") at libnftables.c:465
#11 0x0000558676fa58c6 in ?? ()
#12 0x00007f3691124d0a in __libc_start_main (main=0x558676fa5290, argc=3, 
    argv=0x7ffdc4c35818, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7ffdc4c35808)
    at ../csu/libc-start.c:308
#13 0x0000558676fa5a8a in ?? ()

For v1.0.3+ it seems to be:

nftables v0.9.2 (Scram)
OK
nftables v0.9.3 (Topsy)
OK
nftables v0.9.4 (Jive at Five)
Segmentation fault
nftables v0.9.5 (Capital Idea)
Segmentation fault
nftables v0.9.6 (Capital Idea #2)
Segmentation fault
nftables v0.9.7 (Anyface)
Segmentation fault
nftables v0.9.8 (E.D.S.)
Segmentation fault
nftables v0.9.9 (Prudence Pimpleton)
OK
nftables v1.0.0 (Fearless Fosdick #2)
OK
nftables v1.0.1 (Fearless Fosdick #3)
OK
nftables v1.0.2 (Lester Gooch)
OK
nftables v1.0.3 (Lester Gooch #2)
OK
nftables v1.0.4 (Lester Gooch #3)
OK
nftables v1.0.5 (Lester Gooch #4)
OK
nftables v1.0.6 (Lester Gooch #5)
OK
nftables v1.0.7 (Old Doc Yak)
OK
nftables v1.0.8 (Old Doc Yak #2)
OK
nftables v1.0.9 (Old Doc Yak #3)
OK

@aroradaman
Copy link
Member

aroradaman commented Jan 12, 2024

registry.k8s.io/kube-proxy:v1.29.0 was built with v1.0.6 (Lester Gooch #5)

I exec into kube-proxy pod to run nft in a kind cluster 😅

@danwinship
Copy link
Contributor Author

moving versioning discussion to #122743

@uablrek
Copy link
Contributor

uablrek commented Jan 16, 2024

Multiple instances of kube-proxy in #122814

@aroradaman
Copy link
Member

aroradaman commented May 7, 2024

@danwinship Is there a reason why we have a check on daddr, port and proto for individual service mark-for-maq jump?
Packets will only be jumped to this chain when they match on daddr, port and proto.

chain service-JKB34KST-default/test/tcp/http {
	ip daddr 10.96.83.93 tcp dport 80 ip saddr != 10.244.0.0/16 jump mark-for-masquerade
	numgen random ...
}

# maybe this can be simplified to

chain service-JKB34KST-default/test/tcp/http {
	ip saddr != 10.244.0.0/16 jump mark-for-masquerade
	numgen random ...
}

@aroradaman
Copy link
Member

maybe reuse the knftables.Transaction.operation arrays somehow?

How about a Reset method on transactions? We can call tx.Reset in sync loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

7 participants