Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimizing iptables-restore input size #3454

Conversation

danwinship
Copy link
Contributor

/sig network
/assign @thockin @aojea

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 2, 2022
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Aug 2, 2022
@danwinship danwinship force-pushed the kep-3453-minimize-iptables-restore branch from 651039c to a9796f6 Compare August 24, 2022 16:14
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a lock for 1.26, let's just commit to metrics and so forth.

keps/sig-network/3453-minimize-iptables-restore/README.md Outdated Show resolved Hide resolved
that would exist in larger clusters (ie, the rule sets that would
benefit the most from the partial restore feature).

One possible approach for dealing with that would be to run the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually really like this idea. Could we generate both the dump and the patch, apply the patch, then read it back and ensure that it matches the full dump? If it fails, set a metric and fall back on full-dump for the future. It would be heavyweight but we only need to do it when the gate is on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship Can we pin down the plan and convert words from "possible" to "will" ?

Copy link
Contributor Author

@danwinship danwinship Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having tried to figure out how I would implement this, I'm now pretty convinced this is a bad idea. The "checking" code would be much much much more complicated than the code that it is checking, meaning that if it ever triggered, it would most likely indicate a bug in the checking code, not an actual bug in the syncing code. (And recent events (kubernetes/kubernetes#112477) show that iptables upstream does not consider iptables-save output to be "stable-API-like", so there's yet another source of flakes if we're trying to parse that.)

Also, we have no reason to believe that this failure mode will actually occur. I was just trying to come up with all of the ways that the code could fail, and this is one of them. But it's not an especially plausible one. Much more likely is that if there was a bug, it would involve one of the existing code branches in syncProxyRules (eg, we don't sync correctly if only the firewall rules change) and it would be caught very quickly.

keps/sig-network/3453-minimize-iptables-restore/README.md Outdated Show resolved Hide resolved
keps/sig-network/3453-minimize-iptables-restore/README.md Outdated Show resolved Hide resolved
and if we find that it is happening (in e2e tests or real clusters),
we can then debug or revert the bad code.

#### Subtle Synchronization Delays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any interest in changing the settings for bounded frequency runner? Or updating it to account for the fact that many syncs are expected to be much cheaper?

keps/sig-network/3453-minimize-iptables-restore/README.md Outdated Show resolved Hide resolved
## Drawbacks

Assuming the code is not buggy, there are no drawbacks. The new code
would be strictly superior to the old code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should block the KEP at all, but if someone is touching these rules outside of kube-proxy we're more likely to clobber and correct them frequently without this change.

@aojea
Copy link
Member

aojea commented Sep 11, 2022

@danwinship have you talked with Phil Sutter about this?
Is not doing iptables-restore some kind of caching and diff too?

https://developers.redhat.com/blog/2020/04/27/optimizing-iptables-nft-large-ruleset-performance-in-user-space#max_out_the_receive_buffer

@danwinship danwinship force-pushed the kep-3453-minimize-iptables-restore branch from a9796f6 to 2df8697 Compare September 12, 2022 20:30
@danwinship
Copy link
Contributor Author

Is not doing iptables-restore some kind of caching and diff too?

It does not do any diff. It fetches all of the existing chains, merges the provided data with the existing data, and then uploads all of the updated data.

iptables-legacy can't do any better than that (given the APIs available to it). In theory, iptables-nft could do something more clever, but that would be a bunch of work, and the iptables command-line tools are deprecated and not really being improved much any more.

@aojea
Copy link
Member

aojea commented Sep 14, 2022

LGTM

great reading

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship Can you close out last comment threads so we can merge?

that would exist in larger clusters (ie, the rule sets that would
benefit the most from the partial restore feature).

One possible approach for dealing with that would be to run the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship Can we pin down the plan and convert words from "possible" to "will" ?

@thockin thockin added stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status lead-opted-in Denotes that an issue has been opted in to a release labels Sep 21, 2022
@danwinship danwinship force-pushed the kep-3453-minimize-iptables-restore branch from 2df8697 to 0c621e0 Compare September 23, 2022 13:54
`iptables-restore`. (The `KUBE-SERVICES`, `KUBE-EXTERNAL-SERVICES`,
and `KUBE-NODEPORTS` chains are written in their entirety on every
sync.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there is a situation where one chain update will put the system in an unpredictable state (even for a very short while)? e.g., chain KUBE-SVC-* written out to iptables but the rest of chains are yet to be written?

if so (and i fully understand how much effort it entails, I just want to make sure that we have covered all options) wouldn't it be safer to re arrange rules in a way that one service update can - only - update a service specific chain per table? that way we don't relay on how clever the code is but rather on the structure of rules themselves

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do that without multiple writes (or some nasty pre-allocation scheme). At the root of the tree is the KUBE_SERVICES chain which has a list of conditions that mean "this packet is service X". For every service add/remove we need to add to that chain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g., chain KUBE-SVC-* written out to iptables but the rest of chains are yet to be written?

Each iptables-restore is applied atomically. So the only way things would get out of sync would be if we specifically wrote out a set of out-of-sync rules.

wouldn't it be safer to re arrange rules in a way that one service update can - only - update a service specific chain per table? that way we don't relay on how clever the code is but rather on the structure of rules themselves

Using iptables-restore to load the rules essentially requires that there be a chain like KUBE-SERVICES that contains all the services. (And not using iptables-restore would absolutely destroy our performance.)

@thockin
Copy link
Member

thockin commented Sep 24, 2022

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 24, 2022
@wojtek-t wojtek-t assigned wojtek-t and unassigned thockin, aojea and wojtek-t Sep 27, 2022
@wojtek-t
Copy link
Member

I don't see any obvious way to do that... I may not have permissions?

Maybe - I just added it - item number 42:
https://github.com/orgs/kubernetes/projects/98/views/1

@thockin - FYI

@rhockenbury
Copy link

Hello @wojtek-t @danwinship,

The enhancement tracking board is for the enhancement tracking issues rather than the PRs. In this case, we'll want to get #3453 properly opted in by adding the lead-opted-in label. I'll go ahead and take care of that.

@rhockenbury rhockenbury removed stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Sep 27, 2022
@thockin
Copy link
Member

thockin commented Oct 1, 2022

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 1, 2022
@thockin
Copy link
Member

thockin commented Oct 2, 2022

Needs approve from @wojtek-t I guess

@wojtek-t
Copy link
Member

wojtek-t commented Oct 3, 2022

Needs approve from @wojtek-t I guess

@danwinship - can you please take a look at the two remaining comments by me here?

@danwinship danwinship force-pushed the kep-3453-minimize-iptables-restore branch from 05eee79 to 80206a2 Compare October 3, 2022 19:21
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 3, 2022
@thockin
Copy link
Member

thockin commented Oct 3, 2022

Still
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 3, 2022
@danwinship danwinship force-pushed the kep-3453-minimize-iptables-restore branch from 80206a2 to 228aac0 Compare October 4, 2022 13:25
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2022
@danwinship
Copy link
Contributor Author

repushed with an UNRESOLVED section about the "partial sync is different from full sync" check, and associated update to Beta graduation criteria

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve PRR

I'm approving it because I want to make it ready for feature freeze and I don't want to block alpha on our discussions.
But I would like to continue this discussion somewhere.

@danwinship
I'm holding it for now, but feel free to redirect this discussion and me elsewhere and unhold this PR.


Additionally, kube-proxy will always do a full resync when there are
topology-related changes to Node labels, and it will always do a full
resync at least once every `iptablesSyncPeriod`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we would say that we implement such comparison code but keep it disabled by default (to avoid confusing users), but enable it in our tests?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, thockin, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 4, 2022
@k8s-ci-robot k8s-ci-robot merged commit 1f23d02 into kubernetes:master Oct 4, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Oct 4, 2022
@danwinship danwinship deleted the kep-3453-minimize-iptables-restore branch October 4, 2022 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants