-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimizing iptables-restore input size #3454
Minimizing iptables-restore input size #3454
Conversation
651039c
to
a9796f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a lock for 1.26, let's just commit to metrics and so forth.
that would exist in larger clusters (ie, the rule sets that would | ||
benefit the most from the partial restore feature). | ||
|
||
One possible approach for dealing with that would be to run the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually really like this idea. Could we generate both the dump and the patch, apply the patch, then read it back and ensure that it matches the full dump? If it fails, set a metric and fall back on full-dump for the future. It would be heavyweight but we only need to do it when the gate is on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danwinship Can we pin down the plan and convert words from "possible" to "will" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having tried to figure out how I would implement this, I'm now pretty convinced this is a bad idea. The "checking" code would be much much much more complicated than the code that it is checking, meaning that if it ever triggered, it would most likely indicate a bug in the checking code, not an actual bug in the syncing code. (And recent events (kubernetes/kubernetes#112477) show that iptables upstream does not consider iptables-save
output to be "stable-API-like", so there's yet another source of flakes if we're trying to parse that.)
Also, we have no reason to believe that this failure mode will actually occur. I was just trying to come up with all of the ways that the code could fail, and this is one of them. But it's not an especially plausible one. Much more likely is that if there was a bug, it would involve one of the existing code branches in syncProxyRules
(eg, we don't sync correctly if only the firewall rules change) and it would be caught very quickly.
and if we find that it is happening (in e2e tests or real clusters), | ||
we can then debug or revert the bad code. | ||
|
||
#### Subtle Synchronization Delays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any interest in changing the settings for bounded frequency runner? Or updating it to account for the fact that many syncs are expected to be much cheaper?
## Drawbacks | ||
|
||
Assuming the code is not buggy, there are no drawbacks. The new code | ||
would be strictly superior to the old code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should block the KEP at all, but if someone is touching these rules outside of kube-proxy we're more likely to clobber and correct them frequently without this change.
@danwinship have you talked with Phil Sutter about this? |
a9796f6
to
2df8697
Compare
It does not do any diff. It fetches all of the existing chains, merges the provided data with the existing data, and then uploads all of the updated data.
|
LGTM great reading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danwinship Can you close out last comment threads so we can merge?
that would exist in larger clusters (ie, the rule sets that would | ||
benefit the most from the partial restore feature). | ||
|
||
One possible approach for dealing with that would be to run the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danwinship Can we pin down the plan and convert words from "possible" to "will" ?
2df8697
to
0c621e0
Compare
`iptables-restore`. (The `KUBE-SERVICES`, `KUBE-EXTERNAL-SERVICES`, | ||
and `KUBE-NODEPORTS` chains are written in their entirety on every | ||
sync.) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there is a situation where one chain update will put the system in an unpredictable state (even for a very short while)? e.g., chain KUBE-SVC-*
written out to iptables but the rest of chains are yet to be written?
if so (and i fully understand how much effort it entails, I just want to make sure that we have covered all options) wouldn't it be safer to re arrange rules in a way that one service update can - only - update a service specific chain per table? that way we don't relay on how clever the code is but rather on the structure of rules themselves
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can do that without multiple writes (or some nasty pre-allocation scheme). At the root of the tree is the KUBE_SERVICES chain which has a list of conditions that mean "this packet is service X". For every service add/remove we need to add to that chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g., chain
KUBE-SVC-*
written out to iptables but the rest of chains are yet to be written?
Each iptables-restore
is applied atomically. So the only way things would get out of sync would be if we specifically wrote out a set of out-of-sync rules.
wouldn't it be safer to re arrange rules in a way that one service update can - only - update a service specific chain per table? that way we don't relay on how clever the code is but rather on the structure of rules themselves
Using iptables-restore
to load the rules essentially requires that there be a chain like KUBE-SERVICES
that contains all the services. (And not using iptables-restore
would absolutely destroy our performance.)
Thanks! /lgtm |
Maybe - I just added it - item number 42: @thockin - FYI |
Hello @wojtek-t @danwinship, The enhancement tracking board is for the enhancement tracking issues rather than the PRs. In this case, we'll want to get #3453 properly opted in by adding the |
Thanks! /lgtm |
Needs approve from @wojtek-t I guess |
@danwinship - can you please take a look at the two remaining comments by me here? |
05eee79
to
80206a2
Compare
Still |
80206a2
to
228aac0
Compare
repushed with an UNRESOLVED section about the "partial sync is different from full sync" check, and associated update to Beta graduation criteria |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve PRR
I'm approving it because I want to make it ready for feature freeze and I don't want to block alpha on our discussions.
But I would like to continue this discussion somewhere.
@danwinship
I'm holding it for now, but feel free to redirect this discussion and me elsewhere and unhold this PR.
|
||
Additionally, kube-proxy will always do a full resync when there are | ||
topology-related changes to Node labels, and it will always do a full | ||
resync at least once every `iptablesSyncPeriod`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we would say that we implement such comparison code but keep it disabled by default (to avoid confusing users), but enable it in our tests?
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danwinship, thockin, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/sig network
/assign @thockin @aojea