New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1727441: kube-proxy periodic iptables reloads are extremely disruptive in large clusters #23872
Bug 1727441: kube-proxy periodic iptables reloads are extremely disruptive in large clusters #23872
Conversation
@danwinship: This pull request references Bugzilla bug 1727441, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
/hold |
f39e9cb
to
c7ec8e1
Compare
@danwinship what is the next step for this PR or bug? |
kubernetes/kubernetes#83498 and kubernetes-sigs/kind#947 appear to be the follow-on work that should also be included. @danwinship is that correct? And will you have time to include these in the backport? Or should we try to find someone else to pick this up? |
@danwinship If the above PR includes that, can you pull the hold please? |
83498 only applies to kube-proxy and so is needed in the sdn PR but not this one. The kind bug is not related. This PR is ready to go in. Sorry, I had updated it and then waited to see if the tests passed before commenting but then never commented. |
@danwinship , @tbielawa doesn't think this PR is within his domain and Seth is on vacation AFAICT. Is there anyone else who can review and approve this? |
/lgtm |
Thanks Dan |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danwinship, knobunc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
6 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
@danwinship: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@danwinship: All pull requests linked via external trackers have merged. Bugzilla bug 1727441 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
kubelet and kube-proxy periodically resync all of their rules to iptables, even when nothing has changed, to ensure that things keep working even if the user runs
systemctl restart iptables
and accidentally deletes all of kube's rules. In large clusters, this get very disruptive and interferes with "real" iptables updates, sometimes causing service updates to take many minutes to appear.Part of the problem was RHEL 7-specific (we were telling kube-proxy "wait 5 seconds to get the iptables lock" but telling all other iptables users "wait forever to get the iptables lock", meaning that when there was lots of lock contention, kube-proxy would normally end up being the loser). This was fixed upstream by kubernetes/kubernetes#80368 and kubernetes/kubernetes#82602, which are in 1.16 and so already in origin 4.3.
The rest of the problem was fixed upstream just-post-1.16 by kubernetes/kubernetes#81517, by changing kubelet and kube-proxy to only reload their rules if the old ones were actually deleted.
In OCP 4.3, the fix is to backport 81517 to origin (this PR, for kubelet and openshift-tests) and sdn (openshift/sdn#39, for kube-proxy and openshift-sdn's internal usage). (The origin and sdn PRs complement each other, but are independent, and can merge in either order.)
Then for 4.2 we'll need all that plus 80368+82602. Then in 4.1 and 3.11 the same thing but entirely in origin.