New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 2005480: [4.8z] Remove waiting for namespace and namespace lock contention #760
Conversation
@trozet: This pull request references Bugzilla bug 2005480, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
First 3 commits are part of #743 and this change depends on them merging first. /hold |
/retest |
The pod handlers and egressgw code would wait for the namespace add event before being able to actually create things. This wait could last up to 10 seconds before failure, and possibly even longer depending on how backed up the namespace handler is. This patch removes waiting where it is unnecessary. Those cases are the following: - Pod add only cares about the address sets for the namespace being available. We already create the address set if we detect it is nil during pod add so there is nothing we are really waiting for the ns event. - Egress GW waits for namespace to see what routes the annotation may have on the namespace for gateways. There is really no reason to wait as we can add the routes we know about, and then let the namespace add event cover any routes that were missing when it arrives. - Multicast enablement. This is configured via the namespace annotation. If it is missing because we get the pod add event first, the add namespace event will cover enabling multicast for all pods in the namespace. Patch does not cover one case of waiting for namespace lock: - Network Policy ACL logging. The ACL logging levels for deny and allow are configured in the namespace. If the network policy does not wait, and the namespace event comes later, we have no method to iterate through all affected policies for that namespace and set there correct logging levels. Signed-off-by: Tim Rozet <trozet@redhat.com> (cherry picked from commit ea5a757)
nsInfo is one of the 2 main sources of contention during heavy pod add operations. Moving this to an RW Mutex greatly improves performance because there in most places we are able to only use an RLock for nsInfo, which allows the multiple parallel pod handlers to be blocked less often. Signed-off-by: Tim Rozet <trozet@redhat.com> (cherry picked from commit 98e02b2)
Egress firewall gets namespace lock still in 4.8z. Updated those calls to aquire the lock appropriately. In pods.go, exgw calls and hybrid overlay exgw calls also need to get the lock. For hybrid overlay exgw, we still use waitForNamespace because we cannot update a pod's routes from the namespace watcher, so we must wait for the namespace in the pod handler. Signed-off-by: Tim Rozet <trozet@redhat.com>
/hold cancel |
@@ -179,12 +179,12 @@ func (oc *Controller) syncEgressFirewall(egressFirwalls []interface{}) { | |||
|
|||
func (oc *Controller) addEgressFirewall(egressFirewall *egressfirewallapi.EgressFirewall) error { | |||
klog.Infof("Adding egressFirewall %s in namespace %s", egressFirewall.Name, egressFirewall.Namespace) | |||
nsInfo, err := oc.waitForNamespaceLocked(egressFirewall.Namespace) | |||
nsInfo, nsUnlock, err := oc.ensureNamespaceLocked(egressFirewall.Namespace, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JacobTanenbaum PTAL at all of the changes in this commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trozet looks good to me for the changes to egressFirewall
@dcbw PTAL at the CARRY commit |
/retest |
/retest-required |
/bugzilla refresh |
@trozet: This pull request references Bugzilla bug 2005480, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
/bugzilla refresh |
@trozet: This pull request references Bugzilla bug 2005480, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@trozet: This pull request references Bugzilla bug 2005480, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@trozet: This pull request references Bugzilla bug 2005480, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
19 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/hold Let's give the CI system a break while someone looks into these test failures. |
/test e2e-vsphere-ovn |
@trozet: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest |
/hold cancel |
@trozet: All pull requests linked via external trackers have merged: Bugzilla bug 2005480 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Removes waiting for namespace event to show up. Now it is just ensured when needed. Also changes nsinfo to use a RWMutex so that pod handlers can all simultaneously RLock and not be blocked trying to add many pods for a single namespace.