New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1896958: NetworkPolicy performance (pod caching) #226
Bug 1896958: NetworkPolicy performance (pod caching) #226
Conversation
@danwinship: This pull request references Bugzilla bug 1896958, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
np.lock.Lock() | ||
defer np.lock.Unlock() | ||
|
||
delete(np.pods, pod.UID) | ||
np.refreshNetworkPolicies(refreshForPods) | ||
np.refreshPodNetworkPolicies(pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be synchronous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have any more specific concern than that?
Note that this is necessarily completely asynchronous with respect to CNI pod creation/deletion anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but it blocks the pod informer from processing deltas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see, refreshPodNetworkPolicies is already asynchronous - didn't quite grok it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good, I have a few nitpicks but nothing really relevant given the urgency.
ips = append(ips, pod.Status.PodIP) | ||
} | ||
|
||
pods, err := np.node.kubeInformers.Core().V1().Pods().Lister().Pods(npns.name).List(sel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, this was bad, I wonder if the namespaceIndexer alone might be enough to reduce the processing time enough to prevent the memory increase, or at least make it significantly better.
Rather than invoking the informer handlers directly, use a fake client and actually create/delete objects and let the informers be invoked normally. (In preparation for making use of the informer caches from the handlers.) Additionally, use a dummied-out BoundedFrequencyRunner to verify that syncs occur as expected.
Especially, we were previously copying all of the pods rather than just keeping pointers to the objects in the cache (probably a leftover from very old pre-shared-informer code). This may also fix leaks when pods are deleted and recreated, since informers apparently compress events based on namespace+name, not UID, so a delete+recreate would be compressed to an update, and we'd never get a delete for the old UID.
When syncing multiple namespaces, do them all in a single OVS transaction rather than a transaction per namespace
…cking In large clusters, recalculating networkpolicies after pod/namespace changes may take a lot of effort. Additionally, in some cases we may end up unnecessarily recalculating multiple times before pushing changes to OVS. Fix this by moving the recalculating step into the BoundedFrequencyRunner's thread, doing it just before we push the updates to OVS.
f025d84
to
7aae913
Compare
/test help |
@danwinship: The specified target(s) for
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
CI is a dumpster fire but I brought up a cluster with this PR and ran |
@danwinship: This pull request references Bugzilla bug 1896958, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danwinship, juanluisvaladas The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
@@ -65,7 +68,8 @@ type npNamespace struct { | |||
type npPolicy struct { | |||
policy networkingv1.NetworkPolicy | |||
watchesNamespaces bool | |||
watchesPods bool | |||
watchesAllPods bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to leave a docblock indicating what the watchesFoo
variables mean?
} | ||
} | ||
if changed && npns.inUse { | ||
if npns.mustRecalculate && npns.inUse { | ||
np.syncNamespace(npns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to just set npns.mustSync
on all relevant namespaces, and trigger the runner once? I don't see much advantage to "splitting out" the transaction.
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
10 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@danwinship: All pull requests linked via external trackers have merged: Bugzilla bug 1896958 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick 4.6 |
@cuppett: cannot checkout 4.6: error checking out 4.6: exit status 1. output: error: pathspec '4.6' did not match any file(s) known to git In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.6 |
/cherry-pick release-4.5 |
@danwinship: new pull request created: #228 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@danwinship: new pull request created: #229 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Trying to improve NetworkPolicy performance/memory usage in large clusters with lots of policies and lots of changes...
/cc @squeed @juanluisvaladas @JacobTanenbaum