Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1896958: NetworkPolicy performance (pod caching) #226

Merged

Conversation

danwinship
Copy link
Contributor

@danwinship danwinship commented Dec 1, 2020

Trying to improve NetworkPolicy performance/memory usage in large clusters with lots of policies and lots of changes...

/cc @squeed @juanluisvaladas @JacobTanenbaum

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 1, 2020
@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Dec 1, 2020
@openshift-ci-robot
Copy link
Contributor

@danwinship: This pull request references Bugzilla bug 1896958, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

WIP Bug 1896958: NetworkPolicy performance (pod caching)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Dec 1, 2020
np.lock.Lock()
defer np.lock.Unlock()

delete(np.pods, pod.UID)
np.refreshNetworkPolicies(refreshForPods)
np.refreshPodNetworkPolicies(pod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be synchronous?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have any more specific concern than that?

Note that this is necessarily completely asynchronous with respect to CNI pod creation/deletion anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but it blocks the pod informer from processing deltas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see, refreshPodNetworkPolicies is already asynchronous - didn't quite grok it.

Copy link
Contributor

@juanluisvaladas juanluisvaladas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good, I have a few nitpicks but nothing really relevant given the urgency.

pkg/network/node/networkpolicy.go Show resolved Hide resolved
ips = append(ips, pod.Status.PodIP)
}

pods, err := np.node.kubeInformers.Core().V1().Pods().Lister().Pods(npns.name).List(sel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, this was bad, I wonder if the namespaceIndexer alone might be enough to reduce the processing time enough to prevent the memory increase, or at least make it significantly better.

pkg/network/node/networkpolicy.go Outdated Show resolved Hide resolved
Rather than invoking the informer handlers directly, use a fake client
and actually create/delete objects and let the informers be invoked
normally. (In preparation for making use of the informer caches from
the handlers.)

Additionally, use a dummied-out BoundedFrequencyRunner to verify that
syncs occur as expected.
Especially, we were previously copying all of the pods rather than
just keeping pointers to the objects in the cache (probably a leftover
from very old pre-shared-informer code).

This may also fix leaks when pods are deleted and recreated, since
informers apparently compress events based on namespace+name, not UID,
so a delete+recreate would be compressed to an update, and we'd never
get a delete for the old UID.
When syncing multiple namespaces, do them all in a single OVS
transaction rather than a transaction per namespace
…cking

In large clusters, recalculating networkpolicies after pod/namespace
changes may take a lot of effort. Additionally, in some cases we may
end up unnecessarily recalculating multiple times before pushing
changes to OVS. Fix this by moving the recalculating step into the
BoundedFrequencyRunner's thread, doing it just before we push the
updates to OVS.
@danwinship
Copy link
Contributor Author

/test help

@openshift-ci-robot
Copy link
Contributor

@danwinship: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

  • /test e2e-aws
  • /test e2e-aws-multitenant
  • /test e2e-aws-upgrade
  • /test e2e-gcp
  • /test images
  • /test unit
  • /test verify
  • /test verify-deps

Use /test all to run the following jobs:

  • pull-ci-openshift-sdn-master-e2e-aws
  • pull-ci-openshift-sdn-master-e2e-aws-upgrade
  • pull-ci-openshift-sdn-master-e2e-gcp
  • pull-ci-openshift-sdn-master-images
  • pull-ci-openshift-sdn-master-unit
  • pull-ci-openshift-sdn-master-verify
  • pull-ci-openshift-sdn-master-verify-deps

In response to this:

/test help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@danwinship danwinship changed the title WIP Bug 1896958: NetworkPolicy performance (pod caching) Bug 1896958: NetworkPolicy performance (pod caching) Dec 1, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 1, 2020
@danwinship
Copy link
Contributor Author

CI is a dumpster fire but I brought up a cluster with this PR and ran openshift-tests by hand. All the NetworkPolicy tests passed.

@openshift-ci-robot
Copy link
Contributor

@danwinship: This pull request references Bugzilla bug 1896958, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1896958: NetworkPolicy performance (pod caching)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@juanluisvaladas
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 2, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, juanluisvaladas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@danwinship
Copy link
Contributor Author

/retest

@@ -65,7 +68,8 @@ type npNamespace struct {
type npPolicy struct {
policy networkingv1.NetworkPolicy
watchesNamespaces bool
watchesPods bool
watchesAllPods bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to leave a docblock indicating what the watchesFoo variables mean?

}
}
if changed && npns.inUse {
if npns.mustRecalculate && npns.inUse {
np.syncNamespace(npns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to just set npns.mustSync on all relevant namespaces, and trigger the runner once? I don't see much advantage to "splitting out" the transaction.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

10 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 9075977 into openshift:master Dec 3, 2020
@openshift-ci-robot
Copy link
Contributor

@danwinship: All pull requests linked via external trackers have merged:

Bugzilla bug 1896958 has been moved to the MODIFIED state.

In response to this:

Bug 1896958: NetworkPolicy performance (pod caching)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cuppett
Copy link
Member

cuppett commented Dec 4, 2020

/cherry-pick 4.6

@openshift-cherrypick-robot

@cuppett: cannot checkout 4.6: error checking out 4.6: exit status 1. output: error: pathspec '4.6' did not match any file(s) known to git

In response to this:

/cherry-pick 4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@danwinship danwinship deleted the networkpolicy-perf branch December 4, 2020 14:04
@danwinship
Copy link
Contributor Author

/cherry-pick release-4.6

@danwinship
Copy link
Contributor Author

/cherry-pick release-4.5

@openshift-cherrypick-robot

@danwinship: new pull request created: #228

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@danwinship: new pull request created: #229

In response to this:

/cherry-pick release-4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants