Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network Policy: Ensures handler serialized shutdown #2797

Closed
wants to merge 1 commit into from

Conversation

trozet
Copy link
Contributor

@trozet trozet commented Feb 4, 2022

In the code we add several handlers to pod, namespace, service
informers depending on the definition of the network policy. We track
these via the network policy entry in the namespace cache, adding each
handler to a slice. During network policy deletion, we shutdown all the
handlers by requesting them to be detached by the watch factory. This
operation was non-blocking for 2 reasons:

  1. Although all handlers required informer lock to be removed,
    federated handlers did not need informer read lock to execute.
  2. Handlers were being removed inside of a goroutine.

Due to this behavior, when a policy was being deleted and handlers were
"shutdown" there was no guarantee that an in-process handler on a
federated queue, would not finish adding/updating a new object, after
the handlers were supposed to be shutdown.

To solve this problem, we used np.deleted, a field that the handlers
could reference to know if they should ignore adding an object in the
case where things are shutting down. However, this mechanism requires
access to the np object itself and needs a lock on the np. This is the
root cause of why we cannot hold the np lock while we are adding
handlers during network policy creation time. This causes a bunch of
headaches with complicated locking scenarios.

This commit removes this complex locking, and ensures network policy
creation can hold the np lock the entire time, while also reducing
access required to np fields in the handlers. It does the following:

  1. Make federated queue event processing have a read lock on the
    informer.
  2. When adding handlers, introduce a new "tag" which can be used to
    group operations on a bunch of handlers.
  3. Network policy no longer holds handlers in fields of the np object.
    Instead it creates handlers with a unique tag corresponding to that
    policy.
  4. Handlers are no longer removed in a goroutine and is a blocking
    function.
  5. When a Network policy is deleted, it calls to watch factory to
    destroy all handlers by it's tag (policyHandlerID).
  6. Remove the need for np.delete and np.created. Remove access
    as much as possible to the np object inside of the handlers.
  7. For handlers that still need to access the np, they use a new
    subLock which is used to guard against access to np between
    handlers.

Signed-off-by: Tim Rozet trozet@redhat.com

@trozet trozet requested a review from dcbw February 4, 2022 23:16
@trozet
Copy link
Contributor Author

trozet commented Feb 4, 2022

@jcaamano PTAL

// ... on condition that the removed address set was in the 'gress policy
return gress.delNamespaceAddressSet(namespace.Name)
})
},
UpdateFunc: func(oldObj, newObj interface{}) {
},
}, func(i []interface{}) {
// This needs to be a write lock because there's no locking around 'gress policies
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcaamano you added this. I didn't really understand what you meant here, but I didn't see a reason we need to lock the network policy here. Can you take a look at this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it makes changes on the gress policy and gress policies are protected with the np lock they belong to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow I see the policyHandler holds a pointer to the same object stored in np :(

@coveralls
Copy link

coveralls commented Feb 4, 2022

Coverage Status

Coverage decreased (-0.1%) to 50.546% when pulling dbd6afb on trozet:remove_np_delete_create into aa6696f on ovn-org:master.

@jcaamano
Copy link
Contributor

jcaamano commented Feb 7, 2022

So while I agree that this is an improvement that allows us to reduce or eliminate locking on NPs, I also think that in general removing locks may lead to complicated code and that there is high probability this or other similar locks may need to come back going forward, so that it also needs to be considered carefully.

On this specific issue being dealt with in this PR, I was thinking more in the line that the problem was how our factory handler specific code introduces calls to the handler callbacks that are synchronous with respect to the add handler functions.

Here

// Send existing items to the handler's add function; informers usually
// do this but since we share informers, it's long-since happened so
// we must emulate that here
i.initialAddFunc(handler, existingItems)

And here

if processExisting != nil {
// Process existing items as a set so the caller can clean up
// after a restart or whatever
processExisting(items)
}

Whereas the upstream shared index informer we built on top of is 100% asynchronous on their handler code
https://github.com/kubernetes/client-go/blob/8f44946f6cbe967fbe2e2548e76987680a89428e/tools/cache/shared_informer.go#L552-L563

I guess we might have reasons not to be 100% asynchronous or that it might be a complication we don't want to deal with now, but unknown to me. So just adding a note here.

@trozet
Copy link
Contributor Author

trozet commented Feb 7, 2022

@jcaamano I agree we should move to workqueues and asynchronous addition of handlers. This is mainly a stop gap until we can get there. I need something I can backport to stabalize/fix our network policy in past releases.

In the code we add several handlers to pod, namespace, service
informers depending on the definition of the network policy. We track
these via the network policy entry in the namespace cache, adding each
handler to a slice. During network policy deletion, we shutdown all the
handlers by requesting them to be detached by the watch factory. This
operation was non-blocking for 2 reasons:
  1. Although all handlers required informer lock to be removed,
     federated handlers did not need informer read lock to execute.
  2. Handlers were being removed inside of a goroutine.

Due to this behavior, when a policy was being deleted and handlers were
"shutdown" there was no guarantee that an in-process handler on a
federated queue, would not finish adding/updating a new object, after
the handlers were supposed to be shutdown.

To solve this problem, we used np.deleted, a field that the handlers
could reference to know if they should ignore adding an object in the
case where things are shutting down. However, this mechanism requires
access to the np object itself and needs a lock on the np. This is the
root cause of why we cannot hold the np lock while we are adding
handlers during network policy creation time. This causes a bunch of
headaches with complicated locking scenarios.

This commit removes this complex locking, and ensures network policy
creation can hold the np lock the entire time, while also reducing
access required to np fields in the handlers. It does the following:
  1. Make federated queue event processing have a read lock on the
     informer.
  2. When adding handlers, introduce a new "tag" which can be used to
     group operations on a bunch of handlers.
  3. Network policy no longer holds handlers in fields of the np object.
     Instead it creates handlers with a unique tag corresponding to that
     policy.
  4. Handlers are no longer removed in a goroutine and is a blocking
     function.
  5. When a Network policy is deleted, it calls to watch factory to
     destroy all handlers by it's tag (policyHandlerID).
  6. Remove the need for np.delete and np.created. Remove access
     as much as possible to the np object inside of the handlers.
  7. For handlers that still need to access the np, they use a new
     subLock which is used to guard against access to np between
     handlers.

Signed-off-by: Tim Rozet <trozet@redhat.com>
@trozet
Copy link
Contributor Author

trozet commented Feb 8, 2022

/retest

@@ -605,16 +598,16 @@ func (oc *Controller) localPodAddDefaultDeny(policy *knet.NetworkPolicy,

// localPodDelDefaultDeny decrements a pod's policy reference count and removes a pod
// from the default-deny portgroups if the reference count for the pod is 0
func (oc *Controller) localPodDelDefaultDeny(
np *networkPolicy, ports ...*lpInfo) (ingressDenyPorts, egressDenyPorts []string) {
func (oc *Controller) localPodDelDefaultDeny(policyTypes []knet.PolicyType, numEgressRules int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just be symmetric to localPodAddDefaultDeny, pass policy *knet.NetworkPolicy instead of policyTypes []knet.PolicyType, numEgressRules int and use policy.Spec.PolicyTypes and len(policy.Spec.Egress)respectively?

@@ -899,7 +879,7 @@ func (oc *Controller) createNetworkPolicy(np *networkPolicy, policy *knet.Networ
// addressSet for the peer pods.
for i, ingressJSON := range policy.Spec.Ingress {
klog.V(5).Infof("Network policy ingress is %+v", ingressJSON)

np.subLock.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this lock while creating the policy?

}

// Go through each egress rule. For each egress rule, create an
// addressSet for the peer pods.
for i, egressJSON := range policy.Spec.Egress {
klog.V(5).Infof("Network policy egress is %+v", egressJSON)

np.subLock.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this lock while creating the policy?

@jcaamano
Copy link
Contributor

jcaamano commented Feb 8, 2022

Do we need np.Lock at all? It kinda looks for me (from afar) that we are covered with the namespace lock?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants