Network Policy: Ensures handler serialized shutdown #2797

trozet · 2022-02-04T23:16:20Z

In the code we add several handlers to pod, namespace, service
informers depending on the definition of the network policy. We track
these via the network policy entry in the namespace cache, adding each
handler to a slice. During network policy deletion, we shutdown all the
handlers by requesting them to be detached by the watch factory. This
operation was non-blocking for 2 reasons:

Although all handlers required informer lock to be removed,
federated handlers did not need informer read lock to execute.
Handlers were being removed inside of a goroutine.

Due to this behavior, when a policy was being deleted and handlers were
"shutdown" there was no guarantee that an in-process handler on a
federated queue, would not finish adding/updating a new object, after
the handlers were supposed to be shutdown.

To solve this problem, we used np.deleted, a field that the handlers
could reference to know if they should ignore adding an object in the
case where things are shutting down. However, this mechanism requires
access to the np object itself and needs a lock on the np. This is the
root cause of why we cannot hold the np lock while we are adding
handlers during network policy creation time. This causes a bunch of
headaches with complicated locking scenarios.

This commit removes this complex locking, and ensures network policy
creation can hold the np lock the entire time, while also reducing
access required to np fields in the handlers. It does the following:

Make federated queue event processing have a read lock on the
informer.
When adding handlers, introduce a new "tag" which can be used to
group operations on a bunch of handlers.
Network policy no longer holds handlers in fields of the np object.
Instead it creates handlers with a unique tag corresponding to that
policy.
Handlers are no longer removed in a goroutine and is a blocking
function.
When a Network policy is deleted, it calls to watch factory to
destroy all handlers by it's tag (policyHandlerID).
Remove the need for np.delete and np.created. Remove access
as much as possible to the np object inside of the handlers.
For handlers that still need to access the np, they use a new
subLock which is used to guard against access to np between
handlers.

Signed-off-by: Tim Rozet trozet@redhat.com

trozet · 2022-02-04T23:16:44Z

@jcaamano PTAL

trozet · 2022-02-04T23:19:22Z

go-controller/pkg/ovn/policy.go

 					// ... on condition that the removed address set was in the 'gress policy
 					return gress.delNamespaceAddressSet(namespace.Name)
 				})
 			},
 			UpdateFunc: func(oldObj, newObj interface{}) {
 			},
 		}, func(i []interface{}) {
-			// This needs to be a write lock because there's no locking around 'gress policies


@jcaamano you added this. I didn't really understand what you meant here, but I didn't see a reason we need to lock the network policy here. Can you take a look at this?

Because it makes changes on the gress policy and gress policies are protected with the np lock they belong to.

wow I see the policyHandler holds a pointer to the same object stored in np :(

coveralls · 2022-02-04T23:30:03Z

Coverage decreased (-0.1%) to 50.546% when pulling dbd6afb on trozet:remove_np_delete_create into aa6696f on ovn-org:master.

go-controller/pkg/factory/factory.go

go-controller/pkg/ovn/policy.go

jcaamano · 2022-02-07T10:38:10Z

So while I agree that this is an improvement that allows us to reduce or eliminate locking on NPs, I also think that in general removing locks may lead to complicated code and that there is high probability this or other similar locks may need to come back going forward, so that it also needs to be considered carefully.

On this specific issue being dealt with in this PR, I was thinking more in the line that the problem was how our factory handler specific code introduces calls to the handler callbacks that are synchronous with respect to the add handler functions.

Here

ovn-kubernetes/go-controller/pkg/factory/handler.go

Lines 128 to 131 in 16afeee

    
           // Send existing items to the handler's add function; informers usually 
        
           // do this but since we share informers, it's long-since happened so 
        
           // we must emulate that here 
        
           i.initialAddFunc(handler, existingItems)

And here

ovn-kubernetes/go-controller/pkg/factory/factory.go

Lines 363 to 367 in 16afeee

    
           if processExisting != nil { 
        
           	// Process existing items as a set so the caller can clean up 
        
           	// after a restart or whatever 
        
           	processExisting(items) 
        
           }

Whereas the upstream shared index informer we built on top of is 100% asynchronous on their handler code
https://github.com/kubernetes/client-go/blob/8f44946f6cbe967fbe2e2548e76987680a89428e/tools/cache/shared_informer.go#L552-L563

I guess we might have reasons not to be 100% asynchronous or that it might be a complication we don't want to deal with now, but unknown to me. So just adding a note here.

trozet · 2022-02-07T16:32:33Z

@jcaamano I agree we should move to workqueues and asynchronous addition of handlers. This is mainly a stop gap until we can get there. I need something I can backport to stabalize/fix our network policy in past releases.

In the code we add several handlers to pod, namespace, service informers depending on the definition of the network policy. We track these via the network policy entry in the namespace cache, adding each handler to a slice. During network policy deletion, we shutdown all the handlers by requesting them to be detached by the watch factory. This operation was non-blocking for 2 reasons: 1. Although all handlers required informer lock to be removed, federated handlers did not need informer read lock to execute. 2. Handlers were being removed inside of a goroutine. Due to this behavior, when a policy was being deleted and handlers were "shutdown" there was no guarantee that an in-process handler on a federated queue, would not finish adding/updating a new object, after the handlers were supposed to be shutdown. To solve this problem, we used np.deleted, a field that the handlers could reference to know if they should ignore adding an object in the case where things are shutting down. However, this mechanism requires access to the np object itself and needs a lock on the np. This is the root cause of why we cannot hold the np lock while we are adding handlers during network policy creation time. This causes a bunch of headaches with complicated locking scenarios. This commit removes this complex locking, and ensures network policy creation can hold the np lock the entire time, while also reducing access required to np fields in the handlers. It does the following: 1. Make federated queue event processing have a read lock on the informer. 2. When adding handlers, introduce a new "tag" which can be used to group operations on a bunch of handlers. 3. Network policy no longer holds handlers in fields of the np object. Instead it creates handlers with a unique tag corresponding to that policy. 4. Handlers are no longer removed in a goroutine and is a blocking function. 5. When a Network policy is deleted, it calls to watch factory to destroy all handlers by it's tag (policyHandlerID). 6. Remove the need for np.delete and np.created. Remove access as much as possible to the np object inside of the handlers. 7. For handlers that still need to access the np, they use a new subLock which is used to guard against access to np between handlers. Signed-off-by: Tim Rozet <trozet@redhat.com>

trozet · 2022-02-08T01:27:27Z

/retest

jcaamano · 2022-02-08T15:46:48Z

go-controller/pkg/ovn/policy.go

@@ -605,16 +598,16 @@ func (oc *Controller) localPodAddDefaultDeny(policy *knet.NetworkPolicy,

 // localPodDelDefaultDeny decrements a pod's policy reference count and removes a pod
 // from the default-deny portgroups if the reference count for the pod is 0
-func (oc *Controller) localPodDelDefaultDeny(
-	np *networkPolicy, ports ...*lpInfo) (ingressDenyPorts, egressDenyPorts []string) {
+func (oc *Controller) localPodDelDefaultDeny(policyTypes []knet.PolicyType, numEgressRules int,


Can't we just be symmetric to localPodAddDefaultDeny, pass policy *knet.NetworkPolicy instead of policyTypes []knet.PolicyType, numEgressRules int and use policy.Spec.PolicyTypes and len(policy.Spec.Egress)respectively?

jcaamano · 2022-02-08T16:45:15Z

go-controller/pkg/ovn/policy.go

@@ -899,7 +879,7 @@ func (oc *Controller) createNetworkPolicy(np *networkPolicy, policy *knet.Networ
 	// addressSet for the peer pods.
 	for i, ingressJSON := range policy.Spec.Ingress {
 		klog.V(5).Infof("Network policy ingress is %+v", ingressJSON)
-
+		np.subLock.Lock()


Do we need this lock while creating the policy?

jcaamano · 2022-02-08T16:45:25Z

go-controller/pkg/ovn/policy.go

 	}

 	// Go through each egress rule.  For each egress rule, create an
 	// addressSet for the peer pods.
 	for i, egressJSON := range policy.Spec.Egress {
 		klog.V(5).Infof("Network policy egress is %+v", egressJSON)
-
+		np.subLock.Lock()


Do we need this lock while creating the policy?

jcaamano · 2022-02-08T16:56:27Z

Do we need np.Lock at all? It kinda looks for me (from afar) that we are covered with the namespace lock?

trozet requested a review from dcbw February 4, 2022 23:16

trozet commented Feb 4, 2022

View reviewed changes

dcbw reviewed Feb 4, 2022

View reviewed changes

go-controller/pkg/factory/factory.go Outdated Show resolved Hide resolved

dcbw reviewed Feb 5, 2022

View reviewed changes

go-controller/pkg/ovn/policy.go Outdated Show resolved Hide resolved

trozet force-pushed the remove_np_delete_create branch from adc8465 to 2914174 Compare February 7, 2022 16:10

trozet force-pushed the remove_np_delete_create branch from 2914174 to 7cc5d2e Compare February 7, 2022 20:26

trozet force-pushed the remove_np_delete_create branch from 7cc5d2e to dbd6afb Compare February 7, 2022 23:24

jcaamano reviewed Feb 8, 2022

View reviewed changes

trozet added the do-not-merge/hold label Feb 8, 2022

jcaamano reviewed Feb 8, 2022

View reviewed changes

trozet mentioned this pull request Feb 11, 2022

Adds retry mechanism for Network Policy #2809

Merged

trozet closed this Mar 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network Policy: Ensures handler serialized shutdown #2797

Network Policy: Ensures handler serialized shutdown #2797

trozet commented Feb 4, 2022 •

edited

Loading

trozet commented Feb 4, 2022

trozet Feb 4, 2022

jcaamano Feb 7, 2022

trozet Feb 7, 2022

coveralls commented Feb 4, 2022 •

edited

Loading

jcaamano commented Feb 7, 2022

trozet commented Feb 7, 2022

trozet commented Feb 8, 2022

jcaamano Feb 8, 2022

jcaamano Feb 8, 2022

jcaamano Feb 8, 2022

jcaamano commented Feb 8, 2022

Network Policy: Ensures handler serialized shutdown #2797

Network Policy: Ensures handler serialized shutdown #2797

Conversation

trozet commented Feb 4, 2022 • edited Loading

trozet commented Feb 4, 2022

trozet Feb 4, 2022

Choose a reason for hiding this comment

jcaamano Feb 7, 2022

Choose a reason for hiding this comment

trozet Feb 7, 2022

Choose a reason for hiding this comment

coveralls commented Feb 4, 2022 • edited Loading

jcaamano commented Feb 7, 2022

trozet commented Feb 7, 2022

trozet commented Feb 8, 2022

jcaamano Feb 8, 2022

Choose a reason for hiding this comment

jcaamano Feb 8, 2022

Choose a reason for hiding this comment

jcaamano Feb 8, 2022

Choose a reason for hiding this comment

jcaamano commented Feb 8, 2022

trozet commented Feb 4, 2022 •

edited

Loading

coveralls commented Feb 4, 2022 •

edited

Loading