-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network Policy: Ensures handler serialized shutdown #2797
Conversation
@jcaamano PTAL |
// ... on condition that the removed address set was in the 'gress policy | ||
return gress.delNamespaceAddressSet(namespace.Name) | ||
}) | ||
}, | ||
UpdateFunc: func(oldObj, newObj interface{}) { | ||
}, | ||
}, func(i []interface{}) { | ||
// This needs to be a write lock because there's no locking around 'gress policies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcaamano you added this. I didn't really understand what you meant here, but I didn't see a reason we need to lock the network policy here. Can you take a look at this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it makes changes on the gress policy and gress policies are protected with the np lock they belong to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow I see the policyHandler holds a pointer to the same object stored in np :(
So while I agree that this is an improvement that allows us to reduce or eliminate locking on NPs, I also think that in general removing locks may lead to complicated code and that there is high probability this or other similar locks may need to come back going forward, so that it also needs to be considered carefully. On this specific issue being dealt with in this PR, I was thinking more in the line that the problem was how our factory handler specific code introduces calls to the handler callbacks that are synchronous with respect to the add handler functions. Here ovn-kubernetes/go-controller/pkg/factory/handler.go Lines 128 to 131 in 16afeee
And here ovn-kubernetes/go-controller/pkg/factory/factory.go Lines 363 to 367 in 16afeee
Whereas the upstream shared index informer we built on top of is 100% asynchronous on their handler code I guess we might have reasons not to be 100% asynchronous or that it might be a complication we don't want to deal with now, but unknown to me. So just adding a note here. |
adc8465
to
2914174
Compare
@jcaamano I agree we should move to workqueues and asynchronous addition of handlers. This is mainly a stop gap until we can get there. I need something I can backport to stabalize/fix our network policy in past releases. |
2914174
to
7cc5d2e
Compare
In the code we add several handlers to pod, namespace, service informers depending on the definition of the network policy. We track these via the network policy entry in the namespace cache, adding each handler to a slice. During network policy deletion, we shutdown all the handlers by requesting them to be detached by the watch factory. This operation was non-blocking for 2 reasons: 1. Although all handlers required informer lock to be removed, federated handlers did not need informer read lock to execute. 2. Handlers were being removed inside of a goroutine. Due to this behavior, when a policy was being deleted and handlers were "shutdown" there was no guarantee that an in-process handler on a federated queue, would not finish adding/updating a new object, after the handlers were supposed to be shutdown. To solve this problem, we used np.deleted, a field that the handlers could reference to know if they should ignore adding an object in the case where things are shutting down. However, this mechanism requires access to the np object itself and needs a lock on the np. This is the root cause of why we cannot hold the np lock while we are adding handlers during network policy creation time. This causes a bunch of headaches with complicated locking scenarios. This commit removes this complex locking, and ensures network policy creation can hold the np lock the entire time, while also reducing access required to np fields in the handlers. It does the following: 1. Make federated queue event processing have a read lock on the informer. 2. When adding handlers, introduce a new "tag" which can be used to group operations on a bunch of handlers. 3. Network policy no longer holds handlers in fields of the np object. Instead it creates handlers with a unique tag corresponding to that policy. 4. Handlers are no longer removed in a goroutine and is a blocking function. 5. When a Network policy is deleted, it calls to watch factory to destroy all handlers by it's tag (policyHandlerID). 6. Remove the need for np.delete and np.created. Remove access as much as possible to the np object inside of the handlers. 7. For handlers that still need to access the np, they use a new subLock which is used to guard against access to np between handlers. Signed-off-by: Tim Rozet <trozet@redhat.com>
7cc5d2e
to
dbd6afb
Compare
/retest |
@@ -605,16 +598,16 @@ func (oc *Controller) localPodAddDefaultDeny(policy *knet.NetworkPolicy, | |||
|
|||
// localPodDelDefaultDeny decrements a pod's policy reference count and removes a pod | |||
// from the default-deny portgroups if the reference count for the pod is 0 | |||
func (oc *Controller) localPodDelDefaultDeny( | |||
np *networkPolicy, ports ...*lpInfo) (ingressDenyPorts, egressDenyPorts []string) { | |||
func (oc *Controller) localPodDelDefaultDeny(policyTypes []knet.PolicyType, numEgressRules int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just be symmetric to localPodAddDefaultDeny
, pass policy *knet.NetworkPolicy
instead of policyTypes []knet.PolicyType, numEgressRules int
and use policy.Spec.PolicyTypes
and len(policy.Spec.Egress)
respectively?
@@ -899,7 +879,7 @@ func (oc *Controller) createNetworkPolicy(np *networkPolicy, policy *knet.Networ | |||
// addressSet for the peer pods. | |||
for i, ingressJSON := range policy.Spec.Ingress { | |||
klog.V(5).Infof("Network policy ingress is %+v", ingressJSON) | |||
|
|||
np.subLock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this lock while creating the policy?
} | ||
|
||
// Go through each egress rule. For each egress rule, create an | ||
// addressSet for the peer pods. | ||
for i, egressJSON := range policy.Spec.Egress { | ||
klog.V(5).Infof("Network policy egress is %+v", egressJSON) | ||
|
||
np.subLock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this lock while creating the policy?
Do we need |
In the code we add several handlers to pod, namespace, service
informers depending on the definition of the network policy. We track
these via the network policy entry in the namespace cache, adding each
handler to a slice. During network policy deletion, we shutdown all the
handlers by requesting them to be detached by the watch factory. This
operation was non-blocking for 2 reasons:
federated handlers did not need informer read lock to execute.
Due to this behavior, when a policy was being deleted and handlers were
"shutdown" there was no guarantee that an in-process handler on a
federated queue, would not finish adding/updating a new object, after
the handlers were supposed to be shutdown.
To solve this problem, we used np.deleted, a field that the handlers
could reference to know if they should ignore adding an object in the
case where things are shutting down. However, this mechanism requires
access to the np object itself and needs a lock on the np. This is the
root cause of why we cannot hold the np lock while we are adding
handlers during network policy creation time. This causes a bunch of
headaches with complicated locking scenarios.
This commit removes this complex locking, and ensures network policy
creation can hold the np lock the entire time, while also reducing
access required to np fields in the handlers. It does the following:
informer.
group operations on a bunch of handlers.
Instead it creates handlers with a unique tag corresponding to that
policy.
function.
destroy all handlers by it's tag (policyHandlerID).
as much as possible to the np object inside of the handlers.
subLock which is used to guard against access to np between
handlers.
Signed-off-by: Tim Rozet trozet@redhat.com