New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds retry mechanism for Network Policy #2809
Conversation
b8f70ae
to
28f1cf9
Compare
Unrelated flake? [Fail] OVN Namespace Operations on startup [It] creates an address set for existing nodes when the host network traffic namespace is created |
/retest |
Another unrelated flake [Fail] Informer Event Handler Tests [It] adds existing pod and processes an update event |
/retest |
Looks like 1 test case failed in dualstack: I'll test it out manually to make sure it was just a flake. |
test case passes locally for me, and no obvious errors in the CI log |
oc.addNetworkPolicy(newPolicy) | ||
oc.checkAndSkipRetryPolicy(oldPolicy) | ||
if err := oc.deleteNetworkPolicy(oldPolicy, nil); err != nil { | ||
oc.initRetryPolicyWithDelete(oldPolicy, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit complicated. specially since we have initRetryPolicyWithDelete
and initRetryPolicy
.... need to go back and forth between functions to see what they do...should we add a comment to this part to explain what we do so that 4 months from now we are good?
self-note-attached-for-when-I-come-back-to-this-PR-in-the-future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice diagram! Initially I made just a single "initRetryPolicy(old, new)" type of of function, but then I thought that was more confusing when reading the code in other places. Also, it made things weird about implying what a nil old, or new object means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah for me it was confusing going back and forth between pod_retry code and ovn handler code but in the end I think I grasped the logic, its genius with all this skip and unskip :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcbw idea not me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I want any of this to be called genius...
go-controller/pkg/ovn/policy.go
Outdated
np = foundNp | ||
} | ||
|
||
if err := oc.destroyNetworkPolicy(np, nsInfo); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can reach here with nsInfo==nil and np!=nil case, I think your nested if's broke the logic of the older code. Is that on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the intention with this change was to be able to delete without nsInfo, because it is possible that the namespace could be gone by the time we want to delete a network policy. Here is a case where that may happen:
// Now do nsinfo operations to set the policy
nsInfo, nsUnlock, err = oc.ensureNamespaceLocked(policy.Namespace, false, nil)
if err != nil {
// rollback network policy
if err := oc.deleteNetworkPolicy(policy, np); err != nil {
// rollback failed, add to retry to cleanup
oc.addDeleteToRetryPolicy(policy, np)
}
return fmt.Errorf("unable to ensure namespace for network policy: %s, namespace: %s, error: %v",
policy.Name, policy.Namespace, err)
}
klog.Infof("Network Policy Retry: %s retry network policy setup", namespacedName) | ||
|
||
// check if we need to delete anything | ||
if npEntry.oldPolicy != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so essentially this is your trigger right?
- if oldPolicy is set, you know its a delete retry versus
- if newpolicy is set (policyToCreate to be more precise), we know its a add retry
if both are set then? you make sure delete is done first and don't retry add huh based on L55's continue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't really care what the state is here, we just know if a retry entry has a delete we should remove it first, if that fails we should not try to add the new thing. They may conflict.
continue | ||
} | ||
// successfully cleaned up old policy, remove it from the retry cache | ||
npEntry.newPolicy = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably just delete the entry from cache entirely at this point right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I guess this wasn't really necessary. We remove from the cache a few lines later, I don't think it hurts to have it though.
6ae2298
to
da16b77
Compare
Leverages the same mechanism used by pods, except it also handles retrying deletion. Signed-off-by: Tim Rozet <trozet@redhat.com>
da16b77
to
25669a4
Compare
I think it LGTM now. Trying a downstream run openshift/ovn-kubernetes#953 just for kicks. Update: downstream seems OK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
we can do the clean ups in later prs...
npEntry.timeStamp = time.Now() | ||
continue | ||
} | ||
// successfully cleaned up old policy, remove it from the retry cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
// successfully cleaned up old policy, remove it from the retry cache | |
// successfully created new policy, remove it from retry cache. |
yeah, the thing with these retries is, I doubt we have any real time e2e's where the creation and/or deletion fails and we keep retrying... |
Updating the ACL logging by checking the namespace was added to the network policy handler as part of: ovn-org#2809 However, the policy was not added to the namespace before updating the logging levels. Therefore the current policy would not be processed during logging level updates in this path. Reported-by: Andrew Stoycos <astoycos@redhat.com> Signed-off-by: Tim Rozet <trozet@redhat.com>
Updating the ACL logging by checking the namespace was added to the network policy handler as part of: ovn-org/ovn-kubernetes#2809 However, the policy was not added to the namespace before updating the logging levels. Therefore the current policy would not be processed during logging level updates in this path. Reported-by: Andrew Stoycos <astoycos@redhat.com> Signed-off-by: Tim Rozet <trozet@redhat.com> (cherry picked from commit 8e2b643)
Updating the ACL logging by checking the namespace was added to the network policy handler as part of: ovn-org/ovn-kubernetes#2809 However, the policy was not added to the namespace before updating the logging levels. Therefore the current policy would not be processed during logging level updates in this path. Reported-by: Andrew Stoycos <astoycos@redhat.com> Signed-off-by: Tim Rozet <trozet@redhat.com> (cherry picked from commit 8e2b643) (cherry picked from commit 1cdf58a)
Updating the ACL logging by checking the namespace was added to the network policy handler as part of: ovn-org/ovn-kubernetes#2809 However, the policy was not added to the namespace before updating the logging levels. Therefore the current policy would not be processed during logging level updates in this path. Reported-by: Andrew Stoycos <astoycos@redhat.com> Signed-off-by: Tim Rozet <trozet@redhat.com> (cherry picked from commit 8e2b643) (cherry picked from commit 1cdf58a) (cherry picked from commit fee98b4)
Leverages the same mechanism used by pods, except it also handles
retrying deletion.
Signed-off-by: Tim Rozet trozet@redhat.com