Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds retry mechanism for Network Policy #2809

Merged
merged 1 commit into from Feb 14, 2022

Conversation

trozet
Copy link
Contributor

@trozet trozet commented Feb 8, 2022

Leverages the same mechanism used by pods, except it also handles
retrying deletion.

Signed-off-by: Tim Rozet trozet@redhat.com

@trozet trozet requested review from astoycos and dcbw February 8, 2022 20:45
go-controller/pkg/ovn/ovn.go Outdated Show resolved Hide resolved
@trozet trozet force-pushed the retry_network_policy_on_fail branch from b8f70ae to 28f1cf9 Compare February 8, 2022 22:24
@trozet
Copy link
Contributor Author

trozet commented Feb 8, 2022

Unrelated flake?

[Fail] OVN Namespace Operations on startup [It] creates an address set for existing nodes when the host network traffic namespace is created

@trozet
Copy link
Contributor Author

trozet commented Feb 8, 2022

/retest

@trozet
Copy link
Contributor Author

trozet commented Feb 8, 2022

Another unrelated flake

[Fail] Informer Event Handler Tests [It] adds existing pod and processes an update event
/home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/informer/informer_test.go:276

@trozet
Copy link
Contributor Author

trozet commented Feb 8, 2022

/retest

@coveralls
Copy link

coveralls commented Feb 8, 2022

Coverage Status

Coverage increased (+0.07%) to 50.785% when pulling 25669a4 on trozet:retry_network_policy_on_fail into 5cd51e4 on ovn-org:master.

@trozet
Copy link
Contributor Author

trozet commented Feb 9, 2022

Looks like 1 test case failed in dualstack:
should enforce policy to allow traffic only from a pod in a different namespace based on PodSelector and NamespaceSelector

I'll test it out manually to make sure it was just a flake.

@trozet
Copy link
Contributor Author

trozet commented Feb 9, 2022

test case passes locally for me, and no obvious errors in the CI log

@tssurya tssurya self-requested a review February 9, 2022 23:08
go-controller/pkg/factory/factory.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/policy_retry.go Show resolved Hide resolved
go-controller/pkg/ovn/ovn.go Show resolved Hide resolved
go-controller/pkg/ovn/ovn.go Outdated Show resolved Hide resolved
oc.addNetworkPolicy(newPolicy)
oc.checkAndSkipRetryPolicy(oldPolicy)
if err := oc.deleteNetworkPolicy(oldPolicy, nil); err != nil {
oc.initRetryPolicyWithDelete(oldPolicy, nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit complicated. specially since we have initRetryPolicyWithDelete and initRetryPolicy.... need to go back and forth between functions to see what they do...should we add a comment to this part to explain what we do so that 4 months from now we are good?

self-note-attached-for-when-I-come-back-to-this-PR-in-the-future
IMG_6231

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice diagram! Initially I made just a single "initRetryPolicy(old, new)" type of of function, but then I thought that was more confusing when reading the code in other places. Also, it made things weird about implying what a nil old, or new object means.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah for me it was confusing going back and forth between pod_retry code and ovn handler code but in the end I think I grasped the logic, its genius with all this skip and unskip :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcbw idea not me :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I want any of this to be called genius...

go-controller/pkg/ovn/policy.go Show resolved Hide resolved
np = foundNp
}

if err := oc.destroyNetworkPolicy(np, nsInfo); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can reach here with nsInfo==nil and np!=nil case, I think your nested if's broke the logic of the older code. Is that on purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intention with this change was to be able to delete without nsInfo, because it is possible that the namespace could be gone by the time we want to delete a network policy. Here is a case where that may happen:

	// Now do nsinfo operations to set the policy
	nsInfo, nsUnlock, err = oc.ensureNamespaceLocked(policy.Namespace, false, nil)
	if err != nil {
		// rollback network policy
		if err := oc.deleteNetworkPolicy(policy, np); err != nil {
			// rollback failed, add to retry to cleanup
			oc.addDeleteToRetryPolicy(policy, np)
		}
		return fmt.Errorf("unable to ensure namespace for network policy: %s, namespace: %s, error: %v",
			policy.Name, policy.Namespace, err)
	}

go-controller/pkg/ovn/policy_retry.go Outdated Show resolved Hide resolved
klog.Infof("Network Policy Retry: %s retry network policy setup", namespacedName)

// check if we need to delete anything
if npEntry.oldPolicy != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so essentially this is your trigger right?

  • if oldPolicy is set, you know its a delete retry versus
  • if newpolicy is set (policyToCreate to be more precise), we know its a add retry

if both are set then? you make sure delete is done first and don't retry add huh based on L55's continue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't really care what the state is here, we just know if a retry entry has a delete we should remove it first, if that fails we should not try to add the new thing. They may conflict.

go-controller/pkg/ovn/policy_retry.go Show resolved Hide resolved
continue
}
// successfully cleaned up old policy, remove it from the retry cache
npEntry.newPolicy = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably just delete the entry from cache entirely at this point right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I guess this wasn't really necessary. We remove from the cache a few lines later, I don't think it hurts to have it though.

@trozet trozet force-pushed the retry_network_policy_on_fail branch 2 times, most recently from 6ae2298 to da16b77 Compare February 11, 2022 18:56
Leverages the same mechanism used by pods, except it also handles
retrying deletion.

Signed-off-by: Tim Rozet <trozet@redhat.com>
@trozet trozet force-pushed the retry_network_policy_on_fail branch from da16b77 to 25669a4 Compare February 11, 2022 19:00
@dcbw
Copy link
Contributor

dcbw commented Feb 11, 2022

I think it LGTM now. Trying a downstream run openshift/ovn-kubernetes#953 just for kicks.

Update: downstream seems OK.

Copy link
Member

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
we can do the clean ups in later prs...

npEntry.timeStamp = time.Now()
continue
}
// successfully cleaned up old policy, remove it from the retry cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// successfully cleaned up old policy, remove it from the retry cache
// successfully created new policy, remove it from retry cache.

@tssurya
Copy link
Member

tssurya commented Feb 14, 2022

I think it LGTM now. Trying a downstream run openshift/ovn-kubernetes#953 just for kicks.

Update: downstream seems OK.

yeah, the thing with these retries is, I doubt we have any real time e2e's where the creation and/or deletion fails and we keep retrying...

@trozet trozet merged commit 4aaa7b5 into ovn-org:master Feb 14, 2022
trozet added a commit to trozet/ovn-kubernetes that referenced this pull request Feb 15, 2022
Updating the ACL logging by checking the namespace was added to the
network policy handler as part of:
ovn-org#2809

However, the policy was not added to the namespace before updating the
logging levels. Therefore the current policy would not be processed
during logging level updates in this path.

Reported-by: Andrew Stoycos <astoycos@redhat.com>

Signed-off-by: Tim Rozet <trozet@redhat.com>
trozet added a commit to trozet/ovn-kubernetes-1 that referenced this pull request Feb 25, 2022
Updating the ACL logging by checking the namespace was added to the
network policy handler as part of:
ovn-org/ovn-kubernetes#2809

However, the policy was not added to the namespace before updating the
logging levels. Therefore the current policy would not be processed
during logging level updates in this path.

Reported-by: Andrew Stoycos <astoycos@redhat.com>

Signed-off-by: Tim Rozet <trozet@redhat.com>
(cherry picked from commit 8e2b643)
trozet added a commit to trozet/ovn-kubernetes-1 that referenced this pull request Mar 7, 2022
Updating the ACL logging by checking the namespace was added to the
network policy handler as part of:
ovn-org/ovn-kubernetes#2809

However, the policy was not added to the namespace before updating the
logging levels. Therefore the current policy would not be processed
during logging level updates in this path.

Reported-by: Andrew Stoycos <astoycos@redhat.com>

Signed-off-by: Tim Rozet <trozet@redhat.com>
(cherry picked from commit 8e2b643)
(cherry picked from commit 1cdf58a)
trozet added a commit to trozet/ovn-kubernetes-1 that referenced this pull request Mar 17, 2022
Updating the ACL logging by checking the namespace was added to the
network policy handler as part of:
ovn-org/ovn-kubernetes#2809

However, the policy was not added to the namespace before updating the
logging levels. Therefore the current policy would not be processed
during logging level updates in this path.

Reported-by: Andrew Stoycos <astoycos@redhat.com>

Signed-off-by: Tim Rozet <trozet@redhat.com>
(cherry picked from commit 8e2b643)
(cherry picked from commit 1cdf58a)
(cherry picked from commit fee98b4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants