[4.7] Bug 2073350: Lookup reject ACLs found at startup when removing reject ACLs for services [alternative] #1211

ricky-rav · 2022-07-26T08:56:15Z

[alternative solution to https://github.com//pull/1208, avoiding extra ovn-nbctl calls]

The serviceLBMap cache stores, for each load balancer, a service VIP along with its endpoints and reject ACL (if any). The reject ACL is added to this cache either (1) if it was created in the current execution of ovnkube-master or (2) if it was created in a previous execution of ovnkube master and the service still needs a reject ACL (because it has no endpoints).

Now, if the reject ACL was created in the previous execution of ovnkube-master and after restart ovnkube-master first adds the backend pods for this service and only afterwards (re)creates the service, then the existing reject ACL will go unnoticed. Endpoints will be added correctly, but traffic to this service will be dropped because of this reject ACL.

In order to fix this corner case, keep a list of valid reject ACLs found at startup and lookup this list too when trying to delete a reject ACL for a service with endpoints.

Signed-off-by: Riccardo Ravaioli rravaiol@redhat.com

closes #2073350

…vices The serviceLBMap cache stores, for each load balancer, a service VIP along with its endpoints and reject ACL (if any). The reject ACL is added to this cache either (1) if it was created in the current execution of ovnkube-master or (2) if it was created in a previous execution of ovnkube master and the service still needs a reject ACL (because it has no endpoints). Now, if the reject ACL was created in the previous execution of ovnkube-master and after restart ovnkube-master first adds the backend pods for this service and only afterwards (re)creates the service, then the existing reject ACL will go unnoticed. Endpoints will be added correctly, but traffic to this service will be dropped because of this reject ACL. In order to fix this corner case, keep a list of valid reject ACLs found at startup and lookup this list too when trying to delete a reject ACL for a service with endpoints. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

openshift-ci · 2022-07-26T08:56:21Z

@ricky-rav: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

Lookup reject ACLs found at startup when removing reject ACLs for services [alternative]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-07-26T08:57:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ricky-rav
Once this PR has been reviewed and has the lgtm label, please assign danwinship for approval by writing /assign @danwinship in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2022-07-26T09:10:02Z

@ricky-rav: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[4.7] Lookup reject ACLs found at startup when removing reject ACLs for services [alternative]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ricky-rav · 2022-07-26T12:11:15Z

/retest-required

ricky-rav · 2022-07-26T14:25:26Z

/retest-required

ricky-rav · 2022-07-27T15:48:44Z

/retest-required

openshift-ci · 2022-07-27T17:42:44Z

@ricky-rav: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-e2e-gcp-ovn	`959f2e0`	link	false	`/test okd-e2e-gcp-ovn`
ci/prow/e2e-ovn-hybrid-step-registry	`959f2e0`	link	true	`/test e2e-ovn-hybrid-step-registry`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

trozet · 2022-07-28T22:06:09Z

go-controller/pkg/ovn/service.go

@@ -204,13 +208,17 @@ func (ovn *Controller) syncServices(services []interface{}) {
 									foundSwitches)
 								ovn.removeACLFromNodeSwitches(foundSwitches, uuid)
 							}
+						} else {
+							OVNRejectACLsAtStartup[name] = uuid


I don't understand how the sync code above will not catch a service that has endpoints with a stale ACL? What is different about your corner case?

So the sync code is correct: it removes stale ACLs at startup. What happened with the customer's cluster is that after the sync code and before the service creation the endpoints were added, so the code - after running syncServices - never kept track of the existing reject acl. In particular:

the reject ACL was created in the previous execution of ovnkube-master;

after restart, the sync code correctly keeps this reject ACL, since the service still has no endpoints

the backend pods for this service are added;

the service is (re)created along with its endpoints:

the service has endpoints, so we do not hit

ovn-kubernetes/go-controller/pkg/ovn/loadbalancer.go

Line 196 in b1abcda

func (ovn *Controller) createLoadBalancerRejectACL(lb, sourceIP string, sourcePort int32, proto kapi.Protocol) (string, error) {

, where we create a reject ACL or keep track of an existing one if there are no endpoints;

we try instead to find an existing reject ACL to remove, but serviceLBMap contains no reference to the ACL that's actually in OVN, so the ACL is not removed:
https://github.com/openshift/ovn-kubernetes/blob/release-4.7/go-controller/pkg/ovn/loadbalancer.go#L313-L317

openshift-ci · 2022-07-29T06:36:29Z

@ricky-rav: This pull request references Bugzilla bug 2073350, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

6 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.z) matches configured target release for branch (4.7.z)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
dependent bug Bugzilla bug 2109625 is in the state CLOSED (CURRENTRELEASE), which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE))
dependent Bugzilla bug 2109625 targets the "4.8.z" release, which is one of the valid target releases: 4.8.0, 4.8.z
bug has dependents

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

[4.7] Bug 2073350: Lookup reject ACLs found at startup when removing reject ACLs for services [alternative]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-07-29T06:36:58Z

@ricky-rav: This pull request references Bugzilla bug 2073350, which is valid.

6 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.z) matches configured target release for branch (4.7.z)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
dependent bug Bugzilla bug 2109625 is in the state CLOSED (CURRENTRELEASE), which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE))
dependent Bugzilla bug 2109625 targets the "4.8.z" release, which is one of the valid target releases: 4.8.0, 4.8.z
bug has dependents

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

[4.7] Bug 2073350: Lookup reject ACLs found at startup when removing reject ACLs for services [alternative]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ricky-rav · 2022-07-29T06:47:58Z

/bugzilla refresh

openshift-ci · 2022-07-29T06:48:08Z

@ricky-rav: This pull request references Bugzilla bug 2073350, which is valid.

6 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.z) matches configured target release for branch (4.7.z)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
dependent bug Bugzilla bug 2109625 is in the state CLOSED (CURRENTRELEASE), which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE))
dependent Bugzilla bug 2109625 targets the "4.8.z" release, which is one of the valid target releases: 4.8.0, 4.8.z
bug has dependents

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ricky-rav · 2022-07-29T07:19:19Z

/retest-failed

openshift-bot · 2022-10-27T09:01:21Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2022-11-27T00:30:13Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-ci · 2022-11-27T17:14:34Z

@ricky-rav: This pull request references Bugzilla bug 2073350. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

[4.7] Bug 2073350: Lookup reject ACLs found at startup when removing reject ACLs for services [alternative]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot requested review from juanluisvaladas and rcarrillocruz July 26, 2022 08:57

ricky-rav changed the title ~~Lookup reject ACLs found at startup when removing reject ACLs for services [alternative]~~ [4.7] Lookup reject ACLs found at startup when removing reject ACLs for services [alternative] Jul 26, 2022

trozet reviewed Jul 28, 2022

View reviewed changes

ricky-rav changed the title ~~[4.7] Lookup reject ACLs found at startup when removing reject ACLs for services [alternative]~~ [4.7] Bug 2073350: Lookup reject ACLs found at startup when removing reject ACLs for services [alternative] Jul 29, 2022

openshift-ci bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jul 29, 2022

openshift-ci bot requested a review from anuragthehatter July 29, 2022 06:36

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 27, 2022

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 27, 2022

ricky-rav closed this Nov 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[4.7] Bug 2073350: Lookup reject ACLs found at startup when removing reject ACLs for services [alternative] #1211

[4.7] Bug 2073350: Lookup reject ACLs found at startup when removing reject ACLs for services [alternative] #1211

ricky-rav commented Jul 26, 2022 •

edited

openshift-ci bot commented Jul 26, 2022

openshift-ci bot commented Jul 26, 2022

openshift-ci bot commented Jul 26, 2022

ricky-rav commented Jul 26, 2022

ricky-rav commented Jul 26, 2022

ricky-rav commented Jul 27, 2022

openshift-ci bot commented Jul 27, 2022

trozet Jul 28, 2022

ricky-rav Jul 29, 2022

openshift-ci bot commented Jul 29, 2022

openshift-ci bot commented Jul 29, 2022

ricky-rav commented Jul 29, 2022

openshift-ci bot commented Jul 29, 2022

ricky-rav commented Jul 29, 2022

openshift-bot commented Oct 27, 2022

openshift-bot commented Nov 27, 2022

openshift-ci bot commented Nov 27, 2022

[4.7] Bug 2073350: Lookup reject ACLs found at startup when removing reject ACLs for services [alternative] #1211

[4.7] Bug 2073350: Lookup reject ACLs found at startup when removing reject ACLs for services [alternative] #1211

Conversation

ricky-rav commented Jul 26, 2022 • edited

openshift-ci bot commented Jul 26, 2022

openshift-ci bot commented Jul 26, 2022

openshift-ci bot commented Jul 26, 2022

ricky-rav commented Jul 26, 2022

ricky-rav commented Jul 26, 2022

ricky-rav commented Jul 27, 2022

openshift-ci bot commented Jul 27, 2022

trozet Jul 28, 2022

Choose a reason for hiding this comment

ricky-rav Jul 29, 2022

Choose a reason for hiding this comment

openshift-ci bot commented Jul 29, 2022

openshift-ci bot commented Jul 29, 2022

ricky-rav commented Jul 29, 2022

openshift-ci bot commented Jul 29, 2022

ricky-rav commented Jul 29, 2022

openshift-bot commented Oct 27, 2022

openshift-bot commented Nov 27, 2022

openshift-ci bot commented Nov 27, 2022

ricky-rav commented Jul 26, 2022 •

edited