New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[4.7] Bug 2073350: Lookup reject ACLs found at startup when removing reject ACLs for services [alternative] #1211
Conversation
…vices The serviceLBMap cache stores, for each load balancer, a service VIP along with its endpoints and reject ACL (if any). The reject ACL is added to this cache either (1) if it was created in the current execution of ovnkube-master or (2) if it was created in a previous execution of ovnkube master and the service still needs a reject ACL (because it has no endpoints). Now, if the reject ACL was created in the previous execution of ovnkube-master and after restart ovnkube-master first adds the backend pods for this service and only afterwards (re)creates the service, then the existing reject ACL will go unnoticed. Endpoints will be added correctly, but traffic to this service will be dropped because of this reject ACL. In order to fix this corner case, keep a list of valid reject ACLs found at startup and lookup this list too when trying to delete a reject ACL for a service with endpoints. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
@ricky-rav: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ricky-rav The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@ricky-rav: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest-required |
2 similar comments
/retest-required |
/retest-required |
@ricky-rav: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@@ -204,13 +208,17 @@ func (ovn *Controller) syncServices(services []interface{}) { | |||
foundSwitches) | |||
ovn.removeACLFromNodeSwitches(foundSwitches, uuid) | |||
} | |||
} else { | |||
OVNRejectACLsAtStartup[name] = uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how the sync code above will not catch a service that has endpoints with a stale ACL? What is different about your corner case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the sync code is correct: it removes stale ACLs at startup. What happened with the customer's cluster is that after the sync code and before the service creation the endpoints were added, so the code - after running syncServices - never kept track of the existing reject acl. In particular:
- the reject ACL was created in the previous execution of ovnkube-master;
- after restart, the sync code correctly keeps this reject ACL, since the service still has no endpoints
- the backend pods for this service are added;
- the service is (re)created along with its endpoints:
- the service has endpoints, so we do not hit
func (ovn *Controller) createLoadBalancerRejectACL(lb, sourceIP string, sourcePort int32, proto kapi.Protocol) (string, error) { - we try instead to find an existing reject ACL to remove, but
serviceLBMap
contains no reference to the ACL that's actually in OVN, so the ACL is not removed:
https://github.com/openshift/ovn-kubernetes/blob/release-4.7/go-controller/pkg/ovn/loadbalancer.go#L313-L317
@ricky-rav: This pull request references Bugzilla bug 2073350, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 6 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@ricky-rav: This pull request references Bugzilla bug 2073350, which is valid. 6 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@ricky-rav: This pull request references Bugzilla bug 2073350, which is valid. 6 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest-failed |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
@ricky-rav: This pull request references Bugzilla bug 2073350. The bug has been updated to no longer refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[alternative solution to https://github.com//pull/1208, avoiding extra ovn-nbctl calls]
The serviceLBMap cache stores, for each load balancer, a service VIP along with its endpoints and reject ACL (if any). The reject ACL is added to this cache either (1) if it was created in the current execution of ovnkube-master or (2) if it was created in a previous execution of ovnkube master and the service still needs a reject ACL (because it has no endpoints).
Now, if the reject ACL was created in the previous execution of ovnkube-master and after restart ovnkube-master first adds the backend pods for this service and only afterwards (re)creates the service, then the existing reject ACL will go unnoticed. Endpoints will be added correctly, but traffic to this service will be dropped because of this reject ACL.
In order to fix this corner case, keep a list of valid reject ACLs found at startup and lookup this list too when trying to delete a reject ACL for a service with endpoints.
Signed-off-by: Riccardo Ravaioli rravaiol@redhat.com
closes #2073350