New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-4.10] OCPBUGS-1454: Fix NP upgrade bug: unexpectedly found multiple equivalent ACLs (arp v/s arp||nd) #1269
[release-4.10] OCPBUGS-1454: Fix NP upgrade bug: unexpectedly found multiple equivalent ACLs (arp v/s arp||nd) #1269
Conversation
NOTE: This is 4.10 only commit because we are missing openshift@df19112#diff-ee54e95b45c5a079454546fed94fcef68b13d9dc6cd14e192585fe2465bcaefcL46. Hence we are renaming the existing find acls by predicate to be used globally outside of acls.go while avoiding to bring any other changes from the libovsdb cleanup Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
@tssurya: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@tssurya: This pull request references Jira Issue OCPBUGS-1454, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
@tssurya: This pull request references Jira Issue OCPBUGS-1454, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
openshift/origin#27429 needs to pass first |
/lgtm |
/jira refresh |
@sdodson: This pull request references Jira Issue OCPBUGS-1454, which is valid. The bug has been moved to the POST state. 6 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold |
/retest |
This PR does two things: 1) Loops through all the ACLs that have `ARPallowPolicy` in their names and have " && arp" in their match expression and deletes it. This is because we did https://github.com/openshift/ovn-kubernetes/pull/1043/files where we changed the match but didn't remove acls on the older match which causes problems like: 2022-06-01T08:17:44.635401164Z E0601 08:17:44.634509 1 ovn.go:753] Failed to create network policy mdh-old/allow-from-other-namespaces, error: failed to create default port groups and acls for policy: mdh-old/allow-from-other-namespaces, error: unexpectedly found multiple equivalent ACLs: [{UUID:3bc36c0e-ee1a-4609-a240-c211f14f379b Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a13985064446031893020_ingressDefaultDeny && arp Meter:0xc003245210 Name:0xc003245220 Options:map[] Priority:1001 Severity:0xc003245230} {UUID:41702278-fb49-4418-abca-6425912d1e63 Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a13985064446031893020_ingressDefaultDeny && (arp || nd) Meter:0xc0032452a0 Name:0xc0032452b0 Options:map[] Priority:1001 Severity:0xc0032452c0}] on upgrades 2) When we create the default deny ACLs for a namespace with at least one network policy, we name it "namespace_np-name". This doesn't make sense since the ACL is applicable to all the network policies in that namespace. The policy that get's created first becomes the lucky winner. This PR updates the names of default deny ACLs to "namespace_egressDefaultDeny" OR "namespace_ingressDefaultDeny" so that "we can stop being stupid" :) and stop errors like: 2022-06-09T17:15:00.952174381Z E0609 17:15:00.952154 1 ovn.go:753] Failed to create network policy oit-ssi-fluentd/fluentd-input, error: failed to create default port groups and acls for policy: oit-ssi-fluentd/fluentd-input, error: unexpectedly found multiple equivalent ACLs: [{UUID:5b98f17d-789f-4de1-9beb-36741bfa40d1 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a12933912868060780448_egressDefaultDeny Meter:0xc001388aa0 Name:0xc001388ab0 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc001388ac0} {UUID:b9349cb8-99b0-4130-92cf-ab11b4874a11 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a12933912868060780448_egressDefaultDeny Meter:0xc001388c90 Name:0xc001388ca0 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc001388cb0}] (NOTE: not sure how the user ended up with two default ACLs in the same namespace, but they did!) Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com> (cherry picked from commit 5fea420) Conflicts in 4.11: test/e2e/acl_logging.go had minor conflicts since openshift@e1f7da0 is missing in 4.11 (cherry picked from commit 2762c4d) Conflicts in 4.10: go-controller/pkg/ovn/policy.go openshift@b4738c7#diff-cc83e19af1c257d5a09b711d5977d8f8c20beb34b7b5d3eb37b2f2c53ded1bf7L116 is missing in 4.10 openshift@174aed7 is missing in 4.10 openshift@df19112#diff-ee54e95b45c5a079454546fed94fcef68b13d9dc6cd14e192585fe2465bcaefcL46 is missing in 4.10
We weren't removing reference of ACL from PG before deleting it. Unfortuntely unit tests don't catch this. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com> (cherry picked from commit 4d446ba) (cherry picked from commit de6258c) Conflicts in 4.10: go-controller/pkg/ovn/policy.go because openshift@df19112#diff-ee54e95b45c5a079454546fed94fcef68b13d9dc6cd14e192585fe2465bcaefcR189 is missing (same for port groups) NOTE: Instead of bringing down openshift@5e78ad5 which had many conflicts, we simply use the DeleteACLFromPortGroupOps present in 4.10 to avoid complications. So that commit is squashed into this commit
We didn't consider this tiny hack that we do https://github.com/openshift/ovn-kubernetes/blob/44ad75466e486cce605e39513a3ecd9e0b306e7d/go-controller/pkg/libovsdbops/acl.go#L60 there when we wrote openshift#1259 and unit tests don't actually scream loudly for longer names. Without this PR we break users who have namespace names longer than 45 characters. NOTE: In 4.11 openshift@44ad754#diff-5b9331449265f93da0d6fac90800eb68b7cb28f72b3f55eb01ad7026fc2b6089R68 is missing so we are directly using libovsdbops.BuildACL. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com> Co-authored-by: Patryk Diak <pdiak@redhat.com> (cherry picked from commit b10ed7b) (cherry picked from commit 54381b0) (cherry picked from commit a039f90) Conflicts in 4.10: go-controller/pkg/ovn/policy.go had to remove passing `ACL.options` into BuildACL because in 4.10 we have: openshift@9b723ec versus in 4.11 we had: openshift@14bf1f2 So the function signature for libovsdbops.BuildACL has changed between versions.
Initially when the commit to cleanup ACLs was done in ovn-org/ovn-kubernetes#3038 I had assumed that ACL names weren't truncated to fit the 63 char length and thought they were like other fields. Based on that assumption there were two places in code where I used a predicate match based on acl names which was plain stupidity. This PR fixes that: 1) When the arp && arp || nd bug was being cleaned up, I was trying to be extra careful in matching on old exp: "arp" and the acl name. I deleted the second part of that logic since there can be cases where the entire suffix is missing like when the namespace is 63 chars. So let us stick to matching on acl.Match alone 2) When the deny default ACL duplicates per namespace were being updated - again I was matching on ACL names to exclude the arp policies. I replaced this with using acl.Match which is more reliable and accurate. In ovn-org/ovn-kubernetes#3181 we realised we had to trim the acls and we fixed that, but didn't fix the above two issues. So let's fix that up and also I added a unit test case with a long named namespace. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com> (cherry picked from commit 6f60497) Conflicts in 4.11: go-controller/pkg/ovn/policy_test.go because we are missing https://github.com/ovn-org/ovn-kubernetes/pull/3161/files#diff-2a5ea421d13f2f639ad217cada1f5c741e9ba6d68abcf431a700c0f68b71e04dR1729 (cherry picked from commit 4d5d7d6) Conflicts in 4.10 since we are not backporting openshift@5e78ad5: go-controller/pkg/ovn/policy.go go-controller/pkg/ovn/policy_test.go
@anuragthehatter For the 4.10 we will need to do all the tests :D
|
/retest |
/hold cancel |
/retest |
/test 4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade |
hmm the stable upgrade fails and everytime its a new alert problem!
|
/test 4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade |
QE testing looks good on all 3 trim ACL cases and |
thanks Anurag! are we also good on the arp versus arp || nd duplicate ACLs bug? |
@tssurya: GitHub didn't allow me to assign the following users: for, the, labels. Note that only openshift members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jcaamano, npinaeva, tssurya The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @anuragthehatter |
/label cherry-pick-approved |
/retest-required |
/retest |
@tssurya: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@tssurya: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-1454 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Cherry-picked from #1259.
Conflicts are outlined in commit message for each commit.