Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.10] OCPBUGS-1454: Fix NP upgrade bug: unexpectedly found multiple equivalent ACLs (arp v/s arp||nd) #1269

Merged
merged 5 commits into from Oct 28, 2022

Conversation

tssurya
Copy link
Contributor

@tssurya tssurya commented Sep 19, 2022

Cherry-picked from #1259.
Conflicts are outlined in commit message for each commit.

NOTE: This is 4.10 only commit because we are missing
openshift@df19112#diff-ee54e95b45c5a079454546fed94fcef68b13d9dc6cd14e192585fe2465bcaefcL46.

Hence we are renaming the existing find acls by predicate
to be used globally outside of acls.go while avoiding to
bring any other changes from the libovsdb cleanup

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
@openshift-ci-robot openshift-ci-robot added the jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. label Sep 19, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 19, 2022

@tssurya: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[release-4.10] OCPBUGS-1454: Fix NP upgrade bug: unexpectedly found multiple equivalent ACLs (arp v/s arp||nd)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Sep 19, 2022
@openshift-ci-robot
Copy link
Contributor

@tssurya: This pull request references Jira Issue OCPBUGS-1454, which is invalid:

  • expected Jira Issue OCPBUGS-1454 to depend on a bug targeting a version in 4.11.0, 4.11.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Cherry-picked from #1259.
Conflicts are outlined in commit message for each commit.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tssurya
Copy link
Contributor Author

tssurya commented Sep 19, 2022

/jira refresh

@openshift-ci-robot
Copy link
Contributor

@tssurya: This pull request references Jira Issue OCPBUGS-1454, which is invalid:

  • expected dependent Jira Issue OCPBUGS-772 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from abhat and dcbw September 19, 2022 10:43
@tssurya
Copy link
Contributor Author

tssurya commented Sep 19, 2022

openshift/origin#27429 needs to pass first

@npinaeva
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 21, 2022
@sdodson
Copy link
Member

sdodson commented Sep 21, 2022

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 21, 2022
@openshift-ci-robot
Copy link
Contributor

@sdodson: This pull request references Jira Issue OCPBUGS-1454, which is valid. The bug has been moved to the POST state.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.10.z) matches configured target version for branch (4.10.z)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-772 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE))
  • dependent Jira Issue OCPBUGS-772 targets the "4.11.z" version, which is one of the valid target versions: 4.11.0, 4.11.z
  • bug has dependents

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tssurya
Copy link
Contributor Author

tssurya commented Sep 26, 2022

/hold
till we fix https://issues.redhat.com/browse/OCPBUGS-1705

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 26, 2022
@tssurya
Copy link
Contributor Author

tssurya commented Sep 26, 2022

/retest

This PR does two things:

1) Loops through all the ACLs that have `ARPallowPolicy`
in their names and have " && arp" in their match expression
and deletes it. This is because we did
https://github.com/openshift/ovn-kubernetes/pull/1043/files where
we changed the match but didn't remove acls on the older match which
causes problems like:

2022-06-01T08:17:44.635401164Z E0601 08:17:44.634509       1 ovn.go:753]
Failed to create network policy mdh-old/allow-from-other-namespaces,
error: failed to create default port groups and acls for policy:
mdh-old/allow-from-other-namespaces, error: unexpectedly found multiple
equivalent ACLs: [{UUID:3bc36c0e-ee1a-4609-a240-c211f14f379b
Action:allow Direction:to-lport
ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false
Match:outport == @a13985064446031893020_ingressDefaultDeny && arp
Meter:0xc003245210 Name:0xc003245220 Options:map[] Priority:1001
Severity:0xc003245230} {UUID:41702278-fb49-4418-abca-6425912d1e63
Action:allow Direction:to-lport
ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false
Match:outport == @a13985064446031893020_ingressDefaultDeny && (arp ||
nd) Meter:0xc0032452a0 Name:0xc0032452b0 Options:map[] Priority:1001
Severity:0xc0032452c0}]

on upgrades

2) When we create the default deny ACLs for a namespace with at least
one network policy, we name it "namespace_np-name". This doesn't make
sense since the ACL is applicable to all the network policies in that
namespace. The policy that get's created first becomes the lucky winner.
This PR updates the names of default deny ACLs to
"namespace_egressDefaultDeny" OR "namespace_ingressDefaultDeny" so that
"we can stop being stupid" :) and stop errors like:

2022-06-09T17:15:00.952174381Z E0609 17:15:00.952154       1 ovn.go:753]
Failed to create network policy oit-ssi-fluentd/fluentd-input, error:
failed to create default port groups and acls for policy:
oit-ssi-fluentd/fluentd-input, error: unexpectedly found multiple
equivalent ACLs: [{UUID:5b98f17d-789f-4de1-9beb-36741bfa40d1 Action:drop
Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress]
Label:0 Log:false Match:inport ==
@a12933912868060780448_egressDefaultDeny Meter:0xc001388aa0
Name:0xc001388ab0 Options:map[apply-after-lb:true] Priority:1000
Severity:0xc001388ac0} {UUID:b9349cb8-99b0-4130-92cf-ab11b4874a11
Action:drop Direction:from-lport
ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false
Match:inport == @a12933912868060780448_egressDefaultDeny
Meter:0xc001388c90 Name:0xc001388ca0 Options:map[apply-after-lb:true]
Priority:1000 Severity:0xc001388cb0}]

(NOTE: not sure how the user ended up with two default ACLs in the same
namespace, but they did!)

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
(cherry picked from commit 5fea420)

Conflicts in 4.11:
  test/e2e/acl_logging.go had minor conflicts since
openshift@e1f7da0
 is missing in 4.11

(cherry picked from commit 2762c4d)

Conflicts in 4.10:
	go-controller/pkg/ovn/policy.go
 openshift@b4738c7#diff-cc83e19af1c257d5a09b711d5977d8f8c20beb34b7b5d3eb37b2f2c53ded1bf7L116
is missing in 4.10
        openshift@174aed7
is missing in 4.10
        openshift@df19112#diff-ee54e95b45c5a079454546fed94fcef68b13d9dc6cd14e192585fe2465bcaefcL46
is missing in 4.10
We weren't removing reference of ACL from PG before deleting
it. Unfortuntely unit tests don't catch this.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
(cherry picked from commit 4d446ba)
(cherry picked from commit de6258c)

Conflicts in 4.10:
	go-controller/pkg/ovn/policy.go
        because
openshift@df19112#diff-ee54e95b45c5a079454546fed94fcef68b13d9dc6cd14e192585fe2465bcaefcR189
is missing (same for port groups)

NOTE: Instead of bringing down openshift@5e78ad5
which had many conflicts, we simply use the
DeleteACLFromPortGroupOps present in 4.10 to avoid
complications. So that commit is squashed into this commit
We didn't consider this tiny hack that we do
https://github.com/openshift/ovn-kubernetes/blob/44ad75466e486cce605e39513a3ecd9e0b306e7d/go-controller/pkg/libovsdbops/acl.go#L60
there when we wrote openshift#1259
and unit tests don't actually scream loudly for longer names.

Without this PR we break users who have namespace names
longer than 45 characters.

NOTE: In 4.11
openshift@44ad754#diff-5b9331449265f93da0d6fac90800eb68b7cb28f72b3f55eb01ad7026fc2b6089R68
is missing so we are directly using libovsdbops.BuildACL.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Co-authored-by: Patryk Diak <pdiak@redhat.com>
(cherry picked from commit b10ed7b)
(cherry picked from commit 54381b0)
(cherry picked from commit a039f90)

Conflicts in 4.10:
	go-controller/pkg/ovn/policy.go
had to remove passing `ACL.options` into BuildACL because in 4.10
we have: openshift@9b723ec
versus in 4.11 we had: openshift@14bf1f2
So the function signature for libovsdbops.BuildACL has changed between
versions.
Initially when the commit to cleanup ACLs was done in
ovn-org/ovn-kubernetes#3038 I had assumed
that ACL names weren't truncated to fit the 63 char length and
thought they were like other fields. Based on that assumption there
were two places in code where I used a predicate match based on
acl names which was plain stupidity. This PR fixes that:

1) When the arp && arp || nd bug was being cleaned up, I was
trying to be extra careful in matching on old exp: "arp" and
the acl name. I deleted the second part of that logic since
there can be cases where the entire suffix is missing like
when the namespace is 63 chars. So let us stick to matching
on acl.Match alone

2) When the deny default ACL duplicates per namespace were
being updated - again I was matching on ACL names to exclude
the arp policies. I replaced this with using acl.Match which
is more reliable and accurate.

In ovn-org/ovn-kubernetes#3181 we realised
we had to trim the acls and we fixed that, but didn't fix the
above two issues. So let's fix that up and also I added a unit
test case with a long named namespace.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
(cherry picked from commit 6f60497)

Conflicts in 4.11:
	go-controller/pkg/ovn/policy_test.go
because we are missing https://github.com/ovn-org/ovn-kubernetes/pull/3161/files#diff-2a5ea421d13f2f639ad217cada1f5c741e9ba6d68abcf431a700c0f68b71e04dR1729

(cherry picked from commit 4d5d7d6)

Conflicts in 4.10 since we are not backporting
openshift@5e78ad5:
	go-controller/pkg/ovn/policy.go
	go-controller/pkg/ovn/policy_test.go
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 11, 2022
@tssurya
Copy link
Contributor Author

tssurya commented Oct 11, 2022

@anuragthehatter For the 4.10 we will need to do all the tests :D

  • We need to do the actual bug test where the (arp) && (arp || nd) was introduced, an upgrade from a 4.9.31 to any version on 4.10 without this fix should introduce the bug, then we test if this version with fix fixes that...
  • Repeat same to ensure the default ACL names are changed from <ns>_<np-name> to <ns>-<standard-suffix>
  • All the 3 trimACL corner cases we had for > 45 characters

@tssurya
Copy link
Contributor Author

tssurya commented Oct 11, 2022

/retest

@tssurya
Copy link
Contributor Author

tssurya commented Oct 12, 2022

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2022
@tssurya
Copy link
Contributor Author

tssurya commented Oct 12, 2022

/retest

@tssurya
Copy link
Contributor Author

tssurya commented Oct 13, 2022

/test 4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade

@tssurya
Copy link
Contributor Author

tssurya commented Oct 13, 2022

hmm the stable upgrade fails and everytime its a new alert problem!

  1. last run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1269/pull-ci-openshift-ovn-kubernetes-release-4.10-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1580064435186176000 CsvAbnormalFailedOver2Min
  2. before that: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1269/pull-ci-openshift-ovn-kubernetes-release-4.10-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1579895858344759296

event happened 25 times, something is wrong: ns/openshift-config-operator pod/openshift-config-operator-7bc5884657-7gtj9 node/ip-10-0-155-62.us-west-2.compute.internal - reason/ProbeError Readiness probe error: Get "https://10.128.0.21:8443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
body: 
}
  1. Before before that: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1269/pull-ci-openshift-ovn-kubernetes-release-4.10-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1579811067561775104 :
event happened 35 times, something is wrong: ns/openshift-config-operator pod/openshift-config-operator-7587c4f5f7-djgh8 node/ip-10-0-155-44.ec2.internal - reason/ProbeError Liveness probe error: Get "https://10.129.0.13:8443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
body: 

event happened 25 times, something is wrong: ns/openshift-config-operator pod/openshift-config-operator-7587c4f5f7-djgh8 node/ip-10-0-155-44.ec2.internal - reason/ProbeError Readiness probe error: Get "https://10.129.0.13:8443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
body: 
}

@tssurya
Copy link
Contributor Author

tssurya commented Oct 14, 2022

/test 4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade

@anuragthehatter
Copy link

QE testing looks good on all 3 trim ACL cases and outport == @a12181866158835321247_ingressDefaultDeny && (arp || nd) also looks good. Thanks

@tssurya
Copy link
Contributor Author

tssurya commented Oct 25, 2022

thanks Anurag! are we also good on the arp versus arp || nd duplicate ACLs bug?
/assign @jcaamano for the labels

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 25, 2022

@tssurya: GitHub didn't allow me to assign the following users: for, the, labels.

Note that only openshift members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

thanks Anurag! are we also good on the arp versus arp || nd duplicate ACLs bug?
/assign @jcaamano for the labels

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jcaamano
Copy link
Contributor

/lgtm
/approve
/label backport-risk-assessed

@openshift-ci openshift-ci bot added backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. lgtm Indicates that a PR is ready to be merged. labels Oct 26, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 26, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jcaamano, npinaeva, tssurya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 26, 2022
@trozet
Copy link
Contributor

trozet commented Oct 26, 2022

/assign @anuragthehatter

@anuragthehatter
Copy link

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Oct 26, 2022
@trozet
Copy link
Contributor

trozet commented Oct 27, 2022

/retest-required

@tssurya
Copy link
Contributor Author

tssurya commented Oct 27, 2022

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 28, 2022

@tssurya: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-windows a0fddff link false /test e2e-vsphere-windows
ci/prow/e2e-openstack-ovn a0fddff link false /test e2e-openstack-ovn
ci/prow/e2e-vsphere-ovn a0fddff link false /test e2e-vsphere-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 8f229d6 into openshift:release-4.10 Oct 28, 2022
@openshift-ci-robot
Copy link
Contributor

@tssurya: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-1454 has been moved to the MODIFIED state.

In response to this:

Cherry-picked from #1259.
Conflicts are outlined in commit message for each commit.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants