Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

olm,catalog: only validate the resources we label #3165

Conversation

stevekuznetsov
Copy link
Member

*: don't duplicate owner references

Signed-off-by: Steve Kuznetsov skuznets@redhat.com


olm,catalog: only validate the resources we label

Each controller can see (and, therefore, can label) a separate set of
GVRs. When we start up, detecting if any OLM-related resource needs
labelling means that one controller may start, detect a need for
labelling a resource it cannot itself label, detect that it's labelled
everything it can, and restart. If the other operator is stuck for
whatever reason, this leads the first controller to enter
CrashLoopBackOff and break OCP upgrade.

Signed-off-by: Steve Kuznetsov skuznets@redhat.com


Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
Each controller can see (and, therefore, can label) a separate set of
GVRs. When we start up, detecting if any OLM-related resource needs
labelling means that one controller may start, detect a need for
labelling a resource it cannot itself label, detect that it's labelled
everything it can, and restart. If the other operator is stuck for
whatever reason, this leads the first controller to enter
CrashLoopBackOff and break OCP upgrade.

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
@stevekuznetsov stevekuznetsov changed the title olm,catalog: only validate the resources we label OCPBUGS-28744: olm,catalog: only validate the resources we label Jan 31, 2024
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 31, 2024
@openshift-ci-robot
Copy link
Collaborator

@stevekuznetsov: This pull request references Jira Issue OCPBUGS-28744, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

*: don't duplicate owner references

Signed-off-by: Steve Kuznetsov skuznets@redhat.com


olm,catalog: only validate the resources we label

Each controller can see (and, therefore, can label) a separate set of
GVRs. When we start up, detecting if any OLM-related resource needs
labelling means that one controller may start, detect a need for
labelling a resource it cannot itself label, detect that it's labelled
everything it can, and restart. If the other operator is stuck for
whatever reason, this leads the first controller to enter
CrashLoopBackOff and break OCP upgrade.

Signed-off-by: Steve Kuznetsov skuznets@redhat.com


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ncdc ncdc changed the title OCPBUGS-28744: olm,catalog: only validate the resources we label olm,catalog: only validate the resources we label Jan 31, 2024
@openshift-ci-robot openshift-ci-robot removed the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 31, 2024
@openshift-ci-robot
Copy link
Collaborator

@stevekuznetsov: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

In response to this:

*: don't duplicate owner references

Signed-off-by: Steve Kuznetsov skuznets@redhat.com


olm,catalog: only validate the resources we label

Each controller can see (and, therefore, can label) a separate set of
GVRs. When we start up, detecting if any OLM-related resource needs
labelling means that one controller may start, detect a need for
labelling a resource it cannot itself label, detect that it's labelled
everything it can, and restart. If the other operator is stuck for
whatever reason, this leads the first controller to enter
CrashLoopBackOff and break OCP upgrade.

Signed-off-by: Steve Kuznetsov skuznets@redhat.com


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jan 31, 2024
@ncdc
Copy link
Member

ncdc commented Jan 31, 2024

@stevekuznetsov I removed the downstream jira from the title of this upstream PR

@ncdc
Copy link
Member

ncdc commented Jan 31, 2024

LGTM. Probably need to investigate e2e failures to see if they're flakes or due to this PR.

@joelanford PTAL at the PR - thx

@stevekuznetsov stevekuznetsov force-pushed the skuznets/fix-owners-and-labels branch 2 times, most recently from 6a49102 to 04231c1 Compare January 31, 2024 16:03
Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
@ncdc
Copy link
Member

ncdc commented Jan 31, 2024

Still LGTM. Joe?

@stevekuznetsov
Copy link
Member Author

 • [FAILED] [73.452 seconds]
Operator Condition [It] OperatorCondition Upgradeable type and overrides
/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/operator_condition_e2e_test.go:35

  Captured StdOut/StdErr Output >>
  waiting 104.190706ms for catalog pod operator-conditions-e2e-2j9rz/catalog-2ktfc to be available (for sync) - NO_CONNECTION
  waiting 1.009379511s for catalog pod operator-conditions-e2e-2j9rz/catalog-2ktfc to be available (for sync) - IDLE
  waiting 396.888151ms for catalog pod operator-conditions-e2e-2j9rz/catalog-2ktfc to be available (for sync) - CONNECTING
  waiting 19.600061548s for catalog pod operator-conditions-e2e-2j9rz/catalog-2ktfc to be available (for sync) - TRANSIENT_FAILURE
  waiting 22.600388182s for catalog pod operator-conditions-e2e-2j9rz/catalog-2ktfc to be available (for sync) - READY
  probing catalog catalog-2ktfc pod with address catalog-2ktfc.operator-conditions-e2e-2j9rz.svc:50051
  skipping health check
  waiting 189.477209ms for catalog pod operator-conditions-e2e-2j9rz/catalog-2ktfc to be available (for sync) - READY
  probing catalog catalog-2ktfc pod with address catalog-2ktfc.operator-conditions-e2e-2j9rz.svc:50051
  skipping health check
  waited 188.344521ms for catalog pod catalog-2ktfc to be available (after catalog update) - READY
  waited 1.600263746s for catalog pod catalog-2ktfc to be available (after catalog update) - IDLE
  waited 1.200102082s for catalog pod catalog-2ktfc to be available (after catalog update) - TRANSIENT_FAILURE
  catalog updated
  waiting 199.550634ms for catalog pod operator-conditions-e2e-2j9rz/catalog-2ktfc to be available (for sync) - READY
  probing catalog catalog-2ktfc pod with address catalog-2ktfc.operator-conditions-e2e-2j9rz.svc:50051
  skipping health check
  Using the kubectl kubectl binary
  Using the artifacts/operator-conditions-e2e-2j9rz output directory
  Storing the test artifact output in the artifacts/operator-conditions-e2e-2j9rz directory
  Collecting get catalogsources -o yaml output...
  Collecting get subscriptions -o yaml output...
  Collecting get operatorgroups -o yaml output...
  Collecting get clusterserviceversions -o yaml output...
  Collecting get installplans -o yaml output...
  Collecting get pods -o wide output...
  Collecting get events --sort-by .lastTimestamp output...
  << Captured StdOut/StdErr Output

  Timeline >>
  created the operator-conditions-e2e-2j9rz testing namespace
  created the operator-conditions-e2e-2j9rz/operator-conditions-e2e-2j9rz-operatorgroup operator group
  STEP: This test proves that an operator can upgrade successfully when Upgrade condition type is set in OperatorCondition spec. Plus, an operator chooses not to use OperatorCondition, the upgrade process will proceed as expected. The overrides spec in OperatorCondition can be used to override the conditions spec. The overrides spec will remain in place until they are unset. @ 01/31/24 16:28:03.84
  STEP: Create a catalog for csvA, csvB, and csvD @ 01/31/24 16:28:03.84
  STEP: Create an nginx details deployment @ 01/31/24 16:28:03.84
  STEP: Create an nginx details deployment @ 01/31/24 16:28:03.84
  STEP: Create an nginx details deployment @ 01/31/24 16:28:03.84
  STEP: set a simple default strategy if none given @ 01/31/24 16:28:03.84
  STEP: Populate owned and required @ 01/31/24 16:28:03.84
  STEP: set a simple default strategy if none given @ 01/31/24 16:28:03.84
  STEP: Populate owned and required @ 01/31/24 16:28:03.84
  STEP: set a simple default strategy if none given @ 01/31/24 16:28:03.84
  STEP: Populate owned and required @ 01/31/24 16:28:03.84
  STEP: Create the initial catalogsources @ 01/31/24 16:28:03.84
  Creating catalog source catalog-2ktfc in namespace operator-conditions-e2e-2j9rz...
  Catalog source catalog-2ktfc created
  STEP: Await csvA's success @ 01/31/24 16:28:47.779
  waiting for CSV operator-conditions-e2e-2j9rz/a-rhlbb-stable to reach condition
  error getting csv operator-conditions-e2e-2j9rz/a-rhlbb-stable: clusterserviceversions.operators.coreos.com "a-rhlbb-stable" not found
  waited 2.998616915s for csv operator-conditions-e2e-2j9rz/a-rhlbb-stable - Installing (InstallWaiting): installing: waiting for deployment a-rhlbb-stable to become ready: deployment "a-rhlbb-stable" not available: Deployment does not have minimum availability.
  waited 3m33.783553067s for CSV operator-conditions-e2e-2j9rz/a-rhlbb-stable: to be in phases [Succeeded], in phase Installing (InstallWaiting): installing: waiting for deployment a-rhlbb-stable to become ready: deployment "a-rhlbb-stable" not available: Deployment does not have minimum availability.
  waited 1.600759552s for csv operator-conditions-e2e-2j9rz/a-rhlbb-stable - Succeeded (InstallSucceeded): install strategy completed with no errors
  waited 1.600762296s for CSV operator-conditions-e2e-2j9rz/a-rhlbb-stable: to be in phases [Succeeded], in phase Succeeded (InstallSucceeded): install strategy completed with no errors
  STEP: Get the OperatorCondition for csvA and report that it is not upgradeable @ 01/31/24 16:28:52.378
  STEP: Update the catalogsources @ 01/31/24 16:28:52.388
  STEP: Attempt to get the catalog source before creating install plan(s) @ 01/31/24 16:29:08.378
  STEP: csvB will be in Pending phase due to csvA reports Upgradeable=False condition @ 01/31/24 16:29:08.577
  waiting for CSV operator-conditions-e2e-2j9rz/b-nzcpj-stable to reach condition
  waited 201.341252ms for csv operator-conditions-e2e-2j9rz/b-nzcpj-stable - Pending (RequirementsUnknown): requirements not yet checked
  waited 201.349377ms for CSV operator-conditions-e2e-2j9rz/b-nzcpj-stable: to have reasons [OperatorConditionNotUpgradeable], in phase Pending (RequirementsUnknown): requirements not yet checked
  waited 408.946205ms for csv operator-conditions-e2e-2j9rz/b-nzcpj-stable - Pending (OperatorConditionNotUpgradeable): operator is not upgradeable: the operator is not upgradeable: test
  waited 408.950893ms for CSV operator-conditions-e2e-2j9rz/b-nzcpj-stable: to have reasons [OperatorConditionNotUpgradeable], in phase Pending (OperatorConditionNotUpgradeable): operator is not upgradeable: the operator is not upgradeable: test
  STEP: Get the OperatorCondition for csvA and report that it is upgradeable, unblocking csvB @ 01/31/24 16:29:09.188
  STEP: Await csvB's success @ 01/31/24 16:29:09.2
  waiting for CSV operator-conditions-e2e-2j9rz/b-nzcpj-stable to reach condition
  waited 178.484551ms for csv operator-conditions-e2e-2j9rz/b-nzcpj-stable - Pending (OperatorConditionNotUpgradeable): operator is not upgradeable: the operator is not upgradeable: test
  waited 17.00064196s for CSV operator-conditions-e2e-2j9rz/b-nzcpj-stable: to be in phases [Succeeded], in phase Pending (OperatorConditionNotUpgradeable): operator is not upgradeable: the operator is not upgradeable: test
  waited 435.567688ms for csv operator-conditions-e2e-2j9rz/b-nzcpj-stable - InstallReady (AllRequirementsMet): all requirements found, attempting install
  waited 435.572839ms for CSV operator-conditions-e2e-2j9rz/b-nzcpj-stable: to be in phases [Succeeded], in phase InstallReady (AllRequirementsMet): all requirements found, attempting install
  waited 563.877107ms for csv operator-conditions-e2e-2j9rz/b-nzcpj-stable - Installing (InstallSucceeded): waiting for install components to report healthy
  waited 563.87914ms for CSV operator-conditions-e2e-2j9rz/b-nzcpj-stable: to be in phases [Succeeded], in phase Installing (InstallSucceeded): waiting for install components to report healthy
  waited 400.920194ms for csv operator-conditions-e2e-2j9rz/b-nzcpj-stable - Installing (InstallWaiting): installing: waiting for deployment b-nzcpj-stable to become ready: deployment "b-nzcpj-stable" not available: Deployment does not have minimum availability.
  waited 400.922819ms for CSV operator-conditions-e2e-2j9rz/b-nzcpj-stable: to be in phases [Succeeded], in phase Installing (InstallWaiting): installing: waiting for deployment b-nzcpj-stable to become ready: deployment "b-nzcpj-stable" not available: Deployment does not have minimum availability.
  waited 1.600221574s for csv operator-conditions-e2e-2j9rz/b-nzcpj-stable - Succeeded (InstallSucceeded): install strategy completed with no errors
  waited 1.600224088s for CSV operator-conditions-e2e-2j9rz/b-nzcpj-stable: to be in phases [Succeeded], in phase Succeeded (InstallSucceeded): install strategy completed with no errors
  STEP: Get the OperatorCondition for csvB and purposedly change ObservedGeneration @ 01/31/24 16:29:12.38
  STEP: to cause mismatch generation situation @ 01/31/24 16:29:12.38
  Cleaning catalog source catalog-2ktfc
  Deleting config map catalog-2ktfc-configmap...
  Deleting catalog source catalog-2ktfc...
  waiting for the catalog source catalog-2ktfc-bjvvk pod to be deleted...
  Done cleaning catalog source catalog-2ktfc
  [FAILED] in [It] - /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/operator_condition_e2e_test.go:164 @ 01/31/24 16:29:13.79
  collecting the operator-conditions-e2e-2j9rz namespace artifacts as the 'OperatorCondition Upgradeable type and overrides' test case failed
  collecting logs in the ./artifacts/ artifacts directory
  tearing down the operator-conditions-e2e-2j9rz namespace
  resetting e2e kube client
  deleting operator-conditions-e2e-2j9rz/operator-conditions-e2e-2j9rz-operatorgroup
  deleting <global>/operator-conditions-e2e-2j9rz
  deleting package-manifest-e2epjzjc/package-manifest-e2epjzjc-operatorgroup
  deleting <global>/package-manifest-e2epjzjc
  deleting package-manifest-e2eq5fzh/package-manifest-e2eq5fzh-operatorgroup
  deleting <global>/package-manifest-e2eq5fzh
  deleting package-manifest-e2elldcm/package-manifest-e2elldcm-operatorgroup
  deleting <global>/package-manifest-e2elldcm
  deleting package-manifest-e2esf77w/package-manifest-e2esf77w-operatorgroup
  deleting <global>/package-manifest-e2esf77w
  deleting package-manifest-e2e88vf4/package-manifest-e2e88vf4-operatorgroup
  deleting <global>/package-manifest-e2e88vf4
  garbage collecting CRDs
  deleting crd a-rhlbbx7ntt.cluster.com
  << Timeline

  [FAILED] Timed out after 0.101s.
  Expected success, but got an error:
      <*errors.StatusError | 0xc0010e5a40>: 
      Operation cannot be fulfilled on operatorconditions.operators.coreos.com "b-nzcpj-stable": the object has been modified; please apply your changes to the latest version and try again
      {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "Operation cannot be fulfilled on operatorconditions.operators.coreos.com \"b-nzcpj-stable\": the object has been modified; please apply your changes to the latest version and try again",
              Reason: "Conflict",
              Details: {
                  Name: "b-nzcpj-stable",
                  Group: "operators.coreos.com",
                  Kind: "operatorconditions",
                  UID: "",
                  Causes: nil,
                  RetryAfterSeconds: 0,
              },
              Code: 409,
          },
      }
  In [It] at: /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/operator_condition_e2e_test.go:164 @ 01/31/24 16:29:13.79

Can't grok how that could be our fault

Comment on lines +289 to +291
if ref.Kind == ownerRef.Kind && ref.Name == ownerRef.Name && ref.UID == ownerRef.UID {
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance that we call this function with the same object, but with different blockOwnerDeletion or isController values?

Curious if we find the owner ref, rather than leaving what's there untouched, we should update what's there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe the codebase does any of that today. Generally speaking I would not expect isController to be something that logically could change.

Copy link

openshift-ci bot commented Feb 1, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joelanford, stevekuznetsov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 1, 2024
@tmshort
Copy link
Contributor

tmshort commented Feb 1, 2024

/retest

@stevekuznetsov stevekuznetsov added this pull request to the merge queue Feb 2, 2024
Merged via the queue into operator-framework:master with commit 1e0324c Feb 2, 2024
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants