NO-ISSUE: Synchronize From Upstream Repositories#1306
Conversation
This commit improves the handling and messaging of bundle unpack failures in OLM to provide better user experience and follows Kubernetes controller best practices. Key improvements: 1. User-friendly error messages - Provide clear, actionable error messages for bundle unpack failures - Include common causes and troubleshooting steps - Add auto-retry information and manual remediation commands 2. Reduced etcd payload bloat - Keep subscription condition messages concise - Emit detailed troubleshooting guidance as Kubernetes Events - Prevents storing large 329-char strings repeatedly in etcd 3. Prevent duplicate guidance in messages - Check if guidance already exists before appending - Avoids message duplication from underlying conditions 4. Fix variable shadowing - Rename inner 'cond' to 'unpackingCond' for clarity - Improves code readability and prevents confusion 5. Prevent queue churn from repeated requeues - Track state transitions with isNewFailure flag - Only schedule delayed requeue on new failures - Prevents repeated AddAfter calls for persistent failures 6. Use exact constant comparison for JobIncompleteReason - Replace strings.Contains() with bundle.JobIncompleteReason - More deterministic and type-safe - Prevents matching unintended substring values 7. Improve test maintainability - Use substring assertions instead of exact message matching - Tests won't break on minor wording/punctuation changes - Verify key components: prefix, reason, error, guidance All tests pass. Changes follow Kubernetes controller best practices. Co-authored-by: Rohit Patil <ropatil@ropatil-thinkpadp16vgen1.bengluru.csb> Upstream-repository: operator-lifecycle-manager Upstream-commit: 4aab00c4e145519e53495801d995d5ec3d6d2b1e
|
@openshift-bot: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughBundle unpack job automation now includes automatic cleanup via TTL (300 seconds), and error/pending diagnostic messages enriched with targeted troubleshooting guidance. Subscription status updates track new failures separately and emit detailed warning events. ChangesBundle Unpack Job TTL and Enhanced Diagnostics
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED Approval requirements bypassed by manually added approval. This pull-request has been approved by: openshift-bot The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
1 similar comment
|
[APPROVALNOTIFIER] This PR is APPROVED Approval requirements bypassed by manually added approval. This pull-request has been approved by: openshift-bot The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
staging/operator-lifecycle-manager/pkg/controller/operators/catalog/subscriptions_test.go (1)
1327-1340: ⚡ Quick winAvoid index-based condition assertions in this test
Using
fetched.Status.Conditions[0]assumes stable ordering. Prefer fetchingSubscriptionBundleUnpackFailedby type, then asserting on that message.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@staging/operator-lifecycle-manager/pkg/controller/operators/catalog/subscriptions_test.go` around lines 1327 - 1340, The test is indexing Status.Conditions[0] which assumes stable ordering; instead, locate the condition with Type "SubscriptionBundleUnpackFailed" on both fetched and the expected s (e.g., iterate over fetched.Status.Conditions to find cond.Type == "SubscriptionBundleUnpackFailed") and copy that condition's Message into the matching expected condition before require.Equal, then run the require.Contains assertions against that found condition's Message. Update all uses of fetched.Status.Conditions[0] and s.Status.Conditions[0] in the "NoStatus/NoCurrentCSV/BundleUnpackFailed" case to use the condition lookup by Type so assertions no longer rely on array index ordering.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@staging/operator-lifecycle-manager/pkg/controller/operators/catalog/operator.go`:
- Around line 1401-1410: The current isNewFailure check only schedules retries
when a SubscriptionBundleUnpackFailed condition flips to True, which prevents
further AddAfter retries once the condition stays True; change the logic that
computes isNewFailure to instead detect any subscription that currently has
SubscriptionBundleUnpackFailed == True (iterate subs and if existingCond.Status
== corev1.ConditionTrue set a flag), and call the existing AddAfter scheduling
path for those failing subscriptions even when the condition was already True so
retries continue; keep the rest of the existing AddAfter call/path unchanged
(use the same queue key and delay) and remove the branch that skips scheduling
solely because the failure is not "new".
---
Nitpick comments:
In
`@staging/operator-lifecycle-manager/pkg/controller/operators/catalog/subscriptions_test.go`:
- Around line 1327-1340: The test is indexing Status.Conditions[0] which assumes
stable ordering; instead, locate the condition with Type
"SubscriptionBundleUnpackFailed" on both fetched and the expected s (e.g.,
iterate over fetched.Status.Conditions to find cond.Type ==
"SubscriptionBundleUnpackFailed") and copy that condition's Message into the
matching expected condition before require.Equal, then run the require.Contains
assertions against that found condition's Message. Update all uses of
fetched.Status.Conditions[0] and s.Status.Conditions[0] in the
"NoStatus/NoCurrentCSV/BundleUnpackFailed" case to use the condition lookup by
Type so assertions no longer rely on array index ordering.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a1febac9-4c7a-44b1-96ff-4aa178b1abab
⛔ Files ignored due to path filters (2)
vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/bundle/bundle_unpacker.gois excluded by!**/vendor/**,!vendor/**vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/operator.gois excluded by!**/vendor/**,!vendor/**
📒 Files selected for processing (5)
staging/operator-lifecycle-manager/pkg/controller/bundle/bundle_unpacker.gostaging/operator-lifecycle-manager/pkg/controller/bundle/bundle_unpacker_test.gostaging/operator-lifecycle-manager/pkg/controller/operators/catalog/operator.gostaging/operator-lifecycle-manager/pkg/controller/operators/catalog/subscriptions_test.gostaging/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go
| // Check if any subscription is transitioning to failed state (not already failed) | ||
| // to avoid scheduling redundant delayed requeues on every reconcile | ||
| isNewFailure := false | ||
| for _, sub := range subs { | ||
| existingCond := sub.Status.GetCondition(v1alpha1.SubscriptionBundleUnpackFailed) | ||
| if existingCond.Status != corev1.ConditionTrue { | ||
| isNewFailure = true | ||
| break | ||
| } | ||
| } |
There was a problem hiding this comment.
Retry scheduling can stop after subsequent failed attempts
isNewFailure only checks whether SubscriptionBundleUnpackFailed=True already exists. On later failed attempts, that stays true, so Line 1477 skips AddAfter, which can stop automatic retry progression after TTL cleanup.
💡 Suggested fix
- isNewFailure := false
+ isNewFailure := false
for _, sub := range subs {
- existingCond := sub.Status.GetCondition(v1alpha1.SubscriptionBundleUnpackFailed)
- if existingCond.Status != corev1.ConditionTrue {
+ existingFailed := sub.Status.GetCondition(v1alpha1.SubscriptionBundleUnpackFailed)
+ existingUnpacking := sub.Status.GetCondition(v1alpha1.SubscriptionBundleUnpacking)
+ // Treat transition from unpacking->failed as a new failure cycle.
+ if existingFailed.Status != corev1.ConditionTrue || existingUnpacking.Status == corev1.ConditionTrue {
isNewFailure = true
break
}
}Also applies to: 1475-1483
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@staging/operator-lifecycle-manager/pkg/controller/operators/catalog/operator.go`
around lines 1401 - 1410, The current isNewFailure check only schedules retries
when a SubscriptionBundleUnpackFailed condition flips to True, which prevents
further AddAfter retries once the condition stays True; change the logic that
computes isNewFailure to instead detect any subscription that currently has
SubscriptionBundleUnpackFailed == True (iterate subs and if existingCond.Status
== corev1.ConditionTrue set a flag), and call the existing AddAfter scheduling
path for those failing subscriptions even when the condition was already True so
retries continue; keep the rest of the existing AddAfter call/path unchanged
(use the same queue key and delay) and remove the branch that skips scheduling
solely because the failure is not "new".
|
@openshift-bot: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/label qe-approved |
|
@bandrade: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
328957c
into
openshift:main
The staging/ and vendor/ directories have been synchronized from the upstream repositories, pulling in the following commits:
This pull request is expected to merge without any human intervention. If tests are failing here, changes must land upstream to fix any issues so that future downstreaming efforts succeed.
/assign @openshift/openshift-team-operator-runtime