NO-JIRA: refactor(oadp): unify backup/restore e2e test with platform auto-detection#7971
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@mgencur: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis pull request changes namespace deletion helpers in test/e2e/v2/backuprestore/cleanup.go to use timeout-based semantics (time.Duration) instead of boolean wait flags, implementing a two-phase delete: initial short timeout, then finalizer stripping and a retry with a longer timeout on timeout events. Public signatures for deleteNamespace, deleteControlPlaneNamespace, and deleteHostedClusterNamespace were updated. test/e2e/v2/backuprestore/cli.go increases the OIDC timeout from 20 to 30 minutes. test/e2e/v2/tests/backup_restore_test.go adds per-platform backup/restore configuration (excludeWorkloads, postRestoreHook, additionalNamespaces) and runtime platform gating. Sequence Diagram(s)mermaid ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
9bfbd1f to
9580326
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/e2e/v2/tests/backup_restore_test.go (1)
176-205:⚠️ Potential issue | 🟠 MajorAdd
IncludeNamespacesto restore options.The test passes
platformCfg.additionalNamespacesto both schedule and backup creation, but the restore options omit it. This inconsistency means those namespaces will be backed up but never restored. Update line 248 to include:IncludeNamespaces: platformCfg.additionalNamespaces,Same fix applies to the second occurrence at lines 248–255.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/v2/tests/backup_restore_test.go` around lines 176 - 205, The restore options are missing IncludeNamespaces so namespaces backed up by OADPSchedule/OADPBackup are not restored; update the OADPRestoreOptions instances used in this test to include IncludeNamespaces: platformCfg.additionalNamespaces (matching how OADPScheduleOptions and OADPBackupOptions are constructed), i.e., add IncludeNamespaces to the restore options near where backupOpts is created and to the second restore-options occurrence referenced in the test so both restores include platformCfg.additionalNamespaces.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/v2/backuprestore/cleanup.go`:
- Around line 76-94: The cleanup retry currently treats any error from
deleteControlPlaneNamespace/deleteHostedClusterNamespace as a timeout; change
the logic to only perform the finalizer-removal + retry when the returned error
is an actual timeout. Update the branches around deleteControlPlaneNamespace and
deleteHostedClusterNamespace to check the error with errors.Is for known timeout
signals (e.g. context.DeadlineExceeded or wait.ErrWaitTimeout) or a specific
sentinel timeout error returned by those functions, and return other errors
immediately. Use the same change for the analogous block later (lines ~302-345):
only call removeNamespaceObjectFinalizers and retry when the error indicates a
timeout, otherwise propagate the original error.
In `@test/e2e/v2/tests/backup_restore_test.go`:
- Around line 270-288: The NodePool Ready=False check is timing-sensitive and
can be missed after the long control-plane readiness waits; to fix it, start
observing the NodePool transition immediately after restore completion instead
of after
WaitForControlPlaneStatefulSetsReadiness/WaitForControlPlaneDeploymentsReadiness:
move the Eventually block that calls getNodePool and internal.ValidateConditions
(checking hyperv1.NodePoolReadyConditionType == metav1.ConditionFalse with
Reason "WaitingForAvailableMachines") to run right after restore completion, or
run it concurrently as a prober (goroutine) that polls with
backuprestore.PollInterval and backuprestore.OIDCTimeout using the same testCtx
so it can record the transient state while the subsequent calls to
internal.WaitForControlPlaneStatefulSetsReadiness and
internal.WaitForControlPlaneDeploymentsReadiness still run.
---
Outside diff comments:
In `@test/e2e/v2/tests/backup_restore_test.go`:
- Around line 176-205: The restore options are missing IncludeNamespaces so
namespaces backed up by OADPSchedule/OADPBackup are not restored; update the
OADPRestoreOptions instances used in this test to include IncludeNamespaces:
platformCfg.additionalNamespaces (matching how OADPScheduleOptions and
OADPBackupOptions are constructed), i.e., add IncludeNamespaces to the restore
options near where backupOpts is created and to the second restore-options
occurrence referenced in the test so both restores include
platformCfg.additionalNamespaces.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 988cdf1d-2a4b-4b10-aeba-3bfcba14b798
📒 Files selected for processing (3)
test/e2e/v2/backuprestore/cleanup.gotest/e2e/v2/backuprestore/cli.gotest/e2e/v2/tests/backup_restore_test.go
be896ff to
b7a9415
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
test/e2e/v2/tests/backup_restore_test.go (1)
270-288:⚠️ Potential issue | 🟠 MajorNodePool transient-state check is still timing-sensitive
At Line 270, the test waits for control-plane readiness first, then at Line 276 starts looking for
NodePoolReady=Falsewith reasonWaitingForAvailableMachines. That state can occur earlier and be missed, causing flaky failures.Start observing this condition immediately after restore completion (or before the long readiness waits), then assert it was observed.
Suggested reorder (minimal change)
-By("Waiting for control plane statefulsets to be ready") -err := internal.WaitForControlPlaneStatefulSetsReadiness(testCtx, backuprestore.RestoreTimeout, platformCfg.excludeWorkloads) -Expect(err).NotTo(HaveOccurred()) -By("Waiting for control plane deployments to be ready") -err = internal.WaitForControlPlaneDeploymentsReadiness(testCtx, backuprestore.RestoreTimeout, platformCfg.excludeWorkloads) -Expect(err).NotTo(HaveOccurred()) By("Waiting for NodePool to reach WaitingForAvailableMachines state") Eventually(func(g Gomega) { nodePool, err := getNodePool(testCtx) g.Expect(err).NotTo(HaveOccurred()) g.Expect(nodePool).NotTo(BeNil()) internal.ValidateConditions(g, nodePool, []util.Condition{ { Type: hyperv1.NodePoolReadyConditionType, Status: metav1.ConditionFalse, Reason: "WaitingForAvailableMachines", }, }) }).WithPolling(backuprestore.PollInterval).WithTimeout(backuprestore.OIDCTimeout).Should(Succeed()) + +By("Waiting for control plane statefulsets to be ready") +err := internal.WaitForControlPlaneStatefulSetsReadiness(testCtx, backuprestore.RestoreTimeout, platformCfg.excludeWorkloads) +Expect(err).NotTo(HaveOccurred()) +By("Waiting for control plane deployments to be ready") +err = internal.WaitForControlPlaneDeploymentsReadiness(testCtx, backuprestore.RestoreTimeout, platformCfg.excludeWorkloads) +Expect(err).NotTo(HaveOccurred())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/v2/tests/backup_restore_test.go` around lines 270 - 288, The transient NodePool condition check is placed after long control-plane readiness waits and can be missed; move the Eventually block that calls getNodePool and internal.ValidateConditions (asserting hyperv1.NodePoolReadyConditionType == metav1.ConditionFalse with Reason "WaitingForAvailableMachines") to run immediately after restore completion (i.e., before calling internal.WaitForControlPlaneStatefulSetsReadiness and internal.WaitForControlPlaneDeploymentsReadiness) so the test starts observing the transient state early, then continue with the existing readiness waits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test/e2e/v2/tests/backup_restore_test.go`:
- Around line 270-288: The transient NodePool condition check is placed after
long control-plane readiness waits and can be missed; move the Eventually block
that calls getNodePool and internal.ValidateConditions (asserting
hyperv1.NodePoolReadyConditionType == metav1.ConditionFalse with Reason
"WaitingForAvailableMachines") to run immediately after restore completion
(i.e., before calling internal.WaitForControlPlaneStatefulSetsReadiness and
internal.WaitForControlPlaneDeploymentsReadiness) so the test starts observing
the transient state early, then continue with the existing readiness waits.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: ea4fd520-2f00-4969-83ac-aaf0ac96d1a1
📒 Files selected for processing (2)
test/e2e/v2/backuprestore/cleanup.gotest/e2e/v2/tests/backup_restore_test.go
…anup Refactor deleteNamespace to accept a timeout duration instead of a boolean wait flag. Implement a two-pass finalizer removal strategy to handle controllers recreating resources during namespace deletion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ecf5b0a to
cd884dd
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/e2e/v2/backuprestore/cleanup.go (1)
73-88: Consider extracting the duplicated timeout/retry flow into a helper.Step 6 and Step 7 now implement almost the same retry sequence. A shared helper would reduce drift and make future changes safer.
As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
Also applies to: 90-102
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/v2/backuprestore/cleanup.go` around lines 73 - 88, Extract the duplicated timeout/retry sequence into a helper function (e.g., ensureNamespaceDeleted(ctx, namespace, logger)) that encapsulates: attempt deleteControlPlaneNamespace(ctx, shortTimeout), on DeadlineExceeded or wait.ErrWaitTimeout call removeNamespaceObjectFinalizers(ctx, namespace, logger) and retry deleteControlPlaneNamespace(ctx, DeletionTimeout) returning any error; then replace the two nearly identical blocks in Step 6 and Step 7 with calls to this new helper using testCtx and testCtx.ControlPlaneNamespace so the retry logic (including use of DeletionTimeout) is centralized and avoids drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/e2e/v2/backuprestore/cleanup.go`:
- Around line 73-88: Extract the duplicated timeout/retry sequence into a helper
function (e.g., ensureNamespaceDeleted(ctx, namespace, logger)) that
encapsulates: attempt deleteControlPlaneNamespace(ctx, shortTimeout), on
DeadlineExceeded or wait.ErrWaitTimeout call
removeNamespaceObjectFinalizers(ctx, namespace, logger) and retry
deleteControlPlaneNamespace(ctx, DeletionTimeout) returning any error; then
replace the two nearly identical blocks in Step 6 and Step 7 with calls to this
new helper using testCtx and testCtx.ControlPlaneNamespace so the retry logic
(including use of DeletionTimeout) is centralized and avoids drift.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 714ef578-3745-4b4d-ae10-46a6cf5c8928
📒 Files selected for processing (3)
test/e2e/v2/backuprestore/cleanup.gotest/e2e/v2/backuprestore/cli.gotest/e2e/v2/tests/backup_restore_test.go
jparrill
left a comment
There was a problem hiding this comment.
Dropped a question. Let me know for tagging.
|
/test e2e-azure-self-managed |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jparrill, mgencur The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Code Review — Automated Analysis
🔴 Overall Verdict: FAIL — 4 blocking issues must be fixed before mergeFiles Reviewed
🚨 Required Actions (Blocking)R1 — Wrong error sentinel in retry logic (
|
| # | Severity | Area | Issue |
|---|---|---|---|
| R1 | 🔴 Blocking | Retry logic | errors.Is misses context.Canceled; use wait.Interrupted(err) |
| R2 | 🔴 Blocking | Agent platform | hypershift-agents hardcoded; must read HostedCluster.Spec.Platform.Agent.AgentNamespace |
| R3 | 🔴 Blocking | Test safety | prober.Stop() nil dereference risk in ContextVerifyContinual |
| R4 | 🔴 Blocking | Test validity | CNTRLPLANE-2676 skip removal unverified; link implementing PR |
| I1 | 🟡 Recommended | DRY | Extract duplicated retry block into deleteNamespaceWithRetry |
| I2 | 🟡 Recommended | Style | 1*time.Minute magic literal → InitialDeletionTimeout constant |
| I3 | 🟡 Recommended | AWS correctness | Add cloud-credential-operator to AWS excludeWorkloads |
| I4 | 🟡 Recommended | DRY | Verify IncludeNamespaces applied to all three OADP option structs |
| I5 | 🟡 Recommended | SOLID/ISP | Move NodePool validation flag into backupRestorePlatformConfig |
| I6 | 🟡 Recommended | Robustness | Add IsNotFound tolerance in verifyReconciliationActiveFunction |
| I7 | 🟡 Recommended | AWS correctness | Add file-existence pre-flight check for AWS_GUEST_INFRA_CREDENTIALS_FILE |
| I8 | 🟡 Recommended | Clarity | Document OIDCTimeout decoupling from BackupTimeout |
| I9 | 🟡 Recommended | Clarity | Document gracePeriodSeconds=-1 sentinel |
| I10 | 🟡 Recommended | Maintenance | Add MGMT-23509 ticket link to TODO |
| I11 | 🟡 Recommended | CI | Verify aws Ginkgo label is removed |
Generated by Claude Code automated review · HyperShift profile · 2026-03-18
|
Above from AI under my direction. Here is mine, retry + fallback is not optional for OADP in HyperShift, it’s required for production-grade reliability. Without it, you’ll absolutely see flaky backups/restores due to exactly the issues we met (disk, network, control plane churn).
Fallback:
Restore Policy Fallback: |
|
Fixing R1 - R3.
The pauses were removed as part of openshift/hypershift-oadp-plugin#197 |
…nil guard Replace manual timeout error checks with wait.Interrupted(), resolve Agent namespace from HostedCluster spec at runtime, and guard prober against nil dereference on unsupported platforms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
For #7971 (comment) - I'm not sure exactly if the retries are desired or how to achieve them, maybe it needs to be a separate feature? there are also Schedules which periodically do backups so that reduces the need for retries. It would need wider discussion and maybe would result in a new feature. Would you mind having this as a separate JIRA ticket? sounds too much for this single PR |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/e2e/v2/tests/backup_restore_test.go (1)
116-117:⚠️ Potential issue | 🟠 MajorCapture
testCtxin a local variable before passing the closure toprober.Spawn().Line 116 reassigns
testCtxinBeforeEachfor each test spec, while the prober spawns a goroutine at line 170 that repeatedly readstestCtxasynchronously. This shared mutable capture creates a data race where the spawned goroutine may read a stale or partially-updatedtestCtxreference.Fix: capture immutable context for the prober
verifyReconciliationActiveFunction := func() error { + proberTestCtx := testCtx hostedCluster := &hyperv1.HostedCluster{} - err := testCtx.MgmtClient.Get(testCtx.Context, crclient.ObjectKey{ - Name: testCtx.ClusterName, - Namespace: testCtx.ClusterNamespace, + err := proberTestCtx.MgmtClient.Get(proberTestCtx.Context, crclient.ObjectKey{ + Name: proberTestCtx.ClusterName, + Namespace: proberTestCtx.ClusterNamespace, }, hostedCluster)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/v2/tests/backup_restore_test.go` around lines 116 - 117, Before passing the closure to prober.Spawn() from your BeforeEach, capture the current testCtx into a local immutable variable (e.g., localCtx := testCtx) and have the spawned closure reference that localCtx instead of the outer testCtx; this avoids the shared mutable capture that causes a data race between BeforeEach reassignment of testCtx and the goroutine started by prober.Spawn(), so update the closure(s) passed to prober.Spawn() to use the new localCtx.
🧹 Nitpick comments (1)
test/e2e/v2/backuprestore/cleanup.go (1)
72-101: Extract Step 6/7 retry flow into a shared helper.Both blocks implement the same timeout → finalizer cleanup → retry sequence. Keeping this duplicated makes timeout/error behavior easy to drift later.
♻️ Refactor sketch
+func retryNamespaceDeleteWithFinalizerCleanup( + testCtx *internal.TestContext, + logger logr.Logger, + ns string, + initialTimeout time.Duration, + deleteFn func(*internal.TestContext, time.Duration) error, +) error { + if err := deleteFn(testCtx, initialTimeout); err != nil { + if !wait.Interrupted(err) { + return err + } + logger.Info("Namespace did not delete within initial timeout, removing finalizers again and retrying", "namespace", ns) + if err := removeNamespaceObjectFinalizers(testCtx, ns, logger); err != nil { + return fmt.Errorf("failed to remove finalizers on retry for namespace %s: %w", ns, err) + } + if err := deleteFn(testCtx, DeletionTimeout); err != nil { + return fmt.Errorf("failed to delete namespace %s after retry: %w", ns, err) + } + } + return nil +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/v2/backuprestore/cleanup.go` around lines 72 - 101, Extract the duplicated timeout→finalizer-cleanup→retry logic into a shared helper (e.g. deleteWithRetry) and call it from both places instead of duplicating the block around deleteControlPlaneNamespace and deleteHostedClusterNamespace; the helper should accept a context/testCtx, a delete function (like deleteControlPlaneNamespace or deleteHostedClusterNamespace), the namespace identifier (for use with removeNamespaceObjectFinalizers/testCtx.ClusterNamespace or testCtx.ControlPlaneNamespace), a logger, the short initial timeout (1*time.Minute) and the retry timeout (DeletionTimeout), perform the initial delete, check wait.Interrupted(err) to decide whether to remove finalizers via removeNamespaceObjectFinalizers and retry, log the same informational message when retrying, and return the original or retry error wrapped exactly as the current returns do so call sites only replace their blocks with a call to deleteWithRetry.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@test/e2e/v2/tests/backup_restore_test.go`:
- Around line 116-117: Before passing the closure to prober.Spawn() from your
BeforeEach, capture the current testCtx into a local immutable variable (e.g.,
localCtx := testCtx) and have the spawned closure reference that localCtx
instead of the outer testCtx; this avoids the shared mutable capture that causes
a data race between BeforeEach reassignment of testCtx and the goroutine started
by prober.Spawn(), so update the closure(s) passed to prober.Spawn() to use the
new localCtx.
---
Nitpick comments:
In `@test/e2e/v2/backuprestore/cleanup.go`:
- Around line 72-101: Extract the duplicated timeout→finalizer-cleanup→retry
logic into a shared helper (e.g. deleteWithRetry) and call it from both places
instead of duplicating the block around deleteControlPlaneNamespace and
deleteHostedClusterNamespace; the helper should accept a context/testCtx, a
delete function (like deleteControlPlaneNamespace or
deleteHostedClusterNamespace), the namespace identifier (for use with
removeNamespaceObjectFinalizers/testCtx.ClusterNamespace or
testCtx.ControlPlaneNamespace), a logger, the short initial timeout
(1*time.Minute) and the retry timeout (DeletionTimeout), perform the initial
delete, check wait.Interrupted(err) to decide whether to remove finalizers via
removeNamespaceObjectFinalizers and retry, log the same informational message
when retrying, and return the original or retry error wrapped exactly as the
current returns do so call sites only replace their blocks with a call to
deleteWithRetry.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: ee959b3d-2946-438e-85ff-65e311168b07
📒 Files selected for processing (2)
test/e2e/v2/backuprestore/cleanup.gotest/e2e/v2/tests/backup_restore_test.go
That's functions of the next version.
|
|
/lgtm |
|
Scheduling tests matching the |
Test Resultse2e-aws
Failed TestsTotal failed tests: 9
... and 4 more failed tests e2e-aks
|
|
/retest |
|
/retest |
|
/test e2e-aws |
|
/test e2e-aws-4-21 |
|
/test e2e-aws |
1 similar comment
|
/test e2e-aws |
|
/test e2e-aws |
|
need to rebase, e2e-aws was fixed and merged. |
|
/test e2e-aws |
|
/verified by @mgencur |
|
@mgencur: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-azure-self-managed |
|
@mgencur: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |

What this PR does / why we need it:
Tested in openshift/release#76089
Which issue(s) this PR fixes:
Fixes
Special notes for your reviewer:
Checklist:
Note
Medium Risk
Medium risk because it changes e2e test flow and teardown semantics (namespace deletion timeouts/retries), which can affect CI behavior and cluster cleanup reliability across platforms.
Overview
Platform-aware backup/restore e2e test: The
BackupRestoretest is refactored to auto-detect the hosted cluster platform and run on both AWS and Agent, using per-platform config for excluded workloads, optional post-restore hooks (OIDC/IAM fix on AWS), and inclusion of extra namespaces (e.g. Agent namespace) in OADP backup/schedule/restore requests.More resilient teardown:
BreakHostedClusterPreservingMachinesnow uses timeout-based namespace deletion and, if deletion doesn’t complete within 1 minute, re-strips finalizers and retries with the standardDeletionTimeoutto avoid CI hangs.Timing tweak: Increases
OIDCTimeoutfrom 20m to 30m.Written by Cursor Bugbot for commit b142d77. This will update automatically on new commits. Configure here.
Summary by CodeRabbit