Skip to content

NO-JIRA: refactor(oadp): unify backup/restore e2e test with platform auto-detection#7971

Merged
openshift-merge-bot[bot] merged 4 commits intoopenshift:mainfrom
mgencur:rerun_finalizer_deletion
Mar 23, 2026
Merged

NO-JIRA: refactor(oadp): unify backup/restore e2e test with platform auto-detection#7971
openshift-merge-bot[bot] merged 4 commits intoopenshift:mainfrom
mgencur:rerun_finalizer_deletion

Conversation

@mgencur
Copy link
Contributor

@mgencur mgencur commented Mar 16, 2026

What this PR does / why we need it:

  • Merge the AWS-only BackupRestore test into a singleplatform-agnostic. This test now works for both AWS and Agent platform.
  • Improve namespace deletion resilience during BreakHostedCluster by retrying finalizer removal when the initial 1-minute timeout expires. This is required as the namespace deletion was sometimes hanging in CI. Some resources were re-created after deletion and kept hanging there because the new finalizers were not removed.
  • Enable the continual ReconciliationActive probing that was previously skipped (CNTRLPLANE-2676).

Tested in openshift/release#76089

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Note

Medium Risk
Medium risk because it changes e2e test flow and teardown semantics (namespace deletion timeouts/retries), which can affect CI behavior and cluster cleanup reliability across platforms.

Overview
Platform-aware backup/restore e2e test: The BackupRestore test is refactored to auto-detect the hosted cluster platform and run on both AWS and Agent, using per-platform config for excluded workloads, optional post-restore hooks (OIDC/IAM fix on AWS), and inclusion of extra namespaces (e.g. Agent namespace) in OADP backup/schedule/restore requests.

More resilient teardown: BreakHostedClusterPreservingMachines now uses timeout-based namespace deletion and, if deletion doesn’t complete within 1 minute, re-strips finalizers and retries with the standard DeletionTimeout to avoid CI hangs.

Timing tweak: Increases OIDCTimeout from 20m to 30m.

Written by Cursor Bugbot for commit b142d77. This will update automatically on new commits. Configure here.

Summary by CodeRabbit

  • Tests
    • Added platform-aware backup/restore tests with runtime gating (AWS, Agent), platform-specific hooks, additional namespaces, and conditional post-restore verification including nodepool readiness.
  • Bug Fixes
    • Namespace deletion now uses timeout-based waits with retries and finalizer removal to recover stalled deletions.
  • Chores
    • Increased default OIDC operation timeout from 20 to 30 minutes.

@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 16, 2026
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 16, 2026
@openshift-ci-robot
Copy link

@mgencur: This pull request explicitly references no jira issue.

Details

In response to this:

Depends on #7837 and the current PR will need a rebase after merging 7837

What this PR does / why we need it:

Refactor deleteNamespace to accept a timeout duration instead of a boolean wait flag. Implement a two-pass finalizer removal strategy to handle controllers recreating resources during namespace deletion.
Add NodePool WaitingForAvailableMachines state validation after restore.
This is required as the namespace deletion was sometimes hanging in CI. Some resources were re-created after deletion and kept hanging there because the new finalizers were not removed.

Tested in openshift/release#76089

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 16, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 8ff52fdb-782e-4cfb-a912-eaf35abb5ee8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request changes namespace deletion helpers in test/e2e/v2/backuprestore/cleanup.go to use timeout-based semantics (time.Duration) instead of boolean wait flags, implementing a two-phase delete: initial short timeout, then finalizer stripping and a retry with a longer timeout on timeout events. Public signatures for deleteNamespace, deleteControlPlaneNamespace, and deleteHostedClusterNamespace were updated. test/e2e/v2/backuprestore/cli.go increases the OIDC timeout from 20 to 30 minutes. test/e2e/v2/tests/backup_restore_test.go adds per-platform backup/restore configuration (excludeWorkloads, postRestoreHook, additionalNamespaces) and runtime platform gating.

Sequence Diagram(s)

mermaid
sequenceDiagram
actor TestRunner
participant K8sAPI as Kubernetes API
participant Controller as Controllers / Finalizers

TestRunner->>K8sAPI: Request namespace deletion (short timeout)
alt Deletion succeeds before timeout
    K8sAPI-->>TestRunner: Deleted
else Deletion times out (context deadline)
    K8sAPI-->>TestRunner: Deletion pending / still exists
    TestRunner->>Controller: Strip namespace finalizers
    Controller-->>K8sAPI: Finalizers removed
    TestRunner->>K8sAPI: Retry namespace deletion (longer timeout)
    alt Deletion succeeds
        K8sAPI-->>TestRunner: Deleted
    else Non-timeout error
        K8sAPI-->>TestRunner: Error (propagate)
    end
end
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from csrwng and jparrill March 16, 2026 07:44
@openshift-ci openshift-ci bot added area/cli Indicates the PR includes changes for CLI area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Mar 16, 2026
@mgencur mgencur force-pushed the rerun_finalizer_deletion branch from 9bfbd1f to 9580326 Compare March 16, 2026 13:03
@mgencur mgencur changed the title [WIP] NO-JIRA: test(e2e): use timeout-based namespace deletion in backup/restore NO-JIRA: test(e2e): use timeout-based namespace deletion in backup/restore Mar 16, 2026
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 16, 2026
@mgencur mgencur changed the title NO-JIRA: test(e2e): use timeout-based namespace deletion in backup/restore NO-JIRA: refactor(oadp): unify backup/restore e2e test with platform auto-detection Mar 17, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/e2e/v2/tests/backup_restore_test.go (1)

176-205: ⚠️ Potential issue | 🟠 Major

Add IncludeNamespaces to restore options.

The test passes platformCfg.additionalNamespaces to both schedule and backup creation, but the restore options omit it. This inconsistency means those namespaces will be backed up but never restored. Update line 248 to include:

IncludeNamespaces: platformCfg.additionalNamespaces,

Same fix applies to the second occurrence at lines 248–255.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/v2/tests/backup_restore_test.go` around lines 176 - 205, The restore
options are missing IncludeNamespaces so namespaces backed up by
OADPSchedule/OADPBackup are not restored; update the OADPRestoreOptions
instances used in this test to include IncludeNamespaces:
platformCfg.additionalNamespaces (matching how OADPScheduleOptions and
OADPBackupOptions are constructed), i.e., add IncludeNamespaces to the restore
options near where backupOpts is created and to the second restore-options
occurrence referenced in the test so both restores include
platformCfg.additionalNamespaces.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/v2/backuprestore/cleanup.go`:
- Around line 76-94: The cleanup retry currently treats any error from
deleteControlPlaneNamespace/deleteHostedClusterNamespace as a timeout; change
the logic to only perform the finalizer-removal + retry when the returned error
is an actual timeout. Update the branches around deleteControlPlaneNamespace and
deleteHostedClusterNamespace to check the error with errors.Is for known timeout
signals (e.g. context.DeadlineExceeded or wait.ErrWaitTimeout) or a specific
sentinel timeout error returned by those functions, and return other errors
immediately. Use the same change for the analogous block later (lines ~302-345):
only call removeNamespaceObjectFinalizers and retry when the error indicates a
timeout, otherwise propagate the original error.

In `@test/e2e/v2/tests/backup_restore_test.go`:
- Around line 270-288: The NodePool Ready=False check is timing-sensitive and
can be missed after the long control-plane readiness waits; to fix it, start
observing the NodePool transition immediately after restore completion instead
of after
WaitForControlPlaneStatefulSetsReadiness/WaitForControlPlaneDeploymentsReadiness:
move the Eventually block that calls getNodePool and internal.ValidateConditions
(checking hyperv1.NodePoolReadyConditionType == metav1.ConditionFalse with
Reason "WaitingForAvailableMachines") to run right after restore completion, or
run it concurrently as a prober (goroutine) that polls with
backuprestore.PollInterval and backuprestore.OIDCTimeout using the same testCtx
so it can record the transient state while the subsequent calls to
internal.WaitForControlPlaneStatefulSetsReadiness and
internal.WaitForControlPlaneDeploymentsReadiness still run.

---

Outside diff comments:
In `@test/e2e/v2/tests/backup_restore_test.go`:
- Around line 176-205: The restore options are missing IncludeNamespaces so
namespaces backed up by OADPSchedule/OADPBackup are not restored; update the
OADPRestoreOptions instances used in this test to include IncludeNamespaces:
platformCfg.additionalNamespaces (matching how OADPScheduleOptions and
OADPBackupOptions are constructed), i.e., add IncludeNamespaces to the restore
options near where backupOpts is created and to the second restore-options
occurrence referenced in the test so both restores include
platformCfg.additionalNamespaces.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 988cdf1d-2a4b-4b10-aeba-3bfcba14b798

📥 Commits

Reviewing files that changed from the base of the PR and between 2a03c4f and be896ff.

📒 Files selected for processing (3)
  • test/e2e/v2/backuprestore/cleanup.go
  • test/e2e/v2/backuprestore/cli.go
  • test/e2e/v2/tests/backup_restore_test.go

@mgencur mgencur force-pushed the rerun_finalizer_deletion branch from be896ff to b7a9415 Compare March 18, 2026 07:46
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
test/e2e/v2/tests/backup_restore_test.go (1)

270-288: ⚠️ Potential issue | 🟠 Major

NodePool transient-state check is still timing-sensitive

At Line 270, the test waits for control-plane readiness first, then at Line 276 starts looking for NodePoolReady=False with reason WaitingForAvailableMachines. That state can occur earlier and be missed, causing flaky failures.

Start observing this condition immediately after restore completion (or before the long readiness waits), then assert it was observed.

Suggested reorder (minimal change)
-By("Waiting for control plane statefulsets to be ready")
-err := internal.WaitForControlPlaneStatefulSetsReadiness(testCtx, backuprestore.RestoreTimeout, platformCfg.excludeWorkloads)
-Expect(err).NotTo(HaveOccurred())
-By("Waiting for control plane deployments to be ready")
-err = internal.WaitForControlPlaneDeploymentsReadiness(testCtx, backuprestore.RestoreTimeout, platformCfg.excludeWorkloads)
-Expect(err).NotTo(HaveOccurred())
 By("Waiting for NodePool to reach WaitingForAvailableMachines state")
 Eventually(func(g Gomega) {
     nodePool, err := getNodePool(testCtx)
     g.Expect(err).NotTo(HaveOccurred())
     g.Expect(nodePool).NotTo(BeNil())
     internal.ValidateConditions(g, nodePool, []util.Condition{
         {
             Type:   hyperv1.NodePoolReadyConditionType,
             Status: metav1.ConditionFalse,
             Reason: "WaitingForAvailableMachines",
         },
     })
 }).WithPolling(backuprestore.PollInterval).WithTimeout(backuprestore.OIDCTimeout).Should(Succeed())
+
+By("Waiting for control plane statefulsets to be ready")
+err := internal.WaitForControlPlaneStatefulSetsReadiness(testCtx, backuprestore.RestoreTimeout, platformCfg.excludeWorkloads)
+Expect(err).NotTo(HaveOccurred())
+By("Waiting for control plane deployments to be ready")
+err = internal.WaitForControlPlaneDeploymentsReadiness(testCtx, backuprestore.RestoreTimeout, platformCfg.excludeWorkloads)
+Expect(err).NotTo(HaveOccurred())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/v2/tests/backup_restore_test.go` around lines 270 - 288, The
transient NodePool condition check is placed after long control-plane readiness
waits and can be missed; move the Eventually block that calls getNodePool and
internal.ValidateConditions (asserting hyperv1.NodePoolReadyConditionType ==
metav1.ConditionFalse with Reason "WaitingForAvailableMachines") to run
immediately after restore completion (i.e., before calling
internal.WaitForControlPlaneStatefulSetsReadiness and
internal.WaitForControlPlaneDeploymentsReadiness) so the test starts observing
the transient state early, then continue with the existing readiness waits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/e2e/v2/tests/backup_restore_test.go`:
- Around line 270-288: The transient NodePool condition check is placed after
long control-plane readiness waits and can be missed; move the Eventually block
that calls getNodePool and internal.ValidateConditions (asserting
hyperv1.NodePoolReadyConditionType == metav1.ConditionFalse with Reason
"WaitingForAvailableMachines") to run immediately after restore completion
(i.e., before calling internal.WaitForControlPlaneStatefulSetsReadiness and
internal.WaitForControlPlaneDeploymentsReadiness) so the test starts observing
the transient state early, then continue with the existing readiness waits.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: ea4fd520-2f00-4969-83ac-aaf0ac96d1a1

📥 Commits

Reviewing files that changed from the base of the PR and between be896ff and b7a9415.

📒 Files selected for processing (2)
  • test/e2e/v2/backuprestore/cleanup.go
  • test/e2e/v2/tests/backup_restore_test.go

…anup

Refactor deleteNamespace to accept a timeout duration instead of a
boolean wait flag. Implement a two-pass finalizer removal strategy
to handle controllers recreating resources during namespace deletion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mgencur mgencur force-pushed the rerun_finalizer_deletion branch 2 times, most recently from ecf5b0a to cd884dd Compare March 18, 2026 08:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/e2e/v2/backuprestore/cleanup.go (1)

73-88: Consider extracting the duplicated timeout/retry flow into a helper.

Step 6 and Step 7 now implement almost the same retry sequence. A shared helper would reduce drift and make future changes safer.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 90-102

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/v2/backuprestore/cleanup.go` around lines 73 - 88, Extract the
duplicated timeout/retry sequence into a helper function (e.g.,
ensureNamespaceDeleted(ctx, namespace, logger)) that encapsulates: attempt
deleteControlPlaneNamespace(ctx, shortTimeout), on DeadlineExceeded or
wait.ErrWaitTimeout call removeNamespaceObjectFinalizers(ctx, namespace, logger)
and retry deleteControlPlaneNamespace(ctx, DeletionTimeout) returning any error;
then replace the two nearly identical blocks in Step 6 and Step 7 with calls to
this new helper using testCtx and testCtx.ControlPlaneNamespace so the retry
logic (including use of DeletionTimeout) is centralized and avoids drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/e2e/v2/backuprestore/cleanup.go`:
- Around line 73-88: Extract the duplicated timeout/retry sequence into a helper
function (e.g., ensureNamespaceDeleted(ctx, namespace, logger)) that
encapsulates: attempt deleteControlPlaneNamespace(ctx, shortTimeout), on
DeadlineExceeded or wait.ErrWaitTimeout call
removeNamespaceObjectFinalizers(ctx, namespace, logger) and retry
deleteControlPlaneNamespace(ctx, DeletionTimeout) returning any error; then
replace the two nearly identical blocks in Step 6 and Step 7 with calls to this
new helper using testCtx and testCtx.ControlPlaneNamespace so the retry logic
(including use of DeletionTimeout) is centralized and avoids drift.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 714ef578-3745-4b4d-ae10-46a6cf5c8928

📥 Commits

Reviewing files that changed from the base of the PR and between b7a9415 and cd884dd.

📒 Files selected for processing (3)
  • test/e2e/v2/backuprestore/cleanup.go
  • test/e2e/v2/backuprestore/cli.go
  • test/e2e/v2/tests/backup_restore_test.go

Copy link
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped a question. Let me know for tagging.

@jparrill
Copy link
Contributor

/test e2e-azure-self-managed

@jparrill
Copy link
Contributor

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 18, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jparrill, mgencur

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 18, 2026
@wangke19
Copy link
Contributor

wangke19 commented Mar 18, 2026

Code Review — Automated Analysis

Reviewed by Claude Code with HyperShift profile (control-plane-sme, data-plane-sme, api-sme, cloud-provider-sme, hcp-architect-sme)
Focus area: Retry and fallback mechanism (user-requested) + full idiomatic Go + SOLID + DRY + platform correctness


🔴 Overall Verdict: FAIL — 4 blocking issues must be fixed before merge


Files Reviewed

File Category
test/e2e/v2/backuprestore/cleanup.go E2E helper — namespace cleanup & retry logic
test/e2e/v2/backuprestore/cli.go E2E helper — OADP CLI wrapper & constants
test/e2e/v2/tests/backup_restore_test.go Ginkgo e2e test suite

🚨 Required Actions (Blocking)

R1 — Wrong error sentinel in retry logic (cleanup.go, Steps 6 & 7)

The current check is:

if !errors.Is(err, context.DeadlineExceeded) && !errors.Is(err, wait.ErrWaitTimeout) {
    return fmt.Errorf("failed to delete control plane namespace: %w", err)
}

This has two bugs:

  1. context.Canceled is not caught. wait.PollUntilContextTimeout returns ctx.Err() directly, which can be context.Canceled (e.g. when the outer testCtx.Context is cancelled by the test runner). A cancelled context falls through to the hard-error path, skipping the retry entirely — the opposite of the intended behaviour.

  2. wait.ErrWaitTimeout check is dead code. Deprecated since k8s 1.29; PollUntilContextTimeout never returns it (returns context errors directly). The errInterrupted.Is method matches any errInterrupted{}, not specifically timeout.

The vendored k8s.io/apimachinery/pkg/util/wait/error.go documents the intended API for exactly this case:

// Fix — use the API the wait package provides:
if !wait.Interrupted(err) {
    return fmt.Errorf("failed to delete control plane namespace: %w", err)
}

wait.Interrupted(err) checks all three sentinels (context.Canceled, context.DeadlineExceeded, errWaitTimeout) in one call. The comment in the vendored source reads: "Callers should use this method instead of comparing the error value directly to ErrWaitTimeout, as methods that cancel a context may not return that error." — directly applicable here.

This fix must be applied in both Step 6 (control plane namespace) and Step 7 (hosted cluster namespace).


R2 — hypershift-agents namespace is hardcoded — will silently break Agent platform restore (backup_restore_test.go)

hyperv1.AgentPlatform: {
    additionalNamespaces: []string{"hypershift-agents"},
},

AgentNamespace is a user-supplied field at cluster creation time (HostedCluster.Spec.Platform.Agent.AgentNamespace, set via --agent-namespace flag, marked required). Common CI values include assisted-installer, open-cluster-management, and others — there is no canonical value.

If the actual namespace differs from hypershift-agents, Velero will back up a wrong/empty namespace and silently miss all Agent CRs, causing a failed restore with no clear error.

Required fix: Read the namespace at runtime:

hostedCluster.Spec.Platform.Agent.AgentNamespace

and populate additionalNamespaces dynamically.


R3 — prober.Stop() in ContextVerifyContinual lacks nil guard — potential panic (backup_restore_test.go)

AfterAll correctly guards with if prober != nil, but the It block in ContextVerifyContinual calls prober.Stop() unconditionally. If BeforeAll skips the suite (unsupported platform), prober is nil and Stop() will panic. Add the same guard:

It("should verify continual operations completed successfully", func() {
    if prober == nil {
        Skip("prober not initialized")
    }
    err := prober.Stop()
    Expect(err).NotTo(HaveOccurred())
})

R4 — CNTRLPLANE-2676 skip removal is unverified (backup_restore_test.go)

The PR title says NO-JIRA but the removed Skip("Skipping until [CNTRLPLANE-2676](https://redhat.atlassian.net/browse/CNTRLPLANE-2676) is implemented") references a JIRA. These were always-skipped code paths — enabling them without validation evidence is risky. If ReconciliationActive transiently goes False during backup (which could be legitimate behaviour during reconciliation pausing), the prober will report false failures.

Please either:

  • Link the implementing PR/commit for CNTRLPLANE-2676 and confirm ReconciliationActive remains True throughout backup on both AWS and Agent platforms, or
  • Replace the blanket skip with a per-platform conditional or feature-gate env var.

Recommended Improvements (Non-blocking)

I1 — Extract duplicated retry block into a helper (DRY)

Steps 6 and 7 are structurally identical 10-line blocks. The only differences are the delete function called and the namespace string. Extract:

func deleteNamespaceWithRetry(
    testCtx *internal.TestContext,
    logger logr.Logger,
    namespace string,
    deleteFn func(*internal.TestContext, time.Duration) error,
    logLabel string,
) error {
    if err := deleteFn(testCtx, InitialDeletionTimeout); err != nil {
        if !wait.Interrupted(err) {
            return fmt.Errorf("failed to delete %s namespace: %w", logLabel, err)
        }
        logger.Info("namespace did not delete within initial timeout, removing finalizers and retrying",
            "namespace", namespace)
        if err := removeNamespaceObjectFinalizers(testCtx, namespace, logger); err != nil {
            return fmt.Errorf("failed to remove finalizers on retry for %s: %w", logLabel, err)
        }
        if err := deleteFn(testCtx, DeletionTimeout); err != nil {
            return fmt.Errorf("failed to delete %s namespace after retry: %w", logLabel, err)
        }
    }
    return nil
}

This also makes the wait.Interrupted fix a single-point change.

I2 — Add named constant for the 1-minute initial timeout

const InitialDeletionTimeout = 1 * time.Minute

The magic literal 1*time.Minute appears twice, has no name, and has no obvious relationship to DeletionTimeout. A named constant makes it a visible, tunable configuration point.

I3 — Add cloud-credential-operator to AWS excludeWorkloads

CCO relies on IAM WebIdentity and enters CrashLoopBackOff after restore until RunFixDrOidcIam completes and rotates the OIDC provider. The pre-restore and mid-restore ValidateControlPlaneDeploymentsReadiness calls will fail on CCO. Add cloud-credential-operator to the AWS exclusion list to prevent post-restore health check flakes.

I4 — Verify IncludeNamespaces is applied to all three option structs

Confirm that IncludeNamespaces: platformCfg.additionalNamespaces is set consistently in all three of OADPScheduleOptions, OADPBackupOptions, and OADPRestoreOptions. If applied to backup but not schedule or restore, namespaces will be silently missing from one operation.

I5 — Move NodePool validation flag into backupRestorePlatformConfig

// Current — second platform-dispatch mechanism outside the map:
if testCtx.GetHostedCluster().Spec.Platform.Type != hyperv1.AgentPlatform {
    // validate NodePool conditions
}

// Better — keep all per-platform decisions in the map:
type backupRestorePlatformConfig struct {
    excludeWorkloads     []string
    postRestoreHook      func(testCtx *internal.TestContext) error
    additionalNamespaces []string
    validateNodePools    bool   // add this
}

I6 — Add apierrors.IsNotFound tolerance in verifyReconciliationActiveFunction

If a test node fails after the prober starts but before ContextVerifyContinual stops it, the prober continues running into ContextBreakControlPlane where the HostedCluster is deleted. MgmtClient.Get returns NotFound, the prober records it as a failure, and AfterAll surfaces a confusing error. Treat IsNotFound as a terminal success condition (the cluster was intentionally destroyed).

I7 — Add file-existence check for AWS_GUEST_INFRA_CREDENTIALS_FILE

GetEnvVarValue defaults to ~/.aws/credentials if the env var is not set. If the file doesn't exist, the error surfaces only at RunFixDrOidcIam runtime with a non-obvious message. Add an explicit os.Stat check in the postRestoreHook closure with a clear Fail() message.

I8 — Document OIDCTimeout decoupling from BackupTimeout

Add a comment explaining why OIDCTimeout differs from BackupTimeout (e.g., "30 minutes to account for IAM propagation delays and etcd recovery time after restore"). Without a comment, the divergence will be puzzling to future readers, and the two constants may silently drift further apart.

I9 — Document gracePeriodSeconds=-1 sentinel

// Add:
const useServerDefaultGracePeriod int64 = -1

And update the deleteNamespace doc comment to explain that -1 means "omit the GracePeriodSeconds option and let the server apply its default." The current comment only explains 0.

I10 — Add JIRA ticket reference to the TODO

// TODO(mgencur): Remove this condition once https://redhat.atlassian.net/browse/MGMT-23509 is fixed

Without a ticket link, this TODO is likely to become permanent dead weight.

I11 — Verify the aws Ginkgo label is removed from the Describe block

The diff shows Label("backup-restore") without "aws", which is correct. If the aws label were still present, the test would remain gated to AWS CI lanes at the label-selector level even though the code is now platform-agnostic. Please confirm this is the case in the final diff.


Summary Table

# Severity Area Issue
R1 🔴 Blocking Retry logic errors.Is misses context.Canceled; use wait.Interrupted(err)
R2 🔴 Blocking Agent platform hypershift-agents hardcoded; must read HostedCluster.Spec.Platform.Agent.AgentNamespace
R3 🔴 Blocking Test safety prober.Stop() nil dereference risk in ContextVerifyContinual
R4 🔴 Blocking Test validity CNTRLPLANE-2676 skip removal unverified; link implementing PR
I1 🟡 Recommended DRY Extract duplicated retry block into deleteNamespaceWithRetry
I2 🟡 Recommended Style 1*time.Minute magic literal → InitialDeletionTimeout constant
I3 🟡 Recommended AWS correctness Add cloud-credential-operator to AWS excludeWorkloads
I4 🟡 Recommended DRY Verify IncludeNamespaces applied to all three OADP option structs
I5 🟡 Recommended SOLID/ISP Move NodePool validation flag into backupRestorePlatformConfig
I6 🟡 Recommended Robustness Add IsNotFound tolerance in verifyReconciliationActiveFunction
I7 🟡 Recommended AWS correctness Add file-existence pre-flight check for AWS_GUEST_INFRA_CREDENTIALS_FILE
I8 🟡 Recommended Clarity Document OIDCTimeout decoupling from BackupTimeout
I9 🟡 Recommended Clarity Document gracePeriodSeconds=-1 sentinel
I10 🟡 Recommended Maintenance Add MGMT-23509 ticket link to TODO
I11 🟡 Recommended CI Verify aws Ginkgo label is removed

Generated by Claude Code automated review · HyperShift profile · 2026-03-18

@wangke19
Copy link
Contributor

wangke19 commented Mar 18, 2026

Above from AI under my direction. Here is mine, retry + fallback is not optional for OADP in HyperShift, it’s required for production-grade reliability. Without it, you’ll absolutely see flaky backups/restores due to exactly the issues we met (disk, network, control plane churn).
Backup Policy:

  • Retry: 3 attempts (exponential backoff)
  • Timeout: 30 min per backup

Fallback:

  • Retry same storage
  • Switch storage
  • Switch to filesystem backup
  • Reduce scope

Restore Policy
Max retries: 3
Pre-check required: ✅
Staged restore: ✅
Conflict policy: configurable

Fallback:
Retry restore
Partial restore
Manual intervention trigger

@mgencur
Copy link
Contributor Author

mgencur commented Mar 18, 2026

Fixing R1 - R3.

R4 — CNTRLPLANE-2676 skip removal is unverified (backup_restore_test.go)

The pauses were removed as part of openshift/hypershift-oadp-plugin#197
I already verified that. See my comment on the JIRA if you're interested.

…nil guard

Replace manual timeout error checks with wait.Interrupted(), resolve
Agent namespace from HostedCluster spec at runtime, and guard prober
against nil dereference on unsupported platforms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mgencur
Copy link
Contributor Author

mgencur commented Mar 18, 2026

For #7971 (comment) - I'm not sure exactly if the retries are desired or how to achieve them, maybe it needs to be a separate feature? there are also Schedules which periodically do backups so that reduces the need for retries. It would need wider discussion and maybe would result in a new feature. Would you mind having this as a separate JIRA ticket? sounds too much for this single PR

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/e2e/v2/tests/backup_restore_test.go (1)

116-117: ⚠️ Potential issue | 🟠 Major

Capture testCtx in a local variable before passing the closure to prober.Spawn().

Line 116 reassigns testCtx in BeforeEach for each test spec, while the prober spawns a goroutine at line 170 that repeatedly reads testCtx asynchronously. This shared mutable capture creates a data race where the spawned goroutine may read a stale or partially-updated testCtx reference.

Fix: capture immutable context for the prober
 			verifyReconciliationActiveFunction := func() error {
+			proberTestCtx := testCtx
 				hostedCluster := &hyperv1.HostedCluster{}
-				err := testCtx.MgmtClient.Get(testCtx.Context, crclient.ObjectKey{
-					Name:      testCtx.ClusterName,
-					Namespace: testCtx.ClusterNamespace,
+				err := proberTestCtx.MgmtClient.Get(proberTestCtx.Context, crclient.ObjectKey{
+					Name:      proberTestCtx.ClusterName,
+					Namespace: proberTestCtx.ClusterNamespace,
 				}, hostedCluster)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/v2/tests/backup_restore_test.go` around lines 116 - 117, Before
passing the closure to prober.Spawn() from your BeforeEach, capture the current
testCtx into a local immutable variable (e.g., localCtx := testCtx) and have the
spawned closure reference that localCtx instead of the outer testCtx; this
avoids the shared mutable capture that causes a data race between BeforeEach
reassignment of testCtx and the goroutine started by prober.Spawn(), so update
the closure(s) passed to prober.Spawn() to use the new localCtx.
🧹 Nitpick comments (1)
test/e2e/v2/backuprestore/cleanup.go (1)

72-101: Extract Step 6/7 retry flow into a shared helper.

Both blocks implement the same timeout → finalizer cleanup → retry sequence. Keeping this duplicated makes timeout/error behavior easy to drift later.

♻️ Refactor sketch
+func retryNamespaceDeleteWithFinalizerCleanup(
+	testCtx *internal.TestContext,
+	logger logr.Logger,
+	ns string,
+	initialTimeout time.Duration,
+	deleteFn func(*internal.TestContext, time.Duration) error,
+) error {
+	if err := deleteFn(testCtx, initialTimeout); err != nil {
+		if !wait.Interrupted(err) {
+			return err
+		}
+		logger.Info("Namespace did not delete within initial timeout, removing finalizers again and retrying", "namespace", ns)
+		if err := removeNamespaceObjectFinalizers(testCtx, ns, logger); err != nil {
+			return fmt.Errorf("failed to remove finalizers on retry for namespace %s: %w", ns, err)
+		}
+		if err := deleteFn(testCtx, DeletionTimeout); err != nil {
+			return fmt.Errorf("failed to delete namespace %s after retry: %w", ns, err)
+		}
+	}
+	return nil
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/v2/backuprestore/cleanup.go` around lines 72 - 101, Extract the
duplicated timeout→finalizer-cleanup→retry logic into a shared helper (e.g.
deleteWithRetry) and call it from both places instead of duplicating the block
around deleteControlPlaneNamespace and deleteHostedClusterNamespace; the helper
should accept a context/testCtx, a delete function (like
deleteControlPlaneNamespace or deleteHostedClusterNamespace), the namespace
identifier (for use with
removeNamespaceObjectFinalizers/testCtx.ClusterNamespace or
testCtx.ControlPlaneNamespace), a logger, the short initial timeout
(1*time.Minute) and the retry timeout (DeletionTimeout), perform the initial
delete, check wait.Interrupted(err) to decide whether to remove finalizers via
removeNamespaceObjectFinalizers and retry, log the same informational message
when retrying, and return the original or retry error wrapped exactly as the
current returns do so call sites only replace their blocks with a call to
deleteWithRetry.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@test/e2e/v2/tests/backup_restore_test.go`:
- Around line 116-117: Before passing the closure to prober.Spawn() from your
BeforeEach, capture the current testCtx into a local immutable variable (e.g.,
localCtx := testCtx) and have the spawned closure reference that localCtx
instead of the outer testCtx; this avoids the shared mutable capture that causes
a data race between BeforeEach reassignment of testCtx and the goroutine started
by prober.Spawn(), so update the closure(s) passed to prober.Spawn() to use the
new localCtx.

---

Nitpick comments:
In `@test/e2e/v2/backuprestore/cleanup.go`:
- Around line 72-101: Extract the duplicated timeout→finalizer-cleanup→retry
logic into a shared helper (e.g. deleteWithRetry) and call it from both places
instead of duplicating the block around deleteControlPlaneNamespace and
deleteHostedClusterNamespace; the helper should accept a context/testCtx, a
delete function (like deleteControlPlaneNamespace or
deleteHostedClusterNamespace), the namespace identifier (for use with
removeNamespaceObjectFinalizers/testCtx.ClusterNamespace or
testCtx.ControlPlaneNamespace), a logger, the short initial timeout
(1*time.Minute) and the retry timeout (DeletionTimeout), perform the initial
delete, check wait.Interrupted(err) to decide whether to remove finalizers via
removeNamespaceObjectFinalizers and retry, log the same informational message
when retrying, and return the original or retry error wrapped exactly as the
current returns do so call sites only replace their blocks with a call to
deleteWithRetry.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: ee959b3d-2946-438e-85ff-65e311168b07

📥 Commits

Reviewing files that changed from the base of the PR and between ef65ca9 and b142d77.

📒 Files selected for processing (2)
  • test/e2e/v2/backuprestore/cleanup.go
  • test/e2e/v2/tests/backup_restore_test.go

@wangke19
Copy link
Contributor

For #7971 (comment) - I'm not sure exactly if the retries are desired or how to achieve them, maybe it needs to be a separate feature? there are also Schedules which periodically do backups so that reduces the need for retries. It would need wider discussion and maybe would result in a new feature. Would you mind having this as a separate JIRA ticket? sounds too much for this single PR

That's functions of the next version.

Would you mind having this as a separate JIRA ticket? sounds too much for this single PR.
Sure.

@wangke19
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 18, 2026
@openshift-ci-robot
Copy link

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-21
/test e2e-aws-4-21
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@cwbotbot
Copy link

cwbotbot commented Mar 18, 2026

Test Results

e2e-aws

Failed Tests

Total failed tests: 9

  • TestCreateCluster
  • TestCreateCluster/Teardown
  • TestNodePool
  • TestNodePool/HostedCluster0
  • TestNodePool/HostedCluster0/Main

... and 4 more failed tests

e2e-aks

@mgencur
Copy link
Contributor Author

mgencur commented Mar 19, 2026

/retest

@mgencur
Copy link
Contributor Author

mgencur commented Mar 19, 2026

/retest

@mgencur
Copy link
Contributor Author

mgencur commented Mar 19, 2026

/test e2e-aws

@mgencur
Copy link
Contributor Author

mgencur commented Mar 19, 2026

/test e2e-aws-4-21

@mgencur
Copy link
Contributor Author

mgencur commented Mar 19, 2026

/test e2e-aws

1 similar comment
@mgencur
Copy link
Contributor Author

mgencur commented Mar 20, 2026

/test e2e-aws

@mgencur
Copy link
Contributor Author

mgencur commented Mar 20, 2026

/test e2e-aws

@wangke19
Copy link
Contributor

need to rebase, e2e-aws was fixed and merged.

@mgencur
Copy link
Contributor Author

mgencur commented Mar 23, 2026

/test e2e-aws

@mgencur
Copy link
Contributor Author

mgencur commented Mar 23, 2026

/verified by @mgencur

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 23, 2026
@openshift-ci-robot
Copy link

@mgencur: This PR has been marked as verified by @mgencur.

Details

In response to this:

/verified by @mgencur

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mgencur
Copy link
Contributor Author

mgencur commented Mar 23, 2026

/test e2e-azure-self-managed

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 23, 2026

@mgencur: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 783c4fe into openshift:main Mar 23, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cli Indicates the PR includes changes for CLI area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants