NO-JIRA: refactor(oadp): unify backup/restore e2e test with platform auto-detection by mgencur · Pull Request #7971 · openshift/hypershift

mgencur · 2026-03-16T07:44:16Z

What this PR does / why we need it:

Merge the AWS-only BackupRestore test into a singleplatform-agnostic. This test now works for both AWS and Agent platform.
Improve namespace deletion resilience during BreakHostedCluster by retrying finalizer removal when the initial 1-minute timeout expires. This is required as the namespace deletion was sometimes hanging in CI. Some resources were re-created after deletion and kept hanging there because the new finalizers were not removed.
Enable the continual ReconciliationActive probing that was previously skipped (CNTRLPLANE-2676).

Tested in openshift/release#76089

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

Note

Medium Risk
Medium risk because it changes e2e test flow and teardown semantics (namespace deletion timeouts/retries), which can affect CI behavior and cluster cleanup reliability across platforms.

Overview
Platform-aware backup/restore e2e test: The BackupRestore test is refactored to auto-detect the hosted cluster platform and run on both AWS and Agent, using per-platform config for excluded workloads, optional post-restore hooks (OIDC/IAM fix on AWS), and inclusion of extra namespaces (e.g. Agent namespace) in OADP backup/schedule/restore requests.

More resilient teardown: BreakHostedClusterPreservingMachines now uses timeout-based namespace deletion and, if deletion doesn’t complete within 1 minute, re-strips finalizers and retries with the standard DeletionTimeout to avoid CI hangs.

Timing tweak: Increases OIDCTimeout from 20m to 30m.

^{Written by Cursor Bugbot for commit b142d77. This will update automatically on new commits. Configure here.}

Summary by CodeRabbit

Tests
- Added platform-aware backup/restore tests with runtime gating (AWS, Agent), platform-specific hooks, additional namespaces, and conditional post-restore verification including nodepool readiness.
Bug Fixes
- Namespace deletion now uses timeout-based waits with retries and finalizer removal to recover stalled deletions.
Chores
- Increased default OIDC operation timeout from 20 to 30 minutes.

openshift-ci-robot · 2026-03-16T07:44:19Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci-robot · 2026-03-16T07:44:21Z

@mgencur: This pull request explicitly references no jira issue.

Details

In response to this:

Depends on #7837 and the current PR will need a rebase after merging 7837

What this PR does / why we need it:

Refactor deleteNamespace to accept a timeout duration instead of a boolean wait flag. Implement a two-pass finalizer removal strategy to handle controllers recreating resources during namespace deletion.
Add NodePool WaitingForAvailableMachines state validation after restore.
This is required as the namespace deletion was sometimes hanging in CI. Some resources were re-created after deletion and kept hanging there because the new finalizers were not removed.

Tested in openshift/release#76089

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-03-16T07:44:26Z

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)

do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 8ff52fdb-782e-4cfb-a912-eaf35abb5ee8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This pull request changes namespace deletion helpers in test/e2e/v2/backuprestore/cleanup.go to use timeout-based semantics (time.Duration) instead of boolean wait flags, implementing a two-phase delete: initial short timeout, then finalizer stripping and a retry with a longer timeout on timeout events. Public signatures for deleteNamespace, deleteControlPlaneNamespace, and deleteHostedClusterNamespace were updated. test/e2e/v2/backuprestore/cli.go increases the OIDC timeout from 20 to 30 minutes. test/e2e/v2/tests/backup_restore_test.go adds per-platform backup/restore configuration (excludeWorkloads, postRestoreHook, additionalNamespaces) and runtime platform gating.

Sequence Diagram(s)

mermaid
sequenceDiagram
actor TestRunner
participant K8sAPI as Kubernetes API
participant Controller as Controllers / Finalizers

TestRunner->>K8sAPI: Request namespace deletion (short timeout)
alt Deletion succeeds before timeout
    K8sAPI-->>TestRunner: Deleted
else Deletion times out (context deadline)
    K8sAPI-->>TestRunner: Deletion pending / still exists
    TestRunner->>Controller: Strip namespace finalizers
    Controller-->>K8sAPI: Finalizers removed
    TestRunner->>K8sAPI: Retry namespace deletion (longer timeout)
    alt Deletion succeeds
        K8sAPI-->>TestRunner: Deleted
    else Non-timeout error
        K8sAPI-->>TestRunner: Error (propagate)
    end
end

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

test/e2e/v2/backuprestore/cleanup.go

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/e2e/v2/tests/backup_restore_test.go (1)
176-205: ⚠️ Potential issue | 🟠 Major

Add IncludeNamespaces to restore options.

The test passes platformCfg.additionalNamespaces to both schedule and backup creation, but the restore options omit it. This inconsistency means those namespaces will be backed up but never restored. Update line 248 to include:
IncludeNamespaces: platformCfg.additionalNamespaces,
Same fix applies to the second occurrence at lines 248–255.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/v2/tests/backup_restore_test.go` around lines 176 - 205, The restore
options are missing IncludeNamespaces so namespaces backed up by
OADPSchedule/OADPBackup are not restored; update the OADPRestoreOptions
instances used in this test to include IncludeNamespaces:
platformCfg.additionalNamespaces (matching how OADPScheduleOptions and
OADPBackupOptions are constructed), i.e., add IncludeNamespaces to the restore
options near where backupOpts is created and to the second restore-options
occurrence referenced in the test so both restores include
platformCfg.additionalNamespaces.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/v2/backuprestore/cleanup.go`:
- Around line 76-94: The cleanup retry currently treats any error from
deleteControlPlaneNamespace/deleteHostedClusterNamespace as a timeout; change
the logic to only perform the finalizer-removal + retry when the returned error
is an actual timeout. Update the branches around deleteControlPlaneNamespace and
deleteHostedClusterNamespace to check the error with errors.Is for known timeout
signals (e.g. context.DeadlineExceeded or wait.ErrWaitTimeout) or a specific
sentinel timeout error returned by those functions, and return other errors
immediately. Use the same change for the analogous block later (lines ~302-345):
only call removeNamespaceObjectFinalizers and retry when the error indicates a
timeout, otherwise propagate the original error.

In `@test/e2e/v2/tests/backup_restore_test.go`:
- Around line 270-288: The NodePool Ready=False check is timing-sensitive and
can be missed after the long control-plane readiness waits; to fix it, start
observing the NodePool transition immediately after restore completion instead
of after
WaitForControlPlaneStatefulSetsReadiness/WaitForControlPlaneDeploymentsReadiness:
move the Eventually block that calls getNodePool and internal.ValidateConditions
(checking hyperv1.NodePoolReadyConditionType == metav1.ConditionFalse with
Reason "WaitingForAvailableMachines") to run right after restore completion, or
run it concurrently as a prober (goroutine) that polls with
backuprestore.PollInterval and backuprestore.OIDCTimeout using the same testCtx
so it can record the transient state while the subsequent calls to
internal.WaitForControlPlaneStatefulSetsReadiness and
internal.WaitForControlPlaneDeploymentsReadiness still run.

---

Outside diff comments:
In `@test/e2e/v2/tests/backup_restore_test.go`:
- Around line 176-205: The restore options are missing IncludeNamespaces so
namespaces backed up by OADPSchedule/OADPBackup are not restored; update the
OADPRestoreOptions instances used in this test to include IncludeNamespaces:
platformCfg.additionalNamespaces (matching how OADPScheduleOptions and
OADPBackupOptions are constructed), i.e., add IncludeNamespaces to the restore
options near where backupOpts is created and to the second restore-options
occurrence referenced in the test so both restores include
platformCfg.additionalNamespaces.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 988cdf1d-2a4b-4b10-aeba-3bfcba14b798

📥 Commits

Reviewing files that changed from the base of the PR and between 2a03c4f and be896ff.

📒 Files selected for processing (3)

test/e2e/v2/backuprestore/cleanup.go
test/e2e/v2/backuprestore/cli.go
test/e2e/v2/tests/backup_restore_test.go

test/e2e/v2/backuprestore/cleanup.go

test/e2e/v2/tests/backup_restore_test.go

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

test/e2e/v2/tests/backup_restore_test.go

coderabbitai

♻️ Duplicate comments (1)

test/e2e/v2/tests/backup_restore_test.go (1)

270-288: ⚠️ Potential issue | 🟠 Major

NodePool transient-state check is still timing-sensitive

At Line 270, the test waits for control-plane readiness first, then at Line 276 starts looking for NodePoolReady=False with reason WaitingForAvailableMachines. That state can occur earlier and be missed, causing flaky failures.

Start observing this condition immediately after restore completion (or before the long readiness waits), then assert it was observed.

Suggested reorder (minimal change)

-By("Waiting for control plane statefulsets to be ready")
-err := internal.WaitForControlPlaneStatefulSetsReadiness(testCtx, backuprestore.RestoreTimeout, platformCfg.excludeWorkloads)
-Expect(err).NotTo(HaveOccurred())
-By("Waiting for control plane deployments to be ready")
-err = internal.WaitForControlPlaneDeploymentsReadiness(testCtx, backuprestore.RestoreTimeout, platformCfg.excludeWorkloads)
-Expect(err).NotTo(HaveOccurred())
 By("Waiting for NodePool to reach WaitingForAvailableMachines state")
 Eventually(func(g Gomega) {
     nodePool, err := getNodePool(testCtx)
     g.Expect(err).NotTo(HaveOccurred())
     g.Expect(nodePool).NotTo(BeNil())
     internal.ValidateConditions(g, nodePool, []util.Condition{
         {
             Type:   hyperv1.NodePoolReadyConditionType,
             Status: metav1.ConditionFalse,
             Reason: "WaitingForAvailableMachines",
         },
     })
 }).WithPolling(backuprestore.PollInterval).WithTimeout(backuprestore.OIDCTimeout).Should(Succeed())
+
+By("Waiting for control plane statefulsets to be ready")
+err := internal.WaitForControlPlaneStatefulSetsReadiness(testCtx, backuprestore.RestoreTimeout, platformCfg.excludeWorkloads)
+Expect(err).NotTo(HaveOccurred())
+By("Waiting for control plane deployments to be ready")
+err = internal.WaitForControlPlaneDeploymentsReadiness(testCtx, backuprestore.RestoreTimeout, platformCfg.excludeWorkloads)
+Expect(err).NotTo(HaveOccurred())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/v2/tests/backup_restore_test.go` around lines 270 - 288, The
transient NodePool condition check is placed after long control-plane readiness
waits and can be missed; move the Eventually block that calls getNodePool and
internal.ValidateConditions (asserting hyperv1.NodePoolReadyConditionType ==
metav1.ConditionFalse with Reason "WaitingForAvailableMachines") to run
immediately after restore completion (i.e., before calling
internal.WaitForControlPlaneStatefulSetsReadiness and
internal.WaitForControlPlaneDeploymentsReadiness) so the test starts observing
the transient state early, then continue with the existing readiness waits.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/e2e/v2/tests/backup_restore_test.go`:
- Around line 270-288: The transient NodePool condition check is placed after
long control-plane readiness waits and can be missed; move the Eventually block
that calls getNodePool and internal.ValidateConditions (asserting
hyperv1.NodePoolReadyConditionType == metav1.ConditionFalse with Reason
"WaitingForAvailableMachines") to run immediately after restore completion
(i.e., before calling internal.WaitForControlPlaneStatefulSetsReadiness and
internal.WaitForControlPlaneDeploymentsReadiness) so the test starts observing
the transient state early, then continue with the existing readiness waits.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: ea4fd520-2f00-4969-83ac-aaf0ac96d1a1

📥 Commits

Reviewing files that changed from the base of the PR and between be896ff and b7a9415.

📒 Files selected for processing (2)

test/e2e/v2/backuprestore/cleanup.go
test/e2e/v2/tests/backup_restore_test.go

…anup Refactor deleteNamespace to accept a timeout duration instead of a boolean wait flag. Implement a two-pass finalizer removal strategy to handle controllers recreating resources during namespace deletion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (1)

test/e2e/v2/backuprestore/cleanup.go (1)
73-88: Consider extracting the duplicated timeout/retry flow into a helper.

Step 6 and Step 7 now implement almost the same retry sequence. A shared helper would reduce drift and make future changes safer.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 90-102
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/v2/backuprestore/cleanup.go` around lines 73 - 88, Extract the
duplicated timeout/retry sequence into a helper function (e.g.,
ensureNamespaceDeleted(ctx, namespace, logger)) that encapsulates: attempt
deleteControlPlaneNamespace(ctx, shortTimeout), on DeadlineExceeded or
wait.ErrWaitTimeout call removeNamespaceObjectFinalizers(ctx, namespace, logger)
and retry deleteControlPlaneNamespace(ctx, DeletionTimeout) returning any error;
then replace the two nearly identical blocks in Step 6 and Step 7 with calls to
this new helper using testCtx and testCtx.ControlPlaneNamespace so the retry
logic (including use of DeletionTimeout) is centralized and avoids drift.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/e2e/v2/backuprestore/cleanup.go`:
- Around line 73-88: Extract the duplicated timeout/retry sequence into a helper
function (e.g., ensureNamespaceDeleted(ctx, namespace, logger)) that
encapsulates: attempt deleteControlPlaneNamespace(ctx, shortTimeout), on
DeadlineExceeded or wait.ErrWaitTimeout call
removeNamespaceObjectFinalizers(ctx, namespace, logger) and retry
deleteControlPlaneNamespace(ctx, DeletionTimeout) returning any error; then
replace the two nearly identical blocks in Step 6 and Step 7 with calls to this
new helper using testCtx and testCtx.ControlPlaneNamespace so the retry logic
(including use of DeletionTimeout) is centralized and avoids drift.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 714ef578-3745-4b4d-ae10-46a6cf5c8928

📥 Commits

Reviewing files that changed from the base of the PR and between b7a9415 and cd884dd.

📒 Files selected for processing (3)

test/e2e/v2/backuprestore/cleanup.go
test/e2e/v2/backuprestore/cli.go
test/e2e/v2/tests/backup_restore_test.go

test/e2e/v2/backuprestore/cli.go

jparrill

Dropped a question. Let me know for tagging.

jparrill · 2026-03-18T11:20:12Z

/test e2e-azure-self-managed

jparrill · 2026-03-18T11:56:52Z

/approve

openshift-ci · 2026-03-18T11:57:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jparrill, mgencur

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jparrill]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wangke19 · 2026-03-18T14:50:44Z

Code Review — Automated Analysis

Reviewed by Claude Code with HyperShift profile (control-plane-sme, data-plane-sme, api-sme, cloud-provider-sme, hcp-architect-sme)
Focus area: Retry and fallback mechanism (user-requested) + full idiomatic Go + SOLID + DRY + platform correctness

🔴 Overall Verdict: FAIL — 4 blocking issues must be fixed before merge

Files Reviewed

File	Category
`test/e2e/v2/backuprestore/cleanup.go`	E2E helper — namespace cleanup & retry logic
`test/e2e/v2/backuprestore/cli.go`	E2E helper — OADP CLI wrapper & constants
`test/e2e/v2/tests/backup_restore_test.go`	Ginkgo e2e test suite

🚨 Required Actions (Blocking)

R1 — Wrong error sentinel in retry logic (`cleanup.go`, Steps 6 & 7)

The current check is:

if !errors.Is(err, context.DeadlineExceeded) && !errors.Is(err, wait.ErrWaitTimeout) {
    return fmt.Errorf("failed to delete control plane namespace: %w", err)
}

This has two bugs:

context.Canceled is not caught. wait.PollUntilContextTimeout returns ctx.Err() directly, which can be context.Canceled (e.g. when the outer testCtx.Context is cancelled by the test runner). A cancelled context falls through to the hard-error path, skipping the retry entirely — the opposite of the intended behaviour.
wait.ErrWaitTimeout check is dead code. Deprecated since k8s 1.29; PollUntilContextTimeout never returns it (returns context errors directly). The errInterrupted.Is method matches any errInterrupted{}, not specifically timeout.

The vendored k8s.io/apimachinery/pkg/util/wait/error.go documents the intended API for exactly this case:

// Fix — use the API the wait package provides:
if !wait.Interrupted(err) {
    return fmt.Errorf("failed to delete control plane namespace: %w", err)
}

wait.Interrupted(err) checks all three sentinels (context.Canceled, context.DeadlineExceeded, errWaitTimeout) in one call. The comment in the vendored source reads: "Callers should use this method instead of comparing the error value directly to ErrWaitTimeout, as methods that cancel a context may not return that error." — directly applicable here.

This fix must be applied in both Step 6 (control plane namespace) and Step 7 (hosted cluster namespace).

R2 — `hypershift-agents` namespace is hardcoded — will silently break Agent platform restore (`backup_restore_test.go`)

hyperv1.AgentPlatform: {
    additionalNamespaces: []string{"hypershift-agents"},
},

AgentNamespace is a user-supplied field at cluster creation time (HostedCluster.Spec.Platform.Agent.AgentNamespace, set via --agent-namespace flag, marked required). Common CI values include assisted-installer, open-cluster-management, and others — there is no canonical value.

If the actual namespace differs from hypershift-agents, Velero will back up a wrong/empty namespace and silently miss all Agent CRs, causing a failed restore with no clear error.

Required fix: Read the namespace at runtime:

hostedCluster.Spec.Platform.Agent.AgentNamespace

and populate additionalNamespaces dynamically.

R3 — `prober.Stop()` in `ContextVerifyContinual` lacks nil guard — potential panic (`backup_restore_test.go`)

AfterAll correctly guards with if prober != nil, but the It block in ContextVerifyContinual calls prober.Stop() unconditionally. If BeforeAll skips the suite (unsupported platform), prober is nil and Stop() will panic. Add the same guard:

It("should verify continual operations completed successfully", func() {
    if prober == nil {
        Skip("prober not initialized")
    }
    err := prober.Stop()
    Expect(err).NotTo(HaveOccurred())
})

R4 — CNTRLPLANE-2676 skip removal is unverified (`backup_restore_test.go`)

The PR title says NO-JIRA but the removed Skip("Skipping until [CNTRLPLANE-2676](https://redhat.atlassian.net/browse/CNTRLPLANE-2676) is implemented") references a JIRA. These were always-skipped code paths — enabling them without validation evidence is risky. If ReconciliationActive transiently goes False during backup (which could be legitimate behaviour during reconciliation pausing), the prober will report false failures.

Please either:

Link the implementing PR/commit for CNTRLPLANE-2676 and confirm ReconciliationActive remains True throughout backup on both AWS and Agent platforms, or
Replace the blanket skip with a per-platform conditional or feature-gate env var.

Recommended Improvements (Non-blocking)

I1 — Extract duplicated retry block into a helper (DRY)

Steps 6 and 7 are structurally identical 10-line blocks. The only differences are the delete function called and the namespace string. Extract:

func deleteNamespaceWithRetry(
    testCtx *internal.TestContext,
    logger logr.Logger,
    namespace string,
    deleteFn func(*internal.TestContext, time.Duration) error,
    logLabel string,
) error {
    if err := deleteFn(testCtx, InitialDeletionTimeout); err != nil {
        if !wait.Interrupted(err) {
            return fmt.Errorf("failed to delete %s namespace: %w", logLabel, err)
        }
        logger.Info("namespace did not delete within initial timeout, removing finalizers and retrying",
            "namespace", namespace)
        if err := removeNamespaceObjectFinalizers(testCtx, namespace, logger); err != nil {
            return fmt.Errorf("failed to remove finalizers on retry for %s: %w", logLabel, err)
        }
        if err := deleteFn(testCtx, DeletionTimeout); err != nil {
            return fmt.Errorf("failed to delete %s namespace after retry: %w", logLabel, err)
        }
    }
    return nil
}

This also makes the wait.Interrupted fix a single-point change.

I2 — Add named constant for the 1-minute initial timeout

const InitialDeletionTimeout = 1 * time.Minute

The magic literal 1*time.Minute appears twice, has no name, and has no obvious relationship to DeletionTimeout. A named constant makes it a visible, tunable configuration point.

I3 — Add `cloud-credential-operator` to AWS `excludeWorkloads`

CCO relies on IAM WebIdentity and enters CrashLoopBackOff after restore until RunFixDrOidcIam completes and rotates the OIDC provider. The pre-restore and mid-restore ValidateControlPlaneDeploymentsReadiness calls will fail on CCO. Add cloud-credential-operator to the AWS exclusion list to prevent post-restore health check flakes.

I4 — Verify `IncludeNamespaces` is applied to all three option structs

Confirm that IncludeNamespaces: platformCfg.additionalNamespaces is set consistently in all three of OADPScheduleOptions, OADPBackupOptions, and OADPRestoreOptions. If applied to backup but not schedule or restore, namespaces will be silently missing from one operation.

I5 — Move NodePool validation flag into `backupRestorePlatformConfig`

// Current — second platform-dispatch mechanism outside the map:
if testCtx.GetHostedCluster().Spec.Platform.Type != hyperv1.AgentPlatform {
    // validate NodePool conditions
}

// Better — keep all per-platform decisions in the map:
type backupRestorePlatformConfig struct {
    excludeWorkloads     []string
    postRestoreHook      func(testCtx *internal.TestContext) error
    additionalNamespaces []string
    validateNodePools    bool   // add this
}

I6 — Add `apierrors.IsNotFound` tolerance in `verifyReconciliationActiveFunction`

If a test node fails after the prober starts but before ContextVerifyContinual stops it, the prober continues running into ContextBreakControlPlane where the HostedCluster is deleted. MgmtClient.Get returns NotFound, the prober records it as a failure, and AfterAll surfaces a confusing error. Treat IsNotFound as a terminal success condition (the cluster was intentionally destroyed).

I7 — Add file-existence check for `AWS_GUEST_INFRA_CREDENTIALS_FILE`

GetEnvVarValue defaults to ~/.aws/credentials if the env var is not set. If the file doesn't exist, the error surfaces only at RunFixDrOidcIam runtime with a non-obvious message. Add an explicit os.Stat check in the postRestoreHook closure with a clear Fail() message.

I8 — Document `OIDCTimeout` decoupling from `BackupTimeout`

Add a comment explaining why OIDCTimeout differs from BackupTimeout (e.g., "30 minutes to account for IAM propagation delays and etcd recovery time after restore"). Without a comment, the divergence will be puzzling to future readers, and the two constants may silently drift further apart.

I9 — Document `gracePeriodSeconds=-1` sentinel

// Add:
const useServerDefaultGracePeriod int64 = -1

And update the deleteNamespace doc comment to explain that -1 means "omit the GracePeriodSeconds option and let the server apply its default." The current comment only explains 0.

I10 — Add JIRA ticket reference to the `TODO`

// TODO(mgencur): Remove this condition once https://redhat.atlassian.net/browse/MGMT-23509 is fixed

Without a ticket link, this TODO is likely to become permanent dead weight.

I11 — Verify the `aws` Ginkgo label is removed from the `Describe` block

The diff shows Label("backup-restore") without "aws", which is correct. If the aws label were still present, the test would remain gated to AWS CI lanes at the label-selector level even though the code is now platform-agnostic. Please confirm this is the case in the final diff.

Summary Table

#	Severity	Area	Issue
R1	🔴 Blocking	Retry logic	`errors.Is` misses `context.Canceled`; use `wait.Interrupted(err)`
R2	🔴 Blocking	Agent platform	`hypershift-agents` hardcoded; must read `HostedCluster.Spec.Platform.Agent.AgentNamespace`
R3	🔴 Blocking	Test safety	`prober.Stop()` nil dereference risk in `ContextVerifyContinual`
R4	🔴 Blocking	Test validity	CNTRLPLANE-2676 skip removal unverified; link implementing PR
I1	🟡 Recommended	DRY	Extract duplicated retry block into `deleteNamespaceWithRetry`
I2	🟡 Recommended	Style	`1*time.Minute` magic literal → `InitialDeletionTimeout` constant
I3	🟡 Recommended	AWS correctness	Add `cloud-credential-operator` to AWS `excludeWorkloads`
I4	🟡 Recommended	DRY	Verify `IncludeNamespaces` applied to all three OADP option structs
I5	🟡 Recommended	SOLID/ISP	Move NodePool validation flag into `backupRestorePlatformConfig`
I6	🟡 Recommended	Robustness	Add `IsNotFound` tolerance in `verifyReconciliationActiveFunction`
I7	🟡 Recommended	AWS correctness	Add file-existence pre-flight check for `AWS_GUEST_INFRA_CREDENTIALS_FILE`
I8	🟡 Recommended	Clarity	Document `OIDCTimeout` decoupling from `BackupTimeout`
I9	🟡 Recommended	Clarity	Document `gracePeriodSeconds=-1` sentinel
I10	🟡 Recommended	Maintenance	Add MGMT-23509 ticket link to `TODO`
I11	🟡 Recommended	CI	Verify `aws` Ginkgo label is removed

Generated by Claude Code automated review · HyperShift profile · 2026-03-18

wangke19 · 2026-03-18T14:56:58Z

Above from AI under my direction. Here is mine, retry + fallback is not optional for OADP in HyperShift, it’s required for production-grade reliability. Without it, you’ll absolutely see flaky backups/restores due to exactly the issues we met (disk, network, control plane churn).
Backup Policy:

Retry: 3 attempts (exponential backoff)
Timeout: 30 min per backup

Fallback:

Retry same storage
Switch storage
Switch to filesystem backup
Reduce scope

Restore Policy
Max retries: 3
Pre-check required: ✅
Staged restore: ✅
Conflict policy: configurable

Fallback:
Retry restore
Partial restore
Manual intervention trigger

mgencur · 2026-03-18T15:20:22Z

Fixing R1 - R3.

R4 — CNTRLPLANE-2676 skip removal is unverified (backup_restore_test.go)

The pauses were removed as part of openshift/hypershift-oadp-plugin#197
I already verified that. See my comment on the JIRA if you're interested.

…nil guard Replace manual timeout error checks with wait.Interrupted(), resolve Agent namespace from HostedCluster spec at runtime, and guard prober against nil dereference on unsupported platforms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mgencur · 2026-03-18T15:35:26Z

For #7971 (comment) - I'm not sure exactly if the retries are desired or how to achieve them, maybe it needs to be a separate feature? there are also Schedules which periodically do backups so that reduces the need for retries. It would need wider discussion and maybe would result in a new feature. Would you mind having this as a separate JIRA ticket? sounds too much for this single PR

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/e2e/v2/tests/backup_restore_test.go (1)

116-117: ⚠️ Potential issue | 🟠 Major

Capture testCtx in a local variable before passing the closure to prober.Spawn().

Line 116 reassigns testCtx in BeforeEach for each test spec, while the prober spawns a goroutine at line 170 that repeatedly reads testCtx asynchronously. This shared mutable capture creates a data race where the spawned goroutine may read a stale or partially-updated testCtx reference.

Fix: capture immutable context for the prober

 			verifyReconciliationActiveFunction := func() error {
+			proberTestCtx := testCtx
 				hostedCluster := &hyperv1.HostedCluster{}
-				err := testCtx.MgmtClient.Get(testCtx.Context, crclient.ObjectKey{
-					Name:      testCtx.ClusterName,
-					Namespace: testCtx.ClusterNamespace,
+				err := proberTestCtx.MgmtClient.Get(proberTestCtx.Context, crclient.ObjectKey{
+					Name:      proberTestCtx.ClusterName,
+					Namespace: proberTestCtx.ClusterNamespace,
 				}, hostedCluster)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/v2/tests/backup_restore_test.go` around lines 116 - 117, Before
passing the closure to prober.Spawn() from your BeforeEach, capture the current
testCtx into a local immutable variable (e.g., localCtx := testCtx) and have the
spawned closure reference that localCtx instead of the outer testCtx; this
avoids the shared mutable capture that causes a data race between BeforeEach
reassignment of testCtx and the goroutine started by prober.Spawn(), so update
the closure(s) passed to prober.Spawn() to use the new localCtx.

🧹 Nitpick comments (1)

test/e2e/v2/backuprestore/cleanup.go (1)

72-101: Extract Step 6/7 retry flow into a shared helper.

Both blocks implement the same timeout → finalizer cleanup → retry sequence. Keeping this duplicated makes timeout/error behavior easy to drift later.

♻️ Refactor sketch

+func retryNamespaceDeleteWithFinalizerCleanup(
+	testCtx *internal.TestContext,
+	logger logr.Logger,
+	ns string,
+	initialTimeout time.Duration,
+	deleteFn func(*internal.TestContext, time.Duration) error,
+) error {
+	if err := deleteFn(testCtx, initialTimeout); err != nil {
+		if !wait.Interrupted(err) {
+			return err
+		}
+		logger.Info("Namespace did not delete within initial timeout, removing finalizers again and retrying", "namespace", ns)
+		if err := removeNamespaceObjectFinalizers(testCtx, ns, logger); err != nil {
+			return fmt.Errorf("failed to remove finalizers on retry for namespace %s: %w", ns, err)
+		}
+		if err := deleteFn(testCtx, DeletionTimeout); err != nil {
+			return fmt.Errorf("failed to delete namespace %s after retry: %w", ns, err)
+		}
+	}
+	return nil
+}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/v2/backuprestore/cleanup.go` around lines 72 - 101, Extract the
duplicated timeout→finalizer-cleanup→retry logic into a shared helper (e.g.
deleteWithRetry) and call it from both places instead of duplicating the block
around deleteControlPlaneNamespace and deleteHostedClusterNamespace; the helper
should accept a context/testCtx, a delete function (like
deleteControlPlaneNamespace or deleteHostedClusterNamespace), the namespace
identifier (for use with
removeNamespaceObjectFinalizers/testCtx.ClusterNamespace or
testCtx.ControlPlaneNamespace), a logger, the short initial timeout
(1*time.Minute) and the retry timeout (DeletionTimeout), perform the initial
delete, check wait.Interrupted(err) to decide whether to remove finalizers via
removeNamespaceObjectFinalizers and retry, log the same informational message
when retrying, and return the original or retry error wrapped exactly as the
current returns do so call sites only replace their blocks with a call to
deleteWithRetry.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@test/e2e/v2/tests/backup_restore_test.go`:
- Around line 116-117: Before passing the closure to prober.Spawn() from your
BeforeEach, capture the current testCtx into a local immutable variable (e.g.,
localCtx := testCtx) and have the spawned closure reference that localCtx
instead of the outer testCtx; this avoids the shared mutable capture that causes
a data race between BeforeEach reassignment of testCtx and the goroutine started
by prober.Spawn(), so update the closure(s) passed to prober.Spawn() to use the
new localCtx.

---

Nitpick comments:
In `@test/e2e/v2/backuprestore/cleanup.go`:
- Around line 72-101: Extract the duplicated timeout→finalizer-cleanup→retry
logic into a shared helper (e.g. deleteWithRetry) and call it from both places
instead of duplicating the block around deleteControlPlaneNamespace and
deleteHostedClusterNamespace; the helper should accept a context/testCtx, a
delete function (like deleteControlPlaneNamespace or
deleteHostedClusterNamespace), the namespace identifier (for use with
removeNamespaceObjectFinalizers/testCtx.ClusterNamespace or
testCtx.ControlPlaneNamespace), a logger, the short initial timeout
(1*time.Minute) and the retry timeout (DeletionTimeout), perform the initial
delete, check wait.Interrupted(err) to decide whether to remove finalizers via
removeNamespaceObjectFinalizers and retry, log the same informational message
when retrying, and return the original or retry error wrapped exactly as the
current returns do so call sites only replace their blocks with a call to
deleteWithRetry.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: ee959b3d-2946-438e-85ff-65e311168b07

📥 Commits

Reviewing files that changed from the base of the PR and between ef65ca9 and b142d77.

📒 Files selected for processing (2)

test/e2e/v2/backuprestore/cleanup.go
test/e2e/v2/tests/backup_restore_test.go

wangke19 · 2026-03-18T19:01:05Z

For #7971 (comment) - I'm not sure exactly if the retries are desired or how to achieve them, maybe it needs to be a separate feature? there are also Schedules which periodically do backups so that reduces the need for retries. It would need wider discussion and maybe would result in a new feature. Would you mind having this as a separate JIRA ticket? sounds too much for this single PR

That's functions of the next version.

Would you mind having this as a separate JIRA ticket? sounds too much for this single PR.
Sure.

wangke19 · 2026-03-18T19:01:28Z

/lgtm

openshift-ci-robot · 2026-03-18T19:01:49Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-21
/test e2e-aws-4-21
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

cwbotbot · 2026-03-18T19:25:04Z

Test Results

e2e-aws

Status: ❌ FAIL
Started: 2026-03-20T06:02:16Z
View Job
View Job History

Failed Tests

Total failed tests: 9

TestCreateCluster
TestCreateCluster/Teardown
TestNodePool
TestNodePool/HostedCluster0
TestNodePool/HostedCluster0/Main

... and 4 more failed tests

e2e-aks

Status: ✅ PASS
Started: 2026-03-18T19:02:22Z
View Job
View Job History

mgencur · 2026-03-19T05:15:52Z

/retest

mgencur · 2026-03-19T07:10:25Z

/retest

mgencur · 2026-03-19T09:51:25Z

/test e2e-aws

mgencur · 2026-03-19T10:15:02Z

/test e2e-aws-4-21

mgencur · 2026-03-19T12:24:30Z

/test e2e-aws

mgencur · 2026-03-20T06:01:45Z

/test e2e-aws

mgencur · 2026-03-20T16:32:09Z

/test e2e-aws

wangke19 · 2026-03-22T18:57:39Z

need to rebase, e2e-aws was fixed and merged.

mgencur · 2026-03-23T05:59:09Z

/test e2e-aws

mgencur · 2026-03-23T07:46:44Z

/verified by @mgencur

openshift-ci-robot · 2026-03-23T07:46:56Z

@mgencur: This PR has been marked as verified by @mgencur.

Details

In response to this:

/verified by @mgencur

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

mgencur · 2026-03-23T09:47:20Z

/test e2e-azure-self-managed

openshift-ci · 2026-03-23T11:18:09Z

@mgencur: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 16, 2026

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 16, 2026

openshift-ci bot added the do-not-merge/needs-area label Mar 16, 2026

openshift-ci bot requested review from csrwng and jparrill March 16, 2026 07:44

openshift-ci bot added area/cli Indicates the PR includes changes for CLI area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Mar 16, 2026

mgencur force-pushed the rerun_finalizer_deletion branch from 9bfbd1f to 9580326 Compare March 16, 2026 13:03

mgencur changed the title ~~[WIP] NO-JIRA: test(e2e): use timeout-based namespace deletion in backup/restore~~ NO-JIRA: test(e2e): use timeout-based namespace deletion in backup/restore Mar 16, 2026

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 16, 2026

mgencur changed the title ~~NO-JIRA: test(e2e): use timeout-based namespace deletion in backup/restore~~ NO-JIRA: refactor(oadp): unify backup/restore e2e test with platform auto-detection Mar 17, 2026

cursor bot reviewed Mar 17, 2026

View reviewed changes

test/e2e/v2/backuprestore/cleanup.go Outdated Show resolved Hide resolved

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

test/e2e/v2/backuprestore/cleanup.go Show resolved Hide resolved

test/e2e/v2/tests/backup_restore_test.go Outdated Show resolved Hide resolved

mgencur force-pushed the rerun_finalizer_deletion branch from be896ff to b7a9415 Compare March 18, 2026 07:46

cursor bot reviewed Mar 18, 2026

View reviewed changes

test/e2e/v2/tests/backup_restore_test.go Show resolved Hide resolved

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

mgencur force-pushed the rerun_finalizer_deletion branch 2 times, most recently from ecf5b0a to cd884dd Compare March 18, 2026 08:01

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

jparrill reviewed Mar 18, 2026

View reviewed changes

test/e2e/v2/backuprestore/cli.go Show resolved Hide resolved

jparrill reviewed Mar 18, 2026

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 18, 2026

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

openshift-ci bot assigned wangke19 Mar 18, 2026

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 18, 2026

mgencur mentioned this pull request Mar 19, 2026

[WIP] Initial backup/restore setup for AWS openshift/release#75798

Open

wangke19 mentioned this pull request Mar 20, 2026

[WIP]OCPBUGS-74019: hypershift: fix OADP e2e job for MCE agent with guest OLM placement openshift/release#75695

Open

3 tasks

wangke19 mentioned this pull request Mar 23, 2026

[WIP]CNTRLPLANE-2995: hypershift: fix OADP e2e job for 4.21 with guest OLM placement openshift/release#76406

Open

3 tasks

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 23, 2026

openshift-merge-bot bot merged commit 783c4fe into openshift:main Mar 23, 2026
24 checks passed

Conversation

mgencur commented Mar 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Mar 16, 2026

Uh oh!

openshift-ci-robot commented Mar 16, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Reviews paused

Walkthrough

Sequence Diagram(s)

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jparrill left a comment

Choose a reason for hiding this comment

Uh oh!

jparrill commented Mar 18, 2026

Uh oh!

jparrill commented Mar 18, 2026

Uh oh!

openshift-ci bot commented Mar 18, 2026

Uh oh!

wangke19 commented Mar 18, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — Automated Analysis

🔴 Overall Verdict: FAIL — 4 blocking issues must be fixed before merge

Files Reviewed

🚨 Required Actions (Blocking)

R1 — Wrong error sentinel in retry logic (cleanup.go, Steps 6 & 7)

R2 — hypershift-agents namespace is hardcoded — will silently break Agent platform restore (backup_restore_test.go)

R3 — prober.Stop() in ContextVerifyContinual lacks nil guard — potential panic (backup_restore_test.go)

R4 — CNTRLPLANE-2676 skip removal is unverified (backup_restore_test.go)

Recommended Improvements (Non-blocking)

I1 — Extract duplicated retry block into a helper (DRY)

I2 — Add named constant for the 1-minute initial timeout

I3 — Add cloud-credential-operator to AWS excludeWorkloads

I4 — Verify IncludeNamespaces is applied to all three option structs

I5 — Move NodePool validation flag into backupRestorePlatformConfig

I6 — Add apierrors.IsNotFound tolerance in verifyReconciliationActiveFunction

I7 — Add file-existence check for AWS_GUEST_INFRA_CREDENTIALS_FILE

I8 — Document OIDCTimeout decoupling from BackupTimeout

I9 — Document gracePeriodSeconds=-1 sentinel

I10 — Add JIRA ticket reference to the TODO

I11 — Verify the aws Ginkgo label is removed from the Describe block

Summary Table

Uh oh!

wangke19 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgencur commented Mar 18, 2026

Uh oh!

mgencur commented Mar 18, 2026

mgencur commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 16, 2026 •

edited

Loading

wangke19 commented Mar 18, 2026 •

edited by openshift-ci bot

Loading

R1 — Wrong error sentinel in retry logic (`cleanup.go`, Steps 6 & 7)

R2 — `hypershift-agents` namespace is hardcoded — will silently break Agent platform restore (`backup_restore_test.go`)

R3 — `prober.Stop()` in `ContextVerifyContinual` lacks nil guard — potential panic (`backup_restore_test.go`)

R4 — CNTRLPLANE-2676 skip removal is unverified (`backup_restore_test.go`)

I3 — Add `cloud-credential-operator` to AWS `excludeWorkloads`

I4 — Verify `IncludeNamespaces` is applied to all three option structs

I5 — Move NodePool validation flag into `backupRestorePlatformConfig`

I6 — Add `apierrors.IsNotFound` tolerance in `verifyReconciliationActiveFunction`

I7 — Add file-existence check for `AWS_GUEST_INFRA_CREDENTIALS_FILE`

I8 — Document `OIDCTimeout` decoupling from `BackupTimeout`

I9 — Document `gracePeriodSeconds=-1` sentinel

I10 — Add JIRA ticket reference to the `TODO`

I11 — Verify the `aws` Ginkgo label is removed from the `Describe` block

wangke19 commented Mar 18, 2026 •

edited

Loading

cwbotbot commented Mar 18, 2026 •

edited

Loading