OCPBUGS-83868: prevent IAM resource leaks during cluster destroy by sdminonne · Pull Request #8303 · openshift/hypershift

sdminonne · 2026-04-21T17:08:01Z

When destroying an AWS cluster, failures at any step in the cleanup chain would cause all subsequent steps to be skipped, leaking IAM resources (OIDC providers, IAM roles, instance profiles).

Change destroyPlatformSpecifics, DestroyIAM, DestroyOIDCResources, and DestroySharedVPCRoles to collect errors and attempt all cleanup steps instead of returning on the first failure. This ensures that e.g. an infrastructure destroy failure does not prevent IAM cleanup, and a single role deletion failure does not skip the remaining roles.

What this PR does / why we need it:

It fixes the most common IAM oidc providers and roles leaks

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/OCPBUGS-83868

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

Summary by CodeRabbit

Bug Fixes
- Enhanced AWS infrastructure and IAM destruction: The cluster cleanup process now attempts all resource deletion steps and aggregates all errors before reporting, rather than stopping at the first failure. This ensures more complete cleanup attempts and provides comprehensive visibility into all issues encountered.
Tests
- Updated error handling tests to verify aggregated error reporting during cluster destruction.

openshift-merge-bot · 2026-04-21T17:08:04Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci-robot · 2026-04-21T17:08:04Z

@sdminonne: No Jira issue with key fOCPBUGS-83868 exists in the tracker at https://redhat.atlassian.net.
Once a valid jira issue is referenced in the title of this pull request, request a refresh with /jira refresh.

Details

In response to this:

When destroying an AWS cluster, failures at any step in the cleanup chain would cause all subsequent steps to be skipped, leaking IAM resources (OIDC providers, IAM roles, instance profiles).

Change destroyPlatformSpecifics, DestroyIAM, DestroyOIDCResources, and DestroySharedVPCRoles to collect errors and attempt all cleanup steps instead of returning on the first failure. This ensures that e.g. an infrastructure destroy failure does not prevent IAM cleanup, and a single role deletion failure does not skip the remaining roles.

What this PR does / why we need it:

It fixes the most common IAM oidc providers and roles leaks

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/OCPBUGS-83868

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-04-21T17:08:13Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-04-21T17:08:18Z

📝 Walkthrough

Walkthrough

The pull request refactors error handling in AWS infrastructure and IAM destruction operations from a fail-fast approach to a fail-all approach. The destroyPlatformSpecifics function in destroy.go now accumulates errors from both infrastructure and IAM destroy steps and returns an aggregated error instead of returning immediately on failure. Similarly, destroy_iam.go is updated to collect errors from DestroyOIDCResources, DestroySharedVPCRoles, and worker instance profile destruction, then return them as a single aggregated error. Component role deletions are refactored to iterate over role lists and continue on failure rather than short-circuiting. Tests are expanded to verify that all error messages are present in the aggregated error output.

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the main change: preventing IAM resource leaks during cluster destroy by implementing error aggregation instead of fail-fast behavior in cleanup operations.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	The test file uses standard Go testing with Test functions and no Ginkgo imports detected.
Test Structure And Quality	✅ Passed	The custom check is designed for Ginkgo test code, but the file uses standard Go unit tests with table-driven patterns and gomock.
Microshift Test Compatibility	✅ Passed	This PR does not introduce any new Ginkgo e2e tests. Changes are to CLI infrastructure code and standard Go unit tests, not Ginkgo e2e tests.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	The pull request does not introduce any Ginkgo e2e tests. The modifications are standard Go unit tests using testing.T framework with table-driven patterns and gomock, not Ginkgo patterns like Describe(), It(), Context(), or When(). These tests are unit-testing CLI infrastructure cleanup logic, not cluster behavior.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies only CLI commands and infrastructure cleanup utilities in AWS destroy functions, not deployment manifests, operator code, or controllers, so custom check is not applicable.
Ote Binary Stdout Contract	✅ Passed	Modified files contain no process-level entry points or direct stdout writes. All logging uses injected logr.Logger with zap backend writing to stderr.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	The PR modifies destroy_iam_test.go which appears to be a standard Go unit test file, not a Ginkgo e2e test framework.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-04-21T17:08:24Z

@sdminonne: This pull request references Jira Issue OCPBUGS-83868, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

When destroying an AWS cluster, failures at any step in the cleanup chain would cause all subsequent steps to be skipped, leaking IAM resources (OIDC providers, IAM roles, instance profiles).

Change destroyPlatformSpecifics, DestroyIAM, DestroyOIDCResources, and DestroySharedVPCRoles to collect errors and attempt all cleanup steps instead of returning on the first failure. This ensures that e.g. an infrastructure destroy failure does not prevent IAM cleanup, and a single role deletion failure does not skip the remaining roles.

What this PR does / why we need it:

It fixes the most common IAM oidc providers and roles leaks

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/OCPBUGS-83868

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-04-21T17:09:16Z

@sdminonne: This pull request references Jira Issue OCPBUGS-83868, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

When destroying an AWS cluster, failures at any step in the cleanup chain would cause all subsequent steps to be skipped, leaking IAM resources (OIDC providers, IAM roles, instance profiles).

Change destroyPlatformSpecifics, DestroyIAM, DestroyOIDCResources, and DestroySharedVPCRoles to collect errors and attempt all cleanup steps instead of returning on the first failure. This ensures that e.g. an infrastructure destroy failure does not prevent IAM cleanup, and a single role deletion failure does not skip the remaining roles.

What this PR does / why we need it:

It fixes the most common IAM oidc providers and roles leaks

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/OCPBUGS-83868

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

Bug Fixes

Enhanced error reporting during AWS infrastructure and IAM cleanup to collect and report all errors encountered, rather than stopping at the first failure.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/infra/aws/destroy_iam.go`:
- Around line 157-159: The call to DestroyOIDCRole("shared-role") currently
returns immediately on error which aborts cleanup for per-component roles;
change the flow so that when DestroyOIDCRole(ctx, iamClient, "shared-role")
returns an error you append that error to the errs slice (e.g., errs =
append(errs, err)) and do not return — allow execution to continue to the
subsequent per-component role cleanup logic; keep the existing use of the
removed boolean as-is and aggregate errs at the end (using
utilerrors.NewAggregate) rather than returning early.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 0177988a-625a-4e52-97e1-a4b4f9d2d615

📥 Commits

Reviewing files that changed from the base of the PR and between 3ec580c and 1558c31.

📒 Files selected for processing (3)

cmd/cluster/aws/destroy.go
cmd/infra/aws/destroy_iam.go
cmd/infra/aws/destroy_iam_test.go

codecov · 2026-04-21T17:22:39Z

Codecov Report

❌ Patch coverage is 70.21277% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.35%. Comparing base (e7e828e) to head (100b927).
⚠️ Report is 78 commits behind head on main.

Files with missing lines	Patch %	Lines
cmd/infra/aws/destroy_iam.go	78.57%	9 Missing ⚠️
cmd/cluster/aws/destroy.go	0.00%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8303      +/-   ##
==========================================
+ Coverage   36.32%   36.35%   +0.03%     
==========================================
  Files         764      764              
  Lines       92886    92884       -2     
==========================================
+ Hits        33737    33766      +29     
+ Misses      56442    56421      -21     
+ Partials     2707     2697      -10

Files with missing lines	Coverage Δ
cmd/cluster/aws/destroy.go	`4.58% <0.00%> (-0.08%)`	⬇️
cmd/infra/aws/destroy_iam.go	`62.71% <78.57%> (+10.82%)`	⬆️

Flag	Coverage Δ
cmd-support	`30.12% <70.21%> (+0.09%)`	⬆️
cpo-hostedcontrolplane	`37.05% <ø> (ø)`
cpo-other	`35.69% <ø> (ø)`
hypershift-operator	`47.99% <ø> (ø)`
other	`27.68% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sdminonne · 2026-04-21T20:52:02Z

/jira refresh

openshift-ci-robot · 2026-04-21T20:52:06Z

@sdminonne: This pull request references Jira Issue OCPBUGS-83868, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

🧹 Nitpick comments (1)

cmd/infra/aws/destroy_iam_test.go (1)

552-559: Enforce call order to prove “continue after ingress failure” behavior.

Right now the expectations don’t assert sequencing, so this can pass even if control-plane role is checked first. Wrap these two calls in gomock.InOrder to lock in the intended flow.

Proposed test-hardening diff

 			setupVPCOwnerMock: func(m *awsapi.MockIAMAPI) {
-				m.EXPECT().GetRole(gomock.Any(), &iam.GetRoleInput{RoleName: aws.String(ingressRoleName)}, gomock.Any()).
-					Return(nil, errors.New("api error"))
-				m.EXPECT().GetRole(gomock.Any(), &iam.GetRoleInput{RoleName: aws.String(cpRoleName)}, gomock.Any()).
-					Return(nil, noSuchEntity())
+				gomock.InOrder(
+					m.EXPECT().GetRole(gomock.Any(), &iam.GetRoleInput{RoleName: aws.String(ingressRoleName)}, gomock.Any()).
+						Return(nil, errors.New("api error")),
+					m.EXPECT().GetRole(gomock.Any(), &iam.GetRoleInput{RoleName: aws.String(cpRoleName)}, gomock.Any()).
+						Return(nil, noSuchEntity()),
+				)
 			},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cmd/infra/aws/destroy_iam_test.go` around lines 552 - 559, The test currently
sets two unordered expectations on MockIAMAPI GetRole for ingressRoleName and
cpRoleName inside setupVPCOwnerMock; to assert the code actually checks ingress
first and then control-plane, wrap those two m.EXPECT() calls in a
gomock.InOrder(...) block so the GetRole call for ingressRoleName (returning api
error) must happen before the GetRole call for cpRoleName (returning
noSuchEntity()); update setupVPCOwnerMock in destroy_iam_test.go to place the
two existing m.EXPECT() calls inside gomock.InOrder to enforce sequencing.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cmd/infra/aws/destroy_iam_test.go`:
- Around line 552-559: The test currently sets two unordered expectations on
MockIAMAPI GetRole for ingressRoleName and cpRoleName inside setupVPCOwnerMock;
to assert the code actually checks ingress first and then control-plane, wrap
those two m.EXPECT() calls in a gomock.InOrder(...) block so the GetRole call
for ingressRoleName (returning api error) must happen before the GetRole call
for cpRoleName (returning noSuchEntity()); update setupVPCOwnerMock in
destroy_iam_test.go to place the two existing m.EXPECT() calls inside
gomock.InOrder to enforce sequencing.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 525efc49-262f-481a-b8ae-a1bdc5d2201e

📥 Commits

Reviewing files that changed from the base of the PR and between 1558c31 and 66b6a4e.

📒 Files selected for processing (4)

cmd/cluster/aws/destroy.go
cmd/infra/aws/destroy_iam.go
cmd/infra/aws/destroy_iam_test.go
test/e2e/nodepool_test.go

✅ Files skipped from review due to trivial changes (1)

test/e2e/nodepool_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

cmd/infra/aws/destroy_iam.go

muraee · 2026-04-27T10:06:04Z

+		"karpenter",
+	}
+	for _, name := range roleNames {
+		if name == "openshift-ingress" {


better to have a separate list e.g. roleNamesWithRetry := []string{"openshift-ingress" } and iterate over it. This way we can easily add retry to different roles later.

jparrill

Dropped some comments. Thanks!

jparrill · 2026-04-27T10:09:06Z

 		},
 		{
-			name:                  "When destroying the ingress role fails it should return the error",
+			name:                  "When destroying the ingress role fails it should still attempt control-plane role and return the error",


Add test cases for multiple simultaneous failures in the role deletion loop to verify all errors are properly aggregated

jparrill · 2026-04-27T10:10:39Z

+		if name == "openshift-ingress" {
+			if err := o.DestroyOIDCRoleWithRetry(ctx, iamClient, name); err != nil {
+				errs = append(errs, err)
+			}


Maybe makes sense to extract this from the loop and handle it separately, instead of evaluate the name in every loop cycle.

jparrill · 2026-04-27T10:14:07Z

-	err = o.DestroyWorkerInstanceProfile(ctx, iamClient)
-	if err != nil {
-		return err
+	if err = o.DestroyWorkerInstanceProfile(ctx, iamClient); err != nil {


Inside of the DestroyWorkerInstanceProfile function, we're still returning early instead of use the error aggregation pattern. Make sense to move it same strategy or keep in return early approach?

Hmmm unsure. I would say here we have to return immediately.

jparrill · 2026-04-27T10:16:26Z

+	roleNames := []string{
+		"openshift-ingress",
+		"openshift-image-registry",
+		"aws-ebs-csi-driver-controller",
+		"cloud-controller",
+		"node-pool",
+		"control-plane-operator",
+		"cloud-network-config-controller",
+		"kms-provider",
+		"karpenter",
+	}


I think make sense to have a shared list between creation/deletion to keep both in sync to avoid drift between both processes. WDYT?

Oh, great idea but this is more a whole refactor while the goal of this is to fix short-term bugs impacting our daily activities. Could it be a follow-up?

When destroying an AWS cluster, failures at any step in the cleanup chain would cause all subsequent steps to be skipped, leaking IAM resources (OIDC providers, IAM roles, instance profiles). Change destroyPlatformSpecifics, DestroyIAM, DestroyOIDCResources, and DestroySharedVPCRoles to collect errors and attempt all cleanup steps instead of returning on the first failure. This ensures that e.g. an infrastructure destroy failure does not prevent IAM cleanup, and a single role deletion failure does not skip the remaining roles. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix shared-role early return that still skipped per-component role cleanup on error, extract openshift-ingress retry handling out of the main role loop into a separate roleNamesWithRetry list, and add tests for multiple simultaneous deletion failures to verify proper error aggregation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/infra/aws/destroy_iam.go`:
- Around line 110-114: When creating the VPC-owner AWS session fails (the block
using VPCOwnerCredentialsOpts.AWSCredentialsFile and GetSession which currently
returns early with utilerrors.NewAggregate), do not return immediately; instead
append the session error into errs and continue so we still attempt shared-VPC
cleanup using the cluster-account iamClient when PrivateZonesInClusterAccount is
true. Update the code around vpcOwnerAWSSession/GetSession to collect the error
(add it to errs) and proceed, guarding any use of vpcOwnerAWSSession when nil,
and ensure the shared-VPC ingress deletion path (which uses iamClient and the
PrivateZonesInClusterAccount flag) runs even if GetSession failed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5d2d06c3-de63-4fdb-abaf-2b93c049eefd

📥 Commits

Reviewing files that changed from the base of the PR and between 5a0074b and 100b927.

📒 Files selected for processing (3)

cmd/cluster/aws/destroy.go
cmd/infra/aws/destroy_iam.go
cmd/infra/aws/destroy_iam_test.go

coderabbitai · 2026-04-27T19:58:21Z

 	if o.VPCOwnerCredentialsOpts.AWSCredentialsFile != "" {
 		vpcOwnerAWSSession, err := o.VPCOwnerCredentialsOpts.GetSession(ctx, "cli-destroy-iam", nil, o.Region)
 		if err != nil {
-			return err
+			return utilerrors.NewAggregate(append(errs, err))
 		}


⚠️ Potential issue | 🟠 Major

shared-vpc-ingress can still leak when VPC-owner session creation fails.

At Line 113, returning immediately skips any shared-VPC cleanup attempt. In PrivateZonesInClusterAccount=true, shared-vpc-ingress lives in the cluster account and can still be deleted with iamClient, but this path never attempts it.

Suggested fix

if o.VPCOwnerCredentialsOpts.AWSCredentialsFile != "" { vpcOwnerAWSSession, err := o.VPCOwnerCredentialsOpts.GetSession(ctx, "cli-destroy-iam", nil, o.Region) if err != nil { - return utilerrors.NewAggregate(append(errs, err)) + errs = append(errs, fmt.Errorf("failed to create VPC owner IAM session: %w", err)) + // In private-zones mode, ingress role is in cluster account; still attempt cleanup. + if o.PrivateZonesInClusterAccount { + if _, ingressErr := o.DestroyOIDCRole(ctx, iamClient, "shared-vpc-ingress"); ingressErr != nil { + errs = append(errs, ingressErr) + } + } + return utilerrors.NewAggregate(errs) } vpcOwnerIAMClient := iam.NewFromConfig(*vpcOwnerAWSSession, func(o *iam.Options) { o.Retryer = awsConfig() })

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cmd/infra/aws/destroy_iam.go` around lines 110 - 114, When creating the VPC-owner AWS session fails (the block using VPCOwnerCredentialsOpts.AWSCredentialsFile and GetSession which currently returns early with utilerrors.NewAggregate), do not return immediately; instead append the session error into errs and continue so we still attempt shared-VPC cleanup using the cluster-account iamClient when PrivateZonesInClusterAccount is true. Update the code around vpcOwnerAWSSession/GetSession to collect the error (add it to errs) and proceed, guarding any use of vpcOwnerAWSSession when nil, and ensure the shared-VPC ingress deletion path (which uses iamClient and the PrivateZonesInClusterAccount flag) runs even if GetSession failed.

This is a valid observation, but I'm intentionally not addressing it in this PR. Here's why:

When GetSession for the VPC owner fails, it indicates a fundamental credentials problem. Proceeding with partial cleanup (only the ingress role) while silently skipping the shared-vpc-control-plane role cleanup would leave things in an inconsistent state.

The error aggregation pattern in DestroyIAM already ensures the VPC session error is reported to the caller. The outer Run method retries on retryable errors, so transient credential issues will be retried.

The PrivateZonesInClusterAccount + VPC owner session failure is an edge case that would require refactoring DestroyIAM to make GetSession mockable for proper testing. That's out of scope for this bug fix PR.

If we see this scenario causing real leaks in production, we can address it in a follow-up.

@sdminonne, thanks for the detailed explanation — the reasoning is sound. You're right that attempting partial shared-VPC cleanup when the VPC-owner credentials are fundamentally broken could leave things in a worse, harder-to-reason-about state than a clean failure that gets retried. The existing retry loop in Run() is the appropriate safety net for transient credential issues.

Would you like me to open a follow-up GitHub issue to track the PrivateZonesInClusterAccount + VPC-owner session-failure edge case so it doesn't get lost?

📝 Learnings were identified and not saved (knowledge base disabled). Enable

sdminonne · 2026-04-30T09:41:58Z

@jparrill @muraee may you PTAL

muraee · 2026-04-30T13:30:28Z

/approve

openshift-ci · 2026-04-30T13:30:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: muraee, sdminonne

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [muraee]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jparrill

/lgtm

openshift-merge-bot · 2026-04-30T13:57:28Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

sdminonne · 2026-04-30T14:56:40Z

/test e2e-aks

hypershift-jira-solve-ci · 2026-04-30T15:47:55Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2049850754159611905 | Cost: $3.8839149999999996 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

sdminonne · 2026-05-01T16:24:17Z

/test e2e-aks

sdminonne · 2026-05-01T16:24:47Z

/test e2e-azure-self-managed

hypershift-jira-solve-ci · 2026-05-01T18:34:59Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2050250102328528896 | Cost: $4.097735850000001 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

sdminonne · 2026-05-01T20:05:38Z

This appears to be an infrastructure flake rather than a code regression.

sdminonne · 2026-05-01T20:10:16Z

/test e2e-azure-self-managed

hypershift-jira-solve-ci · 2026-05-01T20:56:19Z

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed
Build ID: 2050250102328528896
Target: e2e-azure-self-managed
PR: OCPBUGS-83868: prevent IAM resource leaks during cluster destroy #8303 — OCPBUGS-83868: prevent IAM resource leaks during cluster destroy
Duration: ~2h 24m (16:24:50Z → 18:48:33Z)
Result: 288 tests run, 27 skipped, 7 failures (2 root failures + 5 parent propagations)

Test Failure Analysis

Error

--- FAIL: TestNodePool/HostedCluster0/Main/TestNodePoolReplaceUpgrade (3424.96s)
    nodepool_upgrade_test.go:238: Failed to wait for NodePool to have version 5.0.0-0.ci-2026-05-01-163528-test-ci-op-vvhyyw15-latest in 45m0s: context deadline exceeded
    AllNodesHealthy=False: NodeProvisioning(1 of 2 machines are not healthy)
    Machine node-pool-ptw72-test-replaceupgrade-sb42h-w6zzg: NodeProvisioning: Waiting for a node with matching ProviderID to exist

--- FAIL: TestAzureOAuthLoadBalancer/ValidateHostedCluster/EnsureNoCrashingPods (0.12s)
    util.go:817: Container kube-apiserver in pod kube-apiserver-5fb9d658bb-5s6dj has a restartCount > 0 (1)

Summary

Both failures are transient Azure infrastructure issues completely unrelated to PR #8303's code changes.

Failure 1 — TestNodePoolReplaceUpgrade (45-minute timeout): During a NodePool rolling replace-upgrade to version 5.0.0-0.ci-..., CAPI/CAPZ provisioned a new Azure VM (sb42h-w6zzg) but the VM never registered as a Kubernetes node. The machine remained stuck in NodeProvisioning state with the message "Waiting for a node with matching ProviderID to exist", meaning the Azure VM's ProviderID was never matched to a kubelet-registered node. After 45 minutes the test timed out. This is a known transient failure mode in Azure where VM-to-node registration can stall due to Azure API latency, cloud-provider reconciliation delays, or kubelet bootstrap issues on the new VM. The rate limiter error (rate: Wait(n=1) would exceed context deadline) confirms the test was also hitting Azure API throttling near the end of the timeout window.

Failure 2 — EnsureNoCrashingPods (kube-apiserver restart): A single kube-apiserver restart (restartCount=1) triggered the zero-tolerance EnsureNoCrashingPods check. A single restart of kube-apiserver is a common transient event in CI hosted control planes (leader election, memory pressure, or brief connectivity blip) and does not indicate a product defect.

PR #8303 is not implicated. The PR modifies exactly three files — all in AWS-specific code paths (cmd/cluster/aws/destroy.go, cmd/infra/aws/destroy_iam.go, cmd/infra/aws/destroy_iam_test.go). No Azure code, no NodePool upgrade logic, and no kube-apiserver configuration is touched. The CI system's own automated failure-analysis.md artifact independently corroborates this conclusion.

Root Cause

Primary Cause — TestNodePoolReplaceUpgrade:
Azure VM-to-Kubernetes-node ProviderID registration failure. During the NodePool replace-upgrade, CAPZ successfully provisioned Azure VM node-pool-ptw72-test-replaceupgrade-sb42h-w6zzg, but the VM's kubelet either failed to bootstrap or failed to register with the hosted cluster's API server with a matching ProviderID. The CAPI Machine object remained in NodeProvisioning state for the entire 45-minute timeout. Contributing factors include Azure API rate limiting (confirmed by rate: Wait(n=1) would exceed context deadline errors in the final minutes) and possible Azure regional API latency.

Primary Cause — EnsureNoCrashingPods:
A single transient kube-apiserver pod restart (restartCount=1) in the hosted control plane. This is a known benign transient event in CI environments — kube-apiserver may restart once due to leader election transitions, liveness probe timing, or brief resource pressure. The EnsureNoCrashingPods check uses a zero-tolerance policy (restartCount > 0) that flags this as a failure despite it being operationally normal.

PR #8303 Relevance: None. All three changed files are AWS IAM destroy logic. No Azure code paths, NodePool upgrade logic, or kube-apiserver configuration is modified.

Recommendations

Retest the PR — Both failures are transient Azure infrastructure issues unrelated to the code changes. Run /retest to trigger a new CI run.
No code changes needed — PR OCPBUGS-83868: prevent IAM resource leaks during cluster destroy #8303 modifies only AWS IAM destroy paths; the Azure e2e failures have zero overlap with the changed code.
If the retest fails with the same TestNodePoolReplaceUpgrade timeout, it may indicate a persistent Azure regional issue — consider checking Azure service health for the CI region or filing a CAPZ issue for ProviderID registration reliability.

Evidence

Stack trace — TestNodePoolReplaceUpgrade (from build-log.txt lines 1740–1760, 3346–3385):

eventually.go:105: Failed to get *v1beta1.NodePool: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
nodepool_upgrade_test.go:238: Failed to wait for NodePool e2e-clusters-6v5cp/node-pool-ptw72-test-replaceupgrade to have version 5.0.0-0.ci-2026-05-01-163528-test-ci-op-vvhyyw15-latest in 45m0s: context deadline exceeded
AllNodesHealthy=False: NodeProvisioning(1 of 2 machines are not healthy)
  Machine node-pool-ptw72-test-replaceupgrade-sb42h-w6zzg: NodeProvisioning: Waiting for a node with matching ProviderID to exist

Stack trace — EnsureNoCrashingPods (from build-log.txt line 3326):

util.go:817: Container kube-apiserver in pod kube-apiserver-5fb9d658bb-5s6dj has a restartCount > 0 (1)

PR #8303 changed files (verified via gh pr view 8303):

cmd/cluster/aws/destroy.go — AWS-only
cmd/infra/aws/destroy_iam.go — AWS-only
cmd/infra/aws/destroy_iam_test.go — AWS-only

JUnit results (junit_e2e.xml):

7 total failures; 2 root failures + 5 parent test propagations
Root: TestNodePool/.../TestNodePoolReplaceUpgrade (3424.96s)
Root: TestAzureOAuthLoadBalancer/.../EnsureNoCrashingPods (0.12s)

CI's own failure analysis (failure-analysis.md artifact):

Independently confirms both failures as transient Azure infrastructure issues
Confirms no relationship to PR OCPBUGS-83868: prevent IAM resource leaks during cluster destroy #8303's AWS-only changes
Notes Azure API version deprecation warning during cleanup (informational, not causal)

Failed CI step: e2e-azure-self-managed-hypershift-azure-run-e2e-self-managed — duration 1h23m55s

sdminonne · 2026-05-02T09:17:07Z

/verified by e2e

openshift-ci-robot · 2026-05-02T09:17:18Z

@sdminonne: This PR has been marked as verified by e2e.

Details

In response to this:

/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-05-02T09:56:06Z

@sdminonne: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-05-02T10:02:58Z

@sdminonne: Jira Issue OCPBUGS-83868: Some pull requests linked via external trackers have merged:

openshift/hypershift#8303

The following pull request, linked via external tracker, has not merged:

openshift/ovn-kubernetes#3149 is open

All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-83868 has not been moved to the MODIFIED state.

This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload.

Details

In response to this:

When destroying an AWS cluster, failures at any step in the cleanup chain would cause all subsequent steps to be skipped, leaking IAM resources (OIDC providers, IAM roles, instance profiles).

Change destroyPlatformSpecifics, DestroyIAM, DestroyOIDCResources, and DestroySharedVPCRoles to collect errors and attempt all cleanup steps instead of returning on the first failure. This ensures that e.g. an infrastructure destroy failure does not prevent IAM cleanup, and a single role deletion failure does not skip the remaining roles.

What this PR does / why we need it:

It fixes the most common IAM oidc providers and roles leaks

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/OCPBUGS-83868

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

Bug Fixes

Enhanced AWS infrastructure and IAM destruction: The cluster cleanup process now attempts all resource deletion steps and aggregates all errors before reporting, rather than stopping at the first failure. This ensures more complete cleanup attempts and provides comprehensive visibility into all issues encountered.

Tests

Updated error handling tests to verify aggregated error reporting during cluster destruction.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 21, 2026

openshift-ci Bot added the do-not-merge/needs-area label Apr 21, 2026

sdminonne changed the title ~~fOCPBUGS-83868: prevent IAM resource leaks during cluster destroy~~ OCPBUGS-83868: prevent IAM resource leaks during cluster destroy Apr 21, 2026

openshift-ci Bot added area/cli Indicates the PR includes changes for CLI area/platform/aws PR/issue for AWS (AWSPlatform) platform and removed do-not-merge/needs-area labels Apr 21, 2026

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread cmd/infra/aws/destroy_iam.go

sdminonne marked this pull request as ready for review April 21, 2026 20:52

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 21, 2026

openshift-ci Bot requested review from devguyio and enxebre April 21, 2026 20:52

sdminonne force-pushed the OCPBUGS-83868 branch from 1558c31 to 549538c Compare April 23, 2026 11:21

openshift-ci Bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/testing Indicates the PR includes changes for e2e testing labels Apr 23, 2026

sdminonne force-pushed the OCPBUGS-83868 branch from 549538c to 66b6a4e Compare April 23, 2026 11:23

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

muraee reviewed Apr 27, 2026

View reviewed changes

jparrill reviewed Apr 27, 2026

View reviewed changes

sdminonne and others added 2 commits April 27, 2026 21:52

sdminonne force-pushed the OCPBUGS-83868 branch from 5a0074b to 100b927 Compare April 27, 2026 19:52

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2026

jparrill reviewed Apr 30, 2026

View reviewed changes

openshift-ci Bot assigned jparrill Apr 30, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 2, 2026

openshift-merge-bot Bot merged commit 3df2163 into openshift:main May 2, 2026
39 checks passed

Conversation

sdminonne commented Apr 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented Apr 21, 2026

Uh oh!

openshift-ci-robot commented Apr 21, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

openshift-ci Bot commented Apr 21, 2026

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

openshift-ci-robot commented Apr 21, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

openshift-ci-robot commented Apr 21, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sdminonne commented Apr 21, 2026

Uh oh!

openshift-ci-robot commented Apr 21, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jparrill left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

sdminonne commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

codecov Bot commented Apr 21, 2026 •

edited

Loading

coderabbitai Bot Apr 27, 2026 •

edited

Loading