Skip to content

OCPBUGS-83868: prevent IAM resource leaks during cluster destroy#8303

Merged
openshift-merge-bot[bot] merged 2 commits intoopenshift:mainfrom
sdminonne:OCPBUGS-83868
May 2, 2026
Merged

OCPBUGS-83868: prevent IAM resource leaks during cluster destroy#8303
openshift-merge-bot[bot] merged 2 commits intoopenshift:mainfrom
sdminonne:OCPBUGS-83868

Conversation

@sdminonne
Copy link
Copy Markdown
Contributor

@sdminonne sdminonne commented Apr 21, 2026

When destroying an AWS cluster, failures at any step in the cleanup chain would cause all subsequent steps to be skipped, leaking IAM resources (OIDC providers, IAM roles, instance profiles).

Change destroyPlatformSpecifics, DestroyIAM, DestroyOIDCResources, and DestroySharedVPCRoles to collect errors and attempt all cleanup steps instead of returning on the first failure. This ensures that e.g. an infrastructure destroy failure does not prevent IAM cleanup, and a single role deletion failure does not skip the remaining roles.

What this PR does / why we need it:

It fixes the most common IAM oidc providers and roles leaks

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/OCPBUGS-83868

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced AWS infrastructure and IAM destruction: The cluster cleanup process now attempts all resource deletion steps and aggregates all errors before reporting, rather than stopping at the first failure. This ensures more complete cleanup attempts and provides comprehensive visibility into all issues encountered.
  • Tests

    • Updated error handling tests to verify aggregated error reporting during cluster destruction.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot
Copy link
Copy Markdown

@sdminonne: No Jira issue with key fOCPBUGS-83868 exists in the tracker at https://redhat.atlassian.net.
Once a valid jira issue is referenced in the title of this pull request, request a refresh with /jira refresh.

Details

In response to this:

When destroying an AWS cluster, failures at any step in the cleanup chain would cause all subsequent steps to be skipped, leaking IAM resources (OIDC providers, IAM roles, instance profiles).

Change destroyPlatformSpecifics, DestroyIAM, DestroyOIDCResources, and DestroySharedVPCRoles to collect errors and attempt all cleanup steps instead of returning on the first failure. This ensures that e.g. an infrastructure destroy failure does not prevent IAM cleanup, and a single role deletion failure does not skip the remaining roles.

What this PR does / why we need it:

It fixes the most common IAM oidc providers and roles leaks

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/OCPBUGS-83868

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 21, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 21, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@sdminonne sdminonne changed the title fOCPBUGS-83868: prevent IAM resource leaks during cluster destroy OCPBUGS-83868: prevent IAM resource leaks during cluster destroy Apr 21, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

📝 Walkthrough

Walkthrough

The pull request refactors error handling in AWS infrastructure and IAM destruction operations from a fail-fast approach to a fail-all approach. The destroyPlatformSpecifics function in destroy.go now accumulates errors from both infrastructure and IAM destroy steps and returns an aggregated error instead of returning immediately on failure. Similarly, destroy_iam.go is updated to collect errors from DestroyOIDCResources, DestroySharedVPCRoles, and worker instance profile destruction, then return them as a single aggregated error. Component role deletions are refactored to iterate over role lists and continue on failure rather than short-circuiting. Tests are expanded to verify that all error messages are present in the aggregated error output.

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main change: preventing IAM resource leaks during cluster destroy by implementing error aggregation instead of fail-fast behavior in cleanup operations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The test file uses standard Go testing with Test functions and no Ginkgo imports detected.
Test Structure And Quality ✅ Passed The custom check is designed for Ginkgo test code, but the file uses standard Go unit tests with table-driven patterns and gomock.
Microshift Test Compatibility ✅ Passed This PR does not introduce any new Ginkgo e2e tests. Changes are to CLI infrastructure code and standard Go unit tests, not Ginkgo e2e tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The pull request does not introduce any Ginkgo e2e tests. The modifications are standard Go unit tests using testing.T framework with table-driven patterns and gomock, not Ginkgo patterns like Describe(), It(), Context(), or When(). These tests are unit-testing CLI infrastructure cleanup logic, not cluster behavior.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only CLI commands and infrastructure cleanup utilities in AWS destroy functions, not deployment manifests, operator code, or controllers, so custom check is not applicable.
Ote Binary Stdout Contract ✅ Passed Modified files contain no process-level entry points or direct stdout writes. All logging uses injected logr.Logger with zap backend writing to stderr.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The PR modifies destroy_iam_test.go which appears to be a standard Go unit test file, not a Ginkgo e2e test framework.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Apr 21, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@sdminonne: This pull request references Jira Issue OCPBUGS-83868, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

When destroying an AWS cluster, failures at any step in the cleanup chain would cause all subsequent steps to be skipped, leaking IAM resources (OIDC providers, IAM roles, instance profiles).

Change destroyPlatformSpecifics, DestroyIAM, DestroyOIDCResources, and DestroySharedVPCRoles to collect errors and attempt all cleanup steps instead of returning on the first failure. This ensures that e.g. an infrastructure destroy failure does not prevent IAM cleanup, and a single role deletion failure does not skip the remaining roles.

What this PR does / why we need it:

It fixes the most common IAM oidc providers and roles leaks

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/OCPBUGS-83868

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added area/cli Indicates the PR includes changes for CLI area/platform/aws PR/issue for AWS (AWSPlatform) platform and removed do-not-merge/needs-area labels Apr 21, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@sdminonne: This pull request references Jira Issue OCPBUGS-83868, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

When destroying an AWS cluster, failures at any step in the cleanup chain would cause all subsequent steps to be skipped, leaking IAM resources (OIDC providers, IAM roles, instance profiles).

Change destroyPlatformSpecifics, DestroyIAM, DestroyOIDCResources, and DestroySharedVPCRoles to collect errors and attempt all cleanup steps instead of returning on the first failure. This ensures that e.g. an infrastructure destroy failure does not prevent IAM cleanup, and a single role deletion failure does not skip the remaining roles.

What this PR does / why we need it:

It fixes the most common IAM oidc providers and roles leaks

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/OCPBUGS-83868

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes
  • Enhanced error reporting during AWS infrastructure and IAM cleanup to collect and report all errors encountered, rather than stopping at the first failure.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/infra/aws/destroy_iam.go`:
- Around line 157-159: The call to DestroyOIDCRole("shared-role") currently
returns immediately on error which aborts cleanup for per-component roles;
change the flow so that when DestroyOIDCRole(ctx, iamClient, "shared-role")
returns an error you append that error to the errs slice (e.g., errs =
append(errs, err)) and do not return — allow execution to continue to the
subsequent per-component role cleanup logic; keep the existing use of the
removed boolean as-is and aggregate errs at the end (using
utilerrors.NewAggregate) rather than returning early.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 0177988a-625a-4e52-97e1-a4b4f9d2d615

📥 Commits

Reviewing files that changed from the base of the PR and between 3ec580c and 1558c31.

📒 Files selected for processing (3)
  • cmd/cluster/aws/destroy.go
  • cmd/infra/aws/destroy_iam.go
  • cmd/infra/aws/destroy_iam_test.go

Comment thread cmd/infra/aws/destroy_iam.go
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 70.21277% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.35%. Comparing base (e7e828e) to head (100b927).
⚠️ Report is 78 commits behind head on main.

Files with missing lines Patch % Lines
cmd/infra/aws/destroy_iam.go 78.57% 9 Missing ⚠️
cmd/cluster/aws/destroy.go 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8303      +/-   ##
==========================================
+ Coverage   36.32%   36.35%   +0.03%     
==========================================
  Files         764      764              
  Lines       92886    92884       -2     
==========================================
+ Hits        33737    33766      +29     
+ Misses      56442    56421      -21     
+ Partials     2707     2697      -10     
Files with missing lines Coverage Δ
cmd/cluster/aws/destroy.go 4.58% <0.00%> (-0.08%) ⬇️
cmd/infra/aws/destroy_iam.go 62.71% <78.57%> (+10.82%) ⬆️
Flag Coverage Δ
cmd-support 30.12% <70.21%> (+0.09%) ⬆️
cpo-hostedcontrolplane 37.05% <ø> (ø)
cpo-other 35.69% <ø> (ø)
hypershift-operator 47.99% <ø> (ø)
other 27.68% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sdminonne
Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown

@sdminonne: This pull request references Jira Issue OCPBUGS-83868, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdminonne sdminonne marked this pull request as ready for review April 21, 2026 20:52
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 21, 2026
@openshift-ci openshift-ci Bot requested review from devguyio and enxebre April 21, 2026 20:52
@openshift-ci openshift-ci Bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/testing Indicates the PR includes changes for e2e testing labels Apr 23, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cmd/infra/aws/destroy_iam_test.go (1)

552-559: Enforce call order to prove “continue after ingress failure” behavior.

Right now the expectations don’t assert sequencing, so this can pass even if control-plane role is checked first. Wrap these two calls in gomock.InOrder to lock in the intended flow.

Proposed test-hardening diff
 			setupVPCOwnerMock: func(m *awsapi.MockIAMAPI) {
-				m.EXPECT().GetRole(gomock.Any(), &iam.GetRoleInput{RoleName: aws.String(ingressRoleName)}, gomock.Any()).
-					Return(nil, errors.New("api error"))
-				m.EXPECT().GetRole(gomock.Any(), &iam.GetRoleInput{RoleName: aws.String(cpRoleName)}, gomock.Any()).
-					Return(nil, noSuchEntity())
+				gomock.InOrder(
+					m.EXPECT().GetRole(gomock.Any(), &iam.GetRoleInput{RoleName: aws.String(ingressRoleName)}, gomock.Any()).
+						Return(nil, errors.New("api error")),
+					m.EXPECT().GetRole(gomock.Any(), &iam.GetRoleInput{RoleName: aws.String(cpRoleName)}, gomock.Any()).
+						Return(nil, noSuchEntity()),
+				)
 			},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/infra/aws/destroy_iam_test.go` around lines 552 - 559, The test currently
sets two unordered expectations on MockIAMAPI GetRole for ingressRoleName and
cpRoleName inside setupVPCOwnerMock; to assert the code actually checks ingress
first and then control-plane, wrap those two m.EXPECT() calls in a
gomock.InOrder(...) block so the GetRole call for ingressRoleName (returning api
error) must happen before the GetRole call for cpRoleName (returning
noSuchEntity()); update setupVPCOwnerMock in destroy_iam_test.go to place the
two existing m.EXPECT() calls inside gomock.InOrder to enforce sequencing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cmd/infra/aws/destroy_iam_test.go`:
- Around line 552-559: The test currently sets two unordered expectations on
MockIAMAPI GetRole for ingressRoleName and cpRoleName inside setupVPCOwnerMock;
to assert the code actually checks ingress first and then control-plane, wrap
those two m.EXPECT() calls in a gomock.InOrder(...) block so the GetRole call
for ingressRoleName (returning api error) must happen before the GetRole call
for cpRoleName (returning noSuchEntity()); update setupVPCOwnerMock in
destroy_iam_test.go to place the two existing m.EXPECT() calls inside
gomock.InOrder to enforce sequencing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 525efc49-262f-481a-b8ae-a1bdc5d2201e

📥 Commits

Reviewing files that changed from the base of the PR and between 1558c31 and 66b6a4e.

📒 Files selected for processing (4)
  • cmd/cluster/aws/destroy.go
  • cmd/infra/aws/destroy_iam.go
  • cmd/infra/aws/destroy_iam_test.go
  • test/e2e/nodepool_test.go
✅ Files skipped from review due to trivial changes (1)
  • test/e2e/nodepool_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/infra/aws/destroy_iam.go

Comment thread cmd/infra/aws/destroy_iam.go Outdated
"karpenter",
}
for _, name := range roleNames {
if name == "openshift-ingress" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to have a separate list e.g. roleNamesWithRetry := []string{"openshift-ingress" } and iterate over it. This way we can easily add retry to different roles later.

Copy link
Copy Markdown
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some comments. Thanks!

},
{
name: "When destroying the ingress role fails it should return the error",
name: "When destroying the ingress role fails it should still attempt control-plane role and return the error",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test cases for multiple simultaneous failures in the role deletion loop to verify all errors are properly aggregated

Comment thread cmd/infra/aws/destroy_iam.go Outdated
Comment on lines +179 to +182
if name == "openshift-ingress" {
if err := o.DestroyOIDCRoleWithRetry(ctx, iamClient, name); err != nil {
errs = append(errs, err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe makes sense to extract this from the loop and handle it separately, instead of evaluate the name in every loop cycle.

err = o.DestroyWorkerInstanceProfile(ctx, iamClient)
if err != nil {
return err
if err = o.DestroyWorkerInstanceProfile(ctx, iamClient); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside of the DestroyWorkerInstanceProfile function, we're still returning early instead of use the error aggregation pattern. Make sense to move it same strategy or keep in return early approach?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm unsure. I would say here we have to return immediately.

Comment thread cmd/infra/aws/destroy_iam.go Outdated
Comment on lines +167 to +177
roleNames := []string{
"openshift-ingress",
"openshift-image-registry",
"aws-ebs-csi-driver-controller",
"cloud-controller",
"node-pool",
"control-plane-operator",
"cloud-network-config-controller",
"kms-provider",
"karpenter",
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think make sense to have a shared list between creation/deletion to keep both in sync to avoid drift between both processes. WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, great idea but this is more a whole refactor while the goal of this is to fix short-term bugs impacting our daily activities. Could it be a follow-up?

sdminonne and others added 2 commits April 27, 2026 21:52
When destroying an AWS cluster, failures at any step in the cleanup
chain would cause all subsequent steps to be skipped, leaking IAM
resources (OIDC providers, IAM roles, instance profiles).

Change destroyPlatformSpecifics, DestroyIAM, DestroyOIDCResources,
and DestroySharedVPCRoles to collect errors and attempt all cleanup
steps instead of returning on the first failure. This ensures that
e.g. an infrastructure destroy failure does not prevent IAM cleanup,
and a single role deletion failure does not skip the remaining roles.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix shared-role early return that still skipped per-component role
cleanup on error, extract openshift-ingress retry handling out of the
main role loop into a separate roleNamesWithRetry list, and add tests
for multiple simultaneous deletion failures to verify proper error
aggregation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/infra/aws/destroy_iam.go`:
- Around line 110-114: When creating the VPC-owner AWS session fails (the block
using VPCOwnerCredentialsOpts.AWSCredentialsFile and GetSession which currently
returns early with utilerrors.NewAggregate), do not return immediately; instead
append the session error into errs and continue so we still attempt shared-VPC
cleanup using the cluster-account iamClient when PrivateZonesInClusterAccount is
true. Update the code around vpcOwnerAWSSession/GetSession to collect the error
(add it to errs) and proceed, guarding any use of vpcOwnerAWSSession when nil,
and ensure the shared-VPC ingress deletion path (which uses iamClient and the
PrivateZonesInClusterAccount flag) runs even if GetSession failed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5d2d06c3-de63-4fdb-abaf-2b93c049eefd

📥 Commits

Reviewing files that changed from the base of the PR and between 5a0074b and 100b927.

📒 Files selected for processing (3)
  • cmd/cluster/aws/destroy.go
  • cmd/infra/aws/destroy_iam.go
  • cmd/infra/aws/destroy_iam_test.go

Comment on lines 110 to 114
if o.VPCOwnerCredentialsOpts.AWSCredentialsFile != "" {
vpcOwnerAWSSession, err := o.VPCOwnerCredentialsOpts.GetSession(ctx, "cli-destroy-iam", nil, o.Region)
if err != nil {
return err
return utilerrors.NewAggregate(append(errs, err))
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

shared-vpc-ingress can still leak when VPC-owner session creation fails.

At Line 113, returning immediately skips any shared-VPC cleanup attempt. In PrivateZonesInClusterAccount=true, shared-vpc-ingress lives in the cluster account and can still be deleted with iamClient, but this path never attempts it.

Suggested fix
  if o.VPCOwnerCredentialsOpts.AWSCredentialsFile != "" {
      vpcOwnerAWSSession, err := o.VPCOwnerCredentialsOpts.GetSession(ctx, "cli-destroy-iam", nil, o.Region)
      if err != nil {
-         return utilerrors.NewAggregate(append(errs, err))
+         errs = append(errs, fmt.Errorf("failed to create VPC owner IAM session: %w", err))
+         // In private-zones mode, ingress role is in cluster account; still attempt cleanup.
+         if o.PrivateZonesInClusterAccount {
+             if _, ingressErr := o.DestroyOIDCRole(ctx, iamClient, "shared-vpc-ingress"); ingressErr != nil {
+                 errs = append(errs, ingressErr)
+             }
+         }
+         return utilerrors.NewAggregate(errs)
      }
      vpcOwnerIAMClient := iam.NewFromConfig(*vpcOwnerAWSSession, func(o *iam.Options) {
          o.Retryer = awsConfig()
      })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/infra/aws/destroy_iam.go` around lines 110 - 114, When creating the
VPC-owner AWS session fails (the block using
VPCOwnerCredentialsOpts.AWSCredentialsFile and GetSession which currently
returns early with utilerrors.NewAggregate), do not return immediately; instead
append the session error into errs and continue so we still attempt shared-VPC
cleanup using the cluster-account iamClient when PrivateZonesInClusterAccount is
true. Update the code around vpcOwnerAWSSession/GetSession to collect the error
(add it to errs) and proceed, guarding any use of vpcOwnerAWSSession when nil,
and ensure the shared-VPC ingress deletion path (which uses iamClient and the
PrivateZonesInClusterAccount flag) runs even if GetSession failed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid observation, but I'm intentionally not addressing it in this PR. Here's why:

  1. When GetSession for the VPC owner fails, it indicates a fundamental credentials problem. Proceeding with partial cleanup (only the ingress role) while silently skipping the shared-vpc-control-plane role cleanup would leave things in an inconsistent state.

  2. The error aggregation pattern in DestroyIAM already ensures the VPC session error is reported to the caller. The outer Run method retries on retryable errors, so transient credential issues will be retried.

  3. The PrivateZonesInClusterAccount + VPC owner session failure is an edge case that would require refactoring DestroyIAM to make GetSession mockable for proper testing. That's out of scope for this bug fix PR.

If we see this scenario causing real leaks in production, we can address it in a follow-up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdminonne, thanks for the detailed explanation — the reasoning is sound. You're right that attempting partial shared-VPC cleanup when the VPC-owner credentials are fundamentally broken could leave things in a worse, harder-to-reason-about state than a clean failure that gets retried. The existing retry loop in Run() is the appropriate safety net for transient credential issues.

Would you like me to open a follow-up GitHub issue to track the PrivateZonesInClusterAccount + VPC-owner session-failure edge case so it doesn't get lost?


📝 Learnings were identified and not saved (knowledge base disabled). Enable

@sdminonne
Copy link
Copy Markdown
Contributor Author

@jparrill @muraee may you PTAL

@muraee
Copy link
Copy Markdown
Contributor

muraee commented Apr 30, 2026

/approve

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 30, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: muraee, sdminonne

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2026
Copy link
Copy Markdown
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2049850754159611905 | Cost: $3.8839149999999996 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-azure-self-managed

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2050250102328528896 | Cost: $4.097735850000001 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@sdminonne
Copy link
Copy Markdown
Contributor Author

This appears to be an infrastructure flake rather than a code regression.

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-azure-self-managed

@hypershift-jira-solve-ci
Copy link
Copy Markdown

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed
  • Build ID: 2050250102328528896
  • Target: e2e-azure-self-managed
  • PR: OCPBUGS-83868: prevent IAM resource leaks during cluster destroy #8303OCPBUGS-83868: prevent IAM resource leaks during cluster destroy
  • Duration: ~2h 24m (16:24:50Z → 18:48:33Z)
  • Result: 288 tests run, 27 skipped, 7 failures (2 root failures + 5 parent propagations)

Test Failure Analysis

Error

--- FAIL: TestNodePool/HostedCluster0/Main/TestNodePoolReplaceUpgrade (3424.96s)
    nodepool_upgrade_test.go:238: Failed to wait for NodePool to have version 5.0.0-0.ci-2026-05-01-163528-test-ci-op-vvhyyw15-latest in 45m0s: context deadline exceeded
    AllNodesHealthy=False: NodeProvisioning(1 of 2 machines are not healthy)
    Machine node-pool-ptw72-test-replaceupgrade-sb42h-w6zzg: NodeProvisioning: Waiting for a node with matching ProviderID to exist

--- FAIL: TestAzureOAuthLoadBalancer/ValidateHostedCluster/EnsureNoCrashingPods (0.12s)
    util.go:817: Container kube-apiserver in pod kube-apiserver-5fb9d658bb-5s6dj has a restartCount > 0 (1)

Summary

Both failures are transient Azure infrastructure issues completely unrelated to PR #8303's code changes.

Failure 1 — TestNodePoolReplaceUpgrade (45-minute timeout): During a NodePool rolling replace-upgrade to version 5.0.0-0.ci-..., CAPI/CAPZ provisioned a new Azure VM (sb42h-w6zzg) but the VM never registered as a Kubernetes node. The machine remained stuck in NodeProvisioning state with the message "Waiting for a node with matching ProviderID to exist", meaning the Azure VM's ProviderID was never matched to a kubelet-registered node. After 45 minutes the test timed out. This is a known transient failure mode in Azure where VM-to-node registration can stall due to Azure API latency, cloud-provider reconciliation delays, or kubelet bootstrap issues on the new VM. The rate limiter error (rate: Wait(n=1) would exceed context deadline) confirms the test was also hitting Azure API throttling near the end of the timeout window.

Failure 2 — EnsureNoCrashingPods (kube-apiserver restart): A single kube-apiserver restart (restartCount=1) triggered the zero-tolerance EnsureNoCrashingPods check. A single restart of kube-apiserver is a common transient event in CI hosted control planes (leader election, memory pressure, or brief connectivity blip) and does not indicate a product defect.

PR #8303 is not implicated. The PR modifies exactly three files — all in AWS-specific code paths (cmd/cluster/aws/destroy.go, cmd/infra/aws/destroy_iam.go, cmd/infra/aws/destroy_iam_test.go). No Azure code, no NodePool upgrade logic, and no kube-apiserver configuration is touched. The CI system's own automated failure-analysis.md artifact independently corroborates this conclusion.

Root Cause

Primary Cause — TestNodePoolReplaceUpgrade:
Azure VM-to-Kubernetes-node ProviderID registration failure. During the NodePool replace-upgrade, CAPZ successfully provisioned Azure VM node-pool-ptw72-test-replaceupgrade-sb42h-w6zzg, but the VM's kubelet either failed to bootstrap or failed to register with the hosted cluster's API server with a matching ProviderID. The CAPI Machine object remained in NodeProvisioning state for the entire 45-minute timeout. Contributing factors include Azure API rate limiting (confirmed by rate: Wait(n=1) would exceed context deadline errors in the final minutes) and possible Azure regional API latency.

Primary Cause — EnsureNoCrashingPods:
A single transient kube-apiserver pod restart (restartCount=1) in the hosted control plane. This is a known benign transient event in CI environments — kube-apiserver may restart once due to leader election transitions, liveness probe timing, or brief resource pressure. The EnsureNoCrashingPods check uses a zero-tolerance policy (restartCount > 0) that flags this as a failure despite it being operationally normal.

PR #8303 Relevance: None. All three changed files are AWS IAM destroy logic. No Azure code paths, NodePool upgrade logic, or kube-apiserver configuration is modified.

Recommendations
  1. Retest the PR — Both failures are transient Azure infrastructure issues unrelated to the code changes. Run /retest to trigger a new CI run.
  2. No code changes needed — PR OCPBUGS-83868: prevent IAM resource leaks during cluster destroy #8303 modifies only AWS IAM destroy paths; the Azure e2e failures have zero overlap with the changed code.
  3. If the retest fails with the same TestNodePoolReplaceUpgrade timeout, it may indicate a persistent Azure regional issue — consider checking Azure service health for the CI region or filing a CAPZ issue for ProviderID registration reliability.
Evidence

Stack trace — TestNodePoolReplaceUpgrade (from build-log.txt lines 1740–1760, 3346–3385):

eventually.go:105: Failed to get *v1beta1.NodePool: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
nodepool_upgrade_test.go:238: Failed to wait for NodePool e2e-clusters-6v5cp/node-pool-ptw72-test-replaceupgrade to have version 5.0.0-0.ci-2026-05-01-163528-test-ci-op-vvhyyw15-latest in 45m0s: context deadline exceeded
AllNodesHealthy=False: NodeProvisioning(1 of 2 machines are not healthy)
  Machine node-pool-ptw72-test-replaceupgrade-sb42h-w6zzg: NodeProvisioning: Waiting for a node with matching ProviderID to exist

Stack trace — EnsureNoCrashingPods (from build-log.txt line 3326):

util.go:817: Container kube-apiserver in pod kube-apiserver-5fb9d658bb-5s6dj has a restartCount > 0 (1)

PR #8303 changed files (verified via gh pr view 8303):

  • cmd/cluster/aws/destroy.go — AWS-only
  • cmd/infra/aws/destroy_iam.go — AWS-only
  • cmd/infra/aws/destroy_iam_test.go — AWS-only

JUnit results (junit_e2e.xml):

  • 7 total failures; 2 root failures + 5 parent test propagations
  • Root: TestNodePool/.../TestNodePoolReplaceUpgrade (3424.96s)
  • Root: TestAzureOAuthLoadBalancer/.../EnsureNoCrashingPods (0.12s)

CI's own failure analysis (failure-analysis.md artifact):

Failed CI step: e2e-azure-self-managed-hypershift-azure-run-e2e-self-managed — duration 1h23m55s


@sdminonne
Copy link
Copy Markdown
Contributor Author

/verified by e2e

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 2, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@sdminonne: This PR has been marked as verified by e2e.

Details

In response to this:

/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 2, 2026

@sdminonne: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 3df2163 into openshift:main May 2, 2026
39 checks passed
@openshift-ci-robot
Copy link
Copy Markdown

@sdminonne: Jira Issue OCPBUGS-83868: Some pull requests linked via external trackers have merged:

The following pull request, linked via external tracker, has not merged:

All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-83868 has not been moved to the MODIFIED state.

This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload.

Details

In response to this:

When destroying an AWS cluster, failures at any step in the cleanup chain would cause all subsequent steps to be skipped, leaking IAM resources (OIDC providers, IAM roles, instance profiles).

Change destroyPlatformSpecifics, DestroyIAM, DestroyOIDCResources, and DestroySharedVPCRoles to collect errors and attempt all cleanup steps instead of returning on the first failure. This ensures that e.g. an infrastructure destroy failure does not prevent IAM cleanup, and a single role deletion failure does not skip the remaining roles.

What this PR does / why we need it:

It fixes the most common IAM oidc providers and roles leaks

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/OCPBUGS-83868

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes

  • Enhanced AWS infrastructure and IAM destruction: The cluster cleanup process now attempts all resource deletion steps and aggregates all errors before reporting, rather than stopping at the first failure. This ensures more complete cleanup attempts and provides comprehensive visibility into all issues encountered.

  • Tests

  • Updated error handling tests to verify aggregated error reporting during cluster destruction.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cli Indicates the PR includes changes for CLI area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform area/testing Indicates the PR includes changes for e2e testing jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants