OCPBUGS-83868: prevent IAM resource leaks during cluster destroy#8303
OCPBUGS-83868: prevent IAM resource leaks during cluster destroy#8303openshift-merge-bot[bot] merged 2 commits intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@sdminonne: No Jira issue with key fOCPBUGS-83868 exists in the tracker at https://redhat.atlassian.net. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
📝 WalkthroughWalkthroughThe pull request refactors error handling in AWS infrastructure and IAM destruction operations from a fail-fast approach to a fail-all approach. The 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@sdminonne: This pull request references Jira Issue OCPBUGS-83868, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@sdminonne: This pull request references Jira Issue OCPBUGS-83868, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/infra/aws/destroy_iam.go`:
- Around line 157-159: The call to DestroyOIDCRole("shared-role") currently
returns immediately on error which aborts cleanup for per-component roles;
change the flow so that when DestroyOIDCRole(ctx, iamClient, "shared-role")
returns an error you append that error to the errs slice (e.g., errs =
append(errs, err)) and do not return — allow execution to continue to the
subsequent per-component role cleanup logic; keep the existing use of the
removed boolean as-is and aggregate errs at the end (using
utilerrors.NewAggregate) rather than returning early.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 0177988a-625a-4e52-97e1-a4b4f9d2d615
📒 Files selected for processing (3)
cmd/cluster/aws/destroy.gocmd/infra/aws/destroy_iam.gocmd/infra/aws/destroy_iam_test.go
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8303 +/- ##
==========================================
+ Coverage 36.32% 36.35% +0.03%
==========================================
Files 764 764
Lines 92886 92884 -2
==========================================
+ Hits 33737 33766 +29
+ Misses 56442 56421 -21
+ Partials 2707 2697 -10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
/jira refresh |
|
@sdminonne: This pull request references Jira Issue OCPBUGS-83868, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
1558c31 to
549538c
Compare
549538c to
66b6a4e
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cmd/infra/aws/destroy_iam_test.go (1)
552-559: Enforce call order to prove “continue after ingress failure” behavior.Right now the expectations don’t assert sequencing, so this can pass even if control-plane role is checked first. Wrap these two calls in
gomock.InOrderto lock in the intended flow.Proposed test-hardening diff
setupVPCOwnerMock: func(m *awsapi.MockIAMAPI) { - m.EXPECT().GetRole(gomock.Any(), &iam.GetRoleInput{RoleName: aws.String(ingressRoleName)}, gomock.Any()). - Return(nil, errors.New("api error")) - m.EXPECT().GetRole(gomock.Any(), &iam.GetRoleInput{RoleName: aws.String(cpRoleName)}, gomock.Any()). - Return(nil, noSuchEntity()) + gomock.InOrder( + m.EXPECT().GetRole(gomock.Any(), &iam.GetRoleInput{RoleName: aws.String(ingressRoleName)}, gomock.Any()). + Return(nil, errors.New("api error")), + m.EXPECT().GetRole(gomock.Any(), &iam.GetRoleInput{RoleName: aws.String(cpRoleName)}, gomock.Any()). + Return(nil, noSuchEntity()), + ) },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/infra/aws/destroy_iam_test.go` around lines 552 - 559, The test currently sets two unordered expectations on MockIAMAPI GetRole for ingressRoleName and cpRoleName inside setupVPCOwnerMock; to assert the code actually checks ingress first and then control-plane, wrap those two m.EXPECT() calls in a gomock.InOrder(...) block so the GetRole call for ingressRoleName (returning api error) must happen before the GetRole call for cpRoleName (returning noSuchEntity()); update setupVPCOwnerMock in destroy_iam_test.go to place the two existing m.EXPECT() calls inside gomock.InOrder to enforce sequencing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cmd/infra/aws/destroy_iam_test.go`:
- Around line 552-559: The test currently sets two unordered expectations on
MockIAMAPI GetRole for ingressRoleName and cpRoleName inside setupVPCOwnerMock;
to assert the code actually checks ingress first and then control-plane, wrap
those two m.EXPECT() calls in a gomock.InOrder(...) block so the GetRole call
for ingressRoleName (returning api error) must happen before the GetRole call
for cpRoleName (returning noSuchEntity()); update setupVPCOwnerMock in
destroy_iam_test.go to place the two existing m.EXPECT() calls inside
gomock.InOrder to enforce sequencing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 525efc49-262f-481a-b8ae-a1bdc5d2201e
📒 Files selected for processing (4)
cmd/cluster/aws/destroy.gocmd/infra/aws/destroy_iam.gocmd/infra/aws/destroy_iam_test.gotest/e2e/nodepool_test.go
✅ Files skipped from review due to trivial changes (1)
- test/e2e/nodepool_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- cmd/infra/aws/destroy_iam.go
| "karpenter", | ||
| } | ||
| for _, name := range roleNames { | ||
| if name == "openshift-ingress" { |
There was a problem hiding this comment.
better to have a separate list e.g. roleNamesWithRetry := []string{"openshift-ingress" } and iterate over it. This way we can easily add retry to different roles later.
jparrill
left a comment
There was a problem hiding this comment.
Dropped some comments. Thanks!
| }, | ||
| { | ||
| name: "When destroying the ingress role fails it should return the error", | ||
| name: "When destroying the ingress role fails it should still attempt control-plane role and return the error", |
There was a problem hiding this comment.
Add test cases for multiple simultaneous failures in the role deletion loop to verify all errors are properly aggregated
| if name == "openshift-ingress" { | ||
| if err := o.DestroyOIDCRoleWithRetry(ctx, iamClient, name); err != nil { | ||
| errs = append(errs, err) | ||
| } |
There was a problem hiding this comment.
Maybe makes sense to extract this from the loop and handle it separately, instead of evaluate the name in every loop cycle.
| err = o.DestroyWorkerInstanceProfile(ctx, iamClient) | ||
| if err != nil { | ||
| return err | ||
| if err = o.DestroyWorkerInstanceProfile(ctx, iamClient); err != nil { |
There was a problem hiding this comment.
Inside of the DestroyWorkerInstanceProfile function, we're still returning early instead of use the error aggregation pattern. Make sense to move it same strategy or keep in return early approach?
There was a problem hiding this comment.
Hmmm unsure. I would say here we have to return immediately.
| roleNames := []string{ | ||
| "openshift-ingress", | ||
| "openshift-image-registry", | ||
| "aws-ebs-csi-driver-controller", | ||
| "cloud-controller", | ||
| "node-pool", | ||
| "control-plane-operator", | ||
| "cloud-network-config-controller", | ||
| "kms-provider", | ||
| "karpenter", | ||
| } |
There was a problem hiding this comment.
I think make sense to have a shared list between creation/deletion to keep both in sync to avoid drift between both processes. WDYT?
There was a problem hiding this comment.
Oh, great idea but this is more a whole refactor while the goal of this is to fix short-term bugs impacting our daily activities. Could it be a follow-up?
When destroying an AWS cluster, failures at any step in the cleanup chain would cause all subsequent steps to be skipped, leaking IAM resources (OIDC providers, IAM roles, instance profiles). Change destroyPlatformSpecifics, DestroyIAM, DestroyOIDCResources, and DestroySharedVPCRoles to collect errors and attempt all cleanup steps instead of returning on the first failure. This ensures that e.g. an infrastructure destroy failure does not prevent IAM cleanup, and a single role deletion failure does not skip the remaining roles. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix shared-role early return that still skipped per-component role cleanup on error, extract openshift-ingress retry handling out of the main role loop into a separate roleNamesWithRetry list, and add tests for multiple simultaneous deletion failures to verify proper error aggregation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5a0074b to
100b927
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/infra/aws/destroy_iam.go`:
- Around line 110-114: When creating the VPC-owner AWS session fails (the block
using VPCOwnerCredentialsOpts.AWSCredentialsFile and GetSession which currently
returns early with utilerrors.NewAggregate), do not return immediately; instead
append the session error into errs and continue so we still attempt shared-VPC
cleanup using the cluster-account iamClient when PrivateZonesInClusterAccount is
true. Update the code around vpcOwnerAWSSession/GetSession to collect the error
(add it to errs) and proceed, guarding any use of vpcOwnerAWSSession when nil,
and ensure the shared-VPC ingress deletion path (which uses iamClient and the
PrivateZonesInClusterAccount flag) runs even if GetSession failed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 5d2d06c3-de63-4fdb-abaf-2b93c049eefd
📒 Files selected for processing (3)
cmd/cluster/aws/destroy.gocmd/infra/aws/destroy_iam.gocmd/infra/aws/destroy_iam_test.go
| if o.VPCOwnerCredentialsOpts.AWSCredentialsFile != "" { | ||
| vpcOwnerAWSSession, err := o.VPCOwnerCredentialsOpts.GetSession(ctx, "cli-destroy-iam", nil, o.Region) | ||
| if err != nil { | ||
| return err | ||
| return utilerrors.NewAggregate(append(errs, err)) | ||
| } |
There was a problem hiding this comment.
shared-vpc-ingress can still leak when VPC-owner session creation fails.
At Line 113, returning immediately skips any shared-VPC cleanup attempt. In PrivateZonesInClusterAccount=true, shared-vpc-ingress lives in the cluster account and can still be deleted with iamClient, but this path never attempts it.
Suggested fix
if o.VPCOwnerCredentialsOpts.AWSCredentialsFile != "" {
vpcOwnerAWSSession, err := o.VPCOwnerCredentialsOpts.GetSession(ctx, "cli-destroy-iam", nil, o.Region)
if err != nil {
- return utilerrors.NewAggregate(append(errs, err))
+ errs = append(errs, fmt.Errorf("failed to create VPC owner IAM session: %w", err))
+ // In private-zones mode, ingress role is in cluster account; still attempt cleanup.
+ if o.PrivateZonesInClusterAccount {
+ if _, ingressErr := o.DestroyOIDCRole(ctx, iamClient, "shared-vpc-ingress"); ingressErr != nil {
+ errs = append(errs, ingressErr)
+ }
+ }
+ return utilerrors.NewAggregate(errs)
}
vpcOwnerIAMClient := iam.NewFromConfig(*vpcOwnerAWSSession, func(o *iam.Options) {
o.Retryer = awsConfig()
})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/infra/aws/destroy_iam.go` around lines 110 - 114, When creating the
VPC-owner AWS session fails (the block using
VPCOwnerCredentialsOpts.AWSCredentialsFile and GetSession which currently
returns early with utilerrors.NewAggregate), do not return immediately; instead
append the session error into errs and continue so we still attempt shared-VPC
cleanup using the cluster-account iamClient when PrivateZonesInClusterAccount is
true. Update the code around vpcOwnerAWSSession/GetSession to collect the error
(add it to errs) and proceed, guarding any use of vpcOwnerAWSSession when nil,
and ensure the shared-VPC ingress deletion path (which uses iamClient and the
PrivateZonesInClusterAccount flag) runs even if GetSession failed.
There was a problem hiding this comment.
This is a valid observation, but I'm intentionally not addressing it in this PR. Here's why:
-
When
GetSessionfor the VPC owner fails, it indicates a fundamental credentials problem. Proceeding with partial cleanup (only the ingress role) while silently skipping theshared-vpc-control-planerole cleanup would leave things in an inconsistent state. -
The error aggregation pattern in
DestroyIAMalready ensures the VPC session error is reported to the caller. The outerRunmethod retries on retryable errors, so transient credential issues will be retried. -
The
PrivateZonesInClusterAccount+ VPC owner session failure is an edge case that would require refactoringDestroyIAMto makeGetSessionmockable for proper testing. That's out of scope for this bug fix PR.
If we see this scenario causing real leaks in production, we can address it in a follow-up.
There was a problem hiding this comment.
@sdminonne, thanks for the detailed explanation — the reasoning is sound. You're right that attempting partial shared-VPC cleanup when the VPC-owner credentials are fundamentally broken could leave things in a worse, harder-to-reason-about state than a clean failure that gets retried. The existing retry loop in Run() is the appropriate safety net for transient credential issues.
Would you like me to open a follow-up GitHub issue to track the PrivateZonesInClusterAccount + VPC-owner session-failure edge case so it doesn't get lost?
📝 Learnings were identified and not saved (knowledge base disabled). Enable
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: muraee, sdminonne The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Scheduling tests matching the |
|
/test e2e-aks |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/test e2e-aks |
|
/test e2e-azure-self-managed |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
This appears to be an infrastructure flake rather than a code regression. |
|
/test e2e-azure-self-managed |
Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryBoth failures are transient Azure infrastructure issues completely unrelated to PR #8303's code changes. Failure 1 — TestNodePoolReplaceUpgrade (45-minute timeout): During a NodePool rolling replace-upgrade to version Failure 2 — EnsureNoCrashingPods (kube-apiserver restart): A single kube-apiserver restart ( PR #8303 is not implicated. The PR modifies exactly three files — all in AWS-specific code paths ( Root CausePrimary Cause — TestNodePoolReplaceUpgrade: Primary Cause — EnsureNoCrashingPods: PR #8303 Relevance: None. All three changed files are AWS IAM destroy logic. No Azure code paths, NodePool upgrade logic, or kube-apiserver configuration is modified. Recommendations
EvidenceStack trace — TestNodePoolReplaceUpgrade (from Stack trace — EnsureNoCrashingPods (from PR #8303 changed files (verified via
JUnit results (
CI's own failure analysis (
Failed CI step: |
|
/verified by e2e |
|
@sdminonne: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@sdminonne: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@sdminonne: Jira Issue OCPBUGS-83868: Some pull requests linked via external trackers have merged: The following pull request, linked via external tracker, has not merged:
All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-83868 has not been moved to the MODIFIED state. This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
When destroying an AWS cluster, failures at any step in the cleanup chain would cause all subsequent steps to be skipped, leaking IAM resources (OIDC providers, IAM roles, instance profiles).
Change destroyPlatformSpecifics, DestroyIAM, DestroyOIDCResources, and DestroySharedVPCRoles to collect errors and attempt all cleanup steps instead of returning on the first failure. This ensures that e.g. an infrastructure destroy failure does not prevent IAM cleanup, and a single role deletion failure does not skip the remaining roles.
What this PR does / why we need it:
It fixes the most common IAM oidc providers and roles leaks
Which issue(s) this PR fixes:
Fixes https://redhat.atlassian.net/browse/OCPBUGS-83868
Special notes for your reviewer:
Checklist:
Summary by CodeRabbit
Bug Fixes
Tests