OCPBUGS-84251: fix(azure): detect and replace stale role assignments on cluster re-creation#8322
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a Sequence DiagramsequenceDiagram
participant DestroyCmd as Destroy IAM Command
participant RBACMgr as RBAC Manager
participant AzureAPI as Azure API
participant IdentityMgr as Identity Manager
DestroyCmd->>RBACMgr: CleanupRoleAssignments(ctx, infraID, rgMain, nsgRG, vnetRG, dnsZoneRG, ...)
RBACMgr->>AzureAPI: List role assignments for deterministic names
AzureAPI-->>RBACMgr: Paged assignment lists
loop for each deterministic assignment name
RBACMgr->>AzureAPI: Get role assignment by deterministic name
AzureAPI-->>RBACMgr: Assignment (with PrincipalID) or 404
alt assignment exists and PrincipalID != expected
RBACMgr->>AzureAPI: Delete role assignment (ignore 404)
AzureAPI-->>RBACMgr: Delete result (ok or error)
else assignment matches expected or 404
RBACMgr-->>RBACMgr: Skip or treat as no-op
end
end
RBACMgr-->>DestroyCmd: Return aggregated delete errors (if any)
DestroyCmd->>IdentityMgr: Proceed to delete identities/federated credentials
IdentityMgr-->>DestroyCmd: Deletion results
🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bryan-cox The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@bryan-cox: This pull request references Jira Issue OCPBUGS-84251, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/infra/azure/rbac.go (1)
228-263:⚠️ Potential issue | 🟠 MajorThe GET/create fallback can still report success for the wrong assignment.
Line 230 only validates
PrincipalID, so if the deterministic assignment exists for the same principal but an outdatedRoleDefinitionID, this path returnsnileven though the LIST phase already decided the required role is missing. There is a second false-success path after stale deletion: ifCreatereturnsRoleAssignmentExists, the code also returns success without checking whether the deterministic name now points at the expected principal and role.🛠️ Proposed fix
+ replacementAttempt := false existing, err := client.Get(ctx, scope, roleAssignmentName, nil) if err == nil { - if existing.Properties != nil && existing.Properties.PrincipalID != nil && strings.EqualFold(*existing.Properties.PrincipalID, assigneeID) { + if existing.Properties != nil && + existing.Properties.PrincipalID != nil && + existing.Properties.RoleDefinitionID != nil && + strings.EqualFold(*existing.Properties.PrincipalID, assigneeID) && + strings.EqualFold(*existing.Properties.RoleDefinitionID, roleDefinitionID) { log.Log.Info("Skipping role assignment creation, role assignment already exists.", "role", role, "assigneeID", assigneeID, "scope", scope) return nil } + replacementAttempt = true if _, err := client.Delete(ctx, scope, roleAssignmentName, nil); err != nil { return fmt.Errorf("failed to delete stale role assignment: %w", err) } } else { // existing error handling... } _, err = client.Create(ctx, scope, roleAssignmentName, roleAssignmentProperties, nil) if err != nil { var respErr *azcore.ResponseError if errors.As(err, &respErr) && (respErr.StatusCode == http.StatusConflict || strings.EqualFold(respErr.ErrorCode, "RoleAssignmentExists")) { - log.Log.Info("Failed role assignment creation, role assignment already exists.", "role", role, "assigneeID", assigneeID, "scope", scope) - return nil + current, getErr := client.Get(ctx, scope, roleAssignmentName, nil) + if getErr == nil && + current.Properties != nil && + current.Properties.PrincipalID != nil && + current.Properties.RoleDefinitionID != nil && + strings.EqualFold(*current.Properties.PrincipalID, assigneeID) && + strings.EqualFold(*current.Properties.RoleDefinitionID, roleDefinitionID) { + return nil + } + if replacementAttempt { + return fmt.Errorf("role assignment still does not match expected principal/role after replacement: %w", err) + } + return nil } return fmt.Errorf("failed to create role assignment: %w", err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/infra/azure/rbac.go` around lines 228 - 263, The code only checks existing.Properties.PrincipalID when deciding an existing assignment is acceptable and likewise treats a Create conflict as success without validating the RoleDefinitionID; update both the pre-create check and the post-Create conflict handling to also compare the existing.Properties.RoleDefinitionID against the desired role (from roleAssignmentProperties.RoleDefinitionID or the variable representing the desired RoleDefinitionID). If they differ, treat the assignment as stale (log stale role, delete or return an error so a fresh assignment can be created) rather than returning success; use client.Get and existing.Properties.RoleDefinitionID and roleAssignmentProperties.RoleDefinitionID (and roleAssignmentName) to locate and validate the assignment in both code paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/infra/azure/destroy_iam.go`:
- Around line 116-118: The RG names for NSG/VNet in destroy_iam.go are being
built as o.Name + "-nsg-" + o.InfraID and o.Name + "-vnet-" + o.InfraID which
does not match the create path; update the nsgRG and vnetRG variables to use the
same resource-group names as create.go (o.Name + "-nsg" and o.Name + "-vnet") so
CleanupRoleAssignments computes the same scopes/deterministic assignment names;
ensure you update the nsgRG and vnetRG assignments in the same function where
CleanupRoleAssignments is invoked to keep naming consistent with networking.go
and create.go.
- Around line 121-123: The call to rbacManager.CleanupRoleAssignments (in
destroy_iam.go) currently logs errors and continues, but since
CleanupRoleAssignments already treats not-found as success any non-nil error
indicates failure to remove assignments and we must abort; change the error
handling at the CleanupRoleAssignments call so that on error you return (or
propagate) an error from the enclosing function instead of just logging (e.g.,
wrap with context like "failed to clean up role assignments" and return it),
preventing subsequent identity deletion code from running when
CleanupRoleAssignments fails.
---
Outside diff comments:
In `@cmd/infra/azure/rbac.go`:
- Around line 228-263: The code only checks existing.Properties.PrincipalID when
deciding an existing assignment is acceptable and likewise treats a Create
conflict as success without validating the RoleDefinitionID; update both the
pre-create check and the post-Create conflict handling to also compare the
existing.Properties.RoleDefinitionID against the desired role (from
roleAssignmentProperties.RoleDefinitionID or the variable representing the
desired RoleDefinitionID). If they differ, treat the assignment as stale (log
stale role, delete or return an error so a fresh assignment can be created)
rather than returning success; use client.Get and
existing.Properties.RoleDefinitionID and
roleAssignmentProperties.RoleDefinitionID (and roleAssignmentName) to locate and
validate the assignment in both code paths.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: a075084a-ffa3-4cd4-a134-a90bde905c4b
📒 Files selected for processing (4)
cmd/infra/azure/destroy_iam.gocmd/infra/azure/rbac.gocmd/infra/azure/rbac_test.gocmd/infra/azure/types.go
9946c42 to
bfa986f
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/cluster/azure/destroy.go`:
- Around line 162-164: NewRBACManager and RBACManager currently ignore
non-public Azure clouds causing role assignment clients to use default public
endpoints; update RBACManager (azureinfra.NewRBACManager and the RBACManager
constructor) to accept a cloud configuration or arm.ClientOptions (use
azureutil.GetAzureCloudConfiguration and azureutil.NewARMClientOptions pattern)
and pass those options when creating role assignment clients inside rbac.go so
clients are constructed with the correct cloud endpoints; then update callers
such as the call in destroy.go (where azureinfra.NewRBACManager is invoked) to
supply the cloud configuration/ClientOptions so CleanupRoleAssignments runs
against the configured cloud.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 4d803446-12cd-4c51-a6ed-fc69b3491575
📒 Files selected for processing (3)
cmd/cluster/azure/destroy.gocmd/cluster/core/destroy.goproduct-cli/cmd/cluster/azure/destroy.go
✅ Files skipped from review due to trivial changes (1)
- cmd/cluster/core/destroy.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8322 +/- ##
==========================================
+ Coverage 36.30% 36.40% +0.10%
==========================================
Files 764 764
Lines 93015 93133 +118
==========================================
+ Hits 33772 33909 +137
+ Misses 56530 56511 -19
Partials 2713 2713
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
/cc @enxebre for review |
|
@bryan-cox: GitHub didn't allow me to request PR reviews from the following users: for, review. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/infra/azure/rbac.go`:
- Around line 234-241: The log call and ptr.Deref access can panic if
existing.Properties is nil; before logging/deleting in the role assignment
cleanup (around existing, existing.Properties, Properties.PrincipalID,
log.Log.Info, ptr.Deref, client.Delete, roleAssignmentName, scope) add a nil
guard: check if existing != nil and existing.Properties != nil and only then
dereference PrincipalID for the "stalePrincipal" field; if Properties is nil,
log a safe placeholder (e.g., "<nil-properties>") or omit that field, and still
proceed to call client.Delete as before, returning the same wrapped error on
failure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 183f1d05-eaf7-4185-b2bf-150bbe24578b
📒 Files selected for processing (7)
cmd/cluster/azure/destroy.gocmd/cluster/core/destroy.gocmd/infra/azure/destroy_iam.gocmd/infra/azure/rbac.gocmd/infra/azure/rbac_test.gocmd/infra/azure/types.goproduct-cli/cmd/cluster/azure/destroy.go
✅ Files skipped from review due to trivial changes (2)
- cmd/infra/azure/types.go
- cmd/infra/azure/destroy_iam.go
🚧 Files skipped from review as they are similar to previous changes (3)
- cmd/cluster/core/destroy.go
- cmd/cluster/azure/destroy.go
- cmd/infra/azure/rbac_test.go
|
/uncc @enxebre |
bfa986f to
95f2829
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cmd/cluster/azure/destroy.go (1)
152-168: Consider extracting credential setup to avoid duplication.Azure credential setup (lines 154-157) is similar to lines 123-126 in
DestroyCluster. While they're in different code paths, consider extracting to a helper or restructuring to share the credentials wheno.AzurePlatform.ResourceGroupNameis provided.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/cluster/azure/destroy.go` around lines 152 - 168, The Azure credential setup (call to util.SetupAzureCredentials used here) is duplicated with the similar block in DestroyCluster; extract this into a shared helper (e.g., NewAzureCredentials or ensureCredentials) and use it from both places so credentials are reused when o.AzurePlatform.ResourceGroupName is present; update callers (this code and DestroyCluster) to call the new helper and return the same subscriptionID and azureCreds to pass into azureinfra.NewRBACManager and rbacManager.CleanupRoleAssignments, preserving existing error handling and logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cmd/cluster/azure/destroy.go`:
- Around line 152-168: The Azure credential setup (call to
util.SetupAzureCredentials used here) is duplicated with the similar block in
DestroyCluster; extract this into a shared helper (e.g., NewAzureCredentials or
ensureCredentials) and use it from both places so credentials are reused when
o.AzurePlatform.ResourceGroupName is present; update callers (this code and
DestroyCluster) to call the new helper and return the same subscriptionID and
azureCreds to pass into azureinfra.NewRBACManager and
rbacManager.CleanupRoleAssignments, preserving existing error handling and
logging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: b5b25127-7957-4e7f-9ab4-660173c3eb20
📒 Files selected for processing (8)
cmd/cluster/azure/destroy.gocmd/cluster/core/destroy.gocmd/infra/azure/destroy_iam.gocmd/infra/azure/rbac.gocmd/infra/azure/rbac_test.gocmd/infra/azure/types.gocmd/util/azure_flag_descriptions.goproduct-cli/cmd/cluster/azure/destroy.go
✅ Files skipped from review due to trivial changes (1)
- cmd/cluster/core/destroy.go
🚧 Files skipped from review as they are similar to previous changes (2)
- product-cli/cmd/cluster/azure/destroy.go
- cmd/infra/azure/destroy_iam.go
21002bf to
0d34916
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (5)
cmd/cluster/azure/destroy.go (2)
152-160:⚠️ Potential issue | 🟠 MajorUse the same NSG/VNet resource-group names as the create path.
The cleanup scopes need to match the original assignment scopes exactly. Adding
InfraIDhere changes the deterministic assignment name, so the real NSG/VNet-scoped assignments will be missed.🛠️ Proposed fix
- // The resource group names for NSG and VNet follow the convention: {name}-nsg-{infraID} and {name}-vnet-{infraID}. - nsgRG := o.Name + "-nsg-" + o.InfraID - vnetRG := o.Name + "-vnet-" + o.InfraID + // Match the create path resource-group names so deterministic role assignment names line up. + nsgRG := o.Name + "-nsg" + vnetRG := o.Name + "-vnet"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/cluster/azure/destroy.go` around lines 152 - 160, The cleanup computes NSG/VNet resource-group names with an extra "-{InfraID}" suffix so the role-assignment lookup misses the original scopes; change the variables nsgRG and vnetRG in destroy.go to use the exact same names used in the create path (use o.Name + "-nsg" and o.Name + "-vnet" respectively) so the cleanup scopes for the role assignments exactly match the original assignment scopes.
166-168:⚠️ Potential issue | 🟠 MajorAbort destroy when RBAC cleanup returns a real error.
CleanupRoleAssignmentsalready treats not-found as success. Any error here means some assignments were definitely left behind, so continuing into infrastructure deletion defeats the stale-assignment fix.🛠️ Proposed fix
- if err := rbacManager.CleanupRoleAssignments(ctx, o.Log, o.InfraID, o.AzurePlatform.ResourceGroupName, nsgRG, vnetRG, o.AzurePlatform.DNSZoneRGName, false); err != nil { - o.Log.Error(err, "Failed to clean up some role assignments, continuing with infrastructure deletion") - } + if err := rbacManager.CleanupRoleAssignments(ctx, o.Log, o.InfraID, o.AzurePlatform.ResourceGroupName, nsgRG, vnetRG, o.AzurePlatform.DNSZoneRGName, false); err != nil { + return fmt.Errorf("failed to clean up role assignments before destroying infrastructure: %w", err) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/cluster/azure/destroy.go` around lines 166 - 168, The call to rbacManager.CleanupRoleAssignments currently logs errors and continues, which lets a real RBAC failure let the destroy proceed; change this so any non-nil error from CleanupRoleAssignments causes the destroy to abort by returning or propagating the error instead of only logging it. Locate the invocation of CleanupRoleAssignments (the call using rbacManager, o.Log, o.InfraID, o.AzurePlatform.ResourceGroupName, nsgRG, vnetRG, o.AzurePlatform.DNSZoneRGName) and replace the log-only branch with an early return of the error (or wrap and return it) so infrastructure deletion does not proceed when CleanupRoleAssignments fails.cmd/infra/azure/rbac.go (1)
112-115:⚠️ Potential issue | 🟠 MajorPass cloud-aware ARM client options into
RoleAssignmentsClient.These constructors still use
niloptions, so RBAC assign/cleanup will default to public-cloud endpoints. That breaks Gov/China tenants even though the destroy paths already know the configured Azure cloud.#!/bin/bash set -euo pipefail echo "== RBAC manager definition ==" sed -n '40,62p' cmd/infra/azure/rbac.go echo echo "== Role assignment client construction ==" rg -n -C2 'NewRoleAssignmentsClient\(' cmd/infra/azure/rbac.go echo echo "== Cloud-aware Azure client setup elsewhere ==" rg -n -C3 'GetAzureCloudConfiguration|NewARMClientOptions|ClientOptions: azcore.ClientOptions' \ cmd/cluster/azure/destroy.go cmd/infra/azure/destroy_iam.go support/azureutil/azureutil.goAlso applies to: 145-148, 279-283
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/infra/azure/rbac.go` around lines 112 - 115, The RoleAssignmentsClient constructors (calls to azureauth.NewRoleAssignmentsClient used to create raClient with r.subscriptionID and r.creds) are passing nil for options and thus defaulting to public-cloud endpoints; update each NewRoleAssignmentsClient invocation (e.g., the call that creates raClient and the other occurrences around the indicated blocks) to supply cloud-aware azcore.ClientOptions/ARM client options obtained from your existing helper (the same options pattern used elsewhere like NewARMClientOptions/GetAzureCloudConfiguration) so the client is created with the configured Azure cloud endpoints rather than nil.cmd/infra/azure/destroy_iam.go (2)
115-118:⚠️ Potential issue | 🟠 MajorUse the same NSG/VNet resource-group names as the create path.
These values feed directly into the deterministic role-assignment name regeneration. With the extra
InfraIDsuffix, cleanup targets different scopes and leaves the actual NSG/VNet assignments behind.🛠️ Proposed fix
- // The resource group names for NSG and VNet follow the convention: {name}-nsg-{infraID} and {name}-vnet-{infraID}. - nsgRG := o.Name + "-nsg-" + o.InfraID - vnetRG := o.Name + "-vnet-" + o.InfraID + // Match the create path resource-group names so deterministic role assignment names line up. + nsgRG := o.Name + "-nsg" + vnetRG := o.Name + "-vnet"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/infra/azure/destroy_iam.go` around lines 115 - 118, The NSG/VNet resource-group names (variables nsgRG and vnetRG) include an extra InfraID suffix causing cleanup to target the wrong groups; update the assignment of nsgRG and vnetRG to match the create path by removing the InfraID suffix (use o.Name + "-nsg" and o.Name + "-vnet" rather than o.Name + "-nsg-" + o.InfraID / o.Name + "-vnet-" + o.InfraID) so the deterministic role-assignment name regeneration cleans the same scopes created earlier.
124-126:⚠️ Potential issue | 🟠 MajorAbort IAM deletion when RBAC cleanup fails.
A non-nil return here already means the helper saw a real delete failure, not just a missing assignment. Continuing into identity deletion can leave the exact orphaned assignments this PR is meant to prevent.
🛠️ Proposed fix
- if err := rbacManager.CleanupRoleAssignments(ctx, l, o.InfraID, o.ResourceGroupName, nsgRG, vnetRG, o.DNSZoneRG, false); err != nil { - l.Error(err, "Failed to clean up some role assignments, continuing with identity deletion") - } + if err := rbacManager.CleanupRoleAssignments(ctx, l, o.InfraID, o.ResourceGroupName, nsgRG, vnetRG, o.DNSZoneRG, false); err != nil { + return fmt.Errorf("failed to clean up role assignments before deleting identities: %w", err) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/infra/azure/destroy_iam.go` around lines 124 - 126, The call to rbacManager.CleanupRoleAssignments in destroy_iam.go currently logs the error and continues, which can leave orphaned RBAC assignments; update the error handling in the block that calls rbacManager.CleanupRoleAssignments(ctx, l, o.InfraID, o.ResourceGroupName, nsgRG, vnetRG, o.DNSZoneRG, false) to abort the IAM deletion flow on non-nil error by returning the error (or a wrapped error) instead of merely logging via l.Error, so the subsequent identity deletion steps are not executed when cleanup fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/infra/azure/rbac.go`:
- Around line 227-231: The GET branch that skips creating a role assignment only
checks existing.Properties.PrincipalID and can incorrectly reuse a deterministic
assignment whose RoleDefinitionID has changed; update the conditional in the
client.Get handling (where roleAssignmentName and existing are used) to also
validate existing.Properties.RoleDefinitionID (or the exact RoleDefinitionID
field on existing.Properties) matches the expected role definition ID before
returning, and only skip creation when both PrincipalID and RoleDefinitionID are
equal to the desired values so the code will recreate the assignment when the
role definition differs.
---
Duplicate comments:
In `@cmd/cluster/azure/destroy.go`:
- Around line 152-160: The cleanup computes NSG/VNet resource-group names with
an extra "-{InfraID}" suffix so the role-assignment lookup misses the original
scopes; change the variables nsgRG and vnetRG in destroy.go to use the exact
same names used in the create path (use o.Name + "-nsg" and o.Name + "-vnet"
respectively) so the cleanup scopes for the role assignments exactly match the
original assignment scopes.
- Around line 166-168: The call to rbacManager.CleanupRoleAssignments currently
logs errors and continues, which lets a real RBAC failure let the destroy
proceed; change this so any non-nil error from CleanupRoleAssignments causes the
destroy to abort by returning or propagating the error instead of only logging
it. Locate the invocation of CleanupRoleAssignments (the call using rbacManager,
o.Log, o.InfraID, o.AzurePlatform.ResourceGroupName, nsgRG, vnetRG,
o.AzurePlatform.DNSZoneRGName) and replace the log-only branch with an early
return of the error (or wrap and return it) so infrastructure deletion does not
proceed when CleanupRoleAssignments fails.
In `@cmd/infra/azure/destroy_iam.go`:
- Around line 115-118: The NSG/VNet resource-group names (variables nsgRG and
vnetRG) include an extra InfraID suffix causing cleanup to target the wrong
groups; update the assignment of nsgRG and vnetRG to match the create path by
removing the InfraID suffix (use o.Name + "-nsg" and o.Name + "-vnet" rather
than o.Name + "-nsg-" + o.InfraID / o.Name + "-vnet-" + o.InfraID) so the
deterministic role-assignment name regeneration cleans the same scopes created
earlier.
- Around line 124-126: The call to rbacManager.CleanupRoleAssignments in
destroy_iam.go currently logs the error and continues, which can leave orphaned
RBAC assignments; update the error handling in the block that calls
rbacManager.CleanupRoleAssignments(ctx, l, o.InfraID, o.ResourceGroupName,
nsgRG, vnetRG, o.DNSZoneRG, false) to abort the IAM deletion flow on non-nil
error by returning the error (or a wrapped error) instead of merely logging via
l.Error, so the subsequent identity deletion steps are not executed when cleanup
fails.
In `@cmd/infra/azure/rbac.go`:
- Around line 112-115: The RoleAssignmentsClient constructors (calls to
azureauth.NewRoleAssignmentsClient used to create raClient with r.subscriptionID
and r.creds) are passing nil for options and thus defaulting to public-cloud
endpoints; update each NewRoleAssignmentsClient invocation (e.g., the call that
creates raClient and the other occurrences around the indicated blocks) to
supply cloud-aware azcore.ClientOptions/ARM client options obtained from your
existing helper (the same options pattern used elsewhere like
NewARMClientOptions/GetAzureCloudConfiguration) so the client is created with
the configured Azure cloud endpoints rather than nil.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: fe84b027-f1c0-4418-8f12-8fa62f432580
📒 Files selected for processing (8)
cmd/cluster/azure/destroy.gocmd/cluster/core/destroy.gocmd/infra/azure/destroy_iam.gocmd/infra/azure/rbac.gocmd/infra/azure/rbac_test.gocmd/infra/azure/types.gocmd/util/azure_flag_descriptions.goproduct-cli/cmd/cluster/azure/destroy.go
✅ Files skipped from review due to trivial changes (3)
- cmd/infra/azure/types.go
- cmd/util/azure_flag_descriptions.go
- cmd/cluster/core/destroy.go
🚧 Files skipped from review as they are similar to previous changes (1)
- product-cli/cmd/cluster/azure/destroy.go
0d34916 to
a70a36a
Compare
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/retest |
…re-creation - Extract roleAssignmentClient interface from Azure SDK for testability - Add LIST-based existence check with atScope() filter before GET fallback - Verify both PrincipalID and RoleDefinitionID before skipping creation - Delete stale assignments (mismatched principal) and create fresh ones - Add CleanupRoleAssignments to both destroy cluster and destroy iam paths - Make --dns-zone-rg-name required on all destroy commands - Add comprehensive behavior-driven tests for assignRole, deleteRoleAssignmentByName, and cleanupRoleAssignments Signed-off-by: Bryan Cox <brcox@redhat.com> Commit-Message-Assisted-by: Claude (via Claude Code)
Update destroy command examples in self-managed Azure cluster, IAM separate workflow, and private cluster docs to include the now-required --dns-zone-rg-name flag. Also update the destroy iam command reference table with all required flags. Signed-off-by: Bryan Cox <brcox@redhat.com> Commit-Message-Assisted-by: Claude (via Claude Code)
3819499 to
7443553
Compare
|
@bryan-cox: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/lgtm |
|
Scheduling tests matching the |
|
The e2e-aws test step never started. The failure occurred during the infrastructure setup phase (importing the Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryBoth Root CauseCI infrastructure pod-name collision in a shared namespace. The ci-operator deterministically maps a PR's source commit to a namespace name. When multiple jobs for the same PR and commit share configuration inputs (same org/repo/branch/commit SHA), they can be assigned the same namespace ( However, the
The step graph manifests confirm this: the pod This is a known transient CI infrastructure issue with ci-operator's namespace sharing mechanism. It is not caused by any code change in PR #8322 (which modifies Azure role assignment logic, completely unrelated to release image importing). Recommendations
Evidence
|
|
/retest |
|
/override ci/prow/e2e-kubevirt-aws-ovn-reduced This only touches Azure CLI and docs. |
|
@bryan-cox: Overrode contexts on behalf of bryan-cox: ci/prow/e2e-kubevirt-aws-ovn-reduced DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@bryan-cox: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@bryan-cox: Jira Issue Verification Checks: Jira Issue OCPBUGS-84251 Jira Issue OCPBUGS-84251 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira backport release-4.22 |
|
@bryan-cox: The following backport issues have been created:
Queuing cherrypicks to the requested branches to be created after this PR merges: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-ci-robot: new pull request created: #8359 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
openshift/hypershift#8322 made --dns-zone-rg-name a required flag on `hypershift destroy cluster azure`. Two destroy chains were not passing it, causing every Azure HyperShift job to fail during cleanup and leak Azure clusters. hypershift-destroy-nested-management-cluster: used by e2e-azure-self-managed jobs. The create step already passes --dns-zone-rg-name=os4-common; adds the same to the destroy step. hypershift-azure-destroy: used by AKS conformance and 12+ cucushift Azure HyperShift workflows. Adds DNS_ZONE_RG_NAME env var (default os4-common) matching the create chain, and passes it to the destroy command. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: add --dns-zone-rg-name to Azure self-managed destroy step openshift/hypershift#8322 made --dns-zone-rg-name a required flag on `hypershift destroy cluster azure`. The destroy-management-cluster step in the self-managed workflow was not passing it, causing every e2e-azure-self-managed run to fail during cleanup and leak Azure clusters. The create step already passes --dns-zone-rg-name=os4-common; this adds the same flag to the destroy step. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add --dns-zone-rg-name to Azure destroy steps openshift/hypershift#8322 made --dns-zone-rg-name a required flag on `hypershift destroy cluster azure`. Two destroy chains were not passing it, causing every Azure HyperShift job to fail during cleanup and leak Azure clusters. hypershift-destroy-nested-management-cluster: used by e2e-azure-self-managed jobs. The create step already passes --dns-zone-rg-name=os4-common; adds the same to the destroy step. hypershift-azure-destroy: used by AKS conformance and 12+ cucushift Azure HyperShift workflows. Adds DNS_ZONE_RG_NAME env var (default os4-common) matching the create chain, and passes it to the destroy command. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
What this PR does / why we need it:
When a self-managed Azure cluster is destroyed and re-created with the same infraID, role
assignments from the previous cluster persist as orphans because neither
destroy iam,destroy infra, nordestroy clustercleans them up. The deterministic UUID naming(based on infraID+component+scope) causes
assignRoleto find the stale assignment by nameand skip creation, even though it points to a deleted managed identity. This results in
403 AuthorizationFailed errors for components like the ingress operator that need RBAC
roles on external resource groups (e.g., the DNS zone resource group).
Root cause
Two bugs working together:
assignRoleGET check doesn't verify principal: The deterministic role assignment name(
uuid(infraID+component+scope)) matches a stale/orphaned assignment from a previous cluster.The GET check finds it by name and skips creation without verifying the
principalIDmatchesthe new identity.
No destroy path cleans up role assignments: When you destroy a cluster, the managed
identities are deleted but their role assignments persist as orphans in Azure RBAC.
The failure sequence
create infra --infra-id=XXX→ creates role assignmentuuid(XXX-ingress-scope)pointing to identity Adestroy infra/destroy iam/destroy cluster→ deletes identity A but leaves orphaned role assignmentcreate iam(new attempt, same name) → creates NEW identity B (different objectID)create infra --infra-id=XXX --assign-identity-roles→principalID != B→ doesn't match → continuesuuid(XXX-ingress-scope)→ FINDS orphan → "already exists" → SKIPSFix
assignRolenow verifiesprincipalIDwhen GET finds an existing assignment. Ifmismatched (stale orphan), it deletes the stale assignment and creates a new one.
destroy iamnow callsCleanupRoleAssignmentsbefore destroying managed identities,preventing orphaned role assignments from accumulating.
destroy cluster azurenow callsCleanupRoleAssignmentsbefore destroying infrastructure,so both destroy paths clean up role assignments.
Extracted
roleAssignmentClientinterface from the Azure SDK concrete client to enableunit testing of the stale detection logic.
Verified with
azCLICreated a test identity, assigned a role with a deterministic UUID name, deleted the identity
(leaving the role assignment orphaned), created a new identity — confirmed the GET returns
the stale
principalIdwhich our fix now detects and replaces.Which issue(s) this PR fixes:
Fixes OCPBUGS-84251
Special notes for your reviewer:
--dns-zone-rg-nameflag is optional on all destroy commands; if not provided, DNS zonescoped assignments simply get 404 during cleanup (handled gracefully).
Checklist:
Summary by CodeRabbit
New Features
Bug Fixes
Tests