Skip to content

OCPBUGS-84251: fix(azure): detect and replace stale role assignments on cluster re-creation#8322

Merged
openshift-merge-bot[bot] merged 2 commits intoopenshift:mainfrom
bryan-cox:fix-stale-role-assignments
Apr 28, 2026
Merged

OCPBUGS-84251: fix(azure): detect and replace stale role assignments on cluster re-creation#8322
openshift-merge-bot[bot] merged 2 commits intoopenshift:mainfrom
bryan-cox:fix-stale-role-assignments

Conversation

@bryan-cox
Copy link
Copy Markdown
Member

@bryan-cox bryan-cox commented Apr 23, 2026

What this PR does / why we need it:

When a self-managed Azure cluster is destroyed and re-created with the same infraID, role
assignments from the previous cluster persist as orphans because neither destroy iam,
destroy infra, nor destroy cluster cleans them up. The deterministic UUID naming
(based on infraID+component+scope) causes assignRole to find the stale assignment by name
and skip creation, even though it points to a deleted managed identity. This results in
403 AuthorizationFailed errors for components like the ingress operator that need RBAC
roles on external resource groups (e.g., the DNS zone resource group).

Root cause

Two bugs working together:

  1. assignRole GET check doesn't verify principal: The deterministic role assignment name
    (uuid(infraID+component+scope)) matches a stale/orphaned assignment from a previous cluster.
    The GET check finds it by name and skips creation without verifying the principalID matches
    the new identity.

  2. No destroy path cleans up role assignments: When you destroy a cluster, the managed
    identities are deleted but their role assignments persist as orphans in Azure RBAC.

The failure sequence

  1. create infra --infra-id=XXX → creates role assignment uuid(XXX-ingress-scope) pointing to identity A
  2. destroy infra / destroy iam / destroy cluster → deletes identity A but leaves orphaned role assignment
  3. create iam (new attempt, same name) → creates NEW identity B (different objectID)
  4. create infra --infra-id=XXX --assign-identity-roles
    • LIST check: finds orphan for A, principalID != B → doesn't match → continues
    • GET check: uuid(XXX-ingress-scope) → FINDS orphan → "already exists" → SKIPS
  5. Identity B has NO role on dnsZoneRG → 403 AuthorizationFailed

Fix

  1. assignRole now verifies principalID when GET finds an existing assignment. If
    mismatched (stale orphan), it deletes the stale assignment and creates a new one.

  2. destroy iam now calls CleanupRoleAssignments before destroying managed identities,
    preventing orphaned role assignments from accumulating.

  3. destroy cluster azure now calls CleanupRoleAssignments before destroying infrastructure,
    so both destroy paths clean up role assignments.

  4. Extracted roleAssignmentClient interface from the Azure SDK concrete client to enable
    unit testing of the stale detection logic.

Verified with az CLI

Created a test identity, assigned a role with a deterministic UUID name, deleted the identity
(leaving the role assignment orphaned), created a new identity — confirmed the GET returns
the stale principalId which our fix now detects and replaces.

Which issue(s) this PR fixes:

Fixes OCPBUGS-84251

Special notes for your reviewer:

  • The --dns-zone-rg-name flag is optional on all destroy commands; if not provided, DNS zone
    scoped assignments simply get 404 during cleanup (handled gracefully).

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

    • CLI flag to specify DNS zone resource group for Azure destroy
    • Automatic pre-cleanup of Azure role assignments across related resource groups
  • Bug Fixes

    • Detects and replaces stale or mismatched Azure role assignments deterministically
    • Cleanup failures are logged and treated as non-fatal so deletion continues
  • Tests

    • New unit tests covering role-assignment creation, detection, deletion, and cleanup behaviors

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 23, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 23, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a --dns-zone-rg-name CLI flag and threads it through Azure destroy flows and option structs. Destroy now computes NSG and VNet resource-group names from cluster name and InfraID, constructs an RBACManager, and calls the new exported CleanupRoleAssignments (including the DNS-zone RG) before identity/federated-credential deletion; cleanup errors are logged and do not abort destroy. Refactors RBAC code to inject a role-assignment client, performs deterministic GET by assignment name after listing, deletes stale assignments whose PrincipalID differs, and creates assignments only when appropriate. New unit tests validate LIST/GET/Create/Delete behaviors, deterministic naming, 404 handling, and aggregated delete errors.

Sequence Diagram

sequenceDiagram
    participant DestroyCmd as Destroy IAM Command
    participant RBACMgr as RBAC Manager
    participant AzureAPI as Azure API
    participant IdentityMgr as Identity Manager

    DestroyCmd->>RBACMgr: CleanupRoleAssignments(ctx, infraID, rgMain, nsgRG, vnetRG, dnsZoneRG, ...)
    RBACMgr->>AzureAPI: List role assignments for deterministic names
    AzureAPI-->>RBACMgr: Paged assignment lists

    loop for each deterministic assignment name
        RBACMgr->>AzureAPI: Get role assignment by deterministic name
        AzureAPI-->>RBACMgr: Assignment (with PrincipalID) or 404
        alt assignment exists and PrincipalID != expected
            RBACMgr->>AzureAPI: Delete role assignment (ignore 404)
            AzureAPI-->>RBACMgr: Delete result (ok or error)
        else assignment matches expected or 404
            RBACMgr-->>RBACMgr: Skip or treat as no-op
        end
    end

    RBACMgr-->>DestroyCmd: Return aggregated delete errors (if any)
    DestroyCmd->>IdentityMgr: Proceed to delete identities/federated credentials
    IdentityMgr-->>DestroyCmd: Deletion results
Loading
🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: detecting and replacing stale role assignments during Azure cluster recreation, which is the core purpose of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Test file uses standard Go testing.T framework, not Ginkgo. Check for Ginkgo test naming stability is not applicable.
Test Structure And Quality ✅ Passed The test file demonstrates high quality across all assessed dimensions, using standard Go testing with table-driven test cases where each case has single responsibility, appropriate inline mock setup/cleanup, meaningful assertion messages, correct timeout handling, and consistency with existing repository patterns.
Microshift Test Compatibility ✅ Passed The pull request adds standard Go unit tests in cmd/infra/azure/rbac_test.go, not Ginkgo e2e tests, so the MicroShift Test Compatibility check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The pull request adds only standard Go unit tests in cmd/infra/azure/rbac_test.go using the testing.T interface with Gomega assertions. No Ginkgo e2e tests are introduced.
Topology-Aware Scheduling Compatibility ✅ Passed PR contains only Azure CLI infrastructure management code with no Kubernetes deployment manifests, operators, controllers, or scheduling constraints.
Ote Binary Stdout Contract ✅ Passed This pull request does not violate the OTE Binary Stdout Contract. No stdout writes found across all modified files.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The test file added is a standard Go unit test with fully mocked Azure SDK clients, not a Ginkgo e2e test, containing no IPv4 hardcoded addresses and requiring no external connectivity.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/cli Indicates the PR includes changes for CLI area/platform/azure PR/issue for Azure (AzurePlatform) platform and removed do-not-merge/needs-area labels Apr 23, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 23, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 23, 2026
@bryan-cox bryan-cox changed the title fix(azure): detect and replace stale role assignments on cluster re-creation OCPBUGS-84251: fix(azure): detect and replace stale role assignments on cluster re-creation Apr 23, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 23, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This pull request references Jira Issue OCPBUGS-84251, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

When a self-managed Azure cluster is destroyed and re-created with the same infraID, role
assignments from the previous cluster persist as orphans because neither destroy iam,
destroy infra, nor destroy cluster cleans them up. The deterministic UUID naming
(based on infraID+component+scope) causes assignRole to find the stale assignment by name
and skip creation, even though it points to a deleted managed identity. This results in
403 AuthorizationFailed errors for components like the ingress operator that need RBAC
roles on external resource groups (e.g., the DNS zone resource group).

Root cause

Two bugs working together:

  1. assignRole GET check doesn't verify principal: The deterministic role assignment name
    (uuid(infraID+component+scope)) matches a stale/orphaned assignment from a previous cluster.
    The GET check finds it by name and skips creation without verifying the principalID matches
    the new identity.

  2. No destroy path cleans up role assignments: When you destroy a cluster, the managed
    identities are deleted but their role assignments persist as orphans in Azure RBAC.

The failure sequence

  1. create infra --infra-id=XXX → creates role assignment uuid(XXX-ingress-scope) pointing to identity A
  2. destroy infra / destroy iam / destroy cluster → deletes identity A but leaves orphaned role assignment
  3. create iam (new attempt, same name) → creates NEW identity B (different objectID)
  4. create infra --infra-id=XXX --assign-identity-roles
  • LIST check: finds orphan for A, principalID != B → doesn't match → continues
  • GET check: uuid(XXX-ingress-scope) → FINDS orphan → "already exists" → SKIPS
  1. Identity B has NO role on dnsZoneRG → 403 AuthorizationFailed

Fix

  1. assignRole now verifies principalID when GET finds an existing assignment. If
    mismatched (stale orphan), it deletes the stale assignment and creates a new one.

  2. destroy iam now calls CleanupRoleAssignments before destroying managed identities,
    preventing orphaned role assignments from accumulating.

  3. destroy cluster azure now calls CleanupRoleAssignments before destroying infrastructure,
    so both destroy paths clean up role assignments.

  4. Extracted roleAssignmentClient interface from the Azure SDK concrete client to enable
    unit testing of the stale detection logic.

Verified with az CLI

Created a test identity, assigned a role with a deterministic UUID name, deleted the identity
(leaving the role assignment orphaned), created a new identity — confirmed the GET returns
the stale principalId which our fix now detects and replaces.

Which issue(s) this PR fixes:

Fixes OCPBUGS-84251

Special notes for your reviewer:

  • The --dns-zone-rg-name flag is optional on all destroy commands; if not provided, DNS zone
    scoped assignments simply get 404 during cleanup (handled gracefully).

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Apr 23, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/infra/azure/rbac.go (1)

228-263: ⚠️ Potential issue | 🟠 Major

The GET/create fallback can still report success for the wrong assignment.

Line 230 only validates PrincipalID, so if the deterministic assignment exists for the same principal but an outdated RoleDefinitionID, this path returns nil even though the LIST phase already decided the required role is missing. There is a second false-success path after stale deletion: if Create returns RoleAssignmentExists, the code also returns success without checking whether the deterministic name now points at the expected principal and role.

🛠️ Proposed fix
+ replacementAttempt := false
  existing, err := client.Get(ctx, scope, roleAssignmentName, nil)
  if err == nil {
- 	if existing.Properties != nil && existing.Properties.PrincipalID != nil && strings.EqualFold(*existing.Properties.PrincipalID, assigneeID) {
+ 	if existing.Properties != nil &&
+ 		existing.Properties.PrincipalID != nil &&
+ 		existing.Properties.RoleDefinitionID != nil &&
+ 		strings.EqualFold(*existing.Properties.PrincipalID, assigneeID) &&
+ 		strings.EqualFold(*existing.Properties.RoleDefinitionID, roleDefinitionID) {
  		log.Log.Info("Skipping role assignment creation, role assignment already exists.", "role", role, "assigneeID", assigneeID, "scope", scope)
  		return nil
  	}
+ 	replacementAttempt = true
  	if _, err := client.Delete(ctx, scope, roleAssignmentName, nil); err != nil {
  		return fmt.Errorf("failed to delete stale role assignment: %w", err)
  	}
  } else {
  	// existing error handling...
  }

  _, err = client.Create(ctx, scope, roleAssignmentName, roleAssignmentProperties, nil)
  if err != nil {
  	var respErr *azcore.ResponseError
  	if errors.As(err, &respErr) && (respErr.StatusCode == http.StatusConflict || strings.EqualFold(respErr.ErrorCode, "RoleAssignmentExists")) {
- 		log.Log.Info("Failed role assignment creation, role assignment already exists.", "role", role, "assigneeID", assigneeID, "scope", scope)
- 		return nil
+ 		current, getErr := client.Get(ctx, scope, roleAssignmentName, nil)
+ 		if getErr == nil &&
+ 			current.Properties != nil &&
+ 			current.Properties.PrincipalID != nil &&
+ 			current.Properties.RoleDefinitionID != nil &&
+ 			strings.EqualFold(*current.Properties.PrincipalID, assigneeID) &&
+ 			strings.EqualFold(*current.Properties.RoleDefinitionID, roleDefinitionID) {
+ 			return nil
+ 		}
+ 		if replacementAttempt {
+ 			return fmt.Errorf("role assignment still does not match expected principal/role after replacement: %w", err)
+ 		}
+ 		return nil
  	}
  	return fmt.Errorf("failed to create role assignment: %w", err)
  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/infra/azure/rbac.go` around lines 228 - 263, The code only checks
existing.Properties.PrincipalID when deciding an existing assignment is
acceptable and likewise treats a Create conflict as success without validating
the RoleDefinitionID; update both the pre-create check and the post-Create
conflict handling to also compare the existing.Properties.RoleDefinitionID
against the desired role (from roleAssignmentProperties.RoleDefinitionID or the
variable representing the desired RoleDefinitionID). If they differ, treat the
assignment as stale (log stale role, delete or return an error so a fresh
assignment can be created) rather than returning success; use client.Get and
existing.Properties.RoleDefinitionID and
roleAssignmentProperties.RoleDefinitionID (and roleAssignmentName) to locate and
validate the assignment in both code paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/infra/azure/destroy_iam.go`:
- Around line 116-118: The RG names for NSG/VNet in destroy_iam.go are being
built as o.Name + "-nsg-" + o.InfraID and o.Name + "-vnet-" + o.InfraID which
does not match the create path; update the nsgRG and vnetRG variables to use the
same resource-group names as create.go (o.Name + "-nsg" and o.Name + "-vnet") so
CleanupRoleAssignments computes the same scopes/deterministic assignment names;
ensure you update the nsgRG and vnetRG assignments in the same function where
CleanupRoleAssignments is invoked to keep naming consistent with networking.go
and create.go.
- Around line 121-123: The call to rbacManager.CleanupRoleAssignments (in
destroy_iam.go) currently logs errors and continues, but since
CleanupRoleAssignments already treats not-found as success any non-nil error
indicates failure to remove assignments and we must abort; change the error
handling at the CleanupRoleAssignments call so that on error you return (or
propagate) an error from the enclosing function instead of just logging (e.g.,
wrap with context like "failed to clean up role assignments" and return it),
preventing subsequent identity deletion code from running when
CleanupRoleAssignments fails.

---

Outside diff comments:
In `@cmd/infra/azure/rbac.go`:
- Around line 228-263: The code only checks existing.Properties.PrincipalID when
deciding an existing assignment is acceptable and likewise treats a Create
conflict as success without validating the RoleDefinitionID; update both the
pre-create check and the post-Create conflict handling to also compare the
existing.Properties.RoleDefinitionID against the desired role (from
roleAssignmentProperties.RoleDefinitionID or the variable representing the
desired RoleDefinitionID). If they differ, treat the assignment as stale (log
stale role, delete or return an error so a fresh assignment can be created)
rather than returning success; use client.Get and
existing.Properties.RoleDefinitionID and
roleAssignmentProperties.RoleDefinitionID (and roleAssignmentName) to locate and
validate the assignment in both code paths.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a075084a-ffa3-4cd4-a134-a90bde905c4b

📥 Commits

Reviewing files that changed from the base of the PR and between 06193dc and 1127c88.

📒 Files selected for processing (4)
  • cmd/infra/azure/destroy_iam.go
  • cmd/infra/azure/rbac.go
  • cmd/infra/azure/rbac_test.go
  • cmd/infra/azure/types.go

Comment thread cmd/infra/azure/destroy_iam.go Outdated
Comment thread cmd/infra/azure/destroy_iam.go
@bryan-cox bryan-cox force-pushed the fix-stale-role-assignments branch from 9946c42 to bfa986f Compare April 23, 2026 18:38
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/cluster/azure/destroy.go`:
- Around line 162-164: NewRBACManager and RBACManager currently ignore
non-public Azure clouds causing role assignment clients to use default public
endpoints; update RBACManager (azureinfra.NewRBACManager and the RBACManager
constructor) to accept a cloud configuration or arm.ClientOptions (use
azureutil.GetAzureCloudConfiguration and azureutil.NewARMClientOptions pattern)
and pass those options when creating role assignment clients inside rbac.go so
clients are constructed with the correct cloud endpoints; then update callers
such as the call in destroy.go (where azureinfra.NewRBACManager is invoked) to
supply the cloud configuration/ClientOptions so CleanupRoleAssignments runs
against the configured cloud.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4d803446-12cd-4c51-a6ed-fc69b3491575

📥 Commits

Reviewing files that changed from the base of the PR and between 1127c88 and bd5fbe2.

📒 Files selected for processing (3)
  • cmd/cluster/azure/destroy.go
  • cmd/cluster/core/destroy.go
  • product-cli/cmd/cluster/azure/destroy.go
✅ Files skipped from review due to trivial changes (1)
  • cmd/cluster/core/destroy.go

Comment thread cmd/cluster/azure/destroy.go
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 62.93706% with 53 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.40%. Comparing base (222a19f) to head (7443553).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
cmd/cluster/azure/destroy.go 0.00% 18 Missing ⚠️
cmd/infra/azure/rbac.go 82.85% 18 Missing ⚠️
cmd/infra/azure/destroy_iam.go 16.66% 15 Missing ⚠️
product-cli/cmd/cluster/azure/destroy.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8322      +/-   ##
==========================================
+ Coverage   36.30%   36.40%   +0.10%     
==========================================
  Files         764      764              
  Lines       93015    93133     +118     
==========================================
+ Hits        33772    33909     +137     
+ Misses      56530    56511      -19     
  Partials     2713     2713              
Files with missing lines Coverage Δ
cmd/cluster/core/destroy.go 8.91% <ø> (ø)
product-cli/cmd/cluster/azure/destroy.go 0.00% <0.00%> (ø)
cmd/infra/azure/destroy_iam.go 20.68% <16.66%> (-0.53%) ⬇️
cmd/cluster/azure/destroy.go 17.74% <0.00%> (-3.02%) ⬇️
cmd/infra/azure/rbac.go 47.18% <82.85%> (+47.18%) ⬆️
Flag Coverage Δ
cmd-support 30.34% <63.82%> (+0.32%) ⬆️
cpo-hostedcontrolplane 37.05% <ø> (ø)
cpo-other 35.69% <ø> (ø)
hypershift-operator 47.89% <ø> (ø)
other 27.68% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bryan-cox
Copy link
Copy Markdown
Member Author

/cc @enxebre for review

@openshift-ci openshift-ci Bot requested a review from enxebre April 23, 2026 18:40
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 23, 2026

@bryan-cox: GitHub didn't allow me to request PR reviews from the following users: for, review.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @enxebre for review

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/infra/azure/rbac.go`:
- Around line 234-241: The log call and ptr.Deref access can panic if
existing.Properties is nil; before logging/deleting in the role assignment
cleanup (around existing, existing.Properties, Properties.PrincipalID,
log.Log.Info, ptr.Deref, client.Delete, roleAssignmentName, scope) add a nil
guard: check if existing != nil and existing.Properties != nil and only then
dereference PrincipalID for the "stalePrincipal" field; if Properties is nil,
log a safe placeholder (e.g., "<nil-properties>") or omit that field, and still
proceed to call client.Delete as before, returning the same wrapped error on
failure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 183f1d05-eaf7-4185-b2bf-150bbe24578b

📥 Commits

Reviewing files that changed from the base of the PR and between bd5fbe2 and bfa986f.

📒 Files selected for processing (7)
  • cmd/cluster/azure/destroy.go
  • cmd/cluster/core/destroy.go
  • cmd/infra/azure/destroy_iam.go
  • cmd/infra/azure/rbac.go
  • cmd/infra/azure/rbac_test.go
  • cmd/infra/azure/types.go
  • product-cli/cmd/cluster/azure/destroy.go
✅ Files skipped from review due to trivial changes (2)
  • cmd/infra/azure/types.go
  • cmd/infra/azure/destroy_iam.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • cmd/cluster/core/destroy.go
  • cmd/cluster/azure/destroy.go
  • cmd/infra/azure/rbac_test.go

Comment thread cmd/infra/azure/rbac.go
@bryan-cox
Copy link
Copy Markdown
Member Author

/uncc @enxebre

@openshift-ci openshift-ci Bot removed the request for review from enxebre April 23, 2026 18:54
@bryan-cox bryan-cox force-pushed the fix-stale-role-assignments branch from bfa986f to 95f2829 Compare April 23, 2026 18:59
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cmd/cluster/azure/destroy.go (1)

152-168: Consider extracting credential setup to avoid duplication.

Azure credential setup (lines 154-157) is similar to lines 123-126 in DestroyCluster. While they're in different code paths, consider extracting to a helper or restructuring to share the credentials when o.AzurePlatform.ResourceGroupName is provided.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/azure/destroy.go` around lines 152 - 168, The Azure credential
setup (call to util.SetupAzureCredentials used here) is duplicated with the
similar block in DestroyCluster; extract this into a shared helper (e.g.,
NewAzureCredentials or ensureCredentials) and use it from both places so
credentials are reused when o.AzurePlatform.ResourceGroupName is present; update
callers (this code and DestroyCluster) to call the new helper and return the
same subscriptionID and azureCreds to pass into azureinfra.NewRBACManager and
rbacManager.CleanupRoleAssignments, preserving existing error handling and
logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cmd/cluster/azure/destroy.go`:
- Around line 152-168: The Azure credential setup (call to
util.SetupAzureCredentials used here) is duplicated with the similar block in
DestroyCluster; extract this into a shared helper (e.g., NewAzureCredentials or
ensureCredentials) and use it from both places so credentials are reused when
o.AzurePlatform.ResourceGroupName is present; update callers (this code and
DestroyCluster) to call the new helper and return the same subscriptionID and
azureCreds to pass into azureinfra.NewRBACManager and
rbacManager.CleanupRoleAssignments, preserving existing error handling and
logging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: b5b25127-7957-4e7f-9ab4-660173c3eb20

📥 Commits

Reviewing files that changed from the base of the PR and between bfa986f and 95f2829.

📒 Files selected for processing (8)
  • cmd/cluster/azure/destroy.go
  • cmd/cluster/core/destroy.go
  • cmd/infra/azure/destroy_iam.go
  • cmd/infra/azure/rbac.go
  • cmd/infra/azure/rbac_test.go
  • cmd/infra/azure/types.go
  • cmd/util/azure_flag_descriptions.go
  • product-cli/cmd/cluster/azure/destroy.go
✅ Files skipped from review due to trivial changes (1)
  • cmd/cluster/core/destroy.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • product-cli/cmd/cluster/azure/destroy.go
  • cmd/infra/azure/destroy_iam.go

@bryan-cox bryan-cox force-pushed the fix-stale-role-assignments branch 2 times, most recently from 21002bf to 0d34916 Compare April 23, 2026 19:37
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
cmd/cluster/azure/destroy.go (2)

152-160: ⚠️ Potential issue | 🟠 Major

Use the same NSG/VNet resource-group names as the create path.

The cleanup scopes need to match the original assignment scopes exactly. Adding InfraID here changes the deterministic assignment name, so the real NSG/VNet-scoped assignments will be missed.

🛠️ Proposed fix
-	// The resource group names for NSG and VNet follow the convention: {name}-nsg-{infraID} and {name}-vnet-{infraID}.
-	nsgRG := o.Name + "-nsg-" + o.InfraID
-	vnetRG := o.Name + "-vnet-" + o.InfraID
+	// Match the create path resource-group names so deterministic role assignment names line up.
+	nsgRG := o.Name + "-nsg"
+	vnetRG := o.Name + "-vnet"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/azure/destroy.go` around lines 152 - 160, The cleanup computes
NSG/VNet resource-group names with an extra "-{InfraID}" suffix so the
role-assignment lookup misses the original scopes; change the variables nsgRG
and vnetRG in destroy.go to use the exact same names used in the create path
(use o.Name + "-nsg" and o.Name + "-vnet" respectively) so the cleanup scopes
for the role assignments exactly match the original assignment scopes.

166-168: ⚠️ Potential issue | 🟠 Major

Abort destroy when RBAC cleanup returns a real error.

CleanupRoleAssignments already treats not-found as success. Any error here means some assignments were definitely left behind, so continuing into infrastructure deletion defeats the stale-assignment fix.

🛠️ Proposed fix
-	if err := rbacManager.CleanupRoleAssignments(ctx, o.Log, o.InfraID, o.AzurePlatform.ResourceGroupName, nsgRG, vnetRG, o.AzurePlatform.DNSZoneRGName, false); err != nil {
-		o.Log.Error(err, "Failed to clean up some role assignments, continuing with infrastructure deletion")
-	}
+	if err := rbacManager.CleanupRoleAssignments(ctx, o.Log, o.InfraID, o.AzurePlatform.ResourceGroupName, nsgRG, vnetRG, o.AzurePlatform.DNSZoneRGName, false); err != nil {
+		return fmt.Errorf("failed to clean up role assignments before destroying infrastructure: %w", err)
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/azure/destroy.go` around lines 166 - 168, The call to
rbacManager.CleanupRoleAssignments currently logs errors and continues, which
lets a real RBAC failure let the destroy proceed; change this so any non-nil
error from CleanupRoleAssignments causes the destroy to abort by returning or
propagating the error instead of only logging it. Locate the invocation of
CleanupRoleAssignments (the call using rbacManager, o.Log, o.InfraID,
o.AzurePlatform.ResourceGroupName, nsgRG, vnetRG, o.AzurePlatform.DNSZoneRGName)
and replace the log-only branch with an early return of the error (or wrap and
return it) so infrastructure deletion does not proceed when
CleanupRoleAssignments fails.
cmd/infra/azure/rbac.go (1)

112-115: ⚠️ Potential issue | 🟠 Major

Pass cloud-aware ARM client options into RoleAssignmentsClient.

These constructors still use nil options, so RBAC assign/cleanup will default to public-cloud endpoints. That breaks Gov/China tenants even though the destroy paths already know the configured Azure cloud.

#!/bin/bash
set -euo pipefail

echo "== RBAC manager definition =="
sed -n '40,62p' cmd/infra/azure/rbac.go

echo
echo "== Role assignment client construction =="
rg -n -C2 'NewRoleAssignmentsClient\(' cmd/infra/azure/rbac.go

echo
echo "== Cloud-aware Azure client setup elsewhere =="
rg -n -C3 'GetAzureCloudConfiguration|NewARMClientOptions|ClientOptions: azcore.ClientOptions' \
  cmd/cluster/azure/destroy.go cmd/infra/azure/destroy_iam.go support/azureutil/azureutil.go

Also applies to: 145-148, 279-283

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/infra/azure/rbac.go` around lines 112 - 115, The RoleAssignmentsClient
constructors (calls to azureauth.NewRoleAssignmentsClient used to create
raClient with r.subscriptionID and r.creds) are passing nil for options and thus
defaulting to public-cloud endpoints; update each NewRoleAssignmentsClient
invocation (e.g., the call that creates raClient and the other occurrences
around the indicated blocks) to supply cloud-aware azcore.ClientOptions/ARM
client options obtained from your existing helper (the same options pattern used
elsewhere like NewARMClientOptions/GetAzureCloudConfiguration) so the client is
created with the configured Azure cloud endpoints rather than nil.
cmd/infra/azure/destroy_iam.go (2)

115-118: ⚠️ Potential issue | 🟠 Major

Use the same NSG/VNet resource-group names as the create path.

These values feed directly into the deterministic role-assignment name regeneration. With the extra InfraID suffix, cleanup targets different scopes and leaves the actual NSG/VNet assignments behind.

🛠️ Proposed fix
-	// The resource group names for NSG and VNet follow the convention: {name}-nsg-{infraID} and {name}-vnet-{infraID}.
-	nsgRG := o.Name + "-nsg-" + o.InfraID
-	vnetRG := o.Name + "-vnet-" + o.InfraID
+	// Match the create path resource-group names so deterministic role assignment names line up.
+	nsgRG := o.Name + "-nsg"
+	vnetRG := o.Name + "-vnet"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/infra/azure/destroy_iam.go` around lines 115 - 118, The NSG/VNet
resource-group names (variables nsgRG and vnetRG) include an extra InfraID
suffix causing cleanup to target the wrong groups; update the assignment of
nsgRG and vnetRG to match the create path by removing the InfraID suffix (use
o.Name + "-nsg" and o.Name + "-vnet" rather than o.Name + "-nsg-" + o.InfraID /
o.Name + "-vnet-" + o.InfraID) so the deterministic role-assignment name
regeneration cleans the same scopes created earlier.

124-126: ⚠️ Potential issue | 🟠 Major

Abort IAM deletion when RBAC cleanup fails.

A non-nil return here already means the helper saw a real delete failure, not just a missing assignment. Continuing into identity deletion can leave the exact orphaned assignments this PR is meant to prevent.

🛠️ Proposed fix
-	if err := rbacManager.CleanupRoleAssignments(ctx, l, o.InfraID, o.ResourceGroupName, nsgRG, vnetRG, o.DNSZoneRG, false); err != nil {
-		l.Error(err, "Failed to clean up some role assignments, continuing with identity deletion")
-	}
+	if err := rbacManager.CleanupRoleAssignments(ctx, l, o.InfraID, o.ResourceGroupName, nsgRG, vnetRG, o.DNSZoneRG, false); err != nil {
+		return fmt.Errorf("failed to clean up role assignments before deleting identities: %w", err)
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/infra/azure/destroy_iam.go` around lines 124 - 126, The call to
rbacManager.CleanupRoleAssignments in destroy_iam.go currently logs the error
and continues, which can leave orphaned RBAC assignments; update the error
handling in the block that calls rbacManager.CleanupRoleAssignments(ctx, l,
o.InfraID, o.ResourceGroupName, nsgRG, vnetRG, o.DNSZoneRG, false) to abort the
IAM deletion flow on non-nil error by returning the error (or a wrapped error)
instead of merely logging via l.Error, so the subsequent identity deletion steps
are not executed when cleanup fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/infra/azure/rbac.go`:
- Around line 227-231: The GET branch that skips creating a role assignment only
checks existing.Properties.PrincipalID and can incorrectly reuse a deterministic
assignment whose RoleDefinitionID has changed; update the conditional in the
client.Get handling (where roleAssignmentName and existing are used) to also
validate existing.Properties.RoleDefinitionID (or the exact RoleDefinitionID
field on existing.Properties) matches the expected role definition ID before
returning, and only skip creation when both PrincipalID and RoleDefinitionID are
equal to the desired values so the code will recreate the assignment when the
role definition differs.

---

Duplicate comments:
In `@cmd/cluster/azure/destroy.go`:
- Around line 152-160: The cleanup computes NSG/VNet resource-group names with
an extra "-{InfraID}" suffix so the role-assignment lookup misses the original
scopes; change the variables nsgRG and vnetRG in destroy.go to use the exact
same names used in the create path (use o.Name + "-nsg" and o.Name + "-vnet"
respectively) so the cleanup scopes for the role assignments exactly match the
original assignment scopes.
- Around line 166-168: The call to rbacManager.CleanupRoleAssignments currently
logs errors and continues, which lets a real RBAC failure let the destroy
proceed; change this so any non-nil error from CleanupRoleAssignments causes the
destroy to abort by returning or propagating the error instead of only logging
it. Locate the invocation of CleanupRoleAssignments (the call using rbacManager,
o.Log, o.InfraID, o.AzurePlatform.ResourceGroupName, nsgRG, vnetRG,
o.AzurePlatform.DNSZoneRGName) and replace the log-only branch with an early
return of the error (or wrap and return it) so infrastructure deletion does not
proceed when CleanupRoleAssignments fails.

In `@cmd/infra/azure/destroy_iam.go`:
- Around line 115-118: The NSG/VNet resource-group names (variables nsgRG and
vnetRG) include an extra InfraID suffix causing cleanup to target the wrong
groups; update the assignment of nsgRG and vnetRG to match the create path by
removing the InfraID suffix (use o.Name + "-nsg" and o.Name + "-vnet" rather
than o.Name + "-nsg-" + o.InfraID / o.Name + "-vnet-" + o.InfraID) so the
deterministic role-assignment name regeneration cleans the same scopes created
earlier.
- Around line 124-126: The call to rbacManager.CleanupRoleAssignments in
destroy_iam.go currently logs the error and continues, which can leave orphaned
RBAC assignments; update the error handling in the block that calls
rbacManager.CleanupRoleAssignments(ctx, l, o.InfraID, o.ResourceGroupName,
nsgRG, vnetRG, o.DNSZoneRG, false) to abort the IAM deletion flow on non-nil
error by returning the error (or a wrapped error) instead of merely logging via
l.Error, so the subsequent identity deletion steps are not executed when cleanup
fails.

In `@cmd/infra/azure/rbac.go`:
- Around line 112-115: The RoleAssignmentsClient constructors (calls to
azureauth.NewRoleAssignmentsClient used to create raClient with r.subscriptionID
and r.creds) are passing nil for options and thus defaulting to public-cloud
endpoints; update each NewRoleAssignmentsClient invocation (e.g., the call that
creates raClient and the other occurrences around the indicated blocks) to
supply cloud-aware azcore.ClientOptions/ARM client options obtained from your
existing helper (the same options pattern used elsewhere like
NewARMClientOptions/GetAzureCloudConfiguration) so the client is created with
the configured Azure cloud endpoints rather than nil.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: fe84b027-f1c0-4418-8f12-8fa62f432580

📥 Commits

Reviewing files that changed from the base of the PR and between 95f2829 and 21002bf.

📒 Files selected for processing (8)
  • cmd/cluster/azure/destroy.go
  • cmd/cluster/core/destroy.go
  • cmd/infra/azure/destroy_iam.go
  • cmd/infra/azure/rbac.go
  • cmd/infra/azure/rbac_test.go
  • cmd/infra/azure/types.go
  • cmd/util/azure_flag_descriptions.go
  • product-cli/cmd/cluster/azure/destroy.go
✅ Files skipped from review due to trivial changes (3)
  • cmd/infra/azure/types.go
  • cmd/util/azure_flag_descriptions.go
  • cmd/cluster/core/destroy.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • product-cli/cmd/cluster/azure/destroy.go

Comment thread cmd/infra/azure/rbac.go
@bryan-cox bryan-cox force-pushed the fix-stale-role-assignments branch from 0d34916 to a70a36a Compare April 23, 2026 19:41
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 222a19f and 2 for PR HEAD 3819499 in total

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2048853730190692352 | Cost: $4.45391675 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

…re-creation

- Extract roleAssignmentClient interface from Azure SDK for testability
- Add LIST-based existence check with atScope() filter before GET fallback
- Verify both PrincipalID and RoleDefinitionID before skipping creation
- Delete stale assignments (mismatched principal) and create fresh ones
- Add CleanupRoleAssignments to both destroy cluster and destroy iam paths
- Make --dns-zone-rg-name required on all destroy commands
- Add comprehensive behavior-driven tests for assignRole, deleteRoleAssignmentByName,
  and cleanupRoleAssignments

Signed-off-by: Bryan Cox <brcox@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
Update destroy command examples in self-managed Azure cluster,
IAM separate workflow, and private cluster docs to include the
now-required --dns-zone-rg-name flag. Also update the destroy iam
command reference table with all required flags.

Signed-off-by: Bryan Cox <brcox@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
@bryan-cox bryan-cox force-pushed the fix-stale-role-assignments branch from 3819499 to 7443553 Compare April 28, 2026 00:45
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Apr 28, 2026
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 28, 2026
@bryan-cox
Copy link
Copy Markdown
Member Author

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 28, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This PR has been marked as verified by @Nirshal (see https://redhat.atlassian.net/browse/OCPBUGS-84251?focusedCommentId=16793903).

Details

In response to this:

/verified by @Nirshal (see https://redhat.atlassian.net/browse/OCPBUGS-84251?focusedCommentId=16793903)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@clebs
Copy link
Copy Markdown
Member

clebs commented Apr 28, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 28, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@hypershift-jira-solve-ci
Copy link
Copy Markdown

The e2e-aws test step never started. The failure occurred during the infrastructure setup phase (importing the initial release payload), before any test code could run. The analysis is complete. Here is the report:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

step [release:initial] failed: failed to get CLI image: unable to find the 'cli' image
in the provided release image: unable to delete completed pod: could not delete completed
pod: Operation cannot be fulfilled on Pod "release-images-initial-cli": the UID in the
precondition (cad7d5b1-63df-433d-bf13-5f3027edbf23) does not match the UID in record
(81a4e94d-8f51-43b4-96dd-6d2b34e03496). The object might have been deleted and then recreated

Summary

Both e2e-aws and e2e-aks jobs failed identically due to a CI infrastructure race condition — not a code defect in PR #8322. Multiple ci-operator instances (at least three: e2e-aws, e2e-aks, and e2e-azure-self-managed) were assigned the same shared namespace ci-op-x7rrxp65 on the build01 cluster. When they concurrently tried to import the release:initial payload, they competed for the same pod name release-images-initial-cli, causing a Kubernetes UID precondition conflict. The actual e2e test steps never started — the failure occurred during the CI graph setup phase before any test code from the PR could execute. A simple re-trigger of the jobs should succeed.

Root Cause

CI infrastructure pod-name collision in a shared namespace.

The ci-operator deterministically maps a PR's source commit to a namespace name. When multiple jobs for the same PR and commit share configuration inputs (same org/repo/branch/commit SHA), they can be assigned the same namespace (ci-op-x7rrxp65). This is by design — it allows jobs to share build artifacts (images, source clones) for efficiency.

However, the [release:initial] step creates a pod named release-images-initial-cli to extract the CLI binary from the release image. All three concurrent ci-operator instances (e2e-aws, e2e-aks, e2e-azure-self-managed) attempted to create, run, and delete this identically-named pod simultaneously:

  1. One instance (likely e2e-azure-self-managed) created the pod with UID cad7d5b1-...
  2. Another instance deleted that pod and recreated it with a new UID 81a4e94d-...
  3. The remaining instances attempted to delete the pod using the stale UID cad7d5b1-..., which Kubernetes rejected with a precondition failure ("the UID in the precondition does not match the UID in record")

The step graph manifests confirm this: the pod release-images-initial-cli was labeled as owned by the e2e-azure-self-managed job, yet it was captured in the e2e-aws step graph — proving cross-job resource contention.

This is a known transient CI infrastructure issue with ci-operator's namespace sharing mechanism. It is not caused by any code change in PR #8322 (which modifies Azure role assignment logic, completely unrelated to release image importing).

Recommendations
  1. Re-trigger both jobs — This is a transient race condition. A /retest on the PR should resolve it, as the timing collision is unlikely to recur.
  2. No code changes needed — PR OCPBUGS-84251: fix(azure): detect and replace stale role assignments on cluster re-creation #8322's changes (Azure stale role assignment detection) are entirely unrelated to this failure. The e2e test steps (e2e-aws, e2e-aks) never started; they were blocked waiting on the failed [release:initial] dependency.
  3. If it recurs — File a bug against openshift/ci-tools (ci-operator) for the pod UID race condition in the release image import step when multiple jobs share a namespace.
Evidence
Evidence Detail
Error type Kubernetes pod UID precondition mismatch (409 Conflict)
Failed step [release:initial] — CI infrastructure step, not a test step
e2e test execution Never startede2e-aws and e2e-aks steps have started_at: None
Shared namespace Both jobs assigned ci-op-x7rrxp65 on build01.ci.openshift.org
Third job in namespace pull-ci-openshift-hypershift-main-e2e-azure-self-managed also shared the namespace
Conflicting pod release-images-initial-cli — labeled as owned by e2e-azure-self-managed, contested by all three jobs
UID precondition (stale) cad7d5b1-63df-433d-bf13-5f3027edbf23
UID in record (current) 81a4e94d-8f51-43b4-96dd-6d2b34e03496
Error identical across jobs Both e2e-aws and e2e-aks report the exact same UID values and error message
Start times Both jobs started within 1 second of each other (10:17:07Z and 10:17:08Z)
Failure time Both failed at exactly 10:34:28Z
Job duration ~17 minutes (all spent on image builds; failure occurred at release import)
ci-operator version v20260427-77434bf61
Release being imported registry.ci.openshift.org/ocp/release-5:5.0.0-0.ci-2026-04-27-173259
PR code relevance None — PR modifies Azure role assignment logic; failure is in CI release import infrastructure

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

@bryan-cox
Copy link
Copy Markdown
Member Author

/override ci/prow/e2e-kubevirt-aws-ovn-reduced

This only touches Azure CLI and docs.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 28, 2026

@bryan-cox: Overrode contexts on behalf of bryan-cox: ci/prow/e2e-kubevirt-aws-ovn-reduced

Details

In response to this:

/override ci/prow/e2e-kubevirt-aws-ovn-reduced

This only touches Azure CLI and docs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 28, 2026

@bryan-cox: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 7a8d6e9 into openshift:main Apr 28, 2026
40 checks passed
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: Jira Issue Verification Checks: Jira Issue OCPBUGS-84251
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-84251 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

What this PR does / why we need it:

When a self-managed Azure cluster is destroyed and re-created with the same infraID, role
assignments from the previous cluster persist as orphans because neither destroy iam,
destroy infra, nor destroy cluster cleans them up. The deterministic UUID naming
(based on infraID+component+scope) causes assignRole to find the stale assignment by name
and skip creation, even though it points to a deleted managed identity. This results in
403 AuthorizationFailed errors for components like the ingress operator that need RBAC
roles on external resource groups (e.g., the DNS zone resource group).

Root cause

Two bugs working together:

  1. assignRole GET check doesn't verify principal: The deterministic role assignment name
    (uuid(infraID+component+scope)) matches a stale/orphaned assignment from a previous cluster.
    The GET check finds it by name and skips creation without verifying the principalID matches
    the new identity.

  2. No destroy path cleans up role assignments: When you destroy a cluster, the managed
    identities are deleted but their role assignments persist as orphans in Azure RBAC.

The failure sequence

  1. create infra --infra-id=XXX → creates role assignment uuid(XXX-ingress-scope) pointing to identity A
  2. destroy infra / destroy iam / destroy cluster → deletes identity A but leaves orphaned role assignment
  3. create iam (new attempt, same name) → creates NEW identity B (different objectID)
  4. create infra --infra-id=XXX --assign-identity-roles
  • LIST check: finds orphan for A, principalID != B → doesn't match → continues
  • GET check: uuid(XXX-ingress-scope) → FINDS orphan → "already exists" → SKIPS
  1. Identity B has NO role on dnsZoneRG → 403 AuthorizationFailed

Fix

  1. assignRole now verifies principalID when GET finds an existing assignment. If
    mismatched (stale orphan), it deletes the stale assignment and creates a new one.

  2. destroy iam now calls CleanupRoleAssignments before destroying managed identities,
    preventing orphaned role assignments from accumulating.

  3. destroy cluster azure now calls CleanupRoleAssignments before destroying infrastructure,
    so both destroy paths clean up role assignments.

  4. Extracted roleAssignmentClient interface from the Azure SDK concrete client to enable
    unit testing of the stale detection logic.

Verified with az CLI

Created a test identity, assigned a role with a deterministic UUID name, deleted the identity
(leaving the role assignment orphaned), created a new identity — confirmed the GET returns
the stale principalId which our fix now detects and replaces.

Which issue(s) this PR fixes:

Fixes OCPBUGS-84251

Special notes for your reviewer:

  • The --dns-zone-rg-name flag is optional on all destroy commands; if not provided, DNS zone
    scoped assignments simply get 404 during cleanup (handled gracefully).

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

  • CLI flag to specify DNS zone resource group for Azure destroy

  • Automatic pre-cleanup of Azure role assignments across related resource groups

  • Bug Fixes

  • Detects and replaces stale or mismatched Azure role assignments deterministically

  • Cleanup failures are logged and treated as non-fatal so deletion continues

  • Tests

  • New unit tests covering role-assignment creation, detection, deletion, and cleanup behaviors

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox bryan-cox deleted the fix-stale-role-assignments branch April 28, 2026 15:06
@bryan-cox
Copy link
Copy Markdown
Member Author

/jira backport release-4.22

@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22

Details

In response to this:

/jira backport release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot
Copy link
Copy Markdown

@openshift-ci-robot: new pull request created: #8359

Details

In response to this:

@bryan-cox: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22

In response to this:

/jira backport release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

bryan-cox added a commit to bryan-cox/release that referenced this pull request Apr 28, 2026
openshift/hypershift#8322 made --dns-zone-rg-name a required flag on
`hypershift destroy cluster azure`. Two destroy chains were not passing
it, causing every Azure HyperShift job to fail during cleanup and leak
Azure clusters.

hypershift-destroy-nested-management-cluster: used by
e2e-azure-self-managed jobs. The create step already passes
--dns-zone-rg-name=os4-common; adds the same to the destroy step.

hypershift-azure-destroy: used by AKS conformance and 12+ cucushift
Azure HyperShift workflows. Adds DNS_ZONE_RG_NAME env var (default
os4-common) matching the create chain, and passes it to the destroy
command.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
openshift-merge-bot Bot pushed a commit to openshift/release that referenced this pull request Apr 29, 2026
* fix: add --dns-zone-rg-name to Azure self-managed destroy step

openshift/hypershift#8322 made --dns-zone-rg-name a required flag on
`hypershift destroy cluster azure`. The destroy-management-cluster step
in the self-managed workflow was not passing it, causing every
e2e-azure-self-managed run to fail during cleanup and leak Azure
clusters.

The create step already passes --dns-zone-rg-name=os4-common; this
adds the same flag to the destroy step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add --dns-zone-rg-name to Azure destroy steps

openshift/hypershift#8322 made --dns-zone-rg-name a required flag on
`hypershift destroy cluster azure`. Two destroy chains were not passing
it, causing every Azure HyperShift job to fail during cleanup and leak
Azure clusters.

hypershift-destroy-nested-management-cluster: used by
e2e-azure-self-managed jobs. The create step already passes
--dns-zone-rg-name=os4-common; adds the same to the destroy step.

hypershift-azure-destroy: used by AKS conformance and 12+ cucushift
Azure HyperShift workflows. Adds DNS_ZONE_RG_NAME env var (default
os4-common) matching the create chain, and passes it to the destroy
command.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cli Indicates the PR includes changes for CLI area/documentation Indicates the PR includes changes for documentation area/platform/azure PR/issue for Azure (AzurePlatform) platform area/testing Indicates the PR includes changes for e2e testing jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants