OCPBUGS-84971: Gate AWSDefaultSecurityGroupDeleted condition to AWS platform by bryan-cox · Pull Request #8415 · openshift/hypershift

bryan-cox · 2026-05-05T02:38:36Z

Summary

Gate the AWSDefaultSecurityGroupDeleted condition behind an AWS platform check during HostedCluster deletion, matching the existing pattern used for AWSDefaultSecurityGroupCreated
Extract the inline condition bubble-up logic into a pure function computeAWSDefaultSGDeletedCondition for testability
Add 8 behavior-driven unit tests covering platform gating, condition propagation, and deduplication

Root Cause

The AWSDefaultSecurityGroupCreated condition (line 873) is correctly gated behind hcluster.Spec.Platform.Type == hyperv1.AWSPlatform, but the AWSDefaultSecurityGroupDeleted block had no platform check — it ran for every platform during deletion.

While the HCP-level controller (destroyAWSDefaultSecurityGroup) correctly short-circuits for non-AWS platforms, the HostedCluster controller still created a fresh condition with Status: Unknown and set it on the HostedCluster regardless.

Test plan

Verify compilation
Unit tests pass (TestComputeAWSDefaultSGDeletedCondition — 8 cases)
Confirm ARO HCP HostedCluster deletion no longer shows AWSDefaultSecurityGroupDeleted condition

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved handling of AWS cluster deletion status so HostedCluster status updates occur only for AWS hosts with a deleting control plane and avoid unnecessary status writes when the condition message is unchanged.
Tests
- Added unit tests covering AWS deletion-status behavior across platforms and control-plane scenarios.

openshift-merge-bot · 2026-05-05T02:38:39Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci-robot · 2026-05-05T02:38:40Z

@bryan-cox: This pull request explicitly references no jira issue.

Details

In response to this:

Summary

Gate the AWSDefaultSecurityGroupDeleted condition behind an AWS platform check during HostedCluster deletion, matching the existing pattern used for AWSDefaultSecurityGroupCreated

Previously this AWS-specific condition was being set on all platforms (Azure, KubeVirt, etc.) during cluster deletion with Status: Unknown

Root Cause

The AWSDefaultSecurityGroupCreated condition (line 873) is correctly gated behind hcluster.Spec.Platform.Type == hyperv1.AWSPlatform, but the AWSDefaultSecurityGroupDeleted block (line 431) had no platform check — it ran for every platform during deletion.

While the HCP-level controller (destroyAWSDefaultSecurityGroup) correctly short-circuits for non-AWS platforms, the HostedCluster controller still created a fresh condition with Status: Unknown and set it on the HostedCluster regardless.

Test plan

Verify no compilation errors

Verify existing unit tests pass

Confirm ARO HCP HostedCluster deletion no longer shows AWSDefaultSecurityGroupDeleted condition

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-05-05T02:38:40Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-05-05T02:38:52Z

📝 Walkthrough

Walkthrough

The bubbling of the AWSDefaultSecurityGroupDeleted condition was refactored into computeAWSDefaultSGDeletedCondition(hcluster, hcp). The controller now calls this helper during reconcile and updates the HostedCluster status only when the platform is AWS, the HostedControlPlane exists and is being deleted, and the helper reports a changed result. The helper derives the condition from the HostedControlPlane’s condition when present, returns an appropriate Unknown/True/False condition, and suppresses updates if the HostedCluster already has the same condition message.

Sequence Diagram(s)

sequenceDiagram
    participant Reconciler as HostedCluster Controller
    participant API as Kubernetes API Server
    participant HCP as HostedControlPlane
    participant HCluster as HostedCluster Status

    Reconciler->>API: Get HostedCluster
    Reconciler->>API: Get HostedControlPlane (HCP)
    Reconciler->>Reconciler: computeAWSDefaultSGDeletedCondition(hcluster, hcp)
    alt platform != AWS or HCP nil or HCP not deleting
        Reconciler-->>Reconciler: no condition computed / no change
    else platform == AWS and HCP deleting
        Reconciler->>HCP: read AWSDefaultSecurityGroupDeleted condition (if present)
        Reconciler-->>Reconciler: compute condition (Unknown/True/False)
        alt message differs from HCluster condition
            Reconciler->>API: Update HostedCluster status (set condition)
            API-->>HCluster: persist status
        else message same
            Reconciler-->>Reconciler: skip status update
        end
    end

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	Test lacks assertion messages. Gomega assertions missing diagnostic messages per codebase patterns: g.Expect(...), g.Expect(condition).ToNot(BeNil()).	Add messages to assertions: g.Expect(changed).To(Equal(...), "failed: changed value"), g.Expect(condition).ToNot(BeNil(), "expected condition set").

✅ Passed checks (10 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main change: gating the AWSDefaultSecurityGroupDeleted condition to only apply to AWS platforms during HostedCluster deletion.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	The test uses standard Go table-driven testing (t.Run), not Ginkgo. The custom check is specific to Ginkgo tests, so it is not applicable here.
Microshift Test Compatibility	✅ Passed	No Ginkgo e2e tests added. New test is standard Go unit test using Gomega assertions, not Ginkgo framework. Custom check targeting Ginkgo e2e tests is not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR adds a unit test, not a Ginkgo e2e test. SNO check applies only to Ginkgo e2e tests that run on live clusters.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies only reconciliation controller status condition logic. No deployment manifests, pod specs, or scheduling constraints are introduced. Not applicable to topology check.
Ote Binary Stdout Contract	✅ Passed	Both modified files contain zero stdout-writing violations. No fmt.Print, log.Print, klog, or os.Stdout calls detected. The new test follows standard unit test patterns with proper test isolation.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No Ginkgo e2e tests were added. The PR adds only a unit test using standard Go testing.T, not Ginkgo framework (no It/Describe/Context blocks). Custom check only applies to Ginkgo e2e tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-05-05T02:40:03Z

@bryan-cox: This pull request references Jira Issue OCPBUGS-84971, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Gate the AWSDefaultSecurityGroupDeleted condition behind an AWS platform check during HostedCluster deletion, matching the existing pattern used for AWSDefaultSecurityGroupCreated

Previously this AWS-specific condition was being set on all platforms (Azure, KubeVirt, etc.) during cluster deletion with Status: Unknown

Root Cause

The AWSDefaultSecurityGroupCreated condition (line 873) is correctly gated behind hcluster.Spec.Platform.Type == hyperv1.AWSPlatform, but the AWSDefaultSecurityGroupDeleted block (line 431) had no platform check — it ran for every platform during deletion.

While the HCP-level controller (destroyAWSDefaultSecurityGroup) correctly short-circuits for non-AWS platforms, the HostedCluster controller still created a fresh condition with Status: Unknown and set it on the HostedCluster regardless.

Test plan

Verify no compilation errors

Verify existing unit tests pass

Confirm ARO HCP HostedCluster deletion no longer shows AWSDefaultSecurityGroupDeleted condition

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes

Improved reliability of deletion handling for AWS-based clusters by refining platform-specific validation logic.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

bryan-cox · 2026-05-05T02:40:49Z

/jira refresh

openshift-ci-robot · 2026-05-05T02:40:56Z

@bryan-cox: This pull request references Jira Issue OCPBUGS-84971, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

codecov · 2026-05-05T02:43:03Z

Codecov Report

❌ Patch coverage is 83.33333% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.25%. Comparing base (ad88854) to head (273acd0).
⚠️ Report is 100 commits behind head on main.

Files with missing lines	Patch %	Lines
...trollers/hostedcluster/hostedcluster_controller.go	83.33%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8415      +/-   ##
==========================================
+ Coverage   37.23%   37.25%   +0.01%     
==========================================
  Files         752      752              
  Lines       91829    91833       +4     
==========================================
+ Hits        34195    34214      +19     
+ Misses      54993    54978      -15     
  Partials     2641     2641

Files with missing lines	Coverage Δ
...trollers/hostedcluster/hostedcluster_controller.go	`43.66% <83.33%> (+0.43%)`	⬆️

Flag	Coverage Δ
cmd-support	`32.06% <ø> (ø)`
cpo-hostedcontrolplane	`36.50% <ø> (ø)`
cpo-other	`37.73% <ø> (ø)`
hypershift-operator	`47.92% <83.33%> (+0.07%)`	⬆️
other	`27.77% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

openshift-ci-robot · 2026-05-05T02:43:31Z

@bryan-cox: This pull request references Jira Issue OCPBUGS-84971, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

Summary

Gate the AWSDefaultSecurityGroupDeleted condition behind an AWS platform check during HostedCluster deletion, matching the existing pattern used for AWSDefaultSecurityGroupCreated

Previously this AWS-specific condition was being set on all platforms (Azure, KubeVirt, etc.) during cluster deletion with Status: Unknown

Root Cause

The AWSDefaultSecurityGroupCreated condition (line 873) is correctly gated behind hcluster.Spec.Platform.Type == hyperv1.AWSPlatform, but the AWSDefaultSecurityGroupDeleted block (line 431) had no platform check — it ran for every platform during deletion.

While the HCP-level controller (destroyAWSDefaultSecurityGroup) correctly short-circuits for non-AWS platforms, the HostedCluster controller still created a fresh condition with Status: Unknown and set it on the HostedCluster regardless.

Test plan

Verify no compilation errors

Verify existing unit tests pass

Confirm ARO HCP HostedCluster deletion no longer shows AWSDefaultSecurityGroupDeleted condition

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes

Improved reliability of deletion handling for AWS-based clusters and prevented unnecessary status updates when conditions haven't changed.

Tests

Added unit coverage to validate AWS deletion-status behavior across platform and control-plane scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

The AWSDefaultSecurityGroupDeleted condition was being set on all HostedClusters during deletion regardless of platform, causing AWS-specific conditions to appear on Azure/KubeVirt/etc clusters. Extract the condition computation into a testable function and gate it behind a platform check matching the existing pattern used for AWSDefaultSecurityGroupCreated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (1)

hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go (1)

3728-3758: ⚡ Quick win

Add a same-status/different-message case to lock in message propagation behavior.

You already test the no-op path when the message matches. Please add the inverse case (same status/reason, different message) to ensure changed is still true and newer HCP context is propagated.

Proposed test case addition

 		{
 			name: "When HC already has the same condition message, it should not report a change",
 			hcluster: &hyperv1.HostedCluster{
 				Spec: hyperv1.HostedClusterSpec{
 					Platform: hyperv1.PlatformSpec{Type: hyperv1.AWSPlatform},
 				},
 				Status: hyperv1.HostedClusterStatus{
 					Conditions: []metav1.Condition{
 						{
 							Type:    string(hyperv1.AWSDefaultSecurityGroupDeleted),
 							Status:  metav1.ConditionTrue,
 							Reason:  "Deleted",
 							Message: "Security group deleted",
 						},
 					},
 				},
 			},
 			hcp: &hyperv1.HostedControlPlane{
 				ObjectMeta: metav1.ObjectMeta{DeletionTimestamp: &deletionTime},
 				Status: hyperv1.HostedControlPlaneStatus{
 					Conditions: []metav1.Condition{
 						{
 							Type:    string(hyperv1.AWSDefaultSecurityGroupDeleted),
 							Status:  metav1.ConditionTrue,
 							Reason:  "Deleted",
 							Message: "Security group deleted",
 						},
 					},
 				},
 			},
 			wantChanged: false,
 		},
+		{
+			name: "When HC has same status but different message, it should report a change",
+			hcluster: &hyperv1.HostedCluster{
+				Spec: hyperv1.HostedClusterSpec{
+					Platform: hyperv1.PlatformSpec{Type: hyperv1.AWSPlatform},
+				},
+				Status: hyperv1.HostedClusterStatus{
+					Conditions: []metav1.Condition{
+						{
+							Type:    string(hyperv1.AWSDefaultSecurityGroupDeleted),
+							Status:  metav1.ConditionTrue,
+							Reason:  "Deleted",
+							Message: "old message",
+						},
+					},
+				},
+			},
+			hcp: &hyperv1.HostedControlPlane{
+				ObjectMeta: metav1.ObjectMeta{DeletionTimestamp: &deletionTime},
+				Status: hyperv1.HostedControlPlaneStatus{
+					Conditions: []metav1.Condition{
+						{
+							Type:    string(hyperv1.AWSDefaultSecurityGroupDeleted),
+							Status:  metav1.ConditionTrue,
+							Reason:  "Deleted",
+							Message: "new message",
+						},
+					},
+				},
+			},
+			wantChanged: true,
+			wantStatus:  metav1.ConditionTrue,
+		},

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`
around lines 3728 - 3758, Add a new table-driven test case alongside the
existing one that uses the same condition Type
(string(hyperv1.AWSDefaultSecurityGroupDeleted)), Status (metav1.ConditionTrue)
and Reason ("Deleted") but a different Message between
hcluster.Status.Conditions and hcp.Status.Conditions; set hcluster to have the
old message, hcp to have the new message, and assert wantChanged is true and
that the controller logic updates/propagates the message from hcp into the
HostedCluster condition (reference the hcluster, hcp,
AWSDefaultSecurityGroupDeleted and wantChanged identifiers to locate and
implement the case).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`:
- Around line 3728-3758: Add a new table-driven test case alongside the existing
one that uses the same condition Type
(string(hyperv1.AWSDefaultSecurityGroupDeleted)), Status (metav1.ConditionTrue)
and Reason ("Deleted") but a different Message between
hcluster.Status.Conditions and hcp.Status.Conditions; set hcluster to have the
old message, hcp to have the new message, and assert wantChanged is true and
that the controller logic updates/propagates the message from hcp into the
HostedCluster condition (reference the hcluster, hcp,
AWSDefaultSecurityGroupDeleted and wantChanged identifiers to locate and
implement the case).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5318179b-f0e6-489c-9f4f-9b64ed9cef7e

📥 Commits

Reviewing files that changed from the base of the PR and between bdeba67 and 273acd0.

📒 Files selected for processing (2)

hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go

bryan-cox · 2026-05-05T09:51:42Z

/pipeline required

openshift-merge-bot · 2026-05-05T09:51:45Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

hypershift-jira-solve-ci · 2026-05-05T11:56:24Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2051601018340249600 | Cost: $2.88565985 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

hypershift-jira-solve-ci · 2026-05-05T11:56:53Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2051601009939058688 | Cost: $3.952714 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

hypershift-jira-solve-ci · 2026-05-05T11:58:43Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2051601008374583296 | Cost: $3.1239234999999996 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

cwbotbot · 2026-05-05T12:15:03Z

Test Results

e2e-aws

Status: ✅ PASS
Started: 2026-05-06T10:55:42Z
View Job
View Job History

e2e-aks

Status: ✅ PASS
Started: 2026-05-05T20:15:36Z
View Job
View Job History

bryan-cox · 2026-05-05T13:59:41Z

/retest

one more time

hypershift-jira-solve-ci · 2026-05-05T15:56:49Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2051663286016937984 | Cost: $5.3032791 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

hypershift-jira-solve-ci · 2026-05-05T16:02:23Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2051663283286446080 | Cost: $3.9105257499999997 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

bryan-cox · 2026-05-05T16:14:11Z

/retest

hypershift-jira-solve-ci · 2026-05-05T18:23:43Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2051697108326551552 | Cost: $3.6653342500000003 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

bryan-cox · 2026-05-05T20:13:20Z

/retest

infra issue


: Import the release payload "n1minor" from an external source expand_less | 10s
-- | --
{  failed to get CLI image: unable to extract the 'cli' image from the release image, pod produced no output}

bryan-cox · 2026-05-06T10:55:13Z

/test e2e-aws

Previous failure was infra issue

{  failed to get CLI image: unable to find the 'cli' image in the provided release image: unable to delete completed pod: could not delete completed pod: Operation cannot be fulfilled on Pod "release-images-initial-cli": the UID in the precondition (20714e7d-f835-428d-b01e-1bfdb716ec48) does not match the UID in record (8ef0c73b-5f6f-4d3e-86b2-2a09b7323d9c). The object might have been deleted and then recreated}

jparrill

LGTM — dropped a comment. Thanks!

openshift-merge-bot · 2026-05-11T13:52:21Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

openshift-ci · 2026-05-11T13:52:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bryan-cox,jparrill]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jparrill

Dropped a couple of comments. Thanks!

Please add a test case covering the "condition update" path — where the HC already has the SG condition but the HCP has progressed to a different message (e.g., from "security group still in use" to "Security group deleted"). This exercises the oldCondition.Message != freshCondition.Message branch explicitly and is the most common real-world scenario during active deletion.

Also consider setting hcluster.Generation to a non-zero value in at least one propagation test case and asserting condition.ObservedGeneration matches.

hypershift-jira-solve-ci · 2026-05-11T14:17:03Z

The PR only modifies the hosted cluster controller Go files — it has nothing to do with CI infrastructure, release image imports, or the multiarch scheduling gate. This is purely a CI infrastructure flake.

Now I have all the evidence I need. Let me produce the final report.

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-aks-4-22
Build ID: 2053835700066848768
Target: e2e-aks-4-22
Type: presubmit (PR OCPBUGS-84971: Gate AWSDefaultSecurityGroupDeleted condition to AWS platform #8415)
Cluster: build01
Duration: ~13 minutes (13:52:43Z → 14:05:49Z)

Test Failure Analysis

Error

step [release:latest-421] failed: failed to get CLI image: unable to extract the 'cli' image from the release image, pod produced no output

Summary

This is a CI infrastructure flake unrelated to PR #8415. The job failed during the release payload import phase — specifically while importing the OCP 4.21 release image (4.21.0-0.ci-2026-05-11-040237). The release-images-latest-421-cli pod, which runs cluster-version-operator image cli to extract the CLI image digest, was affected by a race condition involving the multiarch.openshift.io/scheduling-gate webhook. ci-operator read the pod's status while the Kubernetes API still reported the pod as Pending with SchedulingGated, even though the pod's containers had already completed successfully (exitCode=0). Because the pod appeared to be in Pending phase with no containerStatuses, ci-operator found no termination message and reported "pod produced no output". The PR changes (gating AWSDefaultSecurityGroupDeleted condition to AWS platform in hostedcluster_controller.go) were never exercised — the failure occurred before any test steps ran.

Root Cause

The root cause is a race condition between the multiarch scheduling gate webhook and ci-operator's pod status polling on the build01 CI cluster.

Detailed sequence of events:

ci-operator created pod release-images-latest-421-cli at 13:53:35Z to extract the CLI image from the OCP 4.21 release payload.
The pod had a schedulingGates: [{name: "multiarch.openshift.io/scheduling-gate"}] injected by a mutating admission webhook, initially blocking scheduling.
Despite the gate, the pod was actually scheduled, initialized, and its release container completed successfully at 13:53:41Z with exitCode=0 — the container ran cluster-version-operator image cli > /dev/termination-log and wrote the CLI image digest to the termination log.
However, when ci-operator queried the pod status to read the termination message, the Kubernetes API returned a stale view of the pod: phase: Pending, conditions: [{reason: SchedulingGated}], and crucially empty containerStatuses — meaning no termination message was visible.
ci-operator interpreted the empty termination message as "pod produced no output" and failed the [release:latest-421] step.
The ci-operator metrics agent separately observed the pod as Succeeded at 13:53:43Z, confirming the pod did complete — but by that point the step had already been marked as failed.

Why only latest-421 was affected: All six release CLI pods (latest-418 through latest-422 plus initial-422) ran the same command with the same scheduling gate. The other five pods' status was read at a point when the API reflected their completed state. The latest-421 pod happened to be read during the narrow window where the API cache still showed the pre-gate-removal state.

This is not related to PR #8415. The PR modifies only hostedcluster_controller.go and its test file to gate the AWSDefaultSecurityGroupDeleted condition to the AWS platform. No test steps or CI configuration were touched, and the failure occurred during release image import — well before any test code would have executed.

Recommendations

Retest the PR — Run /retest or /test e2e-aks-4-22 on the PR. This is a transient CI infrastructure flake with no relation to the code changes.
No code changes needed — The PR's changes to hostedcluster_controller.go (gating AWSDefaultSecurityGroupDeleted to AWS platform) are entirely unrelated to this failure.
CI infrastructure note — The race condition between the multiarch.openshift.io/scheduling-gate webhook controller and ci-operator's pod status reading on build01 is a known class of CI flakes. If this recurs frequently, the CI team (openshift/ci-tools) may need to add retry logic when reading pod termination messages from pods that have scheduling gates.

Evidence

Evidence	Detail
Failed step	`[release:latest-421]` — Import the release payload "latest-421" from an external source
Error message	`failed to get CLI image: unable to extract the 'cli' image from the release image, pod produced no output`
Failed pod	`release-images-latest-421-cli` in namespace `ci-op-vmxrwxwb`
Pod command	`/bin/sh -c 'cluster-version-operator image cli > /dev/termination-log'`
Release image	`registry.ci.openshift.org/ocp/release:4.21.0-0.ci-2026-05-11-040237`
Pod phase in step-graph	`Pending` with condition `SchedulingGated` — but pod actually succeeded per ci-operator logs
Scheduling gate	`multiarch.openshift.io/scheduling-gate` (injected by webhook)
Pod succeeded (log)	`"Pod release-images-latest-421-cli succeeded after 6s"` at `13:53:43Z` (ci-operator.log line 345)
Container completed (log)	`"Container release in pod release-images-latest-421-cli completed successfully"` at `13:53:41Z` (line 318)
Step-graph snapshot	Pod status: `Pending`, `containerStatuses: []`, no termination message — stale API response
Other CLI pods	`latest-418`, `latest-419`, `latest-420`, `latest-422`, `initial-422` all succeeded with valid termination messages
PR files changed	`hostedcluster_controller.go`, `hostedcluster_controller_test.go` — unrelated to CI infrastructure
Failure phase	Release image import (pre-test) — no test code from the PR was ever executed
Failure reason	`executing_graph:step_failed:importing_release`

bryan-cox · 2026-05-11T14:19:44Z

/retest

bryan-cox · 2026-05-11T16:22:24Z

E2E Verification Results

All Prow e2e jobs passed. Artifact inspection confirms the fix is working — no AWS conditions leak onto AKS HostedClusters.

AKS (`e2e-aks`) — PASSED

Job: pull-ci-openshift-hypershift-main-e2e-aks/2053835699978768384

HostedCluster artifact: create-cluster-t2j58.yaml

33 conditions present — zero contain "AWS"
AWSDefaultSecurityGroupCreated — not present
AWSDefaultSecurityGroupDeleted — not present

AWS (`e2e-aws`) — PASSED (545 tests, 0 failures)

Job: pull-ci-openshift-hypershift-main-e2e-aws/2053835700075237376

HostedCluster artifact: create-cluster-jmmb7.yaml

AWSDefaultSecurityGroupCreated: True — correctly present on AWS
AWSDefaultSecurityGroupDeleted — not present (expected: cluster was running, not being deleted; 8 unit tests cover the deletion path)

bryan-cox · 2026-05-11T16:23:55Z

/verified by e2e

See #8415 (comment)

openshift-ci-robot · 2026-05-11T16:24:10Z

@bryan-cox: This PR has been marked as verified by e2e.

Details

In response to this:

/verified by e2e

See #8415 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-05-11T19:12:43Z

@bryan-cox: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-05-11T19:17:28Z

@bryan-cox: Jira Issue Verification Checks: Jira Issue OCPBUGS-84971
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-84971 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Summary

Gate the AWSDefaultSecurityGroupDeleted condition behind an AWS platform check during HostedCluster deletion, matching the existing pattern used for AWSDefaultSecurityGroupCreated

Extract the inline condition bubble-up logic into a pure function computeAWSDefaultSGDeletedCondition for testability

Add 8 behavior-driven unit tests covering platform gating, condition propagation, and deduplication

Root Cause

The AWSDefaultSecurityGroupCreated condition (line 873) is correctly gated behind hcluster.Spec.Platform.Type == hyperv1.AWSPlatform, but the AWSDefaultSecurityGroupDeleted block had no platform check — it ran for every platform during deletion.

While the HCP-level controller (destroyAWSDefaultSecurityGroup) correctly short-circuits for non-AWS platforms, the HostedCluster controller still created a fresh condition with Status: Unknown and set it on the HostedCluster regardless.

Test plan

Verify compilation

Unit tests pass (TestComputeAWSDefaultSGDeletedCondition — 8 cases)

Confirm ARO HCP HostedCluster deletion no longer shows AWSDefaultSecurityGroupDeleted condition

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes

Improved handling of AWS cluster deletion status so HostedCluster status updates occur only for AWS hosts with a deleting control plane and avoid unnecessary status writes when the condition message is unchanged.

Tests

Added unit tests covering AWS deletion-status behavior across platforms and control-plane scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 5, 2026

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026

openshift-ci Bot added the do-not-merge/needs-area label May 5, 2026

openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels May 5, 2026

bryan-cox changed the title ~~NO-JIRA: Gate AWSDefaultSecurityGroupDeleted condition to AWS platform~~ OCPBUGS-84971: Gate AWSDefaultSecurityGroupDeleted condition to AWS platform May 5, 2026

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 5, 2026

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 5, 2026

bryan-cox force-pushed the fix-aws-sg-condition-platform-gate branch from 1675693 to bdeba67 Compare May 5, 2026 02:42

bryan-cox marked this pull request as ready for review May 5, 2026 02:43

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026

openshift-ci Bot requested review from devguyio and sjenning May 5, 2026 02:43

bryan-cox force-pushed the fix-aws-sg-condition-platform-gate branch from bdeba67 to 273acd0 Compare May 5, 2026 02:46

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

jparrill approved these changes May 11, 2026

View reviewed changes

openshift-ci Bot assigned jparrill May 11, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 11, 2026

jparrill reviewed May 11, 2026

View reviewed changes

hypershift-jira-solve-ci Bot mentioned this pull request May 11, 2026

NO-JIRA: chore(deps): weekly dependabot consolidation #8464

Open

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 11, 2026

openshift-merge-bot Bot merged commit bf4693b into openshift:main May 11, 2026
41 checks passed

bryan-cox deleted the fix-aws-sg-condition-platform-gate branch May 11, 2026 19:29

Conversation

bryan-cox commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Test plan

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented May 5, 2026

Uh oh!

openshift-ci-robot commented May 5, 2026

Summary

Root Cause

Test plan

Uh oh!

openshift-ci Bot commented May 5, 2026

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

❌ Failed checks (2 warnings)

Uh oh!

openshift-ci-robot commented May 5, 2026

Summary

Root Cause

Test plan

Summary by CodeRabbit

Uh oh!

bryan-cox commented May 5, 2026

Uh oh!

openshift-ci-robot commented May 5, 2026

Uh oh!

codecov Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

openshift-ci-robot commented May 5, 2026

Summary

Root Cause

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

bryan-cox commented May 5, 2026

Uh oh!

openshift-merge-bot Bot commented May 5, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 5, 2026

AI Test Failure Analysis

Uh oh!

hypershift-jira-solve-ci Bot commented May 5, 2026

AI Test Failure Analysis

Uh oh!

hypershift-jira-solve-ci Bot commented May 5, 2026

AI Test Failure Analysis

Uh oh!

cwbotbot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

e2e-aws

e2e-aks

Uh oh!

bryan-cox commented May 5, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 5, 2026

AI Test Failure Analysis

Uh oh!

hypershift-jira-solve-ci Bot commented May 5, 2026

AI Test Failure Analysis

Uh oh!

bryan-cox commented May 5, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 5, 2026

AI Test Failure Analysis

Uh oh!

bryan-cox commented May 5, 2026

Uh oh!

bryan-cox commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading

codecov Bot commented May 5, 2026 •

edited

Loading

cwbotbot commented May 5, 2026 •

edited

Loading

AKS (`e2e-aks`) — PASSED

AWS (`e2e-aws`) — PASSED (545 tests, 0 failures)