Skip to content

OCPBUGS-84971: Gate AWSDefaultSecurityGroupDeleted condition to AWS platform#8415

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
bryan-cox:fix-aws-sg-condition-platform-gate
May 11, 2026
Merged

OCPBUGS-84971: Gate AWSDefaultSecurityGroupDeleted condition to AWS platform#8415
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
bryan-cox:fix-aws-sg-condition-platform-gate

Conversation

@bryan-cox
Copy link
Copy Markdown
Member

@bryan-cox bryan-cox commented May 5, 2026

Summary

  • Gate the AWSDefaultSecurityGroupDeleted condition behind an AWS platform check during HostedCluster deletion, matching the existing pattern used for AWSDefaultSecurityGroupCreated
  • Extract the inline condition bubble-up logic into a pure function computeAWSDefaultSGDeletedCondition for testability
  • Add 8 behavior-driven unit tests covering platform gating, condition propagation, and deduplication

Root Cause

The AWSDefaultSecurityGroupCreated condition (line 873) is correctly gated behind hcluster.Spec.Platform.Type == hyperv1.AWSPlatform, but the AWSDefaultSecurityGroupDeleted block had no platform check — it ran for every platform during deletion.

While the HCP-level controller (destroyAWSDefaultSecurityGroup) correctly short-circuits for non-AWS platforms, the HostedCluster controller still created a fresh condition with Status: Unknown and set it on the HostedCluster regardless.

Test plan

  • Verify compilation
  • Unit tests pass (TestComputeAWSDefaultSGDeletedCondition — 8 cases)
  • Confirm ARO HCP HostedCluster deletion no longer shows AWSDefaultSecurityGroupDeleted condition

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved handling of AWS cluster deletion status so HostedCluster status updates occur only for AWS hosts with a deleting control plane and avoid unnecessary status writes when the condition message is unchanged.
  • Tests
    • Added unit tests covering AWS deletion-status behavior across platforms and control-plane scenarios.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 5, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This pull request explicitly references no jira issue.

Details

In response to this:

Summary

  • Gate the AWSDefaultSecurityGroupDeleted condition behind an AWS platform check during HostedCluster deletion, matching the existing pattern used for AWSDefaultSecurityGroupCreated
  • Previously this AWS-specific condition was being set on all platforms (Azure, KubeVirt, etc.) during cluster deletion with Status: Unknown

Root Cause

The AWSDefaultSecurityGroupCreated condition (line 873) is correctly gated behind hcluster.Spec.Platform.Type == hyperv1.AWSPlatform, but the AWSDefaultSecurityGroupDeleted block (line 431) had no platform check — it ran for every platform during deletion.

While the HCP-level controller (destroyAWSDefaultSecurityGroup) correctly short-circuits for non-AWS platforms, the HostedCluster controller still created a fresh condition with Status: Unknown and set it on the HostedCluster regardless.

Test plan

  • Verify no compilation errors
  • Verify existing unit tests pass
  • Confirm ARO HCP HostedCluster deletion no longer shows AWSDefaultSecurityGroupDeleted condition

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

📝 Walkthrough

Walkthrough

The bubbling of the AWSDefaultSecurityGroupDeleted condition was refactored into computeAWSDefaultSGDeletedCondition(hcluster, hcp). The controller now calls this helper during reconcile and updates the HostedCluster status only when the platform is AWS, the HostedControlPlane exists and is being deleted, and the helper reports a changed result. The helper derives the condition from the HostedControlPlane’s condition when present, returns an appropriate Unknown/True/False condition, and suppresses updates if the HostedCluster already has the same condition message.

Sequence Diagram(s)

sequenceDiagram
    participant Reconciler as HostedCluster Controller
    participant API as Kubernetes API Server
    participant HCP as HostedControlPlane
    participant HCluster as HostedCluster Status

    Reconciler->>API: Get HostedCluster
    Reconciler->>API: Get HostedControlPlane (HCP)
    Reconciler->>Reconciler: computeAWSDefaultSGDeletedCondition(hcluster, hcp)
    alt platform != AWS or HCP nil or HCP not deleting
        Reconciler-->>Reconciler: no condition computed / no change
    else platform == AWS and HCP deleting
        Reconciler->>HCP: read AWSDefaultSecurityGroupDeleted condition (if present)
        Reconciler-->>Reconciler: compute condition (Unknown/True/False)
        alt message differs from HCluster condition
            Reconciler->>API: Update HostedCluster status (set condition)
            API-->>HCluster: persist status
        else message same
            Reconciler-->>Reconciler: skip status update
        end
    end
Loading
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Test lacks assertion messages. Gomega assertions missing diagnostic messages per codebase patterns: g.Expect(...), g.Expect(condition).ToNot(BeNil()). Add messages to assertions: g.Expect(changed).To(Equal(...), "failed: changed value"), g.Expect(condition).ToNot(BeNil(), "expected condition set").
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: gating the AWSDefaultSecurityGroupDeleted condition to only apply to AWS platforms during HostedCluster deletion.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The test uses standard Go table-driven testing (t.Run), not Ginkgo. The custom check is specific to Ginkgo tests, so it is not applicable here.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests added. New test is standard Go unit test using Gomega assertions, not Ginkgo framework. Custom check targeting Ginkgo e2e tests is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds a unit test, not a Ginkgo e2e test. SNO check applies only to Ginkgo e2e tests that run on live clusters.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only reconciliation controller status condition logic. No deployment manifests, pod specs, or scheduling constraints are introduced. Not applicable to topology check.
Ote Binary Stdout Contract ✅ Passed Both modified files contain zero stdout-writing violations. No fmt.Print*, log.Print*, klog, or os.Stdout calls detected. The new test follows standard unit test patterns with proper test isolation.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests were added. The PR adds only a unit test using standard Go testing.T, not Ginkgo framework (no It/Describe/Context blocks). Custom check only applies to Ginkgo e2e tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels May 5, 2026
@bryan-cox bryan-cox changed the title NO-JIRA: Gate AWSDefaultSecurityGroupDeleted condition to AWS platform OCPBUGS-84971: Gate AWSDefaultSecurityGroupDeleted condition to AWS platform May 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This pull request references Jira Issue OCPBUGS-84971, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Gate the AWSDefaultSecurityGroupDeleted condition behind an AWS platform check during HostedCluster deletion, matching the existing pattern used for AWSDefaultSecurityGroupCreated
  • Previously this AWS-specific condition was being set on all platforms (Azure, KubeVirt, etc.) during cluster deletion with Status: Unknown

Root Cause

The AWSDefaultSecurityGroupCreated condition (line 873) is correctly gated behind hcluster.Spec.Platform.Type == hyperv1.AWSPlatform, but the AWSDefaultSecurityGroupDeleted block (line 431) had no platform check — it ran for every platform during deletion.

While the HCP-level controller (destroyAWSDefaultSecurityGroup) correctly short-circuits for non-AWS platforms, the HostedCluster controller still created a fresh condition with Status: Unknown and set it on the HostedCluster regardless.

Test plan

  • Verify no compilation errors
  • Verify existing unit tests pass
  • Confirm ARO HCP HostedCluster deletion no longer shows AWSDefaultSecurityGroupDeleted condition

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
  • Improved reliability of deletion handling for AWS-based clusters by refining platform-specific validation logic.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 5, 2026
@bryan-cox
Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This pull request references Jira Issue OCPBUGS-84971, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox bryan-cox force-pushed the fix-aws-sg-condition-platform-gate branch from 1675693 to bdeba67 Compare May 5, 2026 02:42
@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.25%. Comparing base (ad88854) to head (273acd0).
⚠️ Report is 100 commits behind head on main.

Files with missing lines Patch % Lines
...trollers/hostedcluster/hostedcluster_controller.go 83.33% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8415      +/-   ##
==========================================
+ Coverage   37.23%   37.25%   +0.01%     
==========================================
  Files         752      752              
  Lines       91829    91833       +4     
==========================================
+ Hits        34195    34214      +19     
+ Misses      54993    54978      -15     
  Partials     2641     2641              
Files with missing lines Coverage Δ
...trollers/hostedcluster/hostedcluster_controller.go 43.66% <83.33%> (+0.43%) ⬆️
Flag Coverage Δ
cmd-support 32.06% <ø> (ø)
cpo-hostedcontrolplane 36.50% <ø> (ø)
cpo-other 37.73% <ø> (ø)
hypershift-operator 47.92% <83.33%> (+0.07%) ⬆️
other 27.77% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bryan-cox bryan-cox marked this pull request as ready for review May 5, 2026 02:43
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This pull request references Jira Issue OCPBUGS-84971, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

  • Gate the AWSDefaultSecurityGroupDeleted condition behind an AWS platform check during HostedCluster deletion, matching the existing pattern used for AWSDefaultSecurityGroupCreated
  • Previously this AWS-specific condition was being set on all platforms (Azure, KubeVirt, etc.) during cluster deletion with Status: Unknown

Root Cause

The AWSDefaultSecurityGroupCreated condition (line 873) is correctly gated behind hcluster.Spec.Platform.Type == hyperv1.AWSPlatform, but the AWSDefaultSecurityGroupDeleted block (line 431) had no platform check — it ran for every platform during deletion.

While the HCP-level controller (destroyAWSDefaultSecurityGroup) correctly short-circuits for non-AWS platforms, the HostedCluster controller still created a fresh condition with Status: Unknown and set it on the HostedCluster regardless.

Test plan

  • Verify no compilation errors
  • Verify existing unit tests pass
  • Confirm ARO HCP HostedCluster deletion no longer shows AWSDefaultSecurityGroupDeleted condition

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
  • Improved reliability of deletion handling for AWS-based clusters and prevented unnecessary status updates when conditions haven't changed.
  • Tests
  • Added unit coverage to validate AWS deletion-status behavior across platform and control-plane scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from devguyio and sjenning May 5, 2026 02:43
The AWSDefaultSecurityGroupDeleted condition was being set on all
HostedClusters during deletion regardless of platform, causing
AWS-specific conditions to appear on Azure/KubeVirt/etc clusters.

Extract the condition computation into a testable function and gate
it behind a platform check matching the existing pattern used for
AWSDefaultSecurityGroupCreated.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bryan-cox bryan-cox force-pushed the fix-aws-sg-condition-platform-gate branch from bdeba67 to 273acd0 Compare May 5, 2026 02:46
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go (1)

3728-3758: ⚡ Quick win

Add a same-status/different-message case to lock in message propagation behavior.

You already test the no-op path when the message matches. Please add the inverse case (same status/reason, different message) to ensure changed is still true and newer HCP context is propagated.

Proposed test case addition
 		{
 			name: "When HC already has the same condition message, it should not report a change",
 			hcluster: &hyperv1.HostedCluster{
 				Spec: hyperv1.HostedClusterSpec{
 					Platform: hyperv1.PlatformSpec{Type: hyperv1.AWSPlatform},
 				},
 				Status: hyperv1.HostedClusterStatus{
 					Conditions: []metav1.Condition{
 						{
 							Type:    string(hyperv1.AWSDefaultSecurityGroupDeleted),
 							Status:  metav1.ConditionTrue,
 							Reason:  "Deleted",
 							Message: "Security group deleted",
 						},
 					},
 				},
 			},
 			hcp: &hyperv1.HostedControlPlane{
 				ObjectMeta: metav1.ObjectMeta{DeletionTimestamp: &deletionTime},
 				Status: hyperv1.HostedControlPlaneStatus{
 					Conditions: []metav1.Condition{
 						{
 							Type:    string(hyperv1.AWSDefaultSecurityGroupDeleted),
 							Status:  metav1.ConditionTrue,
 							Reason:  "Deleted",
 							Message: "Security group deleted",
 						},
 					},
 				},
 			},
 			wantChanged: false,
 		},
+		{
+			name: "When HC has same status but different message, it should report a change",
+			hcluster: &hyperv1.HostedCluster{
+				Spec: hyperv1.HostedClusterSpec{
+					Platform: hyperv1.PlatformSpec{Type: hyperv1.AWSPlatform},
+				},
+				Status: hyperv1.HostedClusterStatus{
+					Conditions: []metav1.Condition{
+						{
+							Type:    string(hyperv1.AWSDefaultSecurityGroupDeleted),
+							Status:  metav1.ConditionTrue,
+							Reason:  "Deleted",
+							Message: "old message",
+						},
+					},
+				},
+			},
+			hcp: &hyperv1.HostedControlPlane{
+				ObjectMeta: metav1.ObjectMeta{DeletionTimestamp: &deletionTime},
+				Status: hyperv1.HostedControlPlaneStatus{
+					Conditions: []metav1.Condition{
+						{
+							Type:    string(hyperv1.AWSDefaultSecurityGroupDeleted),
+							Status:  metav1.ConditionTrue,
+							Reason:  "Deleted",
+							Message: "new message",
+						},
+					},
+				},
+			},
+			wantChanged: true,
+			wantStatus:  metav1.ConditionTrue,
+		},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`
around lines 3728 - 3758, Add a new table-driven test case alongside the
existing one that uses the same condition Type
(string(hyperv1.AWSDefaultSecurityGroupDeleted)), Status (metav1.ConditionTrue)
and Reason ("Deleted") but a different Message between
hcluster.Status.Conditions and hcp.Status.Conditions; set hcluster to have the
old message, hcp to have the new message, and assert wantChanged is true and
that the controller logic updates/propagates the message from hcp into the
HostedCluster condition (reference the hcluster, hcp,
AWSDefaultSecurityGroupDeleted and wantChanged identifiers to locate and
implement the case).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`:
- Around line 3728-3758: Add a new table-driven test case alongside the existing
one that uses the same condition Type
(string(hyperv1.AWSDefaultSecurityGroupDeleted)), Status (metav1.ConditionTrue)
and Reason ("Deleted") but a different Message between
hcluster.Status.Conditions and hcp.Status.Conditions; set hcluster to have the
old message, hcp to have the new message, and assert wantChanged is true and
that the controller logic updates/propagates the message from hcp into the
HostedCluster condition (reference the hcluster, hcp,
AWSDefaultSecurityGroupDeleted and wantChanged identifiers to locate and
implement the case).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5318179b-f0e6-489c-9f4f-9b64ed9cef7e

📥 Commits

Reviewing files that changed from the base of the PR and between bdeba67 and 273acd0.

📒 Files selected for processing (2)
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go

@bryan-cox
Copy link
Copy Markdown
Member Author

/pipeline required

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2051601018340249600 | Cost: $2.88565985 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2051601009939058688 | Cost: $3.952714 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2051601008374583296 | Cost: $3.1239234999999996 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented May 5, 2026

Test Results

e2e-aws

e2e-aks

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

one more time

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2051663286016937984 | Cost: $5.3032791 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2051663283286446080 | Cost: $3.9105257499999997 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2051697108326551552 | Cost: $3.6653342500000003 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

infra issue


: Import the release payload "n1minor" from an external source expand_less | 10s
-- | --
{  failed to get CLI image: unable to extract the 'cli' image from the release image, pod produced no output}


@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-aws

Previous failure was infra issue

{  failed to get CLI image: unable to find the 'cli' image in the provided release image: unable to delete completed pod: could not delete completed pod: Operation cannot be fulfilled on Pod "release-images-initial-cli": the UID in the precondition (20714e7d-f835-428d-b01e-1bfdb716ec48) does not match the UID in record (8ef0c73b-5f6f-4d3e-86b2-2a09b7323d9c). The object might have been deleted and then recreated}

Copy link
Copy Markdown
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — dropped a comment. Thanks!

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 11, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 11, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped a couple of comments. Thanks!

Please add a test case covering the "condition update" path — where the HC already has the SG condition but the HCP has progressed to a different message (e.g., from "security group still in use" to "Security group deleted"). This exercises the oldCondition.Message != freshCondition.Message branch explicitly and is the most common real-world scenario during active deletion.

Also consider setting hcluster.Generation to a non-zero value in at least one propagation test case and asserting condition.ObservedGeneration matches.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

The PR only modifies the hosted cluster controller Go files — it has nothing to do with CI infrastructure, release image imports, or the multiarch scheduling gate. This is purely a CI infrastructure flake.

Now I have all the evidence I need. Let me produce the final report.

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

step [release:latest-421] failed: failed to get CLI image: unable to extract the 'cli' image from the release image, pod produced no output

Summary

This is a CI infrastructure flake unrelated to PR #8415. The job failed during the release payload import phase — specifically while importing the OCP 4.21 release image (4.21.0-0.ci-2026-05-11-040237). The release-images-latest-421-cli pod, which runs cluster-version-operator image cli to extract the CLI image digest, was affected by a race condition involving the multiarch.openshift.io/scheduling-gate webhook. ci-operator read the pod's status while the Kubernetes API still reported the pod as Pending with SchedulingGated, even though the pod's containers had already completed successfully (exitCode=0). Because the pod appeared to be in Pending phase with no containerStatuses, ci-operator found no termination message and reported "pod produced no output". The PR changes (gating AWSDefaultSecurityGroupDeleted condition to AWS platform in hostedcluster_controller.go) were never exercised — the failure occurred before any test steps ran.

Root Cause

The root cause is a race condition between the multiarch scheduling gate webhook and ci-operator's pod status polling on the build01 CI cluster.

Detailed sequence of events:

  1. ci-operator created pod release-images-latest-421-cli at 13:53:35Z to extract the CLI image from the OCP 4.21 release payload.
  2. The pod had a schedulingGates: [{name: "multiarch.openshift.io/scheduling-gate"}] injected by a mutating admission webhook, initially blocking scheduling.
  3. Despite the gate, the pod was actually scheduled, initialized, and its release container completed successfully at 13:53:41Z with exitCode=0 — the container ran cluster-version-operator image cli > /dev/termination-log and wrote the CLI image digest to the termination log.
  4. However, when ci-operator queried the pod status to read the termination message, the Kubernetes API returned a stale view of the pod: phase: Pending, conditions: [{reason: SchedulingGated}], and crucially empty containerStatuses — meaning no termination message was visible.
  5. ci-operator interpreted the empty termination message as "pod produced no output" and failed the [release:latest-421] step.
  6. The ci-operator metrics agent separately observed the pod as Succeeded at 13:53:43Z, confirming the pod did complete — but by that point the step had already been marked as failed.

Why only latest-421 was affected: All six release CLI pods (latest-418 through latest-422 plus initial-422) ran the same command with the same scheduling gate. The other five pods' status was read at a point when the API reflected their completed state. The latest-421 pod happened to be read during the narrow window where the API cache still showed the pre-gate-removal state.

This is not related to PR #8415. The PR modifies only hostedcluster_controller.go and its test file to gate the AWSDefaultSecurityGroupDeleted condition to the AWS platform. No test steps or CI configuration were touched, and the failure occurred during release image import — well before any test code would have executed.

Recommendations
  1. Retest the PR — Run /retest or /test e2e-aks-4-22 on the PR. This is a transient CI infrastructure flake with no relation to the code changes.

  2. No code changes needed — The PR's changes to hostedcluster_controller.go (gating AWSDefaultSecurityGroupDeleted to AWS platform) are entirely unrelated to this failure.

  3. CI infrastructure note — The race condition between the multiarch.openshift.io/scheduling-gate webhook controller and ci-operator's pod status reading on build01 is a known class of CI flakes. If this recurs frequently, the CI team (openshift/ci-tools) may need to add retry logic when reading pod termination messages from pods that have scheduling gates.

Evidence
Evidence Detail
Failed step [release:latest-421] — Import the release payload "latest-421" from an external source
Error message failed to get CLI image: unable to extract the 'cli' image from the release image, pod produced no output
Failed pod release-images-latest-421-cli in namespace ci-op-vmxrwxwb
Pod command /bin/sh -c 'cluster-version-operator image cli > /dev/termination-log'
Release image registry.ci.openshift.org/ocp/release:4.21.0-0.ci-2026-05-11-040237
Pod phase in step-graph Pending with condition SchedulingGated — but pod actually succeeded per ci-operator logs
Scheduling gate multiarch.openshift.io/scheduling-gate (injected by webhook)
Pod succeeded (log) "Pod release-images-latest-421-cli succeeded after 6s" at 13:53:43Z (ci-operator.log line 345)
Container completed (log) "Container release in pod release-images-latest-421-cli completed successfully" at 13:53:41Z (line 318)
Step-graph snapshot Pod status: Pending, containerStatuses: [], no termination message — stale API response
Other CLI pods latest-418, latest-419, latest-420, latest-422, initial-422 all succeeded with valid termination messages
PR files changed hostedcluster_controller.go, hostedcluster_controller_test.go — unrelated to CI infrastructure
Failure phase Release image import (pre-test) — no test code from the PR was ever executed
Failure reason executing_graph:step_failed:importing_release

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

@bryan-cox
Copy link
Copy Markdown
Member Author

E2E Verification Results

All Prow e2e jobs passed. Artifact inspection confirms the fix is working — no AWS conditions leak onto AKS HostedClusters.

AKS (e2e-aks) — PASSED

Job: pull-ci-openshift-hypershift-main-e2e-aks/2053835699978768384

HostedCluster artifact: create-cluster-t2j58.yaml

  • 33 conditions present — zero contain "AWS"
  • AWSDefaultSecurityGroupCreatednot present
  • AWSDefaultSecurityGroupDeletednot present

AWS (e2e-aws) — PASSED (545 tests, 0 failures)

Job: pull-ci-openshift-hypershift-main-e2e-aws/2053835700075237376

HostedCluster artifact: create-cluster-jmmb7.yaml

  • AWSDefaultSecurityGroupCreated: True — correctly present on AWS
  • AWSDefaultSecurityGroupDeleted — not present (expected: cluster was running, not being deleted; 8 unit tests cover the deletion path)

@bryan-cox
Copy link
Copy Markdown
Member Author

/verified by e2e

See #8415 (comment)

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 11, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This PR has been marked as verified by e2e.

Details

In response to this:

/verified by e2e

See #8415 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 11, 2026

@bryan-cox: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit bf4693b into openshift:main May 11, 2026
41 checks passed
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: Jira Issue Verification Checks: Jira Issue OCPBUGS-84971
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-84971 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Summary

  • Gate the AWSDefaultSecurityGroupDeleted condition behind an AWS platform check during HostedCluster deletion, matching the existing pattern used for AWSDefaultSecurityGroupCreated
  • Extract the inline condition bubble-up logic into a pure function computeAWSDefaultSGDeletedCondition for testability
  • Add 8 behavior-driven unit tests covering platform gating, condition propagation, and deduplication

Root Cause

The AWSDefaultSecurityGroupCreated condition (line 873) is correctly gated behind hcluster.Spec.Platform.Type == hyperv1.AWSPlatform, but the AWSDefaultSecurityGroupDeleted block had no platform check — it ran for every platform during deletion.

While the HCP-level controller (destroyAWSDefaultSecurityGroup) correctly short-circuits for non-AWS platforms, the HostedCluster controller still created a fresh condition with Status: Unknown and set it on the HostedCluster regardless.

Test plan

  • Verify compilation
  • Unit tests pass (TestComputeAWSDefaultSGDeletedCondition — 8 cases)
  • Confirm ARO HCP HostedCluster deletion no longer shows AWSDefaultSecurityGroupDeleted condition

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
  • Improved handling of AWS cluster deletion status so HostedCluster status updates occur only for AWS hosts with a deleting control plane and avoid unnecessary status writes when the condition message is unchanged.
  • Tests
  • Added unit tests covering AWS deletion-status behavior across platforms and control-plane scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox bryan-cox deleted the fix-aws-sg-condition-platform-gate branch May 11, 2026 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants