CNTRLPLANE-3430: Make HA break-glass credentials test infraless by sdminonne · Pull Request #8546 · openshift/hypershift

sdminonne · 2026-05-19T12:57:53Z

What this PR does / why we need it:

Switches TestCreateClusterHABreakGlassCredentials from the cloud platform to NonePlatform
and fixes CPO bugs on non-OpenShift management clusters (AKS) that this change exposed.

This test validates control-plane PKI operations (CSR creation/approval/signing, certificate
revocation, per-pod KAS cert reload) with NodePoolReplicas=0 and an HA control plane (3 KAS
replicas). None of these operations require cloud infrastructure.

The management cluster's cloud controller manager provisions a real LoadBalancer for the
kube-apiserver service regardless of the hosted cluster's platform type, so the guest API
server remains reachable. This follows the same pattern used by TestHAEtcdChaos.

Removing cloud infrastructure provisioning reduces cost, execution time, and flakiness from
cloud API interactions — particularly under resource pressure when multiple hosted clusters
compete for management cluster capacity (as seen in PR #8338's e2e-aks job).

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-3430

Special notes for your reviewer:

Why the CPO fixes (commit 1) are needed

TestHAEtcdChaos uses the same NonePlatform + HA + NodePoolReplicas=0 pattern but has
never run on AKS. It is restricted to the e2e-azure-self-managed job, which uses an
OpenShift management cluster and has an explicit test filter
(openshift-hypershift-main.yaml):

CI_TESTS_RUN: "^(TestCreateCluster$|TestAutoscaling|TestNodePool|TestUpgradeControlPlane|TestHAEtcdChaos|TestAzureOAuthLoadBalancer|TestAzurePrivateTopology)$"

TestCreateClusterHABreakGlassCredentials is not in that filter. It runs in e2e-aks,
which has no CI_TESTS_RUN filter — all tests run, on an AKS (non-OpenShift) management
cluster.

On OpenShift management clusters the default router populates route.Spec.Host, so the CPO
bugs never manifest. On AKS, there is no default router — the shared ingress controller
populates Status.Ingress[0].RouterCanonicalHostname instead. The CPO was reading
route.Spec.Host directly, getting an empty string, and blocking infrastructure readiness
indefinitely.

Additionally, unlike TestHAEtcdChaos (which only uses the management client to manipulate
etcd pods), the break-glass test calls WaitForGuestClient, which requires the CPO to
complete infrastructure readiness by resolving a host/port for KAS.

Without the CPO fixes, WaitForGuestClient times out on AKS.

Per-commit breakdown

Commit 1 — fix(control-plane-operator): resolve route host on non-OpenShift management clusters

Add RouteHost() helper in support/netutil with a fallback chain (Spec.Host →
Status.Ingress[0].Host → RouterCanonicalHostname) and use it in
oauth.ReconcileServiceStatus and kas.ReconcileKonnectivityServerServiceStatus
Guard the shared ingress path in reconcileAPIServerServiceStatus with
serviceStrategy.Type == Route to prevent empty apiHost when UseSharedIngress()=true
but strategy is LoadBalancer
Log all InfrastructureStatus fields when infrastructure is not ready
Add unit tests for RouteHost, KonnectivityServerServiceStatus fallback, and the shared
ingress guard with LoadBalancer strategy

Commit 2 — test(e2e): make HA break-glass credentials test infraless

Switch the test to NonePlatform. Same pattern as TestHAEtcdChaos.

The framework handles NonePlatform correctly for this test:

before(): skips ValidatePublicCluster — OK, the test has explicit waits
after(): skips EnsureAdmissionPolicies — OK (OCPBUGS-61291)
RunTestControlPlanePKIOperatorBreakGlassCredentials: only skips KubeVirt and GCP

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

🤖 Generated with Claude Code

openshift-ci-robot · 2026-05-19T12:57:58Z

@sdminonne: This pull request references CNTRLPLANE-3430 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Switches TestCreateClusterHABreakGlassCredentials from the cloud platform to NonePlatform.

This test validates control-plane PKI operations (CSR creation/approval/signing, certificate
revocation, per-pod KAS cert reload) with NodePoolReplicas=0 and an HA control plane (3 KAS
replicas). None of these operations require cloud infrastructure.

The management cluster's cloud controller manager provisions a real LoadBalancer for the
kube-apiserver service regardless of the hosted cluster's platform type, so the guest API
server remains reachable. This follows the same pattern used by TestHAEtcdChaos.

Removing cloud infrastructure provisioning reduces cost, execution time, and flakiness from
cloud API interactions — particularly under resource pressure when multiple hosted clusters
compete for management cluster capacity (as seen in PR #8338's e2e-aks job).

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-3430

Special notes for your reviewer:

Single-line change. The NonePlatform pattern is already proven by TestHAEtcdChaos
(test/e2e/chaos_test.go:32-66) which runs an HA control plane on NonePlatform with
NodePoolReplicas=0 and successfully calls WaitForGuestClient.

The framework handles NonePlatform correctly for this test:

before(): skips ValidatePublicCluster — OK, the test has explicit waits

after(): skips EnsureAdmissionPolicies — OK (OCPBUGS-61291)

after(): skips ValidateHostedClusterConditions — OK for NonePlatform

RunTestControlPlanePKIOperatorBreakGlassCredentials: only skips KubeVirt and GCP

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-05-19T12:57:58Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-merge-bot · 2026-05-19T12:58:05Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2026-05-19T12:58:25Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sdminonne
Once this PR has been reviewed and has the lgtm label, please assign cblecker for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sdminonne · 2026-05-19T12:59:25Z

/test e2e-aks

coderabbitai · 2026-05-19T13:00:20Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds support function netutil.RouteHost that chooses the best hostname from a Route (Spec.Host, then Status.Ingress[0].Host, then RouterCanonicalHostname). Reconciler code in OAuth and KAS now uses this helper to populate service status (setting host and port or returning an error when no host is found). Unit tests for RouteHost and additional route-resolution cases were added. A single e2e test was updated to pass hyperv1.NonePlatform and to wait for hosted control plane bootstrap before running break-glass checks.

Sequence Diagram(s)

sequenceDiagram
  participant Reconciler as ReconcileServiceStatus / ReconcileKonnectivityServerServiceStatus
  participant Route as routev1.Route
  participant RouteHost as netutil.RouteHost
  participant ServiceStatus as ServiceStatusUpdater
  Reconciler->>Route: read route resource
  Reconciler->>RouteHost: call RouteHost(route)
  RouteHost-->>Reconciler: host (or "")
  alt host != ""
    Reconciler->>ServiceStatus: set host and port (443 / RouteExternalPort)
  else host == ""
    Reconciler-->>ServiceStatus: set error message and return
  end

Suggested reviewers

muraee
Nirshal

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Microshift Test Compatibility	⚠️ Warning	Test uses HighlyAvailable control plane (3 KAS replicas) which violates MicroShift's single-node architecture. No protective [Skipped:MicroShift] or [apigroup:...] labels present.	Add [Skipped:MicroShift] label to test name or guard with exutil.IsMicroShiftCluster() check, since HighlyAvailable control planes unsupported on single-node MicroShift.

✅ Passed checks (10 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: converting the HA break-glass credentials test to use NonePlatform instead of cloud platform.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names in the PR are stable and deterministic. TestRouteHost (7 cases) and new Route cases in TestReconcileServiceStatus (2 cases) use static, descriptive strings with no dynamic content.
Test Structure And Quality	✅ Passed	Test meets quality standards: single responsibility, proper cleanup, explicit 20-min timeout, meaningful assertion messages, consistent patterns.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No new Ginkgo e2e tests were added. The PR modifies an existing test (TestCreateClusterHABreakGlassCredentials) and adds only a unit test (TestRouteHost), which is outside the scope of this check.
Topology-Aware Scheduling Compatibility	✅ Passed	PR introduces no scheduling constraints. Changes are for route hostname resolution, infrastructure logging, and e2e test platform configuration only.
Ote Binary Stdout Contract	✅ Passed	PR uses logr.Logger for all logging (framework-controlled). No fmt.Print/klog/stdout writes in process-level code. ValidateHostedClusterConditions is called within test bodies.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No new Ginkgo e2e tests were added. New unit tests use domain names, not IPv4 hardcoding, and have no external connectivity requirements.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-19T13:07:50Z

Codecov Report

❌ Patch coverage is 98.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 40.48%. Comparing base (36dfb1b) to head (1aa9c82).

Files with missing lines	Patch %	Lines
...or/controllers/hostedcontrolplane/oauth/service.go	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8546      +/-   ##
==========================================
+ Coverage   40.40%   40.48%   +0.07%     
==========================================
  Files         755      755              
  Lines       93235    93278      +43     
==========================================
+ Hits        37675    37762      +87     
+ Misses      52858    52812      -46     
- Partials     2702     2704       +2

Files with missing lines	Coverage Δ
...ostedcontrolplane/hostedcontrolplane_controller.go	`45.59% <100.00%> (+0.56%)`	⬆️
...ator/controllers/hostedcontrolplane/infra/infra.go	`53.61% <100.00%> (+3.06%)`	⬆️
...ator/controllers/hostedcontrolplane/kas/service.go	`47.86% <100.00%> (+7.49%)`	⬆️
support/netutil/route.go	`72.00% <100.00%> (+3.81%)`	⬆️
...or/controllers/hostedcontrolplane/oauth/service.go	`78.00% <66.66%> (+3.25%)`	⬆️

Flag	Coverage Δ
cmd-support	`34.47% <100.00%> (+0.03%)`	⬆️
cpo-hostedcontrolplane	`42.13% <97.14%> (+0.36%)`	⬆️
cpo-other	`40.31% <ø> (ø)`
hypershift-operator	`50.72% <ø> (ø)`
other	`31.54% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sdminonne · 2026-05-19T13:21:42Z

@coderabbitai full review

coderabbitai · 2026-05-19T13:21:49Z

✅ Actions performed

Full review triggered.

hypershift-jira-solve-ci · 2026-05-19T15:13:58Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2056721417461829632 | Cost: $4.631663149999999 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

cwbotbot · 2026-05-19T16:00:03Z

Test Results

e2e-aws

Status: ✅ PASS
Started: 2026-05-21T20:43:41Z
View Job
View Job History

e2e-aks

Status: ✅ PASS
Started: 2026-05-21T20:43:32Z
View Job
View Job History

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@control-plane-operator/controllers/hostedcontrolplane/oauth/service.go`:
- Around line 95-98: The code currently calls netutil.RouteHost(route) and then
dereferences route.ObjectMeta.CreationTimestamp.Time which can panic if route is
nil; add an explicit nil check for route before calling netutil.RouteHost (e.g.,
if route == nil set message to a clear "OAuth service route is missing" or
similar including the duration since creation and return), otherwise proceed to
compute routeHost and use route.ObjectMeta.CreationTimestamp.Time; ensure you
update the branch that currently handles empty routeHost to avoid dereferencing
route when route is nil.

In `@support/netutil/route_test.go`:
- Around line 472-559: Add a test case in TestRouteHost that asserts RouteHost
prefers a non-empty host from a later ingress when the first ingress is empty:
create a routev1.Route with Status.Ingress containing two entries where
Ingress[0] is empty and Ingress[1].Host (or RouterCanonicalHostname) is set, and
expect RouteHost(route) to return that second ingress host; ensure the new case
references RouteHost and the routev1.Route.Status.Ingress slice so it guards
multi-ingress fallback behavior.

In `@support/netutil/route.go`:
- Around line 163-169: The current logic only inspects route.Status.Ingress[0]
and returns its Host or RouterCanonicalHostname, which misses later ingress
entries; update the code that references route.Status.Ingress[0] to iterate over
all entries in route.Status.Ingress, checking each entry's Host first and then
RouterCanonicalHostname, and return the first non-empty value found (falling
back to empty/no-host if none exist); ensure you reference the same symbols
(route.Status.Ingress, Host, RouterCanonicalHostname) so the loop replaces the
existing index-0 checks.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 815d35bd-ddc0-4a15-b370-ec61f7400400

📥 Commits

Reviewing files that changed from the base of the PR and between 880b44f and 4a99314.

📒 Files selected for processing (5)

control-plane-operator/controllers/hostedcontrolplane/kas/service.go
control-plane-operator/controllers/hostedcontrolplane/oauth/service.go
control-plane-operator/controllers/hostedcontrolplane/oauth/service_test.go
support/netutil/route.go
support/netutil/route_test.go

coderabbitai · 2026-05-19T17:21:00Z

+func TestRouteHost(t *testing.T) {
+	testCases := []struct {
+		name         string
+		route        *routev1.Route
+		expectedHost string
+	}{
+		{
+			name:         "When route is nil, it should return empty string",
+			route:        nil,
+			expectedHost: "",
+		},
+		{
+			name: "When Spec.Host is set, it should return Spec.Host",
+			route: &routev1.Route{
+				Spec: routev1.RouteSpec{
+					Host: "oauth.example.com",
+				},
+			},
+			expectedHost: "oauth.example.com",
+		},
+		{
+			name: "When Spec.Host is set and Status.Ingress is also populated, it should prefer Spec.Host",
+			route: &routev1.Route{
+				Spec: routev1.RouteSpec{
+					Host: "oauth.example.com",
+				},
+				Status: routev1.RouteStatus{
+					Ingress: []routev1.RouteIngress{
+						{
+							Host:                    "ingress.example.com",
+							RouterCanonicalHostname: "canonical.example.com",
+						},
+					},
+				},
+			},
+			expectedHost: "oauth.example.com",
+		},
+		{
+			name: "When Spec.Host is empty and Status.Ingress[0].Host is set, it should return Ingress Host",
+			route: &routev1.Route{
+				Status: routev1.RouteStatus{
+					Ingress: []routev1.RouteIngress{
+						{
+							Host: "ingress.example.com",
+						},
+					},
+				},
+			},
+			expectedHost: "ingress.example.com",
+		},
+		{
+			name: "When Spec.Host and Ingress Host are empty but RouterCanonicalHostname is set, it should return RouterCanonicalHostname",
+			route: &routev1.Route{
+				Status: routev1.RouteStatus{
+					Ingress: []routev1.RouteIngress{
+						{
+							RouterCanonicalHostname: "router-canonical.example.com",
+						},
+					},
+				},
+			},
+			expectedHost: "router-canonical.example.com",
+		},
+		{
+			name: "When all host fields are empty, it should return empty string",
+			route: &routev1.Route{
+				Status: routev1.RouteStatus{
+					Ingress: []routev1.RouteIngress{
+						{},
+					},
+				},
+			},
+			expectedHost: "",
+		},
+		{
+			name:         "When Status.Ingress slice is empty, it should return empty string",
+			route:        &routev1.Route{},
+			expectedHost: "",
+		},
+	}
+
+	for _, tc := range testCases {
+		t.Run(tc.name, func(t *testing.T) {
+			g := NewWithT(t)
+			g.Expect(RouteHost(tc.route)).To(Equal(tc.expectedHost))
+		})
+	}
+}


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a multi-ingress test case to lock expected fallback behavior.

TestRouteHost should include a case where Status.Ingress[0] is empty and Status.Ingress[1] has Host (or RouterCanonicalHostname). This guards against regressions in real multi-router status layouts.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@support/netutil/route_test.go` around lines 472 - 559, Add a test case in TestRouteHost that asserts RouteHost prefers a non-empty host from a later ingress when the first ingress is empty: create a routev1.Route with Status.Ingress containing two entries where Ingress[0] is empty and Ingress[1].Host (or RouterCanonicalHostname) is set, and expect RouteHost(route) to return that second ingress host; ensure the new case references RouteHost and the routev1.Route.Status.Ingress slice so it guards multi-ingress fallback behavior.

This follows from the multi-ingress comment — since a route in HyperShift is admitted by exactly one router, there is no real scenario where Ingress[0] is empty but Ingress[1] has the host. Adding a test for that would codify fallback behavior we intentionally do not support and do not want to guarantee.

@sdminonne, that's a clear and valid constraint — since HyperShift routes are admitted by exactly one router, Ingress[0] being empty while Ingress[1] carries the host is not a real scenario, and locking in fallback behavior that isn't intended would be misleading. Happy to drop this suggestion.

✏️ Learnings added

Learnt from: sdminonne Repo: openshift/hypershift PR: 8546 File: support/netutil/route_test.go:472-559 Timestamp: 2026-05-19T18:45:39.602Z Learning: In HyperShift, a Route is admitted by exactly one router, so `Status.Ingress` will never have a meaningful host in a later index while `Ingress[0]` is empty. Multi-ingress fallback behavior in `RouteHost` (support/netutil/route.go) is intentionally not supported and should not be tested or codified.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

coderabbitai · 2026-05-19T17:21:00Z

+	if len(route.Status.Ingress) > 0 {
+		if route.Status.Ingress[0].Host != "" {
+			return route.Status.Ingress[0].Host
+		}
+		if route.Status.Ingress[0].RouterCanonicalHostname != "" {
+			return route.Status.Ingress[0].RouterCanonicalHostname
+		}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle multi-ingress routes instead of only index 0.

Line 163 assumes Status.Ingress[0] is the only useful entry. Routes can have multiple ingress records, so this may return empty even when a later entry has a usable host, which can stall readiness.

🔧 Suggested fix

func RouteHost(route *routev1.Route) string { if route == nil { return "" } if route.Spec.Host != "" { return route.Spec.Host } - if len(route.Status.Ingress) > 0 { - if route.Status.Ingress[0].Host != "" { - return route.Status.Ingress[0].Host - } - if route.Status.Ingress[0].RouterCanonicalHostname != "" { - return route.Status.Ingress[0].RouterCanonicalHostname - } - } - return "" + var canonicalFallback string + for _, ingress := range route.Status.Ingress { + if ingress.Host != "" { + return ingress.Host + } + if canonicalFallback == "" && ingress.RouterCanonicalHostname != "" { + canonicalFallback = ingress.RouterCanonicalHostname + } + } + return canonicalFallback }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if len(route.Status.Ingress) > 0 {

if route.Status.Ingress[0].Host != "" {

return route.Status.Ingress[0].Host

}

if route.Status.Ingress[0].RouterCanonicalHostname != "" {

return route.Status.Ingress[0].RouterCanonicalHostname

}

func RouteHost(route *routev1.Route) string {

if route == nil {

return ""

}

if route.Spec.Host != "" {

return route.Spec.Host

}

var canonicalFallback string

for _, ingress := range route.Status.Ingress {

if ingress.Host != "" {

return ingress.Host

}

if canonicalFallback == "" && ingress.RouterCanonicalHostname != "" {

canonicalFallback = ingress.RouterCanonicalHostname

}

}

return canonicalFallback

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@support/netutil/route.go` around lines 163 - 169, The current logic only inspects route.Status.Ingress[0] and returns its Host or RouterCanonicalHostname, which misses later ingress entries; update the code that references route.Status.Ingress[0] to iterate over all entries in route.Status.Ingress, checking each entry's Host first and then RouterCanonicalHostname, and return the first non-empty value found (falling back to empty/no-host if none exist); ensure you reference the same symbols (route.Status.Ingress, Host, RouterCanonicalHostname) so the loop replaces the existing index-0 checks.

In HyperShift, a route is admitted by exactly one router — either the OpenShift default router or the HyperShift shared ingress controller. There is no scenario where Ingress[0] is empty but Ingress[1] contains the host. Multiple ingress entries represent different routers admitting the same route, and the first entry is always the relevant one for the controller that owns the route.

Iterating all entries would add complexity for a scenario that does not exist in this codebase and could mask bugs by silently falling through to an unrelated router's entry.

@sdminonne, thank you for the clear explanation of HyperShift's routing model. That's a strong argument — since a route in HyperShift is always admitted by exactly one router, Status.Ingress[0] is definitively the correct (and only) entry to inspect. Iterating further would indeed add unnecessary complexity and could silently mask bugs by falling through to an unrelated router's entry. I'll withdraw the suggestion.

✏️ Learnings added

Learnt from: sdminonne Repo: openshift/hypershift PR: 8546 File: support/netutil/route.go:163-169 Timestamp: 2026-05-19T18:45:30.452Z Learning: In HyperShift, a Route is admitted by exactly one router — either the OpenShift default router or the HyperShift shared ingress controller. There is no scenario where `Status.Ingress[0]` is empty but a later entry (e.g. `Status.Ingress[1]`) contains the host. Multiple ingress entries would represent different routers admitting the same route, but that does not occur in HyperShift. Therefore, reading only `Status.Ingress[0]` (as in `netutil.RouteHost` in `support/netutil/route.go`) is intentional and correct; iterating over all entries is unnecessary and could mask bugs.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

sdminonne · 2026-05-19T18:46:19Z

/retest

sdminonne · 2026-05-19T18:46:26Z

/test e2e-aks

hypershift-jira-solve-ci · 2026-05-19T21:13:37Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2056808766044639232 | Cost: $6.988275500000003 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

sdminonne · 2026-05-20T07:06:44Z

/test e2e-aks

hypershift-jira-solve-ci · 2026-05-20T08:51:24Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2056995254766145536 | Cost: $2.7953644999999994 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

sdminonne · 2026-05-20T11:01:32Z

/test e2e-aks

hypershift-jira-solve-ci · 2026-05-20T13:56:18Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2057054221647220736 | Cost: $3.1325262499999997 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

sdminonne · 2026-05-20T15:32:37Z

/test e2e-aks

hypershift-jira-solve-ci · 2026-05-20T17:15:17Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2057122430584885248 | Cost: $3.4875462500000003 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

sdminonne · 2026-05-21T08:30:36Z

/cc @csrwng

sdminonne · 2026-05-21T10:28:08Z

/test e2e-aks

sdminonne · 2026-05-21T10:29:06Z

/test e2e-aws

hypershift-jira-solve-ci · 2026-05-21T11:50:24Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2057408115615207424 | Cost: $3.833694400000001 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

sdminonne · 2026-05-21T13:12:13Z

/test e2e-aws

sdminonne · 2026-05-21T13:12:24Z

/test e2e-aks

…gement clusters On non-OpenShift management clusters (e.g. AKS), there is no default router to populate route.Spec.Host. The shared ingress controller populates Status.Ingress[0].RouterCanonicalHostname instead. - Add RouteHost() helper in support/netutil with a fallback chain: Spec.Host → Status.Ingress[0].Host → RouterCanonicalHostname - Use RouteHost() in oauth.ReconcileServiceStatus and kas.ReconcileKonnectivityServerServiceStatus - Guard the shared ingress path in reconcileAPIServerServiceStatus with serviceStrategy.Type == Route to prevent empty apiHost when UseSharedIngress()=true but strategy is LoadBalancer - Log all InfrastructureStatus fields when infrastructure is not ready - Add unit tests for RouteHost, KonnectivityServerServiceStatus fallback, and the shared ingress guard with LoadBalancer strategy Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Switch TestCreateClusterHABreakGlassCredentials from the cloud platform to NonePlatform. The test validates control-plane PKI operations with NodePoolReplicas=0 and an HA control plane — none of which require cloud infrastructure. This reduces cost, execution time, and flakiness from cloud API interactions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…econciliation On NonePlatform clusters the test framework skips the full cluster-readiness gate, so KAS may be serving before the control-plane-pki-operator has reconciled the break-glass signer CA into the client-ca trust bundle. Replace the single SSR call with PollUntilContextTimeout (1s interval, 5min timeout) that retries on Unauthorized, matching the pattern already used by validateRevocation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sdminonne · 2026-05-21T13:44:15Z

/test e2e-aks

sdminonne · 2026-05-21T13:44:24Z

/test e2e-aws

Previously only Unauthorized errors were retried, while transient connection errors (DNS resolution, connection refused, TLS handshake) were treated as fatal and immediately aborted the poll. On NonePlatform clusters the KAS ELB DNS may not have propagated when the test begins, causing "no such host" failures that should be retried. Treat all errors as retryable since the 5-minute PollUntilContextTimeout already bounds the total retry window. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sdminonne · 2026-05-21T20:43:05Z

/test e2e-aks

sdminonne · 2026-05-21T20:43:11Z

/test e2e-aws

openshift-ci · 2026-05-21T22:52:23Z

@sdminonne: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hypershift-jira-solve-ci · 2026-05-22T22:24:35Z

I now have all the evidence needed for the final report.

Test Failure Analysis Complete

Job Information

Prow Job: GitHub Actions — envtest-kube and envtest-ocp workflows
Build ID: Run 26247569163 (envtest-kube) and Run 26247569114 (envtest-ocp)
PR: #8546 — CNTRLPLANE-3430: Make HA break-glass credentials test infraless
Head SHA: b5ee079193f76c97b3512bbe125e31e68ed30f2d
Triggered: 2026-05-21T19:13:38Z
Cancelled: 2026-05-22T18:23:40Z (~23 hours later)

Test Failure Analysis

Error

Change detection failed: failure

Summary

Both envtest-kube and envtest-ocp failures are caused by an arc-runner-set (self-hosted Actions Runner Controller) infrastructure outage — not by any code change in PR #8546. All 8 GitHub Actions workflows triggered by commit b5ee079 were queued at 19:13 UTC on May 21 but no runner ever picked up the jobs. After ~23 hours the jobs were cancelled (likely by GitHub's stale-job reaper or an ARC controller reconciliation), and the Conclusion jobs interpreted the cancelled status of the upstream "Detect Changes" job as failure. The previous commit on this PR (1aa9c82) passed all 8 workflows successfully. Re-running the workflows should resolve the failures.

Root Cause

The root cause is a transient arc-runner-set self-hosted runner infrastructure issue that prevented runner pods from being provisioned or assigned to the queued jobs.

Failure chain:

Runner provisioning failure: All 8 workflows for commit b5ee079 were queued at 2026-05-21T19:13:38Z on arc-runner-set runners. No runner was ever assigned — every initial job (Detect Changes, Lint, Verify, etc.) sat in "queued → started" state for 23h10m without executing any steps ("steps": []).
Stale job cancellation: At 2026-05-22T18:23:40Z, all pending jobs were simultaneously cancelled. This is consistent with either GitHub's 24-hour stale check threshold, an ARC controller reconciliation that cleaned up orphaned runner requests, or a manual bulk cancellation.
Conclusion job logic: The envtest-kube and envtest-ocp Conclusion jobs use if: always() and check needs.changes.result. The Detect Changes jobs reported conclusion: "cancelled", which the Conclusion logic treats as a failure (if [ "$changes_result" != "success" ]). The envtest matrix jobs were skipped (never ran) since their needs: changes dependency never succeeded.
Not a code defect: The immediately prior commit (1aa9c82) — pushed just ~6 hours earlier on the same branch — passed all 8 workflows successfully (Lint in 10min, Unit Tests in 11min, envtests in <1min). The code is identical except for the final commit. Other PRs running the same workflows on arc-runner-set around the same period and afterward (May 22) completed successfully, confirming this was a transient infrastructure window.

Recommendations

Re-run all failed workflows: Push an empty commit or use the GitHub Actions "Re-run all jobs" button on each of the 8 failed runs. The arc-runner-set infrastructure has recovered (evidenced by successful runs on other PRs from May 22 onward).
No code changes needed: PR CNTRLPLANE-3430: Make HA break-glass credentials test infraless #8546's code is not implicated. The previous commit passed all CI, and the failure pattern (all 8 workflows cancelled simultaneously after 23h with zero steps executed) is pathognomonic of runner unavailability.
Consider adding a timeout-minutes to the Detect Changes job: Currently only the envtest-kube/envtest-ocp matrix jobs have timeout-minutes: 15. Adding a 10–15 minute timeout to the Detect Changes and Conclusion jobs would cause faster failure (instead of waiting 23 hours) when runners are unavailable, giving clearer signal.
Consider handling cancelled in the Conclusion job: The Conclusion job could treat changes_result == "cancelled" differently from a true failure, e.g., by providing a more descriptive error message like "Change detection was cancelled (possible runner issue) — please re-run."

Evidence

Evidence	Detail
Failure pattern	All 8 GHA workflows for commit `b5ee079` failed identically — queued 23h10m, then cancelled simultaneously
Detect Changes job	`conclusion: "cancelled"`, `steps: []` — runner was never assigned, no step ever executed
Elapsed time	23h 10m 01s (19:13:39 → 18:23:40 UTC), consistent with GitHub's stale-job cancellation threshold
Conclusion job log	`changes_result="failure"` → `"Change detection failed: failure"` → `exit 1`
Previous commit	`1aa9c82` — all 8 workflows passed successfully (Lint: 10min, Unit Tests: 11min, envtests: <1min)
Other PRs unaffected	`envtest-kube` runs on other PRs from May 22 (e.g., `ibmcloud-endpoints-field`, `fix-add-startup-probe-konnectivity-sock5-sidecar`) all succeeded
Runner type	`arc-runner-set` — self-hosted ARC (Actions Runner Controller) runners, subject to pod scheduling and cluster capacity
No newer push	Commit `b5ee079` is still HEAD on the PR branch — no superseding push triggered the cancellation

sdminonne · 2026-05-23T10:30:29Z

/pipeline-required

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 19, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 19, 2026

openshift-ci Bot added the do-not-merge/needs-area label May 19, 2026

openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels May 19, 2026

openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release labels May 19, 2026

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

sdminonne force-pushed the CNTRLPLANE-3430/infraless-ha-break-glass-creds branch from 97d69f9 to 1f3e9bc Compare May 21, 2026 07:37

hypershift-jira-solve-ci Bot mentioned this pull request May 21, 2026

OCPBUGS-85151: Re-enable serviceaccount-pull-secrets controller when registry managementState changes from Removed #8522

Open

4 tasks

sdminonne force-pushed the CNTRLPLANE-3430/infraless-ha-break-glass-creds branch from 1f3e9bc to 5a6f1f4 Compare May 21, 2026 08:21

openshift-ci Bot requested a review from csrwng May 21, 2026 08:30

hypershift-jira-solve-ci Bot mentioned this pull request May 21, 2026

OCPBUGS-77268: reconcile HCP when pull secret is unavailable #8352

Open

4 tasks

sdminonne force-pushed the CNTRLPLANE-3430/infraless-ha-break-glass-creds branch from 5a6f1f4 to 6b0822e Compare May 21, 2026 09:11

sdminonne marked this pull request as ready for review May 21, 2026 09:25

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2026

openshift-ci Bot requested review from Nirshal and muraee May 21, 2026 09:25

sdminonne and others added 3 commits May 21, 2026 15:17

sdminonne force-pushed the CNTRLPLANE-3430/infraless-ha-break-glass-creds branch from 5b2e6af to 1aa9c82 Compare May 21, 2026 13:18

-	if len(route.Status.Ingress) > 0 {
-		if route.Status.Ingress[0].Host != "" {
-			return route.Status.Ingress[0].Host
-		}
-		if route.Status.Ingress[0].RouterCanonicalHostname != "" {
-			return route.Status.Ingress[0].RouterCanonicalHostname
-		}
+func RouteHost(route *routev1.Route) string {
+	if route == nil {
+		return ""
+	}
+	if route.Spec.Host != "" {
+		return route.Spec.Host
+	}
+	var canonicalFallback string
+	for _, ingress := range route.Status.Ingress {
+		if ingress.Host != "" {
+			return ingress.Host
+		}
+		if canonicalFallback == "" && ingress.RouterCanonicalHostname != "" {
+			canonicalFallback = ingress.RouterCanonicalHostname
+		}
+	}
+	return canonicalFallback
+}

Conversation

sdminonne commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Why the CPO fixes (commit 1) are needed

Per-commit breakdown

Checklist:

Uh oh!

openshift-ci-robot commented May 19, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

openshift-ci Bot commented May 19, 2026

Uh oh!

openshift-merge-bot Bot commented May 19, 2026

Uh oh!

openshift-ci Bot commented May 19, 2026

Uh oh!

sdminonne commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Sequence Diagram(s)

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

codecov Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sdminonne commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 19, 2026

AI Test Failure Analysis

Uh oh!

cwbotbot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

e2e-aws

e2e-aks

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdminonne May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdminonne May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

sdminonne commented May 19, 2026

Uh oh!

sdminonne commented May 19, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 19, 2026

AI Test Failure Analysis

sdminonne commented May 19, 2026 •

edited

Loading

openshift-ci-robot commented May 19, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading

codecov Bot commented May 19, 2026 •

edited

Loading

cwbotbot commented May 19, 2026 •

edited

Loading

coderabbitai Bot May 19, 2026 •

edited

Loading

coderabbitai Bot May 19, 2026 •

edited

Loading

hypershift-jira-solve-ci Bot commented May 22, 2026 •

edited by openshift-ci Bot

Loading