Skip to content

CNTRLPLANE-3430: Make HA break-glass credentials test infraless#8546

Open
sdminonne wants to merge 4 commits into
openshift:mainfrom
sdminonne:CNTRLPLANE-3430/infraless-ha-break-glass-creds
Open

CNTRLPLANE-3430: Make HA break-glass credentials test infraless#8546
sdminonne wants to merge 4 commits into
openshift:mainfrom
sdminonne:CNTRLPLANE-3430/infraless-ha-break-glass-creds

Conversation

@sdminonne
Copy link
Copy Markdown
Contributor

@sdminonne sdminonne commented May 19, 2026

What this PR does / why we need it:

Switches TestCreateClusterHABreakGlassCredentials from the cloud platform to NonePlatform
and fixes CPO bugs on non-OpenShift management clusters (AKS) that this change exposed.

This test validates control-plane PKI operations (CSR creation/approval/signing, certificate
revocation, per-pod KAS cert reload) with NodePoolReplicas=0 and an HA control plane (3 KAS
replicas). None of these operations require cloud infrastructure.

The management cluster's cloud controller manager provisions a real LoadBalancer for the
kube-apiserver service regardless of the hosted cluster's platform type, so the guest API
server remains reachable. This follows the same pattern used by TestHAEtcdChaos.

Removing cloud infrastructure provisioning reduces cost, execution time, and flakiness from
cloud API interactions — particularly under resource pressure when multiple hosted clusters
compete for management cluster capacity (as seen in PR #8338's e2e-aks job).

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-3430

Special notes for your reviewer:

Why the CPO fixes (commit 1) are needed

TestHAEtcdChaos uses the same NonePlatform + HA + NodePoolReplicas=0 pattern but has
never run on AKS. It is restricted to the e2e-azure-self-managed job, which uses an
OpenShift management cluster and has an explicit test filter
(openshift-hypershift-main.yaml):

CI_TESTS_RUN: "^(TestCreateCluster$|TestAutoscaling|TestNodePool|TestUpgradeControlPlane|TestHAEtcdChaos|TestAzureOAuthLoadBalancer|TestAzurePrivateTopology)$"

TestCreateClusterHABreakGlassCredentials is not in that filter. It runs in e2e-aks,
which has no CI_TESTS_RUN filter — all tests run, on an AKS (non-OpenShift) management
cluster.

On OpenShift management clusters the default router populates route.Spec.Host, so the CPO
bugs never manifest. On AKS, there is no default router — the shared ingress controller
populates Status.Ingress[0].RouterCanonicalHostname instead. The CPO was reading
route.Spec.Host directly, getting an empty string, and blocking infrastructure readiness
indefinitely.

Additionally, unlike TestHAEtcdChaos (which only uses the management client to manipulate
etcd pods), the break-glass test calls WaitForGuestClient, which requires the CPO to
complete infrastructure readiness by resolving a host/port for KAS.

Without the CPO fixes, WaitForGuestClient times out on AKS.

Per-commit breakdown

Commit 1fix(control-plane-operator): resolve route host on non-OpenShift management clusters

  • Add RouteHost() helper in support/netutil with a fallback chain (Spec.Host
    Status.Ingress[0].HostRouterCanonicalHostname) and use it in
    oauth.ReconcileServiceStatus and kas.ReconcileKonnectivityServerServiceStatus
  • Guard the shared ingress path in reconcileAPIServerServiceStatus with
    serviceStrategy.Type == Route to prevent empty apiHost when UseSharedIngress()=true
    but strategy is LoadBalancer
  • Log all InfrastructureStatus fields when infrastructure is not ready
  • Add unit tests for RouteHost, KonnectivityServerServiceStatus fallback, and the shared
    ingress guard with LoadBalancer strategy

Commit 2test(e2e): make HA break-glass credentials test infraless

Switch the test to NonePlatform. Same pattern as TestHAEtcdChaos.

The framework handles NonePlatform correctly for this test:

  • before(): skips ValidatePublicCluster — OK, the test has explicit waits
  • after(): skips EnsureAdmissionPolicies — OK (OCPBUGS-61291)
  • RunTestControlPlanePKIOperatorBreakGlassCredentials: only skips KubeVirt and GCP

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 19, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 19, 2026

@sdminonne: This pull request references CNTRLPLANE-3430 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Switches TestCreateClusterHABreakGlassCredentials from the cloud platform to NonePlatform.

This test validates control-plane PKI operations (CSR creation/approval/signing, certificate
revocation, per-pod KAS cert reload) with NodePoolReplicas=0 and an HA control plane (3 KAS
replicas). None of these operations require cloud infrastructure.

The management cluster's cloud controller manager provisions a real LoadBalancer for the
kube-apiserver service regardless of the hosted cluster's platform type, so the guest API
server remains reachable. This follows the same pattern used by TestHAEtcdChaos.

Removing cloud infrastructure provisioning reduces cost, execution time, and flakiness from
cloud API interactions — particularly under resource pressure when multiple hosted clusters
compete for management cluster capacity (as seen in PR #8338's e2e-aks job).

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-3430

Special notes for your reviewer:

Single-line change. The NonePlatform pattern is already proven by TestHAEtcdChaos
(test/e2e/chaos_test.go:32-66) which runs an HA control plane on NonePlatform with
NodePoolReplicas=0 and successfully calls WaitForGuestClient.

The framework handles NonePlatform correctly for this test:

  • before(): skips ValidatePublicCluster — OK, the test has explicit waits
  • after(): skips EnsureAdmissionPolicies — OK (OCPBUGS-61291)
  • after(): skips ValidateHostedClusterConditions — OK for NonePlatform
  • RunTestControlPlanePKIOperatorBreakGlassCredentials: only skips KubeVirt and GCP

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sdminonne
Once this PR has been reviewed and has the lgtm label, please assign cblecker for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels May 19, 2026
@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds support function netutil.RouteHost that chooses the best hostname from a Route (Spec.Host, then Status.Ingress[0].Host, then RouterCanonicalHostname). Reconciler code in OAuth and KAS now uses this helper to populate service status (setting host and port or returning an error when no host is found). Unit tests for RouteHost and additional route-resolution cases were added. A single e2e test was updated to pass hyperv1.NonePlatform and to wait for hosted control plane bootstrap before running break-glass checks.

Sequence Diagram(s)

sequenceDiagram
  participant Reconciler as ReconcileServiceStatus / ReconcileKonnectivityServerServiceStatus
  participant Route as routev1.Route
  participant RouteHost as netutil.RouteHost
  participant ServiceStatus as ServiceStatusUpdater
  Reconciler->>Route: read route resource
  Reconciler->>RouteHost: call RouteHost(route)
  RouteHost-->>Reconciler: host (or "")
  alt host != ""
    Reconciler->>ServiceStatus: set host and port (443 / RouteExternalPort)
  else host == ""
    Reconciler-->>ServiceStatus: set error message and return
  end
Loading

Suggested reviewers

  • muraee
  • Nirshal
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Microshift Test Compatibility ⚠️ Warning Test uses HighlyAvailable control plane (3 KAS replicas) which violates MicroShift's single-node architecture. No protective [Skipped:MicroShift] or [apigroup:...] labels present. Add [Skipped:MicroShift] label to test name or guard with exutil.IsMicroShiftCluster() check, since HighlyAvailable control planes unsupported on single-node MicroShift.
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: converting the HA break-glass credentials test to use NonePlatform instead of cloud platform.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the PR are stable and deterministic. TestRouteHost (7 cases) and new Route cases in TestReconcileServiceStatus (2 cases) use static, descriptive strings with no dynamic content.
Test Structure And Quality ✅ Passed Test meets quality standards: single responsibility, proper cleanup, explicit 20-min timeout, meaningful assertion messages, consistent patterns.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests were added. The PR modifies an existing test (TestCreateClusterHABreakGlassCredentials) and adds only a unit test (TestRouteHost), which is outside the scope of this check.
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces no scheduling constraints. Changes are for route hostname resolution, infrastructure logging, and e2e test platform configuration only.
Ote Binary Stdout Contract ✅ Passed PR uses logr.Logger for all logging (framework-controlled). No fmt.Print/klog/stdout writes in process-level code. ValidateHostedClusterConditions is called within test bodies.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests were added. New unit tests use domain names, not IPv4 hardcoding, and have no external connectivity requirements.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 98.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 40.48%. Comparing base (36dfb1b) to head (1aa9c82).

Files with missing lines Patch % Lines
...or/controllers/hostedcontrolplane/oauth/service.go 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8546      +/-   ##
==========================================
+ Coverage   40.40%   40.48%   +0.07%     
==========================================
  Files         755      755              
  Lines       93235    93278      +43     
==========================================
+ Hits        37675    37762      +87     
+ Misses      52858    52812      -46     
- Partials     2702     2704       +2     
Files with missing lines Coverage Δ
...ostedcontrolplane/hostedcontrolplane_controller.go 45.59% <100.00%> (+0.56%) ⬆️
...ator/controllers/hostedcontrolplane/infra/infra.go 53.61% <100.00%> (+3.06%) ⬆️
...ator/controllers/hostedcontrolplane/kas/service.go 47.86% <100.00%> (+7.49%) ⬆️
support/netutil/route.go 72.00% <100.00%> (+3.81%) ⬆️
...or/controllers/hostedcontrolplane/oauth/service.go 78.00% <66.66%> (+3.25%) ⬆️
Flag Coverage Δ
cmd-support 34.47% <100.00%> (+0.03%) ⬆️
cpo-hostedcontrolplane 42.13% <97.14%> (+0.36%) ⬆️
cpo-other 40.31% <ø> (ø)
hypershift-operator 50.72% <ø> (ø)
other 31.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sdminonne
Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

✅ Actions performed

Full review triggered.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2056721417461829632 | Cost: $4.631663149999999 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented May 19, 2026

Test Results

e2e-aws

e2e-aks

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release labels May 19, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@control-plane-operator/controllers/hostedcontrolplane/oauth/service.go`:
- Around line 95-98: The code currently calls netutil.RouteHost(route) and then
dereferences route.ObjectMeta.CreationTimestamp.Time which can panic if route is
nil; add an explicit nil check for route before calling netutil.RouteHost (e.g.,
if route == nil set message to a clear "OAuth service route is missing" or
similar including the duration since creation and return), otherwise proceed to
compute routeHost and use route.ObjectMeta.CreationTimestamp.Time; ensure you
update the branch that currently handles empty routeHost to avoid dereferencing
route when route is nil.

In `@support/netutil/route_test.go`:
- Around line 472-559: Add a test case in TestRouteHost that asserts RouteHost
prefers a non-empty host from a later ingress when the first ingress is empty:
create a routev1.Route with Status.Ingress containing two entries where
Ingress[0] is empty and Ingress[1].Host (or RouterCanonicalHostname) is set, and
expect RouteHost(route) to return that second ingress host; ensure the new case
references RouteHost and the routev1.Route.Status.Ingress slice so it guards
multi-ingress fallback behavior.

In `@support/netutil/route.go`:
- Around line 163-169: The current logic only inspects route.Status.Ingress[0]
and returns its Host or RouterCanonicalHostname, which misses later ingress
entries; update the code that references route.Status.Ingress[0] to iterate over
all entries in route.Status.Ingress, checking each entry's Host first and then
RouterCanonicalHostname, and return the first non-empty value found (falling
back to empty/no-host if none exist); ensure you reference the same symbols
(route.Status.Ingress, Host, RouterCanonicalHostname) so the loop replaces the
existing index-0 checks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 815d35bd-ddc0-4a15-b370-ec61f7400400

📥 Commits

Reviewing files that changed from the base of the PR and between 880b44f and 4a99314.

📒 Files selected for processing (5)
  • control-plane-operator/controllers/hostedcontrolplane/kas/service.go
  • control-plane-operator/controllers/hostedcontrolplane/oauth/service.go
  • control-plane-operator/controllers/hostedcontrolplane/oauth/service_test.go
  • support/netutil/route.go
  • support/netutil/route_test.go

Comment on lines +472 to +559
func TestRouteHost(t *testing.T) {
testCases := []struct {
name string
route *routev1.Route
expectedHost string
}{
{
name: "When route is nil, it should return empty string",
route: nil,
expectedHost: "",
},
{
name: "When Spec.Host is set, it should return Spec.Host",
route: &routev1.Route{
Spec: routev1.RouteSpec{
Host: "oauth.example.com",
},
},
expectedHost: "oauth.example.com",
},
{
name: "When Spec.Host is set and Status.Ingress is also populated, it should prefer Spec.Host",
route: &routev1.Route{
Spec: routev1.RouteSpec{
Host: "oauth.example.com",
},
Status: routev1.RouteStatus{
Ingress: []routev1.RouteIngress{
{
Host: "ingress.example.com",
RouterCanonicalHostname: "canonical.example.com",
},
},
},
},
expectedHost: "oauth.example.com",
},
{
name: "When Spec.Host is empty and Status.Ingress[0].Host is set, it should return Ingress Host",
route: &routev1.Route{
Status: routev1.RouteStatus{
Ingress: []routev1.RouteIngress{
{
Host: "ingress.example.com",
},
},
},
},
expectedHost: "ingress.example.com",
},
{
name: "When Spec.Host and Ingress Host are empty but RouterCanonicalHostname is set, it should return RouterCanonicalHostname",
route: &routev1.Route{
Status: routev1.RouteStatus{
Ingress: []routev1.RouteIngress{
{
RouterCanonicalHostname: "router-canonical.example.com",
},
},
},
},
expectedHost: "router-canonical.example.com",
},
{
name: "When all host fields are empty, it should return empty string",
route: &routev1.Route{
Status: routev1.RouteStatus{
Ingress: []routev1.RouteIngress{
{},
},
},
},
expectedHost: "",
},
{
name: "When Status.Ingress slice is empty, it should return empty string",
route: &routev1.Route{},
expectedHost: "",
},
}

for _, tc := range testCases {
t.Run(tc.name, func(t *testing.T) {
g := NewWithT(t)
g.Expect(RouteHost(tc.route)).To(Equal(tc.expectedHost))
})
}
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a multi-ingress test case to lock expected fallback behavior.

TestRouteHost should include a case where Status.Ingress[0] is empty and Status.Ingress[1] has Host (or RouterCanonicalHostname). This guards against regressions in real multi-router status layouts.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@support/netutil/route_test.go` around lines 472 - 559, Add a test case in
TestRouteHost that asserts RouteHost prefers a non-empty host from a later
ingress when the first ingress is empty: create a routev1.Route with
Status.Ingress containing two entries where Ingress[0] is empty and
Ingress[1].Host (or RouterCanonicalHostname) is set, and expect RouteHost(route)
to return that second ingress host; ensure the new case references RouteHost and
the routev1.Route.Status.Ingress slice so it guards multi-ingress fallback
behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows from the multi-ingress comment — since a route in HyperShift is admitted by exactly one router, there is no real scenario where Ingress[0] is empty but Ingress[1] has the host. Adding a test for that would codify fallback behavior we intentionally do not support and do not want to guarantee.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdminonne, that's a clear and valid constraint — since HyperShift routes are admitted by exactly one router, Ingress[0] being empty while Ingress[1] carries the host is not a real scenario, and locking in fallback behavior that isn't intended would be misleading. Happy to drop this suggestion.


✏️ Learnings added
Learnt from: sdminonne
Repo: openshift/hypershift PR: 8546
File: support/netutil/route_test.go:472-559
Timestamp: 2026-05-19T18:45:39.602Z
Learning: In HyperShift, a Route is admitted by exactly one router, so `Status.Ingress` will never have a meaningful host in a later index while `Ingress[0]` is empty. Multi-ingress fallback behavior in `RouteHost` (support/netutil/route.go) is intentionally not supported and should not be tested or codified.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment thread support/netutil/route.go
Comment on lines +163 to +169
if len(route.Status.Ingress) > 0 {
if route.Status.Ingress[0].Host != "" {
return route.Status.Ingress[0].Host
}
if route.Status.Ingress[0].RouterCanonicalHostname != "" {
return route.Status.Ingress[0].RouterCanonicalHostname
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle multi-ingress routes instead of only index 0.

Line 163 assumes Status.Ingress[0] is the only useful entry. Routes can have multiple ingress records, so this may return empty even when a later entry has a usable host, which can stall readiness.

🔧 Suggested fix
 func RouteHost(route *routev1.Route) string {
 	if route == nil {
 		return ""
 	}
 	if route.Spec.Host != "" {
 		return route.Spec.Host
 	}
-	if len(route.Status.Ingress) > 0 {
-		if route.Status.Ingress[0].Host != "" {
-			return route.Status.Ingress[0].Host
-		}
-		if route.Status.Ingress[0].RouterCanonicalHostname != "" {
-			return route.Status.Ingress[0].RouterCanonicalHostname
-		}
-	}
-	return ""
+	var canonicalFallback string
+	for _, ingress := range route.Status.Ingress {
+		if ingress.Host != "" {
+			return ingress.Host
+		}
+		if canonicalFallback == "" && ingress.RouterCanonicalHostname != "" {
+			canonicalFallback = ingress.RouterCanonicalHostname
+		}
+	}
+	return canonicalFallback
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if len(route.Status.Ingress) > 0 {
if route.Status.Ingress[0].Host != "" {
return route.Status.Ingress[0].Host
}
if route.Status.Ingress[0].RouterCanonicalHostname != "" {
return route.Status.Ingress[0].RouterCanonicalHostname
}
func RouteHost(route *routev1.Route) string {
if route == nil {
return ""
}
if route.Spec.Host != "" {
return route.Spec.Host
}
var canonicalFallback string
for _, ingress := range route.Status.Ingress {
if ingress.Host != "" {
return ingress.Host
}
if canonicalFallback == "" && ingress.RouterCanonicalHostname != "" {
canonicalFallback = ingress.RouterCanonicalHostname
}
}
return canonicalFallback
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@support/netutil/route.go` around lines 163 - 169, The current logic only
inspects route.Status.Ingress[0] and returns its Host or
RouterCanonicalHostname, which misses later ingress entries; update the code
that references route.Status.Ingress[0] to iterate over all entries in
route.Status.Ingress, checking each entry's Host first and then
RouterCanonicalHostname, and return the first non-empty value found (falling
back to empty/no-host if none exist); ensure you reference the same symbols
(route.Status.Ingress, Host, RouterCanonicalHostname) so the loop replaces the
existing index-0 checks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In HyperShift, a route is admitted by exactly one router — either the OpenShift default router or the HyperShift shared ingress controller. There is no scenario where Ingress[0] is empty but Ingress[1] contains the host. Multiple ingress entries represent different routers admitting the same route, and the first entry is always the relevant one for the controller that owns the route.

Iterating all entries would add complexity for a scenario that does not exist in this codebase and could mask bugs by silently falling through to an unrelated router's entry.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdminonne, thank you for the clear explanation of HyperShift's routing model. That's a strong argument — since a route in HyperShift is always admitted by exactly one router, Status.Ingress[0] is definitively the correct (and only) entry to inspect. Iterating further would indeed add unnecessary complexity and could silently mask bugs by falling through to an unrelated router's entry. I'll withdraw the suggestion.


✏️ Learnings added
Learnt from: sdminonne
Repo: openshift/hypershift PR: 8546
File: support/netutil/route.go:163-169
Timestamp: 2026-05-19T18:45:30.452Z
Learning: In HyperShift, a Route is admitted by exactly one router — either the OpenShift default router or the HyperShift shared ingress controller. There is no scenario where `Status.Ingress[0]` is empty but a later entry (e.g. `Status.Ingress[1]`) contains the host. Multiple ingress entries would represent different routers admitting the same route, but that does not occur in HyperShift. Therefore, reading only `Status.Ingress[0]` (as in `netutil.RouteHost` in `support/netutil/route.go`) is intentional and correct; iterating over all entries is unnecessary and could mask bugs.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@sdminonne
Copy link
Copy Markdown
Contributor Author

/retest

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2056808766044639232 | Cost: $6.988275500000003 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2056995254766145536 | Cost: $2.7953644999999994 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2057054221647220736 | Cost: $3.1325262499999997 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2057122430584885248 | Cost: $3.4875462500000003 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@sdminonne sdminonne force-pushed the CNTRLPLANE-3430/infraless-ha-break-glass-creds branch from 1f3e9bc to 5a6f1f4 Compare May 21, 2026 08:21
@sdminonne
Copy link
Copy Markdown
Contributor Author

/cc @csrwng

@openshift-ci openshift-ci Bot requested a review from csrwng May 21, 2026 08:30
@sdminonne sdminonne force-pushed the CNTRLPLANE-3430/infraless-ha-break-glass-creds branch from 5a6f1f4 to 6b0822e Compare May 21, 2026 09:11
@sdminonne sdminonne marked this pull request as ready for review May 21, 2026 09:25
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2026
@openshift-ci openshift-ci Bot requested review from Nirshal and muraee May 21, 2026 09:25
@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2057408115615207424 | Cost: $3.833694400000001 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

sdminonne and others added 3 commits May 21, 2026 15:17
…gement clusters

On non-OpenShift management clusters (e.g. AKS), there is no default
router to populate route.Spec.Host. The shared ingress controller
populates Status.Ingress[0].RouterCanonicalHostname instead.

- Add RouteHost() helper in support/netutil with a fallback chain:
  Spec.Host → Status.Ingress[0].Host → RouterCanonicalHostname
- Use RouteHost() in oauth.ReconcileServiceStatus and
  kas.ReconcileKonnectivityServerServiceStatus
- Guard the shared ingress path in reconcileAPIServerServiceStatus
  with serviceStrategy.Type == Route to prevent empty apiHost when
  UseSharedIngress()=true but strategy is LoadBalancer
- Log all InfrastructureStatus fields when infrastructure is not ready
- Add unit tests for RouteHost, KonnectivityServerServiceStatus
  fallback, and the shared ingress guard with LoadBalancer strategy

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch TestCreateClusterHABreakGlassCredentials from the cloud platform
to NonePlatform. The test validates control-plane PKI operations with
NodePoolReplicas=0 and an HA control plane — none of which require
cloud infrastructure. This reduces cost, execution time, and flakiness
from cloud API interactions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…econciliation

On NonePlatform clusters the test framework skips the full
cluster-readiness gate, so KAS may be serving before the
control-plane-pki-operator has reconciled the break-glass signer CA
into the client-ca trust bundle. Replace the single SSR call with
PollUntilContextTimeout (1s interval, 5min timeout) that retries on
Unauthorized, matching the pattern already used by validateRevocation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sdminonne sdminonne force-pushed the CNTRLPLANE-3430/infraless-ha-break-glass-creds branch from 5b2e6af to 1aa9c82 Compare May 21, 2026 13:18
@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

Previously only Unauthorized errors were retried, while transient
connection errors (DNS resolution, connection refused, TLS handshake)
were treated as fatal and immediately aborted the poll. On NonePlatform
clusters the KAS ELB DNS may not have propagated when the test begins,
causing "no such host" failures that should be retried.

Treat all errors as retryable since the 5-minute PollUntilContextTimeout
already bounds the total retry window.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

@sdminonne: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented May 22, 2026

I now have all the evidence needed for the final report.

Test Failure Analysis Complete

Job Information

  • Prow Job: GitHub Actions — envtest-kube and envtest-ocp workflows
  • Build ID: Run 26247569163 (envtest-kube) and Run 26247569114 (envtest-ocp)
  • PR: #8546CNTRLPLANE-3430: Make HA break-glass credentials test infraless
  • Head SHA: b5ee079193f76c97b3512bbe125e31e68ed30f2d
  • Triggered: 2026-05-21T19:13:38Z
  • Cancelled: 2026-05-22T18:23:40Z (~23 hours later)

Test Failure Analysis

Error

Change detection failed: failure

Summary

Both envtest-kube and envtest-ocp failures are caused by an arc-runner-set (self-hosted Actions Runner Controller) infrastructure outage — not by any code change in PR #8546. All 8 GitHub Actions workflows triggered by commit b5ee079 were queued at 19:13 UTC on May 21 but no runner ever picked up the jobs. After ~23 hours the jobs were cancelled (likely by GitHub's stale-job reaper or an ARC controller reconciliation), and the Conclusion jobs interpreted the cancelled status of the upstream "Detect Changes" job as failure. The previous commit on this PR (1aa9c82) passed all 8 workflows successfully. Re-running the workflows should resolve the failures.

Root Cause

The root cause is a transient arc-runner-set self-hosted runner infrastructure issue that prevented runner pods from being provisioned or assigned to the queued jobs.

Failure chain:

  1. Runner provisioning failure: All 8 workflows for commit b5ee079 were queued at 2026-05-21T19:13:38Z on arc-runner-set runners. No runner was ever assigned — every initial job (Detect Changes, Lint, Verify, etc.) sat in "queued → started" state for 23h10m without executing any steps ("steps": []).

  2. Stale job cancellation: At 2026-05-22T18:23:40Z, all pending jobs were simultaneously cancelled. This is consistent with either GitHub's 24-hour stale check threshold, an ARC controller reconciliation that cleaned up orphaned runner requests, or a manual bulk cancellation.

  3. Conclusion job logic: The envtest-kube and envtest-ocp Conclusion jobs use if: always() and check needs.changes.result. The Detect Changes jobs reported conclusion: "cancelled", which the Conclusion logic treats as a failure (if [ "$changes_result" != "success" ]). The envtest matrix jobs were skipped (never ran) since their needs: changes dependency never succeeded.

  4. Not a code defect: The immediately prior commit (1aa9c82) — pushed just ~6 hours earlier on the same branch — passed all 8 workflows successfully (Lint in 10min, Unit Tests in 11min, envtests in <1min). The code is identical except for the final commit. Other PRs running the same workflows on arc-runner-set around the same period and afterward (May 22) completed successfully, confirming this was a transient infrastructure window.

Recommendations
  1. Re-run all failed workflows: Push an empty commit or use the GitHub Actions "Re-run all jobs" button on each of the 8 failed runs. The arc-runner-set infrastructure has recovered (evidenced by successful runs on other PRs from May 22 onward).

  2. No code changes needed: PR CNTRLPLANE-3430: Make HA break-glass credentials test infraless #8546's code is not implicated. The previous commit passed all CI, and the failure pattern (all 8 workflows cancelled simultaneously after 23h with zero steps executed) is pathognomonic of runner unavailability.

  3. Consider adding a timeout-minutes to the Detect Changes job: Currently only the envtest-kube/envtest-ocp matrix jobs have timeout-minutes: 15. Adding a 10–15 minute timeout to the Detect Changes and Conclusion jobs would cause faster failure (instead of waiting 23 hours) when runners are unavailable, giving clearer signal.

  4. Consider handling cancelled in the Conclusion job: The Conclusion job could treat changes_result == "cancelled" differently from a true failure, e.g., by providing a more descriptive error message like "Change detection was cancelled (possible runner issue) — please re-run."

Evidence
Evidence Detail
Failure pattern All 8 GHA workflows for commit b5ee079 failed identically — queued 23h10m, then cancelled simultaneously
Detect Changes job conclusion: "cancelled", steps: [] — runner was never assigned, no step ever executed
Elapsed time 23h 10m 01s (19:13:39 → 18:23:40 UTC), consistent with GitHub's stale-job cancellation threshold
Conclusion job log changes_result="failure""Change detection failed: failure"exit 1
Previous commit 1aa9c82 — all 8 workflows passed successfully (Lint: 10min, Unit Tests: 11min, envtests: <1min)
Other PRs unaffected envtest-kube runs on other PRs from May 22 (e.g., ibmcloud-endpoints-field, fix-add-startup-probe-konnectivity-sock5-sidecar) all succeeded
Runner type arc-runner-set — self-hosted ARC (Actions Runner Controller) runners, subject to pod scheduling and cluster capacity
No newer push Commit b5ee079 is still HEAD on the PR branch — no superseding push triggered the cancellation

@sdminonne
Copy link
Copy Markdown
Contributor Author

/pipeline-required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants