CNTRLPLANE-3430: Make HA break-glass credentials test infraless#8546
CNTRLPLANE-3430: Make HA break-glass credentials test infraless#8546sdminonne wants to merge 4 commits into
Conversation
|
@sdminonne: This pull request references CNTRLPLANE-3430 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sdminonne The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/test e2e-aks |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds support function netutil.RouteHost that chooses the best hostname from a Route (Spec.Host, then Status.Ingress[0].Host, then RouterCanonicalHostname). Reconciler code in OAuth and KAS now uses this helper to populate service status (setting host and port or returning an error when no host is found). Unit tests for RouteHost and additional route-resolution cases were added. A single e2e test was updated to pass hyperv1.NonePlatform and to wait for hosted control plane bootstrap before running break-glass checks. Sequence Diagram(s)sequenceDiagram
participant Reconciler as ReconcileServiceStatus / ReconcileKonnectivityServerServiceStatus
participant Route as routev1.Route
participant RouteHost as netutil.RouteHost
participant ServiceStatus as ServiceStatusUpdater
Reconciler->>Route: read route resource
Reconciler->>RouteHost: call RouteHost(route)
RouteHost-->>Reconciler: host (or "")
alt host != ""
Reconciler->>ServiceStatus: set host and port (443 / RouteExternalPort)
else host == ""
Reconciler-->>ServiceStatus: set error message and return
end
Suggested reviewers
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8546 +/- ##
==========================================
+ Coverage 40.40% 40.48% +0.07%
==========================================
Files 755 755
Lines 93235 93278 +43
==========================================
+ Hits 37675 37762 +87
+ Misses 52858 52812 -46
- Partials 2702 2704 +2
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
Test Resultse2e-aws
e2e-aks
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@control-plane-operator/controllers/hostedcontrolplane/oauth/service.go`:
- Around line 95-98: The code currently calls netutil.RouteHost(route) and then
dereferences route.ObjectMeta.CreationTimestamp.Time which can panic if route is
nil; add an explicit nil check for route before calling netutil.RouteHost (e.g.,
if route == nil set message to a clear "OAuth service route is missing" or
similar including the duration since creation and return), otherwise proceed to
compute routeHost and use route.ObjectMeta.CreationTimestamp.Time; ensure you
update the branch that currently handles empty routeHost to avoid dereferencing
route when route is nil.
In `@support/netutil/route_test.go`:
- Around line 472-559: Add a test case in TestRouteHost that asserts RouteHost
prefers a non-empty host from a later ingress when the first ingress is empty:
create a routev1.Route with Status.Ingress containing two entries where
Ingress[0] is empty and Ingress[1].Host (or RouterCanonicalHostname) is set, and
expect RouteHost(route) to return that second ingress host; ensure the new case
references RouteHost and the routev1.Route.Status.Ingress slice so it guards
multi-ingress fallback behavior.
In `@support/netutil/route.go`:
- Around line 163-169: The current logic only inspects route.Status.Ingress[0]
and returns its Host or RouterCanonicalHostname, which misses later ingress
entries; update the code that references route.Status.Ingress[0] to iterate over
all entries in route.Status.Ingress, checking each entry's Host first and then
RouterCanonicalHostname, and return the first non-empty value found (falling
back to empty/no-host if none exist); ensure you reference the same symbols
(route.Status.Ingress, Host, RouterCanonicalHostname) so the loop replaces the
existing index-0 checks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 815d35bd-ddc0-4a15-b370-ec61f7400400
📒 Files selected for processing (5)
control-plane-operator/controllers/hostedcontrolplane/kas/service.gocontrol-plane-operator/controllers/hostedcontrolplane/oauth/service.gocontrol-plane-operator/controllers/hostedcontrolplane/oauth/service_test.gosupport/netutil/route.gosupport/netutil/route_test.go
| func TestRouteHost(t *testing.T) { | ||
| testCases := []struct { | ||
| name string | ||
| route *routev1.Route | ||
| expectedHost string | ||
| }{ | ||
| { | ||
| name: "When route is nil, it should return empty string", | ||
| route: nil, | ||
| expectedHost: "", | ||
| }, | ||
| { | ||
| name: "When Spec.Host is set, it should return Spec.Host", | ||
| route: &routev1.Route{ | ||
| Spec: routev1.RouteSpec{ | ||
| Host: "oauth.example.com", | ||
| }, | ||
| }, | ||
| expectedHost: "oauth.example.com", | ||
| }, | ||
| { | ||
| name: "When Spec.Host is set and Status.Ingress is also populated, it should prefer Spec.Host", | ||
| route: &routev1.Route{ | ||
| Spec: routev1.RouteSpec{ | ||
| Host: "oauth.example.com", | ||
| }, | ||
| Status: routev1.RouteStatus{ | ||
| Ingress: []routev1.RouteIngress{ | ||
| { | ||
| Host: "ingress.example.com", | ||
| RouterCanonicalHostname: "canonical.example.com", | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| expectedHost: "oauth.example.com", | ||
| }, | ||
| { | ||
| name: "When Spec.Host is empty and Status.Ingress[0].Host is set, it should return Ingress Host", | ||
| route: &routev1.Route{ | ||
| Status: routev1.RouteStatus{ | ||
| Ingress: []routev1.RouteIngress{ | ||
| { | ||
| Host: "ingress.example.com", | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| expectedHost: "ingress.example.com", | ||
| }, | ||
| { | ||
| name: "When Spec.Host and Ingress Host are empty but RouterCanonicalHostname is set, it should return RouterCanonicalHostname", | ||
| route: &routev1.Route{ | ||
| Status: routev1.RouteStatus{ | ||
| Ingress: []routev1.RouteIngress{ | ||
| { | ||
| RouterCanonicalHostname: "router-canonical.example.com", | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| expectedHost: "router-canonical.example.com", | ||
| }, | ||
| { | ||
| name: "When all host fields are empty, it should return empty string", | ||
| route: &routev1.Route{ | ||
| Status: routev1.RouteStatus{ | ||
| Ingress: []routev1.RouteIngress{ | ||
| {}, | ||
| }, | ||
| }, | ||
| }, | ||
| expectedHost: "", | ||
| }, | ||
| { | ||
| name: "When Status.Ingress slice is empty, it should return empty string", | ||
| route: &routev1.Route{}, | ||
| expectedHost: "", | ||
| }, | ||
| } | ||
|
|
||
| for _, tc := range testCases { | ||
| t.Run(tc.name, func(t *testing.T) { | ||
| g := NewWithT(t) | ||
| g.Expect(RouteHost(tc.route)).To(Equal(tc.expectedHost)) | ||
| }) | ||
| } | ||
| } |
There was a problem hiding this comment.
Add a multi-ingress test case to lock expected fallback behavior.
TestRouteHost should include a case where Status.Ingress[0] is empty and Status.Ingress[1] has Host (or RouterCanonicalHostname). This guards against regressions in real multi-router status layouts.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@support/netutil/route_test.go` around lines 472 - 559, Add a test case in
TestRouteHost that asserts RouteHost prefers a non-empty host from a later
ingress when the first ingress is empty: create a routev1.Route with
Status.Ingress containing two entries where Ingress[0] is empty and
Ingress[1].Host (or RouterCanonicalHostname) is set, and expect RouteHost(route)
to return that second ingress host; ensure the new case references RouteHost and
the routev1.Route.Status.Ingress slice so it guards multi-ingress fallback
behavior.
There was a problem hiding this comment.
This follows from the multi-ingress comment — since a route in HyperShift is admitted by exactly one router, there is no real scenario where Ingress[0] is empty but Ingress[1] has the host. Adding a test for that would codify fallback behavior we intentionally do not support and do not want to guarantee.
There was a problem hiding this comment.
@sdminonne, that's a clear and valid constraint — since HyperShift routes are admitted by exactly one router, Ingress[0] being empty while Ingress[1] carries the host is not a real scenario, and locking in fallback behavior that isn't intended would be misleading. Happy to drop this suggestion.
✏️ Learnings added
Learnt from: sdminonne
Repo: openshift/hypershift PR: 8546
File: support/netutil/route_test.go:472-559
Timestamp: 2026-05-19T18:45:39.602Z
Learning: In HyperShift, a Route is admitted by exactly one router, so `Status.Ingress` will never have a meaningful host in a later index while `Ingress[0]` is empty. Multi-ingress fallback behavior in `RouteHost` (support/netutil/route.go) is intentionally not supported and should not be tested or codified.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
| if len(route.Status.Ingress) > 0 { | ||
| if route.Status.Ingress[0].Host != "" { | ||
| return route.Status.Ingress[0].Host | ||
| } | ||
| if route.Status.Ingress[0].RouterCanonicalHostname != "" { | ||
| return route.Status.Ingress[0].RouterCanonicalHostname | ||
| } |
There was a problem hiding this comment.
Handle multi-ingress routes instead of only index 0.
Line 163 assumes Status.Ingress[0] is the only useful entry. Routes can have multiple ingress records, so this may return empty even when a later entry has a usable host, which can stall readiness.
🔧 Suggested fix
func RouteHost(route *routev1.Route) string {
if route == nil {
return ""
}
if route.Spec.Host != "" {
return route.Spec.Host
}
- if len(route.Status.Ingress) > 0 {
- if route.Status.Ingress[0].Host != "" {
- return route.Status.Ingress[0].Host
- }
- if route.Status.Ingress[0].RouterCanonicalHostname != "" {
- return route.Status.Ingress[0].RouterCanonicalHostname
- }
- }
- return ""
+ var canonicalFallback string
+ for _, ingress := range route.Status.Ingress {
+ if ingress.Host != "" {
+ return ingress.Host
+ }
+ if canonicalFallback == "" && ingress.RouterCanonicalHostname != "" {
+ canonicalFallback = ingress.RouterCanonicalHostname
+ }
+ }
+ return canonicalFallback
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if len(route.Status.Ingress) > 0 { | |
| if route.Status.Ingress[0].Host != "" { | |
| return route.Status.Ingress[0].Host | |
| } | |
| if route.Status.Ingress[0].RouterCanonicalHostname != "" { | |
| return route.Status.Ingress[0].RouterCanonicalHostname | |
| } | |
| func RouteHost(route *routev1.Route) string { | |
| if route == nil { | |
| return "" | |
| } | |
| if route.Spec.Host != "" { | |
| return route.Spec.Host | |
| } | |
| var canonicalFallback string | |
| for _, ingress := range route.Status.Ingress { | |
| if ingress.Host != "" { | |
| return ingress.Host | |
| } | |
| if canonicalFallback == "" && ingress.RouterCanonicalHostname != "" { | |
| canonicalFallback = ingress.RouterCanonicalHostname | |
| } | |
| } | |
| return canonicalFallback | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@support/netutil/route.go` around lines 163 - 169, The current logic only
inspects route.Status.Ingress[0] and returns its Host or
RouterCanonicalHostname, which misses later ingress entries; update the code
that references route.Status.Ingress[0] to iterate over all entries in
route.Status.Ingress, checking each entry's Host first and then
RouterCanonicalHostname, and return the first non-empty value found (falling
back to empty/no-host if none exist); ensure you reference the same symbols
(route.Status.Ingress, Host, RouterCanonicalHostname) so the loop replaces the
existing index-0 checks.
There was a problem hiding this comment.
In HyperShift, a route is admitted by exactly one router — either the OpenShift default router or the HyperShift shared ingress controller. There is no scenario where Ingress[0] is empty but Ingress[1] contains the host. Multiple ingress entries represent different routers admitting the same route, and the first entry is always the relevant one for the controller that owns the route.
Iterating all entries would add complexity for a scenario that does not exist in this codebase and could mask bugs by silently falling through to an unrelated router's entry.
There was a problem hiding this comment.
@sdminonne, thank you for the clear explanation of HyperShift's routing model. That's a strong argument — since a route in HyperShift is always admitted by exactly one router, Status.Ingress[0] is definitively the correct (and only) entry to inspect. Iterating further would indeed add unnecessary complexity and could silently mask bugs by falling through to an unrelated router's entry. I'll withdraw the suggestion.
✏️ Learnings added
Learnt from: sdminonne
Repo: openshift/hypershift PR: 8546
File: support/netutil/route.go:163-169
Timestamp: 2026-05-19T18:45:30.452Z
Learning: In HyperShift, a Route is admitted by exactly one router — either the OpenShift default router or the HyperShift shared ingress controller. There is no scenario where `Status.Ingress[0]` is empty but a later entry (e.g. `Status.Ingress[1]`) contains the host. Multiple ingress entries would represent different routers admitting the same route, but that does not occur in HyperShift. Therefore, reading only `Status.Ingress[0]` (as in `netutil.RouteHost` in `support/netutil/route.go`) is intentional and correct; iterating over all entries is unnecessary and could mask bugs.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
|
/retest |
|
/test e2e-aks |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/test e2e-aks |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/test e2e-aks |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/test e2e-aks |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
97d69f9 to
1f3e9bc
Compare
1f3e9bc to
5a6f1f4
Compare
|
/cc @csrwng |
5a6f1f4 to
6b0822e
Compare
|
/test e2e-aks |
|
/test e2e-aws |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/test e2e-aws |
|
/test e2e-aks |
…gement clusters On non-OpenShift management clusters (e.g. AKS), there is no default router to populate route.Spec.Host. The shared ingress controller populates Status.Ingress[0].RouterCanonicalHostname instead. - Add RouteHost() helper in support/netutil with a fallback chain: Spec.Host → Status.Ingress[0].Host → RouterCanonicalHostname - Use RouteHost() in oauth.ReconcileServiceStatus and kas.ReconcileKonnectivityServerServiceStatus - Guard the shared ingress path in reconcileAPIServerServiceStatus with serviceStrategy.Type == Route to prevent empty apiHost when UseSharedIngress()=true but strategy is LoadBalancer - Log all InfrastructureStatus fields when infrastructure is not ready - Add unit tests for RouteHost, KonnectivityServerServiceStatus fallback, and the shared ingress guard with LoadBalancer strategy Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch TestCreateClusterHABreakGlassCredentials from the cloud platform to NonePlatform. The test validates control-plane PKI operations with NodePoolReplicas=0 and an HA control plane — none of which require cloud infrastructure. This reduces cost, execution time, and flakiness from cloud API interactions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…econciliation On NonePlatform clusters the test framework skips the full cluster-readiness gate, so KAS may be serving before the control-plane-pki-operator has reconciled the break-glass signer CA into the client-ca trust bundle. Replace the single SSR call with PollUntilContextTimeout (1s interval, 5min timeout) that retries on Unauthorized, matching the pattern already used by validateRevocation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5b2e6af to
1aa9c82
Compare
|
/test e2e-aks |
|
/test e2e-aws |
Previously only Unauthorized errors were retried, while transient connection errors (DNS resolution, connection refused, TLS handshake) were treated as fatal and immediately aborted the poll. On NonePlatform clusters the KAS ELB DNS may not have propagated when the test begins, causing "no such host" failures that should be retried. Treat all errors as retryable since the 5-minute PollUntilContextTimeout already bounds the total retry window. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/test e2e-aks |
|
/test e2e-aws |
|
@sdminonne: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
I now have all the evidence needed for the final report. Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryBoth Root CauseThe root cause is a transient Failure chain:
Recommendations
Evidence
|
|
/pipeline-required |
What this PR does / why we need it:
Switches
TestCreateClusterHABreakGlassCredentialsfrom the cloud platform toNonePlatformand fixes CPO bugs on non-OpenShift management clusters (AKS) that this change exposed.
This test validates control-plane PKI operations (CSR creation/approval/signing, certificate
revocation, per-pod KAS cert reload) with
NodePoolReplicas=0and an HA control plane (3 KASreplicas). None of these operations require cloud infrastructure.
The management cluster's cloud controller manager provisions a real LoadBalancer for the
kube-apiserver service regardless of the hosted cluster's platform type, so the guest API
server remains reachable. This follows the same pattern used by
TestHAEtcdChaos.Removing cloud infrastructure provisioning reduces cost, execution time, and flakiness from
cloud API interactions — particularly under resource pressure when multiple hosted clusters
compete for management cluster capacity (as seen in PR #8338's e2e-aks job).
Which issue(s) this PR fixes:
Fixes CNTRLPLANE-3430
Special notes for your reviewer:
Why the CPO fixes (commit 1) are needed
TestHAEtcdChaosuses the sameNonePlatform+ HA +NodePoolReplicas=0pattern but hasnever run on AKS. It is restricted to the
e2e-azure-self-managedjob, which uses anOpenShift management cluster and has an explicit test filter
(openshift-hypershift-main.yaml):
TestCreateClusterHABreakGlassCredentialsis not in that filter. It runs ine2e-aks,which has no
CI_TESTS_RUNfilter — all tests run, on an AKS (non-OpenShift) managementcluster.
On OpenShift management clusters the default router populates
route.Spec.Host, so the CPObugs never manifest. On AKS, there is no default router — the shared ingress controller
populates
Status.Ingress[0].RouterCanonicalHostnameinstead. The CPO was readingroute.Spec.Hostdirectly, getting an empty string, and blocking infrastructure readinessindefinitely.
Additionally, unlike
TestHAEtcdChaos(which only uses the management client to manipulateetcd pods), the break-glass test calls
WaitForGuestClient, which requires the CPO tocomplete infrastructure readiness by resolving a host/port for KAS.
Without the CPO fixes,
WaitForGuestClienttimes out on AKS.Per-commit breakdown
Commit 1 —
fix(control-plane-operator): resolve route host on non-OpenShift management clustersRouteHost()helper insupport/netutilwith a fallback chain (Spec.Host→Status.Ingress[0].Host→RouterCanonicalHostname) and use it inoauth.ReconcileServiceStatusandkas.ReconcileKonnectivityServerServiceStatusreconcileAPIServerServiceStatuswithserviceStrategy.Type == Routeto prevent emptyapiHostwhenUseSharedIngress()=truebut strategy is
LoadBalancerInfrastructureStatusfields when infrastructure is not readyRouteHost,KonnectivityServerServiceStatusfallback, and the sharedingress guard with
LoadBalancerstrategyCommit 2 —
test(e2e): make HA break-glass credentials test infralessSwitch the test to
NonePlatform. Same pattern asTestHAEtcdChaos.The framework handles
NonePlatformcorrectly for this test:before(): skipsValidatePublicCluster— OK, the test has explicit waitsafter(): skipsEnsureAdmissionPolicies— OK (OCPBUGS-61291)RunTestControlPlanePKIOperatorBreakGlassCredentials: only skips KubeVirt and GCPChecklist:
🤖 Generated with Claude Code