Skip to content

CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator#7749

Open
enxebre wants to merge 3 commits into
openshift:mainfrom
enxebre:fix-CNTRLPLANE-2775
Open

CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator#7749
enxebre wants to merge 3 commits into
openshift:mainfrom
enxebre:fix-CNTRLPLANE-2775

Conversation

@enxebre
Copy link
Copy Markdown
Member

@enxebre enxebre commented Feb 19, 2026

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

  • New Features

    • Added KAS health metrics (availability gauge and request-duration histogram) and registered them for collection.
  • Tests

    • Added unit tests for KAS health metrics and end-to-end validation to check metrics are emitted during runs.
  • Chores

    • Metrics instrumentation initialized at startup so controller exposes KAS health metrics for monitoring.

enxebre and others added 2 commits February 19, 2026 13:12
Instrument the existing healthCheckKASEndpoint function in the
control-plane-operator to expose two new Prometheus metrics:

- hypershift_kube_apiserver_available: Gauge reporting 1 when the KAS
  /healthz endpoint returns HTTP 200, 0 otherwise.
- hypershift_kube_apiserver_request_duration_seconds: Histogram tracking
  the latency of the /healthz health check probe.

These metrics are registered with the controller-runtime metrics registry
and are automatically scraped by the existing PodMonitor for the
control-plane-operator. Each CPO pod runs in its own HCP namespace, so
metrics are naturally scoped per hosted cluster. The existing
HostedControlPlaneAvailable condition logic is unchanged — metrics are a
side-effect of the existing health check, not a replacement.

This eliminates the need for external monitoring tooling (e.g.
route-monitor-operator) to track KAS availability and latency for SLA
purposes in HCP offerings.

Ref: CNTRLPLANE-2775

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add tests for the healthCheckKASEndpoint function that verify metrics
are correctly recorded during health check probes:

- Gauge set to 1 and histogram observed on successful 200 response
- Gauge set to 0 on non-200 response (503)
- Gauge set to 0 on connection error (unreachable endpoint)
- No panic when metrics is nil (backward compatibility)

Also add a basic test for KASHealthMetrics construction and registration.

Ref: CNTRLPLANE-2775

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 19, 2026

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 19, 2026

No actionable comments were generated in the recent review. 🎉


Walkthrough

Adds Prometheus metrics for KAS health (availability and request duration), threads an optional KASHealthMetrics through KAS health checks, initializes metrics at startup, and adds unit and E2E tests and helpers to validate metrics exposure. Some test and helper additions are duplicated in the diff.

Changes

Cohort / File(s) Summary
Controller instrumentation
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
Added HostedControlPlaneReconciler.KASHealthMetrics *kas.KASHealthMetrics; changed healthCheckKASEndpoint to accept m *kas.KASHealthMetrics; record request duration and set availability (guarded by nil check).
KAS metrics implementation
control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go
New metrics: KASAvailableMetricName, KASRequestDurationMetricName, KASRequestDurationBuckets; KASHealthMetrics struct with Available gauge and RequestDuration histogram; NewKASHealthMetrics() registering metrics.
Controller unit tests
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
Added TestHealthCheckKASEndpointMetrics with subtests (200, 503, unreachable, nil-metrics), helpers newTestKASHealthMetrics and parseHostPort. Note: test and helpers appear duplicated in the diff — inspect for unintended repeats.
KAS metrics unit tests
control-plane-operator/controllers/hostedcontrolplane/kas/metrics_test.go
New test validating metric registration, gauge initial state, gauge update, and histogram observation via a Prometheus registry.
Startup wiring
control-plane-operator/main.go
Instantiates kas.NewKASHealthMetrics() and injects it into HostedControlPlaneReconciler during startup/setup.
E2E helpers & integration
test/e2e/util/hypershift_framework.go, test/e2e/util/util.go
Adds ValidateCPOMetrics E2E helper and invokes it in after-phase; helper polls control-plane-operator metrics for kas.KASAvailableMetricName and kas.KASRequestDurationMetricName. Note: ValidateCPOMetrics appears duplicated in the diff — verify and dedupe.

Sequence Diagram(s)

sequenceDiagram
  participant Tests as Tests/E2E
  participant Reconciler as HostedControlPlaneReconciler
  participant KAS as KAS/API
  participant Metrics as Prometheus/Registry

  Tests->>Reconciler: trigger health check
  Reconciler->>KAS: HTTP request to ingress (healthCheckKASEndpoint with m)
  activate KAS
  KAS-->>Reconciler: HTTP response (200/503/timeout)
  deactivate KAS
  alt m != nil
    Reconciler->>Metrics: observe RequestDuration
    Reconciler->>Metrics: set Available = 1 or 0
  end
  Tests->>Metrics: query metrics endpoint to validate metrics present/values
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning PR contains duplicated test implementations in hostedcontrolplane_controller_test.go and util.go, lacks meaningful assertion messages, and has insufficient timeout safeguards in E2E polling operations. Remove duplicate test functions, add descriptive failure messages to all assertions, and ensure explicit timeout configuration for all polling operations.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main objective: exposing KAS availability and latency metrics from the control-plane-operator, which aligns with the core implementation across all modified files.
Stable And Deterministic Test Names ✅ Passed All test names in the PR use stable, deterministic naming with no dynamic content, formatted strings, or variable substitution.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Feb 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 19, 2026

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 19, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@enxebre
Copy link
Copy Markdown
Member Author

enxebre commented Feb 19, 2026

/cc @muraee @csrwng

@openshift-ci openshift-ci Bot requested review from csrwng and muraee February 19, 2026 12:36
@openshift-ci openshift-ci Bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label Feb 19, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 19, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Feb 19, 2026
@enxebre
Copy link
Copy Markdown
Member Author

enxebre commented Feb 19, 2026

/auto-cc

@openshift-ci openshift-ci Bot requested review from jparrill and sjenning February 19, 2026 12:41
@typeid
Copy link
Copy Markdown
Member

typeid commented Feb 19, 2026

Just to clarify a bit further, while these metrics are great, in ROSA HCP we intend to probe KAS availability externally via RHOBS synthetic monitoring (SREP-333), where Blackbox Exporter runs on RHOBS cells outside the management cluster. This gives us the advantage of testing the actual customer-facing network path (only partially for private API), including DNS resolution, load balancer health, and regional routing, rather than probing from within the MC's own network.

I understand ARO HCP wants to avoid the RMO dependency, and these metrics help with that. However, since the CPO probe originates from within the MC, it's not a full replacement for external synthetic monitoring for SLA purposes IMO.

That said, these CPO-local metrics are definitely useful even for the ROSA side for faster internal detection of control plane issues, for example catching KAS pod crashes or pinpointing network issues to in-cluster networking failures.

LGTM & thanks for the addition!

@typeid
Copy link
Copy Markdown
Member

typeid commented Feb 19, 2026

Also cc @dustman9000 as a FYI that this exists and is now being extended with latency as well :)

@muraee
Copy link
Copy Markdown
Contributor

muraee commented Feb 19, 2026

lgtm

@enxebre
Copy link
Copy Markdown
Member Author

enxebre commented Feb 19, 2026

/test e2e-aws

@enxebre enxebre marked this pull request as ready for review February 19, 2026 13:59
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 19, 2026

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

Release Notes

  • New Features

  • Added Kubernetes API Server (KAS) health metrics monitoring with Prometheus instrumentation, tracking request duration and availability status.

  • Tests

  • Added comprehensive validation tests for KAS health metrics functionality.

  • Integrated Control Plane Operator metrics validation into end-to-end test workflows.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/util/util.go`:
- Around line 2712-2743: The loop in ValidateCPOMetrics uses
ValidateMetricPresence which expects labeled metrics and therefore never matches
label-less KAS metrics; modify the inner check after GetMetricsFromPod to look
for the MetricFamily by name (from the returned mf MetricFamily map or slice)
for kas.KASAvailableMetricName and kas.KASRequestDurationMetricName instead of
calling ValidateMetricPresence, i.e., verify the MetricFamily exists and has at
least one metric (no label checks) before returning true; keep the surrounding
wait.PollUntilContextTimeout, GetMetricsFromPod, and error handling unchanged.

Comment thread test/e2e/util/util.go
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 19, 2026

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

  • New Features

  • Added KAS health metrics: availability and request-duration metrics exposed for monitoring.

  • Tests

  • Added unit tests for KAS health metrics and integrated control-plane metrics validation into end-to-end test flows.

  • Chores

  • Instrumentation initialized at startup so metrics are available from the controller runtime.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Add ValidateCPOMetrics function that verifies both
hypershift_kube_apiserver_available and
hypershift_kube_apiserver_request_duration_seconds metrics are present
on the control-plane-operator pod's metrics endpoint (port 8080).

The validation is integrated into the e2e test framework's pre-teardown
phase, running alongside existing hypershift-operator metrics checks.
It follows the established pattern using GetMetricsFromPod and
ValidateMetricPresence with polling (10s interval, 5min timeout).

Ref: CNTRLPLANE-2775

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@enxebre enxebre force-pushed the fix-CNTRLPLANE-2775 branch from 51eb7d6 to e94ec9e Compare February 19, 2026 17:24
@enxebre
Copy link
Copy Markdown
Member Author

enxebre commented Feb 19, 2026

/test e2e-aws
/verified by e2e

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Feb 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@enxebre: This PR has been marked as verified by e2e.

Details

In response to this:

/test e2e-aws
/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@enxebre enxebre added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Feb 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 19, 2026

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

  • New Features

  • Added KAS health metrics (availability gauge and request-duration histogram) and registered them for collection.

  • Tests

  • Added unit tests for KAS health metrics and end-to-end validation to check metrics are emitted during runs.

  • Chores

  • Metrics instrumentation initialized at startup so controller exposes KAS health metrics for monitoring.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 21, 2026
@openshift-merge-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@cwbotbot
Copy link
Copy Markdown

Test Results

e2e-aws

Failed Tests

Total failed tests: 4

  • TestCreateClusterRequestServingIsolation
  • TestCreateClusterRequestServingIsolation/Teardown
  • TestCreateClusterRequestServingIsolation/ValidateHostedCluster
  • TestCreateClusterRequestServingIsolation/ValidateHostedCluster/EnsureNoCrashingPods

@openshift-bot
Copy link
Copy Markdown

Stale PRs are closed after 21d of inactivity.

If this PR is still relevant, comment to refresh it or remove the stale label.
Mark the PR as fresh by commenting /remove-lifecycle stale.

If this PR is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 17, 2026
@openshift-bot
Copy link
Copy Markdown

Stale PRs rot after 14d of inactivity.

Mark the PR as fresh by commenting /remove-lifecycle rotten.
Rotten PRs close after an additional 7d of inactivity.

If this PR is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci Bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 2, 2026
@dustman9000
Copy link
Copy Markdown
Member

This is a great addition! Having a native KAS health signal per HCP from inside the CPO will be very useful.

One thing I want to flag from the ROSA HCP SRE monitoring side: the probe frequency here is tied to the reconcile loop (1 min when healthy, 15s when unhealthy), and the probe runs from inside the management cluster. We currently have two layers of external synthetic monitoring that validate the customer-facing network path (DNS, LB, ingress):

  • route-monitor-operator: creates Prometheus blackbox exporter probes per HCP endpoint on the management cluster
  • RHOBS synthetics agent: external blackbox probes monitoring HCP API endpoints from outside the MC, feeding into RHOBS for SLO tracking

Both probe at a higher frequency and validate reachability from outside the control plane. For SLO/SLA calculations like KubeAPIErrorBudgetBurn, that external perspective and tighter sampling interval matter.

So this is complementary to our synthetic monitoring, not a replacement. If we can, I'd suggest softening the "eliminate that dependency" language in the description to avoid confusion downstream. Something like "these native metrics reduce reliance on external probing for internal health checks" would be more accurate.

Copy link
Copy Markdown
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some comments. Thanks!

}),
}

crmetrics.Registry.MustRegister(m.Available, m.RequestDuration)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider accepting prometheus.Registerer instead of hard-coding the global registry. This eliminates the 3 duplicate manual constructions in tests and avoids MustRegister panic on double-registration:

func NewKASHealthMetrics(reg prometheus.Registerer) *KASHealthMetrics {


// KASRequestDurationBuckets defines the histogram bucket boundaries for KAS
// health check latency measurements, ranging from 10ms to 10s.
var KASRequestDurationBuckets = []float64{0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Exported mutable slice — any caller can corrupt the histogram buckets. Consider unexported kasRequestDurationBuckets.

if err != nil {
return err
}
defer resp.Body.Close()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix for the pre-existing resp.Body leak. Placement is non-idiomatic though — resp is used in the if m != nil block before err is checked. Safer:

if resp != nil {
    defer resp.Body.Close()
}

placed right after httpClient.Get(), before the metrics block.

)

func TestNewKASHealthMetrics(t *testing.T) {
t.Run("When creating KAS health metrics, it should register both metrics", func(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This never calls NewKASHealthMetrics() — it tests Prometheus primitives, not the constructor. Rename to TestKASHealthMetricsOperations, or rewrite to call the real constructor (easier with the Registerer param suggested above). Also: missing t.Parallel(), uses t.Errorf instead of gomega (project convention).

}

ValidateMetrics(t, context.Background(), h.client, hostedCluster, metricsToValidate, true)
ValidateCPOMetrics(t, context.Background(), h.client, hostedCluster)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runs in every test's after() — if metrics emission fails on any platform, all E2E tests fail. Suggest starting as a dedicated test, then promoting to after() once stable in CI.

Comment thread test/e2e/util/util.go
hcpNamespace := manifests.HostedControlPlaneNamespace(hc.Namespace, hc.Name)

err := wait.PollUntilContextTimeout(ctx, 10*time.Second, 5*time.Minute, true, func(ctx context.Context) (bool, error) {
mf, err := GetMetricsFromPod(ctx, c, "control-plane-operator", "control-plane-operator", hcpNamespace, "8080")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Magic string "8080" — CPO metrics port from main.go:207. Extract to a constant.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 11, 2026

@enxebre: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws e94ec9e link true /test e2e-aws
ci/prow/unit e94ec9e link true /test unit
ci/prow/e2e-azure-self-managed e94ec9e link true /test e2e-azure-self-managed
ci/prow/verify-workflows e94ec9e link true /test verify-workflows
ci/prow/security e94ec9e link true /test security

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

I have the full picture. Here is the analysis:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

CONFLICT (content): Merge conflict in control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
CONFLICT (content): Merge conflict in test/e2e/util/util.go
Automatic merge failed; fix conflicts and then commit the result.
# Error: exit status 1
# Final SHA: 
# Total runtime: 0s

Summary

The security job never executed. It failed during the git merge phase before any CI step ran. Prow attempted to merge PR #7749 (commit e94ec9ec) into the main branch (base f16ca0d2, head commit of PR #8478) and encountered merge conflicts in two files. Because the merge could not be completed, no final SHA was produced, and the job exited immediately with status 1. The security scan itself was never reached — this is purely a branch staleness issue.

Root Cause

The PR branch (commit e94ec9eceafc85e369cf33eec52a58ec11eb6540) is out of date with the current main branch (commit f16ca0d22941d58a170aaa4b1471b1456017c7c8, from merged PR #8478 bryan-cox/update-agents-on-webhook).

Specifically, the main branch and the PR branch both modify the same regions of two files:

  1. control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go — Both the PR (which adds KAS availability/latency metrics exposure in the control-plane-operator) and recent main commits modify the hosted control plane controller test file in overlapping sections.

  2. test/e2e/util/util.go — Both the PR and recent main commits modify the e2e test utility file in overlapping sections.

Git's automatic merge strategy cannot resolve these conflicts, so git merge --no-ff fails with exit status 1. Prow records no final SHA and marks the job as failed before any CI container (including the security scanner) starts.

This is not a product bug, test bug, or infrastructure issue. It is expected Prow behavior when a PR cannot be cleanly merged against the current base branch.

Recommendations
  1. Rebase PR CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator #7749 onto current main — The PR author should run git rebase main (or git merge main) to resolve the conflicts in the two affected files, then force-push the updated branch. This will allow Prow to cleanly merge and proceed to the security scan.

  2. Resolve conflicts in these specific files:

    • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
    • test/e2e/util/util.go
  3. Re-trigger the job — After pushing the rebased branch, the pull-ci-openshift-hypershift-main-security job will be automatically re-triggered (or can be manually triggered with /retest).

  4. No action needed on the security scan itself — The scan never ran and is not the cause of the failure.

Evidence
Evidence Detail
Failure phase Git merge (pre-CI, before any test step)
Exit code 1 (from git merge --no-ff)
Final SHA Empty (no merge commit produced)
Total runtime 0s (no CI steps executed)
Conflicting file 1 control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
Conflicting file 2 test/e2e/util/util.go
PR commit e94ec9eceafc85e369cf33eec52a58ec11eb6540
Base (main) commit f16ca0d22941d58a170aaa4b1471b1456017c7c8 (PR #8478 bryan-cox/update-agents-on-webhook)
Auto-merged (no conflict) hostedcontrolplane_controller.go, main.go, hypershift_framework.go
Security scan executed No — job terminated before reaching any CI step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants