CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator#7749
CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator#7749enxebre wants to merge 3 commits into
Conversation
Instrument the existing healthCheckKASEndpoint function in the control-plane-operator to expose two new Prometheus metrics: - hypershift_kube_apiserver_available: Gauge reporting 1 when the KAS /healthz endpoint returns HTTP 200, 0 otherwise. - hypershift_kube_apiserver_request_duration_seconds: Histogram tracking the latency of the /healthz health check probe. These metrics are registered with the controller-runtime metrics registry and are automatically scraped by the existing PodMonitor for the control-plane-operator. Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster. The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect of the existing health check, not a replacement. This eliminates the need for external monitoring tooling (e.g. route-monitor-operator) to track KAS availability and latency for SLA purposes in HCP offerings. Ref: CNTRLPLANE-2775 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add tests for the healthCheckKASEndpoint function that verify metrics are correctly recorded during health check probes: - Gauge set to 1 and histogram observed on successful 200 response - Gauge set to 0 on non-200 response (503) - Gauge set to 0 on connection error (unreachable endpoint) - No panic when metrics is nil (backward compatibility) Also add a basic test for KASHealthMetrics construction and registration. Ref: CNTRLPLANE-2775 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 WalkthroughAdds Prometheus metrics for KAS health (availability and request duration), threads an optional Changes
Sequence Diagram(s)sequenceDiagram
participant Tests as Tests/E2E
participant Reconciler as HostedControlPlaneReconciler
participant KAS as KAS/API
participant Metrics as Prometheus/Registry
Tests->>Reconciler: trigger health check
Reconciler->>KAS: HTTP request to ingress (healthCheckKASEndpoint with m)
activate KAS
KAS-->>Reconciler: HTTP response (200/503/timeout)
deactivate KAS
alt m != nil
Reconciler->>Metrics: observe RequestDuration
Reconciler->>Metrics: set Available = 1 or 0
end
Tests->>Metrics: query metrics endpoint to validate metrics present/values
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
|
@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/auto-cc |
|
Just to clarify a bit further, while these metrics are great, in ROSA HCP we intend to probe KAS availability externally via RHOBS synthetic monitoring (SREP-333), where Blackbox Exporter runs on RHOBS cells outside the management cluster. This gives us the advantage of testing the actual customer-facing network path (only partially for private API), including DNS resolution, load balancer health, and regional routing, rather than probing from within the MC's own network. I understand ARO HCP wants to avoid the RMO dependency, and these metrics help with that. However, since the CPO probe originates from within the MC, it's not a full replacement for external synthetic monitoring for SLA purposes IMO. That said, these CPO-local metrics are definitely useful even for the ROSA side for faster internal detection of control plane issues, for example catching KAS pod crashes or pinpointing network issues to in-cluster networking failures. LGTM & thanks for the addition! |
|
Also cc @dustman9000 as a FYI that this exists and is now being extended with latency as well :) |
|
lgtm |
|
/test e2e-aws |
|
@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/util/util.go`:
- Around line 2712-2743: The loop in ValidateCPOMetrics uses
ValidateMetricPresence which expects labeled metrics and therefore never matches
label-less KAS metrics; modify the inner check after GetMetricsFromPod to look
for the MetricFamily by name (from the returned mf MetricFamily map or slice)
for kas.KASAvailableMetricName and kas.KASRequestDurationMetricName instead of
calling ValidateMetricPresence, i.e., verify the MetricFamily exists and has at
least one metric (no label checks) before returning true; keep the surrounding
wait.PollUntilContextTimeout, GetMetricsFromPod, and error handling unchanged.
|
@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Add ValidateCPOMetrics function that verifies both hypershift_kube_apiserver_available and hypershift_kube_apiserver_request_duration_seconds metrics are present on the control-plane-operator pod's metrics endpoint (port 8080). The validation is integrated into the e2e test framework's pre-teardown phase, running alongside existing hypershift-operator metrics checks. It follows the established pattern using GetMetricsFromPod and ValidateMetricPresence with polling (10s interval, 5min timeout). Ref: CNTRLPLANE-2775 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
51eb7d6 to
e94ec9e
Compare
|
/test e2e-aws |
|
@enxebre: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Test Resultse2e-aws
Failed TestsTotal failed tests: 4
|
|
Stale PRs are closed after 21d of inactivity. If this PR is still relevant, comment to refresh it or remove the stale label. If this PR is safe to close now please do so with /lifecycle stale |
|
Stale PRs rot after 14d of inactivity. Mark the PR as fresh by commenting If this PR is safe to close now please do so with /lifecycle rotten |
|
This is a great addition! Having a native KAS health signal per HCP from inside the CPO will be very useful. One thing I want to flag from the ROSA HCP SRE monitoring side: the probe frequency here is tied to the reconcile loop (1 min when healthy, 15s when unhealthy), and the probe runs from inside the management cluster. We currently have two layers of external synthetic monitoring that validate the customer-facing network path (DNS, LB, ingress):
Both probe at a higher frequency and validate reachability from outside the control plane. For SLO/SLA calculations like KubeAPIErrorBudgetBurn, that external perspective and tighter sampling interval matter. So this is complementary to our synthetic monitoring, not a replacement. If we can, I'd suggest softening the "eliminate that dependency" language in the description to avoid confusion downstream. Something like "these native metrics reduce reliance on external probing for internal health checks" would be more accurate. |
jparrill
left a comment
There was a problem hiding this comment.
Dropped some comments. Thanks!
| }), | ||
| } | ||
|
|
||
| crmetrics.Registry.MustRegister(m.Available, m.RequestDuration) |
There was a problem hiding this comment.
Consider accepting prometheus.Registerer instead of hard-coding the global registry. This eliminates the 3 duplicate manual constructions in tests and avoids MustRegister panic on double-registration:
func NewKASHealthMetrics(reg prometheus.Registerer) *KASHealthMetrics {|
|
||
| // KASRequestDurationBuckets defines the histogram bucket boundaries for KAS | ||
| // health check latency measurements, ranging from 10ms to 10s. | ||
| var KASRequestDurationBuckets = []float64{0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10} |
There was a problem hiding this comment.
nit: Exported mutable slice — any caller can corrupt the histogram buckets. Consider unexported kasRequestDurationBuckets.
| if err != nil { | ||
| return err | ||
| } | ||
| defer resp.Body.Close() |
There was a problem hiding this comment.
Good fix for the pre-existing resp.Body leak. Placement is non-idiomatic though — resp is used in the if m != nil block before err is checked. Safer:
if resp != nil {
defer resp.Body.Close()
}placed right after httpClient.Get(), before the metrics block.
| ) | ||
|
|
||
| func TestNewKASHealthMetrics(t *testing.T) { | ||
| t.Run("When creating KAS health metrics, it should register both metrics", func(t *testing.T) { |
There was a problem hiding this comment.
This never calls NewKASHealthMetrics() — it tests Prometheus primitives, not the constructor. Rename to TestKASHealthMetricsOperations, or rewrite to call the real constructor (easier with the Registerer param suggested above). Also: missing t.Parallel(), uses t.Errorf instead of gomega (project convention).
| } | ||
|
|
||
| ValidateMetrics(t, context.Background(), h.client, hostedCluster, metricsToValidate, true) | ||
| ValidateCPOMetrics(t, context.Background(), h.client, hostedCluster) |
There was a problem hiding this comment.
Runs in every test's after() — if metrics emission fails on any platform, all E2E tests fail. Suggest starting as a dedicated test, then promoting to after() once stable in CI.
| hcpNamespace := manifests.HostedControlPlaneNamespace(hc.Namespace, hc.Name) | ||
|
|
||
| err := wait.PollUntilContextTimeout(ctx, 10*time.Second, 5*time.Minute, true, func(ctx context.Context) (bool, error) { | ||
| mf, err := GetMetricsFromPod(ctx, c, "control-plane-operator", "control-plane-operator", hcpNamespace, "8080") |
There was a problem hiding this comment.
nit: Magic string "8080" — CPO metrics port from main.go:207. Extract to a constant.
|
@enxebre: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
I have the full picture. Here is the analysis: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe Root CauseThe PR branch (commit Specifically, the
Git's automatic merge strategy cannot resolve these conflicts, so This is not a product bug, test bug, or infrastructure issue. It is expected Prow behavior when a PR cannot be cleanly merged against the current base branch. Recommendations
Evidence
|
Description
Instruments the existing
healthCheckKASEndpoint()function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:hypershift_kube_apiserver_available/healthzreturns HTTP 200, 0 otherwisehypershift_kube_apiserver_request_duration_seconds/healthzprobe (buckets: 0.01–10s)Why
HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.
How
HostedControlPlaneAvailablecondition logic is unchanged — metrics are a side-effect, not a replacementKey files
control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go— metric definitions and registrationcontrol-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go— instrumented health checkcontrol-plane-operator/main.go— metrics initializationTesting
GetMetricsFromPod/ValidateMetricPresence(Karpenter pattern)make testexits 0)Jira
CNTRLPLANE-2775
🤖 Generated with Claude Code via
/jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)Summary by CodeRabbit
New Features
Tests
Chores