Skip to content

CNTRLPLANE-2838: feat(CPO): add metrics-proxy control plane component#7752

Open
muraee wants to merge 4 commits intoopenshift:mainfrom
muraee:feat-metrics-proxy
Open

CNTRLPLANE-2838: feat(CPO): add metrics-proxy control plane component#7752
muraee wants to merge 4 commits intoopenshift:mainfrom
muraee:feat-metrics-proxy

Conversation

@muraee
Copy link
Contributor

@muraee muraee commented Feb 19, 2026

Summary

  • Adds the metrics-proxy as a new CPOv2 control plane component that exposes hosted control plane
    component metrics externally via an authenticated HTTPS route, as described in the
    hosted control plane metrics exposure enhancement
  • The proxy authenticates requests using bearer tokens validated against the management cluster's
    TokenReview API, discovers component pod targets via the endpoint-resolver, scrapes metrics from
    individual pods in parallel, and applies per-component filtering and labeling
  • Exposes an aggregated Prometheus-compatible /metrics/<component> endpoint over a TLS route
  • Includes dynamic scrape configuration loaded from a ConfigMap, self-signed CA/serving certificate
    management, and comprehensive unit tests

Test plan

  • Verify generated test fixtures match expected output across all platform variants (default, AROSwift, GCP, IBMCloud, TechPreviewNoUpgrade)
  • Run go test ./control-plane-operator/metrics-proxy/... to validate proxy unit tests
  • Run go test ./control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/... to validate component tests
  • Run make test for full unit test suite
  • Run make verify to ensure code generation and formatting are correct

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added metrics-proxy component to enable secure metrics collection and forwarding from hosted control planes
    • Supports dynamic scrape configuration generation from monitoring resources across multiple cloud platforms (AROSwift, GCP, IBMCloud)
    • Implements TLS-based encryption for metrics endpoints with certificate management
    • Enables per-component metrics filtering and label enrichment for improved observability

@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci bot requested review from bryan-cox and enxebre February 19, 2026 13:47
@openshift-ci openshift-ci bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label Feb 19, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 19, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: muraee

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Feb 19, 2026
@muraee
Copy link
Contributor Author

muraee commented Feb 19, 2026

/cc @enxebre @csrwng

@openshift-ci openshift-ci bot requested a review from csrwng February 19, 2026 13:53
@muraee muraee force-pushed the feat-metrics-proxy branch from e949808 to 4ba7d9f Compare February 24, 2026 11:13
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR introduces a new metrics-proxy component for the hosted control plane operator, enabling metrics forwarding from control plane components. It includes v2 component registration, Kubernetes manifests, TLS certificate handling, scrape config generation, endpoint discovery, metric filtering, and an HTTPS proxy server with complete test coverage.

Changes

Cohort / File(s) Summary
Controller Registration
hostedcontrolplane_controller.go, hostedcontrolplane_controller_test.go
Registers metricsproxyv2 and endpointresolverv2 components; updates test fixtures with dependency injection for endpoint resolver.
v2 Component Implementation
v2/metrics_proxy/component.go, v2/metrics_proxy/certs.go, v2/metrics_proxy/deployment.go, v2/metrics_proxy/route.go, v2/metrics_proxy/scrape_config.go
Implements metrics-proxy component lifecycle: component registration with predicates, CA/serving certificate adaptation, deployment TLS configuration with monitor discovery, route adaptation for private/external HCP, and scrape config generation from ServiceMonitor/PodMonitor resources.
v2 Component Tests
v2/metrics_proxy/deployment_test.go, v2/metrics_proxy/scrape_config_test.go
Extensive test coverage for TLS volume handling, monitor deduplication, ConfigMap merging, scrape config generation, and per-component certificate handling.
v2 Assets
v2/assets/metrics-proxy/*
Kubernetes manifests: Secret (CA cert, serving cert), Deployment, Route, Service, ServiceAccount, ConfigMap (scrape config).
Test Fixtures
testdata/metrics-proxy/*/zz_fixture_*
YAML fixture files across deployment profiles (AROSwift, GCP, IBMCloud, TechPreviewNoUpgrade, default) defining ControlPlaneComponent, Deployment, Service, Route, ServiceAccount, and Secrets.
Runtime Initialization
main.go, metrics-proxy/cmd.go
Registers metrics-proxy subcommand; NewStartCommand wires startup flow with config loading, TLS setup, endpoint resolution, scraping, filtering, labeling, and HTTPS server.
Configuration and Discovery
metrics-proxy/config_reader.go, metrics-proxy/config_reader_test.go, metrics-proxy/types.go
ConfigFileReader loads YAML config with per-component TLS settings; defines public interfaces (ComponentProvider, TargetDiscoverer) and data structures (ComponentConfig, ScrapeTarget).
Endpoint Discovery
metrics-proxy/endpoint_resolver_client.go, metrics-proxy/endpoint_resolver_client_test.go
EndpointResolverClient queries endpoint-resolver service for pod scrape targets via HTTP API; handles TLS and error conditions.
Metrics Processing
metrics-proxy/scraper.go, metrics-proxy/filter.go, metrics-proxy/filter_test.go, metrics-proxy/labels.go, metrics-proxy/labels_test.go, metrics-proxy/tls_util.go
Concurrent scraper with per-target TLS config; per-component metric filtering via regex; metric label injection; TLS certificate pool loading; extensive test coverage for caching, concurrency, and label merging.
HTTP Proxy Handler
metrics-proxy/proxy.go
ProxyHandler serves Prometheus metrics via /metrics/ endpoint; orchestrates target discovery, scraping, filtering, and labeling.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant ProxyHandler
    participant TargetDiscoverer
    participant Scraper
    participant Filter
    participant Labeler
    participant HTTPResponse

    Client->>ProxyHandler: GET /metrics/component-name
    ProxyHandler->>ProxyHandler: Validate path, extract component
    ProxyHandler->>TargetDiscoverer: Discover(ctx, serviceName, port)
    TargetDiscoverer-->>ProxyHandler: []ScrapeTarget
    
    loop For each target
        ProxyHandler->>Scraper: scrape(target, tlsConfig)
        Scraper->>Scraper: Build URL, GET with TLS
        Scraper-->>ProxyHandler: ScrapeResult{Families, Err}
    end
    
    ProxyHandler->>Filter: Apply(component, families)
    Filter-->>ProxyHandler: Filtered families
    
    ProxyHandler->>Labeler: Inject(families, target, component)
    Labeler-->>ProxyHandler: Labeled families
    
    ProxyHandler->>HTTPResponse: Encode Prometheus text format
    HTTPResponse-->>Client: 200 OK + metrics
Loading
sequenceDiagram
    participant Controller
    participant ComponentRegistry
    participant MetricsProxyComponent
    participant DeploymentAdapter
    participant ScrapeConfigAdapter
    
    Controller->>ComponentRegistry: registerComponents()
    ComponentRegistry->>MetricsProxyComponent: NewComponent(defaultIngressDomain)
    MetricsProxyComponent->>ComponentRegistry: Register with manifests & adapters
    
    ComponentRegistry->>MetricsProxyComponent: predicate(cpContext)
    MetricsProxyComponent-->>ComponentRegistry: bool (monitoring enabled & forwarding enabled)
    
    loop Reconciliation cycle
        Controller->>DeploymentAdapter: adaptDeployment(cpContext, deployment)
        DeploymentAdapter->>DeploymentAdapter: Discover ServiceMonitor/PodMonitor TLS refs
        DeploymentAdapter->>DeploymentAdapter: Build cert volumes and mounts
        DeploymentAdapter-->>Controller: Updated deployment
        
        Controller->>ScrapeConfigAdapter: adaptScrapeConfig(cpContext, configMap)
        ScrapeConfigAdapter->>ScrapeConfigAdapter: Resolve Services from monitors
        ScrapeConfigAdapter->>ScrapeConfigAdapter: Build per-component config
        ScrapeConfigAdapter-->>Controller: Updated ConfigMap with YAML
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 24.62% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Stable And Deterministic Test Names ✅ Passed Test names in the PR are static and deterministic, using fixed descriptive strings without dynamic values like UUIDs, timestamps, or generated pod names.
Test Structure And Quality ✅ Passed Test files demonstrate solid structure with standard Go testing patterns, table-driven tests, parallel execution, and meaningful assertions.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main change: adding a metrics-proxy control plane component to the CPO (Control Plane Operator).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@muraee muraee changed the title feat(cpo): add metrics-proxy binary for HCP metrics collection feat(CPO): add metrics-proxy control plane component Feb 24, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

♻️ Duplicate comments (2)
control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_config_configmap.yaml (1)

1-19: Duplicate of AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_config_configmap.yaml.

This file has identical content to the AROSwift-specific copy and is placed at the bare testdata/metrics-proxy/ root (same anomaly as the root-level service fixture). Remove whichever copy is not consumed by the test runner to avoid confusion and false coverage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_config_configmap.yaml`
around lines 1 - 19, The test fixture ConfigMap (metadata.name:
metrics-proxy-config) contains a duplicate of the AROSwift fixture (same
config.yaml with endpointResolver and components) which causes confusion; locate
the duplicate file (the one in the metrics-proxy testdata root) and remove the
copy that the test runner does not consume, or consolidate so only the consumed
fixture remains; ensure any tests referencing metrics-proxy-config still point
to the remaining fixture and update test references if necessary.
control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml (1)

14-22: Same weight: null and missing spec.host issues as the IBMCloud route fixture.

Both concerns flagged on the IBMCloud route fixture apply here identically, confirming they originate in the shared route generation code rather than a platform-specific branch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml`
around lines 14 - 22, The fixture shows a generated Route containing "weight:
null" and a missing "spec.host"—fix the shared route generation logic (e.g., the
function that builds routes such as generateRouteSpec / buildRoute /
NewRouteForService) so it omits the "weight" field when no weight is set (don’t
emit null) and always populates spec.host for routes like "metrics-proxy"
(derive host from the service/cluster config or pass through the provided host
value); update the generator to conditionally add the weight only when non-nil
and to compute/set spec.host before emitting the route fixture.
🟡 Minor comments (8)
control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_config_configmap.yaml-1-19 (1)

1-19: ⚠️ Potential issue | 🟡 Minor

Duplicate fixture — identical content exists at the root level.

testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_config_configmap.yaml (File 8) has identical content to this AROSwift/ copy. One of them is likely misplaced. Confirm which is the authoritative location and remove the other to avoid test ambiguity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_config_configmap.yaml`
around lines 1 - 19, This fixture is duplicated; remove the misplaced copy so
tests are unambiguous: decide which location is authoritative (the root-level
fixture or the AROSwift/ copy) and delete the other file (the duplicate
ConfigMap with metadata.name "metrics-proxy-config" and namespace
"hcp-namespace"), then update any test references to point to the retained
fixture and ensure ownerReference to HostedControlPlane/name "hcp" remains
correct.
control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_service.yaml-1-23 (1)

1-23: ⚠️ Potential issue | 🟡 Minor

Remove or reconcile duplicate service fixture placement.

A service fixture exists at both the root level (testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_service.yaml) and in all platform-specific subdirectories (AROSwift/, GCP/, IBMCloud/, TechPreviewNoUpgrade/). This duplication pattern extends to all other fixture types in the same directory. Either the root-level file should be removed if it is not platform-agnostic, or platform-specific copies should be removed if the root-level file serves all platforms. Verify the test framework's behavior with duplicate fixtures to prevent shadowing or unexpected test results.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_service.yaml`
around lines 1 - 23, There are duplicate Service fixtures for the Service named
"metrics-proxy" (metadata.name: metrics-proxy, kind: Service, namespace:
hcp-namespace) placed both at the root-level test fixture and under
platform-specific subdirectories; remove the duplicates by either deleting the
root-level fixture if it's not platform-agnostic or deleting the
platform-specific copies if the root fixture should apply to all platforms, and
then run the test harness to confirm only a single "metrics-proxy" Service
fixture is discovered to avoid shadowing; ensure the remaining fixture's fields
(ownerReferences, namespace, ports) are correct for all platforms that will use
it.
control-plane-operator/endpoint-resolver/server.go-82-85 (1)

82-85: ⚠️ Potential issue | 🟡 Minor

http.Error after json.NewEncoder(w).Encode won't send a 500 status.

If Encode fails partway through, w.Write has already been called (implicitly sending HTTP 200). The subsequent http.Error on line 84 cannot change the status code and will append the error text to the partial body. In practice, encoding these simple structs is unlikely to fail, but for correctness you could marshal to a buffer first.

♻️ Proposed fix: marshal to buffer before writing
+	data, err := json.Marshal(ResolveResponse{Pods: pods})
+	if err != nil {
+		http.Error(w, fmt.Sprintf("failed to encode response: %v", err), http.StatusInternalServerError)
+		return
+	}
 	w.Header().Set("Content-Type", "application/json")
-	if err := json.NewEncoder(w).Encode(ResolveResponse{Pods: pods}); err != nil {
-		http.Error(w, fmt.Sprintf("failed to encode response: %v", err), http.StatusInternalServerError)
-	}
+	w.Write(data)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/endpoint-resolver/server.go` around lines 82 - 85, The
response is being encoded directly to the ResponseWriter
(json.NewEncoder(w).Encode(ResolveResponse{Pods: pods})) so if encoding fails
after a partial write the status is already sent and http.Error cannot set 500;
instead marshal/encode into a temporary buffer (e.g., json.Marshal or
json.NewEncoder(&buf).Encode) and on success set Content-Type and write the full
buffer to w, and on marshal/encode error call http.Error with
StatusInternalServerError; update the code around ResolveResponse and the json
encoding call to use the buffer-first approach.
control-plane-operator/endpoint-resolver/server_test.go-131-135 (1)

131-135: ⚠️ Potential issue | 🟡 Minor

Pod comparison is positional — test may be flaky due to non-deterministic lister order.

cache.NewIndexer backs its store with a Go map, so List() order is non-deterministic. The assertion on lines 131-135 compares response.Pods[i] against tt.expectedPods[i] by index, which can fail intermittently when multiple pods are returned (e.g., the first test case with 2 etcd pods).

Sort both slices by Name before comparing, or use an order-independent lookup.

♻️ Proposed fix
+			sort.Slice(response.Pods, func(i, j int) bool {
+				return response.Pods[i].Name < response.Pods[j].Name
+			})
+			sort.Slice(tt.expectedPods, func(i, j int) bool {
+				return tt.expectedPods[i].Name < tt.expectedPods[j].Name
+			})
 			for i, expected := range tt.expectedPods {
 				if response.Pods[i].Name != expected.Name || response.Pods[i].IP != expected.IP {
 					t.Errorf("pod %d: expected %+v, got %+v", i, expected, response.Pods[i])
 				}
 			}

You'll need to add "sort" to the imports.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/endpoint-resolver/server_test.go` around lines 131 -
135, The test compares response.Pods to tt.expectedPods positionally but
cache.NewIndexer/List() returns non-deterministic order; to fix, sort both
slices by Name before the index-by-index comparison (or convert them to maps
keyed by Name for order-independent lookup). Update the test in server_test.go
to import "sort" and apply sort.Slice on response.Pods and tt.expectedPods
(comparing Pod.Name) prior to the existing loop that checks
response.Pods[i].Name/ IP so assertions are stable.
control-plane-operator/metrics-proxy/config_reader_test.go-98-105 (1)

98-105: ⚠️ Potential issue | 🟡 Minor

Check the Load() error to avoid false positives.
Line 101 ignores the Load() error, so the test could pass even if parsing fails. Use t.Fatalf when Load() returns an error.

✅ Proposed fix
-		_ = r.Load()
+		if err := r.Load(); err != nil {
+			t.Fatalf("Load() returned error: %v", err)
+		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/config_reader_test.go` around lines 98 -
105, The test currently ignores errors from r.Load(), which can produce false
positives; update the test in config_reader_test.go to check the error returned
by ConfigFileReader.Load() (called via r.Load()) and call t.Fatalf with the
error if it is non-nil before calling r.GetComponent("nonexistent"), so the test
fails fast on parse/load errors rather than silently passing.
control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml-18-22 (1)

18-22: ⚠️ Potential issue | 🟡 Minor

spec.to.weight: null — add omitempty tag or set explicit value.

RouteTargetReference.Weight is defined as *int32 json:"weight"withoutomitempty in the OpenShift API types. When the field is left as a nil pointer in the route generation code (support/util/route.go), it serializes to an explicit weight: nullrather than omitting the key. While the API accepts null due to the+optionalannotation, explicit null can cause issues with stricter validators and is not best practice for optional pointer fields. Fix this insupport/util/route.go` by either:

  • Adding omitempty to the struct field tag (in the OpenShift API types, if possible), or
  • Explicitly setting Weight: pointer.Int32(100) in ReconcileExternalRoute and ReconcileInternalRoute to use the documented default.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml`
around lines 18 - 22, The generated RouteTargetReference is leaving Weight as a
nil pointer which serializes to `weight: null`; update support/util/route.go so
both ReconcileExternalRoute and ReconcileInternalRoute explicitly set
RouteTargetReference.Weight to a concrete default (e.g. pointer.Int32(100)) when
constructing the target reference instead of leaving it nil, and add the
required import for the pointer helper (or use your project’s equivalent) so the
field is always a non-nil pointer with the documented default.
control-plane-operator/metrics-proxy/proxy_test.go-58-71 (1)

58-71: ⚠️ Potential issue | 🟡 Minor

Bearer tokens with spaces should be treated as invalid per RFC 6750.

RFC 6750 defines Bearer token syntax as "Bearer" 1*SP b64token, where b64token consists of ALPHA / DIGIT / "-" / "." / "_" / "~" / "+" / "/" and excludes whitespace entirely. Spaces are allowed only as the separator between Bearer and the token, not within the token itself. The test cases at lines 58–60 and 68–70 expect tokens containing spaces to be accepted, which violates the spec.

🛠️ Suggested test adjustment
 {
     name:          "When Authorization header has token with spaces, it should return the full token including spaces",
     authorization: "Bearer token with spaces",
-    expected:      "token with spaces",
+    expected:      "",
 },
@@
 {
     name:          "When Authorization header has multiple spaces after Bearer, it should return the spaces",
     authorization: "Bearer   ",
-    expected:      "  ",
+    expected:      "",
 },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/proxy_test.go` around lines 58 - 71, The
tests in proxy_test.go that expect tokens containing spaces to be accepted
violate RFC 6750; update the table entries for the cases with authorization
"Bearer token with spaces" (test name: "When Authorization header has token with
spaces...") and "Bearer   " (test name: "When Authorization header has multiple
spaces after Bearer...") to expect an empty string instead of a non-empty token,
and update their test descriptions to state they should be rejected; also add
(or ensure presence of) a positive test using a valid RFC6750 b64token (e.g.,
"Bearer abcDEF123+-_/~.") to assert correct extraction, and if the extraction
function (e.g., parseBearerToken / getBearerToken) does not already validate
token characters, update that function to enforce the RFC6750 token regexp (no
whitespace) so the tests pass.
control-plane-operator/metrics-proxy/cmd.go-65-65 (1)

65-65: ⚠️ Potential issue | 🟡 Minor

No validation for MY_NAMESPACE environment variable.

If unset, namespace will be an empty string, silently causing incorrect label injection via NewLabeler(""). Consider failing fast if it's empty.

♻️ Proposed fix
 		namespace := os.Getenv("MY_NAMESPACE")
+		if namespace == "" {
+			fmt.Fprintf(os.Stderr, "Error: MY_NAMESPACE environment variable is required\n")
+			os.Exit(1)
+		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/cmd.go` at line 65, The code reads
namespace := os.Getenv("MY_NAMESPACE") but doesn't validate it; add a check
right after reading namespace and if it's empty log a clear error and exit
non-zero (or return an error from main) before calling NewLabeler(""), so we
never construct NewLabeler with an empty namespace; reference the namespace
variable and NewLabeler("") when adding the fail-fast validation and exit path.
🧹 Nitpick comments (10)
control-plane-operator/controllers/hostedcontrolplane/testdata/endpoint-resolver/zz_fixture_TestControlPlaneComponents_endpoint_resolver_ca_secret.yaml (1)

2-4: Use non-empty dummy TLS data to strengthen the fixture.

Empty tls.crt/tls.key can let tests pass even if cert/key generation regresses. Consider using short, valid base64 placeholders to ensure the reconciliation/test path asserts presence of data.

🛠️ Example fixture hardening
 data:
-  tls.crt: ""
-  tls.key: ""
+  tls.crt: "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCg=="
+  tls.key: "LS0tLS1CRUdJTiBSU0EgUFJJVkFURSBLRVktLS0tLQo="
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/endpoint-resolver/zz_fixture_TestControlPlaneComponents_endpoint_resolver_ca_secret.yaml`
around lines 2 - 4, The fixture currently contains empty tls.crt and tls.key
which lets regressions slip; update the fixture so that data.tls.crt and
data.tls.key contain short, valid base64-encoded placeholders (non-empty dummy
TLS data) instead of empty strings, e.g. simple base64-encoded PEM-like
placeholders for both keys to ensure reconciliation/tests validate presence and
non-empty content of the certificate and private key.
control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml (1)

22-25: weight: null is non-standard — consider setting an explicit value.

spec.to.weight serializes as null because the Go field is a nil *int32. OpenShift defaults to 100 when omitted, but explicit null is unusual in golden-file fixtures and may cause noise in diff reviews or future schema validation. If the producing code intentionally leaves this nil, this is fine; otherwise consider setting it to 100 at the source.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml`
around lines 22 - 25, The fixture currently serializes spec.to.weight as null (a
nil *int32), producing an explicit "weight: null" entry; change the producing
code so spec.to.weight is set to a concrete int32 value (e.g., 100) instead of
left nil so the generated Service "metrics-proxy" entry contains weight: 100;
locate where the route/traffic target is built (the struct that sets
spec.to.weight) and initialize the pointer to int32(100) rather than leaving it
nil.
control-plane-operator/endpoint-resolver/cmd.go (1)

104-117: Graceful shutdown uses unbounded context; dead code on line 118.

Two items:

  1. server.Shutdown(context.Background()) on line 108 waits indefinitely for in-flight requests. Use a timeout (e.g., 30s) to prevent the process from hanging if a connection doesn't drain.
  2. _ = ctx on line 118 is dead code — ctx is already consumed on lines 77-78.
♻️ Proposed fix
 		go func() {
 			sig := <-sigChan
 			l.Info("Received shutdown signal", "signal", sig)
 			cancel()
-			if err := server.Shutdown(context.Background()); err != nil {
+			shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 30*time.Second)
+			defer shutdownCancel()
+			if err := server.Shutdown(shutdownCtx); err != nil {
 				l.Error(err, "Error during server shutdown")
 			}
 		}()

 		l.Info("Serving endpoint-resolver", "port", servingPort, "namespace", namespace)
 		if err := server.ListenAndServeTLS("", ""); err != nil && err != http.ErrServerClosed {
 			fmt.Fprintf(os.Stderr, "Error serving: %v\n", err)
 			os.Exit(1)
 		}
-		_ = ctx
 		l.Info("Server stopped")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/endpoint-resolver/cmd.go` around lines 104 - 117, The
shutdown goroutine currently calls server.Shutdown(context.Background()) which
can block indefinitely; change it to create a timeout context (e.g., via
context.WithTimeout with a 30s duration) and pass that timeout context to
server.Shutdown, ensuring you cancel the timeout context after Shutdown returns;
also remove the dead `_ = ctx` statement that is no longer needed (the outer ctx
is already used elsewhere) to eliminate dead code. Refer to the shutdown
goroutine and the server.Shutdown call to locate where to apply the timeout
context and to remove the `_ = ctx` line.
control-plane-operator/metrics-proxy/auth.go (1)

42-83: Solid auth implementation; consider rate-limiting or negative caching for failed token reviews.

The cache correctly bounds entries by authorized SA count and uses SHA-256 hashing. One consideration: repeated invalid tokens bypass the cache entirely and each triggers a TokenReview API call. If the endpoint is exposed to untrusted callers, this could become an amplification vector against the hosted KAS. A short-lived negative cache or rate limiter on failed reviews would mitigate this.

Not blocking — flagging as a hardening consideration given the intended use with a small number of authorized SAs behind TLS.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/auth.go` around lines 42 - 83, Add
short-lived negative caching to TokenAuthenticator.Authenticate so repeated
invalid tokens don't trigger TokenReview calls: before creating the review,
check a new negative cache (e.g., a.negativeCache keyed by hash with expiry,
protected by a.mu) and return false if present; after receiving a review where
review.Status.Authenticated is false, insert the token hash into that negative
cache with a small TTL (e.g., failedTokenCacheTTL). Update the
TokenAuthenticator struct to hold the negative cache map and TTL constant, reuse
a.mu for synchronization, and ensure only successful authentications populate
the positive a.cache while failures populate the negative cache.
control-plane-operator/controllers/hostedcontrolplane/testdata/endpoint-resolver/zz_fixture_TestControlPlaneComponents_endpoint_resolver_deployment.yaml (1)

110-111: Consider hardening the container security context.

The security context only sets readOnlyRootFilesystem: true but is missing allowPrivilegeEscalation: false and runAsNonRoot: true. Since this is an auto-generated fixture, the fix should be applied in the component code that produces this deployment spec.

#!/bin/bash
# Find the component code that generates the endpoint-resolver deployment security context
rg -n -C5 'allowPrivilegeEscalation|runAsNonRoot' --type go -g '*endpoint_resolver*'
rg -n -C5 'SecurityContext' --type go -g '*endpoint_resolver*component*'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/endpoint-resolver/zz_fixture_TestControlPlaneComponents_endpoint_resolver_deployment.yaml`
around lines 110 - 111, The generated deployment's container securityContext
only sets readOnlyRootFilesystem; update the component code that constructs the
endpoint-resolver Deployment (e.g., the function that builds the
PodSpec/Container for the endpoint-resolver, such as
buildEndpointResolverDeployment/NewEndpointResolverDeployment/reconcileEndpointResolverDeployment)
to add allowPrivilegeEscalation: false and runAsNonRoot: true into the
container's SecurityContext alongside readOnlyRootFilesystem so the produced
fixture includes all three hardening fields.
control-plane-operator/controllers/hostedcontrolplane/v2/assets/endpoint-resolver/ca-cert.yaml (1)

1-8: LGTM — standard HyperShift PKI asset template. One design note: the CA private key (tls.key) and the CA public cert (tls.crt) share the same Secret. Any workload that needs the CA cert for trust verification also gains read access to the CA key if it can read this Secret. Ensure the RBAC Role for the endpoint-resolver ServiceAccount does not grant broader read access to this Secret than needed, and that components only mounting tls.crt (e.g., metrics-proxy CA trust) mount it from a projected volume or a separate read-only copy rather than the full CA Secret.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/endpoint-resolver/ca-cert.yaml`
around lines 1 - 8, The Secret endpoint-resolver-ca currently contains both
tls.crt and tls.key which risks exposing the CA private key to any pod that can
read the Secret; update RBAC and mounts so only the endpoint-resolver
ServiceAccount can access the full Secret and any consumers that only need the
CA cert (e.g., metrics-proxy) are given a separate Secret or a projected volume
containing only tls.crt, and tighten Roles/RoleBindings to remove wide read
access to endpoint-resolver-ca; search for usages of endpoint-resolver-ca, the
endpoint-resolver ServiceAccount, and metrics-proxy mounts to implement
least-privilege access and create a read-only cert-only Secret or projected
volume for consumers that do not require tls.key.
control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/route.yaml (1)

5-13: Add explicit port.targetPort to the Route for clarity and robustness.

The metrics-proxy Service currently declares a single port named "https". While the Route works correctly today, explicitly specifying the target port makes the routing intention self-documenting and prevents accidental misdirection if the Service gains additional ports in the future.

♻️ Suggested addition
 spec:
   host: ""
+  port:
+    targetPort: https
   tls:
     insecureEdgeTerminationPolicy: None
     termination: passthrough
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/route.yaml`
around lines 5 - 13, The Route should explicitly bind to the Service's "https"
port: add a spec.port block with targetPort set to the Service port name "https"
(e.g., spec.port: { targetPort: "https" }) so the Route's to: Service
(metrics-proxy) unambiguously forwards to the correct port if additional ports
are later added.
control-plane-operator/metrics-proxy/cmd.go (1)

54-54: Using cmd.Run instead of cmd.RunE means errors bypass cobra's error handling.

All error paths use os.Exit(1) directly, which skips deferred cleanup and prevents the parent command from handling errors. Consider using cmd.RunE and returning errors instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/cmd.go` at line 54, Replace the
anonymous handler assigned to cmd.Run with cmd.RunE so that errors are returned
to cobra instead of calling os.Exit(1) directly; update the handler (the
function currently wired as cmd.Run) to return an error value, remove direct
os.Exit calls (e.g., os.Exit(1)), and propagate any failures by returning
fmt.Errorf or wrapped errors so deferred cleanup runs and the parent
command/cobra can handle logging/exit behavior. Ensure you update any call sites
inside that handler (including error branches that currently call os.Exit) to
return the error instead.
control-plane-operator/metrics-proxy/proxy.go (1)

35-114: Stream encoded metrics to reduce memory footprint.

The current buffer can grow large for many targets. Streaming directly to ResponseWriter avoids extra memory without changing behavior.

♻️ Suggested streaming encoder change
@@
-import (
-	"bytes"
-	"fmt"
+import (
+	"fmt"
@@
-	// Merge, filter, and label results
-	format := expfmt.NewFormat(expfmt.TypeTextPlain)
-	var buf bytes.Buffer
-	encoder := expfmt.NewEncoder(&buf, format)
+	// Merge, filter, and label results
+	format := expfmt.NewFormat(expfmt.TypeTextPlain)
+	w.Header().Set("Content-Type", string(format))
+	w.WriteHeader(http.StatusOK)
+	encoder := expfmt.NewEncoder(w, format)
@@
-	w.Header().Set("Content-Type", string(format))
-	w.WriteHeader(http.StatusOK)
-	_, _ = w.Write(buf.Bytes())
+	// response streamed via encoder
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/proxy.go` around lines 35 - 114, The
handler currently accumulates all encoded MetricFamily objects into a
bytes.Buffer (buf) which can grow large; instead create the expfmt.Encoder to
write directly to the http.ResponseWriter to stream results and avoid the
intermediate buffer. In ServeHTTP replace creation of encoder :=
expfmt.NewEncoder(&buf, format) with an encoder that wraps the ResponseWriter
after setting the Content-Type header (w.Header().Set(...)) and calling
w.WriteHeader(http.StatusOK) before encoding; keep the same loop over results,
logging result.Err and encoder.Encode errors (using h.log) but drop the final
w.Write(buf.Bytes()) step and remove buf entirely. Ensure you still set
Content-Type to string(format) and write the 200 status before the first write
from encoder so clients receive headers correctly.
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment.go (1)

123-144: Sort ConfigMap key items for deterministic reconciliation.

You already sort volume names; ref.items order can still vary based on monitor iteration order. Sorting keys before assigning ConfigMap.Items avoids noisy diffs.

♻️ Suggested ordering for ConfigMap items
@@
-			if len(ref.items) > 0 {
-				vol.VolumeSource.ConfigMap.Items = ref.items
-			}
+			if len(ref.items) > 0 {
+				sort.Slice(ref.items, func(i, j int) bool {
+					return ref.items[i].Key < ref.items[j].Key
+				})
+				vol.VolumeSource.ConfigMap.Items = ref.items
+			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment.go`
around lines 123 - 144, The ConfigMap volume items (ref.items) can be
non-deterministic and should be sorted before assignment to avoid noisy
reconciliation diffs; inside the loop over names where you build corev1.Volume
and before setting vol.VolumeSource.ConfigMap.Items = ref.items (the block
guarded by if len(ref.items) > 0), sort the slice (e.g.
sort.SliceStable(ref.items, func(i, j int) bool { return ref.items[i].Key <
ref.items[j].Key })) so items are consistently ordered every reconciliation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`:
- Around line 118-119: The securityContext for the AROSwift metrics-proxy
fixture only sets readOnlyRootFilesystem and is missing the hardening fields
present in the GCP fixture; update the metrics-proxy Deployment's container
securityContext (the block containing readOnlyRootFilesystem) to also include
allowPrivilegeEscalation: false, capabilities.drop: [ALL], and runAsNonRoot:
true so it matches the GCP variant and enforces the same restrictions.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/GCP/zz_fixture_TestControlPlaneComponents_metrics_proxy_role.yaml`:
- Around line 14-25: The Role currently grants get/list/watch on all secrets
(rules -> resources: secrets) which is too broad; update the Role to restrict
secret access by either specifying resourceNames for only the ServiceMonitor
tlsConfig secrets referenced (or a single dedicated cert secret name pattern)
instead of allowing all secrets, or replace the secrets permission with access
to a clearly named cert secret used by metrics-proxy; if dynamic access is
required, add an inline comment documenting why wildcard secret access is
necessary and list the ServiceMonitor names that justify it.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`:
- Around line 118-119: The IBMCloud metrics-proxy fixture's container
securityContext is incomplete: update the securityContext block
(securityContext) for the metrics-proxy deployment to match the GCP variant by
adding allowPrivilegeEscalation: false, setting capabilities.drop: [ALL], and
runAsNonRoot: true so the container cannot elevate privileges and runs as
non-root; locate the metrics-proxy deployment YAML entry (securityContext key)
in the IBMCloud fixture and add these fields under that same securityContext.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml`:
- Around line 20-38: The metrics-proxy ControlPlaneComponent's status.resources
list is missing the RBAC objects; update the component so its manifest adapters
and managed resources include the Role and RoleBinding for metrics-proxy.
Specifically, add entries for kind: Role and kind: RoleBinding (matching the
metrics-proxy serviceAccount/name and expected names) to the status.resources
list and ensure the component's manifest adapter includes role.yaml and
rolebinding.yaml alongside the existing resources (e.g.,
metrics-proxy-serving-cert, metrics-proxy-config); keep the Resource names
consistent with the component name "metrics-proxy" so the ControlPlaneComponent
reconcile/cleanup will manage them.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`:
- Around line 118-119: The container securityContext currently only sets
readOnlyRootFilesystem; update the container spec's securityContext (the same
block with readOnlyRootFilesystem) to also set allowPrivilegeEscalation: false
and runAsNonRoot: true and, optionally, a lowered capabilities set (e.g., drop:
["ALL"]) or a seccompProfile to meet baseline PodSecurity and silence
CKV_K8S_20/23; locate the securityContext under the metrics-proxy container in
the Deployment manifest and add these fields alongside readOnlyRootFilesystem.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_role.yaml`:
- Around line 20-21: The Secrets rule in the metrics-proxy RBAC role grants
unnecessary list and watch verbs; update the role definition (the rule that
references "secrets") to remove the "list" and "watch" verbs — either keep only
"get" if you want minimal future Secret reads or remove the Secrets rule
entirely since metrics-proxy reads TLS from mounted files via os.ReadFile() and
never calls the Secrets API. Locate the rule entry referencing "secrets" in the
metrics-proxy role and delete "list" and "watch" (or the whole secrets rule) to
enforce least privilege.
- Around line 15-25: Remove the core Endpoints permission from the metrics-proxy
Role and add the missing assets files under the metrics-proxy assets directory:
create role.yaml (with rules granting get/list/watch on services, secrets,
configmaps; discovery.k8s.io:endpointslices;
monitoring.coreos.com:servicemonitors as shown in the suggestion) and a matching
rolebinding.yaml that binds that Role to the metrics-proxy ServiceAccount;
update any references to the old Role if present. Locate the rationale by
checking endpoint_resolver_client.go (which shows metrics-proxy uses the
endpoint-resolver service) and place the new role.yaml and rolebinding.yaml
alongside other components’ assets (e.g., cluster-autoscaler) so the operator
can load them.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/endpoint-resolver/deployment.yaml`:
- Around line 1-73: The container "endpoint-resolver" in the Deployment manifest
lacks a securityContext; add a container-level securityContext under the
container entry for the endpoint-resolver container to harden it: set
allowPrivilegeEscalation: false, runAsNonRoot: true, drop all capabilities
(e.g., capabilities.drop: ["ALL"]), and enable readOnlyRootFilesystem: true (and
optionally set a non-root runAsUser like 1000) so derived manifests inherit
hardened defaults and address CKV_K8S_20/23.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/deployment.yaml`:
- Around line 19-76: Add a hardened container securityContext to the
metrics-proxy container (name: metrics-proxy) by setting runAsNonRoot: true,
allowPrivilegeEscalation: false, and dropping all capabilities
(capabilities.drop: ["ALL"]); apply these fields under the container's
securityContext in the Deployment spec and update all test fixtures in
testdata/metrics-proxy/ to match the new manifest so CI/tests remain consistent.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`:
- Around line 74-77: The code in scrape_config.go defaults endpoint scheme to
"https" which diverges from Prometheus Operator; update both places where a
local variable is set from ep.Scheme (the blocks that declare scheme := "https"
and later the similar block around lines ~138–141) to default to "http" instead,
i.e. set scheme := "http" when ep.Scheme is nil so that the logic in the
endpoint processing (the ep.Scheme checks) matches Prometheus Operator defaults.
- Around line 121-135: The code only checks PodMetricsEndpoint.Port and skips
when empty, but PodMetricsEndpoint also supports PortNumber and TargetPort with
a defined precedence (Port, then PortNumber, then TargetPort); update the logic
around the ep/portName handling so you: 1) inspect ep.Port first (string
pointer), 2) if nil/empty use ep.PortNumber (int32) and convert it to a string
(or treat as a numeric port when calling resolveDeploymentPort), 3) if still
unset use ep.TargetPort (intstr) and extract either its numeric value or string
name appropriately, and then call resolveDeploymentPort or use the numeric port
directly as needed; adjust the variables referenced (ep, portName,
resolveDeploymentPort, cpContext, namespace, pm.Name) so PodMonitors that
provided portNumber or targetPort are not skipped.
- Around line 216-225: The code currently returns the first Service name when
the ServiceMonitor selector matches multiple Services; change this to error out
instead: after listing services into svcList (via cpContext.Client.List) check
if len(svcList.Items) == 0 and return the existing "no service" error, and
additionally check if len(svcList.Items) > 1 and return a descriptive error
mentioning smName and the count (or the conflicting service names) so the caller
rejects non-unique selectors; update the return path so only when exactly one
item exists you return svcList.Items[0].Name.
- Around line 241-246: The loop over svc.Spec.Ports currently falls back to
p.Port when p.TargetPort.IntValue() == 0 which incorrectly handles named
targetPorts; update the logic in the loop that inspects p (the Port object) to
explicitly detect when p.TargetPort is a named string (i.e., TargetPort.Type
indicates a string/intstr) and return an error instead of using the Service
port, or alternatively resolve it from the pod container ports; specifically,
check p.TargetPort.Type (or equivalent) for a string name and return a clear
error mentioning the named targetPort (referencing p.TargetPort, p.Name, and
p.Port in the current loop) so callers don't mistakenly connect to the Service
port when scraping pod IPs.

In `@control-plane-operator/endpoint-resolver/cmd.go`:
- Around line 26-30: The logger variable l in NewStartCommand is created before
the global logger is configured, so calls using l will use the stale default
logger; fix by configuring the global logger first (call
log.SetLogger(zap.New(...))) and then derive l
(log.Log.WithName("endpoint-resolver")) or alternatively replace uses of l with
fresh log.Log.WithName(...) after SetLogger; update NewStartCommand so
log.SetLogger(zap.New(...)) runs before creating l to ensure the JSON encoder is
used.

In `@control-plane-operator/metrics-proxy/cmd.go`:
- Around line 125-132: The shutdown goroutine calls server.Shutdown with
context.Background(), which can block forever; change it to create a context
with a deadline (e.g., using time.After or time.WithTimeout) matching the
operator's terminationGracePeriodSeconds and pass that context to
server.Shutdown, then cancel the context when done; update the goroutine around
sig := <-sigChan / cancel() / server.Shutdown(...) to create ctx, defer
cancel(), call server.Shutdown(ctx) and handle the timed-out shutdown error;
also add "time" to imports.
- Around line 62-63: The created cancellable context is discarded (ctx is set to
_) so cancel() does nothing; change the code to capture the returned context
(e.g., ctx, cancel := context.WithCancel(cmd.Context())), defer cancel(), and
propagate that ctx into any components that should observe shutdown (e.g., pass
ctx into ProxyHandler creation/Start and Scraper initialization/Run or any
HTTP/server handlers) so they can cancel in-flight work when cancel() is invoked
from the signal handler; if you intentionally don't need cancellation, remove
the WithCancel and cancel() call instead.

In `@control-plane-operator/metrics-proxy/proxy_test.go`:
- Around line 73-75: The test uses a JWT-looking string in the table-driven test
case (the test named "When Authorization header contains Bearer token with
special characters, it should return the token") for the fields authorization
and expected; replace both values with a clearly non-JWT placeholder that still
contains special characters (e.g. "token-ABC_123+~/.!@#") so the parsing
behavior is preserved but secret scanners won't flag it. Update the test case in
proxy_test.go (the table entry that sets authorization and expected) to use the
placeholder consistently.
- Around line 79-82: The subtest uses the loop variable `tt` while calling
`t.Parallel()`, which can cause races; capture the loop variable before starting
the parallel subtest by introducing a local copy (e.g., `tc := tt` or `tt :=
tt`) immediately inside the loop and use that copy in the `t.Run` closure and
throughout the subtest (references: the loop over `tests`, the `tt` variable,
`t.Run` and `t.Parallel` in proxy_test.go).

---

Minor comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_config_configmap.yaml`:
- Around line 1-19: This fixture is duplicated; remove the misplaced copy so
tests are unambiguous: decide which location is authoritative (the root-level
fixture or the AROSwift/ copy) and delete the other file (the duplicate
ConfigMap with metadata.name "metrics-proxy-config" and namespace
"hcp-namespace"), then update any test references to point to the retained
fixture and ensure ownerReference to HostedControlPlane/name "hcp" remains
correct.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml`:
- Around line 18-22: The generated RouteTargetReference is leaving Weight as a
nil pointer which serializes to `weight: null`; update support/util/route.go so
both ReconcileExternalRoute and ReconcileInternalRoute explicitly set
RouteTargetReference.Weight to a concrete default (e.g. pointer.Int32(100)) when
constructing the target reference instead of leaving it nil, and add the
required import for the pointer helper (or use your project’s equivalent) so the
field is always a non-nil pointer with the documented default.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_service.yaml`:
- Around line 1-23: There are duplicate Service fixtures for the Service named
"metrics-proxy" (metadata.name: metrics-proxy, kind: Service, namespace:
hcp-namespace) placed both at the root-level test fixture and under
platform-specific subdirectories; remove the duplicates by either deleting the
root-level fixture if it's not platform-agnostic or deleting the
platform-specific copies if the root fixture should apply to all platforms, and
then run the test harness to confirm only a single "metrics-proxy" Service
fixture is discovered to avoid shadowing; ensure the remaining fixture's fields
(ownerReferences, namespace, ports) are correct for all platforms that will use
it.

In `@control-plane-operator/endpoint-resolver/server_test.go`:
- Around line 131-135: The test compares response.Pods to tt.expectedPods
positionally but cache.NewIndexer/List() returns non-deterministic order; to
fix, sort both slices by Name before the index-by-index comparison (or convert
them to maps keyed by Name for order-independent lookup). Update the test in
server_test.go to import "sort" and apply sort.Slice on response.Pods and
tt.expectedPods (comparing Pod.Name) prior to the existing loop that checks
response.Pods[i].Name/ IP so assertions are stable.

In `@control-plane-operator/endpoint-resolver/server.go`:
- Around line 82-85: The response is being encoded directly to the
ResponseWriter (json.NewEncoder(w).Encode(ResolveResponse{Pods: pods})) so if
encoding fails after a partial write the status is already sent and http.Error
cannot set 500; instead marshal/encode into a temporary buffer (e.g.,
json.Marshal or json.NewEncoder(&buf).Encode) and on success set Content-Type
and write the full buffer to w, and on marshal/encode error call http.Error with
StatusInternalServerError; update the code around ResolveResponse and the json
encoding call to use the buffer-first approach.

In `@control-plane-operator/metrics-proxy/cmd.go`:
- Line 65: The code reads namespace := os.Getenv("MY_NAMESPACE") but doesn't
validate it; add a check right after reading namespace and if it's empty log a
clear error and exit non-zero (or return an error from main) before calling
NewLabeler(""), so we never construct NewLabeler with an empty namespace;
reference the namespace variable and NewLabeler("") when adding the fail-fast
validation and exit path.

In `@control-plane-operator/metrics-proxy/config_reader_test.go`:
- Around line 98-105: The test currently ignores errors from r.Load(), which can
produce false positives; update the test in config_reader_test.go to check the
error returned by ConfigFileReader.Load() (called via r.Load()) and call
t.Fatalf with the error if it is non-nil before calling
r.GetComponent("nonexistent"), so the test fails fast on parse/load errors
rather than silently passing.

In `@control-plane-operator/metrics-proxy/proxy_test.go`:
- Around line 58-71: The tests in proxy_test.go that expect tokens containing
spaces to be accepted violate RFC 6750; update the table entries for the cases
with authorization "Bearer token with spaces" (test name: "When Authorization
header has token with spaces...") and "Bearer   " (test name: "When
Authorization header has multiple spaces after Bearer...") to expect an empty
string instead of a non-empty token, and update their test descriptions to state
they should be rejected; also add (or ensure presence of) a positive test using
a valid RFC6750 b64token (e.g., "Bearer abcDEF123+-_/~.") to assert correct
extraction, and if the extraction function (e.g., parseBearerToken /
getBearerToken) does not already validate token characters, update that function
to enforce the RFC6750 token regexp (no whitespace) so the tests pass.

---

Duplicate comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml`:
- Around line 14-22: The fixture shows a generated Route containing "weight:
null" and a missing "spec.host"—fix the shared route generation logic (e.g., the
function that builds routes such as generateRouteSpec / buildRoute /
NewRouteForService) so it omits the "weight" field when no weight is set (don’t
emit null) and always populates spec.host for routes like "metrics-proxy"
(derive host from the service/cluster config or pass through the provided host
value); update the generator to conditionally add the weight only when non-nil
and to compute/set spec.host before emitting the route fixture.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_config_configmap.yaml`:
- Around line 1-19: The test fixture ConfigMap (metadata.name:
metrics-proxy-config) contains a duplicate of the AROSwift fixture (same
config.yaml with endpointResolver and components) which causes confusion; locate
the duplicate file (the one in the metrics-proxy testdata root) and remove the
copy that the test runner does not consume, or consolidate so only the consumed
fixture remains; ensure any tests referencing metrics-proxy-config still point
to the remaining fixture and update test references if necessary.

---

Nitpick comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/endpoint-resolver/zz_fixture_TestControlPlaneComponents_endpoint_resolver_ca_secret.yaml`:
- Around line 2-4: The fixture currently contains empty tls.crt and tls.key
which lets regressions slip; update the fixture so that data.tls.crt and
data.tls.key contain short, valid base64-encoded placeholders (non-empty dummy
TLS data) instead of empty strings, e.g. simple base64-encoded PEM-like
placeholders for both keys to ensure reconciliation/tests validate presence and
non-empty content of the certificate and private key.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/endpoint-resolver/zz_fixture_TestControlPlaneComponents_endpoint_resolver_deployment.yaml`:
- Around line 110-111: The generated deployment's container securityContext only
sets readOnlyRootFilesystem; update the component code that constructs the
endpoint-resolver Deployment (e.g., the function that builds the
PodSpec/Container for the endpoint-resolver, such as
buildEndpointResolverDeployment/NewEndpointResolverDeployment/reconcileEndpointResolverDeployment)
to add allowPrivilegeEscalation: false and runAsNonRoot: true into the
container's SecurityContext alongside readOnlyRootFilesystem so the produced
fixture includes all three hardening fields.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml`:
- Around line 22-25: The fixture currently serializes spec.to.weight as null (a
nil *int32), producing an explicit "weight: null" entry; change the producing
code so spec.to.weight is set to a concrete int32 value (e.g., 100) instead of
left nil so the generated Service "metrics-proxy" entry contains weight: 100;
locate where the route/traffic target is built (the struct that sets
spec.to.weight) and initialize the pointer to int32(100) rather than leaving it
nil.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/endpoint-resolver/ca-cert.yaml`:
- Around line 1-8: The Secret endpoint-resolver-ca currently contains both
tls.crt and tls.key which risks exposing the CA private key to any pod that can
read the Secret; update RBAC and mounts so only the endpoint-resolver
ServiceAccount can access the full Secret and any consumers that only need the
CA cert (e.g., metrics-proxy) are given a separate Secret or a projected volume
containing only tls.crt, and tighten Roles/RoleBindings to remove wide read
access to endpoint-resolver-ca; search for usages of endpoint-resolver-ca, the
endpoint-resolver ServiceAccount, and metrics-proxy mounts to implement
least-privilege access and create a read-only cert-only Secret or projected
volume for consumers that do not require tls.key.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/route.yaml`:
- Around line 5-13: The Route should explicitly bind to the Service's "https"
port: add a spec.port block with targetPort set to the Service port name "https"
(e.g., spec.port: { targetPort: "https" }) so the Route's to: Service
(metrics-proxy) unambiguously forwards to the correct port if additional ports
are later added.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment.go`:
- Around line 123-144: The ConfigMap volume items (ref.items) can be
non-deterministic and should be sorted before assignment to avoid noisy
reconciliation diffs; inside the loop over names where you build corev1.Volume
and before setting vol.VolumeSource.ConfigMap.Items = ref.items (the block
guarded by if len(ref.items) > 0), sort the slice (e.g.
sort.SliceStable(ref.items, func(i, j int) bool { return ref.items[i].Key <
ref.items[j].Key })) so items are consistently ordered every reconciliation.

In `@control-plane-operator/endpoint-resolver/cmd.go`:
- Around line 104-117: The shutdown goroutine currently calls
server.Shutdown(context.Background()) which can block indefinitely; change it to
create a timeout context (e.g., via context.WithTimeout with a 30s duration) and
pass that timeout context to server.Shutdown, ensuring you cancel the timeout
context after Shutdown returns; also remove the dead `_ = ctx` statement that is
no longer needed (the outer ctx is already used elsewhere) to eliminate dead
code. Refer to the shutdown goroutine and the server.Shutdown call to locate
where to apply the timeout context and to remove the `_ = ctx` line.

In `@control-plane-operator/metrics-proxy/auth.go`:
- Around line 42-83: Add short-lived negative caching to
TokenAuthenticator.Authenticate so repeated invalid tokens don't trigger
TokenReview calls: before creating the review, check a new negative cache (e.g.,
a.negativeCache keyed by hash with expiry, protected by a.mu) and return false
if present; after receiving a review where review.Status.Authenticated is false,
insert the token hash into that negative cache with a small TTL (e.g.,
failedTokenCacheTTL). Update the TokenAuthenticator struct to hold the negative
cache map and TTL constant, reuse a.mu for synchronization, and ensure only
successful authentications populate the positive a.cache while failures populate
the negative cache.

In `@control-plane-operator/metrics-proxy/cmd.go`:
- Line 54: Replace the anonymous handler assigned to cmd.Run with cmd.RunE so
that errors are returned to cobra instead of calling os.Exit(1) directly; update
the handler (the function currently wired as cmd.Run) to return an error value,
remove direct os.Exit calls (e.g., os.Exit(1)), and propagate any failures by
returning fmt.Errorf or wrapped errors so deferred cleanup runs and the parent
command/cobra can handle logging/exit behavior. Ensure you update any call sites
inside that handler (including error branches that currently call os.Exit) to
return the error instead.

In `@control-plane-operator/metrics-proxy/proxy.go`:
- Around line 35-114: The handler currently accumulates all encoded MetricFamily
objects into a bytes.Buffer (buf) which can grow large; instead create the
expfmt.Encoder to write directly to the http.ResponseWriter to stream results
and avoid the intermediate buffer. In ServeHTTP replace creation of encoder :=
expfmt.NewEncoder(&buf, format) with an encoder that wraps the ResponseWriter
after setting the Content-Type header (w.Header().Set(...)) and calling
w.WriteHeader(http.StatusOK) before encoding; keep the same loop over results,
logging result.Err and encoder.Encode errors (using h.log) but drop the final
w.Write(buf.Bytes()) step and remove buf entirely. Ensure you still set
Content-Type to string(format) and write the 200 status before the first write
from encoder so clients receive headers correctly.

@muraee muraee force-pushed the feat-metrics-proxy branch from 4ba7d9f to bcf2d81 Compare February 24, 2026 11:46
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (6)
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go (4)

74-77: ⚠️ Potential issue | 🟠 Major

Default scheme should be http to match Prometheus Operator defaults.

When scheme is omitted, Prometheus Operator assumes http. Defaulting to https here will misconfigure monitors that rely on that default.

✅ Proposed fix
-		scheme := "https"
+		scheme := "http"
 		if ep.Scheme != nil {
 			scheme = ep.Scheme.String()
 		}
@@
-		scheme := "https"
+		scheme := "http"
 		if ep.Scheme != nil {
 			scheme = ep.Scheme.String()
 		}

Also applies to: 138-141

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`
around lines 74 - 77, The code defaults the Prometheus scrape endpoint scheme to
"https" which contradicts Prometheus Operator's default of "http"; update the
default value from "https" to "http" wherever the scheme is derived (e.g., the
local variable named scheme that's assigned from ep.Scheme in scrape_config.go)
and make the same change in the other occurrence mentioned (the block around the
second ep.Scheme usage at the later lines) so that when ep.Scheme is nil the
code uses "http" instead of "https".

121-133: ⚠️ Potential issue | 🟠 Major

PodMonitor port resolution ignores portNumber/targetPort.

PodMetricsEndpoint supports port, portNumber, and targetPort (with defined precedence). Only checking port skips valid monitors or resolves the wrong port.

✅ Proposed fix (honor precedence and numeric ports)
-		portName := ""
-		if ep.Port != nil {
-			portName = *ep.Port
-		}
-		if portName == "" {
-			log.V(4).Info("skipping PodMonitor: no port name", "podMonitor", pm.Name)
-			continue
-		}
-
-		// Resolve the port number from the component's Deployment.
-		// The PodMonitor name matches the component name by CPOv2 convention.
-		port, err := resolveDeploymentPort(cpContext, namespace, pm.Name, portName)
+		var (
+			portName   string
+			portNumber *int32
+		)
+		if ep.Port != nil && *ep.Port != "" {
+			portName = *ep.Port
+		} else if ep.PortNumber != nil {
+			portNumber = ep.PortNumber
+		} else if ep.TargetPort != nil {
+			if ep.TargetPort.IntValue() > 0 {
+				v := int32(ep.TargetPort.IntValue())
+				portNumber = &v
+			} else {
+				portName = ep.TargetPort.String()
+			}
+		}
+		if portName == "" && portNumber == nil {
+			log.V(4).Info("skipping PodMonitor: no port reference", "podMonitor", pm.Name)
+			continue
+		}
+
+		var port int32
+		var err error
+		if portNumber != nil {
+			port = *portNumber
+		} else {
+			// The PodMonitor name matches the component name by CPOv2 convention.
+			port, err = resolveDeploymentPort(cpContext, namespace, pm.Name, portName)
+		}
 		if err != nil {
 			log.V(4).Info("skipping PodMonitor: port not resolvable", "podMonitor", pm.Name, "port", portName, "error", err)
 			continue
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`
around lines 121 - 133, The current PodMonitor handling only checks ep.Port
(named port) and skips ep.PortNumber/ep.TargetPort; update the selection logic
for PodMetricsEndpoint (ep) to follow Prometheus precedence: prefer ep.Port
(string) if set, else use ep.PortNumber (numeric) if set, else use ep.TargetPort
(which may be IntOrString) and handle string vs numeric forms; pass the resolved
port name or number into resolveDeploymentPort (or extend resolveDeploymentPort
to accept numeric ports) and ensure numeric target ports are treated as numbers
while named target ports are resolved against the Deployment container ports;
keep references to PodMetricsEndpoint ep, variables
portName/portNumber/targetPort and the resolveDeploymentPort call so the change
is localized.

216-225: ⚠️ Potential issue | 🟠 Major

Reject non-unique ServiceMonitor selectors.

Returning the first match is ambiguous and relies on non-guaranteed list ordering. Fail fast when multiple Services match a selector.

✅ Proposed fix
 	if len(svcList.Items) == 0 {
 		return "", fmt.Errorf("no service found for ServiceMonitor %s", smName)
 	}
+	if len(svcList.Items) > 1 {
+		return "", fmt.Errorf("multiple services (%d) match ServiceMonitor %s; selector must be unique", len(svcList.Items), smName)
+	}
 
 	return svcList.Items[0].Name, nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`
around lines 216 - 225, The current lookup in scrape_config.go uses
cpContext.Client.List to populate svcList and then returns svcList.Items[0].Name
which hides ambiguous selectors; change the logic in the block that calls
cpContext.Client.List (using client.MatchingLabelsSelector{Selector: sel}) to
explicitly check for multiple matches: if svcList.Items is empty return the
existing no-service error for ServiceMonitor smName, and if len(svcList.Items) >
1 return a new error indicating a non-unique ServiceMonitor selector (include
smName and the selector info) instead of returning the first item; only return
svcList.Items[0].Name when exactly one Service is found.

241-246: ⚠️ Potential issue | 🟠 Major

Named targetPort resolves to the wrong port for pod scraping.

If targetPort is a named string, falling back to the Service port will point at the wrong pod port when scraping pod IPs. Fail fast or resolve the container port name explicitly.

✅ Proposed fix (fail fast on named targetPort)
+	"k8s.io/apimachinery/pkg/util/intstr"
@@
 	for _, p := range svc.Spec.Ports {
 		if p.Name == portName || p.TargetPort.String() == portName {
+			if p.TargetPort.Type == intstr.String {
+				return 0, fmt.Errorf("service %s port %q uses named targetPort %q; resolve pod port before scraping", serviceName, p.Name, p.TargetPort.String())
+			}
 			if p.TargetPort.IntValue() > 0 {
 				return int32(p.TargetPort.IntValue()), nil
 			}
 			return p.Port, nil
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`
around lines 241 - 246, The loop over svc.Spec.Ports incorrectly falls back to
Service port when p.TargetPort is a named string, which yields the wrong pod
port for scraping; update the logic in the loop that checks p.TargetPort (inside
the scrape config resolution using portName and svc.Spec.Ports) to detect a
named TargetPort (e.g., check p.TargetPort.Type/StrVal or that IntValue()==0 and
StrVal!= ""), and in that case return an error (fail fast) instead of returning
p.Port; alternatively, resolve the container port name explicitly by looking up
the container port matching the name — prefer failing fast by returning an
informative error from the function when encountering a named targetPort.
control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/deployment.yaml (1)

19-53: ⚠️ Potential issue | 🟠 Major

Add a hardened container securityContext for metrics-proxy.

Without an explicit securityContext, the container may run as root and allow privilege escalation. Please harden it and update the generated fixtures in testdata/metrics-proxy/ so tests stay consistent.

🔒 Proposed hardening
         image: controlplane-operator
+        securityContext:
+          allowPrivilegeEscalation: false
+          runAsNonRoot: true
+          capabilities:
+            drop:
+            - "ALL"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/deployment.yaml`
around lines 19 - 53, Add a hardened pod/container securityContext for the
metrics-proxy container: in the container block for name: metrics-proxy set
securityContext to enforce non-root execution (runAsNonRoot: true and a non-zero
runAsUser), disable privilege escalation (allowPrivilegeEscalation: false), drop
all Linux capabilities, enable a readOnlyRootFilesystem: true, and set
seccompProfile to RuntimeDefault; then regenerate/update the fixtures under
testdata/metrics-proxy/ so tests match the new generated YAML. Ensure you modify
the container named "metrics-proxy" (and any generator code that produces it) so
these fields are present in the produced deployment manifest and fixture files.
control-plane-operator/metrics-proxy/proxy_test.go (1)

73-75: Replace JWT-looking test token to avoid secret scanners.

This token format is commonly flagged by scanners. Please switch to a non‑JWT placeholder with special characters.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/proxy_test.go` around lines 73 - 75, The
test case with name "When Authorization header contains Bearer token with
special characters, it should return the token" uses a JWT-looking token in the
authorization and expected fields; replace both values with a non-JWT
placeholder that still includes special characters (e.g., "Bearer
token-with.special_chars+123!@#") so scanners won't flag it but the test still
validates special-character handling for the functions under test (look for the
test case row using the authorization and expected variables in proxy_test.go).
🧹 Nitpick comments (1)
control-plane-operator/metrics-proxy/cmd.go (1)

26-29: Logger acquired before logger is configured — swap the order for clarity.

l is derived from log.Log on line 26, then the underlying logger is set on line 27. It works today because controller-runtime uses a delegating sink, but the ordering is semantically backwards and fragile if any log call were ever inserted between the two lines.

♻️ Suggested reorder
-	l := log.Log.WithName("metrics-proxy")
-	log.SetLogger(zap.New(zap.JSONEncoder(func(o *zapcore.EncoderConfig) {
+	log.SetLogger(zap.New(zap.JSONEncoder(func(o *zapcore.EncoderConfig) {
 		o.EncodeTime = zapcore.RFC3339TimeEncoder
 	})))
+	l := log.Log.WithName("metrics-proxy")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/cmd.go` around lines 26 - 29, Swap the
logger setup and acquisition so you call log.SetLogger(zap.New(...)) before
deriving l from log.Log; specifically, move the zap.New(zap.JSONEncoder(func(o
*zapcore.EncoderConfig) { o.EncodeTime = zapcore.RFC3339TimeEncoder })) call
into log.SetLogger first, then get l := log.Log.WithName("metrics-proxy"), and
ensure no other log calls occur between log.SetLogger and l to avoid using the
delegating sink prematurely.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@control-plane-operator/metrics-proxy/auth_test.go`:
- Around line 154-204: The kasCallCount counter in the test has a data race
because it's incremented inside the httptest handler goroutine and read later by
the test goroutine; change kasCallCount to an atomic integer (e.g., int32 or
int64) and replace the direct increment and reads with sync/atomic operations
(use atomic.AddInt32/AddInt64 in the handler and atomic.LoadInt32/LoadInt64
where the test checks kasCallCount) so the handler in the httptest.NewTLSServer
and the test's cache-hit assertions are properly synchronized; alternatively use
a sync.Mutex to protect kasCallCount, but prefer atomic for a simple counter.

In `@control-plane-operator/metrics-proxy/cmd.go`:
- Around line 113-120: The http.Server created in the server variable (Addr
fmt.Sprintf(":%d", servingPort), Handler mux, TLSConfig with tlsCert) has no
timeouts; add ReadHeaderTimeout, ReadTimeout and WriteTimeout fields on that
http.Server with sensible values (e.g. ReadHeaderTimeout ~5s, ReadTimeout ~10s,
WriteTimeout ~10s or tuned for your use) and import the "time" package; update
the same server construction site to include these timeout fields so the exposed
TLS endpoint is protected from slowloris/goroutine exhaustion.
- Line 65: The code reads MY_NAMESPACE into the variable namespace via
os.Getenv("MY_NAMESPACE") but does not validate it; update the code in cmd.go
(where namespace is set in main or the init path) to check if namespace == ""
and either log a clear warning or return/exit with an error so empty namespace
labels cannot be emitted. Refer to the existing namespace variable and the
os.Getenv("MY_NAMESPACE") call, and use the project's logger (or fmt/log.Fatalf)
to emit the message and stop startup when MY_NAMESPACE is missing.
- Line 50: The metricsSet flag is accepted without validation and casting via
metrics.MetricsSet(metricsSet) allows invalid values through to the
*RelabelConfigs functions in support/metrics/sets.go, which then hit the default
DROP behavior; after parsing flags in cmd (where
cmd.Flags().StringVar(&metricsSet,...)) add explicit validation of metricsSet
(e.g., only allow "All", "Telemetry", "SRE" or the exact enum values expected by
metrics.MetricsSet), and if the value is not one of the allowed canonical values
log an error and exit (or return a non-zero status); update the code that calls
metrics.MetricsSet(metricsSet) to only do so after this validation so invalid
inputs are rejected rather than silently dropped.

In `@control-plane-operator/metrics-proxy/filter.go`:
- Around line 54-65: The Apply path currently returns nil (no filter) when
regexp.Compile fails, which widens exposure; instead, on compile error build and
cache a "match-nothing" fallback filter and return that so failing to compile
closes access. Specifically, in the block that calls
getKeepRegexForComponent(componentName, f.metricsSet) and then
regexp.Compile("^(" + regexStr + ")$"), on err create a compiled regexp that
matches nothing (e.g., a pattern that can never match), store that value in
f.cache[componentName] under the same locking (f.mu.Lock()/Unlock()), and return
the cached match-nothing filter from Apply so compilation failures don't return
all metrics.

In `@control-plane-operator/metrics-proxy/proxy.go`:
- Around line 95-114: The current loop over results (variables results and
targets) logs individual scrape errors but still returns 200 with the buffer
even when everything failed; update the handler to track successful encodes
(e.g., increment a success counter each time you successfully encode a metric
family after calling h.filter.Apply and h.labeler.Inject and encoder.Encode) and
after the loop check that counter—if it is zero, write a 502 response (use
w.WriteHeader(http.StatusBadGateway) or http.Error with a short message) and do
not return the empty 200 body; otherwise proceed to set Content-Type and return
the encoded buffer as before.

In `@control-plane-operator/metrics-proxy/scraper.go`:
- Around line 71-77: In Scraper.scrape, ensure metricsPath is normalized to
start with a "/" before building the URL (e.g., check the metricsPath variable
and prefix with "/" when missing) so fmt.Sprintf("%s://%s:%d%s", scheme,
target.PodIP, target.Port, metricsPath) never produces a malformed URL; update
the logic in the scrape method to modify metricsPath (or use a local normalized
variable) before formatting the URL.

---

Duplicate comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/deployment.yaml`:
- Around line 19-53: Add a hardened pod/container securityContext for the
metrics-proxy container: in the container block for name: metrics-proxy set
securityContext to enforce non-root execution (runAsNonRoot: true and a non-zero
runAsUser), disable privilege escalation (allowPrivilegeEscalation: false), drop
all Linux capabilities, enable a readOnlyRootFilesystem: true, and set
seccompProfile to RuntimeDefault; then regenerate/update the fixtures under
testdata/metrics-proxy/ so tests match the new generated YAML. Ensure you modify
the container named "metrics-proxy" (and any generator code that produces it) so
these fields are present in the produced deployment manifest and fixture files.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`:
- Around line 74-77: The code defaults the Prometheus scrape endpoint scheme to
"https" which contradicts Prometheus Operator's default of "http"; update the
default value from "https" to "http" wherever the scheme is derived (e.g., the
local variable named scheme that's assigned from ep.Scheme in scrape_config.go)
and make the same change in the other occurrence mentioned (the block around the
second ep.Scheme usage at the later lines) so that when ep.Scheme is nil the
code uses "http" instead of "https".
- Around line 121-133: The current PodMonitor handling only checks ep.Port
(named port) and skips ep.PortNumber/ep.TargetPort; update the selection logic
for PodMetricsEndpoint (ep) to follow Prometheus precedence: prefer ep.Port
(string) if set, else use ep.PortNumber (numeric) if set, else use ep.TargetPort
(which may be IntOrString) and handle string vs numeric forms; pass the resolved
port name or number into resolveDeploymentPort (or extend resolveDeploymentPort
to accept numeric ports) and ensure numeric target ports are treated as numbers
while named target ports are resolved against the Deployment container ports;
keep references to PodMetricsEndpoint ep, variables
portName/portNumber/targetPort and the resolveDeploymentPort call so the change
is localized.
- Around line 216-225: The current lookup in scrape_config.go uses
cpContext.Client.List to populate svcList and then returns svcList.Items[0].Name
which hides ambiguous selectors; change the logic in the block that calls
cpContext.Client.List (using client.MatchingLabelsSelector{Selector: sel}) to
explicitly check for multiple matches: if svcList.Items is empty return the
existing no-service error for ServiceMonitor smName, and if len(svcList.Items) >
1 return a new error indicating a non-unique ServiceMonitor selector (include
smName and the selector info) instead of returning the first item; only return
svcList.Items[0].Name when exactly one Service is found.
- Around line 241-246: The loop over svc.Spec.Ports incorrectly falls back to
Service port when p.TargetPort is a named string, which yields the wrong pod
port for scraping; update the logic in the loop that checks p.TargetPort (inside
the scrape config resolution using portName and svc.Spec.Ports) to detect a
named TargetPort (e.g., check p.TargetPort.Type/StrVal or that IntValue()==0 and
StrVal!= ""), and in that case return an error (fail fast) instead of returning
p.Port; alternatively, resolve the container port name explicitly by looking up
the container port matching the name — prefer failing fast by returning an
informative error from the function when encountering a named targetPort.

In `@control-plane-operator/metrics-proxy/proxy_test.go`:
- Around line 73-75: The test case with name "When Authorization header contains
Bearer token with special characters, it should return the token" uses a
JWT-looking token in the authorization and expected fields; replace both values
with a non-JWT placeholder that still includes special characters (e.g., "Bearer
token-with.special_chars+123!@#") so scanners won't flag it but the test still
validates special-character handling for the functions under test (look for the
test case row using the authorization and expected variables in proxy_test.go).

---

Nitpick comments:
In `@control-plane-operator/metrics-proxy/cmd.go`:
- Around line 26-29: Swap the logger setup and acquisition so you call
log.SetLogger(zap.New(...)) before deriving l from log.Log; specifically, move
the zap.New(zap.JSONEncoder(func(o *zapcore.EncoderConfig) { o.EncodeTime =
zapcore.RFC3339TimeEncoder })) call into log.SetLogger first, then get l :=
log.Log.WithName("metrics-proxy"), and ensure no other log calls occur between
log.SetLogger and l to avoid using the delegating sink prematurely.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 4ba7d9f and bcf2d81.

📒 Files selected for processing (37)
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/GCP/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/ca-cert.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/deployment.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/route.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/scrape-config.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/service.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/serviceaccount.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/serving-cert.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/certs.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/component.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/route.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go
  • control-plane-operator/main.go
  • control-plane-operator/metrics-proxy/auth.go
  • control-plane-operator/metrics-proxy/auth_test.go
  • control-plane-operator/metrics-proxy/cmd.go
  • control-plane-operator/metrics-proxy/config_reader.go
  • control-plane-operator/metrics-proxy/config_reader_test.go
  • control-plane-operator/metrics-proxy/endpoint_resolver_client.go
  • control-plane-operator/metrics-proxy/endpoint_resolver_client_test.go
  • control-plane-operator/metrics-proxy/filter.go
  • control-plane-operator/metrics-proxy/filter_test.go
  • control-plane-operator/metrics-proxy/labels.go
  • control-plane-operator/metrics-proxy/labels_test.go
  • control-plane-operator/metrics-proxy/proxy.go
  • control-plane-operator/metrics-proxy/proxy_test.go
  • control-plane-operator/metrics-proxy/scraper.go
  • control-plane-operator/metrics-proxy/tls_util.go
  • control-plane-operator/metrics-proxy/types.go
🚧 Files skipped from review as they are similar to previous changes (19)
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/route.yaml
  • control-plane-operator/metrics-proxy/endpoint_resolver_client.go
  • control-plane-operator/metrics-proxy/tls_util.go
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/GCP/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/metrics-proxy/types.go
  • control-plane-operator/main.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/service.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/metrics-proxy/config_reader.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/ca-cert.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/metrics-proxy/endpoint_resolver_client_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/scrape-config.yaml
  • control-plane-operator/metrics-proxy/config_reader_test.go
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/serving-cert.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/serviceaccount.yaml

@muraee muraee force-pushed the feat-metrics-proxy branch from bcf2d81 to f06fba4 Compare February 24, 2026 12:28
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (13)
control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml (1)

118-119: Harden container securityContext — same gap as the default and GCP variants.

Same issue as the default fixture: allowPrivilegeEscalation: false, runAsNonRoot: true, and capabilities.drop: [ALL] are missing, while the GCP variant reportedly includes them. This should be fixed in the component's Go code so all generated fixtures are consistent.

Suggested fix
         securityContext:
+          allowPrivilegeEscalation: false
+          capabilities:
+            drop:
+            - ALL
           readOnlyRootFilesystem: true
+          runAsNonRoot: true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`
around lines 118 - 119, The container spec's securityContext in the
metrics-proxy fixture only sets readOnlyRootFilesystem and is missing hardening
fields; update the code that builds the securityContext (the logic that emits
securityContext for the metrics-proxy Deployment) to include
allowPrivilegeEscalation: false, runAsNonRoot: true, and capabilities.drop:
[ALL] alongside readOnlyRootFilesystem so generated fixtures (and the
component's Go generator) produce the same hardened SecurityContext as the GCP
variant; locate the function that constructs the Deployment/securityContext for
metrics-proxy and add these fields to the emitted securityContext structure.
control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml (1)

118-119: Harden container securityContext — missing allowPrivilegeEscalation: false, runAsNonRoot: true, and capability drops.

Checkov CKV_K8S_20/CKV_K8S_23 flag this. The GCP variant of this same fixture reportedly includes these fields, so this default variant should be consistent.

Suggested fix
         securityContext:
+          allowPrivilegeEscalation: false
+          capabilities:
+            drop:
+            - ALL
           readOnlyRootFilesystem: true
+          runAsNonRoot: true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`
around lines 118 - 119, Add the missing hardened container securityContext
fields: set allowPrivilegeEscalation: false, runAsNonRoot: true, and add a
capabilities block that drops all capabilities (capabilities: { drop: ["ALL"] })
under the existing securityContext entry so this fixture matches the GCP
variant; update the securityContext section that currently contains only
readOnlyRootFilesystem to include these three settings.
control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/deployment.yaml (1)

19-76: Add hardened container securityContext (runAsNonRoot / no privilege escalation).
Same issue as previously flagged; still missing in the Deployment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/deployment.yaml`
around lines 19 - 76, The Deployment's metrics-proxy container is missing a
hardened securityContext; update the Pod/Container spec for the container named
"metrics-proxy" to enforce non-root and no privilege escalation by adding a
securityContext with runAsNonRoot: true (and set a non-zero runAsUser if
applicable), set allowPrivilegeEscalation: false in the container
securityContext, and drop all capabilities (capabilities.drop: ["ALL"]); ensure
the Pod-level securityContext complements this (e.g., runAsNonRoot: true) so the
container cannot run as root or escalate privileges.
control-plane-operator/metrics-proxy/proxy_test.go (2)

73-75: ⚠️ Potential issue | 🟠 Major

Replace JWT-shaped test token to avoid secret-scanner failures.

Gitleaks can flag these values as real JWTs and fail CI. Use a non-JWT placeholder that still preserves special-character handling.

🛠️ Suggested fix
-		{
-			name:          "When Authorization header contains Bearer token with special characters, it should return the token",
-			authorization: "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIn0.dozjgNryP4J3jVmNHl0w5N_XgL0n3I9PlFUP0THsR8U",
-			expected:      "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIn0.dozjgNryP4J3jVmNHl0w5N_XgL0n3I9PlFUP0THsR8U",
-		},
+		{
+			name:          "When Authorization header contains Bearer token with special characters, it should return the token",
+			authorization: "Bearer token-._~+/=special",
+			expected:      "token-._~+/=special",
+		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/proxy_test.go` around lines 73 - 75, The
test uses a JWT-shaped string for the authorization and expected fields which
triggers secret scanners; replace the JWT-like value with a non-JWT placeholder
that still contains special characters (e.g., include dots, underscores, dashes)
so the test exercising special-character handling remains valid—update the test
case's authorization and expected fields in the "When Authorization header
contains Bearer token with special characters, it should return the token" entry
(the authorization and expected values) to a clearly non-JWT token string.

79-82: ⚠️ Potential issue | 🟠 Major

Capture the loop variable before t.Parallel().

Parallel subtests can race on the reused range variable, causing flaky runs.

🧪 Suggested fix
 	for _, tt := range tests {
+		tt := tt
 		t.Run(tt.name, func(t *testing.T) {
 			t.Parallel()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/proxy_test.go` around lines 79 - 82, The
subtest loop uses the range variable `tt` directly inside `t.Run` with
`t.Parallel()`, which can cause races and flaky tests; capture the loop variable
by shadowing it before starting the parallel subtest (e.g., add `tt := tt` or
`tc := tt` immediately inside the loop body) so each goroutine gets its own
copy, then call `t.Run(tt.name, func(t *testing.T) { t.Parallel(); ... })` using
the captured variable in the body and assertions.
control-plane-operator/metrics-proxy/auth_test.go (1)

154-203: ⚠️ Potential issue | 🟡 Minor

Synchronize kasCallCount to avoid a data race.

The handler runs in a separate goroutine and increments kasCallCount without synchronization. Use sync/atomic for a simple counter.

🛠️ Suggested fix
 import (
 	"context"
 	"encoding/pem"
 	"net/http"
 	"net/http/httptest"
+	"sync/atomic"
 	"testing"
 	"time"
@@
-			var kasCallCount int
+			var kasCallCount int32
 			kasServer := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-				kasCallCount++
+				atomic.AddInt32(&kasCallCount, 1)
 				w.Header().Set("Content-Type", "application/json")
 				if tt.tokenReviewResponse != "" {
 					if _, err := w.Write([]byte(tt.tokenReviewResponse)); err != nil {
 						t.Errorf("failed to write response: %v", err)
 					}
 				}
 			}))
@@
-			if tt.expectCacheHit && kasCallCount > 0 {
+			if tt.expectCacheHit && atomic.LoadInt32(&kasCallCount) > 0 {
 				t.Error("Expected cache hit (no KAS call) but KAS was called")
 			}
-			if !tt.expectCacheHit && kasCallCount == 0 {
+			if !tt.expectCacheHit && atomic.LoadInt32(&kasCallCount) == 0 {
 				t.Error("Expected KAS call (cache miss) but KAS was not called")
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/auth_test.go` around lines 154 - 203,
kasCallCount is incremented inside the httptest handler goroutine without
synchronization causing a data race; change kasCallCount to an atomic counter
(e.g., uint64 or int32) and replace direct increments in the kasServer handler
with atomic.AddUint64/atomic.AddInt32, and replace the test-side checks
(kasCallCount > 0 and kasCallCount == 0) with atomic.LoadUint64/atomic.LoadInt32
to read the value safely; update any variable declarations and comparisons
around kasCallCount accordingly so the handler and test read/writes use the
sync/atomic functions.
control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml (1)

118-120: ⚠️ Potential issue | 🟠 Major

Harden securityContext to match other fixtures.

This IBMCloud fixture is missing allowPrivilegeEscalation: false, capabilities.drop: [ALL], and runAsNonRoot: true, which are present in other variants and expected by policy tools.

🔒 Suggested fix
         securityContext:
+          allowPrivilegeEscalation: false
+          capabilities:
+            drop:
+            - ALL
           readOnlyRootFilesystem: true
+          runAsNonRoot: true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`
around lines 118 - 120, The securityContext block for the IBMCloud metrics-proxy
pod is too permissive compared to other fixtures; update the securityContext
(next to readOnlyRootFilesystem and terminationMessagePolicy) to include
allowPrivilegeEscalation: false, runAsNonRoot: true and capabilities.drop: [
"ALL" ] so it matches the hardened settings used by other variants and satisfies
policy checks.
control-plane-operator/metrics-proxy/proxy.go (1)

95-113: Return an error when all scrapes fail.

If every scrape errors, the handler still returns 200 with an empty body. Consider a 502 so Prometheus treats it as a failure.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/proxy.go` around lines 95 - 113, The
handler currently loops over results (for i, result := range results) and always
returns 200 even if every scrape failed; change it to detect when all entries in
results have Err != nil (e.g., count successes while iterating or check after
the loop) and if none succeeded, write a 502 Bad Gateway response
(w.Header().Set("Content-Type", "text/plain");
w.WriteHeader(http.StatusBadGateway)) with a short error body and log via h.log
before returning, otherwise proceed with encoding and the existing 200/response
write using format/buf.
control-plane-operator/metrics-proxy/filter.go (1)

62-65: Fail-closed on regex compilation errors.

Returning nil here causes Apply to return all metrics on compile failure. Consider caching a match-nothing regexp instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/filter.go` around lines 62 - 65, The
regex compilation currently returns nil on error which makes Apply treat it as
match-all; instead create and use a cached match-nothing regexp when compilation
fails. Replace the nil return in the block handling the error from
regexp.Compile (the compiled variable) with a reference to a reused no-match
pattern (e.g., a package-level noMatch regexp) so compiled is never nil and
Apply will return no metrics on compile failure.
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go (4)

74-77: Default scheme to "http" instead of "https" to match Prometheus Operator behaviour

When ep.Scheme is nil, Prometheus Operator defaults the endpoint scheme to http, not https. Hard-coding "https" means any monitor that relies on the operator's default will scrape over the wrong scheme and fail. The same pattern repeats at lines 138–141 for PodMonitors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`
around lines 74 - 77, The code defaults the local variable scheme to "https"
when ep.Scheme is nil, but Prometheus Operator's default is "http"; change the
default assignment so scheme := "http" and only override it when ep.Scheme !=
nil (using ep.Scheme.String()), and apply the same fix to the identical pattern
used for PodMonitor handling (the other block where scheme is set from
ep.Scheme). Ensure you update the variable initialization in both places (the
scheme variable near the top of scrape_config.go and the repeated PodMonitor
block) so nil endpoints default to "http".

241-246: Named targetPort falls back to the service port, causing scrapes to hit the wrong port

When p.TargetPort is a string name (e.g. "http"), p.TargetPort.IntValue() returns 0 and the code returns p.Port (the Service port). Because the metrics-proxy scrapes pod IPs directly — bypassing the Service — it must use the container port, not the service port. Named target ports resolve to different container port numbers that are not the same as the service port.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`
around lines 241 - 246, The code currently treats named TargetPort as zero
(p.TargetPort.IntValue() == 0) and falls back to returning the Service port
(p.Port), which is wrong for pod scrapes; modify the logic that iterates
svc.Spec.Ports (the loop using p.Name, p.TargetPort and portName) to detect when
p.TargetPort is a string name (p.TargetPort.Type == IntOrString.String) and
resolve that name to the actual container port number by locating pods matched
by the Service selector (or Endpoint/EndpointSlice) and finding a container port
with that name, then return that container port; only fall back to p.Port if you
cannot resolve the named targetPort.

221-225: Multiple Services matching a ServiceMonitor selector should be an error, not a silent first-pick

When the label selector matches more than one Service the first item from the unordered list is silently returned. Kubernetes LIST ordering is not guaranteed, so this is non-deterministic. The selector should uniquely identify one Service; multiple matches indicate a misconfiguration that should be rejected.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`
around lines 221 - 225, The current logic silently picks the first item from
svcList.Items when multiple Services match the ServiceMonitor selector; change
this to treat multiple matches as an error: in the function that resolves a
Service for a ServiceMonitor (the block using svcList, smName), add a check for
len(svcList.Items) > 1 and return a descriptive fmt.Errorf indicating multiple
Services matched (include smName and either the count or the list of matching
service names) instead of returning svcList.Items[0].Name; keep the existing
no-match error for len == 0.

121-128: PodMonitor portNumber (int32) and targetPort (IntOrString) fields are silently ignored

Only ep.Port (*string, the named port) is consulted. PodMonitors that specify portNumber or the deprecated targetPort field are unconditionally skipped at line 126–127, even though those fields carry valid port information.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`
around lines 121 - 128, The code currently only checks ep.Port (the named port)
and skips the PodMonitor when it's empty; instead update the logic in the
PodMonitor processing (around variables portName and ep.Port) to also honor
ep.PortNumber (int32) and the deprecated ep.TargetPort (IntOrString): if ep.Port
is non-empty use it, else if ep.TargetPort is set use its string form, else if
ep.PortNumber != nil use strconv.Itoa(int(*ep.PortNumber)); assign the resulting
string to portName (or a new variable like portRef) and proceed rather than
unconditionally skipping, and ensure any downstream code that expects a named
port can accept numeric string ports produced from portNumber/targetPort.
🧹 Nitpick comments (6)
control-plane-operator/metrics-proxy/tls_util.go (1)

15-18: Consider logging the read error for observability.

The error from os.ReadFile is silently discarded, leaving no diagnostic signal when a CA file exists on disk but is unreadable (e.g., wrong permissions, partial write). While the "volume not yet mounted" case is intentional, a DEBUG/WARN log line would make startup TLS failures much easier to diagnose without changing the nil-return contract.

♻️ Proposed refactor
+import (
+    "crypto/x509"
+    "log"
+    "os"
+)

 func loadCertPool(caFile string) *x509.CertPool {
 	if caFile == "" {
 		return nil
 	}
 	caPEM, err := os.ReadFile(caFile)
 	if err != nil {
+		log.Printf("metrics-proxy: could not read CA file %q: %v", caFile, err)
 		return nil
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/tls_util.go` around lines 15 - 18, In
tls_util.go the os.ReadFile(caFile) error is discarded; update the error path so
that before returning nil you emit a DEBUG/WARN log that includes caFile and the
error (preserving the nil-return contract used to indicate "not yet mounted");
locate the read in the code using caPEM and os.ReadFile(caFile) and add a
concise log line (using the existing logger in this package) that records the
filename and err to aid observability.
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go (3)

57-59: Silently discarding AddToScheme errors can mask setup failures.

If either AddToScheme call returns an error, every downstream subtest will fail with confusing "no kind registered" messages instead of a clear root cause. Fail fast in the test setup.

♻️ Proposed fix
-	_ = corev1.AddToScheme(scheme)
-	_ = prometheusoperatorv1.AddToScheme(scheme)
+	if err := corev1.AddToScheme(scheme); err != nil {
+		t.Fatalf("failed to add corev1 to scheme: %v", err)
+	}
+	if err := prometheusoperatorv1.AddToScheme(scheme); err != nil {
+		t.Fatalf("failed to add prometheusoperatorv1 to scheme: %v", err)
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go`
around lines 57 - 59, The test is silently discarding errors from
runtime.NewScheme setup; change the setup to check the returned errors from
corev1.AddToScheme and prometheusoperatorv1.AddToScheme and fail fast when they
occur (e.g., use t.Fatalf or a testing assertion) so tests don't continue with
an invalid scheme; locate the runtime.NewScheme call and the corev1.AddToScheme
and prometheusoperatorv1.AddToScheme invocations in deployment_test.go and
validate each error, logging the error details in the failure message.

565-577: TestCertVolumeOptionalFlag tests an external library utility, not production code.

The entire test body calls ptr.To(true) from k8s.io/utils/ptr and asserts the returned pointer is non-nil and holds true. This exercises no code in the metricsproxy package. The leading comment ("Verify certRef is correctly identified for the ptr.To helper") is also misleading — no certRef is touched. This test can be removed or replaced with an assertion against the actual volume-building logic (e.g., call certVolumesFromMonitors and inspect the resulting Secret.Optional or ConfigMap.Optional pointer directly).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go`
around lines 565 - 577, Remove or replace the pointless
TestCertVolumeOptionalFlag which only exercises k8s.io/utils/ptr; either delete
the TestCertVolumeOptionalFlag test or rewrite it to call the package logic that
builds volumes (e.g., call certVolumesFromMonitors) and assert the Optional
pointer on returned volume sources (Secret.Optional and ConfigMap.Optional) is
set as expected; update references to ptr.To(true) in the test body to instead
inspect the Optional fields from the certVolumesFromMonitors result and assert
non-nil && *Optional == true (or false) as appropriate.

333-370: Missing mount assertions in the ConfigMap optional=true subtest.

The Secret counterpart (lines 286-331) validates both the volume count and mount count/path. This subtest discards mounts entirely (_), leaving the mount-path wiring for ConfigMap volumes untested.

♻️ Proposed fix
-	volumes, _, err := certVolumesFromMonitors(cpContext, namespace)
+	volumes, mounts, err := certVolumesFromMonitors(cpContext, namespace)
 	if err != nil {
 		t.Fatalf("unexpected error: %v", err)
 	}
 	// ... existing volume/optional assertions ...
+
+	if len(mounts) != 1 {
+		t.Fatalf("expected 1 mount, got %d", len(mounts))
+	}
+	expectedPath := certBasePath + "/test-ca"
+	if mounts[0].MountPath != expectedPath {
+		t.Errorf("expected mount path %q, got %q", expectedPath, mounts[0].MountPath)
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go`
around lines 333 - 370, The subtest for ConfigMap volumes ignores mounts (using
`_`) so it never verifies mount wiring; update the call to
certVolumesFromMonitors to capture `mounts` (e.g. `volumes, mounts, err :=
certVolumesFromMonitors(...)`) and add the same mount assertions used in the
Secret subtest: assert `len(mounts) == 1` and that `mounts[0].MountPath` equals
the expected mount path used for the component (mirror the value/assertion from
the Secret test), and keep the existing `volumes[0]`/ConfigMap `Optional`
assertion; use the `certVolumesFromMonitors`, `volumes`, and `mounts`
identifiers to locate the change.
control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml (1)

21-21: weight: null will be serialized — consider omitempty on the RouteTargetReference.Weight field.

The weight: null in the fixture reflects that the Weight field in routev1.RouteTargetReference is defined as a pointer type (*int32) without omitempty in its JSON struct tag. When marshalled, nil pointers become explicit null values in YAML. While the OpenShift Route controller treats a null/absent weight identically (defaults to 100), the explicit null is unconventional for Route manifests.

To fix this, add omitempty to the Weight field's struct tag in the openshift/api package's RouteTargetReference type (vendor/github.com/openshift/api/route/v1/types.go line 339), changing:

Weight *int32 `json:"weight" protobuf:"varint,3,opt,name=weight"`

to:

Weight *int32 `json:"weight,omitempty" protobuf:"varint,3,opt,name=weight"`

This will omit the field from output when nil instead of serializing it as null.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml`
at line 21, The Route fixture shows `weight: null` because
RouteTargetReference.Weight is a pointer without `omitempty`; update the struct
tag on the RouteTargetReference.Weight field to include `omitempty` so nil
pointers are omitted when marshaling (change the JSON tag for the Weight field
on type RouteTargetReference in types.go to add `omitempty`), then run
generation/tests to verify YAML no longer contains `weight: null`.
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go (1)

46-51: Only the first endpoint per monitor is processed; additional endpoints are silently dropped

sm.Spec.Endpoints[0] (and its PodMonitor counterpart at line 119) takes only the first element. A ServiceMonitor or PodMonitor with multiple endpoints will have all subsequent endpoints silently ignored. If this is intentional by CPOv2 convention (one scrape target per component), a comment stating that assumption would prevent future confusion; otherwise the loop needs to iterate all endpoints.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`
around lines 46 - 51, The code currently only processes the first endpoint of
each ServiceMonitor/PodMonitor (using sm.Spec.Endpoints[0]), which drops
additional endpoints; update the logic in the ServiceMonitor handling (iterate
smList.Items and replace the single-index access with a loop over
sm.Spec.Endpoints, e.g. for each endpoint process it the same way you currently
do for ep) and apply the same change to the PodMonitor handling (replace the
single-index access with iterating pm.Spec.Endpoints) so all endpoints are
included; if single-endpoint semantics are intentional instead, add an explicit
comment near the sm.Spec.Endpoints and pm.Spec.Endpoints usage clarifying that
convention.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`:
- Around line 118-120: The container securityContext only sets
readOnlyRootFilesystem; update the manifest to harden the container by adding
runAsNonRoot: true and allowPrivilegeEscalation: false under the same
securityContext (and add a seccompProfile with type: RuntimeDefault if supported
by your platform), then update the upstream asset manifest that produces this
fixture and re-generate the
zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml so the
fixture and source manifest remain in sync; look for the securityContext block
in the Deployment spec to make these changes.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`:
- Around line 105-106: The code currently writes both ServiceMonitor and
PodMonitor entries into cfg.Components keyed by sm.Name and pm.Name, allowing a
PodMonitor to silently overwrite a ServiceMonitor when names collide; update the
assignments that set cfg.Components[...] (where sm.Name and pm.Name are used and
comp is the value) to use distinct keys such as "sm/"+sm.Name and "pm/"+pm.Name,
or alternatively check cfg.Components for an existing entry before writing and
emit a warning log mentioning the collision (include the monitor type and name)
to avoid silent overwrites.

In `@control-plane-operator/metrics-proxy/config_reader.go`:
- Around line 99-121: In buildTLSConfigFromFiles, stop silently ignoring client
cert/key read errors and reject partial configurations: if exactly one of
certFile or keyFile is set return an error explaining the missing counterpart;
when reading certFile/keyFile use os.ReadFile and if the returned error is
os.IsNotExist(err) treat it as "no client cert present" and return the TLS
config without client certs, but for any other read error return that error
(wrap with context); keep existing behavior for loadCertPool (use
loadCertPool(caFile) if non-nil). Ensure errors reference certFile/keyFile and
the function name buildTLSConfigFromFiles for traceability.

---

Duplicate comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`:
- Around line 118-119: The container spec's securityContext in the metrics-proxy
fixture only sets readOnlyRootFilesystem and is missing hardening fields; update
the code that builds the securityContext (the logic that emits securityContext
for the metrics-proxy Deployment) to include allowPrivilegeEscalation: false,
runAsNonRoot: true, and capabilities.drop: [ALL] alongside
readOnlyRootFilesystem so generated fixtures (and the component's Go generator)
produce the same hardened SecurityContext as the GCP variant; locate the
function that constructs the Deployment/securityContext for metrics-proxy and
add these fields to the emitted securityContext structure.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`:
- Around line 118-120: The securityContext block for the IBMCloud metrics-proxy
pod is too permissive compared to other fixtures; update the securityContext
(next to readOnlyRootFilesystem and terminationMessagePolicy) to include
allowPrivilegeEscalation: false, runAsNonRoot: true and capabilities.drop: [
"ALL" ] so it matches the hardened settings used by other variants and satisfies
policy checks.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`:
- Around line 118-119: Add the missing hardened container securityContext
fields: set allowPrivilegeEscalation: false, runAsNonRoot: true, and add a
capabilities block that drops all capabilities (capabilities: { drop: ["ALL"] })
under the existing securityContext entry so this fixture matches the GCP
variant; update the securityContext section that currently contains only
readOnlyRootFilesystem to include these three settings.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/deployment.yaml`:
- Around line 19-76: The Deployment's metrics-proxy container is missing a
hardened securityContext; update the Pod/Container spec for the container named
"metrics-proxy" to enforce non-root and no privilege escalation by adding a
securityContext with runAsNonRoot: true (and set a non-zero runAsUser if
applicable), set allowPrivilegeEscalation: false in the container
securityContext, and drop all capabilities (capabilities.drop: ["ALL"]); ensure
the Pod-level securityContext complements this (e.g., runAsNonRoot: true) so the
container cannot run as root or escalate privileges.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`:
- Around line 74-77: The code defaults the local variable scheme to "https" when
ep.Scheme is nil, but Prometheus Operator's default is "http"; change the
default assignment so scheme := "http" and only override it when ep.Scheme !=
nil (using ep.Scheme.String()), and apply the same fix to the identical pattern
used for PodMonitor handling (the other block where scheme is set from
ep.Scheme). Ensure you update the variable initialization in both places (the
scheme variable near the top of scrape_config.go and the repeated PodMonitor
block) so nil endpoints default to "http".
- Around line 241-246: The code currently treats named TargetPort as zero
(p.TargetPort.IntValue() == 0) and falls back to returning the Service port
(p.Port), which is wrong for pod scrapes; modify the logic that iterates
svc.Spec.Ports (the loop using p.Name, p.TargetPort and portName) to detect when
p.TargetPort is a string name (p.TargetPort.Type == IntOrString.String) and
resolve that name to the actual container port number by locating pods matched
by the Service selector (or Endpoint/EndpointSlice) and finding a container port
with that name, then return that container port; only fall back to p.Port if you
cannot resolve the named targetPort.
- Around line 221-225: The current logic silently picks the first item from
svcList.Items when multiple Services match the ServiceMonitor selector; change
this to treat multiple matches as an error: in the function that resolves a
Service for a ServiceMonitor (the block using svcList, smName), add a check for
len(svcList.Items) > 1 and return a descriptive fmt.Errorf indicating multiple
Services matched (include smName and either the count or the list of matching
service names) instead of returning svcList.Items[0].Name; keep the existing
no-match error for len == 0.
- Around line 121-128: The code currently only checks ep.Port (the named port)
and skips the PodMonitor when it's empty; instead update the logic in the
PodMonitor processing (around variables portName and ep.Port) to also honor
ep.PortNumber (int32) and the deprecated ep.TargetPort (IntOrString): if ep.Port
is non-empty use it, else if ep.TargetPort is set use its string form, else if
ep.PortNumber != nil use strconv.Itoa(int(*ep.PortNumber)); assign the resulting
string to portName (or a new variable like portRef) and proceed rather than
unconditionally skipping, and ensure any downstream code that expects a named
port can accept numeric string ports produced from portNumber/targetPort.

In `@control-plane-operator/metrics-proxy/auth_test.go`:
- Around line 154-203: kasCallCount is incremented inside the httptest handler
goroutine without synchronization causing a data race; change kasCallCount to an
atomic counter (e.g., uint64 or int32) and replace direct increments in the
kasServer handler with atomic.AddUint64/atomic.AddInt32, and replace the
test-side checks (kasCallCount > 0 and kasCallCount == 0) with
atomic.LoadUint64/atomic.LoadInt32 to read the value safely; update any variable
declarations and comparisons around kasCallCount accordingly so the handler and
test read/writes use the sync/atomic functions.

In `@control-plane-operator/metrics-proxy/filter.go`:
- Around line 62-65: The regex compilation currently returns nil on error which
makes Apply treat it as match-all; instead create and use a cached match-nothing
regexp when compilation fails. Replace the nil return in the block handling the
error from regexp.Compile (the compiled variable) with a reference to a reused
no-match pattern (e.g., a package-level noMatch regexp) so compiled is never nil
and Apply will return no metrics on compile failure.

In `@control-plane-operator/metrics-proxy/proxy_test.go`:
- Around line 73-75: The test uses a JWT-shaped string for the authorization and
expected fields which triggers secret scanners; replace the JWT-like value with
a non-JWT placeholder that still contains special characters (e.g., include
dots, underscores, dashes) so the test exercising special-character handling
remains valid—update the test case's authorization and expected fields in the
"When Authorization header contains Bearer token with special characters, it
should return the token" entry (the authorization and expected values) to a
clearly non-JWT token string.
- Around line 79-82: The subtest loop uses the range variable `tt` directly
inside `t.Run` with `t.Parallel()`, which can cause races and flaky tests;
capture the loop variable by shadowing it before starting the parallel subtest
(e.g., add `tt := tt` or `tc := tt` immediately inside the loop body) so each
goroutine gets its own copy, then call `t.Run(tt.name, func(t *testing.T) {
t.Parallel(); ... })` using the captured variable in the body and assertions.

In `@control-plane-operator/metrics-proxy/proxy.go`:
- Around line 95-113: The handler currently loops over results (for i, result :=
range results) and always returns 200 even if every scrape failed; change it to
detect when all entries in results have Err != nil (e.g., count successes while
iterating or check after the loop) and if none succeeded, write a 502 Bad
Gateway response (w.Header().Set("Content-Type", "text/plain");
w.WriteHeader(http.StatusBadGateway)) with a short error body and log via h.log
before returning, otherwise proceed with encoding and the existing 200/response
write using format/buf.

---

Nitpick comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml`:
- Line 21: The Route fixture shows `weight: null` because
RouteTargetReference.Weight is a pointer without `omitempty`; update the struct
tag on the RouteTargetReference.Weight field to include `omitempty` so nil
pointers are omitted when marshaling (change the JSON tag for the Weight field
on type RouteTargetReference in types.go to add `omitempty`), then run
generation/tests to verify YAML no longer contains `weight: null`.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go`:
- Around line 57-59: The test is silently discarding errors from
runtime.NewScheme setup; change the setup to check the returned errors from
corev1.AddToScheme and prometheusoperatorv1.AddToScheme and fail fast when they
occur (e.g., use t.Fatalf or a testing assertion) so tests don't continue with
an invalid scheme; locate the runtime.NewScheme call and the corev1.AddToScheme
and prometheusoperatorv1.AddToScheme invocations in deployment_test.go and
validate each error, logging the error details in the failure message.
- Around line 565-577: Remove or replace the pointless
TestCertVolumeOptionalFlag which only exercises k8s.io/utils/ptr; either delete
the TestCertVolumeOptionalFlag test or rewrite it to call the package logic that
builds volumes (e.g., call certVolumesFromMonitors) and assert the Optional
pointer on returned volume sources (Secret.Optional and ConfigMap.Optional) is
set as expected; update references to ptr.To(true) in the test body to instead
inspect the Optional fields from the certVolumesFromMonitors result and assert
non-nil && *Optional == true (or false) as appropriate.
- Around line 333-370: The subtest for ConfigMap volumes ignores mounts (using
`_`) so it never verifies mount wiring; update the call to
certVolumesFromMonitors to capture `mounts` (e.g. `volumes, mounts, err :=
certVolumesFromMonitors(...)`) and add the same mount assertions used in the
Secret subtest: assert `len(mounts) == 1` and that `mounts[0].MountPath` equals
the expected mount path used for the component (mirror the value/assertion from
the Secret test), and keep the existing `volumes[0]`/ConfigMap `Optional`
assertion; use the `certVolumesFromMonitors`, `volumes`, and `mounts`
identifiers to locate the change.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`:
- Around line 46-51: The code currently only processes the first endpoint of
each ServiceMonitor/PodMonitor (using sm.Spec.Endpoints[0]), which drops
additional endpoints; update the logic in the ServiceMonitor handling (iterate
smList.Items and replace the single-index access with a loop over
sm.Spec.Endpoints, e.g. for each endpoint process it the same way you currently
do for ep) and apply the same change to the PodMonitor handling (replace the
single-index access with iterating pm.Spec.Endpoints) so all endpoints are
included; if single-endpoint semantics are intentional instead, add an explicit
comment near the sm.Spec.Endpoints and pm.Spec.Endpoints usage clarifying that
convention.

In `@control-plane-operator/metrics-proxy/tls_util.go`:
- Around line 15-18: In tls_util.go the os.ReadFile(caFile) error is discarded;
update the error path so that before returning nil you emit a DEBUG/WARN log
that includes caFile and the error (preserving the nil-return contract used to
indicate "not yet mounted"); locate the read in the code using caPEM and
os.ReadFile(caFile) and add a concise log line (using the existing logger in
this package) that records the filename and err to aid observability.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between bcf2d81 and f06fba4.

📒 Files selected for processing (73)
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_ca_cert_secret.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_config_configmap.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_service.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_serviceaccount.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_serving_cert_secret.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/GCP/zz_fixture_TestControlPlaneComponents_metrics_proxy_ca_cert_secret.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/GCP/zz_fixture_TestControlPlaneComponents_metrics_proxy_config_configmap.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/GCP/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/GCP/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/GCP/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/GCP/zz_fixture_TestControlPlaneComponents_metrics_proxy_service.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/GCP/zz_fixture_TestControlPlaneComponents_metrics_proxy_serviceaccount.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/GCP/zz_fixture_TestControlPlaneComponents_metrics_proxy_serving_cert_secret.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_ca_cert_secret.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_config_configmap.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_service.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_serviceaccount.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_serving_cert_secret.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_ca_cert_secret.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_config_configmap.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_service.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_serviceaccount.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_serving_cert_secret.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_ca_cert_secret.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_config_configmap.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_service.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_serviceaccount.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_serving_cert_secret.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/ca-cert.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/deployment.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/route.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/scrape-config.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/service.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/serviceaccount.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/serving-cert.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/certs.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/component.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/route.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go
  • control-plane-operator/main.go
  • control-plane-operator/metrics-proxy/auth.go
  • control-plane-operator/metrics-proxy/auth_test.go
  • control-plane-operator/metrics-proxy/cmd.go
  • control-plane-operator/metrics-proxy/config_reader.go
  • control-plane-operator/metrics-proxy/config_reader_test.go
  • control-plane-operator/metrics-proxy/endpoint_resolver_client.go
  • control-plane-operator/metrics-proxy/endpoint_resolver_client_test.go
  • control-plane-operator/metrics-proxy/filter.go
  • control-plane-operator/metrics-proxy/filter_test.go
  • control-plane-operator/metrics-proxy/labels.go
  • control-plane-operator/metrics-proxy/labels_test.go
  • control-plane-operator/metrics-proxy/proxy.go
  • control-plane-operator/metrics-proxy/proxy_test.go
  • control-plane-operator/metrics-proxy/scraper.go
  • control-plane-operator/metrics-proxy/tls_util.go
  • control-plane-operator/metrics-proxy/types.go
✅ Files skipped from review due to trivial changes (10)
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_serving_cert_secret.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_ca_cert_secret.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_serving_cert_secret.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_route.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/GCP/zz_fixture_TestControlPlaneComponents_metrics_proxy_service.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_serviceaccount.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_ca_cert_secret.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_ca_cert_secret.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_service.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/service.yaml
🚧 Files skipped from review as they are similar to previous changes (14)
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/scrape-config.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/route.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/serving-cert.yaml
  • control-plane-operator/metrics-proxy/cmd.go
  • control-plane-operator/metrics-proxy/endpoint_resolver_client_test.go
  • control-plane-operator/metrics-proxy/scraper.go
  • control-plane-operator/metrics-proxy/config_reader_test.go
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/serviceaccount.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/main.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment.go
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_controlplanecomponent.yaml

Generated test fixtures for the metrics-proxy component across
all platform variants (default, AROSwift, GCP, IBMCloud,
TechPreviewNoUpgrade).

Signed-off-by: Mulham Raee <mulham.raee@gmail.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
@muraee muraee force-pushed the feat-metrics-proxy branch from f06fba4 to 543ba25 Compare March 4, 2026 10:08
Implement the metrics-proxy as a new CPOv2 control plane component that
exposes hosted control plane component metrics externally via an
authenticated HTTPS route as described in the hosted control plane
metrics exposure enhancement.

The proxy:
- Authenticates requests using bearer tokens validated against the
  management cluster's TokenReview API
- Discovers component pod targets via the endpoint-resolver
- Scrapes metrics from individual pods in parallel
- Filters and labels metrics per component configuration
- Exposes an aggregated Prometheus-compatible /metrics/<component>
  endpoint over a TLS route

The component includes:
- CPOv2 component definition with assets (deployment, service, route,
  RBAC, scrape config)
- Self-signed CA and serving certificate management
- Dynamic scrape configuration loaded from a ConfigMap
- Comprehensive unit tests for proxy, auth, filter, labeler, scraper,
  config reader, and endpoint resolver client

Signed-off-by: Mulham Raee <mulham.raee@gmail.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
@muraee muraee force-pushed the feat-metrics-proxy branch from 543ba25 to 40e9f45 Compare March 4, 2026 10:10
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (18)
control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml (1)

118-119: ⚠️ Potential issue | 🟠 Major

Harden container securityContext beyond read-only rootfs.

At Line 118 and Line 119, the container still allows implicit privilege escalation/root execution behavior unless explicitly denied. Please set allowPrivilegeEscalation: false and runAsNonRoot: true (and ideally drop capabilities / set RuntimeDefault seccomp).

Suggested patch
         securityContext:
+          allowPrivilegeEscalation: false
+          runAsNonRoot: true
+          capabilities:
+            drop:
+            - ALL
+          seccompProfile:
+            type: RuntimeDefault
           readOnlyRootFilesystem: true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`
around lines 118 - 119, Update the container securityContext that currently only
sets readOnlyRootFilesystem to harden it: add allowPrivilegeEscalation: false
and runAsNonRoot: true under the same securityContext, and additionally drop all
capabilities (e.g., capabilities: { drop: ["ALL"] }) and set seccompProfile to
RuntimeDefault (or the appropriate profile) so the metrics-proxy container’s
securityContext is explicit and restrictive; locate the securityContext block
for the metrics-proxy container in the fixture (the block containing
readOnlyRootFilesystem) and apply these additions.
control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/deployment.yaml (1)

19-76: ⚠️ Potential issue | 🟠 Major

Harden the container securityContext (still missing).

On this pod template, the metrics-proxy container does not explicitly disable privilege escalation or enforce non-root execution. This was flagged earlier and still applies.

🔒 Suggested manifest hardening
       - args:
         - --tls-cert-file
@@
         image: controlplane-operator
+        securityContext:
+          allowPrivilegeEscalation: false
+          runAsNonRoot: true
+          capabilities:
+            drop:
+            - ALL
         livenessProbe:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/deployment.yaml`
around lines 19 - 76, The metrics-proxy container lacks a container-level
securityContext; update the pod template for the container named "metrics-proxy"
to harden it by adding a securityContext with at minimum:
allowPrivilegeEscalation: false, runAsNonRoot: true, runAsUser: (non-zero UID,
e.g. 1000), readOnlyRootFilesystem: true, capabilities: { drop: ["ALL"] }, and a
seccompProfile (e.g. type: RuntimeDefault); also consider adding equivalent
pod-level securityContext (fsGroup/runAsUser) if needed to support mounts so the
container-level changes do not break file access. Ensure the new fields are
added alongside the existing container spec (name: metrics-proxy, ports,
volumeMounts, probes) so the manifest remains valid.
control-plane-operator/metrics-proxy/proxy_test.go (1)

73-75: ⚠️ Potential issue | 🟠 Major

Replace JWT-looking test token with a scanner-safe placeholder.

Line 74 and Line 75 still use a JWT-shaped string, which can trigger secret scanning and block CI. Keep special characters but switch to a clearly non-JWT value in both fields.

🛠️ Proposed fix
 		{
 			name:          "When Authorization header contains Bearer token with special characters, it should return the token",
-			authorization: "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIn0.dozjgNryP4J3jVmNHl0w5N_XgL0n3I9PlFUP0THsR8U",
-			expected:      "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIn0.dozjgNryP4J3jVmNHl0w5N_XgL0n3I9PlFUP0THsR8U",
+			authorization: "Bearer token-ABC_123+~/.!@#",
+			expected:      "token-ABC_123+~/.!@#",
 		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/proxy_test.go` around lines 73 - 75, The
test case using a JWT-looking string in the table (the "When Authorization
header contains Bearer token with special characters, it should return the
token" entry) should be updated to use a scanner-safe placeholder: replace the
values assigned to the authorization and expected fields with a clearly non-JWT
placeholder that preserves special characters (e.g., "Bearer
token-with-special_chars+./_-XYZ") so the test still validates parsing of
special characters without containing a real JWT-looking secret; update the
authorization and expected entries in that test case accordingly.
control-plane-operator/metrics-proxy/cmd.go (4)

113-130: ⚠️ Potential issue | 🟠 Major

Bound HTTP server and shutdown with timeouts.

Line 113 defines an externally reachable server without read/write timeouts, and Line 129 uses server.Shutdown(context.Background()), which can block indefinitely during drain.

⏱️ Proposed hardening
 import (
 	"context"
 	"crypto/tls"
 	"fmt"
 	"net/http"
 	"os"
 	"os/signal"
 	"syscall"
+	"time"
@@
 		server := &http.Server{
 			Addr:    fmt.Sprintf(":%d", servingPort),
 			Handler: mux,
 			TLSConfig: &tls.Config{
 				Certificates: []tls.Certificate{tlsCert},
 				MinVersion:   tls.VersionTLS12,
 			},
+			ReadHeaderTimeout: 10 * time.Second,
+			ReadTimeout:       30 * time.Second,
+			WriteTimeout:      60 * time.Second,
+			IdleTimeout:       120 * time.Second,
 		}
@@
-			if err := server.Shutdown(context.Background()); err != nil {
+			shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 10*time.Second)
+			defer shutdownCancel()
+			if err := server.Shutdown(shutdownCtx); err != nil {
 				l.Error(err, "Error during server shutdown")
 			}
 		}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/cmd.go` around lines 113 - 130, The HTTP
server is created without request timeouts and shutdown uses a background
context which can block indefinitely; update the server struct (variable server)
to set ReadTimeout, WriteTimeout and IdleTimeout values appropriate for your
workload and replace the blind server.Shutdown(context.Background()) call in the
goroutine with a context that has a deadline (e.g., context.WithTimeout) so
shutdown will abort after a bounded time; ensure the same goroutine cancels
correctly on signal via sigChan and uses server.Shutdown(ctxWithTimeout) and
handles the timeout error path.

50-50: ⚠️ Potential issue | 🟡 Minor

Validate --metrics-set before converting to metrics.MetricsSet.

Line 50 accepts any string, and Line 97 blindly casts it. Reject unknown values early to avoid unintended filtering behavior on typos/casing mismatches.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/cmd.go` at line 50, Validate the
metricsSet flag value (set via cmd.Flags().StringVar(&metricsSet, "metrics-set",
...)) before converting or casting it to metrics.MetricsSet; specifically
normalize casing (e.g., strings.Title or strings.EqualFold checks) and
explicitly allow only the known values ("All", "Telemetry", "SRE"), returning a
clear error/exit if the value is unknown so you don't blindly do
metrics.MetricsSet(metricsSet) on typos or unexpected input; update the flag
handling code to perform this validation right after flag parsing and use the
validated/normalized value for the conversion.

62-63: ⚠️ Potential issue | 🟠 Major

Cancellable context is created but discarded.

Line 62 assigns the returned context to _, so cancel() at shutdown does not propagate to any consumer. Either wire that context into components doing request/work loops, or remove WithCancel entirely.

# Verify whether the cancellable context is consumed anywhere in this package.
rg -nP --type=go 'WithCancel\(|cmd\.Context\(\)|context\.Context' control-plane-operator/metrics-proxy
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/cmd.go` around lines 62 - 63, The code
creates a cancellable context but discards it by assigning the new context to
`_` while still calling `cancel()`, so cancellation never reaches consumers;
change the call to capture the new context (e.g., `ctx, cancel :=
context.WithCancel(cmd.Context())`), pass `ctx` into any long-lived functions or
servers started in this file (e.g., functions/methods that accept contexts or
the main run/serve routines), and keep `defer cancel()` so shutdown propagates;
if no consumer accepts a context, remove the `WithCancel` call and just use
`cmd.Context()` directly along with removing `cancel()` to avoid a no-op.

65-65: ⚠️ Potential issue | 🟡 Minor

Fail fast when MY_NAMESPACE is unset.

Line 65 can produce empty namespace labeling, which degrades metric attribution.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/cmd.go` at line 65, The code reads
MY_NAMESPACE into namespace using namespace := os.Getenv("MY_NAMESPACE") but
doesn't validate it; update the startup logic (where namespace is set) to fail
fast when namespace == "" by logging a clear error and exiting with a non-zero
status (e.g., use log.Fatalf or fmt.Fprintln(os.Stderr, ...) + os.Exit(1)) so
metrics never run with an empty namespace label.
control-plane-operator/metrics-proxy/auth_test.go (1)

154-204: ⚠️ Potential issue | 🟡 Minor

Use atomic counter for kasCallCount across goroutines.

Line 156 writes in the HTTP handler goroutine while Line 199 and Line 202 read in the test goroutine. Use sync/atomic to make the synchronization explicit.

🧵 Proposed fix
 import (
 	"context"
 	"encoding/pem"
 	"net/http"
 	"net/http/httptest"
+	"sync/atomic"
 	"testing"
 	"time"
@@
-			var kasCallCount int
+			var kasCallCount atomic.Int32
 			kasServer := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-				kasCallCount++
+				kasCallCount.Add(1)
@@
-			if tt.expectCacheHit && kasCallCount > 0 {
+			if tt.expectCacheHit && kasCallCount.Load() > 0 {
 				t.Error("Expected cache hit (no KAS call) but KAS was called")
 			}
-			if !tt.expectCacheHit && kasCallCount == 0 {
+			if !tt.expectCacheHit && kasCallCount.Load() == 0 {
 				t.Error("Expected KAS call (cache miss) but KAS was not called")
 			}
# Verify with race detector
go test -race ./control-plane-operator/metrics-proxy -run TestAuthenticate
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/auth_test.go` around lines 154 - 204,
The test uses a plain int variable kasCallCount that is incremented inside the
httptest server handler goroutine and read from the test goroutine; replace
kasCallCount with an atomic counter (e.g., int32/int64) and update the handler
to increment via atomic.AddInt32/AddInt64, and read checks to use
atomic.LoadInt32/LoadInt64; also add the sync/atomic import. Reference
NewTokenAuthenticator, Authenticate, the kasServer http.HandlerFunc and the
kasCallCount checks so you update the increment in the handler and the two reads
that assert expectCacheHit/!expectCacheHit.
control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml (1)

118-120: ⚠️ Potential issue | 🟠 Major

AROSwift fixture still misses key securityContext hardening fields.

Line 118-Line 119 should include allowPrivilegeEscalation: false and runAsNonRoot: true (and preferably seccompProfile: RuntimeDefault) for parity with restricted posture.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`
around lines 118 - 120, The AROSwift fixture's pod/deployment securityContext
only sets readOnlyRootFilesystem; update the securityContext block (the same
block containing readOnlyRootFilesystem) to also set allowPrivilegeEscalation:
false and runAsNonRoot: true and, if possible, add seccompProfile:
RuntimeDefault under securityContext to match the restricted posture; ensure
these entries are added alongside readOnlyRootFilesystem in the same
securityContext for the metrics-proxy deployment (refer to the securityContext
and terminationMessagePolicy around that container spec to locate the correct
place).
control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml (1)

118-120: ⚠️ Potential issue | 🟠 Major

Harden container securityContext beyond read-only root filesystem.

Line 118 currently omits allowPrivilegeEscalation: false and runAsNonRoot: true (and ideally seccompProfile: RuntimeDefault). This weakens baseline pod hardening.

🔒 Suggested hardening
         securityContext:
+          allowPrivilegeEscalation: false
+          runAsNonRoot: true
+          seccompProfile:
+            type: RuntimeDefault
           readOnlyRootFilesystem: true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`
around lines 118 - 120, The container securityContext only sets
readOnlyRootFilesystem; update the securityContext block to harden the container
by adding allowPrivilegeEscalation: false and runAsNonRoot: true and set
seccompProfile.type: RuntimeDefault (under the same securityContext for the
container spec), ensuring YAML keys are at the same indentation level as
readOnlyRootFilesystem and use the correct mapping (seccompProfile: { type:
RuntimeDefault }) so the pod runtime enforces these restrictions.
control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml (1)

118-119: ⚠️ Potential issue | 🟠 Major

Incomplete security context — missing hardening fields.

The container's securityContext is missing allowPrivilegeEscalation: false, capabilities.drop: [ALL], and runAsNonRoot: true that other platform variants (e.g., GCP) include. This was flagged by static analysis (CKV_K8S_20, CKV_K8S_23).

🔒 Proposed fix
         securityContext:
+          allowPrivilegeEscalation: false
+          capabilities:
+            drop:
+            - ALL
           readOnlyRootFilesystem: true
+          runAsNonRoot: true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`
around lines 118 - 119, The container securityContext currently only sets
readOnlyRootFilesystem and must be hardened: update the securityContext block
for the deployment fixture (the same block containing readOnlyRootFilesystem) to
add allowPrivilegeEscalation: false, runAsNonRoot: true, and a capabilities.drop
entry that drops ALL (capabilities.drop: [ALL]) so it matches other platform
variants and satisfies CKV_K8S_20/CKV_K8S_23 checks.
control-plane-operator/metrics-proxy/config_reader.go (1)

99-125: ⚠️ Potential issue | 🟠 Major

Fail fast on partial or errored client-cert reads.

The function silently ignores read errors and doesn't validate when only one of certFile/keyFile is provided. This can mask permission issues or misconfiguration, disabling mTLS without surfacing the error.

🔧 Suggested fix
 func buildTLSConfigFromFiles(caFile, certFile, keyFile string) (*tls.Config, error) {
 	config := &tls.Config{MinVersion: tls.VersionTLS12}

 	if caPool := loadCertPool(caFile); caPool != nil {
 		config.RootCAs = caPool
 	}

-	if certFile != "" && keyFile != "" {
+	if certFile != "" || keyFile != "" {
+		if certFile == "" || keyFile == "" {
+			return nil, fmt.Errorf("buildTLSConfigFromFiles: both certFile and keyFile must be set, got certFile=%q keyFile=%q", certFile, keyFile)
+		}
 		certPEM, err := os.ReadFile(certFile)
 		if err != nil {
-			// Client cert may not exist yet; return config without client cert.
-			return config, nil
+			if os.IsNotExist(err) {
+				return config, nil
+			}
+			return nil, fmt.Errorf("buildTLSConfigFromFiles: failed to read certFile %q: %w", certFile, err)
 		}
 		keyPEM, err := os.ReadFile(keyFile)
 		if err != nil {
-			return config, nil
+			if os.IsNotExist(err) {
+				return config, nil
+			}
+			return nil, fmt.Errorf("buildTLSConfigFromFiles: failed to read keyFile %q: %w", keyFile, err)
 		}
 		cert, err := tls.X509KeyPair(certPEM, keyPEM)
 		if err != nil {
 			return nil, fmt.Errorf("failed to parse client certificate: %w", err)
 		}
 		config.Certificates = []tls.Certificate{cert}
 	}

 	return config, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/config_reader.go` around lines 99 - 125,
The buildTLSConfigFromFiles function currently swallows client cert/key read
errors and allows only-one-of cert/key to silently disable mTLS; change it to
fail-fast: validate that certFile and keyFile are either both empty (no client
cert) or both non-empty, and if both are non-empty attempt to read certFile and
keyFile and return a descriptive error on any os.ReadFile failure (include which
path and the underlying error) instead of returning the config; keep the
existing tls.X509KeyPair parsing error behavior but only reach that after
successful reads; leave loadCertPool(caFile) handling unchanged.
control-plane-operator/metrics-proxy/proxy.go (1)

95-114: ⚠️ Potential issue | 🟠 Major

Return an error when all scrapes fail.

If every target fails, the handler still returns HTTP 200 with an empty body. Prometheus interprets this as a successful scrape with no metrics, masking complete failures.

🛠️ Suggested fix
+	successes := 0
 	for i, result := range results {
 		if result.Err != nil {
 			h.log.Error(result.Err, "scrape error", "target", targets[i].PodName)
 			continue
 		}
+		successes++

 		families := h.filter.Apply(componentName, result.Families)
 		families = h.labeler.Inject(families, targets[i], componentName, componentConfig.ServiceName)

 		for _, mf := range families {
 			if err := encoder.Encode(mf); err != nil {
 				h.log.Error(err, "encode error")
 			}
 		}
 	}

+	if successes == 0 && len(targets) > 0 {
+		http.Error(w, "all scrape targets failed", http.StatusBadGateway)
+		return
+	}
+
 	w.Header().Set("Content-Type", string(format))
 	w.WriteHeader(http.StatusOK)
 	_, _ = w.Write(buf.Bytes())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/proxy.go` around lines 95 - 114, The
handler currently always returns HTTP 200 even when every scrape in results
failed; change the logic in the function that iterates results (the loop over
results/targets in proxy.go) to track successful scrapes (e.g., count non-nil
result.Err or increment on encode success) and if the count of successful
families is zero, write an error status (e.g., http.StatusBadGateway or
http.StatusServiceUnavailable) and an explanatory body before returning instead
of writing the normal 200 response; ensure this check happens after processing
results but before setting Content-Type/StatusOK and writing buf.Bytes(), and
keep existing logging via h.log.Error for individual scrape errors.
control-plane-operator/metrics-proxy/scraper.go (1)

71-77: ⚠️ Potential issue | 🟡 Minor

Normalize metricsPath to avoid malformed URLs.

If metricsPath is empty or missing a leading /, the constructed URL will be malformed (e.g., https://10.0.0.1:9443metrics).

🛠️ Suggested fix
 func (s *Scraper) scrape(ctx context.Context, client *http.Client, target ScrapeTarget, metricsPath, scheme string) ScrapeResult {
 	if scheme == "" {
 		scheme = "https"
 	}
+	if metricsPath == "" {
+		metricsPath = "/metrics"
+	} else if !strings.HasPrefix(metricsPath, "/") {
+		metricsPath = "/" + metricsPath
+	}

 	url := fmt.Sprintf("%s://%s:%d%s", scheme, target.PodIP, target.Port, metricsPath)

Add "strings" to imports.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/scraper.go` around lines 71 - 77, The
URL built in Scraper.scrape is malformed when metricsPath is empty or missing a
leading slash; update the Scraper.scrape function to normalize metricsPath by
defaulting it to a sensible path (e.g., "/metrics") when empty and ensuring it
begins with a "/" if provided without one, then use that normalized value in the
fmt.Sprintf that builds url; also add the "strings" import so you can
test/ensure the leading slash (e.g., with strings.HasPrefix) when normalizing
metricsPath.
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go (4)

74-77: ⚠️ Potential issue | 🟠 Major

Default endpoint scheme should follow Prometheus Operator (http when omitted).

Using "https" as the implicit default diverges from monitor API behavior and can break monitors that intentionally omit scheme.

In Prometheus Operator v0.88.0, what is the default value of ServiceMonitor.spec.endpoints[].scheme and PodMonitor.spec.podMetricsEndpoints[].scheme when omitted?

Also applies to: 138-141

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`
around lines 74 - 77, The code currently defaults the endpoint scheme to "https"
when ep.Scheme is nil; change the implicit default to "http" to match Prometheus
Operator behavior (ServiceMonitor/PodMonitor default to "http" when omitted).
Update the occurrences where scheme is set from ep.Scheme (the local variable
named scheme initialized to "https" near the ep.Scheme check, and the analogous
block around lines 138-141) so they initialize scheme := "http" and otherwise
use ep.Scheme.String(); ensure both places use the same default to keep behavior
consistent.

221-225: ⚠️ Potential issue | 🟠 Major

Reject non-unique ServiceMonitor selector matches.

Returning the first matching Service is ambiguous; this should fail fast when more than one Service matches.

Suggested fix
 	if len(svcList.Items) == 0 {
 		return "", fmt.Errorf("no service found for ServiceMonitor %s", smName)
 	}
+	if len(svcList.Items) > 1 {
+		return "", fmt.Errorf("multiple services (%d) match ServiceMonitor %s; selector must be unique", len(svcList.Items), smName)
+	}
 
 	return svcList.Items[0].Name, nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`
around lines 221 - 225, The current logic returns svcList.Items[0].Name for a
ServiceMonitor match which is ambiguous; update the function that queries
svcList (the code using svcList and smName for ServiceMonitor resolution) to
explicitly reject non-unique matches by returning an error when
len(svcList.Items) > 1 (include smName and the number of matches in the error),
keep the existing error for len == 0, and only return svcList.Items[0].Name when
there is exactly one match.

241-246: ⚠️ Potential issue | 🟠 Major

Named Service targetPort is resolved incorrectly for pod-IP scraping.

When TargetPort is a name, falling back to p.Port can target the wrong port. Fail fast (or explicitly resolve the named pod container port) instead.

Suggested fail-fast fix
 import (
 	"fmt"
 	"path/filepath"
@@
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	"k8s.io/apimachinery/pkg/util/intstr"
@@
 	for _, p := range svc.Spec.Ports {
 		if p.Name == portName || p.TargetPort.String() == portName {
+			if p.TargetPort.Type == intstr.String {
+				return 0, fmt.Errorf("service %s port %q uses named targetPort %q; resolve pod port before scraping", serviceName, p.Name, p.TargetPort.String())
+			}
 			if p.TargetPort.IntValue() > 0 {
 				return int32(p.TargetPort.IntValue()), nil
 			}
 			return p.Port, nil
 		}
 	}
When Kubernetes Service.spec.ports[].targetPort is a named string, does scraping PodIP:ServicePort correctly reach the container target port, or must the named container port be explicitly resolved first?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`
around lines 241 - 246, The current loop over svc.Spec.Ports that compares
p.Name/TargetPort.String() against portName and then falls back to p.Port can
mis-target when p.TargetPort is a named string; update the logic in the block
that inspects svc.Spec.Ports (the loop using p.Name, p.TargetPort and portName)
to fail fast when p.TargetPort is a named value instead of an integer — detect
named target ports via p.TargetPort.Type (intstr.String) or by checking
TargetPort.String() != TargetPort.IntValue() and return an explicit error
indicating named targetPort cannot be used for PodIP scraping; alternatively (if
you prefer) implement explicit resolution of the named container port by looking
up container ports in the PodSpec before returning, but do not silently fall
back to p.Port.

121-128: ⚠️ Potential issue | 🟠 Major

PodMonitor endpoint port resolution misses portNumber and targetPort fallback.

Current logic only reads ep.Port, so valid PodMonitors using portNumber or targetPort are skipped.

Suggested fix
-		portName := ""
-		if ep.Port != nil {
-			portName = *ep.Port
-		}
-		if portName == "" {
-			log.V(4).Info("skipping PodMonitor: no port name", "podMonitor", pm.Name)
-			continue
-		}
-
-		// Resolve the port number from the component's Deployment.
-		// The PodMonitor name matches the component name by CPOv2 convention.
-		port, err := resolveDeploymentPort(cpContext, namespace, pm.Name, portName)
+		portName := ""
+		var port int32
+		hasNumericPort := false
+
+		if ep.Port != nil && *ep.Port != "" {
+			portName = *ep.Port
+		} else if ep.PortNumber != nil && *ep.PortNumber > 0 {
+			port = *ep.PortNumber
+			hasNumericPort = true
+		} else if ep.TargetPort != nil {
+			if ep.TargetPort.IntValue() > 0 {
+				port = int32(ep.TargetPort.IntValue())
+				hasNumericPort = true
+			} else {
+				portName = ep.TargetPort.String()
+			}
+		}
+
+		if !hasNumericPort && portName == "" {
+			log.V(4).Info("skipping PodMonitor: no port reference", "podMonitor", pm.Name)
+			continue
+		}
+
+		// Resolve named port from Deployment only when needed.
+		var err error
+		if !hasNumericPort {
+			port, err = resolveDeploymentPort(cpContext, namespace, pm.Name, portName)
+		}
 		if err != nil {
 			log.V(4).Info("skipping PodMonitor: port not resolvable", "podMonitor", pm.Name, "port", portName, "error", err)
 			continue
 		}
For Prometheus Operator v0.88.0 PodMonitor, what is the precedence among podMetricsEndpoints.port, portNumber, and targetPort?

Also applies to: 132-135

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`
around lines 121 - 128, The current PodMonitor endpoint resolution in
scrape_config.go only reads ep.Port into portName and therefore skips endpoints
that specify portNumber or targetPort; update the logic handling ep (the
PodMetricsEndpoint) so it selects port in this precedence: ep.Port (name) first,
then ep.PortNumber (numeric), then ep.TargetPort (IntOrString), converting
numeric values to a string (use strconv.Itoa for PortNumber and the
IntOrString.String()/IntVal as appropriate) and assign that to portName; apply
the same fix to both places that currently only check ep.Port (the block using
portName and the similar block at the later lines) so valid PodMonitors using
portNumber or targetPort are not incorrectly skipped.
🧹 Nitpick comments (5)
control-plane-operator/metrics-proxy/endpoint_resolver_client.go (2)

25-41: Consider removing unused error return or adding validation.

NewEndpointResolverClient always returns nil for the error, making the error return value unused. Either add validation that could fail (e.g., URL parsing) or simplify the signature.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/endpoint_resolver_client.go` around
lines 25 - 41, NewEndpointResolverClient currently always returns nil error;
either remove the unused error from its signature or add real validation and
return errors (prefer the latter). Parse and validate baseURL inside
NewEndpointResolverClient (use url.Parse or similar) and return a descriptive
error if invalid; keep the TLS setup (tlsConfig and loadCertPool) and the
http.Client construction unchanged but return the error when URL
parsing/validation fails so callers receive meaningful failures. Ensure the
function signature and all call sites are updated accordingly if you choose to
remove the error instead of adding validation.

46-52: Consider URL-escaping serviceName for robustness, though practical risk is low.

The serviceName is interpolated directly into the URL without escaping. While ServiceName comes from operator-controlled configuration rather than untrusted input and Kubernetes service names are constrained to DNS-1035 labels (alphanumeric + hyphens), adding url.PathEscape would be defensive programming. This would protect against misconfiguration or future changes to how component names are sourced.

Suggested improvement
+import "net/url"
+
 func (c *EndpointResolverClient) Discover(ctx context.Context, serviceName string, port int32) ([]ScrapeTarget, error) {
-	url := fmt.Sprintf("%s/resolve/%s", c.baseURL, serviceName)
+	resolveURL := fmt.Sprintf("%s/resolve/%s", c.baseURL, url.PathEscape(serviceName))

-	req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
+	req, err := http.NewRequestWithContext(ctx, http.MethodGet, resolveURL, nil)
 	if err != nil {
-		return nil, fmt.Errorf("failed to create request for %s: %w", url, err)
+		return nil, fmt.Errorf("failed to create request for %s: %w", resolveURL, err)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/endpoint_resolver_client.go` around
lines 46 - 52, In Discover, the serviceName is interpolated directly into the
request URL; escape it defensively by applying url.PathEscape(serviceName)
before building the URL (used where url := fmt.Sprintf("%s/resolve/%s",
c.baseURL, serviceName)); update any log or error messages that reference the
raw URL to use the escaped value so the request is robust against unexpected
characters or future changes to service name sources.
control-plane-operator/metrics-proxy/endpoint_resolver_client_test.go (1)

45-57: Consider verifying both returned targets.

The test verifies only the first target's fields. For completeness, consider also checking targets[1] to ensure all pods are correctly mapped.

✨ Suggested addition
 		if targets[0].Port != 6443 {
 			t.Errorf("expected port 6443, got %d", targets[0].Port)
 		}
+
+		if targets[1].PodName != "kube-apiserver-1" {
+			t.Errorf("expected pod name kube-apiserver-1, got %s", targets[1].PodName)
+		}
+		if targets[1].PodIP != "10.0.1.6" {
+			t.Errorf("expected pod IP 10.0.1.6, got %s", targets[1].PodIP)
+		}
 	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/endpoint_resolver_client_test.go` around
lines 45 - 57, The test currently asserts only the first element of the returned
slice "targets", so add assertions that validate the second element as well
(check targets[1].PodName, targets[1].PodIP, and targets[1].Port) to ensure both
pods are correctly mapped; update the test in endpoint_resolver_client_test.go
to explicitly verify targets length remains 2 and then assert the expected
values for targets[1] (mirroring the checks done for targets[0]) or, if order is
not guaranteed, assert both expected pods exist by matching PodName/PodIP/Port
across the slice.
control-plane-operator/metrics-proxy/filter.go (1)

83-107: Consider verifying component coverage is complete.

The switch statement handles 10 known components. If new components are added to the metrics-proxy scrape configuration, they won't have filtering applied (will return nil, meaning all metrics pass through).

Consider adding a log or metric for unknown components to help identify gaps during development.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/filter.go` around lines 83 - 107, The
default branch of getRelabelConfigsForComponent silently returns nil for unknown
componentName so new components pass through unfiltered; update the default case
to emit an explicit warning (including componentName) and increment a simple
Prometheus counter (e.g., MissingRelabelConfigTotal) to surface missing relabel
config coverage, then return nil as before; add the counter definition at
package scope and increment it inside the default branch so missing components
are logged and measurable.
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go (1)

565-577: TestCertVolumeOptionalFlag is trivial.

This test only verifies that ptr.To(true) returns a non-nil *bool with value true, which is effectively testing the k8s.io/utils/ptr library rather than your code. Consider removing this test or expanding it to verify actual volume construction behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go`
around lines 565 - 577, The TestCertVolumeOptionalFlag test is trivial because
it only asserts ptr.To(true) from k8s.io/utils/ptr; remove this unit test or
replace it with a real behavior test that exercises your volume construction
code (e.g., the function that builds volumes from refs such as
BuildVolumesFromRefs or the deployment/volume builder used elsewhere in this
package) and assert the resulting corev1.Volume/VolumeMount or Secret/ConfigMap
projection has the Optional field set to pointer-to-true; update the test name
to reflect the behavior (e.g., TestBuildCertVolumeOptionalFlag) and assert
against the actual object returned rather than calling ptr.To directly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/certs.go`:
- Around line 73-88: The serving certificate call to certs.ReconcileSignedCert
is passing nil for the extUsages parameter (in the invocation that creates the
"metrics-proxy" cert); update that argument to explicitly include
x509.ExtKeyUsageServerAuth by replacing the nil extUsages with
[]x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth} so the generated serving cert has
the correct Extended Key Usage (see certs.ReconcileSignedCert usage and
test/util/pki.go for the pattern).

In `@control-plane-operator/metrics-proxy/config_reader_test.go`:
- Around line 100-102: The test currently ignores the error returned by
r.Load(), which can mask load failures; update the subtest to capture the error
from r.Load() (call r := NewConfigFileReader(...); err := r.Load()), and
immediately fail the test if err != nil (e.g., use t.Fatalf or equivalent)
before calling r.GetComponent("nonexistent"), so the test fails on a load error
rather than a missing component; reference the NewConfigFileReader, Load,
GetComponent, and the r variable when making the change.

---

Duplicate comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`:
- Around line 118-120: The AROSwift fixture's pod/deployment securityContext
only sets readOnlyRootFilesystem; update the securityContext block (the same
block containing readOnlyRootFilesystem) to also set allowPrivilegeEscalation:
false and runAsNonRoot: true and, if possible, add seccompProfile:
RuntimeDefault under securityContext to match the restricted posture; ensure
these entries are added alongside readOnlyRootFilesystem in the same
securityContext for the metrics-proxy deployment (refer to the securityContext
and terminationMessagePolicy around that container spec to locate the correct
place).

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`:
- Around line 118-119: The container securityContext currently only sets
readOnlyRootFilesystem and must be hardened: update the securityContext block
for the deployment fixture (the same block containing readOnlyRootFilesystem) to
add allowPrivilegeEscalation: false, runAsNonRoot: true, and a capabilities.drop
entry that drops ALL (capabilities.drop: [ALL]) so it matches other platform
variants and satisfies CKV_K8S_20/CKV_K8S_23 checks.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`:
- Around line 118-120: The container securityContext only sets
readOnlyRootFilesystem; update the securityContext block to harden the container
by adding allowPrivilegeEscalation: false and runAsNonRoot: true and set
seccompProfile.type: RuntimeDefault (under the same securityContext for the
container spec), ensuring YAML keys are at the same indentation level as
readOnlyRootFilesystem and use the correct mapping (seccompProfile: { type:
RuntimeDefault }) so the pod runtime enforces these restrictions.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`:
- Around line 118-119: Update the container securityContext that currently only
sets readOnlyRootFilesystem to harden it: add allowPrivilegeEscalation: false
and runAsNonRoot: true under the same securityContext, and additionally drop all
capabilities (e.g., capabilities: { drop: ["ALL"] }) and set seccompProfile to
RuntimeDefault (or the appropriate profile) so the metrics-proxy container’s
securityContext is explicit and restrictive; locate the securityContext block
for the metrics-proxy container in the fixture (the block containing
readOnlyRootFilesystem) and apply these additions.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/deployment.yaml`:
- Around line 19-76: The metrics-proxy container lacks a container-level
securityContext; update the pod template for the container named "metrics-proxy"
to harden it by adding a securityContext with at minimum:
allowPrivilegeEscalation: false, runAsNonRoot: true, runAsUser: (non-zero UID,
e.g. 1000), readOnlyRootFilesystem: true, capabilities: { drop: ["ALL"] }, and a
seccompProfile (e.g. type: RuntimeDefault); also consider adding equivalent
pod-level securityContext (fsGroup/runAsUser) if needed to support mounts so the
container-level changes do not break file access. Ensure the new fields are
added alongside the existing container spec (name: metrics-proxy, ports,
volumeMounts, probes) so the manifest remains valid.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`:
- Around line 74-77: The code currently defaults the endpoint scheme to "https"
when ep.Scheme is nil; change the implicit default to "http" to match Prometheus
Operator behavior (ServiceMonitor/PodMonitor default to "http" when omitted).
Update the occurrences where scheme is set from ep.Scheme (the local variable
named scheme initialized to "https" near the ep.Scheme check, and the analogous
block around lines 138-141) so they initialize scheme := "http" and otherwise
use ep.Scheme.String(); ensure both places use the same default to keep behavior
consistent.
- Around line 221-225: The current logic returns svcList.Items[0].Name for a
ServiceMonitor match which is ambiguous; update the function that queries
svcList (the code using svcList and smName for ServiceMonitor resolution) to
explicitly reject non-unique matches by returning an error when
len(svcList.Items) > 1 (include smName and the number of matches in the error),
keep the existing error for len == 0, and only return svcList.Items[0].Name when
there is exactly one match.
- Around line 241-246: The current loop over svc.Spec.Ports that compares
p.Name/TargetPort.String() against portName and then falls back to p.Port can
mis-target when p.TargetPort is a named string; update the logic in the block
that inspects svc.Spec.Ports (the loop using p.Name, p.TargetPort and portName)
to fail fast when p.TargetPort is a named value instead of an integer — detect
named target ports via p.TargetPort.Type (intstr.String) or by checking
TargetPort.String() != TargetPort.IntValue() and return an explicit error
indicating named targetPort cannot be used for PodIP scraping; alternatively (if
you prefer) implement explicit resolution of the named container port by looking
up container ports in the PodSpec before returning, but do not silently fall
back to p.Port.
- Around line 121-128: The current PodMonitor endpoint resolution in
scrape_config.go only reads ep.Port into portName and therefore skips endpoints
that specify portNumber or targetPort; update the logic handling ep (the
PodMetricsEndpoint) so it selects port in this precedence: ep.Port (name) first,
then ep.PortNumber (numeric), then ep.TargetPort (IntOrString), converting
numeric values to a string (use strconv.Itoa for PortNumber and the
IntOrString.String()/IntVal as appropriate) and assign that to portName; apply
the same fix to both places that currently only check ep.Port (the block using
portName and the similar block at the later lines) so valid PodMonitors using
portNumber or targetPort are not incorrectly skipped.

In `@control-plane-operator/metrics-proxy/auth_test.go`:
- Around line 154-204: The test uses a plain int variable kasCallCount that is
incremented inside the httptest server handler goroutine and read from the test
goroutine; replace kasCallCount with an atomic counter (e.g., int32/int64) and
update the handler to increment via atomic.AddInt32/AddInt64, and read checks to
use atomic.LoadInt32/LoadInt64; also add the sync/atomic import. Reference
NewTokenAuthenticator, Authenticate, the kasServer http.HandlerFunc and the
kasCallCount checks so you update the increment in the handler and the two reads
that assert expectCacheHit/!expectCacheHit.

In `@control-plane-operator/metrics-proxy/cmd.go`:
- Around line 113-130: The HTTP server is created without request timeouts and
shutdown uses a background context which can block indefinitely; update the
server struct (variable server) to set ReadTimeout, WriteTimeout and IdleTimeout
values appropriate for your workload and replace the blind
server.Shutdown(context.Background()) call in the goroutine with a context that
has a deadline (e.g., context.WithTimeout) so shutdown will abort after a
bounded time; ensure the same goroutine cancels correctly on signal via sigChan
and uses server.Shutdown(ctxWithTimeout) and handles the timeout error path.
- Line 50: Validate the metricsSet flag value (set via
cmd.Flags().StringVar(&metricsSet, "metrics-set", ...)) before converting or
casting it to metrics.MetricsSet; specifically normalize casing (e.g.,
strings.Title or strings.EqualFold checks) and explicitly allow only the known
values ("All", "Telemetry", "SRE"), returning a clear error/exit if the value is
unknown so you don't blindly do metrics.MetricsSet(metricsSet) on typos or
unexpected input; update the flag handling code to perform this validation right
after flag parsing and use the validated/normalized value for the conversion.
- Around line 62-63: The code creates a cancellable context but discards it by
assigning the new context to `_` while still calling `cancel()`, so cancellation
never reaches consumers; change the call to capture the new context (e.g., `ctx,
cancel := context.WithCancel(cmd.Context())`), pass `ctx` into any long-lived
functions or servers started in this file (e.g., functions/methods that accept
contexts or the main run/serve routines), and keep `defer cancel()` so shutdown
propagates; if no consumer accepts a context, remove the `WithCancel` call and
just use `cmd.Context()` directly along with removing `cancel()` to avoid a
no-op.
- Line 65: The code reads MY_NAMESPACE into namespace using namespace :=
os.Getenv("MY_NAMESPACE") but doesn't validate it; update the startup logic
(where namespace is set) to fail fast when namespace == "" by logging a clear
error and exiting with a non-zero status (e.g., use log.Fatalf or
fmt.Fprintln(os.Stderr, ...) + os.Exit(1)) so metrics never run with an empty
namespace label.

In `@control-plane-operator/metrics-proxy/config_reader.go`:
- Around line 99-125: The buildTLSConfigFromFiles function currently swallows
client cert/key read errors and allows only-one-of cert/key to silently disable
mTLS; change it to fail-fast: validate that certFile and keyFile are either both
empty (no client cert) or both non-empty, and if both are non-empty attempt to
read certFile and keyFile and return a descriptive error on any os.ReadFile
failure (include which path and the underlying error) instead of returning the
config; keep the existing tls.X509KeyPair parsing error behavior but only reach
that after successful reads; leave loadCertPool(caFile) handling unchanged.

In `@control-plane-operator/metrics-proxy/proxy_test.go`:
- Around line 73-75: The test case using a JWT-looking string in the table (the
"When Authorization header contains Bearer token with special characters, it
should return the token" entry) should be updated to use a scanner-safe
placeholder: replace the values assigned to the authorization and expected
fields with a clearly non-JWT placeholder that preserves special characters
(e.g., "Bearer token-with-special_chars+./_-XYZ") so the test still validates
parsing of special characters without containing a real JWT-looking secret;
update the authorization and expected entries in that test case accordingly.

In `@control-plane-operator/metrics-proxy/proxy.go`:
- Around line 95-114: The handler currently always returns HTTP 200 even when
every scrape in results failed; change the logic in the function that iterates
results (the loop over results/targets in proxy.go) to track successful scrapes
(e.g., count non-nil result.Err or increment on encode success) and if the count
of successful families is zero, write an error status (e.g.,
http.StatusBadGateway or http.StatusServiceUnavailable) and an explanatory body
before returning instead of writing the normal 200 response; ensure this check
happens after processing results but before setting Content-Type/StatusOK and
writing buf.Bytes(), and keep existing logging via h.log.Error for individual
scrape errors.

In `@control-plane-operator/metrics-proxy/scraper.go`:
- Around line 71-77: The URL built in Scraper.scrape is malformed when
metricsPath is empty or missing a leading slash; update the Scraper.scrape
function to normalize metricsPath by defaulting it to a sensible path (e.g.,
"/metrics") when empty and ensuring it begins with a "/" if provided without
one, then use that normalized value in the fmt.Sprintf that builds url; also add
the "strings" import so you can test/ensure the leading slash (e.g., with
strings.HasPrefix) when normalizing metricsPath.

---

Nitpick comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go`:
- Around line 565-577: The TestCertVolumeOptionalFlag test is trivial because it
only asserts ptr.To(true) from k8s.io/utils/ptr; remove this unit test or
replace it with a real behavior test that exercises your volume construction
code (e.g., the function that builds volumes from refs such as
BuildVolumesFromRefs or the deployment/volume builder used elsewhere in this
package) and assert the resulting corev1.Volume/VolumeMount or Secret/ConfigMap
projection has the Optional field set to pointer-to-true; update the test name
to reflect the behavior (e.g., TestBuildCertVolumeOptionalFlag) and assert
against the actual object returned rather than calling ptr.To directly.

In `@control-plane-operator/metrics-proxy/endpoint_resolver_client_test.go`:
- Around line 45-57: The test currently asserts only the first element of the
returned slice "targets", so add assertions that validate the second element as
well (check targets[1].PodName, targets[1].PodIP, and targets[1].Port) to ensure
both pods are correctly mapped; update the test in
endpoint_resolver_client_test.go to explicitly verify targets length remains 2
and then assert the expected values for targets[1] (mirroring the checks done
for targets[0]) or, if order is not guaranteed, assert both expected pods exist
by matching PodName/PodIP/Port across the slice.

In `@control-plane-operator/metrics-proxy/endpoint_resolver_client.go`:
- Around line 25-41: NewEndpointResolverClient currently always returns nil
error; either remove the unused error from its signature or add real validation
and return errors (prefer the latter). Parse and validate baseURL inside
NewEndpointResolverClient (use url.Parse or similar) and return a descriptive
error if invalid; keep the TLS setup (tlsConfig and loadCertPool) and the
http.Client construction unchanged but return the error when URL
parsing/validation fails so callers receive meaningful failures. Ensure the
function signature and all call sites are updated accordingly if you choose to
remove the error instead of adding validation.
- Around line 46-52: In Discover, the serviceName is interpolated directly into
the request URL; escape it defensively by applying url.PathEscape(serviceName)
before building the URL (used where url := fmt.Sprintf("%s/resolve/%s",
c.baseURL, serviceName)); update any log or error messages that reference the
raw URL to use the escaped value so the request is robust against unexpected
characters or future changes to service name sources.

In `@control-plane-operator/metrics-proxy/filter.go`:
- Around line 83-107: The default branch of getRelabelConfigsForComponent
silently returns nil for unknown componentName so new components pass through
unfiltered; update the default case to emit an explicit warning (including
componentName) and increment a simple Prometheus counter (e.g.,
MissingRelabelConfigTotal) to surface missing relabel config coverage, then
return nil as before; add the counter definition at package scope and increment
it inside the default branch so missing components are logged and measurable.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (7)
control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml (1)

116-117: ⚠️ Potential issue | 🟠 Major

Harden container securityContext to enforce non-root/no-escalation.

At Line 116, the container only sets readOnlyRootFilesystem. Please also set allowPrivilegeEscalation: false and runAsNonRoot: true (and ideally drop all capabilities + RuntimeDefault seccomp) to satisfy baseline restricted posture. Since this is fixture data, update the source manifest/template and regenerate fixtures.

Suggested diff
         securityContext:
+          allowPrivilegeEscalation: false
+          runAsNonRoot: true
+          capabilities:
+            drop:
+            - ALL
+          seccompProfile:
+            type: RuntimeDefault
           readOnlyRootFilesystem: true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`
around lines 116 - 117, The container securityContext currently only sets
readOnlyRootFilesystem; update the metrics-proxy deployment fixture's container
securityContext to enforce non-root and no privilege escalation by adding
allowPrivilegeEscalation: false and runAsNonRoot: true, drop all capabilities
(capabilities: { drop: ["ALL"] }) and set seccompProfile to RuntimeDefault; make
these changes in the source manifest/template that generates the fixture and
then regenerate the fixtures so the test data reflects the hardened
securityContext.
control-plane-operator/metrics-proxy/cmd.go (4)

61-61: ⚠️ Potential issue | 🟡 Minor

Guard against missing MY_NAMESPACE at startup.

If the env var is empty, emitted labels can be malformed (namespace="") and break downstream dashboards/routing.

🛡️ Proposed guard
 		namespace := os.Getenv("MY_NAMESPACE")
+		if namespace == "" {
+			fmt.Fprintln(os.Stderr, "Error: MY_NAMESPACE environment variable is required")
+			os.Exit(1)
+		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/cmd.go` at line 61, The code reads
namespace := os.Getenv("MY_NAMESPACE") without validating it, which can emit
empty/malformed labels; update startup in cmd.go to validate the MY_NAMESPACE
env var (the namespace variable) and fail fast or set a safe default: check if
namespace == "" and either log a clear error via the existing logger and exit
non-zero or assign a known fallback value before any metric labeling; ensure the
validation runs before any metric registration or label emission (where
namespace is used) so downstream dashboards/routing are not broken.

58-59: ⚠️ Potential issue | 🟠 Major

Fix shutdown context wiring: discarded context + unbounded Shutdown.

Line 58 discards the created context, and Line 125 uses context.Background() with no deadline. This can hang shutdown indefinitely.

♻️ Proposed fix
 import (
 	"context"
@@
 	"syscall"
+	"time"
@@
-		_, cancel := context.WithCancel(cmd.Context())
+		ctx, cancel := context.WithCancel(cmd.Context())
 		defer cancel()
@@
 		go func() {
 			sig := <-sigChan
 			l.Info("Received shutdown signal", "signal", sig)
 			cancel()
-			if err := server.Shutdown(context.Background()); err != nil {
+			shutdownCtx, shutdownCancel := context.WithTimeout(ctx, 10*time.Second)
+			defer shutdownCancel()
+			if err := server.Shutdown(shutdownCtx); err != nil {
 				l.Error(err, "Error during server shutdown")
 			}
 		}()

Also applies to: 121-126

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/cmd.go` around lines 58 - 59, The
shutdown context is being discarded (context.WithCancel(cmd.Context()) result
ignored) and Shutdown is called with context.Background(), causing potential
indefinite hang; replace the discard with capturing the context and cancel
(e.g., ctx, cancel := context.WithCancel(cmd.Context())) and pass a bounded
context to the HTTP server shutdown (use context.WithTimeout(ctx,
<reasonableDuration>) before calling server.Shutdown) and ensure cancel() is
deferred; update the code paths around the existing context.WithCancel and the
server.Shutdown call so they use the same bounded ctx to properly propagate
cancellation and enforce a shutdown timeout.

107-116: ⚠️ Potential issue | 🟠 Major

Add HTTP server timeouts on the exposed TLS endpoint.

The server has no read/write/header/idle timeouts, which weakens resilience under slow-client and connection-stall scenarios.

⏱️ Proposed fix
 		server := &http.Server{
 			Addr:    fmt.Sprintf(":%d", servingPort),
 			Handler: mux,
 			TLSConfig: &tls.Config{
 				Certificates: []tls.Certificate{tlsCert},
 				ClientAuth:   tls.RequireAndVerifyClientCert,
 				ClientCAs:    clientCAPool,
 				MinVersion:   tls.VersionTLS12,
 			},
+			ReadHeaderTimeout: 10 * time.Second,
+			ReadTimeout:       30 * time.Second,
+			WriteTimeout:      60 * time.Second,
+			IdleTimeout:       120 * time.Second,
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/cmd.go` around lines 107 - 116, The TLS
http.Server created in the server variable lacks timeouts which can allow
slow-client or stalled connections to tie up resources; update the server
instantiation (the http.Server literal where server is assigned) to set sensible
ReadTimeout, ReadHeaderTimeout, WriteTimeout and IdleTimeout values (e.g.,
seconds appropriate for your app) to harden the exposed TLS endpoint and prevent
resource exhaustion from slow or stalled connections.

47-47: ⚠️ Potential issue | 🟡 Minor

Validate --metrics-set before casting to metrics.MetricsSet.

Invalid values are currently accepted and silently alter filter behavior. Reject unknown values early.

✅ Proposed validation
 		scraper := NewScraper()
-		filter := NewFilter(metrics.MetricsSet(metricsSet))
+		switch metrics.MetricsSet(metricsSet) {
+		case metrics.MetricsSetAll, metrics.MetricsSetTelemetry, metrics.MetricsSetSRE:
+		default:
+			fmt.Fprintf(os.Stderr, "Error: invalid --metrics-set %q (valid: All, Telemetry, SRE)\n", metricsSet)
+			os.Exit(1)
+		}
+		filter := NewFilter(metrics.MetricsSet(metricsSet))
 		labeler := NewLabeler(namespace)

Also applies to: 91-91

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/cmd.go` at line 47, Validate the
metricsSet flag value before converting it to metrics.MetricsSet: after the flag
is parsed (the variable metricsSet set via cmd.Flags().StringVar) check that
metricsSet is one of the allowed strings (e.g. "All", "Telemetry", "SRE"); if
not, return an error / print a clear message and exit non‑zero so unknown values
are rejected early; then perform the cast/convert to metrics.MetricsSet only for
validated values (update the code paths that currently cast metricsSet to
metrics.MetricsSet to rely on this validation).
control-plane-operator/metrics-proxy/proxy.go (1)

78-96: ⚠️ Potential issue | 🟠 Major

Return a non-2xx response when all target scrapes fail.

Right now, all-error scrape cycles still return 200. Prometheus will treat that as success even though no target data was collected.

🧯 Proposed fix
-	for i, result := range results {
+	successes := 0
+	for i, result := range results {
 		if result.Err != nil {
 			h.log.Error(result.Err, "scrape error", "target", targets[i].PodName)
 			continue
 		}
@@
 		for _, mf := range families {
 			if err := encoder.Encode(mf); err != nil {
 				h.log.Error(err, "encode error")
+				continue
 			}
+			successes++
 		}
 	}
 
+	if successes == 0 {
+		http.Error(w, "all scrape targets failed", http.StatusBadGateway)
+		return
+	}
+
 	w.Header().Set("Content-Type", string(format))
 	w.WriteHeader(http.StatusOK)
 	_, _ = w.Write(buf.Bytes())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/metrics-proxy/proxy.go` around lines 78 - 96, The
handler currently always writes a 200 and response body even when every entry in
results has result.Err set; change the logic after the scrape loop (iterating
results) to detect if all scrapes failed (e.g., count successes or check if any
result.Err == nil) and when so write a non-2xx status (recommend
http.StatusBadGateway or http.StatusServiceUnavailable), log a summary via h.log
(include targets or error summary), and write a brief error response instead of
buf.Bytes(); keep the existing success path (Content-Type, StatusOK,
buf.Bytes()) when at least one scrape succeeded. Ensure you reference the same
variables used in the diff (results, targets, buf, encoder, h.log) so the
condition is evaluated in the same handler flow.
control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/deployment.yaml (1)

19-67: ⚠️ Potential issue | 🟠 Major

Harden container securityContext in the source asset manifest.

The container currently has no explicit hardening controls. Please set allowPrivilegeEscalation: false and runAsNonRoot: true (and optionally drop all capabilities) in this source manifest, then regenerate testdata/metrics-proxy/* fixtures.

🔒 Proposed hardening
       - args:
         - --tls-cert-file
@@
         image: controlplane-operator
+        securityContext:
+          allowPrivilegeEscalation: false
+          runAsNonRoot: true
+          readOnlyRootFilesystem: true
+          capabilities:
+            drop:
+            - ALL
         livenessProbe:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/deployment.yaml`
around lines 19 - 67, Add a hardened container securityContext to the
metrics-proxy container: under the containers entry for name "metrics-proxy" set
securityContext.allowPrivilegeEscalation=false and
securityContext.runAsNonRoot=true and add securityContext.capabilities.drop:
["ALL"] (or an equivalent drop list); ensure the securityContext sits alongside
the existing ports/volumeMounts/livenessProbe/readinessProbe fields in the same
container spec, then regenerate the testdata/metrics-proxy/* fixtures to reflect
the updated source manifest.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment.go`:
- Around line 66-77: The loop currently only inspects the first endpoint (ep :=
sm.Spec.Endpoints[0]) for each monitor so TLS CA/Cert refs on later endpoints
are ignored; change the logic to iterate over all endpoints (for _, ep := range
sm.Spec.Endpoints or similar) and call collectSecretOrConfigMapRef(refs,
ep.TLSConfig.CA) and collectSecretOrConfigMapRef(refs, ep.TLSConfig.Cert) for
each endpoint where ep.TLSConfig != nil; apply the same change for PodMonitor
processing if it uses the same pattern to ensure all endpoint TLS references are
collected.

---

Duplicate comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml`:
- Around line 116-117: The container securityContext currently only sets
readOnlyRootFilesystem; update the metrics-proxy deployment fixture's container
securityContext to enforce non-root and no privilege escalation by adding
allowPrivilegeEscalation: false and runAsNonRoot: true, drop all capabilities
(capabilities: { drop: ["ALL"] }) and set seccompProfile to RuntimeDefault; make
these changes in the source manifest/template that generates the fixture and
then regenerate the fixtures so the test data reflects the hardened
securityContext.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/deployment.yaml`:
- Around line 19-67: Add a hardened container securityContext to the
metrics-proxy container: under the containers entry for name "metrics-proxy" set
securityContext.allowPrivilegeEscalation=false and
securityContext.runAsNonRoot=true and add securityContext.capabilities.drop:
["ALL"] (or an equivalent drop list); ensure the securityContext sits alongside
the existing ports/volumeMounts/livenessProbe/readinessProbe fields in the same
container spec, then regenerate the testdata/metrics-proxy/* fixtures to reflect
the updated source manifest.

In `@control-plane-operator/metrics-proxy/cmd.go`:
- Line 61: The code reads namespace := os.Getenv("MY_NAMESPACE") without
validating it, which can emit empty/malformed labels; update startup in cmd.go
to validate the MY_NAMESPACE env var (the namespace variable) and fail fast or
set a safe default: check if namespace == "" and either log a clear error via
the existing logger and exit non-zero or assign a known fallback value before
any metric labeling; ensure the validation runs before any metric registration
or label emission (where namespace is used) so downstream dashboards/routing are
not broken.
- Around line 58-59: The shutdown context is being discarded
(context.WithCancel(cmd.Context()) result ignored) and Shutdown is called with
context.Background(), causing potential indefinite hang; replace the discard
with capturing the context and cancel (e.g., ctx, cancel :=
context.WithCancel(cmd.Context())) and pass a bounded context to the HTTP server
shutdown (use context.WithTimeout(ctx, <reasonableDuration>) before calling
server.Shutdown) and ensure cancel() is deferred; update the code paths around
the existing context.WithCancel and the server.Shutdown call so they use the
same bounded ctx to properly propagate cancellation and enforce a shutdown
timeout.
- Around line 107-116: The TLS http.Server created in the server variable lacks
timeouts which can allow slow-client or stalled connections to tie up resources;
update the server instantiation (the http.Server literal where server is
assigned) to set sensible ReadTimeout, ReadHeaderTimeout, WriteTimeout and
IdleTimeout values (e.g., seconds appropriate for your app) to harden the
exposed TLS endpoint and prevent resource exhaustion from slow or stalled
connections.
- Line 47: Validate the metricsSet flag value before converting it to
metrics.MetricsSet: after the flag is parsed (the variable metricsSet set via
cmd.Flags().StringVar) check that metricsSet is one of the allowed strings (e.g.
"All", "Telemetry", "SRE"); if not, return an error / print a clear message and
exit non‑zero so unknown values are rejected early; then perform the
cast/convert to metrics.MetricsSet only for validated values (update the code
paths that currently cast metricsSet to metrics.MetricsSet to rely on this
validation).

In `@control-plane-operator/metrics-proxy/proxy.go`:
- Around line 78-96: The handler currently always writes a 200 and response body
even when every entry in results has result.Err set; change the logic after the
scrape loop (iterating results) to detect if all scrapes failed (e.g., count
successes or check if any result.Err == nil) and when so write a non-2xx status
(recommend http.StatusBadGateway or http.StatusServiceUnavailable), log a
summary via h.log (include targets or error summary), and write a brief error
response instead of buf.Bytes(); keep the existing success path (Content-Type,
StatusOK, buf.Bytes()) when at least one scrape succeeded. Ensure you reference
the same variables used in the diff (results, targets, buf, encoder, h.log) so
the condition is evaluated in the same handler flow.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5e702a10-dcc3-4b5c-a34e-d32b9947c62c

📥 Commits

Reviewing files that changed from the base of the PR and between 40e9f45 and 5beac38.

📒 Files selected for processing (9)
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/AROSwift/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/GCP/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/IBMCloud/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/metrics-proxy/deployment.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment.go
  • control-plane-operator/metrics-proxy/cmd.go
  • control-plane-operator/metrics-proxy/proxy.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • control-plane-operator/controllers/hostedcontrolplane/testdata/metrics-proxy/GCP/zz_fixture_TestControlPlaneComponents_metrics_proxy_deployment.yaml

Comment on lines +66 to +77
for i := range smList.Items {
sm := &smList.Items[i]
if len(sm.Spec.Endpoints) == 0 {
continue
}
ep := sm.Spec.Endpoints[0]
if ep.TLSConfig == nil {
continue
}

collectSecretOrConfigMapRef(refs, ep.TLSConfig.CA)
collectSecretOrConfigMapRef(refs, ep.TLSConfig.Cert)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "deployment.go" -path "*/metrics_proxy/*" -type f

Repository: openshift/hypershift

Length of output: 150


🏁 Script executed:

wc -l ./control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment.go

Repository: openshift/hypershift

Length of output: 154


🏁 Script executed:

sed -n '60,110p' ./control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment.go

Repository: openshift/hypershift

Length of output: 1494


🏁 Script executed:

# Check if there are any references to multiple endpoints in the codebase
rg "Endpoints\[" control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/ -A 2 -B 2

Repository: openshift/hypershift

Length of output: 2192


🏁 Script executed:

# Check the scrape_config.go file to see if it has similar patterns
sed -n '1,150p' ./control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go | head -80

Repository: openshift/hypershift

Length of output: 2426


🏁 Script executed:

# Let's check the Prometheus operator imports and see the type definitions
rg "PodMonitor\|ServiceMonitor" ./control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/ --type go | head -20

Repository: openshift/hypershift

Length of output: 46


🏁 Script executed:

# Search for import statements to understand the types being used
rg "prometheusoperatorv1" ./control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment.go -A 1 -B 1

Repository: openshift/hypershift

Length of output: 747


🏁 Script executed:

# Let's look at the complete function to understand the context better
sed -n '45,130p' ./control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment.go

Repository: openshift/hypershift

Length of output: 2651


🏁 Script executed:

# Let's search for how these endpoints are used elsewhere
rg "for.*range.*Endpoints" ./control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/ -B 2 -A 5

Repository: openshift/hypershift

Length of output: 46


🏁 Script executed:

# Let's check if there are any tests that might indicate expected behavior
find . -path "*/metrics_proxy/*test*" -type f -name "*.go"

Repository: openshift/hypershift

Length of output: 250


🏁 Script executed:

# Let's check the deployment test file
cat ./control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go

Repository: openshift/hypershift

Length of output: 18582


🏁 Script executed:

# Check scrape_config.go to see if it has similar pattern
sed -n '45,100p' ./control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go

Repository: openshift/hypershift

Length of output: 1592


Process all monitor endpoints, not just index 0.

Only the first endpoint from each ServiceMonitor and PodMonitor is inspected. ServiceMonitors and PodMonitors can have multiple endpoints, each with distinct TLS configurations. TLS references in additional endpoints are skipped, causing required cert volumes to not be mounted. When the metrics proxy attempts to scrape those endpoints, it will fail due to missing certificates.

Fix by iterating through all endpoints in each monitor instead of selecting only [0].

🛠️ Proposed fix
-	for i := range smList.Items {
-		sm := &smList.Items[i]
-		if len(sm.Spec.Endpoints) == 0 {
-			continue
-		}
-		ep := sm.Spec.Endpoints[0]
-		if ep.TLSConfig == nil {
-			continue
-		}
-
-		collectSecretOrConfigMapRef(refs, ep.TLSConfig.CA)
-		collectSecretOrConfigMapRef(refs, ep.TLSConfig.Cert)
-		if ep.TLSConfig.KeySecret != nil {
-			refs[ep.TLSConfig.KeySecret.Name] = &certRef{
-				name:     ep.TLSConfig.KeySecret.Name,
-				isSecret: true,
-			}
-		}
-	}
+	for i := range smList.Items {
+		sm := &smList.Items[i]
+		for _, ep := range sm.Spec.Endpoints {
+			if ep.TLSConfig == nil {
+				continue
+			}
+			collectSecretOrConfigMapRef(refs, ep.TLSConfig.CA)
+			collectSecretOrConfigMapRef(refs, ep.TLSConfig.Cert)
+			if ep.TLSConfig.KeySecret != nil {
+				refs[ep.TLSConfig.KeySecret.Name] = &certRef{
+					name:     ep.TLSConfig.KeySecret.Name,
+					isSecret: true,
+				}
+			}
+		}
+	}
@@
-	for i := range pmList.Items {
-		pm := &pmList.Items[i]
-		if len(pm.Spec.PodMetricsEndpoints) == 0 {
-			continue
-		}
-		ep := pm.Spec.PodMetricsEndpoints[0]
-		if ep.TLSConfig == nil {
-			continue
-		}
-
-		collectSecretOrConfigMapRef(refs, ep.TLSConfig.CA)
-		collectSecretOrConfigMapRef(refs, ep.TLSConfig.Cert)
-		if ep.TLSConfig.KeySecret != nil {
-			refs[ep.TLSConfig.KeySecret.Name] = &certRef{
-				name:     ep.TLSConfig.KeySecret.Name,
-				isSecret: true,
-			}
-		}
-	}
+	for i := range pmList.Items {
+		pm := &pmList.Items[i]
+		for _, ep := range pm.Spec.PodMetricsEndpoints {
+			if ep.TLSConfig == nil {
+				continue
+			}
+			collectSecretOrConfigMapRef(refs, ep.TLSConfig.CA)
+			collectSecretOrConfigMapRef(refs, ep.TLSConfig.Cert)
+			if ep.TLSConfig.KeySecret != nil {
+				refs[ep.TLSConfig.KeySecret.Name] = &certRef{
+					name:     ep.TLSConfig.KeySecret.Name,
+					isSecret: true,
+				}
+			}
+		}
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for i := range smList.Items {
sm := &smList.Items[i]
if len(sm.Spec.Endpoints) == 0 {
continue
}
ep := sm.Spec.Endpoints[0]
if ep.TLSConfig == nil {
continue
}
collectSecretOrConfigMapRef(refs, ep.TLSConfig.CA)
collectSecretOrConfigMapRef(refs, ep.TLSConfig.Cert)
for i := range smList.Items {
sm := &smList.Items[i]
for _, ep := range sm.Spec.Endpoints {
if ep.TLSConfig == nil {
continue
}
collectSecretOrConfigMapRef(refs, ep.TLSConfig.CA)
collectSecretOrConfigMapRef(refs, ep.TLSConfig.Cert)
if ep.TLSConfig.KeySecret != nil {
refs[ep.TLSConfig.KeySecret.Name] = &certRef{
name: ep.TLSConfig.KeySecret.Name,
isSecret: true,
}
}
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment.go`
around lines 66 - 77, The loop currently only inspects the first endpoint (ep :=
sm.Spec.Endpoints[0]) for each monitor so TLS CA/Cert refs on later endpoints
are ignored; change the logic to iterate over all endpoints (for _, ep := range
sm.Spec.Endpoints or similar) and call collectSecretOrConfigMapRef(refs,
ep.TLSConfig.CA) and collectSecretOrConfigMapRef(refs, ep.TLSConfig.Cert) for
each endpoint where ep.TLSConfig != nil; apply the same change for PodMonitor
processing if it uses the same pattern to ensure all endpoint TLS references are
collected.

Replace bearer token + TokenReview authentication with mTLS client
certificate verification in the metrics-proxy component. The proxy now
requires client certificates signed by the cluster-signer-ca CA bundle,
eliminating the dependency on the hosted kube-apiserver for
authentication.

Changes:
- Remove --kas-kubeconfig and --authorized-sa flags, add --client-ca-file
- Configure TLS with RequireAndVerifyClientCert using cluster-signer-ca
- Remove TokenAuthenticator and bearer token extraction from proxy handler
- Mount cluster-signer-ca Secret instead of KAS kubeconfig in deployment
- Regenerate test fixtures

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@muraee muraee force-pushed the feat-metrics-proxy branch from 5beac38 to 2223cf1 Compare March 4, 2026 11:17
@muraee muraee changed the title feat(CPO): add metrics-proxy control plane component CNTRLPLANE-2838: feat(CPO): add metrics-proxy control plane component Mar 4, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 4, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 4, 2026

@muraee: This pull request references CNTRLPLANE-2838 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

  • Adds the metrics-proxy as a new CPOv2 control plane component that exposes hosted control plane
    component metrics externally via an authenticated HTTPS route, as described in the
    hosted control plane metrics exposure enhancement
  • The proxy authenticates requests using bearer tokens validated against the management cluster's
    TokenReview API, discovers component pod targets via the endpoint-resolver, scrapes metrics from
    individual pods in parallel, and applies per-component filtering and labeling
  • Exposes an aggregated Prometheus-compatible /metrics/<component> endpoint over a TLS route
  • Includes dynamic scrape configuration loaded from a ConfigMap, self-signed CA/serving certificate
    management, and comprehensive unit tests

Test plan

  • Verify generated test fixtures match expected output across all platform variants (default, AROSwift, GCP, IBMCloud, TechPreviewNoUpgrade)
  • Run go test ./control-plane-operator/metrics-proxy/... to validate proxy unit tests
  • Run go test ./control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/... to validate component tests
  • Run make test for full unit test suite
  • Run make verify to ensure code generation and formatting are correct

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
  • Added metrics-proxy component to enable secure metrics collection and forwarding from hosted control planes
  • Supports dynamic scrape configuration generation from monitoring resources across multiple cloud platforms (AROSwift, GCP, IBMCloud)
  • Implements TLS-based encryption for metrics endpoints with certificate management
  • Enables per-component metrics filtering and label enrichment for improved observability

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

…ic to metrics-proxy

Add support for annotation-based label values on ServiceMonitor/PodMonitor
resources so that metrics-proxy injects labels matching standalone OCP
Prometheus conventions (e.g. namespace=openshift-etcd instead of
namespace=clusters-my-cluster).

Four new annotations are read from SM/PM objects:
- hypershift.openshift.io/metrics-job
- hypershift.openshift.io/metrics-namespace
- hypershift.openshift.io/metrics-service
- hypershift.openshift.io/metrics-endpoint

These propagate through the scrape config ConfigMap into the labeler,
which falls back to the previous defaults when annotations are absent.

Also add a synthetic metrics_aggregator_pod_scrape_errors_total counter
metric emitted when individual pod scrapes fail, with component and pod
labels identifying the failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 4, 2026

@muraee: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-self-managed f0efefc link false /test e2e-azure-self-managed

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

@csrwng csrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of comments from my first pass. It's looking good.

The internal router needs to take the new metrics-proxy route into consideration when building its config
https://github.com/openshift/hypershift/blob/11b52dac18213e6ebb05b478744b2d409fbfbaf8/control-plane-operator/controllers/hostedcontrolplane/v2/router/config.go

URL: endpointResolverURL,
CAFile: endpointResolverCA,
},
Components: make(map[string]metricsproxybin.ComponentFileConfig),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what ensures this is consistently serialized? Would it be better to use a slice that you can sort by component name?

@csrwng
Copy link
Contributor

csrwng commented Mar 4, 2026

We should also add e2e validation that the proxy is working and returning metrics with the right labels, and from all pods, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants