Skip to content

OCPBUGS-83538: fix(metrics-proxy): resolve ports from pods instead of deployments#8221

Merged
openshift-merge-bot[bot] merged 2 commits intoopenshift:mainfrom
muraee:fix-metrics-proxy-port-resolution
Apr 15, 2026
Merged

OCPBUGS-83538: fix(metrics-proxy): resolve ports from pods instead of deployments#8221
openshift-merge-bot[bot] merged 2 commits intoopenshift:mainfrom
muraee:fix-metrics-proxy-port-resolution

Conversation

@muraee
Copy link
Copy Markdown
Contributor

@muraee muraee commented Apr 13, 2026

Summary

  • PodMonitor port resolution: resolveDeploymentPort used the PodMonitor's name to look up a Deployment by convention (pm.Name == deployment name). This broke for the control-plane-operator because its PodMonitor YAML has name: controlplane-operator while the Deployment is named control-plane-operator, causing the CPO metric to be silently omitted from the metrics-proxy-config configmap. Replaced with resolvePodPort which uses the PodMonitor's full label selector (metav1.LabelSelectorAsSelector) to find matching Pods directly, scanning all pods for rollout resilience.
  • ServiceMonitor named targetPort: When a Service's targetPort is a named port (string) rather than a number, resolveServicePort incorrectly fell back to the Service's port field. Now resolves the named targetPort from matching Pods using the Service's selector.
  • HCCO guest resource cleanup: When the EnableMetricsForwarding annotation is removed, the CPO deletes the metrics-proxy Route. The HCCO then gets NotFound on the route lookup and silently returns without cleaning up guest cluster resources (deployment, configmap, serving CA, pod monitor). Added an explicit check so guest resources are deleted when forwarding is not enabled.

Test plan

  • Existing scrape config unit tests pass
  • New TestAdaptScrapeConfigNamedTargetPort test verifies named targetPort resolution from pods
  • New TestAdaptScrapeConfigPodMonitorNameMismatch regression test for the CPO name mismatch
  • Verify on a running cluster that the control-plane-operator metric appears in the metrics-proxy-config configmap
  • Verify removing EnableMetricsForwarding annotation cleans up guest resources

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Metrics forwarding now requires an explicit cluster annotation; when absent, related monitoring resources are removed.
  • Bug Fixes

    • More reliable port detection for services with named target ports and for monitors that select pods by labels; clearer failure when no matching pods or ports exist.
  • Refactor

    • Port resolution now derives ports from matching pods and selectors rather than previous deployment-template lookup.
  • Tests

    • Added and updated tests for pod-based resolution, named target-port resolution, selector mismatches, deletion behavior, and mixed-monitor scenarios.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Port resolution in the metrics proxy scrape-config was changed from Deployment-based lookup to Pod-based lookup. A new resolvePodPort lists Pods matching a selector and finds named container ports; resolveServicePort now accepts a podSelector and uses Pods to resolve named Service targetPorts. adaptScrapeConfig was updated to require/parse pod selectors for ServiceMonitors and PodMonitors and to use resolvePodPort. Tests replace Deployment fixtures with Pod fixtures, add named-targetPort and selector/name-mismatch cases. reconcileMetricsForwarder now requires the hyperv1.EnableMetricsForwarding annotation and deletes forwarder resources when it is absent.

Sequence Diagram(s)

sequenceDiagram
  participant Monitor as ServiceMonitor/PodMonitor
  participant Adapter as adaptScrapeConfig
  participant K8sSvc as Kubernetes Service API
  participant K8sPods as Kubernetes Pods API

  Monitor->>Adapter: request scrape config adaptation
  Adapter->>K8sSvc: find Service for monitor (ServiceMonitor)
  K8sSvc-->>Adapter: Service with port and targetPort
  alt targetPort is numeric
    Adapter-->>Monitor: use numeric targetPort for scrape config
  else targetPort is named
    Adapter->>K8sPods: list Pods by provided podSelector
    K8sPods-->>Adapter: Pod list (containers with ports)
    Adapter->>Adapter: resolvePodPort -> find container port matching name -> numeric port
    Adapter-->>Monitor: return resolved numeric port for scrape config
  end

  note over Adapter,K8sPods: For PodMonitor, Adapter parses Pod selector from monitor and directly queries K8sPods using resolvePodPort
Loading
🚥 Pre-merge checks | ✅ 8 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning The test lacks PodMonitor in existingObjects and does not verify its deletion, violating completeness principle. Add MetricsForwarderPodMonitor() to existingObjects and verify its deletion with assertions in cleanup verification section.
✅ Passed checks (8 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Stable And Deterministic Test Names ✅ Passed The PR uses standard Go testing with table-driven tests, not Ginkgo framework, making this Ginkgo-specific check inapplicable.
Microshift Test Compatibility ✅ Passed PR contains only standard Go unit tests, not Ginkgo e2e tests, so custom check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The pull request does not add any new Ginkgo e2e tests; it only modifies unit tests using standard Go testing.T with Gomega assertions.
Topology-Aware Scheduling Compatibility ✅ Passed PR does not introduce or modify scheduling constraints, affinity rules, replica counts, topology-specific resource definitions, or PodDisruptionBudgets; changes are limited to port resolution logic and annotation-based resource deletion control.
Ote Binary Stdout Contract ✅ Passed The modified files are controller unit tests and implementation files that run within a standard Go test framework, not standalone OTE test binaries.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The PR does not add any Ginkgo e2e tests requiring IPv6 and disconnected network compatibility checks.
Title check ✅ Passed The title accurately captures the main change: replacing Deployment-based port resolution with Pod-based port resolution in the metrics-proxy module.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from jparrill and sjenning April 13, 2026 16:06
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 13, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: muraee

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Apr 13, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go`:
- Around line 118-123: The helper newPod currently hardcodes the pod label to
{"app": name}, which prevents reproducing a PodMonitor vs pod label
name-mismatch; change newPod (and any callers) to accept an explicit label value
(e.g., newPod(name, namespace, portName, portNum, labelValue) or add a
labelParam) and use that for the "app" label instead of name, then add a
regression test case where the PodMonitor name (or selector) is
"controlplane-operator" and the pod label value is "control-plane-operator" (or
vice versa) so the suite actually exercises the mismatch; update any test code
that constructs pods to pass the intended label value and add the new
mismatched-case test in scrape_config_test.go.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`:
- Around line 296-308: The function resolvePodPort currently only checks
podList.Items[0] and can miss a valid port on other matched pods; update
resolvePodPort to iterate over all podList.Items, for each pod iterate its
Spec.Containers and their Ports to find port.Name == portName, return the
matching ContainerPort as soon as found, and only return the error "no pods
found..." or "port not found..." after all pods have been scanned using the
selector and portName values for the final error message.
- Around line 136-137: The code currently passes only
pm.Spec.Selector.MatchLabels into resolvePodPort, losing matchExpressions and
causing wrong pod selection; change the call to build a full selector from the
PodMonitor's metav1.LabelSelector (e.g. use
metav1.LabelSelectorAsSelector(&pm.Spec.Selector) or labels.SelectorFromSet
equivalent) and pass that selector (or its string/label map form) into
resolvePodPort or refactor resolvePodPort to accept a labels.Selector, and
update resolvePodPort to query pods using that selector rather than just
MatchLabels; also modify the pod-port resolution logic in resolvePodPort (or the
caller) to iterate over podList.Items and inspect each pod for the named port
until a match is found instead of only using podList.Items[0], ensuring
deterministic selection across rollouts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: aa7fc084-ef77-4a94-8880-62c4f7660e6f

📥 Commits

Reviewing files that changed from the base of the PR and between 36ccecd and f33c3fe.

📒 Files selected for processing (2)
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 13, 2026

Codecov Report

❌ Patch coverage is 70.00000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 34.65%. Comparing base (36ccecd) to head (774946a).
⚠️ Report is 39 commits behind head on main.

Files with missing lines Patch % Lines
...stedcontrolplane/v2/metrics_proxy/scrape_config.go 66.66% 6 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8221      +/-   ##
==========================================
+ Coverage   34.63%   34.65%   +0.01%     
==========================================
  Files         767      767              
  Lines       93186    93280      +94     
==========================================
+ Hits        32277    32322      +45     
- Misses      58236    58277      +41     
- Partials     2673     2681       +8     
Files with missing lines Coverage Δ
...rconfigoperator/controllers/resources/resources.go 50.52% <100.00%> (-0.13%) ⬇️
...stedcontrolplane/v2/metrics_proxy/scrape_config.go 70.14% <66.66%> (-1.94%) ⬇️

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@muraee muraee changed the title fix(metrics-proxy): resolve ports from pods instead of deployments CNTRLPLANE-2836: fix(metrics-proxy): resolve ports from pods instead of deployments Apr 13, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 13, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 13, 2026

@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

  • PodMonitor port resolution: resolveDeploymentPort used the PodMonitor's name to look up a Deployment by convention (pm.Name == deployment name). This broke for the control-plane-operator because its PodMonitor YAML has name: controlplane-operator while the Deployment is named control-plane-operator, causing the CPO metric to be silently omitted from the metrics-proxy-config configmap. Replaced with resolvePodPort which uses the PodMonitor's spec.selector.matchLabels to find a matching Pod directly.
  • ServiceMonitor named targetPort: When a Service's targetPort is a named port (string) rather than a number, resolveServicePort incorrectly fell back to the Service's port field. Now resolves the named targetPort from matching Pods using the Service's selector.

Test plan

  • Existing scrape config unit tests pass
  • New TestAdaptScrapeConfigNamedTargetPort test verifies named targetPort resolution from pods
  • Verify on a running cluster that the control-plane-operator metric appears in the metrics-proxy-config configmap

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

  • Improved reliability of metrics port detection for monitoring configurations.

  • Refactor

  • Simplified internal metrics proxy configuration by enhancing port resolution logic to better handle named port configurations.

  • Tests

  • Enhanced test coverage for metrics configuration scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@muraee muraee force-pushed the fix-metrics-proxy-port-resolution branch from f33c3fe to 7fd34be Compare April 13, 2026 16:30
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 13, 2026

@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

  • PodMonitor port resolution: resolveDeploymentPort used the PodMonitor's name to look up a Deployment by convention (pm.Name == deployment name). This broke for the control-plane-operator because its PodMonitor YAML has name: controlplane-operator while the Deployment is named control-plane-operator, causing the CPO metric to be silently omitted from the metrics-proxy-config configmap. Replaced with resolvePodPort which uses the PodMonitor's spec.selector.matchLabels to find a matching Pod directly.
  • ServiceMonitor named targetPort: When a Service's targetPort is a named port (string) rather than a number, resolveServicePort incorrectly fell back to the Service's port field. Now resolves the named targetPort from matching Pods using the Service's selector.

Test plan

  • Existing scrape config unit tests pass
  • New TestAdaptScrapeConfigNamedTargetPort test verifies named targetPort resolution from pods
  • Verify on a running cluster that the control-plane-operator metric appears in the metrics-proxy-config configmap

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

  • Improved reliability of metrics port detection when services use named target ports and when monitors reference pods by selector.

  • Refactor

  • Switched port resolution to use pod/label-selector-based discovery instead of deployment-template lookups.

  • Tests

  • Added and updated tests and fixtures to cover pod-based resolution, named target-port scenarios, and selector mismatch cases.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`:
- Around line 265-268: The code currently calls resolvePodPort(cpContext,
namespace, labels.SelectorFromSet(podSelector), p.TargetPort.String()) whenever
p.TargetPort is named, even if podSelector is empty; change this so named
targetPort resolution only runs when podSelector is non-empty (e.g., check that
podSelector has entries / the labels.SelectorFromSet is not empty) and if
podSelector is empty do not call resolvePodPort and instead skip creating the
monitor for that service (avoid picking an unrelated pod). Ensure you update the
logic around p.TargetPort.String(), podSelector, and resolvePodPort to implement
this guard.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: dc3cb725-f0fa-4be3-94f3-ce61f7278bfd

📥 Commits

Reviewing files that changed from the base of the PR and between f33c3fe and 7fd34be.

📒 Files selected for processing (2)
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go

@muraee muraee force-pushed the fix-metrics-proxy-port-resolution branch from 7fd34be to 3500eef Compare April 13, 2026 17:48
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 13, 2026

@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

  • PodMonitor port resolution: resolveDeploymentPort used the PodMonitor's name to look up a Deployment by convention (pm.Name == deployment name). This broke for the control-plane-operator because its PodMonitor YAML has name: controlplane-operator while the Deployment is named control-plane-operator, causing the CPO metric to be silently omitted from the metrics-proxy-config configmap. Replaced with resolvePodPort which uses the PodMonitor's spec.selector.matchLabels to find a matching Pod directly.
  • ServiceMonitor named targetPort: When a Service's targetPort is a named port (string) rather than a number, resolveServicePort incorrectly fell back to the Service's port field. Now resolves the named targetPort from matching Pods using the Service's selector.

Test plan

  • Existing scrape config unit tests pass
  • New TestAdaptScrapeConfigNamedTargetPort test verifies named targetPort resolution from pods
  • Verify on a running cluster that the control-plane-operator metric appears in the metrics-proxy-config configmap

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

  • Improved reliability of metrics port detection when services use named target ports and when monitors reference pods by label selector.

  • Refactor

  • Switched port resolution to use pod and label-selector discovery instead of deployment-template lookups.

  • Tests

  • Added and updated tests and fixtures covering pod-based resolution, named target-port scenarios, selector mismatches, and mixed-monitor cases.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 14, 2026

@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

  • PodMonitor port resolution: resolveDeploymentPort used the PodMonitor's name to look up a Deployment by convention (pm.Name == deployment name). This broke for the control-plane-operator because its PodMonitor YAML has name: controlplane-operator while the Deployment is named control-plane-operator, causing the CPO metric to be silently omitted from the metrics-proxy-config configmap. Replaced with resolvePodPort which uses the PodMonitor's full label selector (metav1.LabelSelectorAsSelector) to find matching Pods directly, scanning all pods for rollout resilience.
  • ServiceMonitor named targetPort: When a Service's targetPort is a named port (string) rather than a number, resolveServicePort incorrectly fell back to the Service's port field. Now resolves the named targetPort from matching Pods using the Service's selector.
  • HCCO guest resource cleanup: When the EnableMetricsForwarding annotation is removed, the CPO deletes the metrics-proxy Route. The HCCO then gets NotFound on the route lookup and silently returns without cleaning up guest cluster resources (deployment, configmap, serving CA, pod monitor). Added an explicit check so guest resources are deleted when forwarding is not enabled.

Test plan

  • Existing scrape config unit tests pass
  • New TestAdaptScrapeConfigNamedTargetPort test verifies named targetPort resolution from pods
  • New TestAdaptScrapeConfigPodMonitorNameMismatch regression test for the CPO name mismatch
  • Verify on a running cluster that the control-plane-operator metric appears in the metrics-proxy-config configmap
  • Verify removing EnableMetricsForwarding annotation cleans up guest resources

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@muraee
Copy link
Copy Markdown
Contributor Author

muraee commented Apr 14, 2026

/auto-cc

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 14, 2026

@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

  • PodMonitor port resolution: resolveDeploymentPort used the PodMonitor's name to look up a Deployment by convention (pm.Name == deployment name). This broke for the control-plane-operator because its PodMonitor YAML has name: controlplane-operator while the Deployment is named control-plane-operator, causing the CPO metric to be silently omitted from the metrics-proxy-config configmap. Replaced with resolvePodPort which uses the PodMonitor's full label selector (metav1.LabelSelectorAsSelector) to find matching Pods directly, scanning all pods for rollout resilience.
  • ServiceMonitor named targetPort: When a Service's targetPort is a named port (string) rather than a number, resolveServicePort incorrectly fell back to the Service's port field. Now resolves the named targetPort from matching Pods using the Service's selector.
  • HCCO guest resource cleanup: When the EnableMetricsForwarding annotation is removed, the CPO deletes the metrics-proxy Route. The HCCO then gets NotFound on the route lookup and silently returns without cleaning up guest cluster resources (deployment, configmap, serving CA, pod monitor). Added an explicit check so guest resources are deleted when forwarding is not enabled.

Test plan

  • Existing scrape config unit tests pass
  • New TestAdaptScrapeConfigNamedTargetPort test verifies named targetPort resolution from pods
  • New TestAdaptScrapeConfigPodMonitorNameMismatch regression test for the CPO name mismatch
  • Verify on a running cluster that the control-plane-operator metric appears in the metrics-proxy-config configmap
  • Verify removing EnableMetricsForwarding annotation cleans up guest resources

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

  • Metrics forwarding can be disabled via a cluster annotation; when disabled, related monitoring resources are removed.

  • Bug Fixes

  • Improved reliability of metrics port detection for services using named target ports and monitors that select pods by labels.

  • Refactor

  • Port resolution now derives ports from matching pods and label selectors instead of deployment-template lookups.

  • Tests

  • Added and updated tests covering pod-based resolution, named target-port resolution, selector mismatches, and mixed-monitor scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from cblecker and enxebre April 14, 2026 16:39
Copy link
Copy Markdown
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some comments. Thanks!

Fix two bugs in the metrics-proxy scrape config generation:

1. PodMonitor port resolution used the PodMonitor name to look up a
Deployment by convention. This broke for the control-plane-operator
because its PodMonitor YAML has name: controlplane-operator while the
Deployment is named control-plane-operator. Replace with resolvePodPort
which uses the PodMonitor's full label selector to find matching Pods.

2. ServiceMonitor named targetPort resolution fell back to the Service's
port field when targetPort was a string. Now resolves named targetPorts
from matching Pods via the Service's selector.

Additionally:
- Use metav1.LabelSelectorAsSelector for full selector support including
  matchExpressions.
- Scan all matching pods (not just the first) to handle rollouts where
  different pod revisions coexist.
- Add regression test for PodMonitor name mismatch scenario.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@muraee muraee force-pushed the fix-metrics-proxy-port-resolution branch from 4e38e9b to 9d27131 Compare April 15, 2026 12:16
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 15, 2026

@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • PodMonitor port resolution: resolveDeploymentPort used the PodMonitor's name to look up a Deployment by convention (pm.Name == deployment name). This broke for the control-plane-operator because its PodMonitor YAML has name: controlplane-operator while the Deployment is named control-plane-operator, causing the CPO metric to be silently omitted from the metrics-proxy-config configmap. Replaced with resolvePodPort which uses the PodMonitor's full label selector (metav1.LabelSelectorAsSelector) to find matching Pods directly, scanning all pods for rollout resilience.
  • ServiceMonitor named targetPort: When a Service's targetPort is a named port (string) rather than a number, resolveServicePort incorrectly fell back to the Service's port field. Now resolves the named targetPort from matching Pods using the Service's selector.
  • HCCO guest resource cleanup: When the EnableMetricsForwarding annotation is removed, the CPO deletes the metrics-proxy Route. The HCCO then gets NotFound on the route lookup and silently returns without cleaning up guest cluster resources (deployment, configmap, serving CA, pod monitor). Added an explicit check so guest resources are deleted when forwarding is not enabled.

Test plan

  • Existing scrape config unit tests pass
  • New TestAdaptScrapeConfigNamedTargetPort test verifies named targetPort resolution from pods
  • New TestAdaptScrapeConfigPodMonitorNameMismatch regression test for the CPO name mismatch
  • Verify on a running cluster that the control-plane-operator metric appears in the metrics-proxy-config configmap
  • Verify removing EnableMetricsForwarding annotation cleans up guest resources

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

  • Metrics forwarding now requires an explicit cluster annotation; when absent, related monitoring resources are removed.

  • Bug Fixes

  • More reliable port detection for services with named target ports and for monitors that select pods by labels; clearer failure when no matching pods or ports exist.

  • Refactor

  • Port resolution now derives ports from matching pods and selectors rather than previous deployment-template lookup.

  • Tests

  • Added and updated tests for pod-based resolution, named target-port resolution, selector mismatches, deletion behavior, and mixed-monitor scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go (1)

142-170: ⚠️ Potential issue | 🟠 Major

Don't emit a partial selector for PodMonitors.

Lines 143-148 resolve the port with the full LabelSelector, but Line 170 still serializes only pm.Spec.Selector.MatchLabels into the generated component. A monitor that uses matchExpressions can therefore resolve its port from one pod set and then scrape a different one; expression-only selectors degrade to an empty selector. Please either carry the full selector through ComponentFileConfig/endpoint resolution, or skip these monitors until that format can represent them safely.

Minimal safe guard if full-selector support is out of scope here
 		podSelector, err := metav1.LabelSelectorAsSelector(&pm.Spec.Selector)
 		if err != nil {
 			log.V(4).Info("skipping PodMonitor: invalid selector", "podMonitor", pm.Name, "error", err)
 			continue
 		}
+		if len(pm.Spec.Selector.MatchExpressions) > 0 {
+			log.V(4).Info("skipping PodMonitor: selector uses matchExpressions which are not representable in metrics-proxy config", "podMonitor", pm.Name)
+			continue
+		}
 		port, err := resolvePodPort(cpContext, namespace, podSelector, portName)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`
around lines 142 - 170, The generated ComponentFileConfig currently uses only
pm.Spec.Selector.MatchLabels which drops matchExpressions and can cause
resolvePodPort to use a different selector than the emitted config; update the
code so the full LabelSelector is preserved: pass pm.Spec.Selector (not just
MatchLabels) into metricsproxybin.ComponentFileConfig.Selector (or add a new
field that accepts the complete selector) and ensure any downstream code that
serializes Selector handles MatchExpressions, or alternatively skip/continue on
PodMonitors whose pm.Spec.Selector contains MatchExpressions (detect via
pm.Spec.Selector.MatchExpressions != nil && len(...) > 0) so you never emit a
partial/empty selector; adjust usages around resolvePodPort,
ComponentFileConfig, and the endpoint generation to reference the same selector
variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`:
- Around line 2977-2981: The test seeds only three resources but
reconcileMetricsForwarder also deletes the PodMonitor; update the test by
including manifests.MetricsForwarderPodMonitor() in the existingObjects seed and
add the corresponding deletion assertion (same pattern used for
MetricsForwarderDeployment/ConfigMap/ServingCA) to verify the PodMonitor was
removed; locate the test in resources_test.go and modify the setup and the
deletion checks that reference reconcileMetricsForwarder to include the
PodMonitor (manifests.MetricsForwarderPodMonitor()) so this deletion path is
covered.

---

Outside diff comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`:
- Around line 142-170: The generated ComponentFileConfig currently uses only
pm.Spec.Selector.MatchLabels which drops matchExpressions and can cause
resolvePodPort to use a different selector than the emitted config; update the
code so the full LabelSelector is preserved: pass pm.Spec.Selector (not just
MatchLabels) into metricsproxybin.ComponentFileConfig.Selector (or add a new
field that accepts the complete selector) and ensure any downstream code that
serializes Selector handles MatchExpressions, or alternatively skip/continue on
PodMonitors whose pm.Spec.Selector contains MatchExpressions (detect via
pm.Spec.Selector.MatchExpressions != nil && len(...) > 0) so you never emit a
partial/empty selector; adjust usages around resolvePodPort,
ComponentFileConfig, and the endpoint generation to reference the same selector
variable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: d6310199-40d4-4b54-9ae1-3ce4eb01168a

📥 Commits

Reviewing files that changed from the base of the PR and between 4e38e9b and 9d27131.

📒 Files selected for processing (4)
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go

@muraee muraee force-pushed the fix-metrics-proxy-port-resolution branch from 9d27131 to 4b9e15a Compare April 15, 2026 12:37
When the EnableMetricsForwarding annotation is removed, the CPO deletes
the metrics-proxy Route from the management cluster. The HCCO then gets
NotFound on the route lookup and silently returns nil without cleaning
up the guest cluster resources (deployment, configmap, serving CA, pod
monitor).

Add an explicit check for the EnableMetricsForwarding annotation so that
guest resources are deleted when forwarding is not enabled, matching the
existing DisableMonitoringServices cleanup path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@muraee muraee force-pushed the fix-metrics-proxy-port-resolution branch from 4b9e15a to 774946a Compare April 15, 2026 13:35
@jparrill
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2044418504827867136 | Cost: $2.0656315000000003 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2044416087256207360 | Cost: $2.16505395 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@muraee
Copy link
Copy Markdown
Contributor Author

muraee commented Apr 15, 2026

/verified by @jiezhao16

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 15, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@muraee: This PR has been marked as verified by @jiezhao16.

Details

In response to this:

/verified by @jiezhao16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@muraee
Copy link
Copy Markdown
Contributor Author

muraee commented Apr 15, 2026

/retest

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented Apr 15, 2026

Test Results

e2e-aws

e2e-aks

@muraee muraee changed the title CNTRLPLANE-2836: fix(metrics-proxy): resolve ports from pods instead of deployments OCPBUGS-83538: fix(metrics-proxy): resolve ports from pods instead of deployments Apr 15, 2026
@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Apr 15, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@muraee: This pull request references Jira Issue OCPBUGS-83538, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • PodMonitor port resolution: resolveDeploymentPort used the PodMonitor's name to look up a Deployment by convention (pm.Name == deployment name). This broke for the control-plane-operator because its PodMonitor YAML has name: controlplane-operator while the Deployment is named control-plane-operator, causing the CPO metric to be silently omitted from the metrics-proxy-config configmap. Replaced with resolvePodPort which uses the PodMonitor's full label selector (metav1.LabelSelectorAsSelector) to find matching Pods directly, scanning all pods for rollout resilience.
  • ServiceMonitor named targetPort: When a Service's targetPort is a named port (string) rather than a number, resolveServicePort incorrectly fell back to the Service's port field. Now resolves the named targetPort from matching Pods using the Service's selector.
  • HCCO guest resource cleanup: When the EnableMetricsForwarding annotation is removed, the CPO deletes the metrics-proxy Route. The HCCO then gets NotFound on the route lookup and silently returns without cleaning up guest cluster resources (deployment, configmap, serving CA, pod monitor). Added an explicit check so guest resources are deleted when forwarding is not enabled.

Test plan

  • Existing scrape config unit tests pass
  • New TestAdaptScrapeConfigNamedTargetPort test verifies named targetPort resolution from pods
  • New TestAdaptScrapeConfigPodMonitorNameMismatch regression test for the CPO name mismatch
  • Verify on a running cluster that the control-plane-operator metric appears in the metrics-proxy-config configmap
  • Verify removing EnableMetricsForwarding annotation cleans up guest resources

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

  • Metrics forwarding now requires an explicit cluster annotation; when absent, related monitoring resources are removed.

  • Bug Fixes

  • More reliable port detection for services with named target ports and for monitors that select pods by labels; clearer failure when no matching pods or ports exist.

  • Refactor

  • Port resolution now derives ports from matching pods and selectors rather than previous deployment-template lookup.

  • Tests

  • Added and updated tests for pod-based resolution, named target-port resolution, selector mismatches, deletion behavior, and mixed-monitor scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@muraee
Copy link
Copy Markdown
Contributor Author

muraee commented Apr 15, 2026

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 15, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@muraee: This pull request references Jira Issue OCPBUGS-83538, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (jiezhao@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2044412777967128576 | Cost: $2.0702462500000003 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2044447616661458944 | Cost: $2.44414965 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 5cdaa57 and 2 for PR HEAD 774946a in total

@hypershift-jira-solve-ci
Copy link
Copy Markdown

Now I have complete evidence for both jobs. Let me compile the final report:

Test Failure Analysis Complete

Job Information

Job 1: e2e-aws

Job 2: e2e-azure-self-managed

Test Failure Analysis

Error

e2e-aws: TestCreateCluster/Main/EnsureMetricsForwarderWorking hung indefinitely retrying
  "kube-apiserver target via metrics-forwarder not found in Prometheus active targets (will retry)"
  until the 2h job timeout was hit, causing TestAutoscaling/Teardown to also fail from
  insufficient time for AWS resource cleanup.

e2e-azure-self-managed: Three independent failures:
  1. TestAzurePrivateTopology/EnsureNoCrashingPods: azure-file-csi-driver-controller containers
     restarting (csi-attacher, csi-provisioner, csi-resizer, csi-snapshotter each restartCount > 0)
  2. TestUpgradeControlPlane/ValidateHostedCluster: HostedCluster conditions invalid after 2979s —
     kube-apiserver unavailable, CVO conditions not found, 31+ components not available
  3. TestNodePool/HostedCluster2/ValidateHostedCluster: controlPlaneVersion state "Partial" after
     20m timeout — cluster never completed initial rollout

Summary

The e2e-aws failure is caused by the EnsureMetricsForwarderWorking test hanging. The test's data path — guest Prometheus → metrics-forwarder → metrics-proxy → endpoint-resolver → kube-apiserver — never fully establishes. The endpoint-resolver deployment initially returns "not found" (expected during first reconciliation), and both endpoint-resolver and metrics-proxy eventually come up, but Prometheus never discovers the kube-apiserver target via the metrics-forwarder scrape pool. This is the test that directly validates the PR's changes (port resolution from pods instead of deployments), and its indefinite retry loop consumed the entire 2h timeout, cascading into a teardown failure. The e2e-azure-self-managed failures are unrelated to the PR's metrics-proxy changes — EnsureMetricsForwarderWorking actually passed on Azure (167s). The Azure failures are Azure-specific infrastructure issues: CSI driver container restarts, a hosted cluster that never completed its control plane rollout (31+ components unavailable), and a node pool cluster stuck in "Partial" state.

Root Cause

e2e-aws (PR-related — metrics-proxy regression):

The PR changes resolveDeploymentPort()resolvePodPort() in scrape_config.go, switching from looking up port numbers via Deployment specs to looking them up via Pod specs matching the PodMonitor's label selector. The EnsureMetricsForwarderWorking test validates the full metrics forwarding data path:

guest Prometheus → PodMonitor → metrics-forwarder (HAProxy) → Route → metrics-proxy → endpoint-resolver → kube-apiserver

In the AWS job, the test shows that:

  1. The endpoint-resolver deployment was initially not found (line 1688), which is normal during first reconciliation
  2. The metrics-proxy deployment came up successfully
  3. The metrics-forwarder deployment in the guest cluster came up
  4. Prometheus pod prometheus-k8s-0 was running

However, the kube-apiserver target was never found in Prometheus active targets despite 11+ retries over an extended period. This means the metrics-proxy's scrape configuration — the component modified by this PR — was not correctly generating the kube-apiserver target configuration. The resolvePodPort() function may be failing to match pods or resolve the correct port for kube-apiserver, causing the scrape config to omit this target entirely, which in turn means the metrics-forwarder has no upstream to proxy to.

The test has no timeout of its own — it retries indefinitely until the 2h job-level timeout kills the entire test process. This caused TestAutoscaling/Teardown to also fail because there was insufficient time remaining to clean up 9 AWS resources (volumes and load balancers).

e2e-azure-self-managed (NOT PR-related — infrastructure issues):

Critically, EnsureMetricsForwarderWorking passed on Azure in 167.46s, proving the PR's code changes work correctly on Azure. The three Azure failures are independent infrastructure issues:

  1. TestAzurePrivateTopology: Azure File CSI driver controller had container restarts (restartCount > 0 for csi-attacher, csi-provisioner, csi-resizer, csi-snapshotter). This is a known Azure infrastructure flake unrelated to metrics-proxy.

  2. TestUpgradeControlPlane: The hosted cluster control-plane-upgrade-r5jrp never reached healthy conditions — kube-apiserver had 1 unavailable replica, CVO conditions were not found, and 31+ control plane components were unavailable. This is a cluster rollout issue, not a metrics issue.

  3. TestNodePool/HostedCluster2: The hosted cluster node-pool-j4sv4 was stuck with controlPlaneVersion in "Partial" state after 20m — the cluster never completed its initial rollout.

Recommendations
  1. Investigate the AWS metrics-proxy scrape config generation: The resolvePodPort() function in scrape_config.go may be failing to match kube-apiserver pods or resolve the correct port. Add debug logging to resolvePodPort() to trace: (a) what label selector is being used, (b) which pods are found, (c) what ports are extracted. Check if the kube-apiserver PodMonitor selector actually matches running pods in the HCP namespace.

  2. Add a timeout to EnsureMetricsForwarderWorking: The test currently retries indefinitely with no deadline. Add a reasonable timeout (e.g., 10-15 minutes) so the test fails fast instead of consuming the entire 2h job budget and cascading into teardown failures.

  3. Retest for Azure: The Azure failures are unrelated to the PR changes (metrics-forwarder actually passed on Azure). A /retest for e2e-azure-self-managed is warranted — these are Azure infrastructure flakes.

  4. Verify kube-apiserver PodMonitor selector: With the switch from Deployment-based to Pod-based port resolution, verify that the kube-apiserver PodMonitor's spec.selector correctly matches the kube-apiserver pods. If the selector uses matchLabels that don't match the actual pod labels, resolvePodPort() will find no pods and skip the target.

Evidence
Evidence Detail
e2e-aws: Failed test TestCreateCluster/Main/EnsureMetricsForwarderWorking — hung indefinitely, never reached PASS/FAIL
e2e-aws: Retry loop kube-apiserver target via metrics-forwarder not found in Prometheus active targets (will retry) — 11+ occurrences
e2e-aws: endpoint-resolver Initially "not found" (line 1688), then came up — expected during first reconciliation
e2e-aws: metrics-proxy Deployment came up successfully (line 1691)
e2e-aws: Job timeout Process killed after 2h0m0s at 2026-04-15T17:23:44Z
e2e-aws: Cascade failure TestAutoscaling/Teardown failed — could not clean up 9 AWS resources (context deadline exceeded)
e2e-azure: Metrics test PASSED EnsureMetricsForwarderWorking passed in 167.46s — PR code works on Azure
e2e-azure: CSI flake azure-file-csi-driver-controller containers had restartCount > 0 (csi-attacher, csi-provisioner, csi-resizer, csi-snapshotter)
e2e-azure: Upgrade failure TestUpgradeControlPlane: kube-apiserver 1 unavailable replica, 31+ components unavailable, CVO conditions not found
e2e-azure: NodePool failure TestNodePool/HostedCluster2: controlPlaneVersion stuck in "Partial" state after 20m timeout
PR code change resolveDeploymentPort()resolvePodPort() in scrape_config.go — switches port resolution from Deployment specs to Pod specs via PodMonitor label selectors

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 15, 2026

@muraee: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 6618c80 into openshift:main Apr 15, 2026
29 checks passed
@openshift-ci-robot
Copy link
Copy Markdown

@muraee: Jira Issue Verification Checks: Jira Issue OCPBUGS-83538
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-83538 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Summary

  • PodMonitor port resolution: resolveDeploymentPort used the PodMonitor's name to look up a Deployment by convention (pm.Name == deployment name). This broke for the control-plane-operator because its PodMonitor YAML has name: controlplane-operator while the Deployment is named control-plane-operator, causing the CPO metric to be silently omitted from the metrics-proxy-config configmap. Replaced with resolvePodPort which uses the PodMonitor's full label selector (metav1.LabelSelectorAsSelector) to find matching Pods directly, scanning all pods for rollout resilience.
  • ServiceMonitor named targetPort: When a Service's targetPort is a named port (string) rather than a number, resolveServicePort incorrectly fell back to the Service's port field. Now resolves the named targetPort from matching Pods using the Service's selector.
  • HCCO guest resource cleanup: When the EnableMetricsForwarding annotation is removed, the CPO deletes the metrics-proxy Route. The HCCO then gets NotFound on the route lookup and silently returns without cleaning up guest cluster resources (deployment, configmap, serving CA, pod monitor). Added an explicit check so guest resources are deleted when forwarding is not enabled.

Test plan

  • Existing scrape config unit tests pass
  • New TestAdaptScrapeConfigNamedTargetPort test verifies named targetPort resolution from pods
  • New TestAdaptScrapeConfigPodMonitorNameMismatch regression test for the CPO name mismatch
  • Verify on a running cluster that the control-plane-operator metric appears in the metrics-proxy-config configmap
  • Verify removing EnableMetricsForwarding annotation cleans up guest resources

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

  • Metrics forwarding now requires an explicit cluster annotation; when absent, related monitoring resources are removed.

  • Bug Fixes

  • More reliable port detection for services with named target ports and for monitors that select pods by labels; clearer failure when no matching pods or ports exist.

  • Refactor

  • Port resolution now derives ports from matching pods and selectors rather than previous deployment-template lookup.

  • Tests

  • Added and updated tests for pod-based resolution, named target-port resolution, selector mismatches, deletion behavior, and mixed-monitor scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants