OCPBUGS-83538: fix(metrics-proxy): resolve ports from pods instead of deployments by muraee · Pull Request #8221 · openshift/hypershift

muraee · 2026-04-13T16:04:54Z

Summary

PodMonitor port resolution: resolveDeploymentPort used the PodMonitor's name to look up a Deployment by convention (pm.Name == deployment name). This broke for the control-plane-operator because its PodMonitor YAML has name: controlplane-operator while the Deployment is named control-plane-operator, causing the CPO metric to be silently omitted from the metrics-proxy-config configmap. Replaced with resolvePodPort which uses the PodMonitor's full label selector (metav1.LabelSelectorAsSelector) to find matching Pods directly, scanning all pods for rollout resilience.
ServiceMonitor named targetPort: When a Service's targetPort is a named port (string) rather than a number, resolveServicePort incorrectly fell back to the Service's port field. Now resolves the named targetPort from matching Pods using the Service's selector.
HCCO guest resource cleanup: When the EnableMetricsForwarding annotation is removed, the CPO deletes the metrics-proxy Route. The HCCO then gets NotFound on the route lookup and silently returns without cleaning up guest cluster resources (deployment, configmap, serving CA, pod monitor). Added an explicit check so guest resources are deleted when forwarding is not enabled.

Test plan

Existing scrape config unit tests pass
New TestAdaptScrapeConfigNamedTargetPort test verifies named targetPort resolution from pods
New TestAdaptScrapeConfigPodMonitorNameMismatch regression test for the CPO name mismatch
Verify on a running cluster that the control-plane-operator metric appears in the metrics-proxy-config configmap
Verify removing EnableMetricsForwarding annotation cleans up guest resources

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Metrics forwarding now requires an explicit cluster annotation; when absent, related monitoring resources are removed.
Bug Fixes
- More reliable port detection for services with named target ports and for monitors that select pods by labels; clearer failure when no matching pods or ports exist.
Refactor
- Port resolution now derives ports from matching pods and selectors rather than previous deployment-template lookup.
Tests
- Added and updated tests for pod-based resolution, named target-port resolution, selector mismatches, deletion behavior, and mixed-monitor scenarios.

openshift-merge-bot · 2026-04-13T16:04:58Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

coderabbitai · 2026-04-13T16:05:13Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Port resolution in the metrics proxy scrape-config was changed from Deployment-based lookup to Pod-based lookup. A new resolvePodPort lists Pods matching a selector and finds named container ports; resolveServicePort now accepts a podSelector and uses Pods to resolve named Service targetPorts. adaptScrapeConfig was updated to require/parse pod selectors for ServiceMonitors and PodMonitors and to use resolvePodPort. Tests replace Deployment fixtures with Pod fixtures, add named-targetPort and selector/name-mismatch cases. reconcileMetricsForwarder now requires the hyperv1.EnableMetricsForwarding annotation and deletes forwarder resources when it is absent.

Sequence Diagram(s)

sequenceDiagram
  participant Monitor as ServiceMonitor/PodMonitor
  participant Adapter as adaptScrapeConfig
  participant K8sSvc as Kubernetes Service API
  participant K8sPods as Kubernetes Pods API

  Monitor->>Adapter: request scrape config adaptation
  Adapter->>K8sSvc: find Service for monitor (ServiceMonitor)
  K8sSvc-->>Adapter: Service with port and targetPort
  alt targetPort is numeric
    Adapter-->>Monitor: use numeric targetPort for scrape config
  else targetPort is named
    Adapter->>K8sPods: list Pods by provided podSelector
    K8sPods-->>Adapter: Pod list (containers with ports)
    Adapter->>Adapter: resolvePodPort -> find container port matching name -> numeric port
    Adapter-->>Monitor: return resolved numeric port for scrape config
  end

  note over Adapter,K8sPods: For PodMonitor, Adapter parses Pod selector from monitor and directly queries K8sPods using resolvePodPort

🚥 Pre-merge checks | ✅ 8 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	The test lacks PodMonitor in existingObjects and does not verify its deletion, violating completeness principle.	Add MetricsForwarderPodMonitor() to existingObjects and verify its deletion with assertions in cleanup verification section.

✅ Passed checks (8 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Stable And Deterministic Test Names	✅ Passed	The PR uses standard Go testing with table-driven tests, not Ginkgo framework, making this Ginkgo-specific check inapplicable.
Microshift Test Compatibility	✅ Passed	PR contains only standard Go unit tests, not Ginkgo e2e tests, so custom check is not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	The pull request does not add any new Ginkgo e2e tests; it only modifies unit tests using standard Go testing.T with Gomega assertions.
Topology-Aware Scheduling Compatibility	✅ Passed	PR does not introduce or modify scheduling constraints, affinity rules, replica counts, topology-specific resource definitions, or PodDisruptionBudgets; changes are limited to port resolution logic and annotation-based resource deletion control.
Ote Binary Stdout Contract	✅ Passed	The modified files are controller unit tests and implementation files that run within a standard Go test framework, not standalone OTE test binaries.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	The PR does not add any Ginkgo e2e tests requiring IPv6 and disconnected network compatibility checks.
Title check	✅ Passed	The title accurately captures the main change: replacing Deployment-based port resolution with Pod-based port resolution in the metrics-proxy module.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-04-13T16:06:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: muraee

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [muraee]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go`:
- Around line 118-123: The helper newPod currently hardcodes the pod label to
{"app": name}, which prevents reproducing a PodMonitor vs pod label
name-mismatch; change newPod (and any callers) to accept an explicit label value
(e.g., newPod(name, namespace, portName, portNum, labelValue) or add a
labelParam) and use that for the "app" label instead of name, then add a
regression test case where the PodMonitor name (or selector) is
"controlplane-operator" and the pod label value is "control-plane-operator" (or
vice versa) so the suite actually exercises the mismatch; update any test code
that constructs pods to pass the intended label value and add the new
mismatched-case test in scrape_config_test.go.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`:
- Around line 296-308: The function resolvePodPort currently only checks
podList.Items[0] and can miss a valid port on other matched pods; update
resolvePodPort to iterate over all podList.Items, for each pod iterate its
Spec.Containers and their Ports to find port.Name == portName, return the
matching ContainerPort as soon as found, and only return the error "no pods
found..." or "port not found..." after all pods have been scanned using the
selector and portName values for the final error message.
- Around line 136-137: The code currently passes only
pm.Spec.Selector.MatchLabels into resolvePodPort, losing matchExpressions and
causing wrong pod selection; change the call to build a full selector from the
PodMonitor's metav1.LabelSelector (e.g. use
metav1.LabelSelectorAsSelector(&pm.Spec.Selector) or labels.SelectorFromSet
equivalent) and pass that selector (or its string/label map form) into
resolvePodPort or refactor resolvePodPort to accept a labels.Selector, and
update resolvePodPort to query pods using that selector rather than just
MatchLabels; also modify the pod-port resolution logic in resolvePodPort (or the
caller) to iterate over podList.Items and inspect each pod for the named port
until a match is found instead of only using podList.Items[0], ensuring
deterministic selection across rollouts.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: aa7fc084-ef77-4a94-8880-62c4f7660e6f

📥 Commits

Reviewing files that changed from the base of the PR and between 36ccecd and f33c3fe.

📒 Files selected for processing (2)

control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go

codecov · 2026-04-13T16:18:50Z

Codecov Report

❌ Patch coverage is 70.00000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 34.65%. Comparing base (36ccecd) to head (774946a).
⚠️ Report is 39 commits behind head on main.

Files with missing lines	Patch %	Lines
...stedcontrolplane/v2/metrics_proxy/scrape_config.go	66.66%	6 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8221      +/-   ##
==========================================
+ Coverage   34.63%   34.65%   +0.01%     
==========================================
  Files         767      767              
  Lines       93186    93280      +94     
==========================================
+ Hits        32277    32322      +45     
- Misses      58236    58277      +41     
- Partials     2673     2681       +8

Files with missing lines	Coverage Δ
...rconfigoperator/controllers/resources/resources.go	`50.52% <100.00%> (-0.13%)`	⬇️
...stedcontrolplane/v2/metrics_proxy/scrape_config.go	`70.14% <66.66%> (-1.94%)`	⬇️

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

openshift-ci-robot · 2026-04-13T16:20:54Z

@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-04-13T16:32:01Z

@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`:
- Around line 265-268: The code currently calls resolvePodPort(cpContext,
namespace, labels.SelectorFromSet(podSelector), p.TargetPort.String()) whenever
p.TargetPort is named, even if podSelector is empty; change this so named
targetPort resolution only runs when podSelector is non-empty (e.g., check that
podSelector has entries / the labels.SelectorFromSet is not empty) and if
podSelector is empty do not call resolvePodPort and instead skip creating the
monitor for that service (avoid picking an unrelated pod). Ensure you update the
logic around p.TargetPort.String(), podSelector, and resolvePodPort to implement
this guard.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: dc3cb725-f0fa-4be3-94f3-ce61f7278bfd

📥 Commits

Reviewing files that changed from the base of the PR and between f33c3fe and 7fd34be.

📒 Files selected for processing (2)

control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go

openshift-ci-robot · 2026-04-13T17:50:00Z

@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-04-14T16:37:52Z

@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

muraee · 2026-04-14T16:39:02Z

/auto-cc

openshift-ci-robot · 2026-04-14T16:39:08Z

@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jparrill

Dropped some comments. Thanks!

Fix two bugs in the metrics-proxy scrape config generation: 1. PodMonitor port resolution used the PodMonitor name to look up a Deployment by convention. This broke for the control-plane-operator because its PodMonitor YAML has name: controlplane-operator while the Deployment is named control-plane-operator. Replace with resolvePodPort which uses the PodMonitor's full label selector to find matching Pods. 2. ServiceMonitor named targetPort resolution fell back to the Service's port field when targetPort was a string. Now resolves named targetPorts from matching Pods via the Service's selector. Additionally: - Use metav1.LabelSelectorAsSelector for full selector support including matchExpressions. - Scan all matching pods (not just the first) to handle rollouts where different pod revisions coexist. - Add regression test for PodMonitor name mismatch scenario. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci-robot · 2026-04-15T12:27:23Z

@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`:
- Around line 2977-2981: The test seeds only three resources but
reconcileMetricsForwarder also deletes the PodMonitor; update the test by
including manifests.MetricsForwarderPodMonitor() in the existingObjects seed and
add the corresponding deletion assertion (same pattern used for
MetricsForwarderDeployment/ConfigMap/ServingCA) to verify the PodMonitor was
removed; locate the test in resources_test.go and modify the setup and the
deletion checks that reference reconcileMetricsForwarder to include the
PodMonitor (manifests.MetricsForwarderPodMonitor()) so this deletion path is
covered.

---

Outside diff comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`:
- Around line 142-170: The generated ComponentFileConfig currently uses only
pm.Spec.Selector.MatchLabels which drops matchExpressions and can cause
resolvePodPort to use a different selector than the emitted config; update the
code so the full LabelSelector is preserved: pass pm.Spec.Selector (not just
MatchLabels) into metricsproxybin.ComponentFileConfig.Selector (or add a new
field that accepts the complete selector) and ensure any downstream code that
serializes Selector handles MatchExpressions, or alternatively skip/continue on
PodMonitors whose pm.Spec.Selector contains MatchExpressions (detect via
pm.Spec.Selector.MatchExpressions != nil && len(...) > 0) so you never emit a
partial/empty selector; adjust usages around resolvePodPort,
ComponentFileConfig, and the endpoint generation to reference the same selector
variable.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: d6310199-40d4-4b54-9ae1-3ce4eb01168a

📥 Commits

Reviewing files that changed from the base of the PR and between 4e38e9b and 9d27131.

📒 Files selected for processing (4)

control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go

When the EnableMetricsForwarding annotation is removed, the CPO deletes the metrics-proxy Route from the management cluster. The HCCO then gets NotFound on the route lookup and silently returns nil without cleaning up the guest cluster resources (deployment, configmap, serving CA, pod monitor). Add an explicit check for the EnableMetricsForwarding annotation so that guest resources are deleted when forwarding is not enabled, matching the existing DisableMonitoringServices cleanup path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jparrill · 2026-04-15T13:48:03Z

/lgtm

openshift-merge-bot · 2026-04-15T13:48:18Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

hypershift-jira-solve-ci · 2026-04-15T15:30:46Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2044418504827867136 | Cost: $2.0656315000000003 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

hypershift-jira-solve-ci · 2026-04-15T15:40:49Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2044416087256207360 | Cost: $2.16505395 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

muraee · 2026-04-15T16:03:06Z

/verified by @jiezhao16

openshift-ci-robot · 2026-04-15T16:03:18Z

@muraee: This PR has been marked as verified by @jiezhao16.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

muraee · 2026-04-15T16:03:28Z

/retest

cwbotbot · 2026-04-15T16:05:02Z

Test Results

e2e-aws

Status: ✅ PASS
Started: 2026-04-15T19:35:29Z
View Job
View Job History

e2e-aks

Status: ✅ PASS
Started: 2026-04-15T16:21:59Z
View Job
View Job History

openshift-ci-robot · 2026-04-15T16:49:05Z

@muraee: This pull request references Jira Issue OCPBUGS-83538, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

muraee · 2026-04-15T16:51:53Z

/jira refresh

openshift-ci-robot · 2026-04-15T16:52:01Z

@muraee: This pull request references Jira Issue OCPBUGS-83538, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (jiezhao@redhat.com), skipping review request.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hypershift-jira-solve-ci · 2026-04-15T18:07:01Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2044412777967128576 | Cost: $2.0702462500000003 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

hypershift-jira-solve-ci · 2026-04-15T18:28:50Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2044447616661458944 | Cost: $2.44414965 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

openshift-merge-bot · 2026-04-15T19:34:05Z

/retest-required

Remaining retests: 0 against base HEAD 5cdaa57 and 2 for PR HEAD 774946a in total

hypershift-jira-solve-ci · 2026-04-15T19:38:00Z

Now I have complete evidence for both jobs. Let me compile the final report:

Test Failure Analysis Complete

Job Information

Job 1: e2e-aws

Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
Build ID: 2044412777967128576
Target: e2e-aws
Status: Failure (2h timeout hit)
URL: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/8221/pull-ci-openshift-hypershift-main-e2e-aws/2044412777967128576

Job 2: e2e-azure-self-managed

Prow Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed
Build ID: 2044447616661458944
Target: e2e-azure-self-managed
Status: Failure (3 tests failed)
URL: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/8221/pull-ci-openshift-hypershift-main-e2e-azure-self-managed/2044447616661458944

Test Failure Analysis

Error

e2e-aws: TestCreateCluster/Main/EnsureMetricsForwarderWorking hung indefinitely retrying
  "kube-apiserver target via metrics-forwarder not found in Prometheus active targets (will retry)"
  until the 2h job timeout was hit, causing TestAutoscaling/Teardown to also fail from
  insufficient time for AWS resource cleanup.

e2e-azure-self-managed: Three independent failures:
  1. TestAzurePrivateTopology/EnsureNoCrashingPods: azure-file-csi-driver-controller containers
     restarting (csi-attacher, csi-provisioner, csi-resizer, csi-snapshotter each restartCount > 0)
  2. TestUpgradeControlPlane/ValidateHostedCluster: HostedCluster conditions invalid after 2979s —
     kube-apiserver unavailable, CVO conditions not found, 31+ components not available
  3. TestNodePool/HostedCluster2/ValidateHostedCluster: controlPlaneVersion state "Partial" after
     20m timeout — cluster never completed initial rollout

Summary

The e2e-aws failure is caused by the EnsureMetricsForwarderWorking test hanging. The test's data path — guest Prometheus → metrics-forwarder → metrics-proxy → endpoint-resolver → kube-apiserver — never fully establishes. The endpoint-resolver deployment initially returns "not found" (expected during first reconciliation), and both endpoint-resolver and metrics-proxy eventually come up, but Prometheus never discovers the kube-apiserver target via the metrics-forwarder scrape pool. This is the test that directly validates the PR's changes (port resolution from pods instead of deployments), and its indefinite retry loop consumed the entire 2h timeout, cascading into a teardown failure. The e2e-azure-self-managed failures are unrelated to the PR's metrics-proxy changes — EnsureMetricsForwarderWorking actually passed on Azure (167s). The Azure failures are Azure-specific infrastructure issues: CSI driver container restarts, a hosted cluster that never completed its control plane rollout (31+ components unavailable), and a node pool cluster stuck in "Partial" state.

Root Cause

e2e-aws (PR-related — metrics-proxy regression):

The PR changes resolveDeploymentPort() → resolvePodPort() in scrape_config.go, switching from looking up port numbers via Deployment specs to looking them up via Pod specs matching the PodMonitor's label selector. The EnsureMetricsForwarderWorking test validates the full metrics forwarding data path:

guest Prometheus → PodMonitor → metrics-forwarder (HAProxy) → Route → metrics-proxy → endpoint-resolver → kube-apiserver

In the AWS job, the test shows that:

The endpoint-resolver deployment was initially not found (line 1688), which is normal during first reconciliation
The metrics-proxy deployment came up successfully
The metrics-forwarder deployment in the guest cluster came up
Prometheus pod prometheus-k8s-0 was running

However, the kube-apiserver target was never found in Prometheus active targets despite 11+ retries over an extended period. This means the metrics-proxy's scrape configuration — the component modified by this PR — was not correctly generating the kube-apiserver target configuration. The resolvePodPort() function may be failing to match pods or resolve the correct port for kube-apiserver, causing the scrape config to omit this target entirely, which in turn means the metrics-forwarder has no upstream to proxy to.

The test has no timeout of its own — it retries indefinitely until the 2h job-level timeout kills the entire test process. This caused TestAutoscaling/Teardown to also fail because there was insufficient time remaining to clean up 9 AWS resources (volumes and load balancers).

e2e-azure-self-managed (NOT PR-related — infrastructure issues):

Critically, EnsureMetricsForwarderWorking passed on Azure in 167.46s, proving the PR's code changes work correctly on Azure. The three Azure failures are independent infrastructure issues:

TestAzurePrivateTopology: Azure File CSI driver controller had container restarts (restartCount > 0 for csi-attacher, csi-provisioner, csi-resizer, csi-snapshotter). This is a known Azure infrastructure flake unrelated to metrics-proxy.
TestUpgradeControlPlane: The hosted cluster control-plane-upgrade-r5jrp never reached healthy conditions — kube-apiserver had 1 unavailable replica, CVO conditions were not found, and 31+ control plane components were unavailable. This is a cluster rollout issue, not a metrics issue.
TestNodePool/HostedCluster2: The hosted cluster node-pool-j4sv4 was stuck with controlPlaneVersion in "Partial" state after 20m — the cluster never completed its initial rollout.

Recommendations

Investigate the AWS metrics-proxy scrape config generation: The resolvePodPort() function in scrape_config.go may be failing to match kube-apiserver pods or resolve the correct port. Add debug logging to resolvePodPort() to trace: (a) what label selector is being used, (b) which pods are found, (c) what ports are extracted. Check if the kube-apiserver PodMonitor selector actually matches running pods in the HCP namespace.
Add a timeout to EnsureMetricsForwarderWorking: The test currently retries indefinitely with no deadline. Add a reasonable timeout (e.g., 10-15 minutes) so the test fails fast instead of consuming the entire 2h job budget and cascading into teardown failures.
Retest for Azure: The Azure failures are unrelated to the PR changes (metrics-forwarder actually passed on Azure). A /retest for e2e-azure-self-managed is warranted — these are Azure infrastructure flakes.
Verify kube-apiserver PodMonitor selector: With the switch from Deployment-based to Pod-based port resolution, verify that the kube-apiserver PodMonitor's spec.selector correctly matches the kube-apiserver pods. If the selector uses matchLabels that don't match the actual pod labels, resolvePodPort() will find no pods and skip the target.

Evidence

Evidence	Detail
e2e-aws: Failed test	`TestCreateCluster/Main/EnsureMetricsForwarderWorking` — hung indefinitely, never reached PASS/FAIL
e2e-aws: Retry loop	`kube-apiserver target via metrics-forwarder not found in Prometheus active targets (will retry)` — 11+ occurrences
e2e-aws: endpoint-resolver	Initially "not found" (line 1688), then came up — expected during first reconciliation
e2e-aws: metrics-proxy	Deployment came up successfully (line 1691)
e2e-aws: Job timeout	Process killed after 2h0m0s at 2026-04-15T17:23:44Z
e2e-aws: Cascade failure	`TestAutoscaling/Teardown` failed — could not clean up 9 AWS resources (context deadline exceeded)
e2e-azure: Metrics test PASSED	`EnsureMetricsForwarderWorking` passed in 167.46s — PR code works on Azure
e2e-azure: CSI flake	`azure-file-csi-driver-controller` containers had restartCount > 0 (csi-attacher, csi-provisioner, csi-resizer, csi-snapshotter)
e2e-azure: Upgrade failure	`TestUpgradeControlPlane`: kube-apiserver 1 unavailable replica, 31+ components unavailable, CVO conditions not found
e2e-azure: NodePool failure	`TestNodePool/HostedCluster2`: controlPlaneVersion stuck in "Partial" state after 20m timeout
PR code change	`resolveDeploymentPort()` → `resolvePodPort()` in `scrape_config.go` — switches port resolution from Deployment specs to Pod specs via PodMonitor label selectors

openshift-ci · 2026-04-15T21:56:07Z

@muraee: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-04-15T21:59:35Z

@muraee: Jira Issue Verification Checks: Jira Issue OCPBUGS-83538
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-83538 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci bot added the do-not-merge/needs-area label Apr 13, 2026

openshift-ci bot requested review from jparrill and sjenning April 13, 2026 16:06

openshift-ci bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Apr 13, 2026

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

muraee changed the title ~~fix(metrics-proxy): resolve ports from pods instead of deployments~~ CNTRLPLANE-2836: fix(metrics-proxy): resolve ports from pods instead of deployments Apr 13, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 13, 2026

muraee force-pushed the fix-metrics-proxy-port-resolution branch from f33c3fe to 7fd34be Compare April 13, 2026 16:30

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

Comment thread control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go

muraee force-pushed the fix-metrics-proxy-port-resolution branch from 7fd34be to 3500eef Compare April 13, 2026 17:48

openshift-ci bot requested review from cblecker and enxebre April 14, 2026 16:39

jparrill reviewed Apr 15, 2026

View reviewed changes

muraee force-pushed the fix-metrics-proxy-port-resolution branch from 4e38e9b to 9d27131 Compare April 15, 2026 12:16

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

Comment thread control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go

muraee force-pushed the fix-metrics-proxy-port-resolution branch from 9d27131 to 4b9e15a Compare April 15, 2026 12:37

muraee force-pushed the fix-metrics-proxy-port-resolution branch from 4b9e15a to 774946a Compare April 15, 2026 13:35

openshift-ci bot assigned jparrill Apr 15, 2026

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 15, 2026

muraee changed the title ~~CNTRLPLANE-2836: fix(metrics-proxy): resolve ports from pods instead of deployments~~ OCPBUGS-83538: fix(metrics-proxy): resolve ports from pods instead of deployments Apr 15, 2026

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Apr 15, 2026

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 15, 2026

openshift-merge-bot bot merged commit 6618c80 into openshift:main Apr 15, 2026
29 checks passed

Conversation

muraee commented Apr 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

openshift-merge-bot bot commented Apr 13, 2026

Uh oh!

coderabbitai bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Sequence Diagram(s)

❌ Failed checks (2 warnings)

Uh oh!

openshift-ci bot commented Apr 13, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

openshift-ci-robot commented Apr 13, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Apr 13, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci-robot commented Apr 13, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Apr 14, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

muraee commented Apr 14, 2026

Uh oh!

openshift-ci-robot commented Apr 14, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

jparrill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openshift-ci-robot commented Apr 15, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

muraee commented Apr 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 13, 2026 •

edited

Loading

codecov bot commented Apr 13, 2026 •

edited

Loading

openshift-ci-robot commented Apr 13, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Apr 13, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Apr 13, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Apr 14, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Apr 14, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Apr 15, 2026 •

edited by openshift-ci bot

Loading

cwbotbot commented Apr 15, 2026 •

edited

Loading