OCPBUGS-83538: fix(metrics-proxy): resolve ports from pods instead of deployments#8221
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughPort resolution in the metrics proxy scrape-config was changed from Deployment-based lookup to Pod-based lookup. A new resolvePodPort lists Pods matching a selector and finds named container ports; resolveServicePort now accepts a podSelector and uses Pods to resolve named Service targetPorts. adaptScrapeConfig was updated to require/parse pod selectors for ServiceMonitors and PodMonitors and to use resolvePodPort. Tests replace Deployment fixtures with Pod fixtures, add named-targetPort and selector/name-mismatch cases. reconcileMetricsForwarder now requires the hyperv1.EnableMetricsForwarding annotation and deletes forwarder resources when it is absent. Sequence Diagram(s)sequenceDiagram
participant Monitor as ServiceMonitor/PodMonitor
participant Adapter as adaptScrapeConfig
participant K8sSvc as Kubernetes Service API
participant K8sPods as Kubernetes Pods API
Monitor->>Adapter: request scrape config adaptation
Adapter->>K8sSvc: find Service for monitor (ServiceMonitor)
K8sSvc-->>Adapter: Service with port and targetPort
alt targetPort is numeric
Adapter-->>Monitor: use numeric targetPort for scrape config
else targetPort is named
Adapter->>K8sPods: list Pods by provided podSelector
K8sPods-->>Adapter: Pod list (containers with ports)
Adapter->>Adapter: resolvePodPort -> find container port matching name -> numeric port
Adapter-->>Monitor: return resolved numeric port for scrape config
end
note over Adapter,K8sPods: For PodMonitor, Adapter parses Pod selector from monitor and directly queries K8sPods using resolvePodPort
🚥 Pre-merge checks | ✅ 8 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (8 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: muraee The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go`:
- Around line 118-123: The helper newPod currently hardcodes the pod label to
{"app": name}, which prevents reproducing a PodMonitor vs pod label
name-mismatch; change newPod (and any callers) to accept an explicit label value
(e.g., newPod(name, namespace, portName, portNum, labelValue) or add a
labelParam) and use that for the "app" label instead of name, then add a
regression test case where the PodMonitor name (or selector) is
"controlplane-operator" and the pod label value is "control-plane-operator" (or
vice versa) so the suite actually exercises the mismatch; update any test code
that constructs pods to pass the intended label value and add the new
mismatched-case test in scrape_config_test.go.
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`:
- Around line 296-308: The function resolvePodPort currently only checks
podList.Items[0] and can miss a valid port on other matched pods; update
resolvePodPort to iterate over all podList.Items, for each pod iterate its
Spec.Containers and their Ports to find port.Name == portName, return the
matching ContainerPort as soon as found, and only return the error "no pods
found..." or "port not found..." after all pods have been scanned using the
selector and portName values for the final error message.
- Around line 136-137: The code currently passes only
pm.Spec.Selector.MatchLabels into resolvePodPort, losing matchExpressions and
causing wrong pod selection; change the call to build a full selector from the
PodMonitor's metav1.LabelSelector (e.g. use
metav1.LabelSelectorAsSelector(&pm.Spec.Selector) or labels.SelectorFromSet
equivalent) and pass that selector (or its string/label map form) into
resolvePodPort or refactor resolvePodPort to accept a labels.Selector, and
update resolvePodPort to query pods using that selector rather than just
MatchLabels; also modify the pod-port resolution logic in resolvePodPort (or the
caller) to iterate over podList.Items and inspect each pod for the named port
until a match is found instead of only using podList.Items[0], ensuring
deterministic selection across rollouts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: aa7fc084-ef77-4a94-8880-62c4f7660e6f
📒 Files selected for processing (2)
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8221 +/- ##
==========================================
+ Coverage 34.63% 34.65% +0.01%
==========================================
Files 767 767
Lines 93186 93280 +94
==========================================
+ Hits 32277 32322 +45
- Misses 58236 58277 +41
- Partials 2673 2681 +8
🚀 New features to boost your workflow:
|
|
@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
f33c3fe to
7fd34be
Compare
|
@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`:
- Around line 265-268: The code currently calls resolvePodPort(cpContext,
namespace, labels.SelectorFromSet(podSelector), p.TargetPort.String()) whenever
p.TargetPort is named, even if podSelector is empty; change this so named
targetPort resolution only runs when podSelector is non-empty (e.g., check that
podSelector has entries / the labels.SelectorFromSet is not empty) and if
podSelector is empty do not call resolvePodPort and instead skip creating the
monitor for that service (avoid picking an unrelated pod). Ensure you update the
logic around p.TargetPort.String(), podSelector, and resolvePodPort to implement
this guard.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: dc3cb725-f0fa-4be3-94f3-ce61f7278bfd
📒 Files selected for processing (2)
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go
7fd34be to
3500eef
Compare
|
@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/auto-cc |
|
@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
jparrill
left a comment
There was a problem hiding this comment.
Dropped some comments. Thanks!
Fix two bugs in the metrics-proxy scrape config generation: 1. PodMonitor port resolution used the PodMonitor name to look up a Deployment by convention. This broke for the control-plane-operator because its PodMonitor YAML has name: controlplane-operator while the Deployment is named control-plane-operator. Replace with resolvePodPort which uses the PodMonitor's full label selector to find matching Pods. 2. ServiceMonitor named targetPort resolution fell back to the Service's port field when targetPort was a string. Now resolves named targetPorts from matching Pods via the Service's selector. Additionally: - Use metav1.LabelSelectorAsSelector for full selector support including matchExpressions. - Scan all matching pods (not just the first) to handle rollouts where different pod revisions coexist. - Add regression test for PodMonitor name mismatch scenario. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4e38e9b to
9d27131
Compare
|
@muraee: This pull request references CNTRLPLANE-2836 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go (1)
142-170:⚠️ Potential issue | 🟠 MajorDon't emit a partial selector for
PodMonitors.Lines 143-148 resolve the port with the full
LabelSelector, but Line 170 still serializes onlypm.Spec.Selector.MatchLabelsinto the generated component. A monitor that usesmatchExpressionscan therefore resolve its port from one pod set and then scrape a different one; expression-only selectors degrade to an empty selector. Please either carry the full selector throughComponentFileConfig/endpoint resolution, or skip these monitors until that format can represent them safely.Minimal safe guard if full-selector support is out of scope here
podSelector, err := metav1.LabelSelectorAsSelector(&pm.Spec.Selector) if err != nil { log.V(4).Info("skipping PodMonitor: invalid selector", "podMonitor", pm.Name, "error", err) continue } + if len(pm.Spec.Selector.MatchExpressions) > 0 { + log.V(4).Info("skipping PodMonitor: selector uses matchExpressions which are not representable in metrics-proxy config", "podMonitor", pm.Name) + continue + } port, err := resolvePodPort(cpContext, namespace, podSelector, portName)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go` around lines 142 - 170, The generated ComponentFileConfig currently uses only pm.Spec.Selector.MatchLabels which drops matchExpressions and can cause resolvePodPort to use a different selector than the emitted config; update the code so the full LabelSelector is preserved: pass pm.Spec.Selector (not just MatchLabels) into metricsproxybin.ComponentFileConfig.Selector (or add a new field that accepts the complete selector) and ensure any downstream code that serializes Selector handles MatchExpressions, or alternatively skip/continue on PodMonitors whose pm.Spec.Selector contains MatchExpressions (detect via pm.Spec.Selector.MatchExpressions != nil && len(...) > 0) so you never emit a partial/empty selector; adjust usages around resolvePodPort, ComponentFileConfig, and the endpoint generation to reference the same selector variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`:
- Around line 2977-2981: The test seeds only three resources but
reconcileMetricsForwarder also deletes the PodMonitor; update the test by
including manifests.MetricsForwarderPodMonitor() in the existingObjects seed and
add the corresponding deletion assertion (same pattern used for
MetricsForwarderDeployment/ConfigMap/ServingCA) to verify the PodMonitor was
removed; locate the test in resources_test.go and modify the setup and the
deletion checks that reference reconcileMetricsForwarder to include the
PodMonitor (manifests.MetricsForwarderPodMonitor()) so this deletion path is
covered.
---
Outside diff comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go`:
- Around line 142-170: The generated ComponentFileConfig currently uses only
pm.Spec.Selector.MatchLabels which drops matchExpressions and can cause
resolvePodPort to use a different selector than the emitted config; update the
code so the full LabelSelector is preserved: pass pm.Spec.Selector (not just
MatchLabels) into metricsproxybin.ComponentFileConfig.Selector (or add a new
field that accepts the complete selector) and ensure any downstream code that
serializes Selector handles MatchExpressions, or alternatively skip/continue on
PodMonitors whose pm.Spec.Selector contains MatchExpressions (detect via
pm.Spec.Selector.MatchExpressions != nil && len(...) > 0) so you never emit a
partial/empty selector; adjust usages around resolvePodPort,
ComponentFileConfig, and the endpoint generation to reference the same selector
variable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: d6310199-40d4-4b54-9ae1-3ce4eb01168a
📒 Files selected for processing (4)
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go
9d27131 to
4b9e15a
Compare
When the EnableMetricsForwarding annotation is removed, the CPO deletes the metrics-proxy Route from the management cluster. The HCCO then gets NotFound on the route lookup and silently returns nil without cleaning up the guest cluster resources (deployment, configmap, serving CA, pod monitor). Add an explicit check for the EnableMetricsForwarding annotation so that guest resources are deleted when forwarding is not enabled, matching the existing DisableMonitoringServices cleanup path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4b9e15a to
774946a
Compare
|
/lgtm |
|
Scheduling tests matching the |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/verified by @jiezhao16 |
|
@muraee: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
Test Resultse2e-aws
e2e-aks
|
|
@muraee: This pull request references Jira Issue OCPBUGS-83538, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@muraee: This pull request references Jira Issue OCPBUGS-83538, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (jiezhao@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
Now I have complete evidence for both jobs. Let me compile the final report: Test Failure Analysis CompleteJob InformationJob 1: e2e-aws
Job 2: e2e-azure-self-managed
Test Failure AnalysisErrorSummaryThe e2e-aws failure is caused by the Root Causee2e-aws (PR-related — metrics-proxy regression): The PR changes In the AWS job, the test shows that:
However, the kube-apiserver target was never found in Prometheus active targets despite 11+ retries over an extended period. This means the metrics-proxy's scrape configuration — the component modified by this PR — was not correctly generating the kube-apiserver target configuration. The The test has no timeout of its own — it retries indefinitely until the 2h job-level timeout kills the entire test process. This caused e2e-azure-self-managed (NOT PR-related — infrastructure issues): Critically,
Recommendations
Evidence
|
|
@muraee: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@muraee: Jira Issue Verification Checks: Jira Issue OCPBUGS-83538 Jira Issue OCPBUGS-83538 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Summary
resolveDeploymentPortused the PodMonitor's name to look up a Deployment by convention (pm.Name == deployment name). This broke for the control-plane-operator because its PodMonitor YAML hasname: controlplane-operatorwhile the Deployment is namedcontrol-plane-operator, causing the CPO metric to be silently omitted from the metrics-proxy-config configmap. Replaced withresolvePodPortwhich uses the PodMonitor's full label selector (metav1.LabelSelectorAsSelector) to find matching Pods directly, scanning all pods for rollout resilience.targetPortis a named port (string) rather than a number,resolveServicePortincorrectly fell back to the Service'sportfield. Now resolves the named targetPort from matching Pods using the Service's selector.EnableMetricsForwardingannotation is removed, the CPO deletes the metrics-proxy Route. The HCCO then gets NotFound on the route lookup and silently returns without cleaning up guest cluster resources (deployment, configmap, serving CA, pod monitor). Added an explicit check so guest resources are deleted when forwarding is not enabled.Test plan
TestAdaptScrapeConfigNamedTargetPorttest verifies named targetPort resolution from podsTestAdaptScrapeConfigPodMonitorNameMismatchregression test for the CPO name mismatchEnableMetricsForwardingannotation cleans up guest resources🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Tests