Skip to content

TRT-2617: Fix CPU metrics monitor node role lookup#31252

Merged
openshift-merge-bot[bot] merged 2 commits into
openshift:mainfrom
sosiouxme:20260602-fix-cpu-metrics-monitor
Jun 3, 2026
Merged

TRT-2617: Fix CPU metrics monitor node role lookup#31252
openshift-merge-bot[bot] merged 2 commits into
openshift:mainfrom
sosiouxme:20260602-fix-cpu-metrics-monitor

Conversation

@sosiouxme
Copy link
Copy Markdown
Member

@sosiouxme sosiouxme commented Jun 3, 2026

Add functional test to enable observation, and fill the missing node role column that's been empty.

Summary by CodeRabbit

  • Tests

    • Added a new functional integration test that validates CPU metrics collection using Prometheus and Kubernetes connectivity; skips when required environment variables are unset.
  • Refactor

    • Reorganized CPU metric collection logic for clearer separation of responsibilities and improved node metadata mapping to better handle node identification and maintainability.

nodeInfos were mapped by IP, but looked up by instance name.
transient nodes might not be configured at the time the map is made, so
there can still be missing nodes, but most that we're interested in will
be found; this will fix the empty role column in cpu data.
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 3, 2026

Walkthrough

Refactors CPU metric collection by extracting Prometheus-client-driven logic into collectCPUMetricsFromPrometheusClient, improves node-to-metadata mapping to include node names, and adds a conditional functional test that runs the end-to-end collection against real Prometheus and Kubernetes contexts.

Changes

CPU Metrics Refactoring and Testing

Layer / File(s) Summary
Prometheus client helper and callsite
pkg/monitortests/testframework/cpumetriccollector/monitortest.go
Adds collectCPUMetricsFromPrometheusClient which accepts a prepared Prometheus v1 API client and kube client; removes a now-redundant local logger from the original caller.
Node info mapping and fallback updates
pkg/monitortests/testframework/cpumetriccollector/monitortest.go
buildNodeInfoMap now creates one nodeInfo per node and keys the map by node name plus internal IP and :9100 variants; small comment/formatting edits for node-name fallback paths.
Functional test for CPU metric pipeline
pkg/monitortests/testframework/cpumetriccollector/functional_test.go
Adds TestCollectCPUMetrics_Functional which conditionally reads PROMETHEUS_URL, KUBECONFIG, and CPU_START_TIME, builds kube and Prometheus clients (Prometheus client disables TLS verification), invokes collection with a cpuMetricCollector, logs results, and asserts collected datapoints are present.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Test Structure And Quality ❓ Inconclusive Custom check specifies reviewing "Ginkgo test code" with It/BeforeEach patterns, but added test is standard Go testing.T, not Ginkgo. Does check apply to all tests or only Ginkgo? PR contains standard Go functional test, not Ginkgo-based tests.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR uses standard Go testing (not Ginkgo), and test name TestCollectCPUMetrics_Functional is static with no dynamic values, timestamps, or identifiers.
Microshift Test Compatibility ✅ Passed The new test is a standard Go test (TestCollectCPUMetrics_Functional), not a Ginkgo e2e test. The MicroShift compatibility check applies only to Ginkgo e2e tests, so this is out of scope.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Test is a standard Go test (testing.T), not a Ginkgo e2e test. Custom check applies only to Ginkgo e2e tests. The functional test does not make SNO-incompatible assumptions.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds test/monitoring code only (functional test and CPU metrics collector refactor), with no deployment manifests, operator code, or scheduling constraints. Check is not applicable.
Ote Binary Stdout Contract ✅ Passed No OTE Binary Stdout Contract violations found. The PR contains no process-level stdout writes, top-level side effects, or uncontrolled logging to stdout. Logrus writes to stderr by default.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The test TestCollectCPUMetrics_Functional is a regular Go test (not a Ginkgo e2e test), so the IPv6 check does not apply here.
No-Weak-Crypto ✅ Passed No weak cryptographic algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB mode), custom crypto implementations, or non-constant-time secret comparisons detected in the modified files.
Container-Privileges ✅ Passed PR contains only Go source code with no Kubernetes manifests or container configurations. The privileged container settings check is not applicable to code-only changes.
No-Sensitive-Data-In-Logs ✅ Passed Code does not contain explicit logging of sensitive data. The functional test logs only numeric counts; monitortest.go logs only counts and node names.
Title check ✅ Passed The title references 'node role lookup' which aligns with the commit messages describing a nodeInfo lookup fix, but incompletely represents the changes which also add a functional test and refactor the collector.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 3, 2026
@openshift-ci openshift-ci Bot requested review from p0lyn0mial and sjenning June 3, 2026 01:26
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
pkg/monitortests/testframework/cpumetriccollector/functional_test.go (3)

128-133: 💤 Low value

Ignored error on os.ReadFile and prefer filepath.Join.

The error from os.ReadFile is dropped; on failure preview is silently empty, masking a real read failure during debugging. Minor, but checking it makes the log trustworthy. Also prefer filepath.Join over fmt.Sprintf for the path.

♻️ Proposed tweak
-		content, _ := os.ReadFile(fmt.Sprintf("%s/%s", storageDir, e.Name()))
+		content, err := os.ReadFile(filepath.Join(storageDir, e.Name()))
+		require.NoError(t, err)
 		preview := string(content)

(add "path/filepath" to imports; "fmt" may become unused.)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/testframework/cpumetriccollector/functional_test.go` around
lines 128 - 133, The call to os.ReadFile for building preview drops its error
and uses fmt.Sprintf for the path; update the code that reads the file (where
os.ReadFile is called with fmt.Sprintf("%s/%s", storageDir, e.Name())) to use
filepath.Join(storageDir, e.Name()), capture the returned error, and handle it
(e.g. t.Fatalf or t.Errorf with the error) instead of ignoring it so failures
are visible; keep the preview truncation and t.Logf("content:\n%s", preview) but
only after a successful read.

56-56: 💤 Low value

Consider a bounded context for the Prometheus query.

context.Background() with QueryRange has no deadline; if the reconstructed Prometheus is unresponsive the test can hang until the harness times out. A context.WithTimeout would fail fast with a clearer signal.

Also applies to: 94-94

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/testframework/cpumetriccollector/functional_test.go` at line
56, Replace the unconstrained context.Background() used for Prometheus queries
with a bounded context created via context.WithTimeout (e.g., 5–30s depending on
test environment) and ensure you call the returned cancel() in a defer to avoid
leaks; update both occurrences where ctx := context.Background() is used for
QueryRange calls so the test fails fast if Prometheus is unresponsive and does
not hang the test harness.

75-84: 💤 Low value

InsecureSkipVerify: true — acceptable here, but worth a scoping note.

Static analysis flags Line 79. For this env-var-gated debug test that points at a Prometheus server reconstructed from captured CI job data, skipping verification is reasonable and not an exploitable production path. A short inline comment explaining the intent would help future readers (and silence the scanner via an expected, documented exception).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/testframework/cpumetriccollector/functional_test.go` around
lines 75 - 84, Add a short inline comment next to the TLS config where
prometheusapi.NewClient is created (the tls.Config{InsecureSkipVerify: true} in
this test) explaining that InsecureSkipVerify is intentionally set for this
env-var-gated debug functional test which connects to a reconstructed Prometheus
instance from CI artifacts, and that this is not used in production; this
documents the exception for static analysis and future readers and prevents the
scanner from flagging this test as an unreviewed security issue.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/monitortests/testframework/cpumetriccollector/functional_test.go`:
- Around line 124-127: The loop over DirEntry entries calls e.Info() and
directly uses info.Size() without checking the returned error; change the
handling in the loop that calls e.Info() so you check the error (e.g., if err !=
nil) before dereferencing info, and fail the test or log and continue (use
t.Fatalf/t.Errorf) when e.Info() returns an error so the test fails cleanly
instead of panicking; update the block using e.Info(), info.Size(), e.Name()
accordingly.

---

Nitpick comments:
In `@pkg/monitortests/testframework/cpumetriccollector/functional_test.go`:
- Around line 128-133: The call to os.ReadFile for building preview drops its
error and uses fmt.Sprintf for the path; update the code that reads the file
(where os.ReadFile is called with fmt.Sprintf("%s/%s", storageDir, e.Name())) to
use filepath.Join(storageDir, e.Name()), capture the returned error, and handle
it (e.g. t.Fatalf or t.Errorf with the error) instead of ignoring it so failures
are visible; keep the preview truncation and t.Logf("content:\n%s", preview) but
only after a successful read.
- Line 56: Replace the unconstrained context.Background() used for Prometheus
queries with a bounded context created via context.WithTimeout (e.g., 5–30s
depending on test environment) and ensure you call the returned cancel() in a
defer to avoid leaks; update both occurrences where ctx := context.Background()
is used for QueryRange calls so the test fails fast if Prometheus is
unresponsive and does not hang the test harness.
- Around line 75-84: Add a short inline comment next to the TLS config where
prometheusapi.NewClient is created (the tls.Config{InsecureSkipVerify: true} in
this test) explaining that InsecureSkipVerify is intentionally set for this
env-var-gated debug functional test which connects to a reconstructed Prometheus
instance from CI artifacts, and that this is not used in production; this
documents the exception for static analysis and future readers and prevents the
scanner from flagging this test as an unreviewed security issue.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 869f3eb5-ec58-49f8-9b2c-6f0f23d9ce22

📥 Commits

Reviewing files that changed from the base of the PR and between 05c80ba and b036cb5.

📒 Files selected for processing (2)
  • pkg/monitortests/testframework/cpumetriccollector/functional_test.go
  • pkg/monitortests/testframework/cpumetriccollector/monitortest.go

Comment thread pkg/monitortests/testframework/cpumetriccollector/functional_test.go Outdated
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

It's useful to be able to point this collector at job data and debug
what it's doing. No change to real functionality.

🤖 Assisted by Claude Code
@sosiouxme sosiouxme changed the title Fix cpu metrics monitor Fix CPU metrics monitor node role lookup Jun 3, 2026
@sosiouxme sosiouxme force-pushed the 20260602-fix-cpu-metrics-monitor branch from b036cb5 to f9c36b9 Compare June 3, 2026 03:20
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/monitortests/testframework/cpumetriccollector/functional_test.go`:
- Around line 62-67: The test currently only ensures cpuDataPoints is non-empty
but doesn't verify node-role resolution; update the test that calls
cpuMetricCollector.collectCPUMetricsFromPrometheusClient to assert that at least
one element in collector.cpuDataPoints has a non-empty/resolved nodeRole (or add
a separate unit test for collectCPUDataPointsFromMetrics that supplies mock node
labels and verifies mapping). Locate cpuMetricCollector,
collectCPUMetricsFromPrometheusClient (and collectCPUDataPointsFromMetrics) and
add an assertion such as checking any(dp.nodeRole != "" || dp.NodeRole != nil)
depending on struct field name, to fail if role lookup regresses.
- Around line 51-58: Remove the insecure default TLS bypass in the Prometheus
client creation: stop using tls.Config{InsecureSkipVerify: true} in
prometheusapi.NewClient and instead rely on system CA validation (or provide a
test-only opt-in flag) so verification is enabled by default; update the client
config in the setup that constructs promClient. Replace context.Background()
with a cancellable context using context.WithTimeout (or WithDeadline) and
thread that ctx into any downstream calls that fetch metrics so the test cannot
hang. Strengthen assertions after running the collector: in addition to checking
collector.cpuDataPoints is non-empty, add an assertion that collector.nodeRole
(or NodeRole) is populated (non-empty) to detect the regression where node role
was omitted. Ensure you update references to promClient construction, the ctx
variable, and the collector instance (collector.cpuDataPoints and
collector.nodeRole/NodeRole) when making these changes.
- Around line 44-64: The test uses context.Background() which can hang on
network calls; replace ctx := context.Background() with a context created via
context.WithTimeout (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable duration>)) and defer
cancel(), then pass that ctx into collectCPUMetricsFromPrometheusClient so the
kubeclient and promAPI operations inside collectCPUMetricsFromPrometheusClient /
collectCPUMetricsFromPrometheus (Namespaces().Get, Nodes().List, Thanos check,
QueryRange) are bounded by the timeout; ensure the same ctx is reused for all
calls and the cancel is deferred to clean up.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: d0a82ba3-0df1-496e-9bb0-4f5fbe412c58

📥 Commits

Reviewing files that changed from the base of the PR and between b036cb5 and f9c36b9.

📒 Files selected for processing (2)
  • pkg/monitortests/testframework/cpumetriccollector/functional_test.go
  • pkg/monitortests/testframework/cpumetriccollector/monitortest.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/monitortests/testframework/cpumetriccollector/monitortest.go

Comment on lines +51 to +58
promClient, err := prometheusapi.NewClient(prometheusapi.Config{
Address: promURL,
Client: &http.Client{
Transport: &http.Transport{
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
},
},
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

file="pkg/monitortests/testframework/cpumetriccollector/functional_test.go"

echo "== File metadata =="
wc -l "$file"
echo

echo "== Snippet around lines 35-75 =="
sed -n '35,80p' "$file" | cat -n

echo
echo "== Snippet around test assertions =="
rg -n "collector\.cpuDataPoints|cpuDataPoints" "$file" -n
echo
# show a bit more context around the match
line=$(rg -n "collector\.cpuDataPoints" "$file" -o | head -n1 | cut -d: -f1 || true)
if [[ -n "${line:-}" ]]; then
  start=$((line-15))
  end=$((line+20))
  sed -n "${start},${end}p" "$file" | cat -n
fi

echo
echo "== Check how nodeRole is populated downstream =="
down="pkg/monitortests/testframework/cpumetriccollector/monitortest.go"
wc -l "$down"
rg -n "nodeRole|nodeInfoMap" "$down" || true
# show relevant section
rg -n "nodeRole" "$down" -n | head -n 5 | while read -r lnum rest; do
  s=$((lnum-15)); e=$((lnum+25));
  echo "---- $down around line $lnum ----"
  sed -n "${s},${e}p" "$down" | cat -n
done

echo
echo "== Search for InsecureSkipVerify usage in repo =="
rg -n "InsecureSkipVerify:\s*true" -S . || true

Repository: openshift/origin

Length of output: 4167


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== InsecureSkipVerify occurrences =="
rg -n "InsecureSkipVerify:\s*true" -S . || true

echo
echo "== Inspect getNodeRole + nodeInfoMap construction =="
file="pkg/monitortests/testframework/cpumetriccollector/monitortest.go"
# show around nodeRole/data point wiring
sed -n '300,380p' "$file" | cat -n

echo
echo "== Inspect cpu data point creation where nodeRole is assigned =="
sed -n '130,210p' "$file" | cat -n

Repository: openshift/origin

Length of output: 13615


Remove default insecure TLS verification for the Prometheus client and strengthen the test’s reliability/regression checks.

  • functional_test.go sets tls.Config{InsecureSkipVerify: true} for prometheusapi.NewClient; keep TLS verification by default (trust CA / add a secure default path, allow insecure only via explicit opt-in).
  • The test uses ctx := context.Background() for live cluster/Prometheus calls; switch to context.WithTimeout/WithDeadline and ensure downstream uses it.
  • The test only asserts collector.cpuDataPoints is non-empty; add an assertion that nodeRole/NodeRole is correctly populated (not empty / not just silently missing) to catch the intended regression.
🧰 Tools
🪛 ast-grep (0.43.0)

[warning] 54-54: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{InsecureSkipVerify: true}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)

🪛 OpenGrep (1.22.0)

[ERROR] 55-55: TLS certificate verification is disabled via InsecureSkipVerify. This allows man-in-the-middle attacks. Remove InsecureSkipVerify or set it to false.

(coderabbit.tls.go-insecure-skip-verify)


[ERROR] 55-55: TLS certificate verification is disabled via InsecureSkipVerify. This allows man-in-the-middle attacks. Remove InsecureSkipVerify or set it to false.

(coderabbit.tls.go-insecure-skip-verify)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/testframework/cpumetriccollector/functional_test.go` around
lines 51 - 58, Remove the insecure default TLS bypass in the Prometheus client
creation: stop using tls.Config{InsecureSkipVerify: true} in
prometheusapi.NewClient and instead rely on system CA validation (or provide a
test-only opt-in flag) so verification is enabled by default; update the client
config in the setup that constructs promClient. Replace context.Background()
with a cancellable context using context.WithTimeout (or WithDeadline) and
thread that ctx into any downstream calls that fetch metrics so the test cannot
hang. Strengthen assertions after running the collector: in addition to checking
collector.cpuDataPoints is non-empty, add an assertion that collector.nodeRole
(or NodeRole) is populated (non-empty) to detect the regression where node role
was omitted. Ensure you update references to promClient construction, the ctx
variable, and the collector instance (collector.cpuDataPoints and
collector.nodeRole/NodeRole) when making these changes.

Comment on lines +62 to +67
collector := &cpuMetricCollector{highCPUThreshold: 95.0}
intervals, err := collector.collectCPUMetricsFromPrometheusClient(ctx, promAPI, kubeClient, startTime)
require.NoError(t, err)

t.Logf("high-cpu intervals: %d, data points: %d", len(intervals), len(collector.cpuDataPoints))
require.NotEmpty(t, collector.cpuDataPoints, "expected some cpu data points")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert the node-role mapping, not just that some samples exist.

The regression this PR fixes is the empty node-role column, but this test only checks cpuDataPoints is non-empty. If role lookup breaks again, collectCPUDataPointsFromMetrics will still append datapoints and this test will still pass. Please assert that at least one collected datapoint has a resolved nodeRole, or add a deterministic unit test around the mapping path.

✅ Suggested assertion
 	t.Logf("high-cpu intervals: %d, data points: %d", len(intervals), len(collector.cpuDataPoints))
 	require.NotEmpty(t, collector.cpuDataPoints, "expected some cpu data points")
+	require.Condition(t, func() bool {
+		for _, dp := range collector.cpuDataPoints {
+			if dp.nodeRole != "" {
+				return true
+			}
+		}
+		return false
+	}, "expected at least one cpu data point with a resolved node role")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
collector := &cpuMetricCollector{highCPUThreshold: 95.0}
intervals, err := collector.collectCPUMetricsFromPrometheusClient(ctx, promAPI, kubeClient, startTime)
require.NoError(t, err)
t.Logf("high-cpu intervals: %d, data points: %d", len(intervals), len(collector.cpuDataPoints))
require.NotEmpty(t, collector.cpuDataPoints, "expected some cpu data points")
collector := &cpuMetricCollector{highCPUThreshold: 95.0}
intervals, err := collector.collectCPUMetricsFromPrometheusClient(ctx, promAPI, kubeClient, startTime)
require.NoError(t, err)
t.Logf("high-cpu intervals: %d, data points: %d", len(intervals), len(collector.cpuDataPoints))
require.NotEmpty(t, collector.cpuDataPoints, "expected some cpu data points")
require.Condition(t, func() bool {
for _, dp := range collector.cpuDataPoints {
if dp.nodeRole != "" {
return true
}
}
return false
}, "expected at least one cpu data point with a resolved node role")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/testframework/cpumetriccollector/functional_test.go` around
lines 62 - 67, The test currently only ensures cpuDataPoints is non-empty but
doesn't verify node-role resolution; update the test that calls
cpuMetricCollector.collectCPUMetricsFromPrometheusClient to assert that at least
one element in collector.cpuDataPoints has a non-empty/resolved nodeRole (or add
a separate unit test for collectCPUDataPointsFromMetrics that supplies mock node
labels and verifies mapping). Locate cpuMetricCollector,
collectCPUMetricsFromPrometheusClient (and collectCPUDataPointsFromMetrics) and
add an assertion such as checking any(dp.nodeRole != "" || dp.NodeRole != nil)
depending on struct field name, to fail if role lookup regresses.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@sosiouxme sosiouxme changed the title Fix CPU metrics monitor node role lookup TRT-2617: Fix CPU metrics monitor node role lookup Jun 3, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Jun 3, 2026

@sosiouxme: This pull request references TRT-2617 which is a valid jira issue.

Details

In response to this:

Add functional test to enable observation, and fill the missing node role column that's been empty.

Summary by CodeRabbit

  • Tests

  • Added a new functional integration test that validates CPU metrics collection using Prometheus and Kubernetes connectivity; skips when required environment variables are unset.

  • Refactor

  • Reorganized CPU metric collection logic for clearer separation of responsibilities and improved node metadata mapping to better handle node identification and maintainability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 3, 2026
@sosiouxme
Copy link
Copy Markdown
Member Author

/retest

Copy link
Copy Markdown
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 3, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 3, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, sosiouxme

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [petr-muller,sosiouxme]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sosiouxme
Copy link
Copy Markdown
Member Author

/verified by @sosiouxme

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 3, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@sosiouxme: This PR has been marked as verified by @sosiouxme.

Details

In response to this:

/verified by @sosiouxme

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 51a9242 and 2 for PR HEAD f9c36b9 in total

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 3, 2026

@sosiouxme: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 5c76968 into openshift:main Jun 3, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants