TRT-2617: Fix CPU metrics monitor node role lookup#31252
TRT-2617: Fix CPU metrics monitor node role lookup#31252openshift-merge-bot[bot] merged 2 commits into
Conversation
nodeInfos were mapped by IP, but looked up by instance name. transient nodes might not be configured at the time the map is made, so there can still be missing nodes, but most that we're interested in will be found; this will fix the empty role column in cpu data.
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
WalkthroughRefactors CPU metric collection by extracting Prometheus-client-driven logic into collectCPUMetricsFromPrometheusClient, improves node-to-metadata mapping to include node names, and adds a conditional functional test that runs the end-to-end collection against real Prometheus and Kubernetes contexts. ChangesCPU Metrics Refactoring and Testing
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
pkg/monitortests/testframework/cpumetriccollector/functional_test.go (3)
128-133: 💤 Low valueIgnored error on
os.ReadFileand preferfilepath.Join.The error from
os.ReadFileis dropped; on failurepreviewis silently empty, masking a real read failure during debugging. Minor, but checking it makes the log trustworthy. Also preferfilepath.Joinoverfmt.Sprintffor the path.♻️ Proposed tweak
- content, _ := os.ReadFile(fmt.Sprintf("%s/%s", storageDir, e.Name())) + content, err := os.ReadFile(filepath.Join(storageDir, e.Name())) + require.NoError(t, err) preview := string(content)(add
"path/filepath"to imports;"fmt"may become unused.)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/testframework/cpumetriccollector/functional_test.go` around lines 128 - 133, The call to os.ReadFile for building preview drops its error and uses fmt.Sprintf for the path; update the code that reads the file (where os.ReadFile is called with fmt.Sprintf("%s/%s", storageDir, e.Name())) to use filepath.Join(storageDir, e.Name()), capture the returned error, and handle it (e.g. t.Fatalf or t.Errorf with the error) instead of ignoring it so failures are visible; keep the preview truncation and t.Logf("content:\n%s", preview) but only after a successful read.
56-56: 💤 Low valueConsider a bounded context for the Prometheus query.
context.Background()withQueryRangehas no deadline; if the reconstructed Prometheus is unresponsive the test can hang until the harness times out. Acontext.WithTimeoutwould fail fast with a clearer signal.Also applies to: 94-94
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/testframework/cpumetriccollector/functional_test.go` at line 56, Replace the unconstrained context.Background() used for Prometheus queries with a bounded context created via context.WithTimeout (e.g., 5–30s depending on test environment) and ensure you call the returned cancel() in a defer to avoid leaks; update both occurrences where ctx := context.Background() is used for QueryRange calls so the test fails fast if Prometheus is unresponsive and does not hang the test harness.
75-84: 💤 Low value
InsecureSkipVerify: true— acceptable here, but worth a scoping note.Static analysis flags Line 79. For this env-var-gated debug test that points at a Prometheus server reconstructed from captured CI job data, skipping verification is reasonable and not an exploitable production path. A short inline comment explaining the intent would help future readers (and silence the scanner via an expected, documented exception).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/testframework/cpumetriccollector/functional_test.go` around lines 75 - 84, Add a short inline comment next to the TLS config where prometheusapi.NewClient is created (the tls.Config{InsecureSkipVerify: true} in this test) explaining that InsecureSkipVerify is intentionally set for this env-var-gated debug functional test which connects to a reconstructed Prometheus instance from CI artifacts, and that this is not used in production; this documents the exception for static analysis and future readers and prevents the scanner from flagging this test as an unreviewed security issue.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/monitortests/testframework/cpumetriccollector/functional_test.go`:
- Around line 124-127: The loop over DirEntry entries calls e.Info() and
directly uses info.Size() without checking the returned error; change the
handling in the loop that calls e.Info() so you check the error (e.g., if err !=
nil) before dereferencing info, and fail the test or log and continue (use
t.Fatalf/t.Errorf) when e.Info() returns an error so the test fails cleanly
instead of panicking; update the block using e.Info(), info.Size(), e.Name()
accordingly.
---
Nitpick comments:
In `@pkg/monitortests/testframework/cpumetriccollector/functional_test.go`:
- Around line 128-133: The call to os.ReadFile for building preview drops its
error and uses fmt.Sprintf for the path; update the code that reads the file
(where os.ReadFile is called with fmt.Sprintf("%s/%s", storageDir, e.Name())) to
use filepath.Join(storageDir, e.Name()), capture the returned error, and handle
it (e.g. t.Fatalf or t.Errorf with the error) instead of ignoring it so failures
are visible; keep the preview truncation and t.Logf("content:\n%s", preview) but
only after a successful read.
- Line 56: Replace the unconstrained context.Background() used for Prometheus
queries with a bounded context created via context.WithTimeout (e.g., 5–30s
depending on test environment) and ensure you call the returned cancel() in a
defer to avoid leaks; update both occurrences where ctx := context.Background()
is used for QueryRange calls so the test fails fast if Prometheus is
unresponsive and does not hang the test harness.
- Around line 75-84: Add a short inline comment next to the TLS config where
prometheusapi.NewClient is created (the tls.Config{InsecureSkipVerify: true} in
this test) explaining that InsecureSkipVerify is intentionally set for this
env-var-gated debug functional test which connects to a reconstructed Prometheus
instance from CI artifacts, and that this is not used in production; this
documents the exception for static analysis and future readers and prevents the
scanner from flagging this test as an unreviewed security issue.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 869f3eb5-ec58-49f8-9b2c-6f0f23d9ce22
📒 Files selected for processing (2)
pkg/monitortests/testframework/cpumetriccollector/functional_test.gopkg/monitortests/testframework/cpumetriccollector/monitortest.go
|
Scheduling required tests: |
It's useful to be able to point this collector at job data and debug what it's doing. No change to real functionality. 🤖 Assisted by Claude Code
b036cb5 to
f9c36b9
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/monitortests/testframework/cpumetriccollector/functional_test.go`:
- Around line 62-67: The test currently only ensures cpuDataPoints is non-empty
but doesn't verify node-role resolution; update the test that calls
cpuMetricCollector.collectCPUMetricsFromPrometheusClient to assert that at least
one element in collector.cpuDataPoints has a non-empty/resolved nodeRole (or add
a separate unit test for collectCPUDataPointsFromMetrics that supplies mock node
labels and verifies mapping). Locate cpuMetricCollector,
collectCPUMetricsFromPrometheusClient (and collectCPUDataPointsFromMetrics) and
add an assertion such as checking any(dp.nodeRole != "" || dp.NodeRole != nil)
depending on struct field name, to fail if role lookup regresses.
- Around line 51-58: Remove the insecure default TLS bypass in the Prometheus
client creation: stop using tls.Config{InsecureSkipVerify: true} in
prometheusapi.NewClient and instead rely on system CA validation (or provide a
test-only opt-in flag) so verification is enabled by default; update the client
config in the setup that constructs promClient. Replace context.Background()
with a cancellable context using context.WithTimeout (or WithDeadline) and
thread that ctx into any downstream calls that fetch metrics so the test cannot
hang. Strengthen assertions after running the collector: in addition to checking
collector.cpuDataPoints is non-empty, add an assertion that collector.nodeRole
(or NodeRole) is populated (non-empty) to detect the regression where node role
was omitted. Ensure you update references to promClient construction, the ctx
variable, and the collector instance (collector.cpuDataPoints and
collector.nodeRole/NodeRole) when making these changes.
- Around line 44-64: The test uses context.Background() which can hang on
network calls; replace ctx := context.Background() with a context created via
context.WithTimeout (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable duration>)) and defer
cancel(), then pass that ctx into collectCPUMetricsFromPrometheusClient so the
kubeclient and promAPI operations inside collectCPUMetricsFromPrometheusClient /
collectCPUMetricsFromPrometheus (Namespaces().Get, Nodes().List, Thanos check,
QueryRange) are bounded by the timeout; ensure the same ctx is reused for all
calls and the cancel is deferred to clean up.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: d0a82ba3-0df1-496e-9bb0-4f5fbe412c58
📒 Files selected for processing (2)
pkg/monitortests/testframework/cpumetriccollector/functional_test.gopkg/monitortests/testframework/cpumetriccollector/monitortest.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/monitortests/testframework/cpumetriccollector/monitortest.go
| promClient, err := prometheusapi.NewClient(prometheusapi.Config{ | ||
| Address: promURL, | ||
| Client: &http.Client{ | ||
| Transport: &http.Transport{ | ||
| TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, | ||
| }, | ||
| }, | ||
| }) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="pkg/monitortests/testframework/cpumetriccollector/functional_test.go"
echo "== File metadata =="
wc -l "$file"
echo
echo "== Snippet around lines 35-75 =="
sed -n '35,80p' "$file" | cat -n
echo
echo "== Snippet around test assertions =="
rg -n "collector\.cpuDataPoints|cpuDataPoints" "$file" -n
echo
# show a bit more context around the match
line=$(rg -n "collector\.cpuDataPoints" "$file" -o | head -n1 | cut -d: -f1 || true)
if [[ -n "${line:-}" ]]; then
start=$((line-15))
end=$((line+20))
sed -n "${start},${end}p" "$file" | cat -n
fi
echo
echo "== Check how nodeRole is populated downstream =="
down="pkg/monitortests/testframework/cpumetriccollector/monitortest.go"
wc -l "$down"
rg -n "nodeRole|nodeInfoMap" "$down" || true
# show relevant section
rg -n "nodeRole" "$down" -n | head -n 5 | while read -r lnum rest; do
s=$((lnum-15)); e=$((lnum+25));
echo "---- $down around line $lnum ----"
sed -n "${s},${e}p" "$down" | cat -n
done
echo
echo "== Search for InsecureSkipVerify usage in repo =="
rg -n "InsecureSkipVerify:\s*true" -S . || trueRepository: openshift/origin
Length of output: 4167
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== InsecureSkipVerify occurrences =="
rg -n "InsecureSkipVerify:\s*true" -S . || true
echo
echo "== Inspect getNodeRole + nodeInfoMap construction =="
file="pkg/monitortests/testframework/cpumetriccollector/monitortest.go"
# show around nodeRole/data point wiring
sed -n '300,380p' "$file" | cat -n
echo
echo "== Inspect cpu data point creation where nodeRole is assigned =="
sed -n '130,210p' "$file" | cat -nRepository: openshift/origin
Length of output: 13615
Remove default insecure TLS verification for the Prometheus client and strengthen the test’s reliability/regression checks.
functional_test.gosetstls.Config{InsecureSkipVerify: true}forprometheusapi.NewClient; keep TLS verification by default (trust CA / add a secure default path, allow insecure only via explicit opt-in).- The test uses
ctx := context.Background()for live cluster/Prometheus calls; switch tocontext.WithTimeout/WithDeadlineand ensure downstream uses it. - The test only asserts
collector.cpuDataPointsis non-empty; add an assertion thatnodeRole/NodeRoleis correctly populated (not empty / not just silently missing) to catch the intended regression.
🧰 Tools
🪛 ast-grep (0.43.0)
[warning] 54-54: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{InsecureSkipVerify: true}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures
(missing-ssl-minversion-go)
🪛 OpenGrep (1.22.0)
[ERROR] 55-55: TLS certificate verification is disabled via InsecureSkipVerify. This allows man-in-the-middle attacks. Remove InsecureSkipVerify or set it to false.
(coderabbit.tls.go-insecure-skip-verify)
[ERROR] 55-55: TLS certificate verification is disabled via InsecureSkipVerify. This allows man-in-the-middle attacks. Remove InsecureSkipVerify or set it to false.
(coderabbit.tls.go-insecure-skip-verify)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/monitortests/testframework/cpumetriccollector/functional_test.go` around
lines 51 - 58, Remove the insecure default TLS bypass in the Prometheus client
creation: stop using tls.Config{InsecureSkipVerify: true} in
prometheusapi.NewClient and instead rely on system CA validation (or provide a
test-only opt-in flag) so verification is enabled by default; update the client
config in the setup that constructs promClient. Replace context.Background()
with a cancellable context using context.WithTimeout (or WithDeadline) and
thread that ctx into any downstream calls that fetch metrics so the test cannot
hang. Strengthen assertions after running the collector: in addition to checking
collector.cpuDataPoints is non-empty, add an assertion that collector.nodeRole
(or NodeRole) is populated (non-empty) to detect the regression where node role
was omitted. Ensure you update references to promClient construction, the ctx
variable, and the collector instance (collector.cpuDataPoints and
collector.nodeRole/NodeRole) when making these changes.
| collector := &cpuMetricCollector{highCPUThreshold: 95.0} | ||
| intervals, err := collector.collectCPUMetricsFromPrometheusClient(ctx, promAPI, kubeClient, startTime) | ||
| require.NoError(t, err) | ||
|
|
||
| t.Logf("high-cpu intervals: %d, data points: %d", len(intervals), len(collector.cpuDataPoints)) | ||
| require.NotEmpty(t, collector.cpuDataPoints, "expected some cpu data points") |
There was a problem hiding this comment.
Assert the node-role mapping, not just that some samples exist.
The regression this PR fixes is the empty node-role column, but this test only checks cpuDataPoints is non-empty. If role lookup breaks again, collectCPUDataPointsFromMetrics will still append datapoints and this test will still pass. Please assert that at least one collected datapoint has a resolved nodeRole, or add a deterministic unit test around the mapping path.
✅ Suggested assertion
t.Logf("high-cpu intervals: %d, data points: %d", len(intervals), len(collector.cpuDataPoints))
require.NotEmpty(t, collector.cpuDataPoints, "expected some cpu data points")
+ require.Condition(t, func() bool {
+ for _, dp := range collector.cpuDataPoints {
+ if dp.nodeRole != "" {
+ return true
+ }
+ }
+ return false
+ }, "expected at least one cpu data point with a resolved node role")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| collector := &cpuMetricCollector{highCPUThreshold: 95.0} | |
| intervals, err := collector.collectCPUMetricsFromPrometheusClient(ctx, promAPI, kubeClient, startTime) | |
| require.NoError(t, err) | |
| t.Logf("high-cpu intervals: %d, data points: %d", len(intervals), len(collector.cpuDataPoints)) | |
| require.NotEmpty(t, collector.cpuDataPoints, "expected some cpu data points") | |
| collector := &cpuMetricCollector{highCPUThreshold: 95.0} | |
| intervals, err := collector.collectCPUMetricsFromPrometheusClient(ctx, promAPI, kubeClient, startTime) | |
| require.NoError(t, err) | |
| t.Logf("high-cpu intervals: %d, data points: %d", len(intervals), len(collector.cpuDataPoints)) | |
| require.NotEmpty(t, collector.cpuDataPoints, "expected some cpu data points") | |
| require.Condition(t, func() bool { | |
| for _, dp := range collector.cpuDataPoints { | |
| if dp.nodeRole != "" { | |
| return true | |
| } | |
| } | |
| return false | |
| }, "expected at least one cpu data point with a resolved node role") |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/monitortests/testframework/cpumetriccollector/functional_test.go` around
lines 62 - 67, The test currently only ensures cpuDataPoints is non-empty but
doesn't verify node-role resolution; update the test that calls
cpuMetricCollector.collectCPUMetricsFromPrometheusClient to assert that at least
one element in collector.cpuDataPoints has a non-empty/resolved nodeRole (or add
a separate unit test for collectCPUDataPointsFromMetrics that supplies mock node
labels and verifies mapping). Locate cpuMetricCollector,
collectCPUMetricsFromPrometheusClient (and collectCPUDataPointsFromMetrics) and
add an assertion such as checking any(dp.nodeRole != "" || dp.NodeRole != nil)
depending on struct field name, to fail if role lookup regresses.
|
Scheduling required tests: |
|
@sosiouxme: This pull request references TRT-2617 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: petr-muller, sosiouxme The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/verified by @sosiouxme |
|
@sosiouxme: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@sosiouxme: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Add functional test to enable observation, and fill the missing node role column that's been empty.
Summary by CodeRabbit
Tests
Refactor