AUTOSCALE-615: include Karpenter node vCPUs in billing metric#8265
AUTOSCALE-615: include Karpenter node vCPUs in billing metric#8265maxcao13 wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds vCPU capacity tracking for Karpenter-managed AutoNodes: a new optional Sequence Diagram(s)sequenceDiagram
participant KC as Karpenter Controller
participant NC as NodeClaim Resources
participant N as corev1.Node List
participant HCP as HostedCluster (Status.AutoNode)
participant MC as Metrics Collector
participant Prom as Prometheus
Note over KC,NC: NodeClaim watch + vCPU computation
KC->>NC: Receive create/update/delete events
NC-->>KC: NodeClaim.Status.Capacity[CPU], Status.NodeName
KC->>N: Query list of Nodes (verify backing node exists)
N-->>KC: List of node names
Note over KC: Sum only NodeClaims with NodeName in Nodes
KC->>KC: sumNodeClaimVCPUs(nodeClaims, liveNodes)
Note over KC,HCP: Status update
KC->>HCP: Update Status.AutoNode.VCPUs with computed sum
HCP-->>HCP: Persist AutoNode.VCPUs
Note over MC,Prom: Metrics aggregation
MC->>HCP: Read Status.AutoNode.VCPUs
MC->>MC: Seed hclusterData.vCpusCount with AutoNode.VCPUs (if present)
MC->>MC: Add native NodePool vCPUs (from instance-type lookup)
MC->>Prom: Emit hypershift_cluster_vcpus and error metrics
Prom-->>Prom: Store/serve metrics
🚥 Pre-merge checks | ✅ 9 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/test e2e-aws-techpreview |
|
@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
karpenter-operator/controllers/karpenter/karpenter_controller.go (1)
398-414: Precompute the live-node set before summing vCPUs.This helper is currently quadratic because every
NodeClaimre-scans the full node list. On larger clusters that turns each Node/NodeClaim event into an avoidable hot path in the billing reconcile loop.♻️ Suggested refactor
func sumNodeClaimVCPUs(nodeClaims []karpenterv1.NodeClaim, allNodes []corev1.Node) int32 { + liveNodes := make(map[string]struct{}, len(allNodes)) + for i := range allNodes { + liveNodes[allNodes[i].Name] = struct{}{} + } + var total int64 for i := range nodeClaims { nc := &nodeClaims[i] - if !slices.ContainsFunc(allNodes, func(n corev1.Node) bool { - return n.Name == nc.Status.NodeName - }) { + if _, ok := liveNodes[nc.Status.NodeName]; !ok { continue } if cpu, ok := nc.Status.Capacity[corev1.ResourceCPU]; ok { total += cpu.Value() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@karpenter-operator/controllers/karpenter/karpenter_controller.go` around lines 398 - 414, The sumNodeClaimVCPUs helper is O(N*M) because it calls slices.ContainsFunc(allNodes, ...) for each NodeClaim; precompute a set of live node names from allNodes (e.g., map[string]struct{} or map[string]bool) once at the start of sumNodeClaimVCPUs, then iterate nodeClaims and do O(1) lookups against that map (check nc.Status.NodeName exists) before summing nc.Status.Capacity[corev1.ResourceCPU].Value(); update references in sumNodeClaimVCPUs accordingly to use the precomputed map instead of slices.ContainsFunc.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/karpenter_test.go`:
- Around line 1368-1421: The predicate in e2eutil.EventuallyObject currently
treats a nil hc.Status.AutoNode.VCPUs as zero which can falsely pass; change the
predicate (inside the anonymous func passed to e2eutil.EventuallyObject that
reads hc.Status.AutoNode.VCPUs) to explicitly require the pointer be non-nil
before considering the value (return false if nil), and only succeed when
*hc.Status.AutoNode.VCPUs == expectedVCPUs. Likewise, tighten the metric check
in the wait.PollUntilContextTimeout loop (the loop that reads
npmetrics.VCpusCountByHClusterMetricName) to match the metric series by both the
"name" label and the "namespace" label (require l.GetName()=="name" &&
l.GetValue()==hostedCluster.Name and also require a label "namespace" with value
hostedCluster.Namespace) so the correct series is selected.
---
Nitpick comments:
In `@karpenter-operator/controllers/karpenter/karpenter_controller.go`:
- Around line 398-414: The sumNodeClaimVCPUs helper is O(N*M) because it calls
slices.ContainsFunc(allNodes, ...) for each NodeClaim; precompute a set of live
node names from allNodes (e.g., map[string]struct{} or map[string]bool) once at
the start of sumNodeClaimVCPUs, then iterate nodeClaims and do O(1) lookups
against that map (check nc.Status.NodeName exists) before summing
nc.Status.Capacity[corev1.ResourceCPU].Value(); update references in
sumNodeClaimVCPUs accordingly to use the precomputed map instead of
slices.ContainsFunc.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 41073511-5238-4fd7-b7a0-bb8022ab9509
⛔ Files ignored due to path filters (12)
api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!**/zz_generated*.go,!**/zz_generated*api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**client/applyconfiguration/hypershift/v1beta1/autonodestatus.gois excluded by!client/**cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamldocs/content/reference/aggregated-docs.mdis excluded by!docs/content/reference/aggregated-docs.mddocs/content/reference/api.mdis excluded by!docs/content/reference/api.mdvendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!vendor/**,!**/vendor/**,!**/zz_generated*.go,!**/zz_generated*
📒 Files selected for processing (6)
api/hypershift/v1beta1/hostedcluster_types.gohypershift-operator/controllers/nodepool/metrics/metrics.gohypershift-operator/controllers/nodepool/metrics/metrics_test.gokarpenter-operator/controllers/karpenter/karpenter_controller.gokarpenter-operator/controllers/karpenter/karpenter_controller_test.gotest/e2e/karpenter_test.go
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8265 +/- ##
=======================================
Coverage 36.04% 36.04%
=======================================
Files 767 767
Lines 93422 93458 +36
=======================================
+ Hits 33674 33689 +15
- Misses 57040 57061 +21
Partials 2708 2708
🚀 New features to boost your workflow:
|
29df349 to
5cd4832
Compare
|
/test e2e-aws-techpreview |
|
@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
5cd4832 to
3d550fe
Compare
|
/test e2e-aws-techpreview |
|
@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/karpenter_test.go`:
- Around line 1171-1178: The baseline vCPU snapshot is taken before AutoNode
convergence so it may include transient Karpenter vCPUs; change the order so you
call waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0) first to
ensure Karpenter vCPUs are zero, then call getVCPUsMetric(t, ctx, mgtClient,
hostedCluster) to set baseline and assert found—update the code around
getVCPUsMetric and waitForAutoNodeStatusVCPUs to reflect this ordering and keep
the baseline variable logic the same.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: ad81a3ff-ac87-4709-8c6f-23548d604839
⛔ Files ignored due to path filters (12)
api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!**/zz_generated*.go,!**/zz_generated*api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**client/applyconfiguration/hypershift/v1beta1/autonodestatus.gois excluded by!client/**cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamldocs/content/reference/aggregated-docs.mdis excluded by!docs/content/reference/aggregated-docs.mddocs/content/reference/api.mdis excluded by!docs/content/reference/api.mdvendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!vendor/**,!**/vendor/**,!**/zz_generated*.go,!**/zz_generated*
📒 Files selected for processing (6)
api/hypershift/v1beta1/hostedcluster_types.gohypershift-operator/controllers/nodepool/metrics/metrics.gohypershift-operator/controllers/nodepool/metrics/metrics_test.gokarpenter-operator/controllers/karpenter/karpenter_controller.gokarpenter-operator/controllers/karpenter/karpenter_controller_test.gotest/e2e/karpenter_test.go
✅ Files skipped from review due to trivial changes (1)
- karpenter-operator/controllers/karpenter/karpenter_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- hypershift-operator/controllers/nodepool/metrics/metrics.go
- api/hypershift/v1beta1/hostedcluster_types.go
| // Read the current metric as baseline (native NodePool vCPUs). | ||
| baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster) | ||
| g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned") | ||
| t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline) | ||
|
|
||
| // Before any Karpenter nodes are provisioned, Karpenter vCPUs should be 0. | ||
| waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0) | ||
|
|
There was a problem hiding this comment.
Take the billing baseline after AutoNode has converged to zero.
Line 1172 can still include transient Karpenter vCPUs from earlier provisioning tests, because the zero-vCPU wait only happens on Line 1177. That makes the later baseline+8 / baseline+4 assertions flaky.
Suggested fix
- // Read the current metric as baseline (native NodePool vCPUs).
- baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster)
- g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned")
- t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline)
-
// Before any Karpenter nodes are provisioned, Karpenter vCPUs should be 0.
waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0)
+
+ // Now the billing metric baseline is native NodePool vCPUs only.
+ baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster)
+ g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned")
+ t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/karpenter_test.go` around lines 1171 - 1178, The baseline vCPU
snapshot is taken before AutoNode convergence so it may include transient
Karpenter vCPUs; change the order so you call waitForAutoNodeStatusVCPUs(t, ctx,
mgtClient, hostedCluster, 0) first to ensure Karpenter vCPUs are zero, then call
getVCPUsMetric(t, ctx, mgtClient, hostedCluster) to set baseline and assert
found—update the code around getVCPUsMetric and waitForAutoNodeStatusVCPUs to
reflect this ordering and keep the baseline variable logic the same.
|
For HyperShift NodePools we return /lgtm |
|
Scheduling tests matching the |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
test/e2e/karpenter_test.go (1)
1249-1256:⚠️ Potential issue | 🟡 MinorTake the billing baseline after AutoNode has converged to zero.
The baseline vCPU snapshot is still taken before confirming
AutoNode.VCPUs == 0. This ordering could include transient Karpenter vCPUs if a previous parallel test didn't fully clean up, makingbaseline+8/baseline+4assertions flaky.Consider swapping the order:
- // Read the current metric as baseline (native NodePool vCPUs). - baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster) - g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned") - t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline) - // Before any Karpenter nodes are provisioned, Karpenter vCPUs should be 0. waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0) + + // Now the billing metric baseline is native NodePool vCPUs only. + baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster) + g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned") + t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/karpenter_test.go` around lines 1249 - 1256, The baseline vCPU read is taken before verifying AutoNode has converged to zero, which can capture transient Karpenter vCPUs; fix by swapping the calls so waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0) runs first to ensure AutoNode.VCPUs == 0, then call getVCPUsMetric(t, ctx, mgtClient, hostedCluster) to set baseline, and use that baseline for subsequent assertions (update references to baseline, found, and the log message accordingly).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test/e2e/karpenter_test.go`:
- Around line 1249-1256: The baseline vCPU read is taken before verifying
AutoNode has converged to zero, which can capture transient Karpenter vCPUs; fix
by swapping the calls so waitForAutoNodeStatusVCPUs(t, ctx, mgtClient,
hostedCluster, 0) runs first to ensure AutoNode.VCPUs == 0, then call
getVCPUsMetric(t, ctx, mgtClient, hostedCluster) to set baseline, and use that
baseline for subsequent assertions (update references to baseline, found, and
the log message accordingly).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 84d04d1b-89bc-4c99-9625-ff88d874440b
⛔ Files ignored due to path filters (12)
api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!**/zz_generated*.go,!**/zz_generated*api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**client/applyconfiguration/hypershift/v1beta1/autonodestatus.gois excluded by!client/**cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamldocs/content/reference/aggregated-docs.mdis excluded by!docs/content/reference/aggregated-docs.mddocs/content/reference/api.mdis excluded by!docs/content/reference/api.mdvendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!vendor/**,!**/vendor/**,!**/zz_generated*.go,!**/zz_generated*
📒 Files selected for processing (6)
api/hypershift/v1beta1/hostedcluster_types.gohypershift-operator/controllers/nodepool/metrics/metrics.gohypershift-operator/controllers/nodepool/metrics/metrics_test.gokarpenter-operator/controllers/karpenter/karpenter_controller.gokarpenter-operator/controllers/karpenter/karpenter_controller_test.gotest/e2e/karpenter_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- api/hypershift/v1beta1/hostedcluster_types.go
- karpenter-operator/controllers/karpenter/karpenter_controller_test.go
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/retest-required |
…us billing metric Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top. - NodeClaim is the authority for Karpenter ownership (no label dependency) - e2e tests validate vCPU status + metric at 0, scale-up, and consolidation Signed-off-by: Max Cao <macao@redhat.com> Made-with: Cursor Signed-off-by: Max Cao <macao@redhat.com>
e939a3e to
e649cc4
Compare
|
/lgtm |
|
Scheduling tests matching the |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/e2e/karpenter_test.go (1)
1249-1256: Baseline read should happen after AutoNode convergence for robustness.The baseline is captured before confirming
AutoNode.VCPUs == 0. In a CI environment where tests may share clusters or where previous reconciliation loops haven't fully settled, the billing metric could still include transient Karpenter vCPUs from earlier operations. Moving the convergence check before the baseline read ensures a clean slate.🔧 Suggested reordering
- // Read the current metric as baseline (native NodePool vCPUs). - baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster) - g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned") - t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline) - // Before any Karpenter nodes are provisioned, Karpenter vCPUs should be 0. waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0) + + // Now the billing metric baseline is native NodePool vCPUs only. + baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster) + g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned") + t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/karpenter_test.go` around lines 1249 - 1256, Move the AutoNode convergence check before reading the baseline metric: call waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0) first to ensure AutoNode.VCPUs == 0, then call getVCPUsMetric(t, ctx, mgtClient, hostedCluster) to capture baseline; update any log messages that reference the baseline variable accordingly so the metric read reflects a clean slate without transient Karpenter vCPUs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/e2e/karpenter_test.go`:
- Around line 1249-1256: Move the AutoNode convergence check before reading the
baseline metric: call waitForAutoNodeStatusVCPUs(t, ctx, mgtClient,
hostedCluster, 0) first to ensure AutoNode.VCPUs == 0, then call
getVCPUsMetric(t, ctx, mgtClient, hostedCluster) to capture baseline; update any
log messages that reference the baseline variable accordingly so the metric read
reflects a clean slate without transient Karpenter vCPUs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 9d60dbfe-c533-46e4-a0f9-a30e10fb4d37
⛔ Files ignored due to path filters (12)
api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!**/zz_generated*.go,!**/zz_generated*api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**client/applyconfiguration/hypershift/v1beta1/autonodestatus.gois excluded by!client/**cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamlcmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yamlis excluded by!**/zz_generated.crd-manifests/**,!cmd/install/assets/**/*.yamldocs/content/reference/aggregated-docs.mdis excluded by!docs/content/reference/aggregated-docs.mddocs/content/reference/api.mdis excluded by!docs/content/reference/api.mdvendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.gois excluded by!vendor/**,!**/vendor/**,!**/zz_generated*.go,!**/zz_generated*
📒 Files selected for processing (6)
api/hypershift/v1beta1/hostedcluster_types.gohypershift-operator/controllers/nodepool/metrics/metrics.gohypershift-operator/controllers/nodepool/metrics/metrics_test.gokarpenter-operator/controllers/karpenter/karpenter_controller.gokarpenter-operator/controllers/karpenter/karpenter_controller_test.gotest/e2e/karpenter_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- api/hypershift/v1beta1/hostedcluster_types.go
- karpenter-operator/controllers/karpenter/karpenter_controller.go
|
/test e2e-aws-4-22 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/verified by e2e |
|
@maxcao13: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/unhold |
|
/retest-required |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
@maxcao13: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
I have all the evidence needed. Let me compile the final analysis. The key findings are:
Neither failure is related to the PR changes (Karpenter vCPU billing metrics). Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThis job has two independent failure chains, neither related to the PR changes (Karpenter vCPU billing metrics). TestCreateClusterCustomConfig (2 sub-failures) failed because the Azure client certificate used for KMS encryption in the Root CausePrimary Root Cause — Expired Azure KMS Client Certificate (TestCreateClusterCustomConfig, 2 failures): The Azure service principal certificate used by the HyperShift CI environment for Azure Key Vault (KMS) operations has expired. The certificate was valid from This caused the HostedCluster
The Secondary Root Cause — Pull Secret Propagation Timeout (TestCreateCluster/EnsureGlobalPullSecret, 4 failures): After updating the global pull secret on the HostedCluster, the test creates a pod that references a restricted container image (requiring the updated pull secret to pull). The pod remained in Neither failure is caused by or related to PR #8265 (Karpenter vCPU billing metrics). The PR modifies Recommendations
Evidence
|
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
What this PR does / why we need it:
Add
VCPUsfield toAutoNodeStatus, computed by karpenter-operator fromNodeClaimcapacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top.NodeClaims and karpenter NodePools exist directly in the guest cluster, so we cannot source this info directly from the hypershift nodepool metrics package using the existing code/clients, which only has access to the management cluster.
We need this because ROSA HCP uses the hypershift_cluster_vcpus metric for billing our customers. This metric is not correctly reflecting the vCPUs for Nodes that are provisoned by Karpenter. This will result in under-billing of customers that are using Karpenter to provision Nodes within their cluster.
Which issue(s) this PR fixes:
Fixes https://redhat.atlassian.net/browse/AUTOSCALE-615
Special notes for your reviewer:
Alternatively, we could have karpenter-operator source its own billing metric and ask ROSA to aggregate using this new exposed metric, but considering that we already expose information about nodes and nodeclaims in the
AutoNodeStatus, I think it makes sense to be consistent here.Made-with: Cursor
Checklist:
Summary by CodeRabbit
New Features
Tests