Skip to content

AUTOSCALE-615: include Karpenter node vCPUs in billing metric#8265

Open
maxcao13 wants to merge 1 commit intoopenshift:mainfrom
maxcao13:karpenter-node-billing
Open

AUTOSCALE-615: include Karpenter node vCPUs in billing metric#8265
maxcao13 wants to merge 1 commit intoopenshift:mainfrom
maxcao13:karpenter-node-billing

Conversation

@maxcao13
Copy link
Copy Markdown
Member

@maxcao13 maxcao13 commented Apr 16, 2026

What this PR does / why we need it:

Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top.

NodeClaims and karpenter NodePools exist directly in the guest cluster, so we cannot source this info directly from the hypershift nodepool metrics package using the existing code/clients, which only has access to the management cluster.

We need this because ROSA HCP uses the hypershift_cluster_vcpus metric for billing our customers. This metric is not correctly reflecting the vCPUs for Nodes that are provisoned by Karpenter. This will result in under-billing of customers that are using Karpenter to provision Nodes within their cluster.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/AUTOSCALE-615

Special notes for your reviewer:

Alternatively, we could have karpenter-operator source its own billing metric and ask ROSA to aggregate using this new exposed metric, but considering that we already expose information about nodes and nodeclaims in the AutoNodeStatus, I think it makes sense to be consistent here.

Made-with: Cursor

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

    • HostedCluster status now includes an optional AutoNode vCPU count; metrics now seed and aggregate AutoNode plus native node vCPUs for improved billing and capacity visibility.
    • Karpenter now sums live NodeClaim CPU capacity when computing AutoNode vCPUs and reacts to NodeClaim changes that affect CPU or assignment.
  • Tests

    • Added unit and e2e tests covering vCPU aggregation, error reporting, metric convergence, and billing/vCPU validation during provisioning and consolidation.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 16, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 16, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 16, 2026

@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top.

NodeClaims and karpenter NodePools exist directly in the guest cluster, so we cannot source this info directly from the hypershift nodepool metrics package using the existing code/clients, which only has access to the management cluster.

We need this because ROSA HCP uses the hypershift_cluster_vcpus metric for billing our customers. This metric is not correctly reflecting the vCPUs for Nodes that are provisoned by Karpenter. This will result in under-billing of customers that are using Karpenter to provision Nodes within their cluster.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/AUTOSCALE-615

Special notes for your reviewer:

Alternatively, we could have karpenter-operator source its own billing metric and ask ROSA to aggregate using this new exposed metric, but considering that we already expose information about nodes and nodeclaims in the AutoNodeStatus, I think it makes sense to be consistent here.

Made-with: Cursor

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Apr 16, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds vCPU capacity tracking for Karpenter-managed AutoNodes: a new optional VCPUs *int32 on HostedCluster.Status.AutoNode; the guest-side Karpenter controller now watches NodeClaim create/delete and updates that change CPU capacity or Status.NodeName, filters NodeClaims to those whose Status.NodeName corresponds to a live corev1.Node, sums their CPU capacity, and writes the total to Status.AutoNode.VCPUs. The nodepool metrics collector seeds per-cluster vCPU totals from AutoNode.VCPUs (when set), aggregates native NodePool vCPUs, and emits a computation-error metric on lookup failures. Unit and e2e tests validate summation, nil behavior, lookup errors, metric emission, and billing integration.

Sequence Diagram(s)

sequenceDiagram
    participant KC as Karpenter Controller
    participant NC as NodeClaim Resources
    participant N as corev1.Node List
    participant HCP as HostedCluster (Status.AutoNode)
    participant MC as Metrics Collector
    participant Prom as Prometheus

    Note over KC,NC: NodeClaim watch + vCPU computation
    KC->>NC: Receive create/update/delete events
    NC-->>KC: NodeClaim.Status.Capacity[CPU], Status.NodeName
    KC->>N: Query list of Nodes (verify backing node exists)
    N-->>KC: List of node names

    Note over KC: Sum only NodeClaims with NodeName in Nodes
    KC->>KC: sumNodeClaimVCPUs(nodeClaims, liveNodes)

    Note over KC,HCP: Status update
    KC->>HCP: Update Status.AutoNode.VCPUs with computed sum
    HCP-->>HCP: Persist AutoNode.VCPUs

    Note over MC,Prom: Metrics aggregation
    MC->>HCP: Read Status.AutoNode.VCPUs
    MC->>MC: Seed hclusterData.vCpusCount with AutoNode.VCPUs (if present)
    MC->>MC: Add native NodePool vCPUs (from instance-type lookup)
    MC->>Prom: Emit hypershift_cluster_vcpus and error metrics
    Prom-->>Prom: Store/serve metrics
Loading
🚥 Pre-merge checks | ✅ 9 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning The testBillingConsolidationAndPDB test uses pod anti-affinity requiring different nodes, which is incompatible with Single Node OpenShift deployments that have only one node. Add [Skipped:SingleReplicaTopology] label to the test or guard with runtime topology checks using exutil.IsSingleNode() to skip on SNO clusters.
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning The new testBillingConsolidationAndPDB e2e test requires pulling container image quay.io/openshift/origin-pod:4.22.0 from public Quay.io registry, which will fail in IPv6-only disconnected CI environments without internet access. The test lacks the [Skipped:Disconnected] annotation. Add the [Skipped:Disconnected] annotation to the test name to skip it on disconnected clusters, or use an internally-available container image instead of quay.io.
✅ Passed checks (9 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding Karpenter node vCPUs to the billing metric, which is the core objective of this multi-file changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the pull request are stable and deterministic without any dynamic information. The test titles clearly describe what is being validated using static, descriptive strings.
Test Structure And Quality ✅ Passed Tests demonstrate solid Ginkgo and Go testing practices with proper timeout specifications, meaningful assertions, and cleanup patterns.
Microshift Test Compatibility ✅ Passed The new e2e tests in test/e2e/karpenter_test.go are protected by a platform guard that prevents execution on MicroShift.
Topology-Aware Scheduling Compatibility ✅ Passed Pull request adds vCPU tracking for Karpenter nodes through API types and metrics without introducing pod scheduling constraints incompatible with SNO, Two-Node, or HyperShift topologies.
Ote Binary Stdout Contract ✅ Passed PR's e2e test code properly uses crzap logger (outputs to stderr by default) for all process-level logging with no direct fmt.Print calls to stdout.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@maxcao13
Copy link
Copy Markdown
Member Author

/test e2e-aws-techpreview

@openshift-ci openshift-ci Bot added area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Apr 16, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 16, 2026

@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top.

NodeClaims and karpenter NodePools exist directly in the guest cluster, so we cannot source this info directly from the hypershift nodepool metrics package using the existing code/clients, which only has access to the management cluster.

We need this because ROSA HCP uses the hypershift_cluster_vcpus metric for billing our customers. This metric is not correctly reflecting the vCPUs for Nodes that are provisoned by Karpenter. This will result in under-billing of customers that are using Karpenter to provision Nodes within their cluster.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/AUTOSCALE-615

Special notes for your reviewer:

Alternatively, we could have karpenter-operator source its own billing metric and ask ROSA to aggregate using this new exposed metric, but considering that we already expose information about nodes and nodeclaims in the AutoNodeStatus, I think it makes sense to be consistent here.

Made-with: Cursor

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

  • Added vCPU capacity tracking and reporting for Karpenter-managed nodes, enabling better visibility into resource allocation and billing metrics.

  • Tests

  • Added comprehensive test coverage for vCPU metric aggregation and reporting across various scenarios.

  • Updated end-to-end tests to validate vCPU metric convergence during node provisioning and consolidation workflows.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
karpenter-operator/controllers/karpenter/karpenter_controller.go (1)

398-414: Precompute the live-node set before summing vCPUs.

This helper is currently quadratic because every NodeClaim re-scans the full node list. On larger clusters that turns each Node/NodeClaim event into an avoidable hot path in the billing reconcile loop.

♻️ Suggested refactor
 func sumNodeClaimVCPUs(nodeClaims []karpenterv1.NodeClaim, allNodes []corev1.Node) int32 {
+	liveNodes := make(map[string]struct{}, len(allNodes))
+	for i := range allNodes {
+		liveNodes[allNodes[i].Name] = struct{}{}
+	}
+
 	var total int64
 	for i := range nodeClaims {
 		nc := &nodeClaims[i]
-		if !slices.ContainsFunc(allNodes, func(n corev1.Node) bool {
-			return n.Name == nc.Status.NodeName
-		}) {
+		if _, ok := liveNodes[nc.Status.NodeName]; !ok {
 			continue
 		}
 		if cpu, ok := nc.Status.Capacity[corev1.ResourceCPU]; ok {
 			total += cpu.Value()
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@karpenter-operator/controllers/karpenter/karpenter_controller.go` around
lines 398 - 414, The sumNodeClaimVCPUs helper is O(N*M) because it calls
slices.ContainsFunc(allNodes, ...) for each NodeClaim; precompute a set of live
node names from allNodes (e.g., map[string]struct{} or map[string]bool) once at
the start of sumNodeClaimVCPUs, then iterate nodeClaims and do O(1) lookups
against that map (check nc.Status.NodeName exists) before summing
nc.Status.Capacity[corev1.ResourceCPU].Value(); update references in
sumNodeClaimVCPUs accordingly to use the precomputed map instead of
slices.ContainsFunc.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/karpenter_test.go`:
- Around line 1368-1421: The predicate in e2eutil.EventuallyObject currently
treats a nil hc.Status.AutoNode.VCPUs as zero which can falsely pass; change the
predicate (inside the anonymous func passed to e2eutil.EventuallyObject that
reads hc.Status.AutoNode.VCPUs) to explicitly require the pointer be non-nil
before considering the value (return false if nil), and only succeed when
*hc.Status.AutoNode.VCPUs == expectedVCPUs. Likewise, tighten the metric check
in the wait.PollUntilContextTimeout loop (the loop that reads
npmetrics.VCpusCountByHClusterMetricName) to match the metric series by both the
"name" label and the "namespace" label (require l.GetName()=="name" &&
l.GetValue()==hostedCluster.Name and also require a label "namespace" with value
hostedCluster.Namespace) so the correct series is selected.

---

Nitpick comments:
In `@karpenter-operator/controllers/karpenter/karpenter_controller.go`:
- Around line 398-414: The sumNodeClaimVCPUs helper is O(N*M) because it calls
slices.ContainsFunc(allNodes, ...) for each NodeClaim; precompute a set of live
node names from allNodes (e.g., map[string]struct{} or map[string]bool) once at
the start of sumNodeClaimVCPUs, then iterate nodeClaims and do O(1) lookups
against that map (check nc.Status.NodeName exists) before summing
nc.Status.Capacity[corev1.ResourceCPU].Value(); update references in
sumNodeClaimVCPUs accordingly to use the precomputed map instead of
slices.ContainsFunc.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 41073511-5238-4fd7-b7a0-bb8022ab9509

📥 Commits

Reviewing files that changed from the base of the PR and between 533aad6 and 29df349.

⛔ Files ignored due to path filters (12)
  • api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • client/applyconfiguration/hypershift/v1beta1/autonodestatus.go is excluded by !client/**
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
📒 Files selected for processing (6)
  • api/hypershift/v1beta1/hostedcluster_types.go
  • hypershift-operator/controllers/nodepool/metrics/metrics.go
  • hypershift-operator/controllers/nodepool/metrics/metrics_test.go
  • karpenter-operator/controllers/karpenter/karpenter_controller.go
  • karpenter-operator/controllers/karpenter/karpenter_controller_test.go
  • test/e2e/karpenter_test.go

Comment thread test/e2e/karpenter_test.go
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 43.58974% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.04%. Comparing base (0f75517) to head (e649cc4).

Files with missing lines Patch % Lines
...ator/controllers/karpenter/karpenter_controller.go 31.25% 22 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8265   +/-   ##
=======================================
  Coverage   36.04%   36.04%           
=======================================
  Files         767      767           
  Lines       93422    93458   +36     
=======================================
+ Hits        33674    33689   +15     
- Misses      57040    57061   +21     
  Partials     2708     2708           
Files with missing lines Coverage Δ
...t-operator/controllers/nodepool/metrics/metrics.go 70.20% <100.00%> (+0.43%) ⬆️
...ator/controllers/karpenter/karpenter_controller.go 26.76% <31.25%> (+0.55%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@maxcao13 maxcao13 force-pushed the karpenter-node-billing branch from 29df349 to 5cd4832 Compare April 16, 2026 22:22
@maxcao13
Copy link
Copy Markdown
Member Author

/test e2e-aws-techpreview

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 16, 2026

@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top.

NodeClaims and karpenter NodePools exist directly in the guest cluster, so we cannot source this info directly from the hypershift nodepool metrics package using the existing code/clients, which only has access to the management cluster.

We need this because ROSA HCP uses the hypershift_cluster_vcpus metric for billing our customers. This metric is not correctly reflecting the vCPUs for Nodes that are provisoned by Karpenter. This will result in under-billing of customers that are using Karpenter to provision Nodes within their cluster.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/AUTOSCALE-615

Special notes for your reviewer:

Alternatively, we could have karpenter-operator source its own billing metric and ask ROSA to aggregate using this new exposed metric, but considering that we already expose information about nodes and nodeclaims in the AutoNodeStatus, I think it makes sense to be consistent here.

Made-with: Cursor

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

  • HostedCluster status now exposes AutoNode vCPU counts; cluster vCPU metrics include both AutoNode and native node contributions for improved billing and capacity visibility.

  • Karpenter reconciliation now accounts for live NodeClaim vCPU capacity when computing AutoNode vCPUs.

  • Tests

  • Added unit and integration tests validating vCPU aggregation, error handling, and end-to-end metric convergence during provisioning and consolidation.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@maxcao13 maxcao13 marked this pull request as ready for review April 17, 2026 16:53
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 17, 2026
@openshift-ci openshift-ci Bot requested review from bryan-cox and jparrill April 17, 2026 16:55
@maxcao13 maxcao13 force-pushed the karpenter-node-billing branch from 5cd4832 to 3d550fe Compare April 17, 2026 16:59
@maxcao13
Copy link
Copy Markdown
Member Author

/test e2e-aws-techpreview

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 17, 2026

@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top.

NodeClaims and karpenter NodePools exist directly in the guest cluster, so we cannot source this info directly from the hypershift nodepool metrics package using the existing code/clients, which only has access to the management cluster.

We need this because ROSA HCP uses the hypershift_cluster_vcpus metric for billing our customers. This metric is not correctly reflecting the vCPUs for Nodes that are provisoned by Karpenter. This will result in under-billing of customers that are using Karpenter to provision Nodes within their cluster.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/AUTOSCALE-615

Special notes for your reviewer:

Alternatively, we could have karpenter-operator source its own billing metric and ask ROSA to aggregate using this new exposed metric, but considering that we already expose information about nodes and nodeclaims in the AutoNodeStatus, I think it makes sense to be consistent here.

Made-with: Cursor

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

  • HostedCluster status now reports AutoNode vCPU counts; cluster vCPU metrics aggregate AutoNode and native node contributions for improved billing and capacity visibility.

  • Karpenter now includes live NodeClaim CPU capacity when computing AutoNode vCPUs.

  • Tests

  • Added unit and end-to-end tests validating vCPU aggregation, error reporting, and billing-metric convergence during provisioning and consolidation.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/karpenter_test.go`:
- Around line 1171-1178: The baseline vCPU snapshot is taken before AutoNode
convergence so it may include transient Karpenter vCPUs; change the order so you
call waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0) first to
ensure Karpenter vCPUs are zero, then call getVCPUsMetric(t, ctx, mgtClient,
hostedCluster) to set baseline and assert found—update the code around
getVCPUsMetric and waitForAutoNodeStatusVCPUs to reflect this ordering and keep
the baseline variable logic the same.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: ad81a3ff-ac87-4709-8c6f-23548d604839

📥 Commits

Reviewing files that changed from the base of the PR and between 5cd4832 and 3d550fe.

⛔ Files ignored due to path filters (12)
  • api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • client/applyconfiguration/hypershift/v1beta1/autonodestatus.go is excluded by !client/**
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
📒 Files selected for processing (6)
  • api/hypershift/v1beta1/hostedcluster_types.go
  • hypershift-operator/controllers/nodepool/metrics/metrics.go
  • hypershift-operator/controllers/nodepool/metrics/metrics_test.go
  • karpenter-operator/controllers/karpenter/karpenter_controller.go
  • karpenter-operator/controllers/karpenter/karpenter_controller_test.go
  • test/e2e/karpenter_test.go
✅ Files skipped from review due to trivial changes (1)
  • karpenter-operator/controllers/karpenter/karpenter_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • hypershift-operator/controllers/nodepool/metrics/metrics.go
  • api/hypershift/v1beta1/hostedcluster_types.go

Comment on lines +1171 to +1178
// Read the current metric as baseline (native NodePool vCPUs).
baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster)
g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned")
t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline)

// Before any Karpenter nodes are provisioned, Karpenter vCPUs should be 0.
waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Take the billing baseline after AutoNode has converged to zero.

Line 1172 can still include transient Karpenter vCPUs from earlier provisioning tests, because the zero-vCPU wait only happens on Line 1177. That makes the later baseline+8 / baseline+4 assertions flaky.

Suggested fix
-		// Read the current metric as baseline (native NodePool vCPUs).
-		baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster)
-		g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned")
-		t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline)
-
 		// Before any Karpenter nodes are provisioned, Karpenter vCPUs should be 0.
 		waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0)
+
+		// Now the billing metric baseline is native NodePool vCPUs only.
+		baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster)
+		g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned")
+		t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/karpenter_test.go` around lines 1171 - 1178, The baseline vCPU
snapshot is taken before AutoNode convergence so it may include transient
Karpenter vCPUs; change the order so you call waitForAutoNodeStatusVCPUs(t, ctx,
mgtClient, hostedCluster, 0) first to ensure Karpenter vCPUs are zero, then call
getVCPUsMetric(t, ctx, mgtClient, hostedCluster) to set baseline and assert
found—update the code around getVCPUsMetric and waitForAutoNodeStatusVCPUs to
reflect this ordering and keep the baseline variable logic the same.

@joshbranham
Copy link
Copy Markdown
Contributor

For HyperShift NodePools we return -1 if we hit an error calculating a vCPU count. We are not doing that here, as proposed, but I think that is fine since that error/calculation happens downstream in karpenter.

/lgtm
/test e2e-aws-techpreview

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
test/e2e/karpenter_test.go (1)

1249-1256: ⚠️ Potential issue | 🟡 Minor

Take the billing baseline after AutoNode has converged to zero.

The baseline vCPU snapshot is still taken before confirming AutoNode.VCPUs == 0. This ordering could include transient Karpenter vCPUs if a previous parallel test didn't fully clean up, making baseline+8 / baseline+4 assertions flaky.

Consider swapping the order:

-		// Read the current metric as baseline (native NodePool vCPUs).
-		baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster)
-		g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned")
-		t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline)
-
 		// Before any Karpenter nodes are provisioned, Karpenter vCPUs should be 0.
 		waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0)
+
+		// Now the billing metric baseline is native NodePool vCPUs only.
+		baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster)
+		g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned")
+		t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/karpenter_test.go` around lines 1249 - 1256, The baseline vCPU read
is taken before verifying AutoNode has converged to zero, which can capture
transient Karpenter vCPUs; fix by swapping the calls so
waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0) runs first to
ensure AutoNode.VCPUs == 0, then call getVCPUsMetric(t, ctx, mgtClient,
hostedCluster) to set baseline, and use that baseline for subsequent assertions
(update references to baseline, found, and the log message accordingly).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/e2e/karpenter_test.go`:
- Around line 1249-1256: The baseline vCPU read is taken before verifying
AutoNode has converged to zero, which can capture transient Karpenter vCPUs; fix
by swapping the calls so waitForAutoNodeStatusVCPUs(t, ctx, mgtClient,
hostedCluster, 0) runs first to ensure AutoNode.VCPUs == 0, then call
getVCPUsMetric(t, ctx, mgtClient, hostedCluster) to set baseline, and use that
baseline for subsequent assertions (update references to baseline, found, and
the log message accordingly).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 84d04d1b-89bc-4c99-9625-ff88d874440b

📥 Commits

Reviewing files that changed from the base of the PR and between 3d550fe and e939a3e.

⛔ Files ignored due to path filters (12)
  • api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • client/applyconfiguration/hypershift/v1beta1/autonodestatus.go is excluded by !client/**
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
📒 Files selected for processing (6)
  • api/hypershift/v1beta1/hostedcluster_types.go
  • hypershift-operator/controllers/nodepool/metrics/metrics.go
  • hypershift-operator/controllers/nodepool/metrics/metrics_test.go
  • karpenter-operator/controllers/karpenter/karpenter_controller.go
  • karpenter-operator/controllers/karpenter/karpenter_controller_test.go
  • test/e2e/karpenter_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • api/hypershift/v1beta1/hostedcluster_types.go
  • karpenter-operator/controllers/karpenter/karpenter_controller_test.go

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2046977908567707648 | Cost: $2.2540942999999993 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@maxcao13
Copy link
Copy Markdown
Member Author

/retest-required

…us billing metric

Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from
NodeClaim capacity and cross-referenced against live Node objects.
The metrics collector seeds per-cluster vCPU count from this field
before accumulating native NodePool vCPUs on top.

- NodeClaim is the authority for Karpenter ownership (no label dependency)
- e2e tests validate vCPU status + metric at 0, scale-up, and consolidation

Signed-off-by: Max Cao <macao@redhat.com>
Made-with: Cursor
Signed-off-by: Max Cao <macao@redhat.com>
@maxcao13 maxcao13 force-pushed the karpenter-node-billing branch from e939a3e to e649cc4 Compare April 22, 2026 20:20
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2026
@joshbranham
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/e2e/karpenter_test.go (1)

1249-1256: Baseline read should happen after AutoNode convergence for robustness.

The baseline is captured before confirming AutoNode.VCPUs == 0. In a CI environment where tests may share clusters or where previous reconciliation loops haven't fully settled, the billing metric could still include transient Karpenter vCPUs from earlier operations. Moving the convergence check before the baseline read ensures a clean slate.

🔧 Suggested reordering
-		// Read the current metric as baseline (native NodePool vCPUs).
-		baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster)
-		g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned")
-		t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline)
-
 		// Before any Karpenter nodes are provisioned, Karpenter vCPUs should be 0.
 		waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0)
+
+		// Now the billing metric baseline is native NodePool vCPUs only.
+		baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster)
+		g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned")
+		t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/karpenter_test.go` around lines 1249 - 1256, Move the AutoNode
convergence check before reading the baseline metric: call
waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0) first to ensure
AutoNode.VCPUs == 0, then call getVCPUsMetric(t, ctx, mgtClient, hostedCluster)
to capture baseline; update any log messages that reference the baseline
variable accordingly so the metric read reflects a clean slate without transient
Karpenter vCPUs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/e2e/karpenter_test.go`:
- Around line 1249-1256: Move the AutoNode convergence check before reading the
baseline metric: call waitForAutoNodeStatusVCPUs(t, ctx, mgtClient,
hostedCluster, 0) first to ensure AutoNode.VCPUs == 0, then call
getVCPUsMetric(t, ctx, mgtClient, hostedCluster) to capture baseline; update any
log messages that reference the baseline variable accordingly so the metric read
reflects a clean slate without transient Karpenter vCPUs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 9d60dbfe-c533-46e4-a0f9-a30e10fb4d37

📥 Commits

Reviewing files that changed from the base of the PR and between e939a3e and e649cc4.

⛔ Files ignored due to path filters (12)
  • api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • client/applyconfiguration/hypershift/v1beta1/autonodestatus.go is excluded by !client/**
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
📒 Files selected for processing (6)
  • api/hypershift/v1beta1/hostedcluster_types.go
  • hypershift-operator/controllers/nodepool/metrics/metrics.go
  • hypershift-operator/controllers/nodepool/metrics/metrics_test.go
  • karpenter-operator/controllers/karpenter/karpenter_controller.go
  • karpenter-operator/controllers/karpenter/karpenter_controller_test.go
  • test/e2e/karpenter_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • api/hypershift/v1beta1/hostedcluster_types.go
  • karpenter-operator/controllers/karpenter/karpenter_controller.go

@maxcao13
Copy link
Copy Markdown
Member Author

/test e2e-aws-4-22

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2047048197989208064 | Cost: $2.40282805 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2047048197783687168 | Cost: $2.09886635 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2047048197838213120 | Cost: $2.4782874499999994 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@maxcao13
Copy link
Copy Markdown
Member Author

/verified by e2e

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 23, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@maxcao13: This PR has been marked as verified by e2e.

Details

In response to this:

/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@maxcao13
Copy link
Copy Markdown
Member Author

/unhold

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 23, 2026
@maxcao13
Copy link
Copy Markdown
Member Author

/retest-required

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2047180009780547584 | Cost: $1.8436635000000001 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2047180009998651392 | Cost: $2.8597483499999994 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2047180010095120384 | Cost: $2.72117325 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 23, 2026

@maxcao13: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-4-22 e649cc4 link true /test e2e-aws-4-22
ci/prow/e2e-azure-self-managed e649cc4 link true /test e2e-azure-self-managed
ci/prow/e2e-aks e649cc4 link true /test e2e-aks
ci/prow/e2e-aws e649cc4 link true /test e2e-aws
ci/prow/e2e-aks-4-22 e649cc4 link true /test e2e-aks-4-22

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 6daa9ce and 2 for PR HEAD e649cc4 in total

@hypershift-jira-solve-ci
Copy link
Copy Markdown

I have all the evidence needed. Let me compile the final analysis. The key findings are:

  1. TestCreateClusterCustomConfig (and its sub-test ValidateHostedCluster): Failed because an Azure client certificate used for KMS (Key Management Service) encryption expired on 2026-04-22 15:47:58 UTC — one day before the job ran on 2026-04-23. The app ID is a05a2f7a-26e7-44f9-8785-effb6b4acaa9, the Azure error is AADSTS700027. This caused the HostedCluster condition ValidAzureKMSConfig=False, which prevented kube-apiserver from becoming available, which cascaded into total control plane unavailability.

  2. TestCreateCluster/Main/EnsureGlobalPullSecret (and its sub-tests): A pod using a restricted image stayed in Pending phase for 420s after the global pull secret was updated. This is a known flaky test behavior — the pod's image pull requires node-level credential refresh after the pull secret update, which sometimes takes longer than the timeout allows.

Neither failure is related to the PR changes (Karpenter vCPU billing metrics).

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-aks-4-22
  • Build ID: 2047216697617682432
  • Target: e2e-aks-4-22
  • State: failure
  • Started: 2026-04-23T07:31:10Z
  • Completed: 2026-04-23T09:22:14Z
  • Result: 305 tests, 38 skipped, 6 failures

Test Failure Analysis

Error

TestCreateClusterCustomConfig: AADSTS700027: The certificate with identifier used to sign
the client assertion is expired on application. [Reason - The key used is expired.,
Found key 'Start=04/22/2025 15:47:58, End=04/22/2026 15:47:58']
App Id: a05a2f7a-26e7-44f9-8785-effb6b4acaa9

TestCreateCluster/Main/EnsureGlobalPullSecret: Timed out after 420.000s.
pod is not running yet, current phase: Pending

Summary

This job has two independent failure chains, neither related to the PR changes (Karpenter vCPU billing metrics). TestCreateClusterCustomConfig (2 sub-failures) failed because the Azure client certificate used for KMS encryption in the hypershift-aks CI credential profile expired on 2026-04-22 — one day before this job ran on 2026-04-23. This caused the ValidAzureKMSConfig condition to be False, preventing kube-apiserver startup and cascading into full control plane unavailability. TestCreateCluster/Main/EnsureGlobalPullSecret (4 sub-failures) failed because a pod using a restricted image remained stuck in Pending phase for 420s after a global pull secret update, a known flaky behavior where node-level credential refresh does not complete within the test timeout.

Root Cause

Primary Root Cause — Expired Azure KMS Client Certificate (TestCreateClusterCustomConfig, 2 failures):

The Azure service principal certificate used by the HyperShift CI environment for Azure Key Vault (KMS) operations has expired. The certificate was valid from 2025-04-22 15:47:58 to 2026-04-22 15:47:58 and expired approximately 16 hours before this job started. The Azure AD error AADSTS700027 confirms this — the client assertion (certificate-based authentication) to Azure AD tenant 520cf09d-78ff-44ed-a731-abd623e73b09 for app a05a2f7a-26e7-44f9-8785-effb6b4acaa9 was rejected.

This caused the HostedCluster custom-config-lkm8l condition ValidAzureKMSConfig=False, which meant the kube-apiserver could not start (because it depends on KMS for secret encryption). With the apiserver unavailable, the cascading failures were:

  • KubeAPIServerAvailable=False
  • Available=False
  • Degraded=True (capi-provider and kube-apiserver deployments had unavailable replicas)
  • IgnitionEndpointAvailable=False
  • All CVO conditions remained Unknown

The ValidateHostedCluster sub-test then failed when trying to connect to the guest API server, receiving repeated TLS handshake timeouts and EOF errors for 10 minutes before timing out.

Secondary Root Cause — Pull Secret Propagation Timeout (TestCreateCluster/EnsureGlobalPullSecret, 4 failures):

After updating the global pull secret on the HostedCluster, the test creates a pod that references a restricted container image (requiring the updated pull secret to pull). The pod remained in Pending phase for the full 420s timeout. Notably, the konnectivity-agent DaemonSet went from 2/2 ready to 0/2 ready during the pull secret update and took several minutes to recover to 2/2. This indicates the pull secret update triggered node-level disruption. The restricted-image pod was likely unable to pull its image because the updated credentials had not been fully propagated to the node's container runtime within the timeout window.

Neither failure is caused by or related to PR #8265 (Karpenter vCPU billing metrics). The PR modifies AutoNodeStatus, metrics collection code, and Karpenter controller logic — none of which interact with Azure KMS configuration or pull secret propagation.

Recommendations
  1. Rotate the expired Azure KMS client certificate immediately. The certificate for app ID a05a2f7a-26e7-44f9-8785-effb6b4acaa9 in tenant 520cf09d-78ff-44ed-a731-abd623e73b09 expired on 2026-04-22. Until renewed, all e2e-aks jobs using TestCreateClusterCustomConfig (which tests Azure KMS encryption) will fail. Update the certificate in the hypershift-aks CI cluster credential profile.

  2. Set up certificate expiry monitoring/alerting for Azure service principal certificates used in CI credential profiles to prevent recurrence. Consider automating rotation or at minimum creating calendar reminders 30 days before expiry.

  3. Re-run the e2e-aks-4-22 job after the certificate is rotated. The TestCreateClusterCustomConfig failure will resolve. The EnsureGlobalPullSecret pod-stuck-in-Pending failure is a known flaky behavior and should pass on retry.

  4. The PR changes can be considered safe to merge from a CI perspective — neither failure is caused by the PR. A /retest after certificate rotation should confirm this.

Evidence
Evidence Detail
Expired Certificate Azure AD error AADSTS700027: key valid Start=04/22/2025 15:47:58, End=04/22/2026 15:47:58, job ran 2026-04-23
Azure App ID a05a2f7a-26e7-44f9-8785-effb6b4acaa9 in tenant 520cf09d-78ff-44ed-a731-abd623e73b09
KMS Condition ValidAzureKMSConfig=False: AzureError(failed to encrypt data using KMS)
Cascading Failures KubeAPIServerAvailable=False, Available=False, Degraded=True (capi-provider 2 unavailable, kube-apiserver 1 unavailable)
API Server Timeout ValidateHostedCluster received repeated TLS handshake timeout and EOF for 10m, then context deadline exceeded
Pod Stuck Pending global-pull-secret-success-pod in kube-system stayed Pending for 420s after pull secret update
Konnectivity Disruption konnectivity-agent DaemonSet went 0/2 ready after pull secret update, recovered after several minutes
PR Relevance PR #8265 modifies AutoNodeStatus, metrics, Karpenter controller — no overlap with Azure KMS or pull secret logic
Test Results 261 passed, 6 failed (2 independent chains), 38 skipped out of 305 tests

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2047216697802231808 | Cost: $1.8541339499999998 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2047216697697374208 | Cost: $3.42728865 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants