AUTOSCALE-615: include Karpenter node vCPUs in billing metric by maxcao13 · Pull Request #8265 · openshift/hypershift

maxcao13 · 2026-04-16T19:09:06Z

What this PR does / why we need it:

Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top.

NodeClaims and karpenter NodePools exist directly in the guest cluster, so we cannot source this info directly from the hypershift nodepool metrics package using the existing code/clients, which only has access to the management cluster.

We need this because ROSA HCP uses the hypershift_cluster_vcpus metric for billing our customers. This metric is not correctly reflecting the vCPUs for Nodes that are provisoned by Karpenter. This will result in under-billing of customers that are using Karpenter to provision Nodes within their cluster.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/AUTOSCALE-615

Special notes for your reviewer:

Alternatively, we could have karpenter-operator source its own billing metric and ask ROSA to aggregate using this new exposed metric, but considering that we already expose information about nodes and nodeclaims in the AutoNodeStatus, I think it makes sense to be consistent here.

Made-with: Cursor

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

Summary by CodeRabbit

New Features
- HostedCluster status now includes an optional AutoNode vCPU count; metrics now seed and aggregate AutoNode plus native node vCPUs for improved billing and capacity visibility.
- Karpenter now sums live NodeClaim CPU capacity when computing AutoNode vCPUs and reacts to NodeClaim changes that affect CPU or assignment.
Tests
- Added unit and e2e tests covering vCPU aggregation, error reporting, metric convergence, and billing/vCPU validation during provisioning and consolidation.

openshift-merge-bot · 2026-04-16T19:09:09Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2026-04-16T19:09:10Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2026-04-16T19:09:10Z

@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top.

NodeClaims and karpenter NodePools exist directly in the guest cluster, so we cannot source this info directly from the hypershift nodepool metrics package using the existing code/clients, which only has access to the management cluster.

We need this because ROSA HCP uses the hypershift_cluster_vcpus metric for billing our customers. This metric is not correctly reflecting the vCPUs for Nodes that are provisoned by Karpenter. This will result in under-billing of customers that are using Karpenter to provision Nodes within their cluster.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/AUTOSCALE-615

Special notes for your reviewer:

Alternatively, we could have karpenter-operator source its own billing metric and ask ROSA to aggregate using this new exposed metric, but considering that we already expose information about nodes and nodeclaims in the AutoNodeStatus, I think it makes sense to be consistent here.

Made-with: Cursor

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-04-16T19:09:24Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds vCPU capacity tracking for Karpenter-managed AutoNodes: a new optional VCPUs *int32 on HostedCluster.Status.AutoNode; the guest-side Karpenter controller now watches NodeClaim create/delete and updates that change CPU capacity or Status.NodeName, filters NodeClaims to those whose Status.NodeName corresponds to a live corev1.Node, sums their CPU capacity, and writes the total to Status.AutoNode.VCPUs. The nodepool metrics collector seeds per-cluster vCPU totals from AutoNode.VCPUs (when set), aggregates native NodePool vCPUs, and emits a computation-error metric on lookup failures. Unit and e2e tests validate summation, nil behavior, lookup errors, metric emission, and billing integration.

Sequence Diagram(s)

sequenceDiagram
    participant KC as Karpenter Controller
    participant NC as NodeClaim Resources
    participant N as corev1.Node List
    participant HCP as HostedCluster (Status.AutoNode)
    participant MC as Metrics Collector
    participant Prom as Prometheus

    Note over KC,NC: NodeClaim watch + vCPU computation
    KC->>NC: Receive create/update/delete events
    NC-->>KC: NodeClaim.Status.Capacity[CPU], Status.NodeName
    KC->>N: Query list of Nodes (verify backing node exists)
    N-->>KC: List of node names

    Note over KC: Sum only NodeClaims with NodeName in Nodes
    KC->>KC: sumNodeClaimVCPUs(nodeClaims, liveNodes)

    Note over KC,HCP: Status update
    KC->>HCP: Update Status.AutoNode.VCPUs with computed sum
    HCP-->>HCP: Persist AutoNode.VCPUs

    Note over MC,Prom: Metrics aggregation
    MC->>HCP: Read Status.AutoNode.VCPUs
    MC->>MC: Seed hclusterData.vCpusCount with AutoNode.VCPUs (if present)
    MC->>MC: Add native NodePool vCPUs (from instance-type lookup)
    MC->>Prom: Emit hypershift_cluster_vcpus and error metrics
    Prom-->>Prom: Store/serve metrics

🚥 Pre-merge checks | ✅ 9 | ❌ 3

❌ Failed checks (3 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Single Node Openshift (Sno) Test Compatibility	⚠️ Warning	The testBillingConsolidationAndPDB test uses pod anti-affinity requiring different nodes, which is incompatible with Single Node OpenShift deployments that have only one node.	Add [Skipped:SingleReplicaTopology] label to the test or guard with runtime topology checks using exutil.IsSingleNode() to skip on SNO clusters.
Ipv6 And Disconnected Network Test Compatibility	⚠️ Warning	The new testBillingConsolidationAndPDB e2e test requires pulling container image quay.io/openshift/origin-pod:4.22.0 from public Quay.io registry, which will fail in IPv6-only disconnected CI environments without internet access. The test lacks the [Skipped:Disconnected] annotation.	Add the [Skipped:Disconnected] annotation to the test name to skip it on disconnected clusters, or use an internally-available container image instead of quay.io.

✅ Passed checks (9 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding Karpenter node vCPUs to the billing metric, which is the core objective of this multi-file changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names in the pull request are stable and deterministic without any dynamic information. The test titles clearly describe what is being validated using static, descriptive strings.
Test Structure And Quality	✅ Passed	Tests demonstrate solid Ginkgo and Go testing practices with proper timeout specifications, meaningful assertions, and cleanup patterns.
Microshift Test Compatibility	✅ Passed	The new e2e tests in test/e2e/karpenter_test.go are protected by a platform guard that prevents execution on MicroShift.
Topology-Aware Scheduling Compatibility	✅ Passed	Pull request adds vCPU tracking for Karpenter nodes through API types and metrics without introducing pod scheduling constraints incompatible with SNO, Two-Node, or HyperShift topologies.
Ote Binary Stdout Contract	✅ Passed	PR's e2e test code properly uses crzap logger (outputs to stderr by default) for all process-level logging with no direct fmt.Print calls to stdout.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

maxcao13 · 2026-04-16T19:09:32Z

/test e2e-aws-techpreview

openshift-ci-robot · 2026-04-16T19:10:46Z

@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top.

NodeClaims and karpenter NodePools exist directly in the guest cluster, so we cannot source this info directly from the hypershift nodepool metrics package using the existing code/clients, which only has access to the management cluster.

We need this because ROSA HCP uses the hypershift_cluster_vcpus metric for billing our customers. This metric is not correctly reflecting the vCPUs for Nodes that are provisoned by Karpenter. This will result in under-billing of customers that are using Karpenter to provision Nodes within their cluster.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/AUTOSCALE-615

Special notes for your reviewer:

Alternatively, we could have karpenter-operator source its own billing metric and ask ROSA to aggregate using this new exposed metric, but considering that we already expose information about nodes and nodeclaims in the AutoNodeStatus, I think it makes sense to be consistent here.

Made-with: Cursor

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

New Features

Added vCPU capacity tracking and reporting for Karpenter-managed nodes, enabling better visibility into resource allocation and billing metrics.

Tests

Added comprehensive test coverage for vCPU metric aggregation and reporting across various scenarios.

Updated end-to-end tests to validate vCPU metric convergence during node provisioning and consolidation workflows.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

karpenter-operator/controllers/karpenter/karpenter_controller.go (1)

398-414: Precompute the live-node set before summing vCPUs.

This helper is currently quadratic because every NodeClaim re-scans the full node list. On larger clusters that turns each Node/NodeClaim event into an avoidable hot path in the billing reconcile loop.

♻️ Suggested refactor

 func sumNodeClaimVCPUs(nodeClaims []karpenterv1.NodeClaim, allNodes []corev1.Node) int32 {
+	liveNodes := make(map[string]struct{}, len(allNodes))
+	for i := range allNodes {
+		liveNodes[allNodes[i].Name] = struct{}{}
+	}
+
 	var total int64
 	for i := range nodeClaims {
 		nc := &nodeClaims[i]
-		if !slices.ContainsFunc(allNodes, func(n corev1.Node) bool {
-			return n.Name == nc.Status.NodeName
-		}) {
+		if _, ok := liveNodes[nc.Status.NodeName]; !ok {
 			continue
 		}
 		if cpu, ok := nc.Status.Capacity[corev1.ResourceCPU]; ok {
 			total += cpu.Value()
 		}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@karpenter-operator/controllers/karpenter/karpenter_controller.go` around
lines 398 - 414, The sumNodeClaimVCPUs helper is O(N*M) because it calls
slices.ContainsFunc(allNodes, ...) for each NodeClaim; precompute a set of live
node names from allNodes (e.g., map[string]struct{} or map[string]bool) once at
the start of sumNodeClaimVCPUs, then iterate nodeClaims and do O(1) lookups
against that map (check nc.Status.NodeName exists) before summing
nc.Status.Capacity[corev1.ResourceCPU].Value(); update references in
sumNodeClaimVCPUs accordingly to use the precomputed map instead of
slices.ContainsFunc.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/karpenter_test.go`:
- Around line 1368-1421: The predicate in e2eutil.EventuallyObject currently
treats a nil hc.Status.AutoNode.VCPUs as zero which can falsely pass; change the
predicate (inside the anonymous func passed to e2eutil.EventuallyObject that
reads hc.Status.AutoNode.VCPUs) to explicitly require the pointer be non-nil
before considering the value (return false if nil), and only succeed when
*hc.Status.AutoNode.VCPUs == expectedVCPUs. Likewise, tighten the metric check
in the wait.PollUntilContextTimeout loop (the loop that reads
npmetrics.VCpusCountByHClusterMetricName) to match the metric series by both the
"name" label and the "namespace" label (require l.GetName()=="name" &&
l.GetValue()==hostedCluster.Name and also require a label "namespace" with value
hostedCluster.Namespace) so the correct series is selected.

---

Nitpick comments:
In `@karpenter-operator/controllers/karpenter/karpenter_controller.go`:
- Around line 398-414: The sumNodeClaimVCPUs helper is O(N*M) because it calls
slices.ContainsFunc(allNodes, ...) for each NodeClaim; precompute a set of live
node names from allNodes (e.g., map[string]struct{} or map[string]bool) once at
the start of sumNodeClaimVCPUs, then iterate nodeClaims and do O(1) lookups
against that map (check nc.Status.NodeName exists) before summing
nc.Status.Capacity[corev1.ResourceCPU].Value(); update references in
sumNodeClaimVCPUs accordingly to use the precomputed map instead of
slices.ContainsFunc.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 41073511-5238-4fd7-b7a0-bb8022ab9509

📥 Commits

Reviewing files that changed from the base of the PR and between 533aad6 and 29df349.

⛔ Files ignored due to path filters (12)

api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
client/applyconfiguration/hypershift/v1beta1/autonodestatus.go is excluded by !client/**
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
docs/content/reference/api.md is excluded by !docs/content/reference/api.md
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*

📒 Files selected for processing (6)

api/hypershift/v1beta1/hostedcluster_types.go
hypershift-operator/controllers/nodepool/metrics/metrics.go
hypershift-operator/controllers/nodepool/metrics/metrics_test.go
karpenter-operator/controllers/karpenter/karpenter_controller.go
karpenter-operator/controllers/karpenter/karpenter_controller_test.go
test/e2e/karpenter_test.go

codecov · 2026-04-16T19:23:55Z

Codecov Report

❌ Patch coverage is 43.58974% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.04%. Comparing base (0f75517) to head (e649cc4).

Files with missing lines	Patch %	Lines
...ator/controllers/karpenter/karpenter_controller.go	31.25%	22 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #8265   +/-   ##
=======================================
  Coverage   36.04%   36.04%           
=======================================
  Files         767      767           
  Lines       93422    93458   +36     
=======================================
+ Hits        33674    33689   +15     
- Misses      57040    57061   +21     
  Partials     2708     2708

Files with missing lines	Coverage Δ
...t-operator/controllers/nodepool/metrics/metrics.go	`70.20% <100.00%> (+0.43%)`	⬆️
...ator/controllers/karpenter/karpenter_controller.go	`26.76% <31.25%> (+0.55%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

maxcao13 · 2026-04-16T22:23:06Z

/test e2e-aws-techpreview

openshift-ci-robot · 2026-04-16T22:29:16Z

@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top.

NodeClaims and karpenter NodePools exist directly in the guest cluster, so we cannot source this info directly from the hypershift nodepool metrics package using the existing code/clients, which only has access to the management cluster.

We need this because ROSA HCP uses the hypershift_cluster_vcpus metric for billing our customers. This metric is not correctly reflecting the vCPUs for Nodes that are provisoned by Karpenter. This will result in under-billing of customers that are using Karpenter to provision Nodes within their cluster.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/AUTOSCALE-615

Special notes for your reviewer:

Alternatively, we could have karpenter-operator source its own billing metric and ask ROSA to aggregate using this new exposed metric, but considering that we already expose information about nodes and nodeclaims in the AutoNodeStatus, I think it makes sense to be consistent here.

Made-with: Cursor

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

New Features

HostedCluster status now exposes AutoNode vCPU counts; cluster vCPU metrics include both AutoNode and native node contributions for improved billing and capacity visibility.

Karpenter reconciliation now accounts for live NodeClaim vCPU capacity when computing AutoNode vCPUs.

Tests

Added unit and integration tests validating vCPU aggregation, error handling, and end-to-end metric convergence during provisioning and consolidation.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

maxcao13 · 2026-04-17T16:59:38Z

/test e2e-aws-techpreview

openshift-ci-robot · 2026-04-17T17:00:52Z

@maxcao13: This pull request references AUTOSCALE-615 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top.

NodeClaims and karpenter NodePools exist directly in the guest cluster, so we cannot source this info directly from the hypershift nodepool metrics package using the existing code/clients, which only has access to the management cluster.

We need this because ROSA HCP uses the hypershift_cluster_vcpus metric for billing our customers. This metric is not correctly reflecting the vCPUs for Nodes that are provisoned by Karpenter. This will result in under-billing of customers that are using Karpenter to provision Nodes within their cluster.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/AUTOSCALE-615

Special notes for your reviewer:

Alternatively, we could have karpenter-operator source its own billing metric and ask ROSA to aggregate using this new exposed metric, but considering that we already expose information about nodes and nodeclaims in the AutoNodeStatus, I think it makes sense to be consistent here.

Made-with: Cursor

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

New Features

HostedCluster status now reports AutoNode vCPU counts; cluster vCPU metrics aggregate AutoNode and native node contributions for improved billing and capacity visibility.

Karpenter now includes live NodeClaim CPU capacity when computing AutoNode vCPUs.

Tests

Added unit and end-to-end tests validating vCPU aggregation, error reporting, and billing-metric convergence during provisioning and consolidation.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/karpenter_test.go`:
- Around line 1171-1178: The baseline vCPU snapshot is taken before AutoNode
convergence so it may include transient Karpenter vCPUs; change the order so you
call waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0) first to
ensure Karpenter vCPUs are zero, then call getVCPUsMetric(t, ctx, mgtClient,
hostedCluster) to set baseline and assert found—update the code around
getVCPUsMetric and waitForAutoNodeStatusVCPUs to reflect this ordering and keep
the baseline variable logic the same.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: ad81a3ff-ac87-4709-8c6f-23548d604839

📥 Commits

Reviewing files that changed from the base of the PR and between 5cd4832 and 3d550fe.

⛔ Files ignored due to path filters (12)

api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
client/applyconfiguration/hypershift/v1beta1/autonodestatus.go is excluded by !client/**
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
docs/content/reference/api.md is excluded by !docs/content/reference/api.md
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*

📒 Files selected for processing (6)

api/hypershift/v1beta1/hostedcluster_types.go
hypershift-operator/controllers/nodepool/metrics/metrics.go
hypershift-operator/controllers/nodepool/metrics/metrics_test.go
karpenter-operator/controllers/karpenter/karpenter_controller.go
karpenter-operator/controllers/karpenter/karpenter_controller_test.go
test/e2e/karpenter_test.go

✅ Files skipped from review due to trivial changes (1)

karpenter-operator/controllers/karpenter/karpenter_controller_test.go

🚧 Files skipped from review as they are similar to previous changes (2)

hypershift-operator/controllers/nodepool/metrics/metrics.go
api/hypershift/v1beta1/hostedcluster_types.go

coderabbitai · 2026-04-17T17:06:01Z

+		// Read the current metric as baseline (native NodePool vCPUs).
+		baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster)
+		g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned")
+		t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline)
+
+		// Before any Karpenter nodes are provisioned, Karpenter vCPUs should be 0.
+		waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0)
+


⚠️ Potential issue | 🟠 Major

Take the billing baseline after AutoNode has converged to zero.

Line 1172 can still include transient Karpenter vCPUs from earlier provisioning tests, because the zero-vCPU wait only happens on Line 1177. That makes the later baseline+8 / baseline+4 assertions flaky.

Suggested fix

- // Read the current metric as baseline (native NodePool vCPUs). - baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster) - g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned") - t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline) - // Before any Karpenter nodes are provisioned, Karpenter vCPUs should be 0. waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0) + + // Now the billing metric baseline is native NodePool vCPUs only. + baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster) + g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned") + t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/karpenter_test.go` around lines 1171 - 1178, The baseline vCPU snapshot is taken before AutoNode convergence so it may include transient Karpenter vCPUs; change the order so you call waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0) first to ensure Karpenter vCPUs are zero, then call getVCPUsMetric(t, ctx, mgtClient, hostedCluster) to set baseline and assert found—update the code around getVCPUsMetric and waitForAutoNodeStatusVCPUs to reflect this ordering and keep the baseline variable logic the same.

joshbranham · 2026-04-18T00:15:30Z

For HyperShift NodePools we return -1 if we hit an error calculating a vCPU count. We are not doing that here, as proposed, but I think that is fine since that error/calculation happens downstream in karpenter.

/lgtm
/test e2e-aws-techpreview

openshift-merge-bot · 2026-04-22T15:35:39Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

coderabbitai

♻️ Duplicate comments (1)

test/e2e/karpenter_test.go (1)

1249-1256: ⚠️ Potential issue | 🟡 Minor

Take the billing baseline after AutoNode has converged to zero.

The baseline vCPU snapshot is still taken before confirming AutoNode.VCPUs == 0. This ordering could include transient Karpenter vCPUs if a previous parallel test didn't fully clean up, making baseline+8 / baseline+4 assertions flaky.

Consider swapping the order:

-		// Read the current metric as baseline (native NodePool vCPUs).
-		baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster)
-		g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned")
-		t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline)
-
 		// Before any Karpenter nodes are provisioned, Karpenter vCPUs should be 0.
 		waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0)
+
+		// Now the billing metric baseline is native NodePool vCPUs only.
+		baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster)
+		g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned")
+		t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/karpenter_test.go` around lines 1249 - 1256, The baseline vCPU read
is taken before verifying AutoNode has converged to zero, which can capture
transient Karpenter vCPUs; fix by swapping the calls so
waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0) runs first to
ensure AutoNode.VCPUs == 0, then call getVCPUsMetric(t, ctx, mgtClient,
hostedCluster) to set baseline, and use that baseline for subsequent assertions
(update references to baseline, found, and the log message accordingly).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/e2e/karpenter_test.go`:
- Around line 1249-1256: The baseline vCPU read is taken before verifying
AutoNode has converged to zero, which can capture transient Karpenter vCPUs; fix
by swapping the calls so waitForAutoNodeStatusVCPUs(t, ctx, mgtClient,
hostedCluster, 0) runs first to ensure AutoNode.VCPUs == 0, then call
getVCPUsMetric(t, ctx, mgtClient, hostedCluster) to set baseline, and use that
baseline for subsequent assertions (update references to baseline, found, and
the log message accordingly).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 84d04d1b-89bc-4c99-9625-ff88d874440b

📥 Commits

Reviewing files that changed from the base of the PR and between 3d550fe and e939a3e.

⛔ Files ignored due to path filters (12)

api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
client/applyconfiguration/hypershift/v1beta1/autonodestatus.go is excluded by !client/**
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
docs/content/reference/api.md is excluded by !docs/content/reference/api.md
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*

📒 Files selected for processing (6)

api/hypershift/v1beta1/hostedcluster_types.go
hypershift-operator/controllers/nodepool/metrics/metrics.go
hypershift-operator/controllers/nodepool/metrics/metrics_test.go
karpenter-operator/controllers/karpenter/karpenter_controller.go
karpenter-operator/controllers/karpenter/karpenter_controller_test.go
test/e2e/karpenter_test.go

🚧 Files skipped from review as they are similar to previous changes (2)

api/hypershift/v1beta1/hostedcluster_types.go
karpenter-operator/controllers/karpenter/karpenter_controller_test.go

hypershift-jira-solve-ci · 2026-04-22T17:16:55Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2046977908567707648 | Cost: $2.2540942999999993 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

maxcao13 · 2026-04-22T17:29:43Z

/retest-required

…us billing metric Add VCPUs field to AutoNodeStatus, computed by karpenter-operator from NodeClaim capacity and cross-referenced against live Node objects. The metrics collector seeds per-cluster vCPU count from this field before accumulating native NodePool vCPUs on top. - NodeClaim is the authority for Karpenter ownership (no label dependency) - e2e tests validate vCPU status + metric at 0, scale-up, and consolidation Signed-off-by: Max Cao <macao@redhat.com> Made-with: Cursor Signed-off-by: Max Cao <macao@redhat.com>

joshbranham · 2026-04-22T20:21:09Z

/lgtm

openshift-merge-bot · 2026-04-22T20:21:20Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

coderabbitai

🧹 Nitpick comments (1)

test/e2e/karpenter_test.go (1)

1249-1256: Baseline read should happen after AutoNode convergence for robustness.

The baseline is captured before confirming AutoNode.VCPUs == 0. In a CI environment where tests may share clusters or where previous reconciliation loops haven't fully settled, the billing metric could still include transient Karpenter vCPUs from earlier operations. Moving the convergence check before the baseline read ensures a clean slate.

🔧 Suggested reordering

-		// Read the current metric as baseline (native NodePool vCPUs).
-		baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster)
-		g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned")
-		t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline)
-
 		// Before any Karpenter nodes are provisioned, Karpenter vCPUs should be 0.
 		waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0)
+
+		// Now the billing metric baseline is native NodePool vCPUs only.
+		baseline, found := getVCPUsMetric(t, ctx, mgtClient, hostedCluster)
+		g.Expect(found).To(BeTrue(), "billing metric should exist before Karpenter nodes are provisioned")
+		t.Logf("Baseline billing metric vCPUs from native NodePools: %d", baseline)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/karpenter_test.go` around lines 1249 - 1256, Move the AutoNode
convergence check before reading the baseline metric: call
waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0) first to ensure
AutoNode.VCPUs == 0, then call getVCPUsMetric(t, ctx, mgtClient, hostedCluster)
to capture baseline; update any log messages that reference the baseline
variable accordingly so the metric read reflects a clean slate without transient
Karpenter vCPUs.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/e2e/karpenter_test.go`:
- Around line 1249-1256: Move the AutoNode convergence check before reading the
baseline metric: call waitForAutoNodeStatusVCPUs(t, ctx, mgtClient,
hostedCluster, 0) first to ensure AutoNode.VCPUs == 0, then call
getVCPUsMetric(t, ctx, mgtClient, hostedCluster) to capture baseline; update any
log messages that reference the baseline variable accordingly so the metric read
reflects a clean slate without transient Karpenter vCPUs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 9d60dbfe-c533-46e4-a0f9-a30e10fb4d37

📥 Commits

Reviewing files that changed from the base of the PR and between e939a3e and e649cc4.

⛔ Files ignored due to path filters (12)

api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedcontrolplanes.hypershift.openshift.io/AutoNodeKarpenter.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
client/applyconfiguration/hypershift/v1beta1/autonodestatus.go is excluded by !client/**
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedcontrolplanes-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
docs/content/reference/api.md is excluded by !docs/content/reference/api.md
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.go is excluded by !vendor/**, !**/vendor/**
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*

📒 Files selected for processing (6)

api/hypershift/v1beta1/hostedcluster_types.go
hypershift-operator/controllers/nodepool/metrics/metrics.go
hypershift-operator/controllers/nodepool/metrics/metrics_test.go
karpenter-operator/controllers/karpenter/karpenter_controller.go
karpenter-operator/controllers/karpenter/karpenter_controller_test.go
test/e2e/karpenter_test.go

🚧 Files skipped from review as they are similar to previous changes (2)

api/hypershift/v1beta1/hostedcluster_types.go
karpenter-operator/controllers/karpenter/karpenter_controller.go

maxcao13 · 2026-04-22T20:49:25Z

/test e2e-aws-4-22

hypershift-jira-solve-ci · 2026-04-22T22:06:49Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2047048197989208064 | Cost: $2.40282805 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

hypershift-jira-solve-ci · 2026-04-22T22:35:03Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2047048197783687168 | Cost: $2.09886635 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

hypershift-jira-solve-ci · 2026-04-22T23:57:09Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2047048197838213120 | Cost: $2.4782874499999994 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

maxcao13 · 2026-04-23T04:38:16Z

/verified by e2e

openshift-ci-robot · 2026-04-23T04:38:27Z

@maxcao13: This PR has been marked as verified by e2e.

Details

In response to this:

/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

maxcao13 · 2026-04-23T04:41:36Z

/unhold

maxcao13 · 2026-04-23T05:05:19Z

/retest-required

hypershift-jira-solve-ci · 2026-04-23T06:39:03Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2047180009780547584 | Cost: $1.8436635000000001 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

hypershift-jira-solve-ci · 2026-04-23T06:49:55Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2047180009998651392 | Cost: $2.8597483499999994 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

hypershift-jira-solve-ci · 2026-04-23T06:54:47Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2047180010095120384 | Cost: $2.72117325 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

openshift-ci · 2026-04-23T07:06:31Z

@maxcao13: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-4-22	`e649cc4`	link	true	`/test e2e-aws-4-22`
ci/prow/e2e-azure-self-managed	`e649cc4`	link	true	`/test e2e-azure-self-managed`
ci/prow/e2e-aks	`e649cc4`	link	true	`/test e2e-aks`
ci/prow/e2e-aws	`e649cc4`	link	true	`/test e2e-aws`
ci/prow/e2e-aks-4-22	`e649cc4`	link	true	`/test e2e-aks-4-22`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-merge-bot · 2026-04-23T07:30:33Z

/retest-required

Remaining retests: 0 against base HEAD 6daa9ce and 2 for PR HEAD e649cc4 in total

hypershift-jira-solve-ci · 2026-04-23T09:30:36Z

I have all the evidence needed. Let me compile the final analysis. The key findings are:

TestCreateClusterCustomConfig (and its sub-test ValidateHostedCluster): Failed because an Azure client certificate used for KMS (Key Management Service) encryption expired on 2026-04-22 15:47:58 UTC — one day before the job ran on 2026-04-23. The app ID is a05a2f7a-26e7-44f9-8785-effb6b4acaa9, the Azure error is AADSTS700027. This caused the HostedCluster condition ValidAzureKMSConfig=False, which prevented kube-apiserver from becoming available, which cascaded into total control plane unavailability.
TestCreateCluster/Main/EnsureGlobalPullSecret (and its sub-tests): A pod using a restricted image stayed in Pending phase for 420s after the global pull secret was updated. This is a known flaky test behavior — the pod's image pull requires node-level credential refresh after the pull secret update, which sometimes takes longer than the timeout allows.

Neither failure is related to the PR changes (Karpenter vCPU billing metrics).

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-aks-4-22
Build ID: 2047216697617682432
Target: e2e-aks-4-22
State: failure
Started: 2026-04-23T07:31:10Z
Completed: 2026-04-23T09:22:14Z
Result: 305 tests, 38 skipped, 6 failures

Test Failure Analysis

Error

TestCreateClusterCustomConfig: AADSTS700027: The certificate with identifier used to sign
the client assertion is expired on application. [Reason - The key used is expired.,
Found key 'Start=04/22/2025 15:47:58, End=04/22/2026 15:47:58']
App Id: a05a2f7a-26e7-44f9-8785-effb6b4acaa9

TestCreateCluster/Main/EnsureGlobalPullSecret: Timed out after 420.000s.
pod is not running yet, current phase: Pending

Summary

This job has two independent failure chains, neither related to the PR changes (Karpenter vCPU billing metrics). TestCreateClusterCustomConfig (2 sub-failures) failed because the Azure client certificate used for KMS encryption in the hypershift-aks CI credential profile expired on 2026-04-22 — one day before this job ran on 2026-04-23. This caused the ValidAzureKMSConfig condition to be False, preventing kube-apiserver startup and cascading into full control plane unavailability. TestCreateCluster/Main/EnsureGlobalPullSecret (4 sub-failures) failed because a pod using a restricted image remained stuck in Pending phase for 420s after a global pull secret update, a known flaky behavior where node-level credential refresh does not complete within the test timeout.

Root Cause

Primary Root Cause — Expired Azure KMS Client Certificate (TestCreateClusterCustomConfig, 2 failures):

The Azure service principal certificate used by the HyperShift CI environment for Azure Key Vault (KMS) operations has expired. The certificate was valid from 2025-04-22 15:47:58 to 2026-04-22 15:47:58 and expired approximately 16 hours before this job started. The Azure AD error AADSTS700027 confirms this — the client assertion (certificate-based authentication) to Azure AD tenant 520cf09d-78ff-44ed-a731-abd623e73b09 for app a05a2f7a-26e7-44f9-8785-effb6b4acaa9 was rejected.

This caused the HostedCluster custom-config-lkm8l condition ValidAzureKMSConfig=False, which meant the kube-apiserver could not start (because it depends on KMS for secret encryption). With the apiserver unavailable, the cascading failures were:

KubeAPIServerAvailable=False
Available=False
Degraded=True (capi-provider and kube-apiserver deployments had unavailable replicas)
IgnitionEndpointAvailable=False
All CVO conditions remained Unknown

The ValidateHostedCluster sub-test then failed when trying to connect to the guest API server, receiving repeated TLS handshake timeouts and EOF errors for 10 minutes before timing out.

Secondary Root Cause — Pull Secret Propagation Timeout (TestCreateCluster/EnsureGlobalPullSecret, 4 failures):

After updating the global pull secret on the HostedCluster, the test creates a pod that references a restricted container image (requiring the updated pull secret to pull). The pod remained in Pending phase for the full 420s timeout. Notably, the konnectivity-agent DaemonSet went from 2/2 ready to 0/2 ready during the pull secret update and took several minutes to recover to 2/2. This indicates the pull secret update triggered node-level disruption. The restricted-image pod was likely unable to pull its image because the updated credentials had not been fully propagated to the node's container runtime within the timeout window.

Neither failure is caused by or related to PR #8265 (Karpenter vCPU billing metrics). The PR modifies AutoNodeStatus, metrics collection code, and Karpenter controller logic — none of which interact with Azure KMS configuration or pull secret propagation.

Recommendations

Rotate the expired Azure KMS client certificate immediately. The certificate for app ID a05a2f7a-26e7-44f9-8785-effb6b4acaa9 in tenant 520cf09d-78ff-44ed-a731-abd623e73b09 expired on 2026-04-22. Until renewed, all e2e-aks jobs using TestCreateClusterCustomConfig (which tests Azure KMS encryption) will fail. Update the certificate in the hypershift-aks CI cluster credential profile.
Set up certificate expiry monitoring/alerting for Azure service principal certificates used in CI credential profiles to prevent recurrence. Consider automating rotation or at minimum creating calendar reminders 30 days before expiry.
Re-run the e2e-aks-4-22 job after the certificate is rotated. The TestCreateClusterCustomConfig failure will resolve. The EnsureGlobalPullSecret pod-stuck-in-Pending failure is a known flaky behavior and should pass on retry.
The PR changes can be considered safe to merge from a CI perspective — neither failure is caused by the PR. A /retest after certificate rotation should confirm this.

Evidence

Evidence	Detail
Expired Certificate	Azure AD error AADSTS700027: key valid `Start=04/22/2025 15:47:58, End=04/22/2026 15:47:58`, job ran 2026-04-23
Azure App ID	`a05a2f7a-26e7-44f9-8785-effb6b4acaa9` in tenant `520cf09d-78ff-44ed-a731-abd623e73b09`
KMS Condition	`ValidAzureKMSConfig=False: AzureError(failed to encrypt data using KMS)`
Cascading Failures	`KubeAPIServerAvailable=False`, `Available=False`, `Degraded=True` (capi-provider 2 unavailable, kube-apiserver 1 unavailable)
API Server Timeout	`ValidateHostedCluster` received repeated `TLS handshake timeout` and `EOF` for 10m, then `context deadline exceeded`
Pod Stuck Pending	`global-pull-secret-success-pod` in `kube-system` stayed `Pending` for 420s after pull secret update
Konnectivity Disruption	konnectivity-agent DaemonSet went `0/2 ready` after pull secret update, recovered after several minutes
PR Relevance	PR #8265 modifies `AutoNodeStatus`, metrics, Karpenter controller — no overlap with Azure KMS or pull secret logic
Test Results	261 passed, 6 failed (2 independent chains), 38 skipped out of 305 tests

hypershift-jira-solve-ci · 2026-04-23T09:39:53Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2047216697802231808 | Cost: $1.8541339499999998 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

hypershift-jira-solve-ci · 2026-04-23T09:41:09Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2047216697697374208 | Cost: $3.42728865 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 16, 2026

openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Apr 16, 2026

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread test/e2e/karpenter_test.go

maxcao13 force-pushed the karpenter-node-billing branch from 29df349 to 5cd4832 Compare April 16, 2026 22:22

maxcao13 marked this pull request as ready for review April 17, 2026 16:53

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 17, 2026

openshift-ci Bot requested review from bryan-cox and jparrill April 17, 2026 16:55

maxcao13 force-pushed the karpenter-node-billing branch from 5cd4832 to 3d550fe Compare April 17, 2026 16:59

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2026

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

maxcao13 force-pushed the karpenter-node-billing branch from e939a3e to e649cc4 Compare April 22, 2026 20:20

openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2026

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 23, 2026

openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 23, 2026

Conversation

maxcao13 commented Apr 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented Apr 16, 2026

Uh oh!

openshift-ci Bot commented Apr 16, 2026

Uh oh!

openshift-ci-robot commented Apr 16, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Sequence Diagram(s)

❌ Failed checks (3 warnings)

Uh oh!

maxcao13 commented Apr 16, 2026

Uh oh!

openshift-ci-robot commented Apr 16, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

maxcao13 commented Apr 16, 2026

Uh oh!

openshift-ci-robot commented Apr 16, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

maxcao13 commented Apr 17, 2026

Uh oh!

openshift-ci-robot commented Apr 17, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

joshbranham commented Apr 18, 2026

Uh oh!

openshift-merge-bot Bot commented Apr 22, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

hypershift-jira-solve-ci Bot commented Apr 22, 2026

AI Test Failure Analysis

maxcao13 commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Apr 16, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

openshift-ci-robot commented Apr 16, 2026 •

edited by openshift-ci Bot

Loading

codecov Bot commented Apr 16, 2026 •

edited

Loading

openshift-ci-robot commented Apr 16, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Apr 17, 2026 •

edited by openshift-ci Bot

Loading