NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled by enxebre · Pull Request #8404 · openshift/hypershift

enxebre · 2026-05-04T09:42:00Z

Summary

When AutoNode is disabled before HostedCluster deletion, the karpenter-operator deployment is removed and cannot process the HCP karpenter finalizer. The resolveKarpenterFinalizer fallback previously early-returned when IsKarpenterEnabled was false, leaving the HCP stuck in terminating indefinitely.

Root cause

The resolveKarpenterFinalizer function had a guard:

if !karpenterutil.IsKarpenterEnabled(hc.Spec.AutoNode) {
    return nil
}

This skipped the fallback entirely when AutoNode was disabled, even though the HCP still had the finalizer from when Karpenter was previously enabled.

Fix

Remove the early-return guard and instead use IsKarpenterEnabled to decide the cleanup strategy:

AutoNode enabled: defer to the karpenter-operator for graceful cleanup — only force-remove the finalizer once the guest KAS is down
AutoNode disabled: remove the finalizer immediately since no controller exists to handle it

Test plan

Existing TestResolveKarpenterFinalizer cases pass (renamed to Gherkin)
New test: "When karpenter is disabled and HCP has a stale finalizer it should remove the finalizer immediately"
New test: "When karpenter is disabled and KAS is still available it should remove the finalizer immediately"

/cc @maxcao13 @jkyros

Summary by CodeRabbit

Bug Fixes
- Resolved an issue where disabling node auto-scaling could cause HostedCluster deletion to hang indefinitely. The system now properly removes cleanup markers when auto-scaling is disabled, and only defers removal when enabled and the cluster is accessible.

…bled When AutoNode is disabled before HostedCluster deletion, the karpenter-operator deployment is removed and cannot process the HCP karpenter finalizer. The resolveKarpenterFinalizer fallback previously skipped when IsKarpenterEnabled returned false, leaving the HCP stuck in terminating indefinitely. Remove the early-return guard so the fallback always runs. When AutoNode is still enabled, defer to the karpenter-operator for graceful cleanup and only force-remove once the guest KAS is down. When AutoNode is disabled, remove the finalizer immediately since no controller exists to handle it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openshift-merge-bot · 2026-05-04T09:42:03Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci-robot · 2026-05-04T09:42:07Z

@enxebre: This pull request references Jira Issue OCPBUGS-84368, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

When AutoNode is disabled before HostedCluster deletion, the karpenter-operator deployment is removed and cannot process the HCP karpenter finalizer. The resolveKarpenterFinalizer fallback previously early-returned when IsKarpenterEnabled was false, leaving the HCP stuck in terminating indefinitely.

Root cause

The resolveKarpenterFinalizer function had a guard:
if !karpenterutil.IsKarpenterEnabled(hc.Spec.AutoNode) {
   return nil
}
This skipped the fallback entirely when AutoNode was disabled, even though the HCP still had the finalizer from when Karpenter was previously enabled.

Fix

Remove the early-return guard and instead use IsKarpenterEnabled to decide the cleanup strategy:

AutoNode enabled: defer to the karpenter-operator for graceful cleanup — only force-remove the finalizer once the guest KAS is down

AutoNode disabled: remove the finalizer immediately since no controller exists to handle it

Test plan

Existing TestResolveKarpenterFinalizer cases pass (renamed to Gherkin)

New test: "When karpenter is disabled and HCP has a stale finalizer it should remove the finalizer immediately"

New test: "When karpenter is disabled and KAS is still available it should remove the finalizer immediately"

/cc @maxcao13 @jkyros

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-04T09:42:18Z

📝 Walkthrough

Walkthrough

The changes modify the resolveKarpenterFinalizer function to alter when the karpenter finalizer is removed from a HostedCluster resource. Previously, the function returned early without taking action whenever AutoNode was disabled. Now, when AutoNode is disabled, the function immediately force-removes the karpenter finalizer to prevent deletion from getting stuck. When AutoNode is enabled, the function defers finalizer removal until confirming the guest Kubernetes API Server is unavailable. The test file was updated with improved descriptive naming for existing test cases and two new test scenarios validating immediate finalizer removal when Karpenter is disabled.

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	Most assertions in the test file lack meaningful failure messages that would help diagnose failures.	Add meaningful failure messages to all Gomega assertions to improve test debuggability when failures occur.

✅ Passed checks (10 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names in karpenter_test.go are stable and deterministic with hardcoded static string literals and no dynamic values.
Microshift Test Compatibility	✅ Passed	The PR only adds standard Go unit tests using testing.T interface with table-driven test cases and fake Kubernetes client, not Ginkgo e2e tests.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR modifies standard Go unit tests with table-driven patterns, not Ginkgo e2e tests. The file uses testing package and Gomega assertions without Ginkgo DSL functions.
Topology-Aware Scheduling Compatibility	✅ Passed	Code changes only modify Karpenter finalizer removal logic without affecting any pod scheduling, affinity rules, or topology constraints.
Ote Binary Stdout Contract	✅ Passed	Modified files are standard controller and unit test code without process-level entry points or OTE suite setup that could emit to stdout.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	Tests use standard Go testing package with mocked Kubernetes objects. No IPv4 assumptions, hardcoded IPs, external connectivity, DNS resolutions, or image pulls detected.
Title check	✅ Passed	The title clearly and specifically describes the main change: fixing the karpenter finalizer resolution when AutoNode is disabled, which directly addresses the bug described in the PR objectives.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-04T09:42:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [enxebre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-05-04T09:42:54Z

@enxebre: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

Details

In response to this:

Summary

When AutoNode is disabled before HostedCluster deletion, the karpenter-operator deployment is removed and cannot process the HCP karpenter finalizer. The resolveKarpenterFinalizer fallback previously early-returned when IsKarpenterEnabled was false, leaving the HCP stuck in terminating indefinitely.

Root cause

The resolveKarpenterFinalizer function had a guard:
if !karpenterutil.IsKarpenterEnabled(hc.Spec.AutoNode) {
   return nil
}
This skipped the fallback entirely when AutoNode was disabled, even though the HCP still had the finalizer from when Karpenter was previously enabled.

Fix

Remove the early-return guard and instead use IsKarpenterEnabled to decide the cleanup strategy:

AutoNode enabled: defer to the karpenter-operator for graceful cleanup — only force-remove the finalizer once the guest KAS is down

AutoNode disabled: remove the finalizer immediately since no controller exists to handle it

Test plan

Existing TestResolveKarpenterFinalizer cases pass (renamed to Gherkin)

New test: "When karpenter is disabled and HCP has a stale finalizer it should remove the finalizer immediately"

New test: "When karpenter is disabled and KAS is still available it should remove the finalizer immediately"

/cc @maxcao13 @jkyros

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hypershift-operator/controllers/hostedcluster/karpenter.go`:
- Around line 143-156: The code currently treats hc.Spec.AutoNode being false as
proof the karpenter operator is gone; instead gate immediate finalizer removal
on the operator actually being absent or on a “disable completed” signal. Update
the branch that uses karpenterutil.IsKarpenterEnabled(hc.Spec.AutoNode) so that
when it returns false you verify operator termination (e.g., check for absence
of Karpenter deployment/replicasets or a disable-completed condition produced by
reconcileAutoNodeEnabledCondition) before removing the finalizer; reuse or add a
helper similar to isKASAvailable (or check the karpenter operator
Deployment/ReplicaSet status) and only return nil/remove finalizer once that
check confirms the operator is truly gone or the disable progression is marked
complete. Ensure references to hc.Spec.AutoNode,
karpenterutil.IsKarpenterEnabled, isKASAvailable and
reconcileAutoNodeEnabledCondition are used to locate and modify the logic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: dcbeace7-3390-4b14-9631-1f825a17f9b1

📥 Commits

Reviewing files that changed from the base of the PR and between 29053b7 and 96d72aa.

📒 Files selected for processing (2)

hypershift-operator/controllers/hostedcluster/karpenter.go
hypershift-operator/controllers/hostedcluster/karpenter_test.go

coderabbitai · 2026-05-04T09:46:37Z

+	// When AutoNode is still enabled, defer to the karpenter-operator for graceful
+	// cleanup — only force-remove the finalizer once the guest KAS is down and the
+	// operator can no longer reach its watches.
+	// When AutoNode is disabled, the karpenter-operator deployment is already gone,
+	// so there is no controller to process the finalizer and we must remove it immediately.
+	if karpenterutil.IsKarpenterEnabled(hc.Spec.AutoNode) {
+		kasAvailable, err := isKASAvailable(ctx, hcp.Namespace, r.Client)
+		if err != nil {
+			return fmt.Errorf("failed to check KAS availability: %w", err)
+		}
+		if kasAvailable {
+			return nil
+		}
 	}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t treat Spec.AutoNode disabled as proof the operator is already gone.

Disable is asynchronous: reconcileAutoNodeEnabledCondition in this file already models a progressing state while the Karpenter deployments still exist. Force-removing the finalizer here as soon as hc.Spec.AutoNode is disabled can therefore skip the graceful deletion flow in karpenter-operator/controllers/karpenter/karpenter_controller.go:218-295, which is the code that cleans up NodePools/NodeClaims before dropping the finalizer. If a user disables AutoNode and then deletes the HostedCluster before teardown finishes, this can orphan those resources. Please gate the immediate-removal path on the operator actually being gone (or another “disable completed” signal), not on spec state alone.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/karpenter.go` around lines 143 - 156, The code currently treats hc.Spec.AutoNode being false as proof the karpenter operator is gone; instead gate immediate finalizer removal on the operator actually being absent or on a “disable completed” signal. Update the branch that uses karpenterutil.IsKarpenterEnabled(hc.Spec.AutoNode) so that when it returns false you verify operator termination (e.g., check for absence of Karpenter deployment/replicasets or a disable-completed condition produced by reconcileAutoNodeEnabledCondition) before removing the finalizer; reuse or add a helper similar to isKASAvailable (or check the karpenter operator Deployment/ReplicaSet status) and only return nil/remove finalizer once that check confirms the operator is truly gone or the disable progression is marked complete. Ensure references to hc.Spec.AutoNode, karpenterutil.IsKarpenterEnabled, isKASAvailable and reconcileAutoNodeEnabledCondition are used to locate and modify the logic.

codecov · 2026-05-04T09:46:41Z

Codecov Report

❌ Patch coverage is 62.50000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.22%. Comparing base (29053b7) to head (96d72aa).
⚠️ Report is 97 commits behind head on main.

Files with missing lines	Patch %	Lines
...ft-operator/controllers/hostedcluster/karpenter.go	62.50%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8404      +/-   ##
==========================================
- Coverage   37.22%   37.22%   -0.01%     
==========================================
  Files         750      750              
  Lines       91789    91787       -2     
==========================================
- Hits        34168    34166       -2     
  Misses      54981    54981              
  Partials     2640     2640

Files with missing lines	Coverage Δ
...ft-operator/controllers/hostedcluster/karpenter.go	`62.92% <62.50%> (-0.42%)`	⬇️

Flag	Coverage Δ
cmd-support	`32.06% <ø> (ø)`
cpo-hostedcontrolplane	`36.45% <ø> (ø)`
cpo-other	`37.73% <ø> (ø)`
hypershift-operator	`47.83% <62.50%> (-0.01%)`	⬇️
other	`27.77% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

joshbranham · 2026-05-04T13:33:00Z

/lgtm

openshift-merge-bot · 2026-05-04T13:33:16Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

cwbotbot · 2026-05-04T15:40:04Z

Test Results

e2e-aws

Status: ✅ PASS
Started: 2026-05-05T14:57:32Z
View Job
View Job History

e2e-aks

Status: ✅ PASS
Started: 2026-05-04T13:34:13Z
View Job
View Job History

joshbranham · 2026-05-04T17:53:54Z

/retest-required

hypershift-jira-solve-ci · 2026-05-04T19:51:54Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2051359739064881152 | Cost: $4.703073750000001 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

hypershift-jira-solve-ci · 2026-05-04T20:55:57Z

Now I have everything I need. Let me construct the final analysis. The key findings are:

Only one actual test failure: TestKarpenter/Main/Billing_vCPUs,_consolidation,_and_cluster_deletion_with_blocking_PDB (the 3 failures in the summary are cascading: subtest → parent → grandparent)
Root cause is a race condition in the e2e test, NOT related to PR NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled #8404's code changes:
- The parallel provisioning subtests created Karpenter nodes (1 node per test, each with 4 vCPUs)
- After cleanup, the AutoNode enable/disable lifecycle test disabled and re-enabled AutoNode
- When AutoNode was disabled, HCP.Status.AutoNode (including VCPUs) was cleared to zero
- When AutoNode was re-enabled (took 33s), the karpenter-operator came back and immediately observed the still-existing node from one of the parallel tests and reported VCPUs=4
- The billing test then started and expected VCPUs=0 (before any Karpenter nodes provisioned), but found VCPUs=4 from the stale node
PR NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled #8404 changes only affect resolveKarpenterFinalizer, which is only called during HostedCluster deletion — a code path NOT exercised during this test

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
Build ID: 2051359739064881152
Target: e2e-aws
PR: NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled #8404 — fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled
Duration: ~2h 7m (17:54 – 20:01 UTC)
Failed Test: TestKarpenter/Main/Billing_vCPUs,_consolidation,_and_cluster_deletion_with_blocking_PDB

Test Failure Analysis

Error

eventually.go:225: observed *v1beta1.HostedCluster e2e-clusters-b2qvf/karpenter-vlk49
  invalid at RV 140643 after 1m0s: AutoNode.VCPUs=4, want 0

karpenter_test.go:1245: Failed to wait for HostedCluster
  e2e-clusters-b2qvf/karpenter-vlk49 AutoNode.VCPUs=0 in 1m0s: context deadline exceeded

Summary

The test Billing_vCPUs,_consolidation,_and_cluster_deletion_with_blocking_PDB failed because it expected AutoNode.VCPUs to be 0 before provisioning any Karpenter nodes, but observed VCPUs=4. This is a pre-existing race condition in the e2e test suite, not caused by PR #8404. The PR only modifies resolveKarpenterFinalizer, which is exclusively called during HostedCluster deletion — a code path not exercised during this test. The stale VCPUs value comes from a Karpenter node left over from the preceding parallel provisioning subtests that wasn't fully accounted for after the AutoNode disable/re-enable lifecycle test.

Root Cause

The failure is a timing race between the e2e test ordering and the karpenter-operator's status reconciliation, unrelated to PR #8404's changes.

Sequence of events:

Parallel provisioning tests (e.g., Instance_profile_annotation_propagation, Capacity_reservation_selector_propagation) each create Karpenter NodePools and provision nodes (t3.xlarge = 4 vCPUs each). These tests clean up their nodes by deleting the NodePools, but cleanup is asynchronous.
AutoNode enable/disable lifecycle test runs next:
- Disables AutoNode → the reconcileKarpenterOperator function clears HCP.Status.AutoNode (VCPUs, NodeCount, etc.) to zero. Takes 3s.
- Re-enables AutoNode → waits for AutoNodeEnabled=True/AsExpected. Takes 33s.
During re-enablement, the karpenter-operator pod restarts and immediately scans the guest cluster for existing NodeClaims/nodes. If any Karpenter node from the parallel tests still exists (even briefly), the operator reports its vCPUs in HCP.Status.AutoNode.VCPUs. Since t3.xlarge has 4 vCPUs, the status shows VCPUs=4.
Billing_vCPUs test starts immediately after and calls waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0). This polls HostedCluster.Status.AutoNode.VCPUs (propagated from HCP.Status.AutoNode via hcluster.Status.AutoNode = hcp.Status.AutoNode). It finds VCPUs=4 and times out after 60s.

Why PR #8404 is NOT the cause:

PR NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled #8404 modifies resolveKarpenterFinalizer in karpenter.go, which removes the early-return for disabled AutoNode so the function can clean up stale finalizers when AutoNode is disabled.
resolveKarpenterFinalizer is called only from the delete() method of the HostedCluster reconciler — a code path that runs during HostedCluster deletion, not during normal test execution.
None of the parallel provisioning tests, the lifecycle test, or the billing test trigger HostedCluster deletion. The HostedCluster remains alive throughout.

Recommendations

This failure is unrelated to PR NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled #8404 — the PR can be safely re-tested (/retest). The changes to resolveKarpenterFinalizer only affect HostedCluster deletion paths and have no bearing on this test failure.
To fix the underlying flake, the e2e test should add an explicit wait between the AutoNode enable/disable lifecycle test and the Billing_vCPUs test to ensure VCPUs converges to 0 after re-enablement — or the lifecycle test should wait for VCPUs=0 before completing. Alternatively, the parallel provisioning tests should use a harder synchronization to confirm all Karpenter nodes are fully deprovisioned (not just NodePools deleted) before proceeding.
Consider increasing the timeout in waitForAutoNodeStatusVCPUs from 60s to accommodate the time needed for the karpenter-operator to detect node removal and update the VCPUs status after a disable/re-enable cycle.

Evidence

Evidence	Detail
Failed test	`TestKarpenter/Main/Billing_vCPUs,_consolidation,_and_cluster_deletion_with_blocking_PDB`
Observed value	`AutoNode.VCPUs=4` (expected `0`) at ResourceVersion 140643
Timeout	60s — the value never converged to 0
Preceding test	`AutoNode enable/disable lifecycle` — disabled then re-enabled AutoNode in 39s total
Re-enable time	33s for `AutoNodeEnabled=True/AsExpected` after re-enabling
Source of stale VCPUs	t3.xlarge node from parallel provisioning tests (4 vCPUs per node)
PR #8404 scope	Only modifies `resolveKarpenterFinalizer` — called exclusively in `HostedClusterReconciler.delete()`
HostedCluster deletion during test?	No — the HostedCluster is alive throughout all subtests
All HC conditions	Healthy — `AutoNodeEnabled=True`, `Available=True`, `Degraded=False`, all operators stable
Other tests	521 tests ran; 518 passed, 25 skipped, 3 failures (all from same test cascading)

joshbranham · 2026-05-04T21:41:04Z

/retest-required

hypershift-jira-solve-ci · 2026-05-04T23:42:07Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2051416864063492096 | Cost: $3.6037326999999997 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

joshbranham · 2026-05-05T14:57:01Z

/retest-required

joshbranham · 2026-05-11T16:35:51Z

/verified by e2e

openshift-ci-robot · 2026-05-11T16:36:03Z

@joshbranham: This PR has been marked as verified by e2e.

Details

In response to this:

/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

joshbranham · 2026-05-11T18:07:03Z

/retitle NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled

openshift-ci-robot · 2026-05-11T18:07:20Z

@enxebre: This pull request explicitly references no jira issue.

Details

In response to this:

Summary

When AutoNode is disabled before HostedCluster deletion, the karpenter-operator deployment is removed and cannot process the HCP karpenter finalizer. The resolveKarpenterFinalizer fallback previously early-returned when IsKarpenterEnabled was false, leaving the HCP stuck in terminating indefinitely.

Root cause

The resolveKarpenterFinalizer function had a guard:
if !karpenterutil.IsKarpenterEnabled(hc.Spec.AutoNode) {
   return nil
}
This skipped the fallback entirely when AutoNode was disabled, even though the HCP still had the finalizer from when Karpenter was previously enabled.

Fix

Remove the early-return guard and instead use IsKarpenterEnabled to decide the cleanup strategy:

AutoNode enabled: defer to the karpenter-operator for graceful cleanup — only force-remove the finalizer once the guest KAS is down

AutoNode disabled: remove the finalizer immediately since no controller exists to handle it

Test plan

Existing TestResolveKarpenterFinalizer cases pass (renamed to Gherkin)

New test: "When karpenter is disabled and HCP has a stale finalizer it should remove the finalizer immediately"

New test: "When karpenter is disabled and KAS is still available it should remove the finalizer immediately"

/cc @maxcao13 @jkyros

Summary by CodeRabbit

Bug Fixes

Resolved an issue where disabling node auto-scaling could cause HostedCluster deletion to hang indefinitely. The system now properly removes cleanup markers when auto-scaling is disabled, and only defers removal when enabled and the cluster is accessible.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-05-11T18:28:09Z

@enxebre: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot requested review from jkyros and maxcao13 May 4, 2026 09:42

openshift-ci Bot added the do-not-merge/needs-area label May 4, 2026

openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels May 4, 2026

enxebre changed the title ~~OCPBUGS-84368: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled~~ fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled May 4, 2026

openshift-ci-robot removed jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels May 4, 2026

openshift-ci-robot removed the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label May 4, 2026

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

openshift-ci Bot assigned joshbranham May 4, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 4, 2026

bryan-cox mentioned this pull request May 5, 2026

OCPBUGS-84969: fix(e2e): wait for Karpenter node cleanup in parallel tests to prevent vCPU flake #8414

Merged

4 tasks

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 11, 2026

openshift-ci Bot changed the title ~~fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled~~ NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled May 11, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 11, 2026

openshift-merge-bot Bot merged commit deb4722 into openshift:main May 11, 2026
40 checks passed

Conversation

enxebre commented May 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Test plan

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented May 4, 2026

Uh oh!

openshift-ci-robot commented May 4, 2026

Summary

Root cause

Fix

Test plan

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

❌ Failed checks (2 warnings)

Uh oh!

openshift-ci Bot commented May 4, 2026

Uh oh!

openshift-ci-robot commented May 4, 2026

Summary

Root cause

Fix

Test plan

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

joshbranham commented May 4, 2026

Uh oh!

openshift-merge-bot Bot commented May 4, 2026

Uh oh!

cwbotbot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

e2e-aws

e2e-aks

Uh oh!

joshbranham commented May 4, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 4, 2026

AI Test Failure Analysis

Uh oh!

hypershift-jira-solve-ci Bot commented May 4, 2026

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Summary

Uh oh!

joshbranham commented May 4, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 4, 2026

AI Test Failure Analysis

Uh oh!

joshbranham commented May 5, 2026

Uh oh!

joshbranham commented May 11, 2026

Uh oh!

openshift-ci-robot commented May 11, 2026

Uh oh!

joshbranham commented May 11, 2026

Uh oh!

openshift-ci-robot commented May 11, 2026

Summary

Root cause

Fix

Test plan

enxebre commented May 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

codecov Bot commented May 4, 2026 •

edited

Loading

cwbotbot commented May 4, 2026 •

edited

Loading