Skip to content

NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled#8404

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
enxebre:enxebre/fix-karpenter-finalizer-disable
May 11, 2026
Merged

NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled#8404
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
enxebre:enxebre/fix-karpenter-finalizer-disable

Conversation

@enxebre
Copy link
Copy Markdown
Member

@enxebre enxebre commented May 4, 2026

Summary

When AutoNode is disabled before HostedCluster deletion, the karpenter-operator deployment is removed and cannot process the HCP karpenter finalizer. The resolveKarpenterFinalizer fallback previously early-returned when IsKarpenterEnabled was false, leaving the HCP stuck in terminating indefinitely.

Root cause

The resolveKarpenterFinalizer function had a guard:

if !karpenterutil.IsKarpenterEnabled(hc.Spec.AutoNode) {
    return nil
}

This skipped the fallback entirely when AutoNode was disabled, even though the HCP still had the finalizer from when Karpenter was previously enabled.

Fix

Remove the early-return guard and instead use IsKarpenterEnabled to decide the cleanup strategy:

  • AutoNode enabled: defer to the karpenter-operator for graceful cleanup — only force-remove the finalizer once the guest KAS is down
  • AutoNode disabled: remove the finalizer immediately since no controller exists to handle it

Test plan

  • Existing TestResolveKarpenterFinalizer cases pass (renamed to Gherkin)
  • New test: "When karpenter is disabled and HCP has a stale finalizer it should remove the finalizer immediately"
  • New test: "When karpenter is disabled and KAS is still available it should remove the finalizer immediately"

/cc @maxcao13 @jkyros

Summary by CodeRabbit

  • Bug Fixes
    • Resolved an issue where disabling node auto-scaling could cause HostedCluster deletion to hang indefinitely. The system now properly removes cleanup markers when auto-scaling is disabled, and only defers removal when enabled and the cluster is accessible.

…bled

When AutoNode is disabled before HostedCluster deletion, the
karpenter-operator deployment is removed and cannot process the
HCP karpenter finalizer. The resolveKarpenterFinalizer fallback
previously skipped when IsKarpenterEnabled returned false, leaving
the HCP stuck in terminating indefinitely.

Remove the early-return guard so the fallback always runs. When
AutoNode is still enabled, defer to the karpenter-operator for
graceful cleanup and only force-remove once the guest KAS is down.
When AutoNode is disabled, remove the finalizer immediately since
no controller exists to handle it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot requested review from jkyros and maxcao13 May 4, 2026 09:42
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 4, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@enxebre: This pull request references Jira Issue OCPBUGS-84368, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

When AutoNode is disabled before HostedCluster deletion, the karpenter-operator deployment is removed and cannot process the HCP karpenter finalizer. The resolveKarpenterFinalizer fallback previously early-returned when IsKarpenterEnabled was false, leaving the HCP stuck in terminating indefinitely.

Root cause

The resolveKarpenterFinalizer function had a guard:

if !karpenterutil.IsKarpenterEnabled(hc.Spec.AutoNode) {
   return nil
}

This skipped the fallback entirely when AutoNode was disabled, even though the HCP still had the finalizer from when Karpenter was previously enabled.

Fix

Remove the early-return guard and instead use IsKarpenterEnabled to decide the cleanup strategy:

  • AutoNode enabled: defer to the karpenter-operator for graceful cleanup — only force-remove the finalizer once the guest KAS is down
  • AutoNode disabled: remove the finalizer immediately since no controller exists to handle it

Test plan

  • Existing TestResolveKarpenterFinalizer cases pass (renamed to Gherkin)
  • New test: "When karpenter is disabled and HCP has a stale finalizer it should remove the finalizer immediately"
  • New test: "When karpenter is disabled and KAS is still available it should remove the finalizer immediately"

/cc @maxcao13 @jkyros

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

📝 Walkthrough

Walkthrough

The changes modify the resolveKarpenterFinalizer function to alter when the karpenter finalizer is removed from a HostedCluster resource. Previously, the function returned early without taking action whenever AutoNode was disabled. Now, when AutoNode is disabled, the function immediately force-removes the karpenter finalizer to prevent deletion from getting stuck. When AutoNode is enabled, the function defers finalizer removal until confirming the guest Kubernetes API Server is unavailable. The test file was updated with improved descriptive naming for existing test cases and two new test scenarios validating immediate finalizer removal when Karpenter is disabled.

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Most assertions in the test file lack meaningful failure messages that would help diagnose failures. Add meaningful failure messages to all Gomega assertions to improve test debuggability when failures occur.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in karpenter_test.go are stable and deterministic with hardcoded static string literals and no dynamic values.
Microshift Test Compatibility ✅ Passed The PR only adds standard Go unit tests using testing.T interface with table-driven test cases and fake Kubernetes client, not Ginkgo e2e tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR modifies standard Go unit tests with table-driven patterns, not Ginkgo e2e tests. The file uses testing package and Gomega assertions without Ginkgo DSL functions.
Topology-Aware Scheduling Compatibility ✅ Passed Code changes only modify Karpenter finalizer removal logic without affecting any pod scheduling, affinity rules, or topology constraints.
Ote Binary Stdout Contract ✅ Passed Modified files are standard controller and unit test code without process-level entry points or OTE suite setup that could emit to stdout.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Tests use standard Go testing package with mocked Kubernetes objects. No IPv4 assumptions, hardcoded IPs, external connectivity, DNS resolutions, or image pulls detected.
Title check ✅ Passed The title clearly and specifically describes the main change: fixing the karpenter finalizer resolution when AutoNode is disabled, which directly addresses the bug described in the PR objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 4, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels May 4, 2026
@enxebre enxebre changed the title OCPBUGS-84368: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled May 4, 2026
@openshift-ci-robot openshift-ci-robot removed jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels May 4, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@enxebre: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

Details

In response to this:

Summary

When AutoNode is disabled before HostedCluster deletion, the karpenter-operator deployment is removed and cannot process the HCP karpenter finalizer. The resolveKarpenterFinalizer fallback previously early-returned when IsKarpenterEnabled was false, leaving the HCP stuck in terminating indefinitely.

Root cause

The resolveKarpenterFinalizer function had a guard:

if !karpenterutil.IsKarpenterEnabled(hc.Spec.AutoNode) {
   return nil
}

This skipped the fallback entirely when AutoNode was disabled, even though the HCP still had the finalizer from when Karpenter was previously enabled.

Fix

Remove the early-return guard and instead use IsKarpenterEnabled to decide the cleanup strategy:

  • AutoNode enabled: defer to the karpenter-operator for graceful cleanup — only force-remove the finalizer once the guest KAS is down
  • AutoNode disabled: remove the finalizer immediately since no controller exists to handle it

Test plan

  • Existing TestResolveKarpenterFinalizer cases pass (renamed to Gherkin)
  • New test: "When karpenter is disabled and HCP has a stale finalizer it should remove the finalizer immediately"
  • New test: "When karpenter is disabled and KAS is still available it should remove the finalizer immediately"

/cc @maxcao13 @jkyros

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label May 4, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hypershift-operator/controllers/hostedcluster/karpenter.go`:
- Around line 143-156: The code currently treats hc.Spec.AutoNode being false as
proof the karpenter operator is gone; instead gate immediate finalizer removal
on the operator actually being absent or on a “disable completed” signal. Update
the branch that uses karpenterutil.IsKarpenterEnabled(hc.Spec.AutoNode) so that
when it returns false you verify operator termination (e.g., check for absence
of Karpenter deployment/replicasets or a disable-completed condition produced by
reconcileAutoNodeEnabledCondition) before removing the finalizer; reuse or add a
helper similar to isKASAvailable (or check the karpenter operator
Deployment/ReplicaSet status) and only return nil/remove finalizer once that
check confirms the operator is truly gone or the disable progression is marked
complete. Ensure references to hc.Spec.AutoNode,
karpenterutil.IsKarpenterEnabled, isKASAvailable and
reconcileAutoNodeEnabledCondition are used to locate and modify the logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: dcbeace7-3390-4b14-9631-1f825a17f9b1

📥 Commits

Reviewing files that changed from the base of the PR and between 29053b7 and 96d72aa.

📒 Files selected for processing (2)
  • hypershift-operator/controllers/hostedcluster/karpenter.go
  • hypershift-operator/controllers/hostedcluster/karpenter_test.go

Comment on lines +143 to 156
// When AutoNode is still enabled, defer to the karpenter-operator for graceful
// cleanup — only force-remove the finalizer once the guest KAS is down and the
// operator can no longer reach its watches.
// When AutoNode is disabled, the karpenter-operator deployment is already gone,
// so there is no controller to process the finalizer and we must remove it immediately.
if karpenterutil.IsKarpenterEnabled(hc.Spec.AutoNode) {
kasAvailable, err := isKASAvailable(ctx, hcp.Namespace, r.Client)
if err != nil {
return fmt.Errorf("failed to check KAS availability: %w", err)
}
if kasAvailable {
return nil
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t treat Spec.AutoNode disabled as proof the operator is already gone.

Disable is asynchronous: reconcileAutoNodeEnabledCondition in this file already models a progressing state while the Karpenter deployments still exist. Force-removing the finalizer here as soon as hc.Spec.AutoNode is disabled can therefore skip the graceful deletion flow in karpenter-operator/controllers/karpenter/karpenter_controller.go:218-295, which is the code that cleans up NodePools/NodeClaims before dropping the finalizer. If a user disables AutoNode and then deletes the HostedCluster before teardown finishes, this can orphan those resources. Please gate the immediate-removal path on the operator actually being gone (or another “disable completed” signal), not on spec state alone.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/hostedcluster/karpenter.go` around lines 143
- 156, The code currently treats hc.Spec.AutoNode being false as proof the
karpenter operator is gone; instead gate immediate finalizer removal on the
operator actually being absent or on a “disable completed” signal. Update the
branch that uses karpenterutil.IsKarpenterEnabled(hc.Spec.AutoNode) so that when
it returns false you verify operator termination (e.g., check for absence of
Karpenter deployment/replicasets or a disable-completed condition produced by
reconcileAutoNodeEnabledCondition) before removing the finalizer; reuse or add a
helper similar to isKASAvailable (or check the karpenter operator
Deployment/ReplicaSet status) and only return nil/remove finalizer once that
check confirms the operator is truly gone or the disable progression is marked
complete. Ensure references to hc.Spec.AutoNode,
karpenterutil.IsKarpenterEnabled, isKASAvailable and
reconcileAutoNodeEnabledCondition are used to locate and modify the logic.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

❌ Patch coverage is 62.50000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.22%. Comparing base (29053b7) to head (96d72aa).
⚠️ Report is 97 commits behind head on main.

Files with missing lines Patch % Lines
...ft-operator/controllers/hostedcluster/karpenter.go 62.50% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8404      +/-   ##
==========================================
- Coverage   37.22%   37.22%   -0.01%     
==========================================
  Files         750      750              
  Lines       91789    91787       -2     
==========================================
- Hits        34168    34166       -2     
  Misses      54981    54981              
  Partials     2640     2640              
Files with missing lines Coverage Δ
...ft-operator/controllers/hostedcluster/karpenter.go 62.92% <62.50%> (-0.42%) ⬇️
Flag Coverage Δ
cmd-support 32.06% <ø> (ø)
cpo-hostedcontrolplane 36.45% <ø> (ø)
cpo-other 37.73% <ø> (ø)
hypershift-operator 47.83% <62.50%> (-0.01%) ⬇️
other 27.77% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@joshbranham
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 4, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented May 4, 2026

Test Results

e2e-aws

e2e-aks

@joshbranham
Copy link
Copy Markdown
Contributor

/retest-required

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2051359739064881152 | Cost: $4.703073750000001 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

Now I have everything I need. Let me construct the final analysis. The key findings are:

  1. Only one actual test failure: TestKarpenter/Main/Billing_vCPUs,_consolidation,_and_cluster_deletion_with_blocking_PDB (the 3 failures in the summary are cascading: subtest → parent → grandparent)

  2. Root cause is a race condition in the e2e test, NOT related to PR NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled #8404's code changes:

    • The parallel provisioning subtests created Karpenter nodes (1 node per test, each with 4 vCPUs)
    • After cleanup, the AutoNode enable/disable lifecycle test disabled and re-enabled AutoNode
    • When AutoNode was disabled, HCP.Status.AutoNode (including VCPUs) was cleared to zero
    • When AutoNode was re-enabled (took 33s), the karpenter-operator came back and immediately observed the still-existing node from one of the parallel tests and reported VCPUs=4
    • The billing test then started and expected VCPUs=0 (before any Karpenter nodes provisioned), but found VCPUs=4 from the stale node
  3. PR NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled #8404 changes only affect resolveKarpenterFinalizer, which is only called during HostedCluster deletion — a code path NOT exercised during this test

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

eventually.go:225: observed *v1beta1.HostedCluster e2e-clusters-b2qvf/karpenter-vlk49
  invalid at RV 140643 after 1m0s: AutoNode.VCPUs=4, want 0

karpenter_test.go:1245: Failed to wait for HostedCluster
  e2e-clusters-b2qvf/karpenter-vlk49 AutoNode.VCPUs=0 in 1m0s: context deadline exceeded

Summary

The test Billing_vCPUs,_consolidation,_and_cluster_deletion_with_blocking_PDB failed because it expected AutoNode.VCPUs to be 0 before provisioning any Karpenter nodes, but observed VCPUs=4. This is a pre-existing race condition in the e2e test suite, not caused by PR #8404. The PR only modifies resolveKarpenterFinalizer, which is exclusively called during HostedCluster deletion — a code path not exercised during this test. The stale VCPUs value comes from a Karpenter node left over from the preceding parallel provisioning subtests that wasn't fully accounted for after the AutoNode disable/re-enable lifecycle test.

Root Cause

The failure is a timing race between the e2e test ordering and the karpenter-operator's status reconciliation, unrelated to PR #8404's changes.

Sequence of events:

  1. Parallel provisioning tests (e.g., Instance_profile_annotation_propagation, Capacity_reservation_selector_propagation) each create Karpenter NodePools and provision nodes (t3.xlarge = 4 vCPUs each). These tests clean up their nodes by deleting the NodePools, but cleanup is asynchronous.

  2. AutoNode enable/disable lifecycle test runs next:

    • Disables AutoNode → the reconcileKarpenterOperator function clears HCP.Status.AutoNode (VCPUs, NodeCount, etc.) to zero. Takes 3s.
    • Re-enables AutoNode → waits for AutoNodeEnabled=True/AsExpected. Takes 33s.
  3. During re-enablement, the karpenter-operator pod restarts and immediately scans the guest cluster for existing NodeClaims/nodes. If any Karpenter node from the parallel tests still exists (even briefly), the operator reports its vCPUs in HCP.Status.AutoNode.VCPUs. Since t3.xlarge has 4 vCPUs, the status shows VCPUs=4.

  4. Billing_vCPUs test starts immediately after and calls waitForAutoNodeStatusVCPUs(t, ctx, mgtClient, hostedCluster, 0). This polls HostedCluster.Status.AutoNode.VCPUs (propagated from HCP.Status.AutoNode via hcluster.Status.AutoNode = hcp.Status.AutoNode). It finds VCPUs=4 and times out after 60s.

Why PR #8404 is NOT the cause:

  • PR NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled #8404 modifies resolveKarpenterFinalizer in karpenter.go, which removes the early-return for disabled AutoNode so the function can clean up stale finalizers when AutoNode is disabled.
  • resolveKarpenterFinalizer is called only from the delete() method of the HostedCluster reconciler — a code path that runs during HostedCluster deletion, not during normal test execution.
  • None of the parallel provisioning tests, the lifecycle test, or the billing test trigger HostedCluster deletion. The HostedCluster remains alive throughout.
Recommendations
  1. This failure is unrelated to PR NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled #8404 — the PR can be safely re-tested (/retest). The changes to resolveKarpenterFinalizer only affect HostedCluster deletion paths and have no bearing on this test failure.

  2. To fix the underlying flake, the e2e test should add an explicit wait between the AutoNode enable/disable lifecycle test and the Billing_vCPUs test to ensure VCPUs converges to 0 after re-enablement — or the lifecycle test should wait for VCPUs=0 before completing. Alternatively, the parallel provisioning tests should use a harder synchronization to confirm all Karpenter nodes are fully deprovisioned (not just NodePools deleted) before proceeding.

  3. Consider increasing the timeout in waitForAutoNodeStatusVCPUs from 60s to accommodate the time needed for the karpenter-operator to detect node removal and update the VCPUs status after a disable/re-enable cycle.

Evidence
Evidence Detail
Failed test TestKarpenter/Main/Billing_vCPUs,_consolidation,_and_cluster_deletion_with_blocking_PDB
Observed value AutoNode.VCPUs=4 (expected 0) at ResourceVersion 140643
Timeout 60s — the value never converged to 0
Preceding test AutoNode enable/disable lifecycle — disabled then re-enabled AutoNode in 39s total
Re-enable time 33s for AutoNodeEnabled=True/AsExpected after re-enabling
Source of stale VCPUs t3.xlarge node from parallel provisioning tests (4 vCPUs per node)
PR #8404 scope Only modifies resolveKarpenterFinalizer — called exclusively in HostedClusterReconciler.delete()
HostedCluster deletion during test? No — the HostedCluster is alive throughout all subtests
All HC conditions Healthy — AutoNodeEnabled=True, Available=True, Degraded=False, all operators stable
Other tests 521 tests ran; 518 passed, 25 skipped, 3 failures (all from same test cascading)

@joshbranham
Copy link
Copy Markdown
Contributor

/retest-required

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2051416864063492096 | Cost: $3.6037326999999997 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@joshbranham
Copy link
Copy Markdown
Contributor

/retest-required

@joshbranham
Copy link
Copy Markdown
Contributor

/verified by e2e

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 11, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@joshbranham: This PR has been marked as verified by e2e.

Details

In response to this:

/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@joshbranham
Copy link
Copy Markdown
Contributor

/retitle NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled

@openshift-ci openshift-ci Bot changed the title fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled May 11, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 11, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@enxebre: This pull request explicitly references no jira issue.

Details

In response to this:

Summary

When AutoNode is disabled before HostedCluster deletion, the karpenter-operator deployment is removed and cannot process the HCP karpenter finalizer. The resolveKarpenterFinalizer fallback previously early-returned when IsKarpenterEnabled was false, leaving the HCP stuck in terminating indefinitely.

Root cause

The resolveKarpenterFinalizer function had a guard:

if !karpenterutil.IsKarpenterEnabled(hc.Spec.AutoNode) {
   return nil
}

This skipped the fallback entirely when AutoNode was disabled, even though the HCP still had the finalizer from when Karpenter was previously enabled.

Fix

Remove the early-return guard and instead use IsKarpenterEnabled to decide the cleanup strategy:

  • AutoNode enabled: defer to the karpenter-operator for graceful cleanup — only force-remove the finalizer once the guest KAS is down
  • AutoNode disabled: remove the finalizer immediately since no controller exists to handle it

Test plan

  • Existing TestResolveKarpenterFinalizer cases pass (renamed to Gherkin)
  • New test: "When karpenter is disabled and HCP has a stale finalizer it should remove the finalizer immediately"
  • New test: "When karpenter is disabled and KAS is still available it should remove the finalizer immediately"

/cc @maxcao13 @jkyros

Summary by CodeRabbit

  • Bug Fixes
  • Resolved an issue where disabling node auto-scaling could cause HostedCluster deletion to hang indefinitely. The system now properly removes cleanup markers when auto-scaling is disabled, and only defers removal when enabled and the cluster is accessible.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 11, 2026

@enxebre: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit deb4722 into openshift:main May 11, 2026
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants