NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled#8404
Conversation
…bled When AutoNode is disabled before HostedCluster deletion, the karpenter-operator deployment is removed and cannot process the HCP karpenter finalizer. The resolveKarpenterFinalizer fallback previously skipped when IsKarpenterEnabled returned false, leaving the HCP stuck in terminating indefinitely. Remove the early-return guard so the fallback always runs. When AutoNode is still enabled, defer to the karpenter-operator for graceful cleanup and only force-remove once the guest KAS is down. When AutoNode is disabled, remove the finalizer immediately since no controller exists to handle it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@enxebre: This pull request references Jira Issue OCPBUGS-84368, which is valid. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
📝 WalkthroughWalkthroughThe changes modify the 🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@enxebre: No Jira issue is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hypershift-operator/controllers/hostedcluster/karpenter.go`:
- Around line 143-156: The code currently treats hc.Spec.AutoNode being false as
proof the karpenter operator is gone; instead gate immediate finalizer removal
on the operator actually being absent or on a “disable completed” signal. Update
the branch that uses karpenterutil.IsKarpenterEnabled(hc.Spec.AutoNode) so that
when it returns false you verify operator termination (e.g., check for absence
of Karpenter deployment/replicasets or a disable-completed condition produced by
reconcileAutoNodeEnabledCondition) before removing the finalizer; reuse or add a
helper similar to isKASAvailable (or check the karpenter operator
Deployment/ReplicaSet status) and only return nil/remove finalizer once that
check confirms the operator is truly gone or the disable progression is marked
complete. Ensure references to hc.Spec.AutoNode,
karpenterutil.IsKarpenterEnabled, isKASAvailable and
reconcileAutoNodeEnabledCondition are used to locate and modify the logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: dcbeace7-3390-4b14-9631-1f825a17f9b1
📒 Files selected for processing (2)
hypershift-operator/controllers/hostedcluster/karpenter.gohypershift-operator/controllers/hostedcluster/karpenter_test.go
| // When AutoNode is still enabled, defer to the karpenter-operator for graceful | ||
| // cleanup — only force-remove the finalizer once the guest KAS is down and the | ||
| // operator can no longer reach its watches. | ||
| // When AutoNode is disabled, the karpenter-operator deployment is already gone, | ||
| // so there is no controller to process the finalizer and we must remove it immediately. | ||
| if karpenterutil.IsKarpenterEnabled(hc.Spec.AutoNode) { | ||
| kasAvailable, err := isKASAvailable(ctx, hcp.Namespace, r.Client) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to check KAS availability: %w", err) | ||
| } | ||
| if kasAvailable { | ||
| return nil | ||
| } | ||
| } |
There was a problem hiding this comment.
Don’t treat Spec.AutoNode disabled as proof the operator is already gone.
Disable is asynchronous: reconcileAutoNodeEnabledCondition in this file already models a progressing state while the Karpenter deployments still exist. Force-removing the finalizer here as soon as hc.Spec.AutoNode is disabled can therefore skip the graceful deletion flow in karpenter-operator/controllers/karpenter/karpenter_controller.go:218-295, which is the code that cleans up NodePools/NodeClaims before dropping the finalizer. If a user disables AutoNode and then deletes the HostedCluster before teardown finishes, this can orphan those resources. Please gate the immediate-removal path on the operator actually being gone (or another “disable completed” signal), not on spec state alone.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hypershift-operator/controllers/hostedcluster/karpenter.go` around lines 143
- 156, The code currently treats hc.Spec.AutoNode being false as proof the
karpenter operator is gone; instead gate immediate finalizer removal on the
operator actually being absent or on a “disable completed” signal. Update the
branch that uses karpenterutil.IsKarpenterEnabled(hc.Spec.AutoNode) so that when
it returns false you verify operator termination (e.g., check for absence of
Karpenter deployment/replicasets or a disable-completed condition produced by
reconcileAutoNodeEnabledCondition) before removing the finalizer; reuse or add a
helper similar to isKASAvailable (or check the karpenter operator
Deployment/ReplicaSet status) and only return nil/remove finalizer once that
check confirms the operator is truly gone or the disable progression is marked
complete. Ensure references to hc.Spec.AutoNode,
karpenterutil.IsKarpenterEnabled, isKASAvailable and
reconcileAutoNodeEnabledCondition are used to locate and modify the logic.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8404 +/- ##
==========================================
- Coverage 37.22% 37.22% -0.01%
==========================================
Files 750 750
Lines 91789 91787 -2
==========================================
- Hits 34168 34166 -2
Misses 54981 54981
Partials 2640 2640
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
/lgtm |
|
Scheduling tests matching the |
Test Resultse2e-aws
e2e-aks
|
|
/retest-required |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
Now I have everything I need. Let me construct the final analysis. The key findings are:
Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe test Root CauseThe failure is a timing race between the e2e test ordering and the karpenter-operator's status reconciliation, unrelated to PR #8404's changes. Sequence of events:
Why PR #8404 is NOT the cause:
Recommendations
Evidence
|
|
/retest-required |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/retest-required |
|
/verified by e2e |
|
@joshbranham: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retitle NO-JIRA: fix(karpenter): resolve HCP karpenter finalizer when AutoNode is disabled |
|
@enxebre: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@enxebre: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
When AutoNode is disabled before HostedCluster deletion, the karpenter-operator deployment is removed and cannot process the HCP karpenter finalizer. The
resolveKarpenterFinalizerfallback previously early-returned whenIsKarpenterEnabledwas false, leaving the HCP stuck in terminating indefinitely.Root cause
The
resolveKarpenterFinalizerfunction had a guard:This skipped the fallback entirely when AutoNode was disabled, even though the HCP still had the finalizer from when Karpenter was previously enabled.
Fix
Remove the early-return guard and instead use
IsKarpenterEnabledto decide the cleanup strategy:Test plan
TestResolveKarpenterFinalizercases pass (renamed to Gherkin)/cc @maxcao13 @jkyros
Summary by CodeRabbit