Skip to content

CNTRLPLANE-507: Add HCP finalizer to AWSEndpointService reconciler#8499

Draft
hypershift-jira-solve-ci[bot] wants to merge 3 commits into
openshift:mainfrom
hypershift-community:fix-CNTRLPLANE-507
Draft

CNTRLPLANE-507: Add HCP finalizer to AWSEndpointService reconciler#8499
hypershift-jira-solve-ci[bot] wants to merge 3 commits into
openshift:mainfrom
hypershift-community:fix-CNTRLPLANE-507

Conversation

@hypershift-jira-solve-ci
Copy link
Copy Markdown

@hypershift-jira-solve-ci hypershift-jira-solve-ci Bot commented May 13, 2026

What this PR does / why we need it:

Adds a finalizer on the HostedControlPlane resource from the AWSEndpointService reconciler to prevent HCP deletion before AWS PrivateLink resources are cleaned up.

Problem: When the CPO restarts during deletion of a SharedVPC cluster, the clientBuilder is uninitialized and the HCP (with its cross-account role ARNs) may already be deleted. This causes the reconciler to fail creating AWS clients, and after a 10-minute grace period the hypershift-operator force-removes the CPO finalizer — orphaning VPC endpoints, security groups, and DNS records in the shared VPC account.

Solution: The new HCP finalizer (hypershift.openshift.io/aws-private-link-endpoint-cleanup) follows the same pattern used by the Azure PLS controller:

  • Adds the finalizer to the HCP during normal reconciliation
  • When HCP deletion is detected, initializes AWS clients from the still-available HCP
  • Cleans up each AWSEndpointService's AWS resources and removes CR finalizers
  • Removes the HCP finalizer only after all AWSEndpointService CRs are cleaned up
  • Extends the HCP watch handler (enqueueOnHCPChange) to also trigger reconciliation when an HCP is being deleted with the finalizer present

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-507

Special notes for your reviewer:

  • This follows the same finalizer pattern already established by the Azure PLS controller
  • The enqueueOnHCPChange handler (renamed from enqueueOnAccessChange) now triggers on both EndpointAccess changes and HCP deletions with the finalizer
  • AWS client initialization during HCP deletion reuses the existing getAWSClient helper, sourcing credentials from the still-available HCP spec

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Always review AI generated responses prior to use.
Generated with Claude Code via /jira:solve [CNTRLPLANE-507](https://redhat.atlassian.net/browse/CNTRLPLANE-507)


Note: This PR was auto-generated by the jira-agent periodic CI job in response to CNTRLPLANE-507. See the full report for token usage, cost breakdown, and detailed phase output.

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced AWS PrivateLink cleanup coordination during HostedControlPlane deletion to ensure proper resource cleanup while credentials remain available.
  • Tests

    • Added comprehensive test coverage for deletion handling and finalizer coordination.

OpenShift CI Bot and others added 3 commits May 13, 2026 02:14
The AWSEndpointService reconciler now adds a finalizer to the
HostedControlPlane resource to ensure HCP credentials remain available
during AWS PrivateLink resource cleanup.

Previously, when the CPO restarted during deletion of a SharedVPC
cluster, the clientBuilder was uninitialized and the HCP (with its
cross-account role ARNs) could already be deleted. This caused the
reconciler to fail creating AWS clients, and after a 10-minute grace
period the hypershift-operator would force-remove the CPO finalizer,
orphaning VPC endpoints, security groups, and DNS records in the
shared VPC account.

The new HCP finalizer follows the same pattern used by the Azure PLS
controller: it blocks HCP deletion until all AWSEndpointService
resources are cleaned up. When HCP deletion is detected, the
reconciler initializes AWS clients from the still-available HCP,
cleans up each AWSEndpointService's AWS resources, removes the CR
finalizers, and finally removes the HCP finalizer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add unit tests for the new HCP finalizer lifecycle:

- TestEnsureHCPFinalizer: verifies the finalizer is added to the HCP
  when not present, and is a no-op when already present.
- TestReconcileHCPDeletion: verifies four scenarios:
  - HCP deletion with our finalizer cleans up AWS resources and
    removes both the CR and HCP finalizers.
  - HCP without our finalizer returns early without cleanup.
  - SharedVPC HCP deletion initializes clients with role ARNs from
    the still-available HCP.
  - Multiple AWSEndpointService CRs: keeps the HCP finalizer until
    all CRs are cleaned up.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Trim verbose comment blocks on hcpAWSPrivateLinkFinalizerName constant
  and reconcileHCPDeletion function to keep only the "why" per project
  conventions
- Fix context.Background() usage in enqueueOnHCPChange to use the
  passed-in ctx for proper cancellation propagation
- Add error path tests for ensureHCPFinalizer (Patch conflict and
  non-conflict errors)
- Add error path tests for reconcileHCPDeletion (AWS client failure,
  cleanup failure, incomplete deletion, CR Update failure, List failure,
  HCP Patch conflict)
- Add direct unit test for enqueueOnHCPChange covering HCP deletion
  with finalizer, EndpointAccess change, and no-op scenarios

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 13, 2026

@hypershift-jira-solve-ci[bot]: This pull request references CNTRLPLANE-507 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Adds a finalizer on the HostedControlPlane resource from the AWSEndpointService reconciler to prevent HCP deletion before AWS PrivateLink resources are cleaned up.

Problem: When the CPO restarts during deletion of a SharedVPC cluster, the clientBuilder is uninitialized and the HCP (with its cross-account role ARNs) may already be deleted. This causes the reconciler to fail creating AWS clients, and after a 10-minute grace period the hypershift-operator force-removes the CPO finalizer — orphaning VPC endpoints, security groups, and DNS records in the shared VPC account.

Solution: The new HCP finalizer (hypershift.openshift.io/aws-private-link-endpoint-cleanup) follows the same pattern used by the Azure PLS controller:

  • Adds the finalizer to the HCP during normal reconciliation
  • When HCP deletion is detected, initializes AWS clients from the still-available HCP
  • Cleans up each AWSEndpointService's AWS resources and removes CR finalizers
  • Removes the HCP finalizer only after all AWSEndpointService CRs are cleaned up
  • Extends the HCP watch handler (enqueueOnHCPChange) to also trigger reconciliation when an HCP is being deleted with the finalizer present

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-507

Special notes for your reviewer:

  • This follows the same finalizer pattern already established by the Azure PLS controller
  • The enqueueOnHCPChange handler (renamed from enqueueOnAccessChange) now triggers on both EndpointAccess changes and HCP deletions with the finalizer
  • AWS client initialization during HCP deletion reuses the existing getAWSClient helper, sourcing credentials from the still-available HCP spec

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Always review AI generated responses prior to use.
Generated with Claude Code via /jira:solve [CNTRLPLANE-507](https://redhat.atlassian.net/browse/CNTRLPLANE-507)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 13, 2026
@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels May 13, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

📝 Walkthrough

Walkthrough

This change introduces an HCP-level finalizer (hcpAWSPrivateLinkFinalizerName) to coordinate AWS PrivateLink resource cleanup during HostedControlPlane deletion. The AWSEndpointService reconciler now watches both EndpointAccess changes and HCP deletion events. During reconciliation, an ensureHCPFinalizer helper adds the finalizer before normal processing. When HCP deletion is detected, a new reconcileHCPDeletion flow initializes AWS clients from HCP credentials, cleans up endpoint/security group/Route53 resources, and removes the HCP finalizer only after all AWSEndpointService CRs have cleared their finalizers. This ensures AWS credentials remain available throughout cleanup.

Sequence Diagram

sequenceDiagram
    participant K8s as Kubernetes API
    participant Rec as AWSEndpointService<br/>Reconciler
    participant HCP as HostedControlPlane
    participant AWS as AWS Services<br/>(EC2/Route53)

    rect rgba(100, 150, 200, 0.5)
    Note over Rec,HCP: Normal Reconciliation Path
    Rec->>K8s: Get HCP
    Rec->>Rec: ensureHCPFinalizer<br/>(add if missing)
    Rec->>K8s: Patch HCP with finalizer
    Rec->>Rec: Continue AWSEndpointService<br/>reconciliation
    end

    rect rgba(200, 100, 100, 0.5)
    Note over Rec,AWS: HCP Deletion Path
    K8s->>Rec: Detect HCP deletion<br/>(DeletionTimestamp set)
    Rec->>Rec: reconcileHCPDeletion
    Rec->>AWS: Initialize AWS client<br/>from HCP credentials
    Rec->>AWS: Delete endpoint service,<br/>security group, Route53 records
    Rec->>K8s: List all AWSEndpointService<br/>CRs in namespace
    alt All CR finalizers cleared
        Rec->>K8s: Remove HCP finalizer
        Rec->>K8s: HCP deletion proceeds
    else CR finalizers still present
        Rec->>Rec: Requeue reconciliation
    end
    end
Loading

Suggested reviewers

  • jparrill
  • jimdaga
🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: adding an HCP finalizer to the AWSEndpointService reconciler, which directly matches the core functionality introduced in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All newly added test case names use stable, deterministic strings with no dynamic values like timestamps, UUIDs, or indices. Test titles are descriptive and static as required.
Test Structure And Quality ✅ Passed The custom check targets Ginkgo test code (Describe/It blocks, BeforeEach/AfterEach). The PR adds standard Go tests using testing.T and table-driven patterns, not Ginkgo. Check not applicable.
Microshift Test Compatibility ✅ Passed The PR adds standard Go unit tests using *testing.T pattern with mocked clients, not Ginkgo e2e tests. The MicroShift Test Compatibility check targets only Ginkgo e2e tests, which are not present.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests were added. The PR adds only standard Go unit tests. Custom check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only controller reconciliation logic (finalizer/deletion handling). No deployment manifests, Pod specs, or scheduling constraints are introduced.
Ote Binary Stdout Contract ✅ Passed Not applicable: PR modifies standard Kubernetes controller unit tests, not OTE binary code. No stdout violations detected.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests added. PR adds only standard Go unit tests using testing.T pattern. Custom check applies to Ginkgo e2e tests and is not applicable.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 13, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform and removed do-not-merge/needs-area labels May 13, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 13, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hypershift-jira-solve-ci[bot]
Once this PR has been reviewed and has the lgtm label, please assign cblecker for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

❌ Patch coverage is 82.95455% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.05%. Comparing base (76bbefc) to head (80add24).

Files with missing lines Patch % Lines
...ollers/awsprivatelink/awsprivatelink_controller.go 82.95% 14 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8499      +/-   ##
==========================================
+ Coverage   40.00%   40.05%   +0.05%     
==========================================
  Files         751      751              
  Lines       92838    92915      +77     
==========================================
+ Hits        37137    37216      +79     
+ Misses      53014    53009       -5     
- Partials     2687     2690       +3     
Files with missing lines Coverage Δ
...ollers/awsprivatelink/awsprivatelink_controller.go 39.75% <82.95%> (+5.41%) ⬆️
Flag Coverage Δ
cmd-support 34.09% <ø> (ø)
cpo-hostedcontrolplane 40.56% <ø> (ø)
cpo-other 40.53% <82.95%> (+0.39%) ⬆️
hypershift-operator 50.53% <ø> (ø)
other 31.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant