Skip to content

OCPBUGS-77311: handle HostedCluster deletion in secretJanitor to prevent stuck cluster deletions#7779

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
sdminonne:OCM-22350
Mar 1, 2026
Merged

OCPBUGS-77311: handle HostedCluster deletion in secretJanitor to prevent stuck cluster deletions#7779
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
sdminonne:OCM-22350

Conversation

@sdminonne
Copy link
Contributor

@sdminonne sdminonne commented Feb 23, 2026

Summary

  • Fix secretJanitor controller getting stuck in an error loop during ROSA HCP cluster deletion when the pull secret is already deleted by OCM before the janitor finishes cleaning up nodepool secrets
  • Add early return paths when the HostedCluster is not found or is being deleted, cleaning up secrets directly (expiration annotation for token secrets, delete for userdata secrets) without requiring pull secret or release image lookups
  • Add unit tests covering both the missing HostedCluster and deleting HostedCluster scenarios

fixes https://issues.redhat.com/browse/OCPBUGS-77311

Test plan

  • Unit tests pass (make test)
  • Verify make verify passes
  • Test on a ROSA HCP cluster: delete a cluster and confirm nodepool secrets are cleaned up without the janitor getting stuck
  • Verify token secrets get expiration annotations when HostedCluster is missing
  • Verify userdata secrets are deleted directly when HostedCluster is being deleted

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Secrets associated with a HostedCluster are now cleaned up when the cluster is missing or being deleted.
    • Token-type secrets are annotated with an expiration timestamp during cleanup; userdata-type secrets are deleted to prevent orphaned resources and reduce error noise.
  • Tests

    • Added tests confirming token expiration annotation and userdata secret deletion when the cluster is absent or terminating.

@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between cdb2941 and b269196.

📒 Files selected for processing (2)
  • hypershift-operator/controllers/nodepool/secret_janitor.go
  • hypershift-operator/controllers/nodepool/secret_janitor_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • hypershift-operator/controllers/nodepool/secret_janitor_test.go

Walkthrough

Reconcile now routes secrets for cleanup when the referenced HostedCluster is missing or has a DeletionTimestamp, via a new helper cleanupSecretForDeletion: token secrets are annotated with an expiration timestamp; userdata secrets are deleted.

Changes

Cohort / File(s) Summary
Secret cleanup implementation
hypershift-operator/controllers/nodepool/secret_janitor.go
Adds cleanupSecretForDeletion(ctx, secret) and calls it from Reconcile when the HostedCluster is NotFound or has a non-zero DeletionTimestamp. Token secrets receive an expiration annotation; userdata secrets are deleted. Adjusts NotFound handling for secret deletions.
Secret cleanup tests
hypershift-operator/controllers/nodepool/secret_janitor_test.go
Adds subtests to TestSecretJanitor_Reconcile covering HostedCluster missing/deleting scenarios: token secrets annotated with an expiration timestamp and userdata secrets deleted. Tests run in parallel with isolated fake clients/reconcilers.

Sequence Diagram(s)

sequenceDiagram
    participant Reconciler as Reconciler
    participant API as K8s API Server
    participant Hosted as HostedCluster
    participant Secret as Secret

    Reconciler->>API: Get NodePool, Secret
    API-->>Reconciler: NodePool, Secret
    Reconciler->>API: Get HostedCluster by name
    API-->>Reconciler: NotFound / DeletionTimestamp set
    alt HostedCluster missing or deleting
        Reconciler->>Secret: cleanupSecretForDeletion(secret)
        alt Secret name has TokenSecretPrefix
            Secret->>API: Patch annotate IgnitionServerTokenExpirationTimestampAnnotation
            API-->>Secret: OK
        else userdata secret
            Secret->>API: Delete secret
            API-->>Secret: Deleted / NotFound
        end
    else HostedCluster present and not deleting
        Reconciler->>API: proceed with normal secret cleanup flow
        API-->>Reconciler: OK
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: handling HostedCluster deletion in secretJanitor to prevent stuck cluster deletions, which directly corresponds to the code changes adding cleanup paths for missing/deleting HostedClusters.
Stable And Deterministic Test Names ✅ Passed Test names are stable and deterministic with no dynamic information like timestamps, UUIDs, or generated suffixes.
Test Structure And Quality ✅ Passed All t.Run subtests in secret_janitor_test.go target one specific behavior each, use ephemeral fake clients with no persistent resources, and avoid indefinite waits.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from jparrill and sjenning February 23, 2026 14:08
@openshift-ci openshift-ci bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels Feb 23, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
hypershift-operator/controllers/nodepool/secret_janitor_test.go (1)

344-384: LGTM! Good test coverage for HostedCluster being deleted scenario.

The test correctly sets up the deletion scenario with DeletionTimestamp and a finalizer, and verifies that userdata secrets are deleted directly.

Consider adding complementary test cases for full coverage.

The two new tests cover token cleanup when HostedCluster is missing and userdata cleanup when HostedCluster is being deleted. For completeness, you could add:

  • Userdata secret cleanup when HostedCluster is not found
  • Token secret cleanup when HostedCluster is being deleted

Since both paths use the same cleanupSecretForDeletion method, this is optional.

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/nodepool/secret_janitor_test.go` around lines
344 - 384, Add complementary unit tests to cover the missing paths: 1) a test
that ensures userdata secrets are deleted when the HostedCluster is not found
(simulate no HostedCluster in the fake client and call secretJanitor.Reconcile
or directly cleanupSecretForDeletion with a userdata secret annotated with
nodePoolAnnotation), and 2) a test that ensures token secrets are deleted when
the HostedCluster is being deleted (create a deleting HostedCluster with
DeletionTimestamp and finalizer, create a token secret rather than userdata,
then call secretJanitor.Reconcile or cleanupSecretForDeletion and assert the
token secret is removed). Reference the secretJanitor type,
NodePoolReconciler.Client, cleanupSecretForDeletion and nodePoolAnnotation to
locate relevant logic and mirror the existing test setup pattern.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@hypershift-operator/controllers/nodepool/secret_janitor_test.go`:
- Around line 344-384: Add complementary unit tests to cover the missing paths:
1) a test that ensures userdata secrets are deleted when the HostedCluster is
not found (simulate no HostedCluster in the fake client and call
secretJanitor.Reconcile or directly cleanupSecretForDeletion with a userdata
secret annotated with nodePoolAnnotation), and 2) a test that ensures token
secrets are deleted when the HostedCluster is being deleted (create a deleting
HostedCluster with DeletionTimestamp and finalizer, create a token secret rather
than userdata, then call secretJanitor.Reconcile or cleanupSecretForDeletion and
assert the token secret is removed). Reference the secretJanitor type,
NodePoolReconciler.Client, cleanupSecretForDeletion and nodePoolAnnotation to
locate relevant logic and mirror the existing test setup pattern.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 160c3b4 and 3ce1c06.

📒 Files selected for processing (2)
  • hypershift-operator/controllers/nodepool/secret_janitor.go
  • hypershift-operator/controllers/nodepool/secret_janitor_test.go

@celebdor celebdor changed the title fix(OCM-22350): handle HostedCluster deletion in secretJanitor to prevent stuck cluster deletions OCM-22350: handle HostedCluster deletion in secretJanitor to prevent stuck cluster deletions Feb 25, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 25, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 25, 2026

@sdminonne: This pull request references OCM-22350 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

  • Fix secretJanitor controller getting stuck in an error loop during ROSA HCP cluster deletion when the pull secret is already deleted by OCM before the janitor finishes cleaning up nodepool secrets
  • Add early return paths when the HostedCluster is not found or is being deleted, cleaning up secrets directly (expiration annotation for token secrets, delete for userdata secrets) without requiring pull secret or release image lookups
  • Add unit tests covering both the missing HostedCluster and deleting HostedCluster scenarios

Test plan

  • Unit tests pass (make test)
  • Verify make verify passes
  • Test on a ROSA HCP cluster: delete a cluster and confirm nodepool secrets are cleaned up without the janitor getting stuck
  • Verify token secrets get expiration annotations when HostedCluster is missing
  • Verify userdata secrets are deleted directly when HostedCluster is being deleted

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

  • Improved secret reconciliation: secrets are now cleaned up when the related cluster is missing or being deleted.

  • Token secrets receive an expiration annotation during cleanup; userdata secrets are removed, reducing orphaned resources and error noise.

  • Tests

  • Added tests validating cleanup of token and userdata secrets when the cluster is absent or terminating.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 25, 2026

@sdminonne: This pull request references OCM-22350 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

  • Fix secretJanitor controller getting stuck in an error loop during ROSA HCP cluster deletion when the pull secret is already deleted by OCM before the janitor finishes cleaning up nodepool secrets
  • Add early return paths when the HostedCluster is not found or is being deleted, cleaning up secrets directly (expiration annotation for token secrets, delete for userdata secrets) without requiring pull secret or release image lookups
  • Add unit tests covering both the missing HostedCluster and deleting HostedCluster scenarios

fixes https://issues.redhat.com/browse/OCM-22350

Test plan

  • Unit tests pass (make test)
  • Verify make verify passes
  • Test on a ROSA HCP cluster: delete a cluster and confirm nodepool secrets are cleaned up without the janitor getting stuck
  • Verify token secrets get expiration annotations when HostedCluster is missing
  • Verify userdata secrets are deleted directly when HostedCluster is being deleted

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

  • Improved secret reconciliation: secrets are now cleaned up when the related cluster is missing or being deleted.

  • Token secrets receive an expiration annotation during cleanup; userdata secrets are removed, reducing orphaned resources and error noise.

  • Tests

  • Added tests validating cleanup of token and userdata secrets when the cluster is absent or terminating.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 25, 2026

@sdminonne: This pull request references OCM-22350 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

  • Fix secretJanitor controller getting stuck in an error loop during ROSA HCP cluster deletion when the pull secret is already deleted by OCM before the janitor finishes cleaning up nodepool secrets
  • Add early return paths when the HostedCluster is not found or is being deleted, cleaning up secrets directly (expiration annotation for token secrets, delete for userdata secrets) without requiring pull secret or release image lookups
  • Add unit tests covering both the missing HostedCluster and deleting HostedCluster scenarios

fixes https://issues.redhat.com/browse/OCPBUGS-77311

Test plan

  • Unit tests pass (make test)
  • Verify make verify passes
  • Test on a ROSA HCP cluster: delete a cluster and confirm nodepool secrets are cleaned up without the janitor getting stuck
  • Verify token secrets get expiration annotations when HostedCluster is missing
  • Verify userdata secrets are deleted directly when HostedCluster is being deleted

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

  • Improved secret reconciliation: secrets are now cleaned up when the related cluster is missing or being deleted.

  • Token secrets receive an expiration annotation during cleanup; userdata secrets are removed, reducing orphaned resources and error noise.

  • Tests

  • Added tests validating cleanup of token and userdata secrets when the cluster is absent or terminating.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@csrwng csrwng changed the title OCM-22350: handle HostedCluster deletion in secretJanitor to prevent stuck cluster deletions OCPBUGS-77311: handle HostedCluster deletion in secretJanitor to prevent stuck cluster deletions Feb 25, 2026
@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Feb 25, 2026
@openshift-ci-robot
Copy link

@sdminonne: This pull request references Jira Issue OCPBUGS-77311, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Fix secretJanitor controller getting stuck in an error loop during ROSA HCP cluster deletion when the pull secret is already deleted by OCM before the janitor finishes cleaning up nodepool secrets
  • Add early return paths when the HostedCluster is not found or is being deleted, cleaning up secrets directly (expiration annotation for token secrets, delete for userdata secrets) without requiring pull secret or release image lookups
  • Add unit tests covering both the missing HostedCluster and deleting HostedCluster scenarios

fixes https://issues.redhat.com/browse/OCPBUGS-77311

Test plan

  • Unit tests pass (make test)
  • Verify make verify passes
  • Test on a ROSA HCP cluster: delete a cluster and confirm nodepool secrets are cleaned up without the janitor getting stuck
  • Verify token secrets get expiration annotations when HostedCluster is missing
  • Verify userdata secrets are deleted directly when HostedCluster is being deleted

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

  • Improved secret reconciliation: secrets are now cleaned up when the related cluster is missing or being deleted.

  • Token secrets receive an expiration annotation during cleanup; userdata secrets are removed, reducing orphaned resources and error noise.

  • Tests

  • Added tests validating cleanup of token and userdata secrets when the cluster is absent or terminating.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdminonne
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 25, 2026
@openshift-ci-robot
Copy link

@sdminonne: This pull request references Jira Issue OCPBUGS-77311, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@csrwng
Copy link
Contributor

csrwng commented Feb 25, 2026

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csrwng, sdminonne

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 25, 2026
@sdminonne
Copy link
Contributor Author

@coderabbitai PTAL

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 25, 2026

@sdminonne: I'll review the changes in this pull request right away.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@sdminonne
Copy link
Contributor Author

Committed on top for clarity, Need to squash

@openshift-ci-robot
Copy link

@sdminonne: This pull request references Jira Issue OCPBUGS-77311, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

  • Fix secretJanitor controller getting stuck in an error loop during ROSA HCP cluster deletion when the pull secret is already deleted by OCM before the janitor finishes cleaning up nodepool secrets
  • Add early return paths when the HostedCluster is not found or is being deleted, cleaning up secrets directly (expiration annotation for token secrets, delete for userdata secrets) without requiring pull secret or release image lookups
  • Add unit tests covering both the missing HostedCluster and deleting HostedCluster scenarios

fixes https://issues.redhat.com/browse/OCPBUGS-77311

Test plan

  • Unit tests pass (make test)
  • Verify make verify passes
  • Test on a ROSA HCP cluster: delete a cluster and confirm nodepool secrets are cleaned up without the janitor getting stuck
  • Verify token secrets get expiration annotations when HostedCluster is missing
  • Verify userdata secrets are deleted directly when HostedCluster is being deleted

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

  • Secrets are now cleaned up when their related cluster is missing or marked for deletion.

  • Token secrets are annotated with an expiration timestamp during cleanup; userdata secrets are deleted to reduce orphaned resources and error noise.

  • Tests

  • Added tests verifying token expiration annotation and userdata secret removal for absent or terminating clusters.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hypershift-operator/controllers/nodepool/secret_janitor.go`:
- Around line 211-216: The cleanupSecretForDeletion function should be made
idempotent by ignoring NotFound errors: when calling
setExpirationTimestampOnToken(ctx, r.Client, secret, r.now) or
r.Client.Delete(ctx, secret) detect apierrors.IsNotFound(err) and treat it as
success (return nil) instead of propagating the error; update imports to use
k8s.io/apimachinery/pkg/api/errors (apierrors) and ensure both branches handle
and swallow NotFound while still returning other errors.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 3ce1c06 and 7e2ca45.

📒 Files selected for processing (2)
  • hypershift-operator/controllers/nodepool/secret_janitor.go
  • hypershift-operator/controllers/nodepool/secret_janitor_test.go

@sdminonne
Copy link
Contributor Author

@coderabbitai implemented #7779 (review) same pattern at line 77 PTAL

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 26, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the specific PR review comment
gh api /repos/openshift/hypershift/pulls/7779/reviews/3857020186 --jq '.body'

# Also fetch review comments to get more context
gh api /repos/openshift/hypershift/pulls/7779/reviews/3857020186/comments --jq '.[] | {path: .path, line: .line, body: .body}'

Length of output: 4221


🏁 Script executed:

#!/bin/bash
# Read the secret_janitor.go file to see the changes around line 77
cat hypershift-operator/controllers/nodepool/secret_janitor.go | head -100

Length of output: 3290


🏁 Script executed:

#!/bin/bash
# Read the cleanupSecretForDeletion function (around lines 200-220)
sed -n '200,230p' hypershift-operator/controllers/nodepool/secret_janitor.go

Length of output: 609


@sdminonne: Perfect! I can confirm that you've successfully implemented the review suggestion. ✅

The cleanupSecretForDeletion function (lines 211-216) now properly wraps both cleanup paths with client.IgnoreNotFound():

  • Token secrets: client.IgnoreNotFound(setExpirationTimestampOnToken(...))
  • Userdata secrets: client.IgnoreNotFound(r.Client.Delete(...))

And line 77 follows the same pattern: client.IgnoreNotFound(r.Client.Delete(ctx, secret)).

This makes both cleanup operations idempotent and prevents unnecessary requeues if another process removes the secret between the Get and cleanup operations. The implementation looks good! 🎉

@sdminonne
Copy link
Contributor Author

/test e2e-aks

…vent stuck cluster deletions

During ROSA HCP cluster deletion, the secretJanitor controller could get
stuck in an error loop when the pull secret was already deleted by OCM
before the janitor finished cleaning up nodepool secrets. This blocked
nodepool and cluster deletion indefinitely.

Add early return paths in secretJanitor.Reconcile() when the
HostedCluster is not found or is being deleted. In both cases, clean up
the secret directly (expiration annotation for token secrets, delete for
userdata secrets) without attempting to compute the expected secret name
hash, which requires the pull secret and release image lookups.

fix(OCM-22350): make secretJanitor cleanup idempotent by ignoring NotFound errors

Wrap Delete and setExpirationTimestampOnToken calls in client.IgnoreNotFound
to prevent unnecessary requeues when a secret is removed by another controller
between the initial Get and the cleanup action.

test(OCM-22350): add complementary secretJanitor cleanup tests for full coverage

Add tests for userdata secret cleanup when HostedCluster is not found
and token secret cleanup when HostedCluster is being deleted, completing
the test matrix for cleanupSecretForDeletion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2026
@openshift-ci-robot
Copy link

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-21
/test e2e-aws-4-21
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@cwbotbot
Copy link

cwbotbot commented Feb 27, 2026

Test Results

e2e-aws

e2e-aks

@sdminonne
Copy link
Contributor Author

/verified by e2e and unit tests

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 1, 2026
@openshift-ci-robot
Copy link

@sdminonne: This PR has been marked as verified by e2e and unit tests.

Details

In response to this:

/verified by e2e and unit tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 1, 2026

@sdminonne: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 766855c into openshift:main Mar 1, 2026
22 checks passed
@openshift-ci-robot
Copy link

@sdminonne: Jira Issue Verification Checks: Jira Issue OCPBUGS-77311
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-77311 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Summary

  • Fix secretJanitor controller getting stuck in an error loop during ROSA HCP cluster deletion when the pull secret is already deleted by OCM before the janitor finishes cleaning up nodepool secrets
  • Add early return paths when the HostedCluster is not found or is being deleted, cleaning up secrets directly (expiration annotation for token secrets, delete for userdata secrets) without requiring pull secret or release image lookups
  • Add unit tests covering both the missing HostedCluster and deleting HostedCluster scenarios

fixes https://issues.redhat.com/browse/OCPBUGS-77311

Test plan

  • Unit tests pass (make test)
  • Verify make verify passes
  • Test on a ROSA HCP cluster: delete a cluster and confirm nodepool secrets are cleaned up without the janitor getting stuck
  • Verify token secrets get expiration annotations when HostedCluster is missing
  • Verify userdata secrets are deleted directly when HostedCluster is being deleted

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

  • Secrets associated with a HostedCluster are now cleaned up when the cluster is missing or being deleted.

  • Token-type secrets are annotated with an expiration timestamp during cleanup; userdata-type secrets are deleted to prevent orphaned resources and reduce error noise.

  • Tests

  • Added tests confirming token expiration annotation and userdata secret deletion when the cluster is absent or terminating.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdminonne
Copy link
Contributor Author

thanks @jparrill !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants