Skip to content

fix: refactor tenant decommissioning#175

Merged
minh-nghia merged 2 commits intomainfrom
feat/refactor-tenant-decommissioning
Mar 25, 2026
Merged

fix: refactor tenant decommissioning#175
minh-nghia merged 2 commits intomainfrom
feat/refactor-tenant-decommissioning

Conversation

@minh-nghia
Copy link
Copy Markdown
Contributor

@minh-nghia minh-nghia commented Mar 6, 2026

Change descriptions

  • Split SystemUnlink job type to SystemUnlink and SystemUnlinkDecommission, instead of using trigger to distinguish between a normal unlink and a decommission-based unlink.
  • SystemUnlinkDecommission job will:
    • Not create any model.Event entry since no user intervention for cancel/retry is needed at this point. Tenant access is already blocked.
    • Set the system status to DISCONNECTED and reset key claim regardless of task result (Done or Failed)
    • Set the system status back to FAILED if the job is canceled, so event is recreated and retried on the next tenant offboarding reconciliation.
    • Not do any unmapping from registry as this will be done separately.
  • Rework KeyDetach event:
    • Add key state DETACHING. When event is created key goes into DETACHING state. Only upon job termination it will go to DETACHED (regardless of Done or Failed)
    • Key state is set to UNKNOWN after job cancellation.
    • Audit log is only sent when job is Done.
  • Modify TerminateTenant operator:
    • Update condition for "all systems have been unlinked" -> check if no system are linked to any key configuration anymore
    • Detach only sent from primary keys of key configuration.
    • Right after all key detach events are sent (all primary keys in DETACHING/DETACHED states), the next step to unmap all systems from registry is executed.
      • Not as a system job hook since there are other systems not connected but still need to be unmapped from the tenant.
      • Idempotent, will skip if System is not found in registry or has been unmapped before
    • The final step will wait for all primary keys to go into DETACHED state before deleting tenant data.

Summary by CodeRabbit

  • New Features

    • Added a DETACHING state for keys to track decommissioning progress before completion.
    • Enhanced tenant offboarding with explicit checkpoints for system unlinking, registry unmapping, and key detachment verification.
  • Bug Fixes

    • Improved error handling and state transitions during key detach operations.
    • Better handling of intermediate states in tenant offboarding process with retry capabilities.

@minh-nghia minh-nghia force-pushed the feat/refactor-tenant-decommissioning branch 3 times, most recently from 4fca870 to 6d8b3f0 Compare March 6, 2026 16:12
@minh-nghia minh-nghia marked this pull request as ready for review March 6, 2026 16:13
Comment on lines +373 to +374
system.Status = cmkapi.SystemStatusCONNECTED
err = updateSystem(ctx, h.repo, system)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking, this is only executed for systems previously on Connected state right? Otherwise we might endup with systems being conneceted that aren't really connected

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition was against systems with keyConfig != null, so would cover both FAILED and CONNECTED. It makes more sense here to mark them as FAILED instead.

Copy link
Copy Markdown
Contributor

@hyperspace-insights hyperspace-insights bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR introduces meaningful improvements to tenant decommissioning but has a critical logic inversion in sendSystemUnlinkEvent where the job type names and factory calls are swapped between the decommission and non-decommission branches — this would route jobs to the wrong handlers entirely. Additionally, there is a test name mismatch for the canceled key detach scenario, a dangling empty comment line in operator.go, and a potentially unscoped system query in unmapAllSystemsFromRegistry that the author should verify is correctly tenant-bounded. The existing reviewer comment about the sendSystemUnlinkEvent inversion should be prioritized for immediate fix.

PR Bot Information

Version: 1.17.99 | 📖 Documentation | 🚨 Create Incident | 💬 Feedback

  • Correlation ID: aa2e50c0-1c66-11f1-9f08-81834e73c7da
  • Event Trigger: issue_comment.created
  • LLM: anthropic--claude-4.6-sonnet

@minh-nghia minh-nghia force-pushed the feat/refactor-tenant-decommissioning branch 2 times, most recently from e6001a1 to c065b21 Compare March 12, 2026 14:08
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces a key detach workflow with an intermediate DETACHING state, adds specialized handlers for key detach and system-unlink-decommission jobs, substantially refactors tenant offboarding with explicit step-by-step reconciliation including registry unmapping, and updates event factories, managers, operators, and tests accordingly.

Changes

Cohort / File(s) Summary
API Schema & Constants
apis/cmk/cmk-ui.yaml, internal/api/cmkapi/cmkapi.go, internal/event-processor/types.go
Added DETACHING key state enum value; updated schema descriptions to reflect detach state transitions; added JobTypeSystemUnlinkDecommission job type constant.
Event Factory
internal/event-processor/factory.go, internal/event-processor/factory_test.go
Removed trigger parameter from SystemUnlink; added SystemUnlinkDecommission method that sets system status to PROCESSING before creating decommission job; updated test call sites.
Key Detach Handler
internal/event-processor/handlers_key.go
New KeyDetachJobHandler with task resolution, job confirmation, and event handling for completion (→ DETACHED), failure (→ DETACHED with audit log), and cancellation (→ UNKNOWN) transitions; includes detach audit logging.
System Handlers
internal/event-processor/handlers_system.go, internal/event-processor/reconciler.go
Added SystemUnlinkDecommissionJobHandler with similar event-handling pattern; refactored existing system handlers to wrap mutations in transactions; moved audit-log emission to after successful updates.
Job Handler Registration & Tests
internal/event-processor/reconciler.go, internal/event-processor/reconciler_test.go
Updated handler mappings for key detach and system unlink decommission; added comprehensive test cases for key detach state transitions and system unlink decommission termination outcomes.
Helper Functions
internal/event-processor/helpers.go, internal/event-processor/repo.go, internal/event-processor/resolvers.go
Added mergeOrbitalTaskErrors helper to extract and join failed task errors; added updateSystem and updateKey repository helpers; updated SystemTaskInfoResolver to map both unlink types to same task type.
Key Manager
internal/manager/key.go
Changed initial detach state from DETACHED to DETACHING; removed inline audit-log emission (moved to handler).
System Manager
internal/manager/system.go
Added UnmapSystemFromRegistry method; refactored UnlinkSystemAction to delegate via new sendSystemUnlinkEvent helper that selects job type based on decommission trigger.
Tenant Offboarding Refactor
internal/manager/tenant.go, internal/manager/tenant_test.go, internal/manager/export_test.go
Replaced monolithic unlink/detach flow with four-step sequence: send unlink events → verify unlink → detach primary keys → unmapping and verification; replaced OffboardingProcessing with OffboardingGoToNextStep/OffboardingContinueAndWait; added gRPC-aware unmapping error classification; updated test wiring with GRPC server for registry service; exposed test-access methods.
Tenant Operator
internal/operator/operator.go, internal/operator/operator_test.go
Split OIDC mapping removal into separate helper; removed OffboardingProcessing case from offboarding status handling; updated test expectations to use OffboardingContinueAndWait.
Integration Test & Configuration
test/integration/event-processor/reconciliation_test.go, sonar-project.properties
Updated unlink factory call to match new signature; excluded cmkapi.go from Sonar analysis.

Sequence Diagrams

sequenceDiagram
    participant Client
    participant KeyMgr as KeyManager
    participant EventFactory
    participant OrbitalMgr as OrbitalManager
    participant KeyHandler as KeyDetachJobHandler
    participant Repo
    participant Auditor

    Client->>KeyMgr: Detach(key)
    KeyMgr->>Repo: Patch key.State→DETACHING
    KeyMgr->>EventFactory: SystemUnlinkDecommission
    KeyMgr->>KeyMgr: sendDetachEvent()
    EventFactory->>OrbitalMgr: CreateJob(KeyDetach)
    OrbitalMgr-->>KeyHandler: Job created
    KeyHandler->>KeyHandler: ResolveTasks()
    KeyHandler->>KeyHandler: HandleJobConfirm()
    Note over KeyHandler: Verify key in DETACHING state
    KeyHandler->>OrbitalMgr: CompleteJobConfirmer / CancelJobConfirmer
    OrbitalMgr->>KeyHandler: JobDoneEvent / JobFailedEvent / JobCanceledEvent
    alt Done Event
        KeyHandler->>Repo: Update key.State→DETACHED
        KeyHandler->>Auditor: SendDetachAuditLog()
        Auditor-->>KeyHandler: success or logged failure
    else Failed Event
        KeyHandler->>Repo: Update key.State→DETACHED
        KeyHandler->>KeyHandler: mergeOrbitalTaskErrors()
    else Canceled Event
        KeyHandler->>Repo: Update key.State→UNKNOWN
    end
Loading
sequenceDiagram
    participant Client
    participant SysMgr as SystemManager
    participant EventFactory
    participant OrbitalMgr as OrbitalManager
    participant SysHandler as SystemUnlinkDecommissionJobHandler
    participant Repo
    participant Registry
    participant Auditor

    Client->>SysMgr: UnlinkSystemAction(trigger=DECOMMISSION)
    SysMgr->>EventFactory: SystemUnlinkDecommission()
    EventFactory->>Repo: Patch system.Status→PROCESSING
    EventFactory->>OrbitalMgr: CreateJob(SYSTEM_UNLINK_DECOMMISSION)
    OrbitalMgr-->>SysHandler: Job created
    SysHandler->>SysHandler: HandleJobConfirm()
    SysHandler->>OrbitalMgr: CompleteJobConfirmer
    OrbitalMgr->>SysHandler: JobDoneEvent
    SysHandler->>SysHandler: terminate()
    SysHandler->>Repo: Load system & key
    SysHandler->>Repo: Update system→DISCONNECTED
    SysHandler->>Registry: UnlinkL1KeyClaim()
    SysHandler->>Auditor: SendOffboardingAuditLog()
    Auditor-->>SysHandler: success or logged failure
Loading
sequenceDiagram
    participant Client
    participant TenantMgr as TenantManager
    participant SysMgr as SystemManager
    participant KeyMgr as KeyManager
    participant Registry
    participant Repo

    Client->>TenantMgr: OffboardTenant()
    loop Step 1-2: Unlink Systems
        TenantMgr->>SysMgr: sendUnlinkForConnectedSystems()
        SysMgr->>SysMgr: Send unlink events
        TenantMgr->>Repo: checkAllSystemsUnlinked()
        alt Not all unlinked
            TenantMgr-->>Client: OffboardingContinueAndWait
        end
    end
    loop Step 3: Detach Primary Keys
        TenantMgr->>KeyMgr: detachPrimaryKeys()
        KeyMgr->>KeyMgr: Send detach events
        TenantMgr->>Repo: checkAllPrimaryKeysProcessed()
        alt Not all in DETACHING
            TenantMgr-->>Client: OffboardingContinueAndWait
        end
    end
    loop Step 4: Unmapping
        TenantMgr->>Registry: unmapAllSystemsFromRegistry()
        alt Retryable error (codes.Internal)
            TenantMgr-->>Client: OffboardingContinueAndWait
        else Permanent error (codes.InvalidArgument)
            TenantMgr-->>Client: OffboardingFailed
        end
    end
    loop Step 5: Verify Detach Complete
        TenantMgr->>Repo: checkAllPrimaryKeysDetached()
        alt Not all DETACHED
            TenantMgr-->>Client: OffboardingContinueAndWait
        end
    end
    TenantMgr-->>Client: OffboardingSuccess
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • bkieling
  • davidbolet
  • apatsap
  • doanhluongsap

Poem

🐰 The detach dance now has more steps,
Keys transition through DETACHING rests,
Systems unlink with decommission flair,
Tenants offboard with registry care,
State machines spin and handlers dance,
A workflow refined with each advance! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.85% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: refactor tenant decommissioning' clearly describes the main change—a refactoring of the tenant decommissioning workflow by splitting SystemUnlink into separate job types and reworking related state transitions.
Description check ✅ Passed The PR description comprehensively covers all major changes: SystemUnlink split, SystemUnlinkDecommission behavior, KeyDetach rework, and TerminateTenant operator modifications, aligning well with the detailed changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/refactor-tenant-decommissioning

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/manager/tenant.go (1)

229-249: ⚠️ Potential issue | 🟠 Major

This unlink seed only covers CONNECTED, but completion waits on all linked systems.

checkAllSystemsUnlinked() returns false until every KeyConfigurationID is cleared, while this helper only dispatches decommission unlinks for Status == CONNECTED. Any linked system already in FAILED or another non-CONNECTED state will never get another unlink event, so offboarding can sit in OffboardingProcessing forever.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/manager/tenant.go` around lines 229 - 249, The current
sendUnlinkForConnectedSystems only queries systems with Status ==
SystemStatusCONNECTED, but checkAllSystemsUnlinked() requires every
KeyConfigurationID to be cleared; change sendUnlinkForConnectedSystems to target
all linked systems (e.g., systems with a non-empty KeyConfigurationID or any
status != UNLINKED) instead of only SystemStatusCONNECTED so every linked system
gets an UnlinkSystemAction; update the repo.NewCompositeKey()/Where(...)
condition used in sendUnlinkForConnectedSystems (keeping the UnlinkSystemAction
call and constants.SystemActionDecommission) to query by KeyConfigurationID
presence (or status not UNLINKED) and leave the loop calling
m.sys.UnlinkSystemAction unchanged.
♻️ Duplicate comments (1)
internal/event-processor/handlers_system.go (1)

354-371: ⚠️ Potential issue | 🔴 Critical

Canceled decommission jobs must restore CONNECTED.

Line 370 sets FAILED, but internal/manager/tenant.go, Lines 229-249 only resubmits decommission unlinks for connected systems. After a cancel, this becomes unrecoverable and blocks tenant offboarding instead of retrying on the next reconciliation. This also disagrees with the new expectation in internal/event-processor/reconciler_test.go, Lines 913-941.

Suggested change
-	system.Status = cmkapi.SystemStatusFAILED
+	system.Status = cmkapi.SystemStatusCONNECTED
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_system.go` around lines 354 - 371, In
HandleJobCanceledEvent (SystemUnlinkDecommissionJobHandler) restore the system
to CONNECTED instead of FAILED: replace the assignment setting system.Status =
cmkapi.SystemStatusFAILED with system.Status = cmkapi.SystemStatusCONNECTED so
the subsequent updateSystem call persists the restored CONNECTED state and
allows the decommission reconciliation to resubmit the unlink; keep the existing
updateSystem(ctx, h.repo, system) call.
🧹 Nitpick comments (4)
internal/event-processor/export_test.go (1)

23-25: Keep DisableAuditLog() in sync with the new handlers.

Tests can now drive cancellation paths through this exported hook, but the helper still only swaps auditors for link/unlink/switch. SystemUnlinkDecommissionJobHandler and KeyDetachJobHandler can still emit audit logs in tests that expect side effects to be disabled.

Proposed update
 func (c *CryptoReconciler) DisableAuditLog() {
 	c.jobHandlerMap[JobTypeSystemLink].(*SystemLinkJobHandler).cmkAuditor = &auditor.Auditor{}
 	c.jobHandlerMap[JobTypeSystemUnlink].(*SystemUnlinkJobHandler).cmkAuditor = &auditor.Auditor{}
+	c.jobHandlerMap[JobTypeSystemUnlinkDecommission].(*SystemUnlinkDecommissionJobHandler).cmkAuditor = &auditor.Auditor{}
 	c.jobHandlerMap[JobTypeSystemSwitch].(*SystemSwitchJobHandler).cmkAuditor = &auditor.Auditor{}
+	c.jobHandlerMap[JobTypeKeyDetach].(*KeyDetachJobHandler).cmkAuditor = &auditor.Auditor{}
 }

Also applies to: 32-36

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/export_test.go` around lines 23 - 25, Update the
DisableAuditLog test helper so it also suppresses audit logging for the new
exported cancellation hook and the two job handlers: ensure DisableAuditLog
swaps auditors (or stubs) for SystemUnlinkDecommissionJobHandler and
KeyDetachJobHandler and additionally covers JobCanceledFunc (which delegates to
jobCanceledFunc) so cancellation paths in tests do not emit audit logs; modify
the helper to register/replace those handler/auditor symbols the same way it
already does for link/unlink/switch.
internal/manager/tenant_test.go (1)

251-260: Scope this cleanup to the fixture under test.

This loop disconnects every system in the tenant, including state created by other t.Runs. That makes the assertion order-dependent and can hide regressions in the OffboardTenant selection logic. Prefer creating only the systems needed here, or filtering the patch by keyConfigID.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/manager/tenant_test.go` around lines 251 - 260, The test currently
disconnects every system in the tenant by listing all model.System via r.List
and patching each one (setting Status and KeyConfigurationID) which affects
other t.Run fixtures; instead restrict the cleanup to only the systems
created/owned by this test: either create and track the specific systems locally
and patch only those, or apply a scoped query filter (e.g. filter by the
fixture's keyConfigID or another unique identifier) when calling r.List and when
calling r.Patch (use the existing repo.NewQuery().UpdateAll(true) but with the
filter applied) so only the intended systems are modified.
test/integration/event-processor/reconciliation_test.go (1)

207-244: Add a SystemUnlinkDecommission case to this table.

This suite now exercises the shortened SystemUnlink API, but the PR also adds a new unlink-decommission job type and resolver path. A sibling case here would catch regressions where the new job stops producing a SYSTEM_UNLINK task or serializes different payload fields.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/integration/event-processor/reconciliation_test.go` around lines 207 -
244, Add a new test case entry in the testCases table for the
SystemUnlinkDecommission job that mirrors the existing SYSTEM_UNLINK
expectations: set name to something like "SYSTEM_UNLINK_DECOMMISSION creates
task for system's region only", jobType to "SYSTEM_UNLINK_DECOMMISSION",
protoTaskType to eventProto.TaskType_SYSTEM_UNLINK (same produced task),
keyIDFrom to the unlink key (keyIDUnlink), keyIDTo to "", and system to
systemUnlink. Then update the job-creation branch inside the t.Run loop to
handle tc.jobType == "SYSTEM_UNLINK_DECOMMISSION" by calling the new factory
method (e.g. tester.eventFactory.SystemUnlinkDecommission(ctx, tc.system,
tc.keyIDFrom)) so the new case is exercised alongside SYSTEM_UNLINK and
SYSTEM_LINK.
internal/manager/key.go (1)

425-427: Patch only the state field here.

UpdateAll(true) writes the full model.Key snapshot back to the row. This path only needs the DETACHING transition, so a field-scoped update is safer and avoids persisting unrelated stale fields from the caller.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/manager/key.go` around lines 425 - 427, The code sets key.State =
string(cmkapi.KeyStateDETACHING) then calls km.repo.Patch with
repo.NewQuery().UpdateAll(true), which writes the whole model back; change the
Patch invocation to perform a field-scoped update that only persists the state
change (i.e., replace UpdateAll(true) with the repository query option that
updates only the "state" field) so unrelated/stale fields on model.Key are not
written; keep the same key variable and km.repo.Patch call but supply a query
configured to UpdateFields("state") (or the repo's equivalent API) instead of
UpdateAll.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/event-processor/factory.go`:
- Around line 131-151: The current SystemUnlinkDecommission sets system.Status
and does a bare repo.Patch inside a transaction, bypassing the guarded/keyed
status-update flow and risking leaving the system in PROCESSING if
createSystemEventJob fails; replace the manual Transaction+Patch logic with the
guarded flow by invoking the existing handleSystemStatus (or the same helper
used elsewhere) so the status update is keyed/guarded and tied to job creation:
prepare SystemActionJobData (SystemID, KeyIDFrom) then call handleSystemStatus
(or the helper wrapper) to perform the status transition and create the job via
createSystemEventJob in its protected closure, ensuring rollback/cleanup if job
creation fails.

In `@internal/event-processor/handlers_key.go`:
- Around line 107-110: The code currently returns an error when
h.cmkAuditor.SendCmkDetachAuditLog(ctx, key.ID.String()) fails, which can cause
retries even though the key is already persisted as DETACHED; change this so
audit-log delivery is non-fatal: call h.cmkAuditor.SendCmkDetachAuditLog as
before but on error do not return—log the failure with context (include key.ID
and the error) using the handler logger (e.g., h.logger or processLogger) and
continue normal completion; remove the fmt.Errorf return path so the done hook
is not blocked by transient auditor failures.

In `@internal/event-processor/handlers_system.go`:
- Around line 314-319: The handler SystemUnlinkDecommissionJobHandler currently
always returns orbital.CompleteJobConfirmer() from HandleJobConfirm which
accepts late task completions; change it to use the standard confirmer instead
so late results from canceled/superseded jobs are rejected. Update
HandleJobConfirm to return the standard confirmer (e.g.,
orbital.StandardJobConfirmer()) rather than orbital.CompleteJobConfirmer(),
preserving the same error return (nil).

In `@internal/event-processor/repo.go`:
- Around line 64-85: Both updateSystem and updateKey call repo.Repo.Patch but
ignore the boolean result indicating whether a row was actually patched;
updateSystem, updateKey, and the Patch callsite should check the returned
boolean (the first return value from repo.Repo.Patch) and return an error if it
is false (e.g., "no rows patched for system/key <ID>"), so callers don't assume
persistence when nothing changed—locate the Patch invocations in
updateSystem/updateKey (they build a composite key via
repo.NewCompositeKey().Where and repo.NewQuery().Where(...).UpdateAll(true)) and
add a conditional that treats a false patch result as an error before returning
nil.

In `@internal/manager/system.go`:
- Around line 432-444: The UnmapSystemFromRegistry method assumes m.registry is
non-nil and calls m.registry.Mapping(), which can panic because NewSystemManager
permits a nil registry; guard this by checking if m.registry == nil at the top
of UnmapSystemFromRegistry and return a retryable/temporary error instead of
calling into Mapping(); reference the method name (UnmapSystemFromRegistry) and
the field (m.registry) when making the change so callers get a surfaced error
rather than a panic.

In `@internal/manager/tenant.go`:
- Around line 292-321: The call to repo.ProcessInBatch in
TenantManager.unmapAllSystemsFromRegistry currently ignores its error; capture
its returned error (assign the result of repo.ProcessInBatch to err) and, if
non-nil, mark the offboarding as failed (e.g., set hasOffboardingFailed = true
or immediately return (true, OffboardingFailed)) so failures
enumerating/batching systems are propagated instead of treated as success; keep
existing per-system handling using m.sys.UnmapSystemFromRegistry and
m.unmapSystemErrorCanContinue, but ensure the overall function returns a failure
status when repo.ProcessInBatch fails.

---

Outside diff comments:
In `@internal/manager/tenant.go`:
- Around line 229-249: The current sendUnlinkForConnectedSystems only queries
systems with Status == SystemStatusCONNECTED, but checkAllSystemsUnlinked()
requires every KeyConfigurationID to be cleared; change
sendUnlinkForConnectedSystems to target all linked systems (e.g., systems with a
non-empty KeyConfigurationID or any status != UNLINKED) instead of only
SystemStatusCONNECTED so every linked system gets an UnlinkSystemAction; update
the repo.NewCompositeKey()/Where(...) condition used in
sendUnlinkForConnectedSystems (keeping the UnlinkSystemAction call and
constants.SystemActionDecommission) to query by KeyConfigurationID presence (or
status not UNLINKED) and leave the loop calling m.sys.UnlinkSystemAction
unchanged.

---

Duplicate comments:
In `@internal/event-processor/handlers_system.go`:
- Around line 354-371: In HandleJobCanceledEvent
(SystemUnlinkDecommissionJobHandler) restore the system to CONNECTED instead of
FAILED: replace the assignment setting system.Status = cmkapi.SystemStatusFAILED
with system.Status = cmkapi.SystemStatusCONNECTED so the subsequent updateSystem
call persists the restored CONNECTED state and allows the decommission
reconciliation to resubmit the unlink; keep the existing updateSystem(ctx,
h.repo, system) call.

---

Nitpick comments:
In `@internal/event-processor/export_test.go`:
- Around line 23-25: Update the DisableAuditLog test helper so it also
suppresses audit logging for the new exported cancellation hook and the two job
handlers: ensure DisableAuditLog swaps auditors (or stubs) for
SystemUnlinkDecommissionJobHandler and KeyDetachJobHandler and additionally
covers JobCanceledFunc (which delegates to jobCanceledFunc) so cancellation
paths in tests do not emit audit logs; modify the helper to register/replace
those handler/auditor symbols the same way it already does for
link/unlink/switch.

In `@internal/manager/key.go`:
- Around line 425-427: The code sets key.State =
string(cmkapi.KeyStateDETACHING) then calls km.repo.Patch with
repo.NewQuery().UpdateAll(true), which writes the whole model back; change the
Patch invocation to perform a field-scoped update that only persists the state
change (i.e., replace UpdateAll(true) with the repository query option that
updates only the "state" field) so unrelated/stale fields on model.Key are not
written; keep the same key variable and km.repo.Patch call but supply a query
configured to UpdateFields("state") (or the repo's equivalent API) instead of
UpdateAll.

In `@internal/manager/tenant_test.go`:
- Around line 251-260: The test currently disconnects every system in the tenant
by listing all model.System via r.List and patching each one (setting Status and
KeyConfigurationID) which affects other t.Run fixtures; instead restrict the
cleanup to only the systems created/owned by this test: either create and track
the specific systems locally and patch only those, or apply a scoped query
filter (e.g. filter by the fixture's keyConfigID or another unique identifier)
when calling r.List and when calling r.Patch (use the existing
repo.NewQuery().UpdateAll(true) but with the filter applied) so only the
intended systems are modified.

In `@test/integration/event-processor/reconciliation_test.go`:
- Around line 207-244: Add a new test case entry in the testCases table for the
SystemUnlinkDecommission job that mirrors the existing SYSTEM_UNLINK
expectations: set name to something like "SYSTEM_UNLINK_DECOMMISSION creates
task for system's region only", jobType to "SYSTEM_UNLINK_DECOMMISSION",
protoTaskType to eventProto.TaskType_SYSTEM_UNLINK (same produced task),
keyIDFrom to the unlink key (keyIDUnlink), keyIDTo to "", and system to
systemUnlink. Then update the job-creation branch inside the t.Run loop to
handle tc.jobType == "SYSTEM_UNLINK_DECOMMISSION" by calling the new factory
method (e.g. tester.eventFactory.SystemUnlinkDecommission(ctx, tc.system,
tc.keyIDFrom)) so the new case is exercised alongside SYSTEM_UNLINK and
SYSTEM_LINK.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0b233394-d19e-4cfd-bcdb-3372f59be67f

📥 Commits

Reviewing files that changed from the base of the PR and between 3368ec5 and c065b21.

📒 Files selected for processing (19)
  • apis/cmk/cmk-ui.yaml
  • internal/api/cmkapi/cmkapi.go
  • internal/event-processor/export_test.go
  • internal/event-processor/factory.go
  • internal/event-processor/factory_test.go
  • internal/event-processor/handlers_key.go
  • internal/event-processor/handlers_system.go
  • internal/event-processor/helpers.go
  • internal/event-processor/reconciler.go
  • internal/event-processor/reconciler_test.go
  • internal/event-processor/repo.go
  • internal/event-processor/resolvers.go
  • internal/event-processor/types.go
  • internal/manager/key.go
  • internal/manager/system.go
  • internal/manager/tenant.go
  • internal/manager/tenant_test.go
  • internal/operator/operator.go
  • test/integration/event-processor/reconciliation_test.go

@minh-nghia minh-nghia force-pushed the feat/refactor-tenant-decommissioning branch from c065b21 to 9c27616 Compare March 12, 2026 15:27
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/operator/operator.go (1)

425-455: ⚠️ Potential issue | 🔴 Critical

Return immediately on any gRPC error before accessing the response.

The current logic only returns early for non-internal errors, allowing codes.Internal to fall through to grpcResp.GetSuccess(), which can panic on a nil response and also violates the intended short-circuit when OIDC cleanup fails. This differs from the safe pattern used in BlockOIDCMapping and UnblockOIDCMapping in the same file. Guard the response dereference with a check on any error.

Suggested fix
 func (o *TenantOperator) removeOIDCMapping(
 	ctx context.Context,
 	resp *orbital.HandlerResponse,
 	tenantID string,
 ) bool {
 	grpcResp, err := o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping(
 		ctx,
 		&oidcmappinggrpc.RemoveOIDCMappingRequest{
 			TenantId: tenantID,
 		},
 	)
-	st, ok := status.FromError(err)
-	if !ok {
-		log.Error(ctx, "failed getting info on sessionManager error", err)
-	}
-	if st.Code() == codes.Internal {
-		log.Error(ctx, "removeOIDC failed with internal err", err)
-	}
-	if err != nil && st.Code() != codes.Internal {
-		log.Error(ctx, "error while removing OIDC mapping", err)
-		setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed)
-		return true
-	}
+	if err != nil {
+		log.Error(ctx, "error while removing OIDC mapping", err, slog.String("grpcCode", status.Code(err).String()))
+		setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed)
+		return true
+	}
 
 	if !grpcResp.GetSuccess() {
 		err = errs.Wrapf(
 			ErrFailedResponse,
 			"session manager could not remove OIDC mapping: "+grpcResp.GetMessage(),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/operator/operator.go` around lines 425 - 455, In removeOIDCMapping,
don't dereference grpcResp when err is non-nil: immediately handle any gRPC
error (including codes.Internal) before accessing grpcResp.GetSuccess();
replicate the safe pattern used in BlockOIDCMapping/UnblockOIDCMapping by
checking if err != nil, logging the error (including internal case), calling
setErrorStateAndContinue or setErrorStateAndFail as appropriate (use
WorkingStateOIDCMappingRemoveFailed and setErrorStateAndContinue for
transient/internal handling per the surrounding logic), and return true so
execution never reaches grpcResp.GetSuccess() when err != nil.
♻️ Duplicate comments (2)
internal/event-processor/handlers_system.go (1)

312-319: ⚠️ Potential issue | 🟠 Major

Don’t auto-confirm decommission jobs.

A canceled decommission job can be requeued during offboarding. With CompleteJobConfirmer(), a late result from the old job is still accepted and can run terminate() on a system that has already moved on. Please gate confirmation on current system state instead of unconditionally completing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_system.go` around lines 312 - 319, The
handler currently auto-confirms all decommission jobs in
SystemUnlinkDecommissionJobHandler.HandleJobConfirm by returning
orbital.CompleteJobConfirmer(), which allows stale jobs to run terminate() on
systems that have already moved on; change HandleJobConfirm to look up the
current system state (using the system ID or metadata from the incoming
orbital.Job), and only return orbital.CompleteJobConfirmer() when the system is
in a state that permits decommission (e.g., still assigned/active and not
already offboarded); otherwise return the appropriate non-confirming result
(e.g., cancel/reject/skip confirmer) so late jobs are not allowed to proceed.
Ensure you use the existing job/system lookup helpers in this package and keep
the logic within SystemUnlinkDecommissionJobHandler.HandleJobConfirm.
internal/event-processor/handlers_key.go (1)

107-110: ⚠️ Potential issue | 🟠 Major

Make detach audit logging non-fatal.

The key is already persisted as DETACHED here. Returning an error on auditor failure can poison a logically completed detach and trigger retries or duplicate callbacks for no state change.

Suggested fix
-	err = h.cmkAuditor.SendCmkDetachAuditLog(ctx, key.ID.String())
-	if err != nil {
-		return fmt.Errorf("failed to send CMK detach audit log for key %s: %w", key.ID, err)
-	}
+	if err := h.cmkAuditor.SendCmkDetachAuditLog(ctx, key.ID.String()); err != nil {
+		log.Error(ctx, "Failed to send CMK detach audit log", err, slog.String("keyID", key.ID.String()))
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_key.go` around lines 107 - 110, The handler
currently returns an error when h.cmkAuditor.SendCmkDetachAuditLog(ctx,
key.ID.String()) fails (using return fmt.Errorf(...)), which makes an
already-persisted DETACHED key retryable; change this to treat the auditor
failure as non-fatal: capture the error from SendCmkDetachAuditLog, log a
warning including key.ID.String() and the error (replace the fmt.Errorf return
with a non-fatal log call using the component logger available in the handler),
and continue normal execution so the detach is not retried or duplicated.
🧹 Nitpick comments (1)
internal/event-processor/factory_test.go (1)

135-245: Add a factory test for SystemUnlinkDecommission.

This table was updated for the SystemUnlink signature change, but the new decommission job path introduced by this PR still is not exercised here. A direct assertion on the created job type and SystemActionJobData would catch regressions in the offboarding path much earlier.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/factory_test.go` around lines 135 - 245, The tests
table in factory_test.go exercises SystemLink/SystemUnlink/SystemSwitch variants
but misses a case for the new SystemUnlinkDecommission path; add a test entry in
the tests slice that calls eventProcessor.SystemUnlinkDecommission (or invokes
SystemUnlink in the way that triggers decommission), supplies tenantID and
keyIDFrom, and asserts the created orbital.Job has type
eventprocessor.JobTypeSystemUnlinkDecommission (or the decommission job type)
and that the unmarshaled eventprocessor.SystemActionJobData.KeyIDFrom matches
the provided keyIDFrom; place the case alongside the other unlink cases and
reuse the assertKeyID helper pattern to validate job type and expStatus
(cmkapi.SystemStatusPROCESSING) so the offboarding path is covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/event-processor/handlers_key.go`:
- Around line 91-96: HandleJobConfirm currently always returns
CompleteJobConfirmer which can incorrectly accept confirmations after a
cancellation; update KeyDetachJobHandler.HandleJobConfirm to read the current
key state (same store/API used by HandleJobCanceledEvent and as used in
handleSystemJobConfirm), and only return orbital.CompleteJobConfirmer() when the
key is in the expected DETACHING state; if the key state is UNKNOWN or anything
else return orbital.CancelJobConfirmer() so the late confirmation does not
overwrite the retry/cancelled state.

In `@internal/event-processor/handlers_system.go`:
- Around line 354-376: The cancel handler HandleJobCanceledEvent in
SystemUnlinkDecommissionJobHandler currently sets system.Status =
cmkapi.SystemStatusFAILED; change it to restore the prior/connected state by
setting system.Status = cmkapi.SystemStatusCONNECTED after retrieving the system
(via unmarshalSystemJobData and getSystemByID) and before calling updateSystem;
ensure you only change the status on cancel and preserve the rest of the system
object when calling updateSystem.

---

Outside diff comments:
In `@internal/operator/operator.go`:
- Around line 425-455: In removeOIDCMapping, don't dereference grpcResp when err
is non-nil: immediately handle any gRPC error (including codes.Internal) before
accessing grpcResp.GetSuccess(); replicate the safe pattern used in
BlockOIDCMapping/UnblockOIDCMapping by checking if err != nil, logging the error
(including internal case), calling setErrorStateAndContinue or
setErrorStateAndFail as appropriate (use WorkingStateOIDCMappingRemoveFailed and
setErrorStateAndContinue for transient/internal handling per the surrounding
logic), and return true so execution never reaches grpcResp.GetSuccess() when
err != nil.

---

Duplicate comments:
In `@internal/event-processor/handlers_key.go`:
- Around line 107-110: The handler currently returns an error when
h.cmkAuditor.SendCmkDetachAuditLog(ctx, key.ID.String()) fails (using return
fmt.Errorf(...)), which makes an already-persisted DETACHED key retryable;
change this to treat the auditor failure as non-fatal: capture the error from
SendCmkDetachAuditLog, log a warning including key.ID.String() and the error
(replace the fmt.Errorf return with a non-fatal log call using the component
logger available in the handler), and continue normal execution so the detach is
not retried or duplicated.

In `@internal/event-processor/handlers_system.go`:
- Around line 312-319: The handler currently auto-confirms all decommission jobs
in SystemUnlinkDecommissionJobHandler.HandleJobConfirm by returning
orbital.CompleteJobConfirmer(), which allows stale jobs to run terminate() on
systems that have already moved on; change HandleJobConfirm to look up the
current system state (using the system ID or metadata from the incoming
orbital.Job), and only return orbital.CompleteJobConfirmer() when the system is
in a state that permits decommission (e.g., still assigned/active and not
already offboarded); otherwise return the appropriate non-confirming result
(e.g., cancel/reject/skip confirmer) so late jobs are not allowed to proceed.
Ensure you use the existing job/system lookup helpers in this package and keep
the logic within SystemUnlinkDecommissionJobHandler.HandleJobConfirm.

---

Nitpick comments:
In `@internal/event-processor/factory_test.go`:
- Around line 135-245: The tests table in factory_test.go exercises
SystemLink/SystemUnlink/SystemSwitch variants but misses a case for the new
SystemUnlinkDecommission path; add a test entry in the tests slice that calls
eventProcessor.SystemUnlinkDecommission (or invokes SystemUnlink in the way that
triggers decommission), supplies tenantID and keyIDFrom, and asserts the created
orbital.Job has type eventprocessor.JobTypeSystemUnlinkDecommission (or the
decommission job type) and that the unmarshaled
eventprocessor.SystemActionJobData.KeyIDFrom matches the provided keyIDFrom;
place the case alongside the other unlink cases and reuse the assertKeyID helper
pattern to validate job type and expStatus (cmkapi.SystemStatusPROCESSING) so
the offboarding path is covered.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 888f2abb-1da0-4272-9304-54cf61c70077

📥 Commits

Reviewing files that changed from the base of the PR and between c065b21 and 9c27616.

📒 Files selected for processing (19)
  • apis/cmk/cmk-ui.yaml
  • internal/api/cmkapi/cmkapi.go
  • internal/event-processor/export_test.go
  • internal/event-processor/factory.go
  • internal/event-processor/factory_test.go
  • internal/event-processor/handlers_key.go
  • internal/event-processor/handlers_system.go
  • internal/event-processor/helpers.go
  • internal/event-processor/reconciler.go
  • internal/event-processor/reconciler_test.go
  • internal/event-processor/repo.go
  • internal/event-processor/resolvers.go
  • internal/event-processor/types.go
  • internal/manager/key.go
  • internal/manager/system.go
  • internal/manager/tenant.go
  • internal/manager/tenant_test.go
  • internal/operator/operator.go
  • test/integration/event-processor/reconciliation_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • internal/manager/system.go
  • internal/event-processor/reconciler.go
  • internal/api/cmkapi/cmkapi.go
  • apis/cmk/cmk-ui.yaml
  • internal/manager/tenant_test.go
  • test/integration/event-processor/reconciliation_test.go

@minh-nghia minh-nghia force-pushed the feat/refactor-tenant-decommissioning branch from 9c27616 to 243e8d3 Compare March 12, 2026 16:06
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/event-processor/factory_test.go (1)

180-205: ⚠️ Potential issue | 🟠 Major

Add coverage for the decommission unlink factory path.

These changes only adapt the existing SystemUnlink tests to the new signature. The PR also introduces SystemUnlinkDecommission, which has intentionally different behavior from a normal unlink, but this file still has no creation/confirmation coverage for that entrypoint. Please add dedicated cases here so regressions in the new factory wiring are caught.

Also applies to: 300-303

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/factory_test.go` around lines 180 - 205, Add test
cases to cover the new SystemUnlinkDecommission factory path: update the
table-driven tests around the SystemUnlink cases to include at least one case
that calls eventProcessor.SystemUnlinkDecommission(ctx, system, from) (mirroring
the existing SystemUnlink cases) and asserts the expected behavior — creation of
a decommission unlink job, correct job type (use
eventprocessor.JobTypeSystemUnlinkDecommission or the string it should produce),
and any status transitions (e.g., system set to PROCESSING) and expected errors;
ensure you include both success and error scenarios (missing tenant, system in
PROCESSING) analogous to the existing SystemUnlink tests and reference the same
test scaffolding (the test table, systemEventFn, systemStatus, tenantID,
keyIDFrom, expType, expStatus, expErr) so the new cases integrate with the
current test loop.
internal/operator/operator.go (1)

430-455: ⚠️ Potential issue | 🔴 Critical

Guard against nil grpcResp on error paths.

When the RPC returns a non-nil error (particularly internal errors), grpcResp is nil per gRPC semantics. The code currently dereferences grpcResp.GetSuccess() at line 449 without checking for this case. When an internal RPC error occurs (st.Code() == codes.Internal), the code logs at line 440 but does not return early, allowing execution to fall through and panic on the nil dereference. This crash can occur during tenant termination before offboarding completes.

The proposed fix correctly guards all grpcResp access by returning early when err != nil.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/operator/operator.go` around lines 430 - 455, The RemoveOIDCMapping
call can return a non-nil error and a nil grpcResp; after calling
o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping and extracting
st via status.FromError, ensure you return early on any non-nil err (after
logging the internal error case) instead of continuing and dereferencing
grpcResp; update the branch around err != nil to call
setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed)
and return immediately so grpcResp.GetSuccess() is only accessed when err ==
nil, and preserve the existing setErrorStateAndFail flow for a non-success
grpcResp.
♻️ Duplicate comments (9)
internal/event-processor/repo.go (1)

64-85: ⚠️ Potential issue | 🟠 Major

Check whether the patch actually updated a row.

Both helpers still ignore the bool returned by r.Patch. If the keyed row is missing, they return nil and the caller proceeds as if the new system/key state was persisted. Please fail when updated == false.

🐛 Proposed fix
 func updateSystem(ctx context.Context, r repo.Repo, system *model.System) error {
 	ck := repo.NewCompositeKey().Where(repo.IDField, system.ID)
 	query := repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true)

-	_, err := r.Patch(ctx, system, *query)
+	updated, err := r.Patch(ctx, system, *query)
 	if err != nil {
 		return fmt.Errorf("failed to update system %s upon job termination: %w", system.ID, err)
 	}
+	if !updated {
+		return fmt.Errorf("system %s not found upon job termination", system.ID)
+	}

 	return nil
 }

 func updateKey(ctx context.Context, r repo.Repo, key *model.Key) error {
 	ck := repo.NewCompositeKey().Where(repo.IDField, key.ID)
 	query := repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true)

-	_, err := r.Patch(ctx, key, *query)
+	updated, err := r.Patch(ctx, key, *query)
 	if err != nil {
 		return fmt.Errorf("failed to update key %s upon job termination: %w", key.ID, err)
 	}
+	if !updated {
+		return fmt.Errorf("key %s not found upon job termination", key.ID)
+	}

 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/repo.go` around lines 64 - 85, Both updateSystem and
updateKey call r.Patch but ignore the boolean "updated" result; change each to
capture the returned values (updated, err := r.Patch(...)), return the wrapped
error on err, and if updated is false return a new error (e.g. fmt.Errorf("no
row updated for system %s", system.ID) / "for key %s") so callers don't assume
the row was persisted; update the error messages in updateSystem and updateKey
accordingly to include this failure case.
internal/event-processor/handlers_system.go (2)

314-319: ⚠️ Potential issue | 🟠 Major

Use the standard system confirmer here too.

This decommission path currently accepts confirmations unconditionally, so late task completions from canceled/superseded jobs can still run the terminal handler. It should use the same handleSystemJobConfirm(...) gate as the other system job handlers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_system.go` around lines 314 - 319, The
SystemUnlinkDecommissionJobHandler.HandleJobConfirm currently accepts
confirmations unconditionally; replace its body to call and return the shared
gate function handleSystemJobConfirm(...) (same used by other system job
handlers) so late completions from canceled/superseded jobs are rejected; locate
SystemUnlinkDecommissionJobHandler.HandleJobConfirm and change the return to the
result of handleSystemJobConfirm with the same parameters/signature used
elsewhere.

356-376: ⚠️ Potential issue | 🟠 Major

Restore CONNECTED on decommission cancel.

The decommission cancel path is supposed to put the system back into a retryable state so the next offboarding reconciliation recreates the unlink job. Setting FAILED here breaks that contract and changes the flow into manual recovery.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_system.go` around lines 356 - 376, In
HandleJobCanceledEvent (SystemUnlinkDecommissionJobHandler) we currently set
system.Status = cmkapi.SystemStatusFAILED which prevents automatic retry; change
the status to the retryable connected state by setting system.Status =
cmkapi.SystemStatusCONNECTED before calling updateSystem(ctx, h.repo, system) so
the next offboarding reconciliation can recreate the unlink job.
internal/event-processor/factory.go (1)

145-155: ⚠️ Potential issue | 🔴 Critical

Keep the PROCESSING transition and job creation atomic.

This path commits the status change before createSystemEventJob(...) runs. If job creation fails, the system is left in PROCESSING with no job/event to drive recovery. Reuse the guarded handleSystemStatus(...) flow here, or create the job inside the same transaction.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/factory.go` around lines 145 - 155, The code sets
system.Status to PROCESSING and commits that change inside f.repo.Transaction,
then calls createSystemEventJob(...) outside the transaction, which can leave
the system stuck if job creation fails; modify the logic to make the status
transition and job creation atomic by either (A) moving the createSystemEventJob
call into the same f.repo.Transaction callback so both the Patch and job
creation occur under the same transactional guard, or (B) reuse the existing
handleSystemStatus(...) flow to perform the PROCESSING transition and job
creation together; ensure you reference and update system.Status, use
f.repo.Transaction, and invoke createSystemEventJob (or call handleSystemStatus)
within that transaction callback so failures roll back the status change.
internal/manager/system.go (1)

432-444: ⚠️ Potential issue | 🔴 Critical

Guard m.registry before calling Mapping().

NewSystemManager allows m.registry to be nil, so this method can panic during tenant offboarding instead of returning a recoverable error. Please add the same defensive check used in the other registry-backed paths.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/manager/system.go` around lines 432 - 444, UnmapSystemFromRegistry
currently calls m.registry.Mapping() without checking m.registry and can panic
when m.registry is nil; update UnmapSystemFromRegistry in SystemManager to guard
m.registry like other registry-backed methods by returning a recoverable error
(e.g., nil registry error) if m.registry == nil before calling
m.registry.Mapping(), and otherwise proceed to call Mapping(). Ensure you
reference the SystemManager receiver, the m.registry field, and the
UnmapSystemFromRegistry method when making the change so behavior matches other
registry-backed paths.
internal/event-processor/reconciler_test.go (1)

913-940: ⚠️ Potential issue | 🟠 Major

Expect CONNECTED after decommission cancellation.

This assertion bakes in the opposite behavior from the PR contract: a canceled SYSTEM_UNLINK_DECOMMISSION should restore the system to CONNECTED so the next offboarding reconciliation can recreate the job. Keeping FAILED here will make the test validate the bug instead of catching it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/reconciler_test.go` around lines 913 - 940, The test
currently asserts sysAfter.Status is cmkapi.SystemStatusFAILED but the contract
requires restoring CONNECTED on a canceled SYSTEM_UNLINK_DECOMMISSION; change
the assertion in the test (in the t.Run "System status on canceled
SYSTEM_UNLINK_DECOMMISSION job termination") to expect
cmkapi.SystemStatusCONNECTED for sysAfter.Status after calling
eventProcessor.JobCanceledFunc (using the SystemActionJobData and
JobTypeSystemUnlinkDecommission flow), so the test verifies cancellation
restores CONNECTED instead of FAILED.
internal/event-processor/handlers_key.go (2)

107-110: ⚠️ Potential issue | 🟠 Major

Don't make a completed detach depend on audit-log delivery.

The key has already been updated to DETACHED before this call. Returning an error here can poison the done hook and cause retries even though the detach itself already succeeded.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_key.go` around lines 107 - 110, The code
currently treats failures from h.cmkAuditor.SendCmkDetachAuditLog as fatal and
returns an error, which can trigger retries even though the key was already set
to DETACHED; change this so audit-log delivery failures do not block completion:
in the handler that calls h.cmkAuditor.SendCmkDetachAuditLog (use the existing
call site and key.ID), do not return an error on failure—instead log the error
(e.g., h.logger.Warnf or h.logger.Error) and continue, or dispatch the
SendCmkDetachAuditLog call asynchronously (goroutine) and swallow/log errors,
ensuring the done hook is not poisoned by audit log delivery failures.

91-95: ⚠️ Potential issue | 🟠 Major

Only confirm detach while the key is still DETACHING.

This unconditional confirmer accepts late results after HandleJobCanceledEvent has already moved the key to UNKNOWN. That lets an old completion overwrite the retry state and incorrectly finalize the key as DETACHED.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_key.go` around lines 91 - 95, The confirmer
unconditionally completes detach jobs; change
KeyDetachJobHandler.HandleJobConfirm to first fetch the key's current state (via
the handler's store/repo used elsewhere in this handler) and only return
orbital.CompleteJobConfirmer() if the key state is exactly DETACHING; otherwise
return a no-op confirmer (e.g., orbital.NoopJobConfirmer() or the project's
equivalent) so late completions cannot overwrite state after
HandleJobCanceledEvent moved the key to UNKNOWN. Ensure you reference
KeyDetachJobHandler.HandleJobConfirm and the key-state enum/value DETACHING when
locating where to add the guard.
internal/manager/tenant.go (1)

298-327: ⚠️ Potential issue | 🟠 Major

Don't downgrade batch-processing failures to OffboardingProcessing.

If ProcessInBatch fails, the unmap step may have been skipped or only partially executed. Returning OffboardingProcessing here hides that infrastructure failure behind an endless retry loop instead of surfacing it as a real offboarding failure/error.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/manager/tenant.go` around lines 298 - 327, The batch-processing
error from repo.ProcessInBatch is being downgraded to OffboardingProcessing
which hides real failures; change the error branch after repo.ProcessInBatch
(the block that currently logs and returns false, OffboardingProcessing) to log
the error and return false, OffboardingFailed so infrastructure failures surface
as offboarding failures; update the code paths around repo.ProcessInBatch,
m.sys.UnmapSystemFromRegistry and m.unmapSystemErrorCanContinue accordingly to
preserve the existing per-system status flags (hasOffboardingFailed /
hasOffboardingProcessing) but treat a non-nil err from ProcessInBatch as an
overall OffboardingFailed result.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/event-processor/handlers_system.go`:
- Around line 396-408: The issue is that sendL1KeyClaim(..., false) is called
after clearing KeyConfigurationID and persisting the system, making cleanup
non-retryable if the external call fails; change the order so the external claim
reset is attempted first and only on success update the local system state.
Concretely, call sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false)
before mutating system.Status and system.KeyConfigurationID and before calling
updateSystem(ctx, h.repo, system); if sendL1KeyClaim returns an error, return
that error without persisting changes (or wrap both operations in a
transactional/compensating flow) so retries can succeed. Ensure you reference
the same system instance used in sendL1KeyClaim and preserve error handling
around updateSystem and sendL1KeyClaim.

In `@internal/event-processor/reconciler_test.go`:
- Around line 1081-1084: The helper terminateNewJob currently swallows
termination-hook errors by logging them; update the error handling to fail the
test immediately by replacing the t.Logf branch with a require.NoError assertion
(use require.NoError(t, err)) so any hook error in terminateNewJob causes the
test to fail and surfaces the real failure in the new state-transition tests.

In `@internal/manager/tenant_test.go`:
- Around line 398-400: Replace the inline errors.New("some error") in the test
with a package-level sentinel error to avoid err113; declare a top-level var
(e.g., var errTestGeneric = errors.New("some error")) in the same package and
use that sentinel in the test call to m.UnmapSystemErrorCanContinue(ctx, err)
(keep the test name and local variables ok, st unchanged).
- Around line 197-201: The test creates new keys with IsPrimary=true via
testutils.NewKey but never updates the shared keyConfig.PrimaryKeyID, so
assertions end up checking the wrong record; update the fixture so that after
creating the key (the testutils.NewKey call) you assign the created key's ID to
keyConfig.PrimaryKeyID (or alternatively change the assertions to reference the
original primary key), and make the same fix for the other occurrences that use
testutils.NewKey with IsPrimary=true so the keyConfig.PrimaryKeyID correctly
points to the inserted primary key when verifying the offboard/detach behavior.

---

Outside diff comments:
In `@internal/event-processor/factory_test.go`:
- Around line 180-205: Add test cases to cover the new SystemUnlinkDecommission
factory path: update the table-driven tests around the SystemUnlink cases to
include at least one case that calls
eventProcessor.SystemUnlinkDecommission(ctx, system, from) (mirroring the
existing SystemUnlink cases) and asserts the expected behavior — creation of a
decommission unlink job, correct job type (use
eventprocessor.JobTypeSystemUnlinkDecommission or the string it should produce),
and any status transitions (e.g., system set to PROCESSING) and expected errors;
ensure you include both success and error scenarios (missing tenant, system in
PROCESSING) analogous to the existing SystemUnlink tests and reference the same
test scaffolding (the test table, systemEventFn, systemStatus, tenantID,
keyIDFrom, expType, expStatus, expErr) so the new cases integrate with the
current test loop.

In `@internal/operator/operator.go`:
- Around line 430-455: The RemoveOIDCMapping call can return a non-nil error and
a nil grpcResp; after calling
o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping and extracting
st via status.FromError, ensure you return early on any non-nil err (after
logging the internal error case) instead of continuing and dereferencing
grpcResp; update the branch around err != nil to call
setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed)
and return immediately so grpcResp.GetSuccess() is only accessed when err ==
nil, and preserve the existing setErrorStateAndFail flow for a non-success
grpcResp.

---

Duplicate comments:
In `@internal/event-processor/factory.go`:
- Around line 145-155: The code sets system.Status to PROCESSING and commits
that change inside f.repo.Transaction, then calls createSystemEventJob(...)
outside the transaction, which can leave the system stuck if job creation fails;
modify the logic to make the status transition and job creation atomic by either
(A) moving the createSystemEventJob call into the same f.repo.Transaction
callback so both the Patch and job creation occur under the same transactional
guard, or (B) reuse the existing handleSystemStatus(...) flow to perform the
PROCESSING transition and job creation together; ensure you reference and update
system.Status, use f.repo.Transaction, and invoke createSystemEventJob (or call
handleSystemStatus) within that transaction callback so failures roll back the
status change.

In `@internal/event-processor/handlers_key.go`:
- Around line 107-110: The code currently treats failures from
h.cmkAuditor.SendCmkDetachAuditLog as fatal and returns an error, which can
trigger retries even though the key was already set to DETACHED; change this so
audit-log delivery failures do not block completion: in the handler that calls
h.cmkAuditor.SendCmkDetachAuditLog (use the existing call site and key.ID), do
not return an error on failure—instead log the error (e.g., h.logger.Warnf or
h.logger.Error) and continue, or dispatch the SendCmkDetachAuditLog call
asynchronously (goroutine) and swallow/log errors, ensuring the done hook is not
poisoned by audit log delivery failures.
- Around line 91-95: The confirmer unconditionally completes detach jobs; change
KeyDetachJobHandler.HandleJobConfirm to first fetch the key's current state (via
the handler's store/repo used elsewhere in this handler) and only return
orbital.CompleteJobConfirmer() if the key state is exactly DETACHING; otherwise
return a no-op confirmer (e.g., orbital.NoopJobConfirmer() or the project's
equivalent) so late completions cannot overwrite state after
HandleJobCanceledEvent moved the key to UNKNOWN. Ensure you reference
KeyDetachJobHandler.HandleJobConfirm and the key-state enum/value DETACHING when
locating where to add the guard.

In `@internal/event-processor/handlers_system.go`:
- Around line 314-319: The SystemUnlinkDecommissionJobHandler.HandleJobConfirm
currently accepts confirmations unconditionally; replace its body to call and
return the shared gate function handleSystemJobConfirm(...) (same used by other
system job handlers) so late completions from canceled/superseded jobs are
rejected; locate SystemUnlinkDecommissionJobHandler.HandleJobConfirm and change
the return to the result of handleSystemJobConfirm with the same
parameters/signature used elsewhere.
- Around line 356-376: In HandleJobCanceledEvent
(SystemUnlinkDecommissionJobHandler) we currently set system.Status =
cmkapi.SystemStatusFAILED which prevents automatic retry; change the status to
the retryable connected state by setting system.Status =
cmkapi.SystemStatusCONNECTED before calling updateSystem(ctx, h.repo, system) so
the next offboarding reconciliation can recreate the unlink job.

In `@internal/event-processor/reconciler_test.go`:
- Around line 913-940: The test currently asserts sysAfter.Status is
cmkapi.SystemStatusFAILED but the contract requires restoring CONNECTED on a
canceled SYSTEM_UNLINK_DECOMMISSION; change the assertion in the test (in the
t.Run "System status on canceled SYSTEM_UNLINK_DECOMMISSION job termination") to
expect cmkapi.SystemStatusCONNECTED for sysAfter.Status after calling
eventProcessor.JobCanceledFunc (using the SystemActionJobData and
JobTypeSystemUnlinkDecommission flow), so the test verifies cancellation
restores CONNECTED instead of FAILED.

In `@internal/event-processor/repo.go`:
- Around line 64-85: Both updateSystem and updateKey call r.Patch but ignore the
boolean "updated" result; change each to capture the returned values (updated,
err := r.Patch(...)), return the wrapped error on err, and if updated is false
return a new error (e.g. fmt.Errorf("no row updated for system %s", system.ID) /
"for key %s") so callers don't assume the row was persisted; update the error
messages in updateSystem and updateKey accordingly to include this failure case.

In `@internal/manager/system.go`:
- Around line 432-444: UnmapSystemFromRegistry currently calls
m.registry.Mapping() without checking m.registry and can panic when m.registry
is nil; update UnmapSystemFromRegistry in SystemManager to guard m.registry like
other registry-backed methods by returning a recoverable error (e.g., nil
registry error) if m.registry == nil before calling m.registry.Mapping(), and
otherwise proceed to call Mapping(). Ensure you reference the SystemManager
receiver, the m.registry field, and the UnmapSystemFromRegistry method when
making the change so behavior matches other registry-backed paths.

In `@internal/manager/tenant.go`:
- Around line 298-327: The batch-processing error from repo.ProcessInBatch is
being downgraded to OffboardingProcessing which hides real failures; change the
error branch after repo.ProcessInBatch (the block that currently logs and
returns false, OffboardingProcessing) to log the error and return false,
OffboardingFailed so infrastructure failures surface as offboarding failures;
update the code paths around repo.ProcessInBatch, m.sys.UnmapSystemFromRegistry
and m.unmapSystemErrorCanContinue accordingly to preserve the existing
per-system status flags (hasOffboardingFailed / hasOffboardingProcessing) but
treat a non-nil err from ProcessInBatch as an overall OffboardingFailed result.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4bd08724-92b1-4904-a095-3fedfeafb07d

📥 Commits

Reviewing files that changed from the base of the PR and between 9c27616 and 243e8d3.

📒 Files selected for processing (20)
  • apis/cmk/cmk-ui.yaml
  • internal/api/cmkapi/cmkapi.go
  • internal/event-processor/export_test.go
  • internal/event-processor/factory.go
  • internal/event-processor/factory_test.go
  • internal/event-processor/handlers_key.go
  • internal/event-processor/handlers_system.go
  • internal/event-processor/helpers.go
  • internal/event-processor/reconciler.go
  • internal/event-processor/reconciler_test.go
  • internal/event-processor/repo.go
  • internal/event-processor/resolvers.go
  • internal/event-processor/types.go
  • internal/manager/export_test.go
  • internal/manager/key.go
  • internal/manager/system.go
  • internal/manager/tenant.go
  • internal/manager/tenant_test.go
  • internal/operator/operator.go
  • test/integration/event-processor/reconciliation_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • internal/event-processor/export_test.go
  • internal/api/cmkapi/cmkapi.go
  • internal/manager/key.go

@minh-nghia minh-nghia force-pushed the feat/refactor-tenant-decommissioning branch from 243e8d3 to dea0ed5 Compare March 12, 2026 16:14
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
internal/event-processor/handlers_key.go (1)

91-96: ⚠️ Potential issue | 🟠 Major

Implement state-based confirmation to prevent late job acceptance.

HandleJobConfirm unconditionally returns CompleteJobConfirmer(), but HandleJobCanceledEvent sets the key to UNKNOWN for retry. If a confirmation message arrives after cancellation, the job will still be accepted, potentially overwriting the retry state.

Follow the pattern used elsewhere: check the current key state before confirming. Return CancelJobConfirmer() if the key is not in the expected DETACHING state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_key.go` around lines 91 - 96,
HandleJobConfirm currently always returns orbital.CompleteJobConfirmer(), which
can accept late confirmations; update KeyDetachJobHandler.HandleJobConfirm to
read the current key state (the same store/lookup used by
HandleJobCanceledEvent) and only return orbital.CompleteJobConfirmer() when the
key is in the expected DETACHING state, otherwise return
orbital.CancelJobConfirmer() to avoid accepting confirmations after the key was
set to UNKNOWN for retry.
🧹 Nitpick comments (1)
internal/event-processor/helpers.go (1)

82-103: Consider using a different separator for error message aggregation.

The colon (:) separator may cause readability issues since error messages often contain colons themselves (e.g., "context deadline exceeded: timeout"). Consider using a more distinct separator like "; " or newline.

♻️ Proposed change
-	message := strings.Join(taskErrors, ":")
+	message := strings.Join(taskErrors, "; ")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/helpers.go` around lines 82 - 103, The current
mergeOrbitalTaskErrors function concatenates task error strings with ":" which
can collide with colons in messages; update mergeOrbitalTaskErrors (function
name) to join taskErrors using a clearer separator such as "; " or "\n" (e.g.,
strings.Join(taskErrors, "; ")) so aggregated messages remain readable and
unambiguous when returned as message from mergeOrbitalTaskErrors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/manager/tenant.go`:
- Around line 365-367: The log message in the error branch incorrectly mentions
OIDC mapping; update the call to log.Error(ctx, "error while removing OIDC
mapping", err) to accurately describe the failure (e.g., "error while removing
system-to-registry mapping" or "error unmapping system from registry") in the
function that performs system-to-registry unmapping (the branch where log.Error
is called with ctx and err, e.g., inside
UnmapSystemFromRegistry/RemoveSystemRegistryMapping); keep the same log.Error
signature and pass the existing err variable unchanged.

---

Duplicate comments:
In `@internal/event-processor/handlers_key.go`:
- Around line 91-96: HandleJobConfirm currently always returns
orbital.CompleteJobConfirmer(), which can accept late confirmations; update
KeyDetachJobHandler.HandleJobConfirm to read the current key state (the same
store/lookup used by HandleJobCanceledEvent) and only return
orbital.CompleteJobConfirmer() when the key is in the expected DETACHING state,
otherwise return orbital.CancelJobConfirmer() to avoid accepting confirmations
after the key was set to UNKNOWN for retry.

---

Nitpick comments:
In `@internal/event-processor/helpers.go`:
- Around line 82-103: The current mergeOrbitalTaskErrors function concatenates
task error strings with ":" which can collide with colons in messages; update
mergeOrbitalTaskErrors (function name) to join taskErrors using a clearer
separator such as "; " or "\n" (e.g., strings.Join(taskErrors, "; ")) so
aggregated messages remain readable and unambiguous when returned as message
from mergeOrbitalTaskErrors.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f1958ceb-36ef-4ecf-aa74-6c3d200b5371

📥 Commits

Reviewing files that changed from the base of the PR and between 243e8d3 and dea0ed5.

📒 Files selected for processing (20)
  • apis/cmk/cmk-ui.yaml
  • internal/api/cmkapi/cmkapi.go
  • internal/event-processor/export_test.go
  • internal/event-processor/factory.go
  • internal/event-processor/factory_test.go
  • internal/event-processor/handlers_key.go
  • internal/event-processor/handlers_system.go
  • internal/event-processor/helpers.go
  • internal/event-processor/reconciler.go
  • internal/event-processor/reconciler_test.go
  • internal/event-processor/repo.go
  • internal/event-processor/resolvers.go
  • internal/event-processor/types.go
  • internal/manager/export_test.go
  • internal/manager/key.go
  • internal/manager/system.go
  • internal/manager/tenant.go
  • internal/manager/tenant_test.go
  • internal/operator/operator.go
  • test/integration/event-processor/reconciliation_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • internal/manager/export_test.go
  • internal/event-processor/resolvers.go
  • internal/api/cmkapi/cmkapi.go
  • internal/event-processor/repo.go
  • internal/event-processor/factory_test.go
  • apis/cmk/cmk-ui.yaml

@minh-nghia minh-nghia force-pushed the feat/refactor-tenant-decommissioning branch from dea0ed5 to f0aa47d Compare March 12, 2026 16:49
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (11)
internal/event-processor/handlers_system.go (3)

354-377: ⚠️ Potential issue | 🟠 Major

Restore CONNECTED on cancel per PR specification.

The PR objectives state: "Sets system status back to CONNECTED if the job is canceled (so an event is recreated and retried on next tenant offboarding reconciliation)." The current implementation sets FAILED, which contradicts this and may require manual intervention.

🐛 Proposed fix
-// HandleJobCanceledEvent puts the system to failed state so the system can be processed again during the next
-// decommission reconciliation loop. Avoid failed state as that would require manual intervention to recover.
+// HandleJobCanceledEvent restores the system to CONNECTED state so the event is recreated and retried 
+// on the next tenant offboarding reconciliation.
 func (h *SystemUnlinkDecommissionJobHandler) HandleJobCanceledEvent(ctx context.Context, job orbital.Job) error {
 	ctx = log.InjectSystemEvent(ctx, job.Type)

 	data, err := unmarshalSystemJobData(job)
 	if err != nil {
 		return err
 	}

 	ctx = cmkcontext.CreateTenantContext(ctx, data.TenantID)
 	system, err := getSystemByID(ctx, h.repo, data.SystemID)
 	if err != nil {
 		return err
 	}

-	system.Status = cmkapi.SystemStatusFAILED
+	system.Status = cmkapi.SystemStatusCONNECTED
 	err = updateSystem(ctx, h.repo, system)
 	if err != nil {
 		return err
 	}

 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_system.go` around lines 354 - 377, The
handler HandleJobCanceledEvent in SystemUnlinkDecommissionJobHandler currently
sets system.Status to cmkapi.SystemStatusFAILED; change this to
cmkapi.SystemStatusCONNECTED so canceled jobs restore the system to CONNECTED,
leaving the existing unmarshalSystemJobData, getSystemByID and updateSystem
calls intact to persist the change; ensure you only modify the assignment of
system.Status and keep the updateSystem(ctx, h.repo, system) call to save the
new CONNECTED state.

396-408: ⚠️ Potential issue | 🟠 Major

Reset external key claim before clearing local linkage.

The current order persists DISCONNECTED with KeyConfigurationID = nil before calling sendL1KeyClaim. If the external call fails, later offboarding passes will consider the system fully unlinked and won't retry the cleanup. Swap the order so the external reset succeeds before dropping local linkage.

🐛 Proposed fix
 func (h *SystemUnlinkDecommissionJobHandler) terminate(
 	ctx context.Context,
 	job orbital.Job,
 ) (*model.System, string, error) {
 	ctx = log.InjectSystemEvent(ctx, job.Type)

 	data, err := unmarshalSystemJobData(job)
 	if err != nil {
 		return nil, "", err
 	}

 	ctx = cmkcontext.CreateTenantContext(ctx, data.TenantID)
 	system, err := getSystemByID(ctx, h.repo, data.SystemID)
 	if err != nil {
 		return nil, "", err
 	}

+	// Reset external key claim first, before clearing local linkage
+	err = sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false)
+	if err != nil {
+		return nil, "", err
+	}
+
 	// For unlink decommission, we want to mark the system as disconnected and clean up the key claim regardless
 	// of whether the job succeeded or failed.
 	system.Status = cmkapi.SystemStatusDISCONNECTED
 	system.KeyConfigurationID = nil
 	err = updateSystem(ctx, h.repo, system)
 	if err != nil {
 		return nil, "", err
 	}

-	err = sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false)
-	if err != nil {
-		return nil, "", err
-	}
-
 	return system, data.KeyIDFrom, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_system.go` around lines 396 - 408, The
external key-claim reset (sendL1KeyClaim) must be performed before clearing
local linkage and persisting DISCONNECTED; move the sendL1KeyClaim(ctx, *system,
h.registry, data.TenantID, false) call to occur before you set system.Status =
cmkapi.SystemStatusDISCONNECTED and system.KeyConfigurationID = nil and before
calling updateSystem(ctx, h.repo, system); on success, then set the fields and
call updateSystem; preserve existing error returns (propagate sendL1KeyClaim
errors so local state isn't cleared if the external reset fails).

312-319: ⚠️ Potential issue | 🟠 Major

Use the standard confirmer to reject late results from canceled jobs.

Always returning CompleteJobConfirmer() accepts late results from canceled or superseded jobs. After a cancel/retry cycle, an old task completion can still run terminate() and disconnect the system unexpectedly. Use handleSystemJobConfirm for consistency.

🐛 Proposed fix
-// HandleJobConfirm for SystemUnlinkDecommissionJobHandler will require no additional validation to minimise
-// job cancellation.
 func (h *SystemUnlinkDecommissionJobHandler) HandleJobConfirm(
-	_ context.Context,
-	_ orbital.Job,
+	ctx context.Context,
+	job orbital.Job,
 ) (orbital.JobConfirmerResult, error) {
-	return orbital.CompleteJobConfirmer(), nil
+	return handleSystemJobConfirm(ctx, h.repo, job)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_system.go` around lines 312 - 319, The
handler SystemUnlinkDecommissionJobHandler.HandleJobConfirm currently always
returns orbital.CompleteJobConfirmer(), which accepts late results; change it to
use the shared reject-late-results helper by calling handleSystemJobConfirm with
the same context and job parameters and returning its orbital.JobConfirmerResult
and error so canceled/superseded tasks are rejected consistently; update the
function body to delegate to handleSystemJobConfirm(ctx, job) (preserving
parameter names) instead of returning CompleteJobConfirmer().
internal/event-processor/factory.go (1)

145-156: ⚠️ Potential issue | 🟠 Major

Job creation should be inside the transaction to avoid stuck PROCESSING state.

The transaction commits system.Status = PROCESSING before createSystemEventJob runs. If job creation fails, the system remains stuck in PROCESSING with no event record for recovery. Move job creation inside the transaction or implement a compensation mechanism.

🐛 Proposed fix
 func (f *EventFactory) SystemUnlinkDecommission(
 	ctx context.Context,
 	system *model.System,
 	keyID string,
 ) (orbital.Job, error) {
 	if system.Status == cmkapi.SystemStatusPROCESSING {
 		return orbital.Job{}, ErrSystemProcessing
 	}

 	systemUnlinkJobData := SystemActionJobData{
 		SystemID:  system.ID.String(),
 		KeyIDFrom: keyID,
 	}

+	var job orbital.Job
 	err := f.repo.Transaction(ctx, func(ctx context.Context) error {
 		system.Status = cmkapi.SystemStatusPROCESSING
-		_, err := f.repo.Patch(ctx, system, *repo.NewQuery().UpdateAll(true))
-		return err
+		if err := updateSystem(ctx, f.repo, system); err != nil {
+			return err
+		}
+		var err error
+		job, err = f.createSystemEventJob(ctx, JobTypeSystemUnlinkDecommission, systemUnlinkJobData)
+		return err
 	})
-
-	if err != nil {
-		return orbital.Job{}, err
-	}
-
-	return f.createSystemEventJob(ctx, JobTypeSystemUnlinkDecommission, systemUnlinkJobData)
+	return job, err
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/factory.go` around lines 145 - 156, The code sets
system.Status to PROCESSING and commits that change inside f.repo.Transaction
but calls createSystemEventJob afterwards, risking a stuck PROCESSING state if
job creation fails; move the call to createSystemEventJob (for
JobTypeSystemUnlinkDecommission with systemUnlinkJobData) into the transaction
callback passed to f.repo.Transaction so the transaction only commits when both
the f.repo.Patch and createSystemEventJob succeed, returning any error from
createSystemEventJob from the transaction function; ensure you propagate that
error out of Transaction and return the created orbital.Job only on success.
internal/event-processor/repo.go (1)

64-85: ⚠️ Potential issue | 🟠 Major

Return an error when no row was patched.

Both updateSystem and updateKey ignore the boolean returned by r.Patch. If no row matches the query, callers will assume the update succeeded when nothing was persisted. This can cause silent failures during job termination flows.

🐛 Proposed fix
 func updateSystem(ctx context.Context, r repo.Repo, system *model.System) error {
 	ck := repo.NewCompositeKey().Where(repo.IDField, system.ID)
 	query := repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true)

-	_, err := r.Patch(ctx, system, *query)
+	updated, err := r.Patch(ctx, system, *query)
 	if err != nil {
 		return fmt.Errorf("failed to update system %s upon job termination: %w", system.ID, err)
 	}
+	if !updated {
+		return fmt.Errorf("system %s not found upon job termination", system.ID)
+	}

 	return nil
 }

 func updateKey(ctx context.Context, r repo.Repo, key *model.Key) error {
 	ck := repo.NewCompositeKey().Where(repo.IDField, key.ID)
 	query := repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true)

-	_, err := r.Patch(ctx, key, *query)
+	updated, err := r.Patch(ctx, key, *query)
 	if err != nil {
 		return fmt.Errorf("failed to update key %s upon job termination: %w", key.ID, err)
 	}
+	if !updated {
+		return fmt.Errorf("key %s not found upon job termination", key.ID)
+	}

 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/repo.go` around lines 64 - 85, The patch functions
updateSystem and updateKey currently ignore the boolean result returned by
r.Patch; change both to inspect the returned "patched" boolean (from the call to
r.Patch(ctx, system|key, *query)) and return a descriptive error if patched is
false (e.g., "no rows patched for system %s" / "no rows patched for key %s"),
keeping the existing fmt.Errorf wrapping for underlying errors; ensure you still
return the error on r.Patch failure and only treat the boolean false as an
additional error case, referencing the existing composite key construction (ck
:= repo.NewCompositeKey().Where(repo.IDField, ...)) and query
(repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true)).
internal/event-processor/handlers_key.go (2)

107-110: ⚠️ Potential issue | 🟠 Major

Don't fail the handler on audit log delivery failure.

The key is already persisted as DETACHED before the audit call. Returning an error here causes retries or duplicate callbacks for a logically completed job. Log the failure instead.

🐛 Proposed fix
-	err = h.cmkAuditor.SendCmkDetachAuditLog(ctx, key.ID.String())
-	if err != nil {
-		return fmt.Errorf("failed to send CMK detach audit log for key %s: %w", key.ID, err)
-	}
+	if err := h.cmkAuditor.SendCmkDetachAuditLog(ctx, key.ID.String()); err != nil {
+		log.Error(ctx, "Failed to send CMK detach audit log", err)
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_key.go` around lines 107 - 110, The handler
currently returns an error when h.cmkAuditor.SendCmkDetachAuditLog(ctx,
key.ID.String()) fails, which triggers retries even though the key is already
marked DETACHED; change this to log the failure and continue. Replace the
"return fmt.Errorf(...)" path after calling h.cmkAuditor.SendCmkDetachAuditLog
with a non-fatal log call (e.g., h.logger.Errorf or similar used in this
package) that includes the key ID and the error, and do not return an error from
this block so the handler completes successfully.

91-96: ⚠️ Potential issue | 🟠 Major

Implement state-based confirmation to reject late results.

HandleJobConfirm unconditionally returns CompleteJobConfirmer(). If a confirmation arrives after cancellation (when key is set to UNKNOWN), it will still be accepted. Check the key state before confirming, similar to handleSystemJobConfirm.

🐛 Proposed fix
 func (h *KeyDetachJobHandler) HandleJobConfirm(
-	_ context.Context,
-	_ orbital.Job,
+	ctx context.Context,
+	job orbital.Job,
 ) (orbital.JobConfirmerResult, error) {
+	data, err := unmarshalKeyJobData(job)
+	if err != nil {
+		return orbital.CancelJobConfirmer(fmt.Sprintf("failed to unmarshal job data: %v", err)), err
+	}
+
+	ctx = cmkcontext.CreateTenantContext(ctx, data.TenantID)
+	key, err := getKeyByKeyID(ctx, h.repo, data.KeyID)
+	if err != nil {
+		if errors.Is(err, repo.ErrNotFound) {
+			return orbital.CancelJobConfirmer(fmt.Sprintf("key %s not found", data.KeyID)), nil
+		}
+		return nil, fmt.Errorf("failed to get key by ID %s: %w", data.KeyID, err)
+	}
+
+	if key.State != string(cmkapi.KeyStateDETACHING) {
+		return orbital.CancelJobConfirmer(fmt.Sprintf("key %s is not in DETACHING state", key.ID)), nil
+	}
+
 	return orbital.CompleteJobConfirmer(), nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_key.go` around lines 91 - 96,
HandleJobConfirm currently always returns orbital.CompleteJobConfirmer(), which
accepts confirmations even after a key was cancelled; update
KeyDetachJobHandler.HandleJobConfirm to first load the key state (like
handleSystemJobConfirm does) and if the key state is UNKNOWN (cancelled) return
the job rejection confirmer (e.g., orbital.RejectJobConfirmer()) instead of
CompleteJobConfirmer(); otherwise proceed to complete. Ensure you reference
KeyDetachJobHandler.HandleJobConfirm, the UNKNOWN key state, and
CompleteJobConfirmer when making the change.
internal/manager/system.go (1)

432-445: ⚠️ Potential issue | 🟠 Major

Guard against nil registry client.

NewSystemManager explicitly allows m.registry to be nil, but UnmapSystemFromRegistry unconditionally calls m.registry.Mapping(). This will panic during tenant offboarding if the manager was created without a registry client.

🛡️ Proposed fix
 func (m *SystemManager) UnmapSystemFromRegistry(ctx context.Context, system *model.System) error {
+	if m.registry == nil {
+		return fmt.Errorf("registry client not available for UnmapSystemFromRegistry")
+	}
+
 	tenant, err := cmkcontext.ExtractTenantID(ctx)
 	if err != nil {
 		return err
 	}

 	_, err = m.registry.Mapping().UnmapSystemFromTenant(ctx, &mappingv1.UnmapSystemFromTenantRequest{
 		ExternalId: system.Identifier,
 		Type:       strings.ToLower(system.Type),
 		TenantId:   tenant,
 	})

 	return err
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/manager/system.go` around lines 432 - 445, UnmapSystemFromRegistry
calls m.registry.Mapping() but NewSystemManager allows m.registry to be nil, so
add a nil-check at the top of UnmapSystemFromRegistry: if m.registry == nil then
return nil (no-op) to avoid a panic during tenant offboarding; otherwise proceed
to call m.registry.Mapping().UnmapSystemFromTenant with the existing arguments
(ExternalId: system.Identifier, Type: strings.ToLower(system.Type), TenantId:
tenant). Ensure the function continues to return any error from
UnmapSystemFromTenant when the registry is present.
internal/manager/tenant_test.go (1)

196-202: ⚠️ Potential issue | 🟡 Minor

Keep PrimaryKeyID aligned with the inserted primary key.

Both subtests create a new key with IsPrimary = true, but the shared keyConfig.PrimaryKeyID still points to the original key created at Lines 171-180. That leaves the fixture in an inconsistent state and can make the detach assertions validate the wrong record if any part of the offboarding path resolves primaries via PrimaryKeyID rather than IsPrimary.

key := testutils.NewKey(func(k *model.Key) {
	k.KeyConfigurationID = keyConfig.ID
	k.IsPrimary = true
	k.State = string(cmkapi.KeyStateDETACHING) // or DETACHED/ENABLED per subtest
})
keyConfig.PrimaryKeyID = ptr.PointTo(key.ID)

Also applies to: 244-250

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/manager/tenant_test.go` around lines 196 - 202, The test creates new
primary keys via testutils.NewKey (setting model.Key.IsPrimary = true) but never
updates keyConfig.PrimaryKeyID, leaving keyConfig inconsistent; after each
NewKey that sets IsPrimary = true (the keys used in the subtests), update
keyConfig.PrimaryKeyID to point to that key's ID (e.g., set
keyConfig.PrimaryKeyID = ptr.PointTo(key.ID)) so PrimaryKeyID is aligned with
the inserted primary key for both affected subtests.
internal/manager/tenant.go (2)

365-367: ⚠️ Potential issue | 🟡 Minor

Fix the stale log message.

This branch is handling registry unmap failures, not OIDC cleanup. The current message will send on-call debugging in the wrong direction.

Suggested fix
-		log.Error(ctx, "error while removing OIDC mapping", err)
+		log.Error(ctx, "error while unmapping system from registry", err)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/manager/tenant.go` around lines 365 - 367, The log message in the
error branch that currently calls log.Error(ctx, "error while removing OIDC
mapping", err) is wrong for registry unmap failures; update the message to
accurately reflect a registry unmap/unmapping failure (e.g., "error while
unmapping registry" or similar) in the same error handling block so on-call
debugging points to registry unmap problems — this change should be made where
the code returns false, OffboardingProcessing for the err != nil branch.

114-118: ⚠️ Potential issue | 🟠 Major

Don't downgrade batch/repository failures to ordinary Processing.

repo.ProcessInBatch failing here means enumeration/batching itself broke, not that a single system returned a retryable registry error. Converting that into OffboardingProcessing hides the actual failure from OffboardTenant and can leave reconciliation spinning forever with only a log entry.

Suggested shape
- canContinue, st := m.unmapAllSystemsFromRegistry(ctx)
+ canContinue, st, err := m.unmapAllSystemsFromRegistry(ctx)
+ if err != nil {
+ 	return OffboardingResult{}, err
+ }
  if !canContinue {
  	return OffboardingResult{Status: st}, nil
  }
-func (m *TenantManager) unmapAllSystemsFromRegistry(ctx context.Context) (bool, OffboardingStatus) {
+func (m *TenantManager) unmapAllSystemsFromRegistry(ctx context.Context) (bool, OffboardingStatus, error) {
 	…
 	if err != nil {
-		log.Error(ctx, "error while processing systems in batch to unmap from registry", err)
-		return false, OffboardingProcessing
+		return false, 0, err
 	}
 	…
-		return true, 0
+		return true, 0, nil
 	}
}

Also applies to: 294-328

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/manager/tenant.go` around lines 114 - 118, The current code in
OffboardTenant converts any non-canContinue result from
m.unmapAllSystemsFromRegistry into OffboardingProcessing, which hides
batch/repository failures from callers; modify the handling so that if
unmapAllSystemsFromRegistry returns a terminal or error status (e.g., a status
produced when repo.ProcessInBatch fails), you return that specific
OffboardingResult{Status: st} directly instead of mapping it to
OffboardingProcessing. Locate the call to m.unmapAllSystemsFromRegistry (and the
similar block at ~294-328) and change the conditional to propagate
non-retryable/error statuses as-is, only turning transient registry errors into
OffboardingProcessing when the returned status explicitly indicates a retryable
registry issue. Ensure the logic distinguishes repo/batch failures from
per-system retryable errors before returning.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@internal/event-processor/factory.go`:
- Around line 145-156: The code sets system.Status to PROCESSING and commits
that change inside f.repo.Transaction but calls createSystemEventJob afterwards,
risking a stuck PROCESSING state if job creation fails; move the call to
createSystemEventJob (for JobTypeSystemUnlinkDecommission with
systemUnlinkJobData) into the transaction callback passed to f.repo.Transaction
so the transaction only commits when both the f.repo.Patch and
createSystemEventJob succeed, returning any error from createSystemEventJob from
the transaction function; ensure you propagate that error out of Transaction and
return the created orbital.Job only on success.

In `@internal/event-processor/handlers_key.go`:
- Around line 107-110: The handler currently returns an error when
h.cmkAuditor.SendCmkDetachAuditLog(ctx, key.ID.String()) fails, which triggers
retries even though the key is already marked DETACHED; change this to log the
failure and continue. Replace the "return fmt.Errorf(...)" path after calling
h.cmkAuditor.SendCmkDetachAuditLog with a non-fatal log call (e.g.,
h.logger.Errorf or similar used in this package) that includes the key ID and
the error, and do not return an error from this block so the handler completes
successfully.
- Around line 91-96: HandleJobConfirm currently always returns
orbital.CompleteJobConfirmer(), which accepts confirmations even after a key was
cancelled; update KeyDetachJobHandler.HandleJobConfirm to first load the key
state (like handleSystemJobConfirm does) and if the key state is UNKNOWN
(cancelled) return the job rejection confirmer (e.g.,
orbital.RejectJobConfirmer()) instead of CompleteJobConfirmer(); otherwise
proceed to complete. Ensure you reference KeyDetachJobHandler.HandleJobConfirm,
the UNKNOWN key state, and CompleteJobConfirmer when making the change.

In `@internal/event-processor/handlers_system.go`:
- Around line 354-377: The handler HandleJobCanceledEvent in
SystemUnlinkDecommissionJobHandler currently sets system.Status to
cmkapi.SystemStatusFAILED; change this to cmkapi.SystemStatusCONNECTED so
canceled jobs restore the system to CONNECTED, leaving the existing
unmarshalSystemJobData, getSystemByID and updateSystem calls intact to persist
the change; ensure you only modify the assignment of system.Status and keep the
updateSystem(ctx, h.repo, system) call to save the new CONNECTED state.
- Around line 396-408: The external key-claim reset (sendL1KeyClaim) must be
performed before clearing local linkage and persisting DISCONNECTED; move the
sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false) call to occur
before you set system.Status = cmkapi.SystemStatusDISCONNECTED and
system.KeyConfigurationID = nil and before calling updateSystem(ctx, h.repo,
system); on success, then set the fields and call updateSystem; preserve
existing error returns (propagate sendL1KeyClaim errors so local state isn't
cleared if the external reset fails).
- Around line 312-319: The handler
SystemUnlinkDecommissionJobHandler.HandleJobConfirm currently always returns
orbital.CompleteJobConfirmer(), which accepts late results; change it to use the
shared reject-late-results helper by calling handleSystemJobConfirm with the
same context and job parameters and returning its orbital.JobConfirmerResult and
error so canceled/superseded tasks are rejected consistently; update the
function body to delegate to handleSystemJobConfirm(ctx, job) (preserving
parameter names) instead of returning CompleteJobConfirmer().

In `@internal/event-processor/repo.go`:
- Around line 64-85: The patch functions updateSystem and updateKey currently
ignore the boolean result returned by r.Patch; change both to inspect the
returned "patched" boolean (from the call to r.Patch(ctx, system|key, *query))
and return a descriptive error if patched is false (e.g., "no rows patched for
system %s" / "no rows patched for key %s"), keeping the existing fmt.Errorf
wrapping for underlying errors; ensure you still return the error on r.Patch
failure and only treat the boolean false as an additional error case,
referencing the existing composite key construction (ck :=
repo.NewCompositeKey().Where(repo.IDField, ...)) and query
(repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true)).

In `@internal/manager/system.go`:
- Around line 432-445: UnmapSystemFromRegistry calls m.registry.Mapping() but
NewSystemManager allows m.registry to be nil, so add a nil-check at the top of
UnmapSystemFromRegistry: if m.registry == nil then return nil (no-op) to avoid a
panic during tenant offboarding; otherwise proceed to call
m.registry.Mapping().UnmapSystemFromTenant with the existing arguments
(ExternalId: system.Identifier, Type: strings.ToLower(system.Type), TenantId:
tenant). Ensure the function continues to return any error from
UnmapSystemFromTenant when the registry is present.

In `@internal/manager/tenant_test.go`:
- Around line 196-202: The test creates new primary keys via testutils.NewKey
(setting model.Key.IsPrimary = true) but never updates keyConfig.PrimaryKeyID,
leaving keyConfig inconsistent; after each NewKey that sets IsPrimary = true
(the keys used in the subtests), update keyConfig.PrimaryKeyID to point to that
key's ID (e.g., set keyConfig.PrimaryKeyID = ptr.PointTo(key.ID)) so
PrimaryKeyID is aligned with the inserted primary key for both affected
subtests.

In `@internal/manager/tenant.go`:
- Around line 365-367: The log message in the error branch that currently calls
log.Error(ctx, "error while removing OIDC mapping", err) is wrong for registry
unmap failures; update the message to accurately reflect a registry
unmap/unmapping failure (e.g., "error while unmapping registry" or similar) in
the same error handling block so on-call debugging points to registry unmap
problems — this change should be made where the code returns false,
OffboardingProcessing for the err != nil branch.
- Around line 114-118: The current code in OffboardTenant converts any
non-canContinue result from m.unmapAllSystemsFromRegistry into
OffboardingProcessing, which hides batch/repository failures from callers;
modify the handling so that if unmapAllSystemsFromRegistry returns a terminal or
error status (e.g., a status produced when repo.ProcessInBatch fails), you
return that specific OffboardingResult{Status: st} directly instead of mapping
it to OffboardingProcessing. Locate the call to m.unmapAllSystemsFromRegistry
(and the similar block at ~294-328) and change the conditional to propagate
non-retryable/error statuses as-is, only turning transient registry errors into
OffboardingProcessing when the returned status explicitly indicates a retryable
registry issue. Ensure the logic distinguishes repo/batch failures from
per-system retryable errors before returning.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f7df641f-9854-4c99-a532-91f00dc74b8b

📥 Commits

Reviewing files that changed from the base of the PR and between dea0ed5 and f0aa47d.

📒 Files selected for processing (20)
  • apis/cmk/cmk-ui.yaml
  • internal/api/cmkapi/cmkapi.go
  • internal/event-processor/export_test.go
  • internal/event-processor/factory.go
  • internal/event-processor/factory_test.go
  • internal/event-processor/handlers_key.go
  • internal/event-processor/handlers_system.go
  • internal/event-processor/helpers.go
  • internal/event-processor/reconciler.go
  • internal/event-processor/reconciler_test.go
  • internal/event-processor/repo.go
  • internal/event-processor/resolvers.go
  • internal/event-processor/types.go
  • internal/manager/export_test.go
  • internal/manager/key.go
  • internal/manager/system.go
  • internal/manager/tenant.go
  • internal/manager/tenant_test.go
  • internal/operator/operator.go
  • test/integration/event-processor/reconciliation_test.go
🚧 Files skipped from review as they are similar to previous changes (9)
  • apis/cmk/cmk-ui.yaml
  • internal/event-processor/reconciler_test.go
  • internal/api/cmkapi/cmkapi.go
  • internal/event-processor/export_test.go
  • internal/event-processor/resolvers.go
  • internal/event-processor/types.go
  • internal/event-processor/reconciler.go
  • internal/event-processor/helpers.go
  • internal/event-processor/factory_test.go

@minh-nghia minh-nghia force-pushed the feat/refactor-tenant-decommissioning branch from f0aa47d to 2e7c866 Compare March 12, 2026 18:27
Copy link
Copy Markdown
Contributor

@bkieling bkieling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done. Just a few comments.

@minh-nghia minh-nghia force-pushed the feat/refactor-tenant-decommissioning branch from 2e7c866 to 1d2c99e Compare March 13, 2026 13:58
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/operator/operator.go (1)

437-459: ⚠️ Potential issue | 🔴 Critical

Return on RemoveOIDCMapping errors before touching grpcResp.

The codes.Internal branch only logs and falls through. That means an internal remove error can still dereference or trust grpcResp, which can panic on a nil response or let tenant offboarding continue even though OIDC cleanup failed.

Suggested fix
 	st, ok := status.FromError(err)
-	if !ok {
+	if !ok && err != nil {
 		log.Error(ctx, "failed getting info on sessionManager error", err)
 	}
-	if st.Code() == codes.Internal {
-		log.Error(ctx, "removeOIDC failed with internal err", err)
-	}
-	if err != nil && st.Code() != codes.Internal {
-		log.Error(ctx, "error while removing OIDC mapping", err)
+
+	if err != nil {
+		if ok && st.Code() == codes.Internal {
+			log.Error(ctx, "removeOIDC failed with internal err", err)
+		} else {
+			log.Error(ctx, "error while removing OIDC mapping", err)
+		}
 		setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed)
 		return true
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/operator/operator.go` around lines 437 - 459, The code currently
logs an Internal error from RemoveOIDCMapping but continues and may dereference
grpcResp; update the Internal branch to treat it like other error branches by
calling setErrorStateAndContinue(ctx, resp, err,
WorkingStateOIDCMappingRemoveFailed) (or setErrorStateAndFail if intended) and
returning true immediately; modify the block handling st.Code() ==
codes.Internal around status.FromError(err) to set the error state and return
before any use of grpcResp, referencing st, err, grpcResp,
setErrorStateAndContinue, and WorkingStateOIDCMappingRemoveFailed.
♻️ Duplicate comments (3)
internal/event-processor/handlers_system.go (1)

400-410: ⚠️ Potential issue | 🟠 Major

Reset the external L1 claim before clearing the local link.

terminate() persists DISCONNECTED and KeyConfigurationID=nil before sendL1KeyClaim(..., false) runs. If that external call fails, later offboarding passes will treat the system as already unlinked and won't enqueue another cleanup attempt, leaving the registry claim stuck.

Suggested ordering
-	system.Status = cmkapi.SystemStatusDISCONNECTED
-	system.KeyConfigurationID = nil
-	err = updateSystem(ctx, h.repo, system)
-	if err != nil {
-		return nil, "", err
-	}
-
 	err = sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false)
 	if err != nil {
 		return nil, "", err
 	}
+
+	system.Status = cmkapi.SystemStatusDISCONNECTED
+	system.KeyConfigurationID = nil
+	err = updateSystem(ctx, h.repo, system)
+	if err != nil {
+		return nil, "", err
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_system.go` around lines 400 - 410, The
current flow in terminate() sets system.Status = cmkapi.SystemStatusDISCONNECTED
and system.KeyConfigurationID = nil and persists via updateSystem before calling
sendL1KeyClaim(..., false), which can leave the external registry claim orphaned
on send failure; change the ordering so you call sendL1KeyClaim(ctx, *system,
h.registry, data.TenantID, false) before mutating/persisting system.Status and
system.KeyConfigurationID (or if you prefer to keep the persist-first approach,
roll back the persisted changes when sendL1KeyClaim returns an error), ensuring
updateSystem only runs after sendL1KeyClaim succeeds and referencing
sendL1KeyClaim, updateSystem, system.Status, system.KeyConfigurationID, and
terminate to locate the code to modify.
internal/manager/system.go (1)

432-444: ⚠️ Potential issue | 🟠 Major

Guard UnmapSystemFromRegistry when the registry client is absent.

NewSystemManager explicitly allows m.registry to be unset, but this method unconditionally dereferences m.registry.Mapping(). Any offboarding flow built in that mode will panic instead of returning a retryable error.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/manager/system.go` around lines 432 - 444, UnmapSystemFromRegistry
currently dereferences m.registry.Mapping() unconditionally; add a guard at the
start of UnmapSystemFromRegistry to check that m.registry and
m.registry.Mapping() are non-nil and if not return a retryable error indicating
the registry client is not configured/available. Locate the
UnmapSystemFromRegistry method and add that nil-check before calling
m.registry.Mapping().UnmapSystemFromTenant so offboarding flows in manager
instances created without a registry do not panic but return a retryable
failure.
internal/event-processor/handlers_key.go (1)

91-96: ⚠️ Potential issue | 🟠 Major

Cancel late detach confirmations unless the key is still DETACHING.

HandleJobConfirm still completes unconditionally. If HandleJobCanceledEvent already moved the key to UNKNOWN, a delayed confirmation can still run the done/failed hook and overwrite the retry state back to DETACHED. Only confirm while the persisted key state is still DETACHING; otherwise cancel the job.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_key.go` around lines 91 - 96,
HandleJobConfirm currently always returns orbital.CompleteJobConfirmer(),
causing late confirmations to overwrite retry state; change it to first load the
persisted key state (extract the key ID from the orbital.Job passed in the same
way other handler methods do) using the handler's store access used in
HandleJobCanceledEvent, and only return orbital.CompleteJobConfirmer() if the
persisted state equals "DETACHING" (or the handler's equivalent constant);
otherwise return orbital.CancelJobConfirmer() (and surface any DB error as the
method error).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@internal/operator/operator.go`:
- Around line 437-459: The code currently logs an Internal error from
RemoveOIDCMapping but continues and may dereference grpcResp; update the
Internal branch to treat it like other error branches by calling
setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed)
(or setErrorStateAndFail if intended) and returning true immediately; modify the
block handling st.Code() == codes.Internal around status.FromError(err) to set
the error state and return before any use of grpcResp, referencing st, err,
grpcResp, setErrorStateAndContinue, and WorkingStateOIDCMappingRemoveFailed.

---

Duplicate comments:
In `@internal/event-processor/handlers_key.go`:
- Around line 91-96: HandleJobConfirm currently always returns
orbital.CompleteJobConfirmer(), causing late confirmations to overwrite retry
state; change it to first load the persisted key state (extract the key ID from
the orbital.Job passed in the same way other handler methods do) using the
handler's store access used in HandleJobCanceledEvent, and only return
orbital.CompleteJobConfirmer() if the persisted state equals "DETACHING" (or the
handler's equivalent constant); otherwise return orbital.CancelJobConfirmer()
(and surface any DB error as the method error).

In `@internal/event-processor/handlers_system.go`:
- Around line 400-410: The current flow in terminate() sets system.Status =
cmkapi.SystemStatusDISCONNECTED and system.KeyConfigurationID = nil and persists
via updateSystem before calling sendL1KeyClaim(..., false), which can leave the
external registry claim orphaned on send failure; change the ordering so you
call sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false) before
mutating/persisting system.Status and system.KeyConfigurationID (or if you
prefer to keep the persist-first approach, roll back the persisted changes when
sendL1KeyClaim returns an error), ensuring updateSystem only runs after
sendL1KeyClaim succeeds and referencing sendL1KeyClaim, updateSystem,
system.Status, system.KeyConfigurationID, and terminate to locate the code to
modify.

In `@internal/manager/system.go`:
- Around line 432-444: UnmapSystemFromRegistry currently dereferences
m.registry.Mapping() unconditionally; add a guard at the start of
UnmapSystemFromRegistry to check that m.registry and m.registry.Mapping() are
non-nil and if not return a retryable error indicating the registry client is
not configured/available. Locate the UnmapSystemFromRegistry method and add that
nil-check before calling m.registry.Mapping().UnmapSystemFromTenant so
offboarding flows in manager instances created without a registry do not panic
but return a retryable failure.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 79a0a2a5-c437-407e-be82-1e0cadbad852

📥 Commits

Reviewing files that changed from the base of the PR and between dea0ed5 and 835f4b5.

📒 Files selected for processing (22)
  • apis/cmk/cmk-ui.yaml
  • internal/api/cmkapi/cmkapi.go
  • internal/event-processor/export_test.go
  • internal/event-processor/factory.go
  • internal/event-processor/factory_test.go
  • internal/event-processor/handlers_key.go
  • internal/event-processor/handlers_system.go
  • internal/event-processor/helpers.go
  • internal/event-processor/reconciler.go
  • internal/event-processor/reconciler_test.go
  • internal/event-processor/repo.go
  • internal/event-processor/resolvers.go
  • internal/event-processor/types.go
  • internal/manager/export_test.go
  • internal/manager/key.go
  • internal/manager/system.go
  • internal/manager/tenant.go
  • internal/manager/tenant_test.go
  • internal/operator/operator.go
  • internal/operator/operator_test.go
  • sonar-project.properties
  • test/integration/event-processor/reconciliation_test.go
✅ Files skipped from review due to trivial changes (3)
  • sonar-project.properties
  • internal/event-processor/helpers.go
  • internal/manager/key.go
🚧 Files skipped from review as they are similar to previous changes (11)
  • internal/event-processor/resolvers.go
  • internal/event-processor/export_test.go
  • internal/event-processor/types.go
  • internal/manager/export_test.go
  • apis/cmk/cmk-ui.yaml
  • test/integration/event-processor/reconciliation_test.go
  • internal/event-processor/repo.go
  • internal/event-processor/factory_test.go
  • internal/event-processor/reconciler.go
  • internal/api/cmkapi/cmkapi.go
  • internal/event-processor/factory.go

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 20, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/operator/operator.go (1)

431-459: ⚠️ Potential issue | 🔴 Critical

Return on RPC errors before dereferencing st or grpcResp.

On the success path, Line 441 can dereference a nil status from status.FromError(nil). On the error path, the codes.Internal branch falls through to grpcResp.GetSuccess() even though grpcResp is usually nil there.

In google.golang.org/grpc/status, what does status.FromError(nil) return? Is the returned *Status nil?
🐛 Suggested change
 	grpcResp, err := o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping(
 		ctx,
 		&oidcmappinggrpc.RemoveOIDCMappingRequest{
 			TenantId: tenantID,
 		},
 	)
-	st, ok := status.FromError(err)
-	if !ok {
-		log.Error(ctx, "failed getting info on sessionManager error", err)
-	}
-	if st.Code() == codes.Internal {
-		log.Error(ctx, "removeOIDC failed with internal err", err)
-	}
-	if err != nil && st.Code() != codes.Internal {
-		log.Error(ctx, "error while removing OIDC mapping", err)
-		setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed)
-		return true
-	}
+	if err != nil {
+		st, ok := status.FromError(err)
+		if !ok {
+			log.Error(ctx, "failed getting info on sessionManager error", err)
+		}
+		if st != nil && st.Code() == codes.Internal {
+			log.Error(ctx, "removeOIDC failed with internal err", err)
+		}
+
+		log.Error(ctx, "error while removing OIDC mapping", err)
+		setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed)
+		return true
+	}
 
 	if !grpcResp.GetSuccess() {
 		err = errs.Wrapf(
 			ErrFailedResponse,
 			"session manager could not remove OIDC mapping: "+grpcResp.GetMessage(),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/operator/operator.go` around lines 431 - 459, The code dereferences
st and grpcResp before ensuring err is nil; change the RemoveOIDCMapping
handling so you first check if err != nil, log it (including st when non-nil),
handle the codes.Internal case and return (or set error state) before touching
st.Code() or grpcResp; only call status.FromError and use st.Code() after
confirming err != nil (or use st safely), and before calling
grpcResp.GetSuccess() ensure grpcResp != nil (or return on error) in the path
around o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping,
updating the branches that call setErrorStateAndContinue and
setErrorStateAndFail accordingly.
♻️ Duplicate comments (4)
internal/event-processor/repo.go (1)

64-85: ⚠️ Potential issue | 🟠 Major

Handle Patch "not updated" results explicitly.

Both helpers ignore Patch’s boolean result, so a no-op patch is treated as success.

🐛 Proposed fix
 func updateSystem(ctx context.Context, r repo.Repo, system *model.System) error {
 	ck := repo.NewCompositeKey().Where(repo.IDField, system.ID)
 	query := repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true)

-	_, err := r.Patch(ctx, system, *query)
+	updated, err := r.Patch(ctx, system, *query)
 	if err != nil {
 		return fmt.Errorf("failed to update system %s upon job termination: %w", system.ID, err)
 	}
+	if !updated {
+		return fmt.Errorf("system %s not found upon job termination", system.ID)
+	}

 	return nil
 }

 func updateKey(ctx context.Context, r repo.Repo, key *model.Key) error {
 	ck := repo.NewCompositeKey().Where(repo.IDField, key.ID)
 	query := repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true)

-	_, err := r.Patch(ctx, key, *query)
+	updated, err := r.Patch(ctx, key, *query)
 	if err != nil {
 		return fmt.Errorf("failed to update key %s upon job termination: %w", key.ID, err)
 	}
+	if !updated {
+		return fmt.Errorf("key %s not found upon job termination", key.ID)
+	}

 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/repo.go` around lines 64 - 85, The updateSystem and
updateKey helpers ignore the boolean result from r.Patch so no-op patches are
treated as success; change both functions to capture the first return value
(e.g., updated, err := r.Patch(...)), check if err != nil as before, and also if
updated == false return a descriptive error (for example "no rows updated" or
"patch did not change anything" including system.ID or key.ID) so callers can
detect and handle no-op patch results.
internal/event-processor/reconciler_test.go (1)

1116-1119: ⚠️ Potential issue | 🟡 Minor

Fail this helper immediately on termination-hook errors.

These subtests are asserting terminal side effects, but Line 1118 only logs hook failures. A broken decommission or detach hook can still leave the test green.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/reconciler_test.go` around lines 1116 - 1119, The
termination-hook error currently only logs failures ("Job termination returned
error: %v") which lets subtests pass despite hook errors; change the handler in
reconciler_test.go so that when err != nil it calls t.Fatalf (or t.Fatalf with
the same message) instead of t.Logf, ensuring the test fails immediately on
termination-hook errors; update the error branch where "Job termination returned
error" is logged to use t.Fatalf and preserve the error message.
internal/event-processor/handlers_system.go (1)

400-412: ⚠️ Potential issue | 🟠 Major

Reset the external L1 claim before clearing the local link.

After Line 404, the system is already persisted as DISCONNECTED with KeyConfigurationID = nil. If sendL1KeyClaim(..., false) then fails, later offboarding passes will treat the system as fully unlinked and will not schedule another cleanup attempt.

🐛 Suggested change
 	// For unlink decommission, we want to mark the system as disconnected and clean up the key claim regardless
 	// of whether the job succeeded or failed.
-	system.Status = cmkapi.SystemStatusDISCONNECTED
-	system.KeyConfigurationID = nil
-	err = updateSystem(ctx, h.repo, system)
-	if err != nil {
-		return nil, "", err
-	}
-
 	err = sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false)
 	if err != nil {
 		return nil, "", err
 	}
+
+	system.Status = cmkapi.SystemStatusDISCONNECTED
+	system.KeyConfigurationID = nil
+	err = updateSystem(ctx, h.repo, system)
+	if err != nil {
+		return nil, "", err
+	}
 
 	return system, data.KeyIDFrom, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_system.go` around lines 400 - 412, The code
clears system.KeyConfigurationID and persists DISCONNECTED before calling
sendL1KeyClaim, which can leave the external L1 claim uncleared if
sendL1KeyClaim fails; move the revocation call earlier: call sendL1KeyClaim(ctx,
*system, h.registry, data.TenantID, false) while the original KeyConfigurationID
is still present and only set system.KeyConfigurationID = nil and call
updateSystem(ctx, h.repo, system) after sendL1KeyClaim returns successfully; if
sendL1KeyClaim returns an error, return that error without mutating or
persisting the KeyConfigurationID so a later offboarding attempt can retry.
internal/event-processor/factory.go (1)

145-155: ⚠️ Potential issue | 🔴 Critical

Create the decommission job inside the same transaction as the status flip.

After Line 149 commits, any createSystemEventJob failure leaves the system stuck in PROCESSING with no persisted event/job to recover it.

🐛 Suggested change
-	err := f.repo.Transaction(ctx, func(ctx context.Context) error {
+	var job orbital.Job
+	err := f.repo.Transaction(ctx, func(ctx context.Context) error {
 		system.Status = cmkapi.SystemStatusPROCESSING
 		_, err := f.repo.Patch(ctx, system, *repo.NewQuery().UpdateAll(true))
-		return err
+		if err != nil {
+			return err
+		}
+
+		job, err = f.createSystemEventJob(ctx, JobTypeSystemUnlinkDecommission, systemUnlinkJobData)
+		return err
 	})
 
 	if err != nil {
 		return orbital.Job{}, err
 	}
 
-	return f.createSystemEventJob(ctx, JobTypeSystemUnlinkDecommission, systemUnlinkJobData)
+	return job, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/factory.go` around lines 145 - 155, The status flip
to PROCESSING and the creation of the decommission job must occur inside the
same transaction so failures roll back; move the call to f.createSystemEventJob
(JobTypeSystemUnlinkDecommission with systemUnlinkJobData) into the
f.repo.Transaction closure after setting system.Status and after f.repo.Patch,
and return any error from createSystemEventJob so the transaction will abort on
failure; ensure you call/create the job using repository methods available in
the transaction context so both the status update and job persist succeed
atomically.
🧹 Nitpick comments (1)
internal/event-processor/helpers.go (1)

67-88: Optionally skip empty task error messages before joining.

This avoids delimiter-only artifacts in merged error strings.

♻️ Proposed tweak
 	taskErrors := make([]string, 0, len(tasks))
 	for _, t := range tasks {
-		taskErrors = append(taskErrors, t.ErrorMessage)
+		if t.ErrorMessage == "" {
+			continue
+		}
+		taskErrors = append(taskErrors, t.ErrorMessage)
 	}
 	message := strings.Join(taskErrors, ":")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/helpers.go` around lines 67 - 88,
mergeOrbitalTaskErrors currently includes empty task error messages which
produces delimiter-only artifacts; update the loop in mergeOrbitalTaskErrors to
trim each t.ErrorMessage (use strings.TrimSpace) and only append non-empty
results to taskErrors (or otherwise filter out empty strings) before joining
with ":" so the returned message contains only real error text; reference
function mergeOrbitalTaskErrors and variables taskErrors and t.ErrorMessage when
making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/api/cmkapi/cmkapi.go`:
- Around line 404-405: The OpenAPI description for the State field (State
*KeyState on the Key/Key Version struct in cmkapi.go) incorrectly defines
UNKNOWN only as an authentication-failure state; update the json/tag comment and
any other occurrences (the other State field docs around the file, e.g., the
similar doc strings referenced) to state that UNKNOWN also represents a key
moved back to UNKNOWN when a detach job is canceled (i.e., include both
"authentication failure" and "canceled-detach" transitions), then regenerate the
OpenAPI/spec and clients so generated client docs reflect the new definition of
UNKNOWN.

In `@internal/manager/tenant_test.go`:
- Around line 268-279: The failing unmap assertions are passing prematurely
because the previous subtest leaves a linked system in the repository and
mockSystemManager.UnlinkSystemAction doesn't update repo state, so
checkAllSystemsUnlinked can make OffboardTenant return
OffboardingContinueAndWait without hitting UnmapSystemFromRegistry; fix by
isolating fixtures per subtest: ensure the system repository is reset or
recreate the tenant/system fixture before each unmap-related subtest (the t.Run
blocks that set mockSystemManager with unlinkErr and UnlinkSystemAction), or
explicitly remove/clear linked systems from the repo prior to invoking
m.OffboardTenant in those subtests so UnmapSystemFromRegistry/UnlinkSystemAction
code path is actually exercised (apply same change to the later subtest around
the other Unmap/Unlink assertions).

---

Outside diff comments:
In `@internal/operator/operator.go`:
- Around line 431-459: The code dereferences st and grpcResp before ensuring err
is nil; change the RemoveOIDCMapping handling so you first check if err != nil,
log it (including st when non-nil), handle the codes.Internal case and return
(or set error state) before touching st.Code() or grpcResp; only call
status.FromError and use st.Code() after confirming err != nil (or use st
safely), and before calling grpcResp.GetSuccess() ensure grpcResp != nil (or
return on error) in the path around
o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping, updating the
branches that call setErrorStateAndContinue and setErrorStateAndFail
accordingly.

---

Duplicate comments:
In `@internal/event-processor/factory.go`:
- Around line 145-155: The status flip to PROCESSING and the creation of the
decommission job must occur inside the same transaction so failures roll back;
move the call to f.createSystemEventJob (JobTypeSystemUnlinkDecommission with
systemUnlinkJobData) into the f.repo.Transaction closure after setting
system.Status and after f.repo.Patch, and return any error from
createSystemEventJob so the transaction will abort on failure; ensure you
call/create the job using repository methods available in the transaction
context so both the status update and job persist succeed atomically.

In `@internal/event-processor/handlers_system.go`:
- Around line 400-412: The code clears system.KeyConfigurationID and persists
DISCONNECTED before calling sendL1KeyClaim, which can leave the external L1
claim uncleared if sendL1KeyClaim fails; move the revocation call earlier: call
sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false) while the
original KeyConfigurationID is still present and only set
system.KeyConfigurationID = nil and call updateSystem(ctx, h.repo, system) after
sendL1KeyClaim returns successfully; if sendL1KeyClaim returns an error, return
that error without mutating or persisting the KeyConfigurationID so a later
offboarding attempt can retry.

In `@internal/event-processor/reconciler_test.go`:
- Around line 1116-1119: The termination-hook error currently only logs failures
("Job termination returned error: %v") which lets subtests pass despite hook
errors; change the handler in reconciler_test.go so that when err != nil it
calls t.Fatalf (or t.Fatalf with the same message) instead of t.Logf, ensuring
the test fails immediately on termination-hook errors; update the error branch
where "Job termination returned error" is logged to use t.Fatalf and preserve
the error message.

In `@internal/event-processor/repo.go`:
- Around line 64-85: The updateSystem and updateKey helpers ignore the boolean
result from r.Patch so no-op patches are treated as success; change both
functions to capture the first return value (e.g., updated, err :=
r.Patch(...)), check if err != nil as before, and also if updated == false
return a descriptive error (for example "no rows updated" or "patch did not
change anything" including system.ID or key.ID) so callers can detect and handle
no-op patch results.

---

Nitpick comments:
In `@internal/event-processor/helpers.go`:
- Around line 67-88: mergeOrbitalTaskErrors currently includes empty task error
messages which produces delimiter-only artifacts; update the loop in
mergeOrbitalTaskErrors to trim each t.ErrorMessage (use strings.TrimSpace) and
only append non-empty results to taskErrors (or otherwise filter out empty
strings) before joining with ":" so the returned message contains only real
error text; reference function mergeOrbitalTaskErrors and variables taskErrors
and t.ErrorMessage when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 977cecbc-2aa3-4ecd-9ee1-75072d2d4f45

📥 Commits

Reviewing files that changed from the base of the PR and between 915f75f and 835f4b5.

📒 Files selected for processing (22)
  • apis/cmk/cmk-ui.yaml
  • internal/api/cmkapi/cmkapi.go
  • internal/event-processor/export_test.go
  • internal/event-processor/factory.go
  • internal/event-processor/factory_test.go
  • internal/event-processor/handlers_key.go
  • internal/event-processor/handlers_system.go
  • internal/event-processor/helpers.go
  • internal/event-processor/reconciler.go
  • internal/event-processor/reconciler_test.go
  • internal/event-processor/repo.go
  • internal/event-processor/resolvers.go
  • internal/event-processor/types.go
  • internal/manager/export_test.go
  • internal/manager/key.go
  • internal/manager/system.go
  • internal/manager/tenant.go
  • internal/manager/tenant_test.go
  • internal/operator/operator.go
  • internal/operator/operator_test.go
  • sonar-project.properties
  • test/integration/event-processor/reconciliation_test.go

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
apis/cmk/cmk-ui.yaml (1)

2027-2032: ⚠️ Potential issue | 🟡 Minor

Document the new UNKNOWN detach outcome too.

The updated schema still says UNKNOWN is customer-held-only, but the new detach flow can also move a key to UNKNOWN when a detach job is canceled. That makes the generated contract inaccurate for system-managed/BYOK keys during tenant termination.

📝 Suggested wording
-        the states PENDING_DELETION, DELETED, FORBIDDEN and UNKNOWN are applicable only to customer held keys. Keys and Versions
-        are in UNKNOWN state if the authentication to the customer key fails due to any reason. FORBIDDEN state is for
+        the states PENDING_DELETION, DELETED and FORBIDDEN are applicable only to customer held keys. UNKNOWN is used when
+        authentication to the customer key fails and also when a detach job is canceled. FORBIDDEN state is for
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apis/cmk/cmk-ui.yaml` around lines 2027 - 2032, Update the description block
that begins "Indicates the current state of the Key/Key Version" to document
that the UNKNOWN state can also occur as a detach outcome (e.g., when a detach
job is cancelled), not only for customer-held keys; explicitly mention that
UNKNOWN may apply to system-managed and BYOK keys during tenant
termination/detach flows and ensure the sentence listing DETACHING/DETACHED
outcomes reflects this expanded applicability.
internal/operator/operator.go (1)

431-459: ⚠️ Potential issue | 🔴 Critical

removeOIDCMapping panics when a gRPC error is returned.

The codes.Internal branch (line 441) logs the error but does not return, allowing the code to fall through to line 450 where grpcResp.GetSuccess() is called. When an RPC error occurs, grpcResp is typically nil, causing a panic. Branch on err first and return early for all error cases before accessing the response.

Suggested fix
 	grpcResp, err := o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping(
 		ctx,
 		&oidcmappinggrpc.RemoveOIDCMappingRequest{
 			TenantId: tenantID,
 		},
 	)
-	st, ok := status.FromError(err)
-	if !ok {
-		log.Error(ctx, "failed getting info on sessionManager error", err)
-	}
-	if st.Code() == codes.Internal {
-		log.Error(ctx, "removeOIDC failed with internal err", err)
-	}
-	if err != nil && st.Code() != codes.Internal {
+
+	if err != nil {
+		if status.Code(err) == codes.Internal {
+			log.Error(ctx, "removeOIDC failed with internal err", err)
+			setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed)
+			return true
+		}
+
 		log.Error(ctx, "error while removing OIDC mapping", err)
 		setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed)
 		return true
 	}
 
 	if !grpcResp.GetSuccess() {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/operator/operator.go` around lines 431 - 459, The code calls
grpcResp.GetSuccess() even when the RPC returned an error, which can panic
because grpcResp may be nil; update the RemoveOIDCMapping handling to check err
first and return early for any error before accessing grpcResp: after calling
o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping(...) use
status.FromError(err) only to enrich logs but if err != nil then log the error
(including internal vs other codes), call setErrorStateAndContinue /
setErrorStateAndFail as appropriate (using WorkingStateOIDCMappingRemoveFailed),
and return true; only proceed to inspect grpcResp.GetSuccess() when err is nil
to avoid the nil dereference.
♻️ Duplicate comments (8)
internal/event-processor/repo.go (1)

64-85: ⚠️ Potential issue | 🟠 Major

Fail when the patch matched no row.

Line 68 and Line 80 drop the bool returned by repo.Repo.Patch. If the row is gone or the keyed query matches nothing, these helpers still return nil, so the decommission flow can proceed as if the terminal system/key state was persisted when nothing was written.

🐛 Proposed fix
 func updateSystem(ctx context.Context, r repo.Repo, system *model.System) error {
 	ck := repo.NewCompositeKey().Where(repo.IDField, system.ID)
 	query := repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true)

-	_, err := r.Patch(ctx, system, *query)
+	updated, err := r.Patch(ctx, system, *query)
 	if err != nil {
 		return fmt.Errorf("failed to update system %s upon job termination: %w", system.ID, err)
 	}
+	if !updated {
+		return fmt.Errorf("system %s not found upon job termination", system.ID)
+	}

 	return nil
 }

 func updateKey(ctx context.Context, r repo.Repo, key *model.Key) error {
 	ck := repo.NewCompositeKey().Where(repo.IDField, key.ID)
 	query := repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true)

-	_, err := r.Patch(ctx, key, *query)
+	updated, err := r.Patch(ctx, key, *query)
 	if err != nil {
 		return fmt.Errorf("failed to update key %s upon job termination: %w", key.ID, err)
 	}
+	if !updated {
+		return fmt.Errorf("key %s not found upon job termination", key.ID)
+	}

 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/repo.go` around lines 64 - 85, The helpers
updateSystem and updateKey call r.Patch and ignore the boolean result that
indicates whether any row was matched; fix both functions (updateSystem,
updateKey) to capture the returned bool (e.g., ok, err := r.Patch(...)) and if
ok is false return an error (similar to the existing fmt.Errorf messages)
indicating the patch matched no rows so the decommission/termination state was
not persisted; keep existing error wrapping for err as well.
internal/api/cmkapi/cmkapi.go (1)

404-405: ⚠️ Potential issue | 🟡 Minor

Document canceled detach as another UNKNOWN transition.

Line 404, Line 463, and Line 613 still describe UNKNOWN only as an authentication-failure state, but this PR also moves a key back to UNKNOWN when a detach job is canceled. Please update the source OpenAPI description and regenerate the generated artifacts so clients see the full detach state machine.

Also applies to: 463-464, 613-614

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/api/cmkapi/cmkapi.go` around lines 404 - 405, Update the OpenAPI/Go
struct documentation for the State field so UNKNOWN includes both
authentication-failure and "detach job canceled" transitions: modify the comment
on State (*KeyState `json:"state,omitempty"`) and any other occurrences of the
Key state description (the other doc comments referencing KeyState at the same
locations) to explicitly mention that a canceled detach moves the key/version
back to UNKNOWN; after editing the source docstrings/OpenAPI description,
regenerate the OpenAPI artifacts and client/server code so the updated state
machine appears in generated clients.
internal/event-processor/factory.go (1)

136-155: ⚠️ Potential issue | 🔴 Critical

Keep decommission scheduling atomic with the status transition.

This path still commits system.Status = PROCESSING before createSystemEventJob runs. If tenant extraction or PrepareJob fails, the system is stranded in PROCESSING with no event/job to drive recovery. Reuse the same guarded flow as SystemUnlink (or extract a no-event variant) so the status update is rolled back when job creation fails.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/factory.go` around lines 136 - 155, The status
update to PROCESSING must be done atomically with job creation so failures in
tenant extraction/PrepareJob don't leave the system stuck; modify the flow
around f.repo.Transaction, f.repo.Patch, and createSystemEventJob (and the
SystemActionJobData/JobTypeSystemUnlinkDecommission path) so the PATCH to set
system.Status = cmkapi.SystemStatusPROCESSING and the call that creates/prepares
the job (createSystemEventJob or an extracted no-event variant used by
SystemUnlink) occur inside the same transaction/guarded flow and return an error
to rollback if job creation fails, mirroring the guarded pattern used by
SystemUnlink.
internal/manager/system.go (1)

432-444: ⚠️ Potential issue | 🔴 Critical

Guard m.registry before calling Mapping().

NewSystemManager still allows the registry client to be nil, but this method unconditionally dereferences it. Any offboarding flow built in that mode will panic instead of returning a retryable error.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/manager/system.go` around lines 432 - 444, The
UnmapSystemFromRegistry method dereferences m.registry without checking for nil;
update UnmapSystemFromRegistry in SystemManager to guard m.registry (or
m.registry.Mapping()) before calling Mapping(), and if nil return a retryable
error (e.g., context-aware temporary error) rather than panicking; also ensure
NewSystemManager's allowance of a nil registry is honored by consistently
returning that error path when m.registry is nil so offboarding flows fail
safely instead of causing a panic.
internal/event-processor/handlers_key.go (1)

91-96: ⚠️ Potential issue | 🟠 Major

Only confirm detach jobs while the key is still DETACHING.

HandleJobCanceledEvent now moves the key to UNKNOWN so offboarding can retry, but HandleJobConfirm still unconditionally returns CompleteJobConfirmer(). A late confirmation can therefore accept work after cancellation and overwrite that retry state. Mirror the system confirmer logic and cancel when the key is no longer DETACHING.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_key.go` around lines 91 - 96, Only confirm
detach jobs while the key is still DETACHING: in
KeyDetachJobHandler.HandleJobConfirm, read the current key state (using the same
store/client used by HandleJobCanceledEvent in this handler) and if the key's
state is not DETACHING return orbital.CancelJobConfirmer() (nil error); only
return orbital.CompleteJobConfirmer() when the live state is DETACHING. Mirror
the system confirmer behavior by querying the key state before completing the
job to avoid late confirmations overwriting the UNKNOWN retry state.
internal/event-processor/reconciler_test.go (1)

1116-1119: ⚠️ Potential issue | 🟡 Minor

Fail this helper on hook errors.

These tests are asserting termination side effects, but the helper still only logs JobDoneFunc/JobFailedFunc failures. That can leave a broken hook path green for the wrong reason. Use require.NoError here instead of t.Logf.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/reconciler_test.go` around lines 1116 - 1119, The
helper in reconciler_test.go currently swallows errors from the job termination
hook by logging them (the block using t.Logf for errors from
JobDoneFunc/JobFailedFunc); change this to fail the test on hook errors by
replacing the t.Logf check with a require.NoError(t, err) (importing
testify/require if not already present) so any error from the termination hook
causes the test to fail rather than silently pass.
internal/manager/tenant_test.go (1)

268-279: ⚠️ Potential issue | 🟠 Major

This shared fixture can still hide whether unmapping is exercised.

mockSystemManager.UnlinkSystemAction returns an error without changing repo state, so the linked system created here can leak into the next subtest. That lets OffboardTenant stop at the “systems still linked” check and return ContinueAndWait before UnmapSystemFromRegistry is reached. Reset the fixture per subtest or clear linked systems before the unmap assertions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/manager/tenant_test.go` around lines 268 - 279, The test uses a
shared linked-system fixture so OffboardTenant may bail early at the "systems
still linked" check; to ensure UnmapSystemFromRegistry is actually exercised,
reset the fixture per subtest by creating a fresh system and repo state inside
this t.Run (use testutils.NewSystem/testutils.CreateTestEntities within the
subtest) or clear linked systems before calling m.OffboardTenant, and ensure the
mockSystemManager used via m.SetSystemForTests either simulates state changes on
UnlinkSystemAction or is removed (set nil/reset) after the subtest so the linked
system cannot leak into following tests; update the mock or test setup
accordingly for the test that asserts unmap behavior.
internal/manager/tenant.go (1)

345-361: ⚠️ Potential issue | 🟡 Minor

Unreachable default case and incorrect log message.

Two issues in this switch:

  1. Unreachable code: Since err == nil returns early at line 337-339, err is guaranteed non-nil in the switch. The case err != nil: at line 355 will always match if none of the preceding cases do, making the default case unreachable.

  2. Misleading log message: The message "error while removing OIDC mapping" doesn't match this function's purpose (unmapping systems from registry).

,

📝 Proposed fix
 switch {
 case st.Code() == codes.FailedPrecondition && strings.Contains(st.Message(), "system is not linked to the tenant"):
 	log.Info(ctx, "system is not linked to the tenant in registry, might have been already unlinked")
 	return OffboardingGoToNextStep
 case st.Code() == codes.NotFound && strings.Contains(st.Message(), "system not found"):
 	log.Info(ctx, "system mapping not found in registry, might have been already removed")
 	return OffboardingGoToNextStep
 case st.Code() == codes.InvalidArgument:
 	log.Error(ctx, "invalid argument error while unmapping system from tenant", err)
 	return OffboardingFailed
-case err != nil:
-	log.Error(ctx, "error while removing OIDC mapping", err)
+default:
+	log.Error(ctx, "error while unmapping system from registry", err)
 	return OffboardingContinueAndWait
-default:
-	return OffboardingGoToNextStep
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/manager/tenant.go` around lines 345 - 361, The switch handling gRPC
status after attempting to unmap a system has an always-true `case err != nil`
and a misleading log; remove the `case err != nil:` branch and instead make the
`default` branch the fallback for unexpected non-matching error statuses,
logging a correct message (e.g., "error while unmapping system from registry")
and returning OffboardingContinueAndWait; keep the existing specific cases for
FailedPrecondition, NotFound, and InvalidArgument (referencing variables st, err
and the Offboarding* return constants) so the switch is reachable and semantics
are preserved.
🧹 Nitpick comments (1)
internal/event-processor/handlers_system.go (1)

358-381: Verify cancel behavior aligns with tenant offboarding expectations.

Setting the system to FAILED on cancel is intentional per PR objectives to allow retry on the next decommission reconciliation. However, sendUnlinkForConnectedSystems filters by KeyConfigIDField != null, while terminate() sets KeyConfigurationID = nil on Done/Failed. This means a canceled job will set FAILED but retain KeyConfigurationID, so it will be re-picked in the next loop—which is correct.

Consider adding a brief comment clarifying why FAILED is chosen here (vs CONNECTED in the regular unlink handler) to help future maintainers understand the different semantics.

📝 Suggested clarifying comment
 // HandleJobCanceledEvent puts the system to failed state so the system can be processed again during the next
 // decommission reconciliation loop.
+// Note: Unlike regular SystemUnlinkJobHandler, we intentionally use FAILED (not CONNECTED) because
+// decommission has no user cancel/retry flow—FAILED ensures the system is re-enqueued on the next offboarding pass.
 func (h *SystemUnlinkDecommissionJobHandler) HandleJobCanceledEvent(ctx context.Context, job orbital.Job) error {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_system.go` around lines 358 - 381, Add a
short clarifying comment in
SystemUnlinkDecommissionJobHandler.HandleJobCanceledEvent right before setting
system.Status = cmkapi.SystemStatusFAILED that explains why FAILED is chosen (so
the system is re-evaluated by the next decommission reconciliation loop) and
note the interaction with sendUnlinkForConnectedSystems and terminate():
sendUnlinkForConnectedSystems filters by KeyConfigurationID != nil while
terminate() clears KeyConfigurationID on Done/Failed, so marking FAILED here
intentionally retains KeyConfigurationID and allows re-pickup on the next loop
(contrast this with the regular unlink handler which sets CONNECTED).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/manager/key.go`:
- Around line 422-424: The current code sets key.State and then calls
km.repo.Patch(ctx, key, *repo.NewQuery().UpdateAll(true)), which overwrites the
entire row and can clobber concurrent changes; change the patch to only update
the state column instead of UpdateAll(true) — e.g., call km.repo.Patch with a
query that targets only the "state" field (use the repository/query API to
Update("state") or equivalent) so only key.State is persisted while leaving
other columns untouched.

---

Outside diff comments:
In `@apis/cmk/cmk-ui.yaml`:
- Around line 2027-2032: Update the description block that begins "Indicates the
current state of the Key/Key Version" to document that the UNKNOWN state can
also occur as a detach outcome (e.g., when a detach job is cancelled), not only
for customer-held keys; explicitly mention that UNKNOWN may apply to
system-managed and BYOK keys during tenant termination/detach flows and ensure
the sentence listing DETACHING/DETACHED outcomes reflects this expanded
applicability.

In `@internal/operator/operator.go`:
- Around line 431-459: The code calls grpcResp.GetSuccess() even when the RPC
returned an error, which can panic because grpcResp may be nil; update the
RemoveOIDCMapping handling to check err first and return early for any error
before accessing grpcResp: after calling
o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping(...) use
status.FromError(err) only to enrich logs but if err != nil then log the error
(including internal vs other codes), call setErrorStateAndContinue /
setErrorStateAndFail as appropriate (using WorkingStateOIDCMappingRemoveFailed),
and return true; only proceed to inspect grpcResp.GetSuccess() when err is nil
to avoid the nil dereference.

---

Duplicate comments:
In `@internal/api/cmkapi/cmkapi.go`:
- Around line 404-405: Update the OpenAPI/Go struct documentation for the State
field so UNKNOWN includes both authentication-failure and "detach job canceled"
transitions: modify the comment on State (*KeyState `json:"state,omitempty"`)
and any other occurrences of the Key state description (the other doc comments
referencing KeyState at the same locations) to explicitly mention that a
canceled detach moves the key/version back to UNKNOWN; after editing the source
docstrings/OpenAPI description, regenerate the OpenAPI artifacts and
client/server code so the updated state machine appears in generated clients.

In `@internal/event-processor/factory.go`:
- Around line 136-155: The status update to PROCESSING must be done atomically
with job creation so failures in tenant extraction/PrepareJob don't leave the
system stuck; modify the flow around f.repo.Transaction, f.repo.Patch, and
createSystemEventJob (and the
SystemActionJobData/JobTypeSystemUnlinkDecommission path) so the PATCH to set
system.Status = cmkapi.SystemStatusPROCESSING and the call that creates/prepares
the job (createSystemEventJob or an extracted no-event variant used by
SystemUnlink) occur inside the same transaction/guarded flow and return an error
to rollback if job creation fails, mirroring the guarded pattern used by
SystemUnlink.

In `@internal/event-processor/handlers_key.go`:
- Around line 91-96: Only confirm detach jobs while the key is still DETACHING:
in KeyDetachJobHandler.HandleJobConfirm, read the current key state (using the
same store/client used by HandleJobCanceledEvent in this handler) and if the
key's state is not DETACHING return orbital.CancelJobConfirmer() (nil error);
only return orbital.CompleteJobConfirmer() when the live state is DETACHING.
Mirror the system confirmer behavior by querying the key state before completing
the job to avoid late confirmations overwriting the UNKNOWN retry state.

In `@internal/event-processor/reconciler_test.go`:
- Around line 1116-1119: The helper in reconciler_test.go currently swallows
errors from the job termination hook by logging them (the block using t.Logf for
errors from JobDoneFunc/JobFailedFunc); change this to fail the test on hook
errors by replacing the t.Logf check with a require.NoError(t, err) (importing
testify/require if not already present) so any error from the termination hook
causes the test to fail rather than silently pass.

In `@internal/event-processor/repo.go`:
- Around line 64-85: The helpers updateSystem and updateKey call r.Patch and
ignore the boolean result that indicates whether any row was matched; fix both
functions (updateSystem, updateKey) to capture the returned bool (e.g., ok, err
:= r.Patch(...)) and if ok is false return an error (similar to the existing
fmt.Errorf messages) indicating the patch matched no rows so the
decommission/termination state was not persisted; keep existing error wrapping
for err as well.

In `@internal/manager/system.go`:
- Around line 432-444: The UnmapSystemFromRegistry method dereferences
m.registry without checking for nil; update UnmapSystemFromRegistry in
SystemManager to guard m.registry (or m.registry.Mapping()) before calling
Mapping(), and if nil return a retryable error (e.g., context-aware temporary
error) rather than panicking; also ensure NewSystemManager's allowance of a nil
registry is honored by consistently returning that error path when m.registry is
nil so offboarding flows fail safely instead of causing a panic.

In `@internal/manager/tenant_test.go`:
- Around line 268-279: The test uses a shared linked-system fixture so
OffboardTenant may bail early at the "systems still linked" check; to ensure
UnmapSystemFromRegistry is actually exercised, reset the fixture per subtest by
creating a fresh system and repo state inside this t.Run (use
testutils.NewSystem/testutils.CreateTestEntities within the subtest) or clear
linked systems before calling m.OffboardTenant, and ensure the mockSystemManager
used via m.SetSystemForTests either simulates state changes on
UnlinkSystemAction or is removed (set nil/reset) after the subtest so the linked
system cannot leak into following tests; update the mock or test setup
accordingly for the test that asserts unmap behavior.

In `@internal/manager/tenant.go`:
- Around line 345-361: The switch handling gRPC status after attempting to unmap
a system has an always-true `case err != nil` and a misleading log; remove the
`case err != nil:` branch and instead make the `default` branch the fallback for
unexpected non-matching error statuses, logging a correct message (e.g., "error
while unmapping system from registry") and returning OffboardingContinueAndWait;
keep the existing specific cases for FailedPrecondition, NotFound, and
InvalidArgument (referencing variables st, err and the Offboarding* return
constants) so the switch is reachable and semantics are preserved.

---

Nitpick comments:
In `@internal/event-processor/handlers_system.go`:
- Around line 358-381: Add a short clarifying comment in
SystemUnlinkDecommissionJobHandler.HandleJobCanceledEvent right before setting
system.Status = cmkapi.SystemStatusFAILED that explains why FAILED is chosen (so
the system is re-evaluated by the next decommission reconciliation loop) and
note the interaction with sendUnlinkForConnectedSystems and terminate():
sendUnlinkForConnectedSystems filters by KeyConfigurationID != nil while
terminate() clears KeyConfigurationID on Done/Failed, so marking FAILED here
intentionally retains KeyConfigurationID and allows re-pickup on the next loop
(contrast this with the regular unlink handler which sets CONNECTED).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b66573ce-937a-45b7-b950-dec9134af181

📥 Commits

Reviewing files that changed from the base of the PR and between 915f75f and 835f4b5.

📒 Files selected for processing (22)
  • apis/cmk/cmk-ui.yaml
  • internal/api/cmkapi/cmkapi.go
  • internal/event-processor/export_test.go
  • internal/event-processor/factory.go
  • internal/event-processor/factory_test.go
  • internal/event-processor/handlers_key.go
  • internal/event-processor/handlers_system.go
  • internal/event-processor/helpers.go
  • internal/event-processor/reconciler.go
  • internal/event-processor/reconciler_test.go
  • internal/event-processor/repo.go
  • internal/event-processor/resolvers.go
  • internal/event-processor/types.go
  • internal/manager/export_test.go
  • internal/manager/key.go
  • internal/manager/system.go
  • internal/manager/tenant.go
  • internal/manager/tenant_test.go
  • internal/operator/operator.go
  • internal/operator/operator_test.go
  • sonar-project.properties
  • test/integration/event-processor/reconciliation_test.go

@minh-nghia minh-nghia force-pushed the feat/refactor-tenant-decommissioning branch 6 times, most recently from a7629a1 to fee5d34 Compare March 23, 2026 22:13
@minh-nghia
Copy link
Copy Markdown
Contributor Author

@coderabbitai resolve

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 23, 2026

✅ Actions performed

Comments resolved. Auto-approval is disabled; enable reviews.request_changes_workflow to approve automatically.

@minh-nghia
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 23, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/operator/operator.go (1)

441-461: ⚠️ Potential issue | 🟠 Major

Clarify error handling for codes.Internal - current flow causes unintended task failure.

When err != nil and st.Code() == codes.Internal, the code logs the error but falls through to line 454. Since grpcResp is nil when an error occurs, grpcResp.GetSuccess() returns false (proto getter safety), which triggers setErrorStateAndFail.

If Internal errors should fail the task, the special logging at line 446 is misleading. If Internal errors should be handled differently (e.g., proceed with termination), the function should return early after logging.

🔧 Option A: If Internal errors should fail the task
 	st, ok := status.FromError(err)
 	if !ok {
 		log.Error(ctx, "failed getting info on sessionManager error", err)
 	}
-	if st.Code() == codes.Internal {
-		log.Error(ctx, "removeOIDC failed with internal err", err)
-	}
-	if err != nil && st.Code() != codes.Internal {
+	if err != nil {
 		log.Error(ctx, "error while removing OIDC mapping", err)
 		setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed)
 		return true
 	}
🔧 Option B: If Internal errors should allow proceeding (e.g., mapping already removed)
 	st, ok := status.FromError(err)
 	if !ok {
 		log.Error(ctx, "failed getting info on sessionManager error", err)
 	}
 	if st.Code() == codes.Internal {
 		log.Error(ctx, "removeOIDC failed with internal err", err)
+		// Internal error is acceptable (e.g., mapping already removed), proceed with termination
+		return false
 	}
 	if err != nil && st.Code() != codes.Internal {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/operator/operator.go` around lines 441 - 461, The current handling
of status.FromError(err)/st.Code() == codes.Internal is ambiguous and lets
execution fall through to grpcResp.GetSuccess() (which is nil) causing
setErrorStateAndFail; change the flow so that after detecting an Internal error
you explicitly handle it: either (A) treat it as a hard failure by calling
setErrorStateAndFail(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed) and
return true, or (B) treat it as non-fatal by logging a clear message and
returning (true or false as appropriate for your loop semantics) to avoid
touching grpcResp; also update the log message around removeOIDC/internal error
to include the error and not imply a different outcome — locate the handling
around status.FromError(err), st.Code(), codes.Internal, grpcResp.GetSuccess(),
setErrorStateAndFail and setErrorStateAndContinue to implement this early-return
behavior.
internal/event-processor/handlers_system.go (1)

457-463: ⚠️ Potential issue | 🔴 Critical

Avoid a panic in the not-found branch.

When getSystemByID returns repo.ErrNotFound, system is nil here. Logging system.Identifier turns an already-gone system into a handler crash.

🐛 Minimal fix
 	if errors.Is(err, repo.ErrNotFound) {
 		log.Warn(ctx, "System not found when handling job termination, skipping system update",
-			slog.String("systemID", system.Identifier))
+			slog.String("systemID", data.SystemID))
 		return nil
 	}

Also applies to: 502-507

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_system.go` around lines 457 - 463, The
not-found error branch in the handler uses system.Identifier after getSystemByID
returns an error, which can be nil and panic; update the repo.ErrNotFound
handling in the block that follows getSystemByID (and the similar block around
lines 502-507) to avoid dereferencing system: log using data.SystemID (or
another non-nil identifier) instead of system.Identifier, or check system != nil
before accessing Identifier, and return nil afterward; ensure the log message
remains descriptive (e.g., "System not found when handling job termination,
skipping system update") and include the safe identifier.
♻️ Duplicate comments (5)
internal/api/cmkapi/cmkapi.go (1)

404-405: ⚠️ Potential issue | 🟡 Minor

Update UNKNOWN-state wording to “canceled detach,” not “failed detach.”

These descriptions currently say UNKNOWN applies when key detach previously failed, but the PR behavior defines UNKNOWN for canceled detach jobs while failed jobs terminate as DETACHED. Please correct the source OpenAPI descriptions and regenerate this file.

Also applies to: 463-464, 613-614

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/api/cmkapi/cmkapi.go` around lines 404 - 405, Update the OpenAPI
description for the Key/Key Version state so that references to UNKNOWN indicate
"canceled detach" instead of "failed detach": locate the source description used
to generate the State field comment (which produced the comment above the State
*KeyState `json:"state,omitempty"` field and the similar comments that generated
the other duplicated blocks), replace the phrase "detach previously failed" (or
similar wording) with "canceled detach", regenerate the OpenAPI client/code so
the comment above the State field and the other duplicated occurrences reflect
"canceled detach" (ensure you run the same generation step that produced
internal/api/cmkapi/cmkapi.go).
internal/manager/tenant_test.go (1)

194-199: ⚠️ Potential issue | 🟠 Major

These fixtures still don't point the key configuration at the key under test.

Both subtests create a new IsPrimary key, but keyConfig.PrimaryKeyID still points at the original fixture key. Offboarding can therefore read or update the wrong record, so these assertions may not be exercising the intended primary-key path.

Also applies to: 242-248

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/manager/tenant_test.go` around lines 194 - 199, The fixture keys
created via testutils.NewKey in the subtests set IsPrimary on the new key but do
not update keyConfig.PrimaryKeyID, so offboarding still references the original
fixture key; update the test setup (after creating the new primary key via
testutils.NewKey) to set keyConfig.PrimaryKeyID = newKey.ID (and persist/update
keyConfig if needed) so the key configuration actually points at the newly
created primary key; do this for both occurrences (the block using
testutils.NewKey at the shown lines and the similar block around lines 242-248)
so the tests exercise the primary-key path correctly.
internal/event-processor/repo.go (1)

69-76: ⚠️ Potential issue | 🟠 Major

Don't swallow updated == false here.

If Patch matches no row, these helpers still return nil. Their callers then continue with event cleanup and audit logging as if the state transition was persisted.

Also applies to: 85-92

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/repo.go` around lines 69 - 76, The current handling
after calling r.Patch(ctx, system, *query) ignores the case where updated ==
false and merely logs a warning; instead treat a false return as a failure to
persist the state transition: after the call in repo.go (the r.Patch(ctx,
system, *query) invocation and the second similar block around lines 85-92),
check if updated is false and return a non-nil error (e.g., fmt.Errorf("failed
to update system %s: no rows affected") or similar) so callers do not proceed
with cleanup/audit as if the update succeeded; update both places that call
r.Patch to return this error rather than just logging.
internal/event-processor/handlers_system.go (1)

400-410: ⚠️ Potential issue | 🟠 Major

Reset the external key claim before persisting the local unlink.

terminate clears KeyConfigurationID and marks the system DISCONNECTED before sendL1KeyClaim(...). If that external call fails after the patch, offboarding will see the system as fully unlinked and won't enqueue another cleanup attempt for the stale claim.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event-processor/handlers_system.go` around lines 400 - 410, The code
clears system.KeyConfigurationID and sets system.Status to DISCONNECTED before
calling sendL1KeyClaim, which can leave an external claim stale if
sendL1KeyClaim fails; change the flow so you call sendL1KeyClaim(ctx, *system,
h.registry, data.TenantID, false) while the system still has its
KeyConfigurationID, and only after sendL1KeyClaim succeeds then set
system.KeyConfigurationID = nil, set system.Status =
cmkapi.SystemStatusDISCONNECTED and persist via updateSystem(ctx, h.repo,
system); ensure error handling preserves the original state if sendL1KeyClaim
returns an error.
internal/manager/system.go (1)

438-443: ⚠️ Potential issue | 🟠 Major

Don't treat a missing registry client as a successful unmap.

NewSystemManager still allows m.registry == nil. Returning nil here makes tenant offboarding advance as if the registry mapping was removed, so deletion can continue with stale external state.

🐛 Minimal fix
 	if m.registry == nil {
 		log.Warn(
 			ctx, "Could not perform system unmap from registry service - registry client is nil",
 		)
 
-		return nil
+		return status.Error(codes.Unavailable, "registry client is not configured")
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/manager/system.go` around lines 438 - 443, The code treats a missing
registry client (m.registry == nil) as a successful unmap by returning nil;
change this to return an explicit error so tenant offboarding does not proceed
with stale external state. In the method that performs the system unmap (the
function containing the shown snippet), replace the current return nil with a
descriptive error (e.g., fmt.Errorf("registry client is nil: cannot unmap
system")), and ensure callers handle this error; additionally, audit
NewSystemManager to either require/validate a non-nil registry client at
construction or document that callers must handle a possible nil registry and
the resulting error from the unmap call (use m.registry and NewSystemManager as
reference points).
🧹 Nitpick comments (1)
internal/operator/operator.go (1)

391-397: Minor: Inconsistent variable naming tenantId vs tenantID.

Go convention prefers tenantID (with uppercase ID acronym). This is inconsistent with the parameter naming on line 433 (tenantID) in the same file.

🔧 Suggested fix
-	tenantId := tenantProto.GetId()
-	if tenantId == "" {
+	tenantID := tenantProto.GetId()
+	if tenantID == "" {
 		setErrorStateAndFail(ctx, resp, ErrInvalidTenantID, WorkingStateInvalidTaskData)
 		return
 	}

-	ctx = slogctx.With(ctx, "tenantId", tenantId)
+	ctx = slogctx.With(ctx, "tenantId", tenantID)

-	hasErr := o.removeOIDCMapping(ctx, resp, tenantId)
+	hasErr := o.removeOIDCMapping(ctx, resp, tenantID)

(Also update line 406 accordingly)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/operator/operator.go` around lines 391 - 397, Rename the local
variable tenantId to tenantID to follow Go convention and match existing
parameter naming; replace the assignment from tenantProto.GetId() to tenantID,
update the empty check and the setErrorStateAndFail call that uses it, and
update the ctx = slogctx.With(ctx, "tenantId", tenantId) to use "tenantID" and
tenantID; also adjust any other nearby/local occurrences (including the other
local variable referenced in this block) so all uses and the context key are
consistently tenantID.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/manager/system.go`:
- Around line 730-741: The code calls keyConfig.PrimaryKeyID.String() unguarded
in the event creation branches (eventFactory.SystemUnlinkDecommission and
eventFactory.SystemUnlink), which will panic if PrimaryKeyID is nil; update both
branches to check ptr.IsNotNilUUID(keyConfig.PrimaryKeyID) before dereferencing
and pass a safe value (e.g., an empty string or handle the nil case
appropriately) into SystemUnlinkDecommission and SystemUnlink so the .String()
method is only invoked when PrimaryKeyID is non-nil, matching the existing guard
pattern used elsewhere in this file.

---

Outside diff comments:
In `@internal/event-processor/handlers_system.go`:
- Around line 457-463: The not-found error branch in the handler uses
system.Identifier after getSystemByID returns an error, which can be nil and
panic; update the repo.ErrNotFound handling in the block that follows
getSystemByID (and the similar block around lines 502-507) to avoid
dereferencing system: log using data.SystemID (or another non-nil identifier)
instead of system.Identifier, or check system != nil before accessing
Identifier, and return nil afterward; ensure the log message remains descriptive
(e.g., "System not found when handling job termination, skipping system update")
and include the safe identifier.

In `@internal/operator/operator.go`:
- Around line 441-461: The current handling of status.FromError(err)/st.Code()
== codes.Internal is ambiguous and lets execution fall through to
grpcResp.GetSuccess() (which is nil) causing setErrorStateAndFail; change the
flow so that after detecting an Internal error you explicitly handle it: either
(A) treat it as a hard failure by calling setErrorStateAndFail(ctx, resp, err,
WorkingStateOIDCMappingRemoveFailed) and return true, or (B) treat it as
non-fatal by logging a clear message and returning (true or false as appropriate
for your loop semantics) to avoid touching grpcResp; also update the log message
around removeOIDC/internal error to include the error and not imply a different
outcome — locate the handling around status.FromError(err), st.Code(),
codes.Internal, grpcResp.GetSuccess(), setErrorStateAndFail and
setErrorStateAndContinue to implement this early-return behavior.

---

Duplicate comments:
In `@internal/api/cmkapi/cmkapi.go`:
- Around line 404-405: Update the OpenAPI description for the Key/Key Version
state so that references to UNKNOWN indicate "canceled detach" instead of
"failed detach": locate the source description used to generate the State field
comment (which produced the comment above the State *KeyState
`json:"state,omitempty"` field and the similar comments that generated the other
duplicated blocks), replace the phrase "detach previously failed" (or similar
wording) with "canceled detach", regenerate the OpenAPI client/code so the
comment above the State field and the other duplicated occurrences reflect
"canceled detach" (ensure you run the same generation step that produced
internal/api/cmkapi/cmkapi.go).

In `@internal/event-processor/handlers_system.go`:
- Around line 400-410: The code clears system.KeyConfigurationID and sets
system.Status to DISCONNECTED before calling sendL1KeyClaim, which can leave an
external claim stale if sendL1KeyClaim fails; change the flow so you call
sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false) while the system
still has its KeyConfigurationID, and only after sendL1KeyClaim succeeds then
set system.KeyConfigurationID = nil, set system.Status =
cmkapi.SystemStatusDISCONNECTED and persist via updateSystem(ctx, h.repo,
system); ensure error handling preserves the original state if sendL1KeyClaim
returns an error.

In `@internal/event-processor/repo.go`:
- Around line 69-76: The current handling after calling r.Patch(ctx, system,
*query) ignores the case where updated == false and merely logs a warning;
instead treat a false return as a failure to persist the state transition: after
the call in repo.go (the r.Patch(ctx, system, *query) invocation and the second
similar block around lines 85-92), check if updated is false and return a
non-nil error (e.g., fmt.Errorf("failed to update system %s: no rows affected")
or similar) so callers do not proceed with cleanup/audit as if the update
succeeded; update both places that call r.Patch to return this error rather than
just logging.

In `@internal/manager/system.go`:
- Around line 438-443: The code treats a missing registry client (m.registry ==
nil) as a successful unmap by returning nil; change this to return an explicit
error so tenant offboarding does not proceed with stale external state. In the
method that performs the system unmap (the function containing the shown
snippet), replace the current return nil with a descriptive error (e.g.,
fmt.Errorf("registry client is nil: cannot unmap system")), and ensure callers
handle this error; additionally, audit NewSystemManager to either
require/validate a non-nil registry client at construction or document that
callers must handle a possible nil registry and the resulting error from the
unmap call (use m.registry and NewSystemManager as reference points).

In `@internal/manager/tenant_test.go`:
- Around line 194-199: The fixture keys created via testutils.NewKey in the
subtests set IsPrimary on the new key but do not update keyConfig.PrimaryKeyID,
so offboarding still references the original fixture key; update the test setup
(after creating the new primary key via testutils.NewKey) to set
keyConfig.PrimaryKeyID = newKey.ID (and persist/update keyConfig if needed) so
the key configuration actually points at the newly created primary key; do this
for both occurrences (the block using testutils.NewKey at the shown lines and
the similar block around lines 242-248) so the tests exercise the primary-key
path correctly.

---

Nitpick comments:
In `@internal/operator/operator.go`:
- Around line 391-397: Rename the local variable tenantId to tenantID to follow
Go convention and match existing parameter naming; replace the assignment from
tenantProto.GetId() to tenantID, update the empty check and the
setErrorStateAndFail call that uses it, and update the ctx = slogctx.With(ctx,
"tenantId", tenantId) to use "tenantID" and tenantID; also adjust any other
nearby/local occurrences (including the other local variable referenced in this
block) so all uses and the context key are consistently tenantID.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0605c116-f9d8-4a66-9fe1-37f2342004ad

📥 Commits

Reviewing files that changed from the base of the PR and between 835f4b5 and fee5d34.

📒 Files selected for processing (21)
  • apis/cmk/cmk-ui.yaml
  • internal/api/cmkapi/cmkapi.go
  • internal/event-processor/factory.go
  • internal/event-processor/factory_test.go
  • internal/event-processor/handlers_key.go
  • internal/event-processor/handlers_system.go
  • internal/event-processor/helpers.go
  • internal/event-processor/reconciler.go
  • internal/event-processor/reconciler_test.go
  • internal/event-processor/repo.go
  • internal/event-processor/resolvers.go
  • internal/event-processor/types.go
  • internal/manager/export_test.go
  • internal/manager/key.go
  • internal/manager/system.go
  • internal/manager/tenant.go
  • internal/manager/tenant_test.go
  • internal/operator/operator.go
  • internal/operator/operator_test.go
  • sonar-project.properties
  • test/integration/event-processor/reconciliation_test.go
✅ Files skipped from review due to trivial changes (3)
  • internal/operator/operator_test.go
  • sonar-project.properties
  • internal/event-processor/helpers.go
🚧 Files skipped from review as they are similar to previous changes (10)
  • internal/event-processor/types.go
  • test/integration/event-processor/reconciliation_test.go
  • internal/event-processor/resolvers.go
  • apis/cmk/cmk-ui.yaml
  • internal/manager/export_test.go
  • internal/event-processor/factory.go
  • internal/event-processor/reconciler_test.go
  • internal/event-processor/handlers_key.go
  • internal/manager/tenant.go
  • internal/manager/key.go

@minh-nghia minh-nghia force-pushed the feat/refactor-tenant-decommissioning branch 2 times, most recently from b9d5a96 to 0b5aa06 Compare March 24, 2026 14:00
bkieling
bkieling previously approved these changes Mar 24, 2026
Comment on lines +70 to +72
if err != nil {
return fmt.Errorf("failed to update system %s upon job termination: %w", system.ID, err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't do this here as this updateSystem doesn't mention any job

Comment on lines 1 to 4
sonar.exclusions=**/api/cmkapi/cmkapi.go
sonar.test.inclusions=**/*_test.go
sonar.coverage.exclusions=**/api/**,**/noop/**,**/wrapper/**,**/main.go,**/*_test.go,**/*_config.go,**/*_dto.go,**/*_model.go, **.sql,internal/api/cmkapi/cmkapi.go
sonar.cpd.exclusions=**/api/**,**/noop/**,**/wrapper/**,**/main.go,**/*_test.go,**/mocks/**,**/proto/**,**/*_generated.go,internal/api/cmkapi/cmkapi.go
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the file already is in coverage.exclusions and cpd.exclusions. It can be removed from there

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had trouble with code smells and couldn't find another configuration to disable it specifically. Or do you mean it should be removed from coverage.exclusions and cpd.exclusions?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that if it's in exclusions it doesn't need to be in cpd.exclusions and coverage.exclusions as it would already include those

@minh-nghia minh-nghia force-pushed the feat/refactor-tenant-decommissioning branch from 0b5aa06 to 6f63e74 Compare March 24, 2026 15:51
case st.Code() == codes.InvalidArgument:
log.Error(ctx, "invalid argument error while unmapping system from tenant", err)
return OffboardingFailed
case err != nil:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know this case is due to removing OIDC mapping error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to be. The point is that, when it is a generic or unknown error we will assume it is something intermittent and reconcile the operator/task, until it reaches the reconciliation limit.

slog.String("errorMessage", taskErrorMessage),
)

_, err = h.terminate(ctx, job, cmkapi.KeyStateDETACHED)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On job failed, is DETACHED a correct state?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or should it be UNKNOWN?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be DETACHED. For tenant decommissioning events (unlink and detach) we do not care about the response from the task operator, as long as we know we have informed them. So COMPLETE and FAILED would be the same.
Task cancellation goes back to UNKNOWN because we haven't sent out the task yet so haven't done our part.

Copy link
Copy Markdown
Contributor

@doanhluongsap doanhluongsap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot

@minh-nghia minh-nghia merged commit ba4f90d into main Mar 25, 2026
7 checks passed
@minh-nghia minh-nghia deleted the feat/refactor-tenant-decommissioning branch March 25, 2026 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants