Conversation
4fca870 to
6d8b3f0
Compare
| system.Status = cmkapi.SystemStatusCONNECTED | ||
| err = updateSystem(ctx, h.repo, system) |
There was a problem hiding this comment.
Just double checking, this is only executed for systems previously on Connected state right? Otherwise we might endup with systems being conneceted that aren't really connected
There was a problem hiding this comment.
The condition was against systems with keyConfig != null, so would cover both FAILED and CONNECTED. It makes more sense here to mark them as FAILED instead.
There was a problem hiding this comment.
The PR introduces meaningful improvements to tenant decommissioning but has a critical logic inversion in sendSystemUnlinkEvent where the job type names and factory calls are swapped between the decommission and non-decommission branches — this would route jobs to the wrong handlers entirely. Additionally, there is a test name mismatch for the canceled key detach scenario, a dangling empty comment line in operator.go, and a potentially unscoped system query in unmapAllSystemsFromRegistry that the author should verify is correctly tenant-bounded. The existing reviewer comment about the sendSystemUnlinkEvent inversion should be prioritized for immediate fix.
PR Bot Information
Version: 1.17.99 | 📖 Documentation | 🚨 Create Incident | 💬 Feedback
- Correlation ID:
aa2e50c0-1c66-11f1-9f08-81834e73c7da - Event Trigger:
issue_comment.created - LLM:
anthropic--claude-4.6-sonnet
e6001a1 to
c065b21
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR introduces a key detach workflow with an intermediate Changes
Sequence DiagramssequenceDiagram
participant Client
participant KeyMgr as KeyManager
participant EventFactory
participant OrbitalMgr as OrbitalManager
participant KeyHandler as KeyDetachJobHandler
participant Repo
participant Auditor
Client->>KeyMgr: Detach(key)
KeyMgr->>Repo: Patch key.State→DETACHING
KeyMgr->>EventFactory: SystemUnlinkDecommission
KeyMgr->>KeyMgr: sendDetachEvent()
EventFactory->>OrbitalMgr: CreateJob(KeyDetach)
OrbitalMgr-->>KeyHandler: Job created
KeyHandler->>KeyHandler: ResolveTasks()
KeyHandler->>KeyHandler: HandleJobConfirm()
Note over KeyHandler: Verify key in DETACHING state
KeyHandler->>OrbitalMgr: CompleteJobConfirmer / CancelJobConfirmer
OrbitalMgr->>KeyHandler: JobDoneEvent / JobFailedEvent / JobCanceledEvent
alt Done Event
KeyHandler->>Repo: Update key.State→DETACHED
KeyHandler->>Auditor: SendDetachAuditLog()
Auditor-->>KeyHandler: success or logged failure
else Failed Event
KeyHandler->>Repo: Update key.State→DETACHED
KeyHandler->>KeyHandler: mergeOrbitalTaskErrors()
else Canceled Event
KeyHandler->>Repo: Update key.State→UNKNOWN
end
sequenceDiagram
participant Client
participant SysMgr as SystemManager
participant EventFactory
participant OrbitalMgr as OrbitalManager
participant SysHandler as SystemUnlinkDecommissionJobHandler
participant Repo
participant Registry
participant Auditor
Client->>SysMgr: UnlinkSystemAction(trigger=DECOMMISSION)
SysMgr->>EventFactory: SystemUnlinkDecommission()
EventFactory->>Repo: Patch system.Status→PROCESSING
EventFactory->>OrbitalMgr: CreateJob(SYSTEM_UNLINK_DECOMMISSION)
OrbitalMgr-->>SysHandler: Job created
SysHandler->>SysHandler: HandleJobConfirm()
SysHandler->>OrbitalMgr: CompleteJobConfirmer
OrbitalMgr->>SysHandler: JobDoneEvent
SysHandler->>SysHandler: terminate()
SysHandler->>Repo: Load system & key
SysHandler->>Repo: Update system→DISCONNECTED
SysHandler->>Registry: UnlinkL1KeyClaim()
SysHandler->>Auditor: SendOffboardingAuditLog()
Auditor-->>SysHandler: success or logged failure
sequenceDiagram
participant Client
participant TenantMgr as TenantManager
participant SysMgr as SystemManager
participant KeyMgr as KeyManager
participant Registry
participant Repo
Client->>TenantMgr: OffboardTenant()
loop Step 1-2: Unlink Systems
TenantMgr->>SysMgr: sendUnlinkForConnectedSystems()
SysMgr->>SysMgr: Send unlink events
TenantMgr->>Repo: checkAllSystemsUnlinked()
alt Not all unlinked
TenantMgr-->>Client: OffboardingContinueAndWait
end
end
loop Step 3: Detach Primary Keys
TenantMgr->>KeyMgr: detachPrimaryKeys()
KeyMgr->>KeyMgr: Send detach events
TenantMgr->>Repo: checkAllPrimaryKeysProcessed()
alt Not all in DETACHING
TenantMgr-->>Client: OffboardingContinueAndWait
end
end
loop Step 4: Unmapping
TenantMgr->>Registry: unmapAllSystemsFromRegistry()
alt Retryable error (codes.Internal)
TenantMgr-->>Client: OffboardingContinueAndWait
else Permanent error (codes.InvalidArgument)
TenantMgr-->>Client: OffboardingFailed
end
end
loop Step 5: Verify Detach Complete
TenantMgr->>Repo: checkAllPrimaryKeysDetached()
alt Not all DETACHED
TenantMgr-->>Client: OffboardingContinueAndWait
end
end
TenantMgr-->>Client: OffboardingSuccess
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/manager/tenant.go (1)
229-249:⚠️ Potential issue | 🟠 MajorThis unlink seed only covers
CONNECTED, but completion waits on all linked systems.
checkAllSystemsUnlinked()returns false until everyKeyConfigurationIDis cleared, while this helper only dispatches decommission unlinks forStatus == CONNECTED. Any linked system already inFAILEDor another non-CONNECTEDstate will never get another unlink event, so offboarding can sit inOffboardingProcessingforever.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/manager/tenant.go` around lines 229 - 249, The current sendUnlinkForConnectedSystems only queries systems with Status == SystemStatusCONNECTED, but checkAllSystemsUnlinked() requires every KeyConfigurationID to be cleared; change sendUnlinkForConnectedSystems to target all linked systems (e.g., systems with a non-empty KeyConfigurationID or any status != UNLINKED) instead of only SystemStatusCONNECTED so every linked system gets an UnlinkSystemAction; update the repo.NewCompositeKey()/Where(...) condition used in sendUnlinkForConnectedSystems (keeping the UnlinkSystemAction call and constants.SystemActionDecommission) to query by KeyConfigurationID presence (or status not UNLINKED) and leave the loop calling m.sys.UnlinkSystemAction unchanged.
♻️ Duplicate comments (1)
internal/event-processor/handlers_system.go (1)
354-371:⚠️ Potential issue | 🔴 CriticalCanceled decommission jobs must restore
CONNECTED.Line 370 sets
FAILED, butinternal/manager/tenant.go, Lines 229-249 only resubmits decommission unlinks for connected systems. After a cancel, this becomes unrecoverable and blocks tenant offboarding instead of retrying on the next reconciliation. This also disagrees with the new expectation ininternal/event-processor/reconciler_test.go, Lines 913-941.Suggested change
- system.Status = cmkapi.SystemStatusFAILED + system.Status = cmkapi.SystemStatusCONNECTED🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_system.go` around lines 354 - 371, In HandleJobCanceledEvent (SystemUnlinkDecommissionJobHandler) restore the system to CONNECTED instead of FAILED: replace the assignment setting system.Status = cmkapi.SystemStatusFAILED with system.Status = cmkapi.SystemStatusCONNECTED so the subsequent updateSystem call persists the restored CONNECTED state and allows the decommission reconciliation to resubmit the unlink; keep the existing updateSystem(ctx, h.repo, system) call.
🧹 Nitpick comments (4)
internal/event-processor/export_test.go (1)
23-25: KeepDisableAuditLog()in sync with the new handlers.Tests can now drive cancellation paths through this exported hook, but the helper still only swaps auditors for link/unlink/switch.
SystemUnlinkDecommissionJobHandlerandKeyDetachJobHandlercan still emit audit logs in tests that expect side effects to be disabled.Proposed update
func (c *CryptoReconciler) DisableAuditLog() { c.jobHandlerMap[JobTypeSystemLink].(*SystemLinkJobHandler).cmkAuditor = &auditor.Auditor{} c.jobHandlerMap[JobTypeSystemUnlink].(*SystemUnlinkJobHandler).cmkAuditor = &auditor.Auditor{} + c.jobHandlerMap[JobTypeSystemUnlinkDecommission].(*SystemUnlinkDecommissionJobHandler).cmkAuditor = &auditor.Auditor{} c.jobHandlerMap[JobTypeSystemSwitch].(*SystemSwitchJobHandler).cmkAuditor = &auditor.Auditor{} + c.jobHandlerMap[JobTypeKeyDetach].(*KeyDetachJobHandler).cmkAuditor = &auditor.Auditor{} }Also applies to: 32-36
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/export_test.go` around lines 23 - 25, Update the DisableAuditLog test helper so it also suppresses audit logging for the new exported cancellation hook and the two job handlers: ensure DisableAuditLog swaps auditors (or stubs) for SystemUnlinkDecommissionJobHandler and KeyDetachJobHandler and additionally covers JobCanceledFunc (which delegates to jobCanceledFunc) so cancellation paths in tests do not emit audit logs; modify the helper to register/replace those handler/auditor symbols the same way it already does for link/unlink/switch.internal/manager/tenant_test.go (1)
251-260: Scope this cleanup to the fixture under test.This loop disconnects every system in the tenant, including state created by other
t.Runs. That makes the assertion order-dependent and can hide regressions in theOffboardTenantselection logic. Prefer creating only the systems needed here, or filtering the patch bykeyConfigID.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/manager/tenant_test.go` around lines 251 - 260, The test currently disconnects every system in the tenant by listing all model.System via r.List and patching each one (setting Status and KeyConfigurationID) which affects other t.Run fixtures; instead restrict the cleanup to only the systems created/owned by this test: either create and track the specific systems locally and patch only those, or apply a scoped query filter (e.g. filter by the fixture's keyConfigID or another unique identifier) when calling r.List and when calling r.Patch (use the existing repo.NewQuery().UpdateAll(true) but with the filter applied) so only the intended systems are modified.test/integration/event-processor/reconciliation_test.go (1)
207-244: Add aSystemUnlinkDecommissioncase to this table.This suite now exercises the shortened
SystemUnlinkAPI, but the PR also adds a new unlink-decommission job type and resolver path. A sibling case here would catch regressions where the new job stops producing aSYSTEM_UNLINKtask or serializes different payload fields.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/integration/event-processor/reconciliation_test.go` around lines 207 - 244, Add a new test case entry in the testCases table for the SystemUnlinkDecommission job that mirrors the existing SYSTEM_UNLINK expectations: set name to something like "SYSTEM_UNLINK_DECOMMISSION creates task for system's region only", jobType to "SYSTEM_UNLINK_DECOMMISSION", protoTaskType to eventProto.TaskType_SYSTEM_UNLINK (same produced task), keyIDFrom to the unlink key (keyIDUnlink), keyIDTo to "", and system to systemUnlink. Then update the job-creation branch inside the t.Run loop to handle tc.jobType == "SYSTEM_UNLINK_DECOMMISSION" by calling the new factory method (e.g. tester.eventFactory.SystemUnlinkDecommission(ctx, tc.system, tc.keyIDFrom)) so the new case is exercised alongside SYSTEM_UNLINK and SYSTEM_LINK.internal/manager/key.go (1)
425-427: Patch only the state field here.
UpdateAll(true)writes the fullmodel.Keysnapshot back to the row. This path only needs theDETACHINGtransition, so a field-scoped update is safer and avoids persisting unrelated stale fields from the caller.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/manager/key.go` around lines 425 - 427, The code sets key.State = string(cmkapi.KeyStateDETACHING) then calls km.repo.Patch with repo.NewQuery().UpdateAll(true), which writes the whole model back; change the Patch invocation to perform a field-scoped update that only persists the state change (i.e., replace UpdateAll(true) with the repository query option that updates only the "state" field) so unrelated/stale fields on model.Key are not written; keep the same key variable and km.repo.Patch call but supply a query configured to UpdateFields("state") (or the repo's equivalent API) instead of UpdateAll.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/event-processor/factory.go`:
- Around line 131-151: The current SystemUnlinkDecommission sets system.Status
and does a bare repo.Patch inside a transaction, bypassing the guarded/keyed
status-update flow and risking leaving the system in PROCESSING if
createSystemEventJob fails; replace the manual Transaction+Patch logic with the
guarded flow by invoking the existing handleSystemStatus (or the same helper
used elsewhere) so the status update is keyed/guarded and tied to job creation:
prepare SystemActionJobData (SystemID, KeyIDFrom) then call handleSystemStatus
(or the helper wrapper) to perform the status transition and create the job via
createSystemEventJob in its protected closure, ensuring rollback/cleanup if job
creation fails.
In `@internal/event-processor/handlers_key.go`:
- Around line 107-110: The code currently returns an error when
h.cmkAuditor.SendCmkDetachAuditLog(ctx, key.ID.String()) fails, which can cause
retries even though the key is already persisted as DETACHED; change this so
audit-log delivery is non-fatal: call h.cmkAuditor.SendCmkDetachAuditLog as
before but on error do not return—log the failure with context (include key.ID
and the error) using the handler logger (e.g., h.logger or processLogger) and
continue normal completion; remove the fmt.Errorf return path so the done hook
is not blocked by transient auditor failures.
In `@internal/event-processor/handlers_system.go`:
- Around line 314-319: The handler SystemUnlinkDecommissionJobHandler currently
always returns orbital.CompleteJobConfirmer() from HandleJobConfirm which
accepts late task completions; change it to use the standard confirmer instead
so late results from canceled/superseded jobs are rejected. Update
HandleJobConfirm to return the standard confirmer (e.g.,
orbital.StandardJobConfirmer()) rather than orbital.CompleteJobConfirmer(),
preserving the same error return (nil).
In `@internal/event-processor/repo.go`:
- Around line 64-85: Both updateSystem and updateKey call repo.Repo.Patch but
ignore the boolean result indicating whether a row was actually patched;
updateSystem, updateKey, and the Patch callsite should check the returned
boolean (the first return value from repo.Repo.Patch) and return an error if it
is false (e.g., "no rows patched for system/key <ID>"), so callers don't assume
persistence when nothing changed—locate the Patch invocations in
updateSystem/updateKey (they build a composite key via
repo.NewCompositeKey().Where and repo.NewQuery().Where(...).UpdateAll(true)) and
add a conditional that treats a false patch result as an error before returning
nil.
In `@internal/manager/system.go`:
- Around line 432-444: The UnmapSystemFromRegistry method assumes m.registry is
non-nil and calls m.registry.Mapping(), which can panic because NewSystemManager
permits a nil registry; guard this by checking if m.registry == nil at the top
of UnmapSystemFromRegistry and return a retryable/temporary error instead of
calling into Mapping(); reference the method name (UnmapSystemFromRegistry) and
the field (m.registry) when making the change so callers get a surfaced error
rather than a panic.
In `@internal/manager/tenant.go`:
- Around line 292-321: The call to repo.ProcessInBatch in
TenantManager.unmapAllSystemsFromRegistry currently ignores its error; capture
its returned error (assign the result of repo.ProcessInBatch to err) and, if
non-nil, mark the offboarding as failed (e.g., set hasOffboardingFailed = true
or immediately return (true, OffboardingFailed)) so failures
enumerating/batching systems are propagated instead of treated as success; keep
existing per-system handling using m.sys.UnmapSystemFromRegistry and
m.unmapSystemErrorCanContinue, but ensure the overall function returns a failure
status when repo.ProcessInBatch fails.
---
Outside diff comments:
In `@internal/manager/tenant.go`:
- Around line 229-249: The current sendUnlinkForConnectedSystems only queries
systems with Status == SystemStatusCONNECTED, but checkAllSystemsUnlinked()
requires every KeyConfigurationID to be cleared; change
sendUnlinkForConnectedSystems to target all linked systems (e.g., systems with a
non-empty KeyConfigurationID or any status != UNLINKED) instead of only
SystemStatusCONNECTED so every linked system gets an UnlinkSystemAction; update
the repo.NewCompositeKey()/Where(...) condition used in
sendUnlinkForConnectedSystems (keeping the UnlinkSystemAction call and
constants.SystemActionDecommission) to query by KeyConfigurationID presence (or
status not UNLINKED) and leave the loop calling m.sys.UnlinkSystemAction
unchanged.
---
Duplicate comments:
In `@internal/event-processor/handlers_system.go`:
- Around line 354-371: In HandleJobCanceledEvent
(SystemUnlinkDecommissionJobHandler) restore the system to CONNECTED instead of
FAILED: replace the assignment setting system.Status = cmkapi.SystemStatusFAILED
with system.Status = cmkapi.SystemStatusCONNECTED so the subsequent updateSystem
call persists the restored CONNECTED state and allows the decommission
reconciliation to resubmit the unlink; keep the existing updateSystem(ctx,
h.repo, system) call.
---
Nitpick comments:
In `@internal/event-processor/export_test.go`:
- Around line 23-25: Update the DisableAuditLog test helper so it also
suppresses audit logging for the new exported cancellation hook and the two job
handlers: ensure DisableAuditLog swaps auditors (or stubs) for
SystemUnlinkDecommissionJobHandler and KeyDetachJobHandler and additionally
covers JobCanceledFunc (which delegates to jobCanceledFunc) so cancellation
paths in tests do not emit audit logs; modify the helper to register/replace
those handler/auditor symbols the same way it already does for
link/unlink/switch.
In `@internal/manager/key.go`:
- Around line 425-427: The code sets key.State =
string(cmkapi.KeyStateDETACHING) then calls km.repo.Patch with
repo.NewQuery().UpdateAll(true), which writes the whole model back; change the
Patch invocation to perform a field-scoped update that only persists the state
change (i.e., replace UpdateAll(true) with the repository query option that
updates only the "state" field) so unrelated/stale fields on model.Key are not
written; keep the same key variable and km.repo.Patch call but supply a query
configured to UpdateFields("state") (or the repo's equivalent API) instead of
UpdateAll.
In `@internal/manager/tenant_test.go`:
- Around line 251-260: The test currently disconnects every system in the tenant
by listing all model.System via r.List and patching each one (setting Status and
KeyConfigurationID) which affects other t.Run fixtures; instead restrict the
cleanup to only the systems created/owned by this test: either create and track
the specific systems locally and patch only those, or apply a scoped query
filter (e.g. filter by the fixture's keyConfigID or another unique identifier)
when calling r.List and when calling r.Patch (use the existing
repo.NewQuery().UpdateAll(true) but with the filter applied) so only the
intended systems are modified.
In `@test/integration/event-processor/reconciliation_test.go`:
- Around line 207-244: Add a new test case entry in the testCases table for the
SystemUnlinkDecommission job that mirrors the existing SYSTEM_UNLINK
expectations: set name to something like "SYSTEM_UNLINK_DECOMMISSION creates
task for system's region only", jobType to "SYSTEM_UNLINK_DECOMMISSION",
protoTaskType to eventProto.TaskType_SYSTEM_UNLINK (same produced task),
keyIDFrom to the unlink key (keyIDUnlink), keyIDTo to "", and system to
systemUnlink. Then update the job-creation branch inside the t.Run loop to
handle tc.jobType == "SYSTEM_UNLINK_DECOMMISSION" by calling the new factory
method (e.g. tester.eventFactory.SystemUnlinkDecommission(ctx, tc.system,
tc.keyIDFrom)) so the new case is exercised alongside SYSTEM_UNLINK and
SYSTEM_LINK.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0b233394-d19e-4cfd-bcdb-3372f59be67f
📒 Files selected for processing (19)
apis/cmk/cmk-ui.yamlinternal/api/cmkapi/cmkapi.gointernal/event-processor/export_test.gointernal/event-processor/factory.gointernal/event-processor/factory_test.gointernal/event-processor/handlers_key.gointernal/event-processor/handlers_system.gointernal/event-processor/helpers.gointernal/event-processor/reconciler.gointernal/event-processor/reconciler_test.gointernal/event-processor/repo.gointernal/event-processor/resolvers.gointernal/event-processor/types.gointernal/manager/key.gointernal/manager/system.gointernal/manager/tenant.gointernal/manager/tenant_test.gointernal/operator/operator.gotest/integration/event-processor/reconciliation_test.go
c065b21 to
9c27616
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/operator/operator.go (1)
425-455:⚠️ Potential issue | 🔴 CriticalReturn immediately on any gRPC error before accessing the response.
The current logic only returns early for non-internal errors, allowing
codes.Internalto fall through togrpcResp.GetSuccess(), which can panic on a nil response and also violates the intended short-circuit when OIDC cleanup fails. This differs from the safe pattern used inBlockOIDCMappingandUnblockOIDCMappingin the same file. Guard the response dereference with a check on any error.Suggested fix
func (o *TenantOperator) removeOIDCMapping( ctx context.Context, resp *orbital.HandlerResponse, tenantID string, ) bool { grpcResp, err := o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping( ctx, &oidcmappinggrpc.RemoveOIDCMappingRequest{ TenantId: tenantID, }, ) - st, ok := status.FromError(err) - if !ok { - log.Error(ctx, "failed getting info on sessionManager error", err) - } - if st.Code() == codes.Internal { - log.Error(ctx, "removeOIDC failed with internal err", err) - } - if err != nil && st.Code() != codes.Internal { - log.Error(ctx, "error while removing OIDC mapping", err) - setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed) - return true - } + if err != nil { + log.Error(ctx, "error while removing OIDC mapping", err, slog.String("grpcCode", status.Code(err).String())) + setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed) + return true + } if !grpcResp.GetSuccess() { err = errs.Wrapf( ErrFailedResponse, "session manager could not remove OIDC mapping: "+grpcResp.GetMessage(),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/operator/operator.go` around lines 425 - 455, In removeOIDCMapping, don't dereference grpcResp when err is non-nil: immediately handle any gRPC error (including codes.Internal) before accessing grpcResp.GetSuccess(); replicate the safe pattern used in BlockOIDCMapping/UnblockOIDCMapping by checking if err != nil, logging the error (including internal case), calling setErrorStateAndContinue or setErrorStateAndFail as appropriate (use WorkingStateOIDCMappingRemoveFailed and setErrorStateAndContinue for transient/internal handling per the surrounding logic), and return true so execution never reaches grpcResp.GetSuccess() when err != nil.
♻️ Duplicate comments (2)
internal/event-processor/handlers_system.go (1)
312-319:⚠️ Potential issue | 🟠 MajorDon’t auto-confirm decommission jobs.
A canceled decommission job can be requeued during offboarding. With
CompleteJobConfirmer(), a late result from the old job is still accepted and can runterminate()on a system that has already moved on. Please gate confirmation on current system state instead of unconditionally completing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_system.go` around lines 312 - 319, The handler currently auto-confirms all decommission jobs in SystemUnlinkDecommissionJobHandler.HandleJobConfirm by returning orbital.CompleteJobConfirmer(), which allows stale jobs to run terminate() on systems that have already moved on; change HandleJobConfirm to look up the current system state (using the system ID or metadata from the incoming orbital.Job), and only return orbital.CompleteJobConfirmer() when the system is in a state that permits decommission (e.g., still assigned/active and not already offboarded); otherwise return the appropriate non-confirming result (e.g., cancel/reject/skip confirmer) so late jobs are not allowed to proceed. Ensure you use the existing job/system lookup helpers in this package and keep the logic within SystemUnlinkDecommissionJobHandler.HandleJobConfirm.internal/event-processor/handlers_key.go (1)
107-110:⚠️ Potential issue | 🟠 MajorMake detach audit logging non-fatal.
The key is already persisted as
DETACHEDhere. Returning an error on auditor failure can poison a logically completed detach and trigger retries or duplicate callbacks for no state change.Suggested fix
- err = h.cmkAuditor.SendCmkDetachAuditLog(ctx, key.ID.String()) - if err != nil { - return fmt.Errorf("failed to send CMK detach audit log for key %s: %w", key.ID, err) - } + if err := h.cmkAuditor.SendCmkDetachAuditLog(ctx, key.ID.String()); err != nil { + log.Error(ctx, "Failed to send CMK detach audit log", err, slog.String("keyID", key.ID.String())) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_key.go` around lines 107 - 110, The handler currently returns an error when h.cmkAuditor.SendCmkDetachAuditLog(ctx, key.ID.String()) fails (using return fmt.Errorf(...)), which makes an already-persisted DETACHED key retryable; change this to treat the auditor failure as non-fatal: capture the error from SendCmkDetachAuditLog, log a warning including key.ID.String() and the error (replace the fmt.Errorf return with a non-fatal log call using the component logger available in the handler), and continue normal execution so the detach is not retried or duplicated.
🧹 Nitpick comments (1)
internal/event-processor/factory_test.go (1)
135-245: Add a factory test forSystemUnlinkDecommission.This table was updated for the
SystemUnlinksignature change, but the new decommission job path introduced by this PR still is not exercised here. A direct assertion on the created job type andSystemActionJobDatawould catch regressions in the offboarding path much earlier.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/factory_test.go` around lines 135 - 245, The tests table in factory_test.go exercises SystemLink/SystemUnlink/SystemSwitch variants but misses a case for the new SystemUnlinkDecommission path; add a test entry in the tests slice that calls eventProcessor.SystemUnlinkDecommission (or invokes SystemUnlink in the way that triggers decommission), supplies tenantID and keyIDFrom, and asserts the created orbital.Job has type eventprocessor.JobTypeSystemUnlinkDecommission (or the decommission job type) and that the unmarshaled eventprocessor.SystemActionJobData.KeyIDFrom matches the provided keyIDFrom; place the case alongside the other unlink cases and reuse the assertKeyID helper pattern to validate job type and expStatus (cmkapi.SystemStatusPROCESSING) so the offboarding path is covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/event-processor/handlers_key.go`:
- Around line 91-96: HandleJobConfirm currently always returns
CompleteJobConfirmer which can incorrectly accept confirmations after a
cancellation; update KeyDetachJobHandler.HandleJobConfirm to read the current
key state (same store/API used by HandleJobCanceledEvent and as used in
handleSystemJobConfirm), and only return orbital.CompleteJobConfirmer() when the
key is in the expected DETACHING state; if the key state is UNKNOWN or anything
else return orbital.CancelJobConfirmer() so the late confirmation does not
overwrite the retry/cancelled state.
In `@internal/event-processor/handlers_system.go`:
- Around line 354-376: The cancel handler HandleJobCanceledEvent in
SystemUnlinkDecommissionJobHandler currently sets system.Status =
cmkapi.SystemStatusFAILED; change it to restore the prior/connected state by
setting system.Status = cmkapi.SystemStatusCONNECTED after retrieving the system
(via unmarshalSystemJobData and getSystemByID) and before calling updateSystem;
ensure you only change the status on cancel and preserve the rest of the system
object when calling updateSystem.
---
Outside diff comments:
In `@internal/operator/operator.go`:
- Around line 425-455: In removeOIDCMapping, don't dereference grpcResp when err
is non-nil: immediately handle any gRPC error (including codes.Internal) before
accessing grpcResp.GetSuccess(); replicate the safe pattern used in
BlockOIDCMapping/UnblockOIDCMapping by checking if err != nil, logging the error
(including internal case), calling setErrorStateAndContinue or
setErrorStateAndFail as appropriate (use WorkingStateOIDCMappingRemoveFailed and
setErrorStateAndContinue for transient/internal handling per the surrounding
logic), and return true so execution never reaches grpcResp.GetSuccess() when
err != nil.
---
Duplicate comments:
In `@internal/event-processor/handlers_key.go`:
- Around line 107-110: The handler currently returns an error when
h.cmkAuditor.SendCmkDetachAuditLog(ctx, key.ID.String()) fails (using return
fmt.Errorf(...)), which makes an already-persisted DETACHED key retryable;
change this to treat the auditor failure as non-fatal: capture the error from
SendCmkDetachAuditLog, log a warning including key.ID.String() and the error
(replace the fmt.Errorf return with a non-fatal log call using the component
logger available in the handler), and continue normal execution so the detach is
not retried or duplicated.
In `@internal/event-processor/handlers_system.go`:
- Around line 312-319: The handler currently auto-confirms all decommission jobs
in SystemUnlinkDecommissionJobHandler.HandleJobConfirm by returning
orbital.CompleteJobConfirmer(), which allows stale jobs to run terminate() on
systems that have already moved on; change HandleJobConfirm to look up the
current system state (using the system ID or metadata from the incoming
orbital.Job), and only return orbital.CompleteJobConfirmer() when the system is
in a state that permits decommission (e.g., still assigned/active and not
already offboarded); otherwise return the appropriate non-confirming result
(e.g., cancel/reject/skip confirmer) so late jobs are not allowed to proceed.
Ensure you use the existing job/system lookup helpers in this package and keep
the logic within SystemUnlinkDecommissionJobHandler.HandleJobConfirm.
---
Nitpick comments:
In `@internal/event-processor/factory_test.go`:
- Around line 135-245: The tests table in factory_test.go exercises
SystemLink/SystemUnlink/SystemSwitch variants but misses a case for the new
SystemUnlinkDecommission path; add a test entry in the tests slice that calls
eventProcessor.SystemUnlinkDecommission (or invokes SystemUnlink in the way that
triggers decommission), supplies tenantID and keyIDFrom, and asserts the created
orbital.Job has type eventprocessor.JobTypeSystemUnlinkDecommission (or the
decommission job type) and that the unmarshaled
eventprocessor.SystemActionJobData.KeyIDFrom matches the provided keyIDFrom;
place the case alongside the other unlink cases and reuse the assertKeyID helper
pattern to validate job type and expStatus (cmkapi.SystemStatusPROCESSING) so
the offboarding path is covered.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 888f2abb-1da0-4272-9304-54cf61c70077
📒 Files selected for processing (19)
apis/cmk/cmk-ui.yamlinternal/api/cmkapi/cmkapi.gointernal/event-processor/export_test.gointernal/event-processor/factory.gointernal/event-processor/factory_test.gointernal/event-processor/handlers_key.gointernal/event-processor/handlers_system.gointernal/event-processor/helpers.gointernal/event-processor/reconciler.gointernal/event-processor/reconciler_test.gointernal/event-processor/repo.gointernal/event-processor/resolvers.gointernal/event-processor/types.gointernal/manager/key.gointernal/manager/system.gointernal/manager/tenant.gointernal/manager/tenant_test.gointernal/operator/operator.gotest/integration/event-processor/reconciliation_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
- internal/manager/system.go
- internal/event-processor/reconciler.go
- internal/api/cmkapi/cmkapi.go
- apis/cmk/cmk-ui.yaml
- internal/manager/tenant_test.go
- test/integration/event-processor/reconciliation_test.go
9c27616 to
243e8d3
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/event-processor/factory_test.go (1)
180-205:⚠️ Potential issue | 🟠 MajorAdd coverage for the decommission unlink factory path.
These changes only adapt the existing
SystemUnlinktests to the new signature. The PR also introducesSystemUnlinkDecommission, which has intentionally different behavior from a normal unlink, but this file still has no creation/confirmation coverage for that entrypoint. Please add dedicated cases here so regressions in the new factory wiring are caught.Also applies to: 300-303
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/factory_test.go` around lines 180 - 205, Add test cases to cover the new SystemUnlinkDecommission factory path: update the table-driven tests around the SystemUnlink cases to include at least one case that calls eventProcessor.SystemUnlinkDecommission(ctx, system, from) (mirroring the existing SystemUnlink cases) and asserts the expected behavior — creation of a decommission unlink job, correct job type (use eventprocessor.JobTypeSystemUnlinkDecommission or the string it should produce), and any status transitions (e.g., system set to PROCESSING) and expected errors; ensure you include both success and error scenarios (missing tenant, system in PROCESSING) analogous to the existing SystemUnlink tests and reference the same test scaffolding (the test table, systemEventFn, systemStatus, tenantID, keyIDFrom, expType, expStatus, expErr) so the new cases integrate with the current test loop.internal/operator/operator.go (1)
430-455:⚠️ Potential issue | 🔴 CriticalGuard against nil
grpcRespon error paths.When the RPC returns a non-nil error (particularly internal errors),
grpcRespis nil per gRPC semantics. The code currently dereferencesgrpcResp.GetSuccess()at line 449 without checking for this case. When an internal RPC error occurs (st.Code() == codes.Internal), the code logs at line 440 but does not return early, allowing execution to fall through and panic on the nil dereference. This crash can occur during tenant termination before offboarding completes.The proposed fix correctly guards all
grpcRespaccess by returning early whenerr != nil.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/operator/operator.go` around lines 430 - 455, The RemoveOIDCMapping call can return a non-nil error and a nil grpcResp; after calling o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping and extracting st via status.FromError, ensure you return early on any non-nil err (after logging the internal error case) instead of continuing and dereferencing grpcResp; update the branch around err != nil to call setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed) and return immediately so grpcResp.GetSuccess() is only accessed when err == nil, and preserve the existing setErrorStateAndFail flow for a non-success grpcResp.
♻️ Duplicate comments (9)
internal/event-processor/repo.go (1)
64-85:⚠️ Potential issue | 🟠 MajorCheck whether the patch actually updated a row.
Both helpers still ignore the
boolreturned byr.Patch. If the keyed row is missing, they returnniland the caller proceeds as if the new system/key state was persisted. Please fail whenupdated == false.🐛 Proposed fix
func updateSystem(ctx context.Context, r repo.Repo, system *model.System) error { ck := repo.NewCompositeKey().Where(repo.IDField, system.ID) query := repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true) - _, err := r.Patch(ctx, system, *query) + updated, err := r.Patch(ctx, system, *query) if err != nil { return fmt.Errorf("failed to update system %s upon job termination: %w", system.ID, err) } + if !updated { + return fmt.Errorf("system %s not found upon job termination", system.ID) + } return nil } func updateKey(ctx context.Context, r repo.Repo, key *model.Key) error { ck := repo.NewCompositeKey().Where(repo.IDField, key.ID) query := repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true) - _, err := r.Patch(ctx, key, *query) + updated, err := r.Patch(ctx, key, *query) if err != nil { return fmt.Errorf("failed to update key %s upon job termination: %w", key.ID, err) } + if !updated { + return fmt.Errorf("key %s not found upon job termination", key.ID) + } return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/repo.go` around lines 64 - 85, Both updateSystem and updateKey call r.Patch but ignore the boolean "updated" result; change each to capture the returned values (updated, err := r.Patch(...)), return the wrapped error on err, and if updated is false return a new error (e.g. fmt.Errorf("no row updated for system %s", system.ID) / "for key %s") so callers don't assume the row was persisted; update the error messages in updateSystem and updateKey accordingly to include this failure case.internal/event-processor/handlers_system.go (2)
314-319:⚠️ Potential issue | 🟠 MajorUse the standard system confirmer here too.
This decommission path currently accepts confirmations unconditionally, so late task completions from canceled/superseded jobs can still run the terminal handler. It should use the same
handleSystemJobConfirm(...)gate as the other system job handlers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_system.go` around lines 314 - 319, The SystemUnlinkDecommissionJobHandler.HandleJobConfirm currently accepts confirmations unconditionally; replace its body to call and return the shared gate function handleSystemJobConfirm(...) (same used by other system job handlers) so late completions from canceled/superseded jobs are rejected; locate SystemUnlinkDecommissionJobHandler.HandleJobConfirm and change the return to the result of handleSystemJobConfirm with the same parameters/signature used elsewhere.
356-376:⚠️ Potential issue | 🟠 MajorRestore
CONNECTEDon decommission cancel.The decommission cancel path is supposed to put the system back into a retryable state so the next offboarding reconciliation recreates the unlink job. Setting
FAILEDhere breaks that contract and changes the flow into manual recovery.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_system.go` around lines 356 - 376, In HandleJobCanceledEvent (SystemUnlinkDecommissionJobHandler) we currently set system.Status = cmkapi.SystemStatusFAILED which prevents automatic retry; change the status to the retryable connected state by setting system.Status = cmkapi.SystemStatusCONNECTED before calling updateSystem(ctx, h.repo, system) so the next offboarding reconciliation can recreate the unlink job.internal/event-processor/factory.go (1)
145-155:⚠️ Potential issue | 🔴 CriticalKeep the
PROCESSINGtransition and job creation atomic.This path commits the status change before
createSystemEventJob(...)runs. If job creation fails, the system is left inPROCESSINGwith no job/event to drive recovery. Reuse the guardedhandleSystemStatus(...)flow here, or create the job inside the same transaction.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/factory.go` around lines 145 - 155, The code sets system.Status to PROCESSING and commits that change inside f.repo.Transaction, then calls createSystemEventJob(...) outside the transaction, which can leave the system stuck if job creation fails; modify the logic to make the status transition and job creation atomic by either (A) moving the createSystemEventJob call into the same f.repo.Transaction callback so both the Patch and job creation occur under the same transactional guard, or (B) reuse the existing handleSystemStatus(...) flow to perform the PROCESSING transition and job creation together; ensure you reference and update system.Status, use f.repo.Transaction, and invoke createSystemEventJob (or call handleSystemStatus) within that transaction callback so failures roll back the status change.internal/manager/system.go (1)
432-444:⚠️ Potential issue | 🔴 CriticalGuard
m.registrybefore callingMapping().
NewSystemManagerallowsm.registryto be nil, so this method can panic during tenant offboarding instead of returning a recoverable error. Please add the same defensive check used in the other registry-backed paths.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/manager/system.go` around lines 432 - 444, UnmapSystemFromRegistry currently calls m.registry.Mapping() without checking m.registry and can panic when m.registry is nil; update UnmapSystemFromRegistry in SystemManager to guard m.registry like other registry-backed methods by returning a recoverable error (e.g., nil registry error) if m.registry == nil before calling m.registry.Mapping(), and otherwise proceed to call Mapping(). Ensure you reference the SystemManager receiver, the m.registry field, and the UnmapSystemFromRegistry method when making the change so behavior matches other registry-backed paths.internal/event-processor/reconciler_test.go (1)
913-940:⚠️ Potential issue | 🟠 MajorExpect
CONNECTEDafter decommission cancellation.This assertion bakes in the opposite behavior from the PR contract: a canceled
SYSTEM_UNLINK_DECOMMISSIONshould restore the system toCONNECTEDso the next offboarding reconciliation can recreate the job. KeepingFAILEDhere will make the test validate the bug instead of catching it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/reconciler_test.go` around lines 913 - 940, The test currently asserts sysAfter.Status is cmkapi.SystemStatusFAILED but the contract requires restoring CONNECTED on a canceled SYSTEM_UNLINK_DECOMMISSION; change the assertion in the test (in the t.Run "System status on canceled SYSTEM_UNLINK_DECOMMISSION job termination") to expect cmkapi.SystemStatusCONNECTED for sysAfter.Status after calling eventProcessor.JobCanceledFunc (using the SystemActionJobData and JobTypeSystemUnlinkDecommission flow), so the test verifies cancellation restores CONNECTED instead of FAILED.internal/event-processor/handlers_key.go (2)
107-110:⚠️ Potential issue | 🟠 MajorDon't make a completed detach depend on audit-log delivery.
The key has already been updated to
DETACHEDbefore this call. Returning an error here can poison the done hook and cause retries even though the detach itself already succeeded.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_key.go` around lines 107 - 110, The code currently treats failures from h.cmkAuditor.SendCmkDetachAuditLog as fatal and returns an error, which can trigger retries even though the key was already set to DETACHED; change this so audit-log delivery failures do not block completion: in the handler that calls h.cmkAuditor.SendCmkDetachAuditLog (use the existing call site and key.ID), do not return an error on failure—instead log the error (e.g., h.logger.Warnf or h.logger.Error) and continue, or dispatch the SendCmkDetachAuditLog call asynchronously (goroutine) and swallow/log errors, ensuring the done hook is not poisoned by audit log delivery failures.
91-95:⚠️ Potential issue | 🟠 MajorOnly confirm detach while the key is still
DETACHING.This unconditional confirmer accepts late results after
HandleJobCanceledEventhas already moved the key toUNKNOWN. That lets an old completion overwrite the retry state and incorrectly finalize the key asDETACHED.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_key.go` around lines 91 - 95, The confirmer unconditionally completes detach jobs; change KeyDetachJobHandler.HandleJobConfirm to first fetch the key's current state (via the handler's store/repo used elsewhere in this handler) and only return orbital.CompleteJobConfirmer() if the key state is exactly DETACHING; otherwise return a no-op confirmer (e.g., orbital.NoopJobConfirmer() or the project's equivalent) so late completions cannot overwrite state after HandleJobCanceledEvent moved the key to UNKNOWN. Ensure you reference KeyDetachJobHandler.HandleJobConfirm and the key-state enum/value DETACHING when locating where to add the guard.internal/manager/tenant.go (1)
298-327:⚠️ Potential issue | 🟠 MajorDon't downgrade batch-processing failures to
OffboardingProcessing.If
ProcessInBatchfails, the unmap step may have been skipped or only partially executed. ReturningOffboardingProcessinghere hides that infrastructure failure behind an endless retry loop instead of surfacing it as a real offboarding failure/error.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/manager/tenant.go` around lines 298 - 327, The batch-processing error from repo.ProcessInBatch is being downgraded to OffboardingProcessing which hides real failures; change the error branch after repo.ProcessInBatch (the block that currently logs and returns false, OffboardingProcessing) to log the error and return false, OffboardingFailed so infrastructure failures surface as offboarding failures; update the code paths around repo.ProcessInBatch, m.sys.UnmapSystemFromRegistry and m.unmapSystemErrorCanContinue accordingly to preserve the existing per-system status flags (hasOffboardingFailed / hasOffboardingProcessing) but treat a non-nil err from ProcessInBatch as an overall OffboardingFailed result.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/event-processor/handlers_system.go`:
- Around line 396-408: The issue is that sendL1KeyClaim(..., false) is called
after clearing KeyConfigurationID and persisting the system, making cleanup
non-retryable if the external call fails; change the order so the external claim
reset is attempted first and only on success update the local system state.
Concretely, call sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false)
before mutating system.Status and system.KeyConfigurationID and before calling
updateSystem(ctx, h.repo, system); if sendL1KeyClaim returns an error, return
that error without persisting changes (or wrap both operations in a
transactional/compensating flow) so retries can succeed. Ensure you reference
the same system instance used in sendL1KeyClaim and preserve error handling
around updateSystem and sendL1KeyClaim.
In `@internal/event-processor/reconciler_test.go`:
- Around line 1081-1084: The helper terminateNewJob currently swallows
termination-hook errors by logging them; update the error handling to fail the
test immediately by replacing the t.Logf branch with a require.NoError assertion
(use require.NoError(t, err)) so any hook error in terminateNewJob causes the
test to fail and surfaces the real failure in the new state-transition tests.
In `@internal/manager/tenant_test.go`:
- Around line 398-400: Replace the inline errors.New("some error") in the test
with a package-level sentinel error to avoid err113; declare a top-level var
(e.g., var errTestGeneric = errors.New("some error")) in the same package and
use that sentinel in the test call to m.UnmapSystemErrorCanContinue(ctx, err)
(keep the test name and local variables ok, st unchanged).
- Around line 197-201: The test creates new keys with IsPrimary=true via
testutils.NewKey but never updates the shared keyConfig.PrimaryKeyID, so
assertions end up checking the wrong record; update the fixture so that after
creating the key (the testutils.NewKey call) you assign the created key's ID to
keyConfig.PrimaryKeyID (or alternatively change the assertions to reference the
original primary key), and make the same fix for the other occurrences that use
testutils.NewKey with IsPrimary=true so the keyConfig.PrimaryKeyID correctly
points to the inserted primary key when verifying the offboard/detach behavior.
---
Outside diff comments:
In `@internal/event-processor/factory_test.go`:
- Around line 180-205: Add test cases to cover the new SystemUnlinkDecommission
factory path: update the table-driven tests around the SystemUnlink cases to
include at least one case that calls
eventProcessor.SystemUnlinkDecommission(ctx, system, from) (mirroring the
existing SystemUnlink cases) and asserts the expected behavior — creation of a
decommission unlink job, correct job type (use
eventprocessor.JobTypeSystemUnlinkDecommission or the string it should produce),
and any status transitions (e.g., system set to PROCESSING) and expected errors;
ensure you include both success and error scenarios (missing tenant, system in
PROCESSING) analogous to the existing SystemUnlink tests and reference the same
test scaffolding (the test table, systemEventFn, systemStatus, tenantID,
keyIDFrom, expType, expStatus, expErr) so the new cases integrate with the
current test loop.
In `@internal/operator/operator.go`:
- Around line 430-455: The RemoveOIDCMapping call can return a non-nil error and
a nil grpcResp; after calling
o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping and extracting
st via status.FromError, ensure you return early on any non-nil err (after
logging the internal error case) instead of continuing and dereferencing
grpcResp; update the branch around err != nil to call
setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed)
and return immediately so grpcResp.GetSuccess() is only accessed when err ==
nil, and preserve the existing setErrorStateAndFail flow for a non-success
grpcResp.
---
Duplicate comments:
In `@internal/event-processor/factory.go`:
- Around line 145-155: The code sets system.Status to PROCESSING and commits
that change inside f.repo.Transaction, then calls createSystemEventJob(...)
outside the transaction, which can leave the system stuck if job creation fails;
modify the logic to make the status transition and job creation atomic by either
(A) moving the createSystemEventJob call into the same f.repo.Transaction
callback so both the Patch and job creation occur under the same transactional
guard, or (B) reuse the existing handleSystemStatus(...) flow to perform the
PROCESSING transition and job creation together; ensure you reference and update
system.Status, use f.repo.Transaction, and invoke createSystemEventJob (or call
handleSystemStatus) within that transaction callback so failures roll back the
status change.
In `@internal/event-processor/handlers_key.go`:
- Around line 107-110: The code currently treats failures from
h.cmkAuditor.SendCmkDetachAuditLog as fatal and returns an error, which can
trigger retries even though the key was already set to DETACHED; change this so
audit-log delivery failures do not block completion: in the handler that calls
h.cmkAuditor.SendCmkDetachAuditLog (use the existing call site and key.ID), do
not return an error on failure—instead log the error (e.g., h.logger.Warnf or
h.logger.Error) and continue, or dispatch the SendCmkDetachAuditLog call
asynchronously (goroutine) and swallow/log errors, ensuring the done hook is not
poisoned by audit log delivery failures.
- Around line 91-95: The confirmer unconditionally completes detach jobs; change
KeyDetachJobHandler.HandleJobConfirm to first fetch the key's current state (via
the handler's store/repo used elsewhere in this handler) and only return
orbital.CompleteJobConfirmer() if the key state is exactly DETACHING; otherwise
return a no-op confirmer (e.g., orbital.NoopJobConfirmer() or the project's
equivalent) so late completions cannot overwrite state after
HandleJobCanceledEvent moved the key to UNKNOWN. Ensure you reference
KeyDetachJobHandler.HandleJobConfirm and the key-state enum/value DETACHING when
locating where to add the guard.
In `@internal/event-processor/handlers_system.go`:
- Around line 314-319: The SystemUnlinkDecommissionJobHandler.HandleJobConfirm
currently accepts confirmations unconditionally; replace its body to call and
return the shared gate function handleSystemJobConfirm(...) (same used by other
system job handlers) so late completions from canceled/superseded jobs are
rejected; locate SystemUnlinkDecommissionJobHandler.HandleJobConfirm and change
the return to the result of handleSystemJobConfirm with the same
parameters/signature used elsewhere.
- Around line 356-376: In HandleJobCanceledEvent
(SystemUnlinkDecommissionJobHandler) we currently set system.Status =
cmkapi.SystemStatusFAILED which prevents automatic retry; change the status to
the retryable connected state by setting system.Status =
cmkapi.SystemStatusCONNECTED before calling updateSystem(ctx, h.repo, system) so
the next offboarding reconciliation can recreate the unlink job.
In `@internal/event-processor/reconciler_test.go`:
- Around line 913-940: The test currently asserts sysAfter.Status is
cmkapi.SystemStatusFAILED but the contract requires restoring CONNECTED on a
canceled SYSTEM_UNLINK_DECOMMISSION; change the assertion in the test (in the
t.Run "System status on canceled SYSTEM_UNLINK_DECOMMISSION job termination") to
expect cmkapi.SystemStatusCONNECTED for sysAfter.Status after calling
eventProcessor.JobCanceledFunc (using the SystemActionJobData and
JobTypeSystemUnlinkDecommission flow), so the test verifies cancellation
restores CONNECTED instead of FAILED.
In `@internal/event-processor/repo.go`:
- Around line 64-85: Both updateSystem and updateKey call r.Patch but ignore the
boolean "updated" result; change each to capture the returned values (updated,
err := r.Patch(...)), return the wrapped error on err, and if updated is false
return a new error (e.g. fmt.Errorf("no row updated for system %s", system.ID) /
"for key %s") so callers don't assume the row was persisted; update the error
messages in updateSystem and updateKey accordingly to include this failure case.
In `@internal/manager/system.go`:
- Around line 432-444: UnmapSystemFromRegistry currently calls
m.registry.Mapping() without checking m.registry and can panic when m.registry
is nil; update UnmapSystemFromRegistry in SystemManager to guard m.registry like
other registry-backed methods by returning a recoverable error (e.g., nil
registry error) if m.registry == nil before calling m.registry.Mapping(), and
otherwise proceed to call Mapping(). Ensure you reference the SystemManager
receiver, the m.registry field, and the UnmapSystemFromRegistry method when
making the change so behavior matches other registry-backed paths.
In `@internal/manager/tenant.go`:
- Around line 298-327: The batch-processing error from repo.ProcessInBatch is
being downgraded to OffboardingProcessing which hides real failures; change the
error branch after repo.ProcessInBatch (the block that currently logs and
returns false, OffboardingProcessing) to log the error and return false,
OffboardingFailed so infrastructure failures surface as offboarding failures;
update the code paths around repo.ProcessInBatch, m.sys.UnmapSystemFromRegistry
and m.unmapSystemErrorCanContinue accordingly to preserve the existing
per-system status flags (hasOffboardingFailed / hasOffboardingProcessing) but
treat a non-nil err from ProcessInBatch as an overall OffboardingFailed result.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4bd08724-92b1-4904-a095-3fedfeafb07d
📒 Files selected for processing (20)
apis/cmk/cmk-ui.yamlinternal/api/cmkapi/cmkapi.gointernal/event-processor/export_test.gointernal/event-processor/factory.gointernal/event-processor/factory_test.gointernal/event-processor/handlers_key.gointernal/event-processor/handlers_system.gointernal/event-processor/helpers.gointernal/event-processor/reconciler.gointernal/event-processor/reconciler_test.gointernal/event-processor/repo.gointernal/event-processor/resolvers.gointernal/event-processor/types.gointernal/manager/export_test.gointernal/manager/key.gointernal/manager/system.gointernal/manager/tenant.gointernal/manager/tenant_test.gointernal/operator/operator.gotest/integration/event-processor/reconciliation_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- internal/event-processor/export_test.go
- internal/api/cmkapi/cmkapi.go
- internal/manager/key.go
243e8d3 to
dea0ed5
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
internal/event-processor/handlers_key.go (1)
91-96:⚠️ Potential issue | 🟠 MajorImplement state-based confirmation to prevent late job acceptance.
HandleJobConfirmunconditionally returnsCompleteJobConfirmer(), butHandleJobCanceledEventsets the key toUNKNOWNfor retry. If a confirmation message arrives after cancellation, the job will still be accepted, potentially overwriting the retry state.Follow the pattern used elsewhere: check the current key state before confirming. Return
CancelJobConfirmer()if the key is not in the expectedDETACHINGstate.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_key.go` around lines 91 - 96, HandleJobConfirm currently always returns orbital.CompleteJobConfirmer(), which can accept late confirmations; update KeyDetachJobHandler.HandleJobConfirm to read the current key state (the same store/lookup used by HandleJobCanceledEvent) and only return orbital.CompleteJobConfirmer() when the key is in the expected DETACHING state, otherwise return orbital.CancelJobConfirmer() to avoid accepting confirmations after the key was set to UNKNOWN for retry.
🧹 Nitpick comments (1)
internal/event-processor/helpers.go (1)
82-103: Consider using a different separator for error message aggregation.The colon (
:) separator may cause readability issues since error messages often contain colons themselves (e.g.,"context deadline exceeded: timeout"). Consider using a more distinct separator like"; "or newline.♻️ Proposed change
- message := strings.Join(taskErrors, ":") + message := strings.Join(taskErrors, "; ")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/helpers.go` around lines 82 - 103, The current mergeOrbitalTaskErrors function concatenates task error strings with ":" which can collide with colons in messages; update mergeOrbitalTaskErrors (function name) to join taskErrors using a clearer separator such as "; " or "\n" (e.g., strings.Join(taskErrors, "; ")) so aggregated messages remain readable and unambiguous when returned as message from mergeOrbitalTaskErrors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/manager/tenant.go`:
- Around line 365-367: The log message in the error branch incorrectly mentions
OIDC mapping; update the call to log.Error(ctx, "error while removing OIDC
mapping", err) to accurately describe the failure (e.g., "error while removing
system-to-registry mapping" or "error unmapping system from registry") in the
function that performs system-to-registry unmapping (the branch where log.Error
is called with ctx and err, e.g., inside
UnmapSystemFromRegistry/RemoveSystemRegistryMapping); keep the same log.Error
signature and pass the existing err variable unchanged.
---
Duplicate comments:
In `@internal/event-processor/handlers_key.go`:
- Around line 91-96: HandleJobConfirm currently always returns
orbital.CompleteJobConfirmer(), which can accept late confirmations; update
KeyDetachJobHandler.HandleJobConfirm to read the current key state (the same
store/lookup used by HandleJobCanceledEvent) and only return
orbital.CompleteJobConfirmer() when the key is in the expected DETACHING state,
otherwise return orbital.CancelJobConfirmer() to avoid accepting confirmations
after the key was set to UNKNOWN for retry.
---
Nitpick comments:
In `@internal/event-processor/helpers.go`:
- Around line 82-103: The current mergeOrbitalTaskErrors function concatenates
task error strings with ":" which can collide with colons in messages; update
mergeOrbitalTaskErrors (function name) to join taskErrors using a clearer
separator such as "; " or "\n" (e.g., strings.Join(taskErrors, "; ")) so
aggregated messages remain readable and unambiguous when returned as message
from mergeOrbitalTaskErrors.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f1958ceb-36ef-4ecf-aa74-6c3d200b5371
📒 Files selected for processing (20)
apis/cmk/cmk-ui.yamlinternal/api/cmkapi/cmkapi.gointernal/event-processor/export_test.gointernal/event-processor/factory.gointernal/event-processor/factory_test.gointernal/event-processor/handlers_key.gointernal/event-processor/handlers_system.gointernal/event-processor/helpers.gointernal/event-processor/reconciler.gointernal/event-processor/reconciler_test.gointernal/event-processor/repo.gointernal/event-processor/resolvers.gointernal/event-processor/types.gointernal/manager/export_test.gointernal/manager/key.gointernal/manager/system.gointernal/manager/tenant.gointernal/manager/tenant_test.gointernal/operator/operator.gotest/integration/event-processor/reconciliation_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
- internal/manager/export_test.go
- internal/event-processor/resolvers.go
- internal/api/cmkapi/cmkapi.go
- internal/event-processor/repo.go
- internal/event-processor/factory_test.go
- apis/cmk/cmk-ui.yaml
dea0ed5 to
f0aa47d
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (11)
internal/event-processor/handlers_system.go (3)
354-377:⚠️ Potential issue | 🟠 MajorRestore
CONNECTEDon cancel per PR specification.The PR objectives state: "Sets system status back to CONNECTED if the job is canceled (so an event is recreated and retried on next tenant offboarding reconciliation)." The current implementation sets
FAILED, which contradicts this and may require manual intervention.🐛 Proposed fix
-// HandleJobCanceledEvent puts the system to failed state so the system can be processed again during the next -// decommission reconciliation loop. Avoid failed state as that would require manual intervention to recover. +// HandleJobCanceledEvent restores the system to CONNECTED state so the event is recreated and retried +// on the next tenant offboarding reconciliation. func (h *SystemUnlinkDecommissionJobHandler) HandleJobCanceledEvent(ctx context.Context, job orbital.Job) error { ctx = log.InjectSystemEvent(ctx, job.Type) data, err := unmarshalSystemJobData(job) if err != nil { return err } ctx = cmkcontext.CreateTenantContext(ctx, data.TenantID) system, err := getSystemByID(ctx, h.repo, data.SystemID) if err != nil { return err } - system.Status = cmkapi.SystemStatusFAILED + system.Status = cmkapi.SystemStatusCONNECTED err = updateSystem(ctx, h.repo, system) if err != nil { return err } return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_system.go` around lines 354 - 377, The handler HandleJobCanceledEvent in SystemUnlinkDecommissionJobHandler currently sets system.Status to cmkapi.SystemStatusFAILED; change this to cmkapi.SystemStatusCONNECTED so canceled jobs restore the system to CONNECTED, leaving the existing unmarshalSystemJobData, getSystemByID and updateSystem calls intact to persist the change; ensure you only modify the assignment of system.Status and keep the updateSystem(ctx, h.repo, system) call to save the new CONNECTED state.
396-408:⚠️ Potential issue | 🟠 MajorReset external key claim before clearing local linkage.
The current order persists
DISCONNECTEDwithKeyConfigurationID = nilbefore callingsendL1KeyClaim. If the external call fails, later offboarding passes will consider the system fully unlinked and won't retry the cleanup. Swap the order so the external reset succeeds before dropping local linkage.🐛 Proposed fix
func (h *SystemUnlinkDecommissionJobHandler) terminate( ctx context.Context, job orbital.Job, ) (*model.System, string, error) { ctx = log.InjectSystemEvent(ctx, job.Type) data, err := unmarshalSystemJobData(job) if err != nil { return nil, "", err } ctx = cmkcontext.CreateTenantContext(ctx, data.TenantID) system, err := getSystemByID(ctx, h.repo, data.SystemID) if err != nil { return nil, "", err } + // Reset external key claim first, before clearing local linkage + err = sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false) + if err != nil { + return nil, "", err + } + // For unlink decommission, we want to mark the system as disconnected and clean up the key claim regardless // of whether the job succeeded or failed. system.Status = cmkapi.SystemStatusDISCONNECTED system.KeyConfigurationID = nil err = updateSystem(ctx, h.repo, system) if err != nil { return nil, "", err } - err = sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false) - if err != nil { - return nil, "", err - } - return system, data.KeyIDFrom, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_system.go` around lines 396 - 408, The external key-claim reset (sendL1KeyClaim) must be performed before clearing local linkage and persisting DISCONNECTED; move the sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false) call to occur before you set system.Status = cmkapi.SystemStatusDISCONNECTED and system.KeyConfigurationID = nil and before calling updateSystem(ctx, h.repo, system); on success, then set the fields and call updateSystem; preserve existing error returns (propagate sendL1KeyClaim errors so local state isn't cleared if the external reset fails).
312-319:⚠️ Potential issue | 🟠 MajorUse the standard confirmer to reject late results from canceled jobs.
Always returning
CompleteJobConfirmer()accepts late results from canceled or superseded jobs. After a cancel/retry cycle, an old task completion can still runterminate()and disconnect the system unexpectedly. UsehandleSystemJobConfirmfor consistency.🐛 Proposed fix
-// HandleJobConfirm for SystemUnlinkDecommissionJobHandler will require no additional validation to minimise -// job cancellation. func (h *SystemUnlinkDecommissionJobHandler) HandleJobConfirm( - _ context.Context, - _ orbital.Job, + ctx context.Context, + job orbital.Job, ) (orbital.JobConfirmerResult, error) { - return orbital.CompleteJobConfirmer(), nil + return handleSystemJobConfirm(ctx, h.repo, job) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_system.go` around lines 312 - 319, The handler SystemUnlinkDecommissionJobHandler.HandleJobConfirm currently always returns orbital.CompleteJobConfirmer(), which accepts late results; change it to use the shared reject-late-results helper by calling handleSystemJobConfirm with the same context and job parameters and returning its orbital.JobConfirmerResult and error so canceled/superseded tasks are rejected consistently; update the function body to delegate to handleSystemJobConfirm(ctx, job) (preserving parameter names) instead of returning CompleteJobConfirmer().internal/event-processor/factory.go (1)
145-156:⚠️ Potential issue | 🟠 MajorJob creation should be inside the transaction to avoid stuck PROCESSING state.
The transaction commits
system.Status = PROCESSINGbeforecreateSystemEventJobruns. If job creation fails, the system remains stuck inPROCESSINGwith no event record for recovery. Move job creation inside the transaction or implement a compensation mechanism.🐛 Proposed fix
func (f *EventFactory) SystemUnlinkDecommission( ctx context.Context, system *model.System, keyID string, ) (orbital.Job, error) { if system.Status == cmkapi.SystemStatusPROCESSING { return orbital.Job{}, ErrSystemProcessing } systemUnlinkJobData := SystemActionJobData{ SystemID: system.ID.String(), KeyIDFrom: keyID, } + var job orbital.Job err := f.repo.Transaction(ctx, func(ctx context.Context) error { system.Status = cmkapi.SystemStatusPROCESSING - _, err := f.repo.Patch(ctx, system, *repo.NewQuery().UpdateAll(true)) - return err + if err := updateSystem(ctx, f.repo, system); err != nil { + return err + } + var err error + job, err = f.createSystemEventJob(ctx, JobTypeSystemUnlinkDecommission, systemUnlinkJobData) + return err }) - - if err != nil { - return orbital.Job{}, err - } - - return f.createSystemEventJob(ctx, JobTypeSystemUnlinkDecommission, systemUnlinkJobData) + return job, err }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/factory.go` around lines 145 - 156, The code sets system.Status to PROCESSING and commits that change inside f.repo.Transaction but calls createSystemEventJob afterwards, risking a stuck PROCESSING state if job creation fails; move the call to createSystemEventJob (for JobTypeSystemUnlinkDecommission with systemUnlinkJobData) into the transaction callback passed to f.repo.Transaction so the transaction only commits when both the f.repo.Patch and createSystemEventJob succeed, returning any error from createSystemEventJob from the transaction function; ensure you propagate that error out of Transaction and return the created orbital.Job only on success.internal/event-processor/repo.go (1)
64-85:⚠️ Potential issue | 🟠 MajorReturn an error when no row was patched.
Both
updateSystemandupdateKeyignore the boolean returned byr.Patch. If no row matches the query, callers will assume the update succeeded when nothing was persisted. This can cause silent failures during job termination flows.🐛 Proposed fix
func updateSystem(ctx context.Context, r repo.Repo, system *model.System) error { ck := repo.NewCompositeKey().Where(repo.IDField, system.ID) query := repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true) - _, err := r.Patch(ctx, system, *query) + updated, err := r.Patch(ctx, system, *query) if err != nil { return fmt.Errorf("failed to update system %s upon job termination: %w", system.ID, err) } + if !updated { + return fmt.Errorf("system %s not found upon job termination", system.ID) + } return nil } func updateKey(ctx context.Context, r repo.Repo, key *model.Key) error { ck := repo.NewCompositeKey().Where(repo.IDField, key.ID) query := repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true) - _, err := r.Patch(ctx, key, *query) + updated, err := r.Patch(ctx, key, *query) if err != nil { return fmt.Errorf("failed to update key %s upon job termination: %w", key.ID, err) } + if !updated { + return fmt.Errorf("key %s not found upon job termination", key.ID) + } return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/repo.go` around lines 64 - 85, The patch functions updateSystem and updateKey currently ignore the boolean result returned by r.Patch; change both to inspect the returned "patched" boolean (from the call to r.Patch(ctx, system|key, *query)) and return a descriptive error if patched is false (e.g., "no rows patched for system %s" / "no rows patched for key %s"), keeping the existing fmt.Errorf wrapping for underlying errors; ensure you still return the error on r.Patch failure and only treat the boolean false as an additional error case, referencing the existing composite key construction (ck := repo.NewCompositeKey().Where(repo.IDField, ...)) and query (repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true)).internal/event-processor/handlers_key.go (2)
107-110:⚠️ Potential issue | 🟠 MajorDon't fail the handler on audit log delivery failure.
The key is already persisted as
DETACHEDbefore the audit call. Returning an error here causes retries or duplicate callbacks for a logically completed job. Log the failure instead.🐛 Proposed fix
- err = h.cmkAuditor.SendCmkDetachAuditLog(ctx, key.ID.String()) - if err != nil { - return fmt.Errorf("failed to send CMK detach audit log for key %s: %w", key.ID, err) - } + if err := h.cmkAuditor.SendCmkDetachAuditLog(ctx, key.ID.String()); err != nil { + log.Error(ctx, "Failed to send CMK detach audit log", err) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_key.go` around lines 107 - 110, The handler currently returns an error when h.cmkAuditor.SendCmkDetachAuditLog(ctx, key.ID.String()) fails, which triggers retries even though the key is already marked DETACHED; change this to log the failure and continue. Replace the "return fmt.Errorf(...)" path after calling h.cmkAuditor.SendCmkDetachAuditLog with a non-fatal log call (e.g., h.logger.Errorf or similar used in this package) that includes the key ID and the error, and do not return an error from this block so the handler completes successfully.
91-96:⚠️ Potential issue | 🟠 MajorImplement state-based confirmation to reject late results.
HandleJobConfirmunconditionally returnsCompleteJobConfirmer(). If a confirmation arrives after cancellation (when key is set toUNKNOWN), it will still be accepted. Check the key state before confirming, similar tohandleSystemJobConfirm.🐛 Proposed fix
func (h *KeyDetachJobHandler) HandleJobConfirm( - _ context.Context, - _ orbital.Job, + ctx context.Context, + job orbital.Job, ) (orbital.JobConfirmerResult, error) { + data, err := unmarshalKeyJobData(job) + if err != nil { + return orbital.CancelJobConfirmer(fmt.Sprintf("failed to unmarshal job data: %v", err)), err + } + + ctx = cmkcontext.CreateTenantContext(ctx, data.TenantID) + key, err := getKeyByKeyID(ctx, h.repo, data.KeyID) + if err != nil { + if errors.Is(err, repo.ErrNotFound) { + return orbital.CancelJobConfirmer(fmt.Sprintf("key %s not found", data.KeyID)), nil + } + return nil, fmt.Errorf("failed to get key by ID %s: %w", data.KeyID, err) + } + + if key.State != string(cmkapi.KeyStateDETACHING) { + return orbital.CancelJobConfirmer(fmt.Sprintf("key %s is not in DETACHING state", key.ID)), nil + } + return orbital.CompleteJobConfirmer(), nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_key.go` around lines 91 - 96, HandleJobConfirm currently always returns orbital.CompleteJobConfirmer(), which accepts confirmations even after a key was cancelled; update KeyDetachJobHandler.HandleJobConfirm to first load the key state (like handleSystemJobConfirm does) and if the key state is UNKNOWN (cancelled) return the job rejection confirmer (e.g., orbital.RejectJobConfirmer()) instead of CompleteJobConfirmer(); otherwise proceed to complete. Ensure you reference KeyDetachJobHandler.HandleJobConfirm, the UNKNOWN key state, and CompleteJobConfirmer when making the change.internal/manager/system.go (1)
432-445:⚠️ Potential issue | 🟠 MajorGuard against nil registry client.
NewSystemManagerexplicitly allowsm.registryto be nil, butUnmapSystemFromRegistryunconditionally callsm.registry.Mapping(). This will panic during tenant offboarding if the manager was created without a registry client.🛡️ Proposed fix
func (m *SystemManager) UnmapSystemFromRegistry(ctx context.Context, system *model.System) error { + if m.registry == nil { + return fmt.Errorf("registry client not available for UnmapSystemFromRegistry") + } + tenant, err := cmkcontext.ExtractTenantID(ctx) if err != nil { return err } _, err = m.registry.Mapping().UnmapSystemFromTenant(ctx, &mappingv1.UnmapSystemFromTenantRequest{ ExternalId: system.Identifier, Type: strings.ToLower(system.Type), TenantId: tenant, }) return err }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/manager/system.go` around lines 432 - 445, UnmapSystemFromRegistry calls m.registry.Mapping() but NewSystemManager allows m.registry to be nil, so add a nil-check at the top of UnmapSystemFromRegistry: if m.registry == nil then return nil (no-op) to avoid a panic during tenant offboarding; otherwise proceed to call m.registry.Mapping().UnmapSystemFromTenant with the existing arguments (ExternalId: system.Identifier, Type: strings.ToLower(system.Type), TenantId: tenant). Ensure the function continues to return any error from UnmapSystemFromTenant when the registry is present.internal/manager/tenant_test.go (1)
196-202:⚠️ Potential issue | 🟡 MinorKeep
PrimaryKeyIDaligned with the inserted primary key.Both subtests create a new key with
IsPrimary = true, but the sharedkeyConfig.PrimaryKeyIDstill points to the original key created at Lines 171-180. That leaves the fixture in an inconsistent state and can make the detach assertions validate the wrong record if any part of the offboarding path resolves primaries viaPrimaryKeyIDrather thanIsPrimary.key := testutils.NewKey(func(k *model.Key) { k.KeyConfigurationID = keyConfig.ID k.IsPrimary = true k.State = string(cmkapi.KeyStateDETACHING) // or DETACHED/ENABLED per subtest }) keyConfig.PrimaryKeyID = ptr.PointTo(key.ID)Also applies to: 244-250
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/manager/tenant_test.go` around lines 196 - 202, The test creates new primary keys via testutils.NewKey (setting model.Key.IsPrimary = true) but never updates keyConfig.PrimaryKeyID, leaving keyConfig inconsistent; after each NewKey that sets IsPrimary = true (the keys used in the subtests), update keyConfig.PrimaryKeyID to point to that key's ID (e.g., set keyConfig.PrimaryKeyID = ptr.PointTo(key.ID)) so PrimaryKeyID is aligned with the inserted primary key for both affected subtests.internal/manager/tenant.go (2)
365-367:⚠️ Potential issue | 🟡 MinorFix the stale log message.
This branch is handling registry unmap failures, not OIDC cleanup. The current message will send on-call debugging in the wrong direction.
Suggested fix
- log.Error(ctx, "error while removing OIDC mapping", err) + log.Error(ctx, "error while unmapping system from registry", err)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/manager/tenant.go` around lines 365 - 367, The log message in the error branch that currently calls log.Error(ctx, "error while removing OIDC mapping", err) is wrong for registry unmap failures; update the message to accurately reflect a registry unmap/unmapping failure (e.g., "error while unmapping registry" or similar) in the same error handling block so on-call debugging points to registry unmap problems — this change should be made where the code returns false, OffboardingProcessing for the err != nil branch.
114-118:⚠️ Potential issue | 🟠 MajorDon't downgrade batch/repository failures to ordinary
Processing.
repo.ProcessInBatchfailing here means enumeration/batching itself broke, not that a single system returned a retryable registry error. Converting that intoOffboardingProcessinghides the actual failure fromOffboardTenantand can leave reconciliation spinning forever with only a log entry.Suggested shape
- canContinue, st := m.unmapAllSystemsFromRegistry(ctx) + canContinue, st, err := m.unmapAllSystemsFromRegistry(ctx) + if err != nil { + return OffboardingResult{}, err + } if !canContinue { return OffboardingResult{Status: st}, nil }-func (m *TenantManager) unmapAllSystemsFromRegistry(ctx context.Context) (bool, OffboardingStatus) { +func (m *TenantManager) unmapAllSystemsFromRegistry(ctx context.Context) (bool, OffboardingStatus, error) { … if err != nil { - log.Error(ctx, "error while processing systems in batch to unmap from registry", err) - return false, OffboardingProcessing + return false, 0, err } … - return true, 0 + return true, 0, nil } }Also applies to: 294-328
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/manager/tenant.go` around lines 114 - 118, The current code in OffboardTenant converts any non-canContinue result from m.unmapAllSystemsFromRegistry into OffboardingProcessing, which hides batch/repository failures from callers; modify the handling so that if unmapAllSystemsFromRegistry returns a terminal or error status (e.g., a status produced when repo.ProcessInBatch fails), you return that specific OffboardingResult{Status: st} directly instead of mapping it to OffboardingProcessing. Locate the call to m.unmapAllSystemsFromRegistry (and the similar block at ~294-328) and change the conditional to propagate non-retryable/error statuses as-is, only turning transient registry errors into OffboardingProcessing when the returned status explicitly indicates a retryable registry issue. Ensure the logic distinguishes repo/batch failures from per-system retryable errors before returning.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@internal/event-processor/factory.go`:
- Around line 145-156: The code sets system.Status to PROCESSING and commits
that change inside f.repo.Transaction but calls createSystemEventJob afterwards,
risking a stuck PROCESSING state if job creation fails; move the call to
createSystemEventJob (for JobTypeSystemUnlinkDecommission with
systemUnlinkJobData) into the transaction callback passed to f.repo.Transaction
so the transaction only commits when both the f.repo.Patch and
createSystemEventJob succeed, returning any error from createSystemEventJob from
the transaction function; ensure you propagate that error out of Transaction and
return the created orbital.Job only on success.
In `@internal/event-processor/handlers_key.go`:
- Around line 107-110: The handler currently returns an error when
h.cmkAuditor.SendCmkDetachAuditLog(ctx, key.ID.String()) fails, which triggers
retries even though the key is already marked DETACHED; change this to log the
failure and continue. Replace the "return fmt.Errorf(...)" path after calling
h.cmkAuditor.SendCmkDetachAuditLog with a non-fatal log call (e.g.,
h.logger.Errorf or similar used in this package) that includes the key ID and
the error, and do not return an error from this block so the handler completes
successfully.
- Around line 91-96: HandleJobConfirm currently always returns
orbital.CompleteJobConfirmer(), which accepts confirmations even after a key was
cancelled; update KeyDetachJobHandler.HandleJobConfirm to first load the key
state (like handleSystemJobConfirm does) and if the key state is UNKNOWN
(cancelled) return the job rejection confirmer (e.g.,
orbital.RejectJobConfirmer()) instead of CompleteJobConfirmer(); otherwise
proceed to complete. Ensure you reference KeyDetachJobHandler.HandleJobConfirm,
the UNKNOWN key state, and CompleteJobConfirmer when making the change.
In `@internal/event-processor/handlers_system.go`:
- Around line 354-377: The handler HandleJobCanceledEvent in
SystemUnlinkDecommissionJobHandler currently sets system.Status to
cmkapi.SystemStatusFAILED; change this to cmkapi.SystemStatusCONNECTED so
canceled jobs restore the system to CONNECTED, leaving the existing
unmarshalSystemJobData, getSystemByID and updateSystem calls intact to persist
the change; ensure you only modify the assignment of system.Status and keep the
updateSystem(ctx, h.repo, system) call to save the new CONNECTED state.
- Around line 396-408: The external key-claim reset (sendL1KeyClaim) must be
performed before clearing local linkage and persisting DISCONNECTED; move the
sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false) call to occur
before you set system.Status = cmkapi.SystemStatusDISCONNECTED and
system.KeyConfigurationID = nil and before calling updateSystem(ctx, h.repo,
system); on success, then set the fields and call updateSystem; preserve
existing error returns (propagate sendL1KeyClaim errors so local state isn't
cleared if the external reset fails).
- Around line 312-319: The handler
SystemUnlinkDecommissionJobHandler.HandleJobConfirm currently always returns
orbital.CompleteJobConfirmer(), which accepts late results; change it to use the
shared reject-late-results helper by calling handleSystemJobConfirm with the
same context and job parameters and returning its orbital.JobConfirmerResult and
error so canceled/superseded tasks are rejected consistently; update the
function body to delegate to handleSystemJobConfirm(ctx, job) (preserving
parameter names) instead of returning CompleteJobConfirmer().
In `@internal/event-processor/repo.go`:
- Around line 64-85: The patch functions updateSystem and updateKey currently
ignore the boolean result returned by r.Patch; change both to inspect the
returned "patched" boolean (from the call to r.Patch(ctx, system|key, *query))
and return a descriptive error if patched is false (e.g., "no rows patched for
system %s" / "no rows patched for key %s"), keeping the existing fmt.Errorf
wrapping for underlying errors; ensure you still return the error on r.Patch
failure and only treat the boolean false as an additional error case,
referencing the existing composite key construction (ck :=
repo.NewCompositeKey().Where(repo.IDField, ...)) and query
(repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true)).
In `@internal/manager/system.go`:
- Around line 432-445: UnmapSystemFromRegistry calls m.registry.Mapping() but
NewSystemManager allows m.registry to be nil, so add a nil-check at the top of
UnmapSystemFromRegistry: if m.registry == nil then return nil (no-op) to avoid a
panic during tenant offboarding; otherwise proceed to call
m.registry.Mapping().UnmapSystemFromTenant with the existing arguments
(ExternalId: system.Identifier, Type: strings.ToLower(system.Type), TenantId:
tenant). Ensure the function continues to return any error from
UnmapSystemFromTenant when the registry is present.
In `@internal/manager/tenant_test.go`:
- Around line 196-202: The test creates new primary keys via testutils.NewKey
(setting model.Key.IsPrimary = true) but never updates keyConfig.PrimaryKeyID,
leaving keyConfig inconsistent; after each NewKey that sets IsPrimary = true
(the keys used in the subtests), update keyConfig.PrimaryKeyID to point to that
key's ID (e.g., set keyConfig.PrimaryKeyID = ptr.PointTo(key.ID)) so
PrimaryKeyID is aligned with the inserted primary key for both affected
subtests.
In `@internal/manager/tenant.go`:
- Around line 365-367: The log message in the error branch that currently calls
log.Error(ctx, "error while removing OIDC mapping", err) is wrong for registry
unmap failures; update the message to accurately reflect a registry
unmap/unmapping failure (e.g., "error while unmapping registry" or similar) in
the same error handling block so on-call debugging points to registry unmap
problems — this change should be made where the code returns false,
OffboardingProcessing for the err != nil branch.
- Around line 114-118: The current code in OffboardTenant converts any
non-canContinue result from m.unmapAllSystemsFromRegistry into
OffboardingProcessing, which hides batch/repository failures from callers;
modify the handling so that if unmapAllSystemsFromRegistry returns a terminal or
error status (e.g., a status produced when repo.ProcessInBatch fails), you
return that specific OffboardingResult{Status: st} directly instead of mapping
it to OffboardingProcessing. Locate the call to m.unmapAllSystemsFromRegistry
(and the similar block at ~294-328) and change the conditional to propagate
non-retryable/error statuses as-is, only turning transient registry errors into
OffboardingProcessing when the returned status explicitly indicates a retryable
registry issue. Ensure the logic distinguishes repo/batch failures from
per-system retryable errors before returning.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f7df641f-9854-4c99-a532-91f00dc74b8b
📒 Files selected for processing (20)
apis/cmk/cmk-ui.yamlinternal/api/cmkapi/cmkapi.gointernal/event-processor/export_test.gointernal/event-processor/factory.gointernal/event-processor/factory_test.gointernal/event-processor/handlers_key.gointernal/event-processor/handlers_system.gointernal/event-processor/helpers.gointernal/event-processor/reconciler.gointernal/event-processor/reconciler_test.gointernal/event-processor/repo.gointernal/event-processor/resolvers.gointernal/event-processor/types.gointernal/manager/export_test.gointernal/manager/key.gointernal/manager/system.gointernal/manager/tenant.gointernal/manager/tenant_test.gointernal/operator/operator.gotest/integration/event-processor/reconciliation_test.go
🚧 Files skipped from review as they are similar to previous changes (9)
- apis/cmk/cmk-ui.yaml
- internal/event-processor/reconciler_test.go
- internal/api/cmkapi/cmkapi.go
- internal/event-processor/export_test.go
- internal/event-processor/resolvers.go
- internal/event-processor/types.go
- internal/event-processor/reconciler.go
- internal/event-processor/helpers.go
- internal/event-processor/factory_test.go
f0aa47d to
2e7c866
Compare
bkieling
left a comment
There was a problem hiding this comment.
Nicely done. Just a few comments.
2e7c866 to
1d2c99e
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/operator/operator.go (1)
437-459:⚠️ Potential issue | 🔴 CriticalReturn on
RemoveOIDCMappingerrors before touchinggrpcResp.The
codes.Internalbranch only logs and falls through. That means an internal remove error can still dereference or trustgrpcResp, which can panic on a nil response or let tenant offboarding continue even though OIDC cleanup failed.Suggested fix
st, ok := status.FromError(err) - if !ok { + if !ok && err != nil { log.Error(ctx, "failed getting info on sessionManager error", err) } - if st.Code() == codes.Internal { - log.Error(ctx, "removeOIDC failed with internal err", err) - } - if err != nil && st.Code() != codes.Internal { - log.Error(ctx, "error while removing OIDC mapping", err) + + if err != nil { + if ok && st.Code() == codes.Internal { + log.Error(ctx, "removeOIDC failed with internal err", err) + } else { + log.Error(ctx, "error while removing OIDC mapping", err) + } setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed) return true }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/operator/operator.go` around lines 437 - 459, The code currently logs an Internal error from RemoveOIDCMapping but continues and may dereference grpcResp; update the Internal branch to treat it like other error branches by calling setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed) (or setErrorStateAndFail if intended) and returning true immediately; modify the block handling st.Code() == codes.Internal around status.FromError(err) to set the error state and return before any use of grpcResp, referencing st, err, grpcResp, setErrorStateAndContinue, and WorkingStateOIDCMappingRemoveFailed.
♻️ Duplicate comments (3)
internal/event-processor/handlers_system.go (1)
400-410:⚠️ Potential issue | 🟠 MajorReset the external L1 claim before clearing the local link.
terminate()persistsDISCONNECTEDandKeyConfigurationID=nilbeforesendL1KeyClaim(..., false)runs. If that external call fails, later offboarding passes will treat the system as already unlinked and won't enqueue another cleanup attempt, leaving the registry claim stuck.Suggested ordering
- system.Status = cmkapi.SystemStatusDISCONNECTED - system.KeyConfigurationID = nil - err = updateSystem(ctx, h.repo, system) - if err != nil { - return nil, "", err - } - err = sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false) if err != nil { return nil, "", err } + + system.Status = cmkapi.SystemStatusDISCONNECTED + system.KeyConfigurationID = nil + err = updateSystem(ctx, h.repo, system) + if err != nil { + return nil, "", err + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_system.go` around lines 400 - 410, The current flow in terminate() sets system.Status = cmkapi.SystemStatusDISCONNECTED and system.KeyConfigurationID = nil and persists via updateSystem before calling sendL1KeyClaim(..., false), which can leave the external registry claim orphaned on send failure; change the ordering so you call sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false) before mutating/persisting system.Status and system.KeyConfigurationID (or if you prefer to keep the persist-first approach, roll back the persisted changes when sendL1KeyClaim returns an error), ensuring updateSystem only runs after sendL1KeyClaim succeeds and referencing sendL1KeyClaim, updateSystem, system.Status, system.KeyConfigurationID, and terminate to locate the code to modify.internal/manager/system.go (1)
432-444:⚠️ Potential issue | 🟠 MajorGuard
UnmapSystemFromRegistrywhen the registry client is absent.
NewSystemManagerexplicitly allowsm.registryto be unset, but this method unconditionally dereferencesm.registry.Mapping(). Any offboarding flow built in that mode will panic instead of returning a retryable error.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/manager/system.go` around lines 432 - 444, UnmapSystemFromRegistry currently dereferences m.registry.Mapping() unconditionally; add a guard at the start of UnmapSystemFromRegistry to check that m.registry and m.registry.Mapping() are non-nil and if not return a retryable error indicating the registry client is not configured/available. Locate the UnmapSystemFromRegistry method and add that nil-check before calling m.registry.Mapping().UnmapSystemFromTenant so offboarding flows in manager instances created without a registry do not panic but return a retryable failure.internal/event-processor/handlers_key.go (1)
91-96:⚠️ Potential issue | 🟠 MajorCancel late detach confirmations unless the key is still
DETACHING.
HandleJobConfirmstill completes unconditionally. IfHandleJobCanceledEventalready moved the key toUNKNOWN, a delayed confirmation can still run the done/failed hook and overwrite the retry state back toDETACHED. Only confirm while the persisted key state is stillDETACHING; otherwise cancel the job.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_key.go` around lines 91 - 96, HandleJobConfirm currently always returns orbital.CompleteJobConfirmer(), causing late confirmations to overwrite retry state; change it to first load the persisted key state (extract the key ID from the orbital.Job passed in the same way other handler methods do) using the handler's store access used in HandleJobCanceledEvent, and only return orbital.CompleteJobConfirmer() if the persisted state equals "DETACHING" (or the handler's equivalent constant); otherwise return orbital.CancelJobConfirmer() (and surface any DB error as the method error).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@internal/operator/operator.go`:
- Around line 437-459: The code currently logs an Internal error from
RemoveOIDCMapping but continues and may dereference grpcResp; update the
Internal branch to treat it like other error branches by calling
setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed)
(or setErrorStateAndFail if intended) and returning true immediately; modify the
block handling st.Code() == codes.Internal around status.FromError(err) to set
the error state and return before any use of grpcResp, referencing st, err,
grpcResp, setErrorStateAndContinue, and WorkingStateOIDCMappingRemoveFailed.
---
Duplicate comments:
In `@internal/event-processor/handlers_key.go`:
- Around line 91-96: HandleJobConfirm currently always returns
orbital.CompleteJobConfirmer(), causing late confirmations to overwrite retry
state; change it to first load the persisted key state (extract the key ID from
the orbital.Job passed in the same way other handler methods do) using the
handler's store access used in HandleJobCanceledEvent, and only return
orbital.CompleteJobConfirmer() if the persisted state equals "DETACHING" (or the
handler's equivalent constant); otherwise return orbital.CancelJobConfirmer()
(and surface any DB error as the method error).
In `@internal/event-processor/handlers_system.go`:
- Around line 400-410: The current flow in terminate() sets system.Status =
cmkapi.SystemStatusDISCONNECTED and system.KeyConfigurationID = nil and persists
via updateSystem before calling sendL1KeyClaim(..., false), which can leave the
external registry claim orphaned on send failure; change the ordering so you
call sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false) before
mutating/persisting system.Status and system.KeyConfigurationID (or if you
prefer to keep the persist-first approach, roll back the persisted changes when
sendL1KeyClaim returns an error), ensuring updateSystem only runs after
sendL1KeyClaim succeeds and referencing sendL1KeyClaim, updateSystem,
system.Status, system.KeyConfigurationID, and terminate to locate the code to
modify.
In `@internal/manager/system.go`:
- Around line 432-444: UnmapSystemFromRegistry currently dereferences
m.registry.Mapping() unconditionally; add a guard at the start of
UnmapSystemFromRegistry to check that m.registry and m.registry.Mapping() are
non-nil and if not return a retryable error indicating the registry client is
not configured/available. Locate the UnmapSystemFromRegistry method and add that
nil-check before calling m.registry.Mapping().UnmapSystemFromTenant so
offboarding flows in manager instances created without a registry do not panic
but return a retryable failure.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 79a0a2a5-c437-407e-be82-1e0cadbad852
📒 Files selected for processing (22)
apis/cmk/cmk-ui.yamlinternal/api/cmkapi/cmkapi.gointernal/event-processor/export_test.gointernal/event-processor/factory.gointernal/event-processor/factory_test.gointernal/event-processor/handlers_key.gointernal/event-processor/handlers_system.gointernal/event-processor/helpers.gointernal/event-processor/reconciler.gointernal/event-processor/reconciler_test.gointernal/event-processor/repo.gointernal/event-processor/resolvers.gointernal/event-processor/types.gointernal/manager/export_test.gointernal/manager/key.gointernal/manager/system.gointernal/manager/tenant.gointernal/manager/tenant_test.gointernal/operator/operator.gointernal/operator/operator_test.gosonar-project.propertiestest/integration/event-processor/reconciliation_test.go
✅ Files skipped from review due to trivial changes (3)
- sonar-project.properties
- internal/event-processor/helpers.go
- internal/manager/key.go
🚧 Files skipped from review as they are similar to previous changes (11)
- internal/event-processor/resolvers.go
- internal/event-processor/export_test.go
- internal/event-processor/types.go
- internal/manager/export_test.go
- apis/cmk/cmk-ui.yaml
- test/integration/event-processor/reconciliation_test.go
- internal/event-processor/repo.go
- internal/event-processor/factory_test.go
- internal/event-processor/reconciler.go
- internal/api/cmkapi/cmkapi.go
- internal/event-processor/factory.go
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/operator/operator.go (1)
431-459:⚠️ Potential issue | 🔴 CriticalReturn on RPC errors before dereferencing
storgrpcResp.On the success path, Line 441 can dereference a nil status from
status.FromError(nil). On the error path, thecodes.Internalbranch falls through togrpcResp.GetSuccess()even thoughgrpcRespis usually nil there.In google.golang.org/grpc/status, what does status.FromError(nil) return? Is the returned *Status nil?🐛 Suggested change
grpcResp, err := o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping( ctx, &oidcmappinggrpc.RemoveOIDCMappingRequest{ TenantId: tenantID, }, ) - st, ok := status.FromError(err) - if !ok { - log.Error(ctx, "failed getting info on sessionManager error", err) - } - if st.Code() == codes.Internal { - log.Error(ctx, "removeOIDC failed with internal err", err) - } - if err != nil && st.Code() != codes.Internal { - log.Error(ctx, "error while removing OIDC mapping", err) - setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed) - return true - } + if err != nil { + st, ok := status.FromError(err) + if !ok { + log.Error(ctx, "failed getting info on sessionManager error", err) + } + if st != nil && st.Code() == codes.Internal { + log.Error(ctx, "removeOIDC failed with internal err", err) + } + + log.Error(ctx, "error while removing OIDC mapping", err) + setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed) + return true + } if !grpcResp.GetSuccess() { err = errs.Wrapf( ErrFailedResponse, "session manager could not remove OIDC mapping: "+grpcResp.GetMessage(),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/operator/operator.go` around lines 431 - 459, The code dereferences st and grpcResp before ensuring err is nil; change the RemoveOIDCMapping handling so you first check if err != nil, log it (including st when non-nil), handle the codes.Internal case and return (or set error state) before touching st.Code() or grpcResp; only call status.FromError and use st.Code() after confirming err != nil (or use st safely), and before calling grpcResp.GetSuccess() ensure grpcResp != nil (or return on error) in the path around o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping, updating the branches that call setErrorStateAndContinue and setErrorStateAndFail accordingly.
♻️ Duplicate comments (4)
internal/event-processor/repo.go (1)
64-85:⚠️ Potential issue | 🟠 MajorHandle
Patch"not updated" results explicitly.Both helpers ignore
Patch’s boolean result, so a no-op patch is treated as success.🐛 Proposed fix
func updateSystem(ctx context.Context, r repo.Repo, system *model.System) error { ck := repo.NewCompositeKey().Where(repo.IDField, system.ID) query := repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true) - _, err := r.Patch(ctx, system, *query) + updated, err := r.Patch(ctx, system, *query) if err != nil { return fmt.Errorf("failed to update system %s upon job termination: %w", system.ID, err) } + if !updated { + return fmt.Errorf("system %s not found upon job termination", system.ID) + } return nil } func updateKey(ctx context.Context, r repo.Repo, key *model.Key) error { ck := repo.NewCompositeKey().Where(repo.IDField, key.ID) query := repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true) - _, err := r.Patch(ctx, key, *query) + updated, err := r.Patch(ctx, key, *query) if err != nil { return fmt.Errorf("failed to update key %s upon job termination: %w", key.ID, err) } + if !updated { + return fmt.Errorf("key %s not found upon job termination", key.ID) + } return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/repo.go` around lines 64 - 85, The updateSystem and updateKey helpers ignore the boolean result from r.Patch so no-op patches are treated as success; change both functions to capture the first return value (e.g., updated, err := r.Patch(...)), check if err != nil as before, and also if updated == false return a descriptive error (for example "no rows updated" or "patch did not change anything" including system.ID or key.ID) so callers can detect and handle no-op patch results.internal/event-processor/reconciler_test.go (1)
1116-1119:⚠️ Potential issue | 🟡 MinorFail this helper immediately on termination-hook errors.
These subtests are asserting terminal side effects, but Line 1118 only logs hook failures. A broken decommission or detach hook can still leave the test green.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/reconciler_test.go` around lines 1116 - 1119, The termination-hook error currently only logs failures ("Job termination returned error: %v") which lets subtests pass despite hook errors; change the handler in reconciler_test.go so that when err != nil it calls t.Fatalf (or t.Fatalf with the same message) instead of t.Logf, ensuring the test fails immediately on termination-hook errors; update the error branch where "Job termination returned error" is logged to use t.Fatalf and preserve the error message.internal/event-processor/handlers_system.go (1)
400-412:⚠️ Potential issue | 🟠 MajorReset the external L1 claim before clearing the local link.
After Line 404, the system is already persisted as
DISCONNECTEDwithKeyConfigurationID = nil. IfsendL1KeyClaim(..., false)then fails, later offboarding passes will treat the system as fully unlinked and will not schedule another cleanup attempt.🐛 Suggested change
// For unlink decommission, we want to mark the system as disconnected and clean up the key claim regardless // of whether the job succeeded or failed. - system.Status = cmkapi.SystemStatusDISCONNECTED - system.KeyConfigurationID = nil - err = updateSystem(ctx, h.repo, system) - if err != nil { - return nil, "", err - } - err = sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false) if err != nil { return nil, "", err } + + system.Status = cmkapi.SystemStatusDISCONNECTED + system.KeyConfigurationID = nil + err = updateSystem(ctx, h.repo, system) + if err != nil { + return nil, "", err + } return system, data.KeyIDFrom, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_system.go` around lines 400 - 412, The code clears system.KeyConfigurationID and persists DISCONNECTED before calling sendL1KeyClaim, which can leave the external L1 claim uncleared if sendL1KeyClaim fails; move the revocation call earlier: call sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false) while the original KeyConfigurationID is still present and only set system.KeyConfigurationID = nil and call updateSystem(ctx, h.repo, system) after sendL1KeyClaim returns successfully; if sendL1KeyClaim returns an error, return that error without mutating or persisting the KeyConfigurationID so a later offboarding attempt can retry.internal/event-processor/factory.go (1)
145-155:⚠️ Potential issue | 🔴 CriticalCreate the decommission job inside the same transaction as the status flip.
After Line 149 commits, any
createSystemEventJobfailure leaves the system stuck inPROCESSINGwith no persisted event/job to recover it.🐛 Suggested change
- err := f.repo.Transaction(ctx, func(ctx context.Context) error { + var job orbital.Job + err := f.repo.Transaction(ctx, func(ctx context.Context) error { system.Status = cmkapi.SystemStatusPROCESSING _, err := f.repo.Patch(ctx, system, *repo.NewQuery().UpdateAll(true)) - return err + if err != nil { + return err + } + + job, err = f.createSystemEventJob(ctx, JobTypeSystemUnlinkDecommission, systemUnlinkJobData) + return err }) if err != nil { return orbital.Job{}, err } - return f.createSystemEventJob(ctx, JobTypeSystemUnlinkDecommission, systemUnlinkJobData) + return job, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/factory.go` around lines 145 - 155, The status flip to PROCESSING and the creation of the decommission job must occur inside the same transaction so failures roll back; move the call to f.createSystemEventJob (JobTypeSystemUnlinkDecommission with systemUnlinkJobData) into the f.repo.Transaction closure after setting system.Status and after f.repo.Patch, and return any error from createSystemEventJob so the transaction will abort on failure; ensure you call/create the job using repository methods available in the transaction context so both the status update and job persist succeed atomically.
🧹 Nitpick comments (1)
internal/event-processor/helpers.go (1)
67-88: Optionally skip empty task error messages before joining.This avoids delimiter-only artifacts in merged error strings.
♻️ Proposed tweak
taskErrors := make([]string, 0, len(tasks)) for _, t := range tasks { - taskErrors = append(taskErrors, t.ErrorMessage) + if t.ErrorMessage == "" { + continue + } + taskErrors = append(taskErrors, t.ErrorMessage) } message := strings.Join(taskErrors, ":")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/helpers.go` around lines 67 - 88, mergeOrbitalTaskErrors currently includes empty task error messages which produces delimiter-only artifacts; update the loop in mergeOrbitalTaskErrors to trim each t.ErrorMessage (use strings.TrimSpace) and only append non-empty results to taskErrors (or otherwise filter out empty strings) before joining with ":" so the returned message contains only real error text; reference function mergeOrbitalTaskErrors and variables taskErrors and t.ErrorMessage when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/api/cmkapi/cmkapi.go`:
- Around line 404-405: The OpenAPI description for the State field (State
*KeyState on the Key/Key Version struct in cmkapi.go) incorrectly defines
UNKNOWN only as an authentication-failure state; update the json/tag comment and
any other occurrences (the other State field docs around the file, e.g., the
similar doc strings referenced) to state that UNKNOWN also represents a key
moved back to UNKNOWN when a detach job is canceled (i.e., include both
"authentication failure" and "canceled-detach" transitions), then regenerate the
OpenAPI/spec and clients so generated client docs reflect the new definition of
UNKNOWN.
In `@internal/manager/tenant_test.go`:
- Around line 268-279: The failing unmap assertions are passing prematurely
because the previous subtest leaves a linked system in the repository and
mockSystemManager.UnlinkSystemAction doesn't update repo state, so
checkAllSystemsUnlinked can make OffboardTenant return
OffboardingContinueAndWait without hitting UnmapSystemFromRegistry; fix by
isolating fixtures per subtest: ensure the system repository is reset or
recreate the tenant/system fixture before each unmap-related subtest (the t.Run
blocks that set mockSystemManager with unlinkErr and UnlinkSystemAction), or
explicitly remove/clear linked systems from the repo prior to invoking
m.OffboardTenant in those subtests so UnmapSystemFromRegistry/UnlinkSystemAction
code path is actually exercised (apply same change to the later subtest around
the other Unmap/Unlink assertions).
---
Outside diff comments:
In `@internal/operator/operator.go`:
- Around line 431-459: The code dereferences st and grpcResp before ensuring err
is nil; change the RemoveOIDCMapping handling so you first check if err != nil,
log it (including st when non-nil), handle the codes.Internal case and return
(or set error state) before touching st.Code() or grpcResp; only call
status.FromError and use st.Code() after confirming err != nil (or use st
safely), and before calling grpcResp.GetSuccess() ensure grpcResp != nil (or
return on error) in the path around
o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping, updating the
branches that call setErrorStateAndContinue and setErrorStateAndFail
accordingly.
---
Duplicate comments:
In `@internal/event-processor/factory.go`:
- Around line 145-155: The status flip to PROCESSING and the creation of the
decommission job must occur inside the same transaction so failures roll back;
move the call to f.createSystemEventJob (JobTypeSystemUnlinkDecommission with
systemUnlinkJobData) into the f.repo.Transaction closure after setting
system.Status and after f.repo.Patch, and return any error from
createSystemEventJob so the transaction will abort on failure; ensure you
call/create the job using repository methods available in the transaction
context so both the status update and job persist succeed atomically.
In `@internal/event-processor/handlers_system.go`:
- Around line 400-412: The code clears system.KeyConfigurationID and persists
DISCONNECTED before calling sendL1KeyClaim, which can leave the external L1
claim uncleared if sendL1KeyClaim fails; move the revocation call earlier: call
sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false) while the
original KeyConfigurationID is still present and only set
system.KeyConfigurationID = nil and call updateSystem(ctx, h.repo, system) after
sendL1KeyClaim returns successfully; if sendL1KeyClaim returns an error, return
that error without mutating or persisting the KeyConfigurationID so a later
offboarding attempt can retry.
In `@internal/event-processor/reconciler_test.go`:
- Around line 1116-1119: The termination-hook error currently only logs failures
("Job termination returned error: %v") which lets subtests pass despite hook
errors; change the handler in reconciler_test.go so that when err != nil it
calls t.Fatalf (or t.Fatalf with the same message) instead of t.Logf, ensuring
the test fails immediately on termination-hook errors; update the error branch
where "Job termination returned error" is logged to use t.Fatalf and preserve
the error message.
In `@internal/event-processor/repo.go`:
- Around line 64-85: The updateSystem and updateKey helpers ignore the boolean
result from r.Patch so no-op patches are treated as success; change both
functions to capture the first return value (e.g., updated, err :=
r.Patch(...)), check if err != nil as before, and also if updated == false
return a descriptive error (for example "no rows updated" or "patch did not
change anything" including system.ID or key.ID) so callers can detect and handle
no-op patch results.
---
Nitpick comments:
In `@internal/event-processor/helpers.go`:
- Around line 67-88: mergeOrbitalTaskErrors currently includes empty task error
messages which produces delimiter-only artifacts; update the loop in
mergeOrbitalTaskErrors to trim each t.ErrorMessage (use strings.TrimSpace) and
only append non-empty results to taskErrors (or otherwise filter out empty
strings) before joining with ":" so the returned message contains only real
error text; reference function mergeOrbitalTaskErrors and variables taskErrors
and t.ErrorMessage when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 977cecbc-2aa3-4ecd-9ee1-75072d2d4f45
📒 Files selected for processing (22)
apis/cmk/cmk-ui.yamlinternal/api/cmkapi/cmkapi.gointernal/event-processor/export_test.gointernal/event-processor/factory.gointernal/event-processor/factory_test.gointernal/event-processor/handlers_key.gointernal/event-processor/handlers_system.gointernal/event-processor/helpers.gointernal/event-processor/reconciler.gointernal/event-processor/reconciler_test.gointernal/event-processor/repo.gointernal/event-processor/resolvers.gointernal/event-processor/types.gointernal/manager/export_test.gointernal/manager/key.gointernal/manager/system.gointernal/manager/tenant.gointernal/manager/tenant_test.gointernal/operator/operator.gointernal/operator/operator_test.gosonar-project.propertiestest/integration/event-processor/reconciliation_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
apis/cmk/cmk-ui.yaml (1)
2027-2032:⚠️ Potential issue | 🟡 MinorDocument the new
UNKNOWNdetach outcome too.The updated schema still says
UNKNOWNis customer-held-only, but the new detach flow can also move a key toUNKNOWNwhen a detach job is canceled. That makes the generated contract inaccurate for system-managed/BYOK keys during tenant termination.📝 Suggested wording
- the states PENDING_DELETION, DELETED, FORBIDDEN and UNKNOWN are applicable only to customer held keys. Keys and Versions - are in UNKNOWN state if the authentication to the customer key fails due to any reason. FORBIDDEN state is for + the states PENDING_DELETION, DELETED and FORBIDDEN are applicable only to customer held keys. UNKNOWN is used when + authentication to the customer key fails and also when a detach job is canceled. FORBIDDEN state is for🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apis/cmk/cmk-ui.yaml` around lines 2027 - 2032, Update the description block that begins "Indicates the current state of the Key/Key Version" to document that the UNKNOWN state can also occur as a detach outcome (e.g., when a detach job is cancelled), not only for customer-held keys; explicitly mention that UNKNOWN may apply to system-managed and BYOK keys during tenant termination/detach flows and ensure the sentence listing DETACHING/DETACHED outcomes reflects this expanded applicability.internal/operator/operator.go (1)
431-459:⚠️ Potential issue | 🔴 Critical
removeOIDCMappingpanics when a gRPC error is returned.The
codes.Internalbranch (line 441) logs the error but does not return, allowing the code to fall through to line 450 wheregrpcResp.GetSuccess()is called. When an RPC error occurs,grpcRespis typically nil, causing a panic. Branch onerrfirst and return early for all error cases before accessing the response.Suggested fix
grpcResp, err := o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping( ctx, &oidcmappinggrpc.RemoveOIDCMappingRequest{ TenantId: tenantID, }, ) - st, ok := status.FromError(err) - if !ok { - log.Error(ctx, "failed getting info on sessionManager error", err) - } - if st.Code() == codes.Internal { - log.Error(ctx, "removeOIDC failed with internal err", err) - } - if err != nil && st.Code() != codes.Internal { + + if err != nil { + if status.Code(err) == codes.Internal { + log.Error(ctx, "removeOIDC failed with internal err", err) + setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed) + return true + } + log.Error(ctx, "error while removing OIDC mapping", err) setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed) return true } if !grpcResp.GetSuccess() {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/operator/operator.go` around lines 431 - 459, The code calls grpcResp.GetSuccess() even when the RPC returned an error, which can panic because grpcResp may be nil; update the RemoveOIDCMapping handling to check err first and return early for any error before accessing grpcResp: after calling o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping(...) use status.FromError(err) only to enrich logs but if err != nil then log the error (including internal vs other codes), call setErrorStateAndContinue / setErrorStateAndFail as appropriate (using WorkingStateOIDCMappingRemoveFailed), and return true; only proceed to inspect grpcResp.GetSuccess() when err is nil to avoid the nil dereference.
♻️ Duplicate comments (8)
internal/event-processor/repo.go (1)
64-85:⚠️ Potential issue | 🟠 MajorFail when the patch matched no row.
Line 68 and Line 80 drop the
boolreturned byrepo.Repo.Patch. If the row is gone or the keyed query matches nothing, these helpers still returnnil, so the decommission flow can proceed as if the terminal system/key state was persisted when nothing was written.🐛 Proposed fix
func updateSystem(ctx context.Context, r repo.Repo, system *model.System) error { ck := repo.NewCompositeKey().Where(repo.IDField, system.ID) query := repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true) - _, err := r.Patch(ctx, system, *query) + updated, err := r.Patch(ctx, system, *query) if err != nil { return fmt.Errorf("failed to update system %s upon job termination: %w", system.ID, err) } + if !updated { + return fmt.Errorf("system %s not found upon job termination", system.ID) + } return nil } func updateKey(ctx context.Context, r repo.Repo, key *model.Key) error { ck := repo.NewCompositeKey().Where(repo.IDField, key.ID) query := repo.NewQuery().Where(repo.NewCompositeKeyGroup(ck)).UpdateAll(true) - _, err := r.Patch(ctx, key, *query) + updated, err := r.Patch(ctx, key, *query) if err != nil { return fmt.Errorf("failed to update key %s upon job termination: %w", key.ID, err) } + if !updated { + return fmt.Errorf("key %s not found upon job termination", key.ID) + } return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/repo.go` around lines 64 - 85, The helpers updateSystem and updateKey call r.Patch and ignore the boolean result that indicates whether any row was matched; fix both functions (updateSystem, updateKey) to capture the returned bool (e.g., ok, err := r.Patch(...)) and if ok is false return an error (similar to the existing fmt.Errorf messages) indicating the patch matched no rows so the decommission/termination state was not persisted; keep existing error wrapping for err as well.internal/api/cmkapi/cmkapi.go (1)
404-405:⚠️ Potential issue | 🟡 MinorDocument canceled detach as another
UNKNOWNtransition.Line 404, Line 463, and Line 613 still describe
UNKNOWNonly as an authentication-failure state, but this PR also moves a key back toUNKNOWNwhen a detach job is canceled. Please update the source OpenAPI description and regenerate the generated artifacts so clients see the full detach state machine.Also applies to: 463-464, 613-614
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/api/cmkapi/cmkapi.go` around lines 404 - 405, Update the OpenAPI/Go struct documentation for the State field so UNKNOWN includes both authentication-failure and "detach job canceled" transitions: modify the comment on State (*KeyState `json:"state,omitempty"`) and any other occurrences of the Key state description (the other doc comments referencing KeyState at the same locations) to explicitly mention that a canceled detach moves the key/version back to UNKNOWN; after editing the source docstrings/OpenAPI description, regenerate the OpenAPI artifacts and client/server code so the updated state machine appears in generated clients.internal/event-processor/factory.go (1)
136-155:⚠️ Potential issue | 🔴 CriticalKeep decommission scheduling atomic with the status transition.
This path still commits
system.Status = PROCESSINGbeforecreateSystemEventJobruns. If tenant extraction orPrepareJobfails, the system is stranded inPROCESSINGwith no event/job to drive recovery. Reuse the same guarded flow asSystemUnlink(or extract a no-event variant) so the status update is rolled back when job creation fails.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/factory.go` around lines 136 - 155, The status update to PROCESSING must be done atomically with job creation so failures in tenant extraction/PrepareJob don't leave the system stuck; modify the flow around f.repo.Transaction, f.repo.Patch, and createSystemEventJob (and the SystemActionJobData/JobTypeSystemUnlinkDecommission path) so the PATCH to set system.Status = cmkapi.SystemStatusPROCESSING and the call that creates/prepares the job (createSystemEventJob or an extracted no-event variant used by SystemUnlink) occur inside the same transaction/guarded flow and return an error to rollback if job creation fails, mirroring the guarded pattern used by SystemUnlink.internal/manager/system.go (1)
432-444:⚠️ Potential issue | 🔴 CriticalGuard
m.registrybefore callingMapping().
NewSystemManagerstill allows the registry client to be nil, but this method unconditionally dereferences it. Any offboarding flow built in that mode will panic instead of returning a retryable error.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/manager/system.go` around lines 432 - 444, The UnmapSystemFromRegistry method dereferences m.registry without checking for nil; update UnmapSystemFromRegistry in SystemManager to guard m.registry (or m.registry.Mapping()) before calling Mapping(), and if nil return a retryable error (e.g., context-aware temporary error) rather than panicking; also ensure NewSystemManager's allowance of a nil registry is honored by consistently returning that error path when m.registry is nil so offboarding flows fail safely instead of causing a panic.internal/event-processor/handlers_key.go (1)
91-96:⚠️ Potential issue | 🟠 MajorOnly confirm detach jobs while the key is still
DETACHING.
HandleJobCanceledEventnow moves the key toUNKNOWNso offboarding can retry, butHandleJobConfirmstill unconditionally returnsCompleteJobConfirmer(). A late confirmation can therefore accept work after cancellation and overwrite that retry state. Mirror the system confirmer logic and cancel when the key is no longerDETACHING.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_key.go` around lines 91 - 96, Only confirm detach jobs while the key is still DETACHING: in KeyDetachJobHandler.HandleJobConfirm, read the current key state (using the same store/client used by HandleJobCanceledEvent in this handler) and if the key's state is not DETACHING return orbital.CancelJobConfirmer() (nil error); only return orbital.CompleteJobConfirmer() when the live state is DETACHING. Mirror the system confirmer behavior by querying the key state before completing the job to avoid late confirmations overwriting the UNKNOWN retry state.internal/event-processor/reconciler_test.go (1)
1116-1119:⚠️ Potential issue | 🟡 MinorFail this helper on hook errors.
These tests are asserting termination side effects, but the helper still only logs
JobDoneFunc/JobFailedFuncfailures. That can leave a broken hook path green for the wrong reason. Userequire.NoErrorhere instead oft.Logf.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/reconciler_test.go` around lines 1116 - 1119, The helper in reconciler_test.go currently swallows errors from the job termination hook by logging them (the block using t.Logf for errors from JobDoneFunc/JobFailedFunc); change this to fail the test on hook errors by replacing the t.Logf check with a require.NoError(t, err) (importing testify/require if not already present) so any error from the termination hook causes the test to fail rather than silently pass.internal/manager/tenant_test.go (1)
268-279:⚠️ Potential issue | 🟠 MajorThis shared fixture can still hide whether unmapping is exercised.
mockSystemManager.UnlinkSystemActionreturns an error without changing repo state, so the linked system created here can leak into the next subtest. That letsOffboardTenantstop at the “systems still linked” check and returnContinueAndWaitbeforeUnmapSystemFromRegistryis reached. Reset the fixture per subtest or clear linked systems before the unmap assertions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/manager/tenant_test.go` around lines 268 - 279, The test uses a shared linked-system fixture so OffboardTenant may bail early at the "systems still linked" check; to ensure UnmapSystemFromRegistry is actually exercised, reset the fixture per subtest by creating a fresh system and repo state inside this t.Run (use testutils.NewSystem/testutils.CreateTestEntities within the subtest) or clear linked systems before calling m.OffboardTenant, and ensure the mockSystemManager used via m.SetSystemForTests either simulates state changes on UnlinkSystemAction or is removed (set nil/reset) after the subtest so the linked system cannot leak into following tests; update the mock or test setup accordingly for the test that asserts unmap behavior.internal/manager/tenant.go (1)
345-361:⚠️ Potential issue | 🟡 MinorUnreachable default case and incorrect log message.
Two issues in this switch:
Unreachable code: Since
err == nilreturns early at line 337-339,erris guaranteed non-nil in the switch. Thecase err != nil:at line 355 will always match if none of the preceding cases do, making thedefaultcase unreachable.Misleading log message: The message "error while removing OIDC mapping" doesn't match this function's purpose (unmapping systems from registry).
,
📝 Proposed fix
switch { case st.Code() == codes.FailedPrecondition && strings.Contains(st.Message(), "system is not linked to the tenant"): log.Info(ctx, "system is not linked to the tenant in registry, might have been already unlinked") return OffboardingGoToNextStep case st.Code() == codes.NotFound && strings.Contains(st.Message(), "system not found"): log.Info(ctx, "system mapping not found in registry, might have been already removed") return OffboardingGoToNextStep case st.Code() == codes.InvalidArgument: log.Error(ctx, "invalid argument error while unmapping system from tenant", err) return OffboardingFailed -case err != nil: - log.Error(ctx, "error while removing OIDC mapping", err) +default: + log.Error(ctx, "error while unmapping system from registry", err) return OffboardingContinueAndWait -default: - return OffboardingGoToNextStep }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/manager/tenant.go` around lines 345 - 361, The switch handling gRPC status after attempting to unmap a system has an always-true `case err != nil` and a misleading log; remove the `case err != nil:` branch and instead make the `default` branch the fallback for unexpected non-matching error statuses, logging a correct message (e.g., "error while unmapping system from registry") and returning OffboardingContinueAndWait; keep the existing specific cases for FailedPrecondition, NotFound, and InvalidArgument (referencing variables st, err and the Offboarding* return constants) so the switch is reachable and semantics are preserved.
🧹 Nitpick comments (1)
internal/event-processor/handlers_system.go (1)
358-381: Verify cancel behavior aligns with tenant offboarding expectations.Setting the system to
FAILEDon cancel is intentional per PR objectives to allow retry on the next decommission reconciliation. However,sendUnlinkForConnectedSystemsfilters byKeyConfigIDField != null, whileterminate()setsKeyConfigurationID = nilon Done/Failed. This means a canceled job will setFAILEDbut retainKeyConfigurationID, so it will be re-picked in the next loop—which is correct.Consider adding a brief comment clarifying why FAILED is chosen here (vs CONNECTED in the regular unlink handler) to help future maintainers understand the different semantics.
📝 Suggested clarifying comment
// HandleJobCanceledEvent puts the system to failed state so the system can be processed again during the next // decommission reconciliation loop. +// Note: Unlike regular SystemUnlinkJobHandler, we intentionally use FAILED (not CONNECTED) because +// decommission has no user cancel/retry flow—FAILED ensures the system is re-enqueued on the next offboarding pass. func (h *SystemUnlinkDecommissionJobHandler) HandleJobCanceledEvent(ctx context.Context, job orbital.Job) error {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_system.go` around lines 358 - 381, Add a short clarifying comment in SystemUnlinkDecommissionJobHandler.HandleJobCanceledEvent right before setting system.Status = cmkapi.SystemStatusFAILED that explains why FAILED is chosen (so the system is re-evaluated by the next decommission reconciliation loop) and note the interaction with sendUnlinkForConnectedSystems and terminate(): sendUnlinkForConnectedSystems filters by KeyConfigurationID != nil while terminate() clears KeyConfigurationID on Done/Failed, so marking FAILED here intentionally retains KeyConfigurationID and allows re-pickup on the next loop (contrast this with the regular unlink handler which sets CONNECTED).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/manager/key.go`:
- Around line 422-424: The current code sets key.State and then calls
km.repo.Patch(ctx, key, *repo.NewQuery().UpdateAll(true)), which overwrites the
entire row and can clobber concurrent changes; change the patch to only update
the state column instead of UpdateAll(true) — e.g., call km.repo.Patch with a
query that targets only the "state" field (use the repository/query API to
Update("state") or equivalent) so only key.State is persisted while leaving
other columns untouched.
---
Outside diff comments:
In `@apis/cmk/cmk-ui.yaml`:
- Around line 2027-2032: Update the description block that begins "Indicates the
current state of the Key/Key Version" to document that the UNKNOWN state can
also occur as a detach outcome (e.g., when a detach job is cancelled), not only
for customer-held keys; explicitly mention that UNKNOWN may apply to
system-managed and BYOK keys during tenant termination/detach flows and ensure
the sentence listing DETACHING/DETACHED outcomes reflects this expanded
applicability.
In `@internal/operator/operator.go`:
- Around line 431-459: The code calls grpcResp.GetSuccess() even when the RPC
returned an error, which can panic because grpcResp may be nil; update the
RemoveOIDCMapping handling to check err first and return early for any error
before accessing grpcResp: after calling
o.clientsFactory.SessionManager().OIDCMapping().RemoveOIDCMapping(...) use
status.FromError(err) only to enrich logs but if err != nil then log the error
(including internal vs other codes), call setErrorStateAndContinue /
setErrorStateAndFail as appropriate (using WorkingStateOIDCMappingRemoveFailed),
and return true; only proceed to inspect grpcResp.GetSuccess() when err is nil
to avoid the nil dereference.
---
Duplicate comments:
In `@internal/api/cmkapi/cmkapi.go`:
- Around line 404-405: Update the OpenAPI/Go struct documentation for the State
field so UNKNOWN includes both authentication-failure and "detach job canceled"
transitions: modify the comment on State (*KeyState `json:"state,omitempty"`)
and any other occurrences of the Key state description (the other doc comments
referencing KeyState at the same locations) to explicitly mention that a
canceled detach moves the key/version back to UNKNOWN; after editing the source
docstrings/OpenAPI description, regenerate the OpenAPI artifacts and
client/server code so the updated state machine appears in generated clients.
In `@internal/event-processor/factory.go`:
- Around line 136-155: The status update to PROCESSING must be done atomically
with job creation so failures in tenant extraction/PrepareJob don't leave the
system stuck; modify the flow around f.repo.Transaction, f.repo.Patch, and
createSystemEventJob (and the
SystemActionJobData/JobTypeSystemUnlinkDecommission path) so the PATCH to set
system.Status = cmkapi.SystemStatusPROCESSING and the call that creates/prepares
the job (createSystemEventJob or an extracted no-event variant used by
SystemUnlink) occur inside the same transaction/guarded flow and return an error
to rollback if job creation fails, mirroring the guarded pattern used by
SystemUnlink.
In `@internal/event-processor/handlers_key.go`:
- Around line 91-96: Only confirm detach jobs while the key is still DETACHING:
in KeyDetachJobHandler.HandleJobConfirm, read the current key state (using the
same store/client used by HandleJobCanceledEvent in this handler) and if the
key's state is not DETACHING return orbital.CancelJobConfirmer() (nil error);
only return orbital.CompleteJobConfirmer() when the live state is DETACHING.
Mirror the system confirmer behavior by querying the key state before completing
the job to avoid late confirmations overwriting the UNKNOWN retry state.
In `@internal/event-processor/reconciler_test.go`:
- Around line 1116-1119: The helper in reconciler_test.go currently swallows
errors from the job termination hook by logging them (the block using t.Logf for
errors from JobDoneFunc/JobFailedFunc); change this to fail the test on hook
errors by replacing the t.Logf check with a require.NoError(t, err) (importing
testify/require if not already present) so any error from the termination hook
causes the test to fail rather than silently pass.
In `@internal/event-processor/repo.go`:
- Around line 64-85: The helpers updateSystem and updateKey call r.Patch and
ignore the boolean result that indicates whether any row was matched; fix both
functions (updateSystem, updateKey) to capture the returned bool (e.g., ok, err
:= r.Patch(...)) and if ok is false return an error (similar to the existing
fmt.Errorf messages) indicating the patch matched no rows so the
decommission/termination state was not persisted; keep existing error wrapping
for err as well.
In `@internal/manager/system.go`:
- Around line 432-444: The UnmapSystemFromRegistry method dereferences
m.registry without checking for nil; update UnmapSystemFromRegistry in
SystemManager to guard m.registry (or m.registry.Mapping()) before calling
Mapping(), and if nil return a retryable error (e.g., context-aware temporary
error) rather than panicking; also ensure NewSystemManager's allowance of a nil
registry is honored by consistently returning that error path when m.registry is
nil so offboarding flows fail safely instead of causing a panic.
In `@internal/manager/tenant_test.go`:
- Around line 268-279: The test uses a shared linked-system fixture so
OffboardTenant may bail early at the "systems still linked" check; to ensure
UnmapSystemFromRegistry is actually exercised, reset the fixture per subtest by
creating a fresh system and repo state inside this t.Run (use
testutils.NewSystem/testutils.CreateTestEntities within the subtest) or clear
linked systems before calling m.OffboardTenant, and ensure the mockSystemManager
used via m.SetSystemForTests either simulates state changes on
UnlinkSystemAction or is removed (set nil/reset) after the subtest so the linked
system cannot leak into following tests; update the mock or test setup
accordingly for the test that asserts unmap behavior.
In `@internal/manager/tenant.go`:
- Around line 345-361: The switch handling gRPC status after attempting to unmap
a system has an always-true `case err != nil` and a misleading log; remove the
`case err != nil:` branch and instead make the `default` branch the fallback for
unexpected non-matching error statuses, logging a correct message (e.g., "error
while unmapping system from registry") and returning OffboardingContinueAndWait;
keep the existing specific cases for FailedPrecondition, NotFound, and
InvalidArgument (referencing variables st, err and the Offboarding* return
constants) so the switch is reachable and semantics are preserved.
---
Nitpick comments:
In `@internal/event-processor/handlers_system.go`:
- Around line 358-381: Add a short clarifying comment in
SystemUnlinkDecommissionJobHandler.HandleJobCanceledEvent right before setting
system.Status = cmkapi.SystemStatusFAILED that explains why FAILED is chosen (so
the system is re-evaluated by the next decommission reconciliation loop) and
note the interaction with sendUnlinkForConnectedSystems and terminate():
sendUnlinkForConnectedSystems filters by KeyConfigurationID != nil while
terminate() clears KeyConfigurationID on Done/Failed, so marking FAILED here
intentionally retains KeyConfigurationID and allows re-pickup on the next loop
(contrast this with the regular unlink handler which sets CONNECTED).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b66573ce-937a-45b7-b950-dec9134af181
📒 Files selected for processing (22)
apis/cmk/cmk-ui.yamlinternal/api/cmkapi/cmkapi.gointernal/event-processor/export_test.gointernal/event-processor/factory.gointernal/event-processor/factory_test.gointernal/event-processor/handlers_key.gointernal/event-processor/handlers_system.gointernal/event-processor/helpers.gointernal/event-processor/reconciler.gointernal/event-processor/reconciler_test.gointernal/event-processor/repo.gointernal/event-processor/resolvers.gointernal/event-processor/types.gointernal/manager/export_test.gointernal/manager/key.gointernal/manager/system.gointernal/manager/tenant.gointernal/manager/tenant_test.gointernal/operator/operator.gointernal/operator/operator_test.gosonar-project.propertiestest/integration/event-processor/reconciliation_test.go
a7629a1 to
fee5d34
Compare
|
@coderabbitai resolve |
✅ Actions performedComments resolved. Auto-approval is disabled; enable |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/operator/operator.go (1)
441-461:⚠️ Potential issue | 🟠 MajorClarify error handling for
codes.Internal- current flow causes unintended task failure.When
err != nilandst.Code() == codes.Internal, the code logs the error but falls through to line 454. SincegrpcRespis nil when an error occurs,grpcResp.GetSuccess()returnsfalse(proto getter safety), which triggerssetErrorStateAndFail.If Internal errors should fail the task, the special logging at line 446 is misleading. If Internal errors should be handled differently (e.g., proceed with termination), the function should return early after logging.
🔧 Option A: If Internal errors should fail the task
st, ok := status.FromError(err) if !ok { log.Error(ctx, "failed getting info on sessionManager error", err) } - if st.Code() == codes.Internal { - log.Error(ctx, "removeOIDC failed with internal err", err) - } - if err != nil && st.Code() != codes.Internal { + if err != nil { log.Error(ctx, "error while removing OIDC mapping", err) setErrorStateAndContinue(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed) return true }🔧 Option B: If Internal errors should allow proceeding (e.g., mapping already removed)
st, ok := status.FromError(err) if !ok { log.Error(ctx, "failed getting info on sessionManager error", err) } if st.Code() == codes.Internal { log.Error(ctx, "removeOIDC failed with internal err", err) + // Internal error is acceptable (e.g., mapping already removed), proceed with termination + return false } if err != nil && st.Code() != codes.Internal {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/operator/operator.go` around lines 441 - 461, The current handling of status.FromError(err)/st.Code() == codes.Internal is ambiguous and lets execution fall through to grpcResp.GetSuccess() (which is nil) causing setErrorStateAndFail; change the flow so that after detecting an Internal error you explicitly handle it: either (A) treat it as a hard failure by calling setErrorStateAndFail(ctx, resp, err, WorkingStateOIDCMappingRemoveFailed) and return true, or (B) treat it as non-fatal by logging a clear message and returning (true or false as appropriate for your loop semantics) to avoid touching grpcResp; also update the log message around removeOIDC/internal error to include the error and not imply a different outcome — locate the handling around status.FromError(err), st.Code(), codes.Internal, grpcResp.GetSuccess(), setErrorStateAndFail and setErrorStateAndContinue to implement this early-return behavior.internal/event-processor/handlers_system.go (1)
457-463:⚠️ Potential issue | 🔴 CriticalAvoid a panic in the not-found branch.
When
getSystemByIDreturnsrepo.ErrNotFound,systemis nil here. Loggingsystem.Identifierturns an already-gone system into a handler crash.🐛 Minimal fix
if errors.Is(err, repo.ErrNotFound) { log.Warn(ctx, "System not found when handling job termination, skipping system update", - slog.String("systemID", system.Identifier)) + slog.String("systemID", data.SystemID)) return nil }Also applies to: 502-507
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_system.go` around lines 457 - 463, The not-found error branch in the handler uses system.Identifier after getSystemByID returns an error, which can be nil and panic; update the repo.ErrNotFound handling in the block that follows getSystemByID (and the similar block around lines 502-507) to avoid dereferencing system: log using data.SystemID (or another non-nil identifier) instead of system.Identifier, or check system != nil before accessing Identifier, and return nil afterward; ensure the log message remains descriptive (e.g., "System not found when handling job termination, skipping system update") and include the safe identifier.
♻️ Duplicate comments (5)
internal/api/cmkapi/cmkapi.go (1)
404-405:⚠️ Potential issue | 🟡 MinorUpdate UNKNOWN-state wording to “canceled detach,” not “failed detach.”
These descriptions currently say UNKNOWN applies when key detach previously failed, but the PR behavior defines UNKNOWN for canceled detach jobs while failed jobs terminate as DETACHED. Please correct the source OpenAPI descriptions and regenerate this file.
Also applies to: 463-464, 613-614
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/api/cmkapi/cmkapi.go` around lines 404 - 405, Update the OpenAPI description for the Key/Key Version state so that references to UNKNOWN indicate "canceled detach" instead of "failed detach": locate the source description used to generate the State field comment (which produced the comment above the State *KeyState `json:"state,omitempty"` field and the similar comments that generated the other duplicated blocks), replace the phrase "detach previously failed" (or similar wording) with "canceled detach", regenerate the OpenAPI client/code so the comment above the State field and the other duplicated occurrences reflect "canceled detach" (ensure you run the same generation step that produced internal/api/cmkapi/cmkapi.go).internal/manager/tenant_test.go (1)
194-199:⚠️ Potential issue | 🟠 MajorThese fixtures still don't point the key configuration at the key under test.
Both subtests create a new
IsPrimarykey, butkeyConfig.PrimaryKeyIDstill points at the original fixture key. Offboarding can therefore read or update the wrong record, so these assertions may not be exercising the intended primary-key path.Also applies to: 242-248
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/manager/tenant_test.go` around lines 194 - 199, The fixture keys created via testutils.NewKey in the subtests set IsPrimary on the new key but do not update keyConfig.PrimaryKeyID, so offboarding still references the original fixture key; update the test setup (after creating the new primary key via testutils.NewKey) to set keyConfig.PrimaryKeyID = newKey.ID (and persist/update keyConfig if needed) so the key configuration actually points at the newly created primary key; do this for both occurrences (the block using testutils.NewKey at the shown lines and the similar block around lines 242-248) so the tests exercise the primary-key path correctly.internal/event-processor/repo.go (1)
69-76:⚠️ Potential issue | 🟠 MajorDon't swallow
updated == falsehere.If
Patchmatches no row, these helpers still returnnil. Their callers then continue with event cleanup and audit logging as if the state transition was persisted.Also applies to: 85-92
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/repo.go` around lines 69 - 76, The current handling after calling r.Patch(ctx, system, *query) ignores the case where updated == false and merely logs a warning; instead treat a false return as a failure to persist the state transition: after the call in repo.go (the r.Patch(ctx, system, *query) invocation and the second similar block around lines 85-92), check if updated is false and return a non-nil error (e.g., fmt.Errorf("failed to update system %s: no rows affected") or similar) so callers do not proceed with cleanup/audit as if the update succeeded; update both places that call r.Patch to return this error rather than just logging.internal/event-processor/handlers_system.go (1)
400-410:⚠️ Potential issue | 🟠 MajorReset the external key claim before persisting the local unlink.
terminateclearsKeyConfigurationIDand marks the systemDISCONNECTEDbeforesendL1KeyClaim(...). If that external call fails after the patch, offboarding will see the system as fully unlinked and won't enqueue another cleanup attempt for the stale claim.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/event-processor/handlers_system.go` around lines 400 - 410, The code clears system.KeyConfigurationID and sets system.Status to DISCONNECTED before calling sendL1KeyClaim, which can leave an external claim stale if sendL1KeyClaim fails; change the flow so you call sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false) while the system still has its KeyConfigurationID, and only after sendL1KeyClaim succeeds then set system.KeyConfigurationID = nil, set system.Status = cmkapi.SystemStatusDISCONNECTED and persist via updateSystem(ctx, h.repo, system); ensure error handling preserves the original state if sendL1KeyClaim returns an error.internal/manager/system.go (1)
438-443:⚠️ Potential issue | 🟠 MajorDon't treat a missing registry client as a successful unmap.
NewSystemManagerstill allowsm.registry == nil. Returningnilhere makes tenant offboarding advance as if the registry mapping was removed, so deletion can continue with stale external state.🐛 Minimal fix
if m.registry == nil { log.Warn( ctx, "Could not perform system unmap from registry service - registry client is nil", ) - return nil + return status.Error(codes.Unavailable, "registry client is not configured") }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/manager/system.go` around lines 438 - 443, The code treats a missing registry client (m.registry == nil) as a successful unmap by returning nil; change this to return an explicit error so tenant offboarding does not proceed with stale external state. In the method that performs the system unmap (the function containing the shown snippet), replace the current return nil with a descriptive error (e.g., fmt.Errorf("registry client is nil: cannot unmap system")), and ensure callers handle this error; additionally, audit NewSystemManager to either require/validate a non-nil registry client at construction or document that callers must handle a possible nil registry and the resulting error from the unmap call (use m.registry and NewSystemManager as reference points).
🧹 Nitpick comments (1)
internal/operator/operator.go (1)
391-397: Minor: Inconsistent variable namingtenantIdvstenantID.Go convention prefers
tenantID(with uppercaseIDacronym). This is inconsistent with the parameter naming on line 433 (tenantID) in the same file.🔧 Suggested fix
- tenantId := tenantProto.GetId() - if tenantId == "" { + tenantID := tenantProto.GetId() + if tenantID == "" { setErrorStateAndFail(ctx, resp, ErrInvalidTenantID, WorkingStateInvalidTaskData) return } - ctx = slogctx.With(ctx, "tenantId", tenantId) + ctx = slogctx.With(ctx, "tenantId", tenantID) - hasErr := o.removeOIDCMapping(ctx, resp, tenantId) + hasErr := o.removeOIDCMapping(ctx, resp, tenantID)(Also update line 406 accordingly)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/operator/operator.go` around lines 391 - 397, Rename the local variable tenantId to tenantID to follow Go convention and match existing parameter naming; replace the assignment from tenantProto.GetId() to tenantID, update the empty check and the setErrorStateAndFail call that uses it, and update the ctx = slogctx.With(ctx, "tenantId", tenantId) to use "tenantID" and tenantID; also adjust any other nearby/local occurrences (including the other local variable referenced in this block) so all uses and the context key are consistently tenantID.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/manager/system.go`:
- Around line 730-741: The code calls keyConfig.PrimaryKeyID.String() unguarded
in the event creation branches (eventFactory.SystemUnlinkDecommission and
eventFactory.SystemUnlink), which will panic if PrimaryKeyID is nil; update both
branches to check ptr.IsNotNilUUID(keyConfig.PrimaryKeyID) before dereferencing
and pass a safe value (e.g., an empty string or handle the nil case
appropriately) into SystemUnlinkDecommission and SystemUnlink so the .String()
method is only invoked when PrimaryKeyID is non-nil, matching the existing guard
pattern used elsewhere in this file.
---
Outside diff comments:
In `@internal/event-processor/handlers_system.go`:
- Around line 457-463: The not-found error branch in the handler uses
system.Identifier after getSystemByID returns an error, which can be nil and
panic; update the repo.ErrNotFound handling in the block that follows
getSystemByID (and the similar block around lines 502-507) to avoid
dereferencing system: log using data.SystemID (or another non-nil identifier)
instead of system.Identifier, or check system != nil before accessing
Identifier, and return nil afterward; ensure the log message remains descriptive
(e.g., "System not found when handling job termination, skipping system update")
and include the safe identifier.
In `@internal/operator/operator.go`:
- Around line 441-461: The current handling of status.FromError(err)/st.Code()
== codes.Internal is ambiguous and lets execution fall through to
grpcResp.GetSuccess() (which is nil) causing setErrorStateAndFail; change the
flow so that after detecting an Internal error you explicitly handle it: either
(A) treat it as a hard failure by calling setErrorStateAndFail(ctx, resp, err,
WorkingStateOIDCMappingRemoveFailed) and return true, or (B) treat it as
non-fatal by logging a clear message and returning (true or false as appropriate
for your loop semantics) to avoid touching grpcResp; also update the log message
around removeOIDC/internal error to include the error and not imply a different
outcome — locate the handling around status.FromError(err), st.Code(),
codes.Internal, grpcResp.GetSuccess(), setErrorStateAndFail and
setErrorStateAndContinue to implement this early-return behavior.
---
Duplicate comments:
In `@internal/api/cmkapi/cmkapi.go`:
- Around line 404-405: Update the OpenAPI description for the Key/Key Version
state so that references to UNKNOWN indicate "canceled detach" instead of
"failed detach": locate the source description used to generate the State field
comment (which produced the comment above the State *KeyState
`json:"state,omitempty"` field and the similar comments that generated the other
duplicated blocks), replace the phrase "detach previously failed" (or similar
wording) with "canceled detach", regenerate the OpenAPI client/code so the
comment above the State field and the other duplicated occurrences reflect
"canceled detach" (ensure you run the same generation step that produced
internal/api/cmkapi/cmkapi.go).
In `@internal/event-processor/handlers_system.go`:
- Around line 400-410: The code clears system.KeyConfigurationID and sets
system.Status to DISCONNECTED before calling sendL1KeyClaim, which can leave an
external claim stale if sendL1KeyClaim fails; change the flow so you call
sendL1KeyClaim(ctx, *system, h.registry, data.TenantID, false) while the system
still has its KeyConfigurationID, and only after sendL1KeyClaim succeeds then
set system.KeyConfigurationID = nil, set system.Status =
cmkapi.SystemStatusDISCONNECTED and persist via updateSystem(ctx, h.repo,
system); ensure error handling preserves the original state if sendL1KeyClaim
returns an error.
In `@internal/event-processor/repo.go`:
- Around line 69-76: The current handling after calling r.Patch(ctx, system,
*query) ignores the case where updated == false and merely logs a warning;
instead treat a false return as a failure to persist the state transition: after
the call in repo.go (the r.Patch(ctx, system, *query) invocation and the second
similar block around lines 85-92), check if updated is false and return a
non-nil error (e.g., fmt.Errorf("failed to update system %s: no rows affected")
or similar) so callers do not proceed with cleanup/audit as if the update
succeeded; update both places that call r.Patch to return this error rather than
just logging.
In `@internal/manager/system.go`:
- Around line 438-443: The code treats a missing registry client (m.registry ==
nil) as a successful unmap by returning nil; change this to return an explicit
error so tenant offboarding does not proceed with stale external state. In the
method that performs the system unmap (the function containing the shown
snippet), replace the current return nil with a descriptive error (e.g.,
fmt.Errorf("registry client is nil: cannot unmap system")), and ensure callers
handle this error; additionally, audit NewSystemManager to either
require/validate a non-nil registry client at construction or document that
callers must handle a possible nil registry and the resulting error from the
unmap call (use m.registry and NewSystemManager as reference points).
In `@internal/manager/tenant_test.go`:
- Around line 194-199: The fixture keys created via testutils.NewKey in the
subtests set IsPrimary on the new key but do not update keyConfig.PrimaryKeyID,
so offboarding still references the original fixture key; update the test setup
(after creating the new primary key via testutils.NewKey) to set
keyConfig.PrimaryKeyID = newKey.ID (and persist/update keyConfig if needed) so
the key configuration actually points at the newly created primary key; do this
for both occurrences (the block using testutils.NewKey at the shown lines and
the similar block around lines 242-248) so the tests exercise the primary-key
path correctly.
---
Nitpick comments:
In `@internal/operator/operator.go`:
- Around line 391-397: Rename the local variable tenantId to tenantID to follow
Go convention and match existing parameter naming; replace the assignment from
tenantProto.GetId() to tenantID, update the empty check and the
setErrorStateAndFail call that uses it, and update the ctx = slogctx.With(ctx,
"tenantId", tenantId) to use "tenantID" and tenantID; also adjust any other
nearby/local occurrences (including the other local variable referenced in this
block) so all uses and the context key are consistently tenantID.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0605c116-f9d8-4a66-9fe1-37f2342004ad
📒 Files selected for processing (21)
apis/cmk/cmk-ui.yamlinternal/api/cmkapi/cmkapi.gointernal/event-processor/factory.gointernal/event-processor/factory_test.gointernal/event-processor/handlers_key.gointernal/event-processor/handlers_system.gointernal/event-processor/helpers.gointernal/event-processor/reconciler.gointernal/event-processor/reconciler_test.gointernal/event-processor/repo.gointernal/event-processor/resolvers.gointernal/event-processor/types.gointernal/manager/export_test.gointernal/manager/key.gointernal/manager/system.gointernal/manager/tenant.gointernal/manager/tenant_test.gointernal/operator/operator.gointernal/operator/operator_test.gosonar-project.propertiestest/integration/event-processor/reconciliation_test.go
✅ Files skipped from review due to trivial changes (3)
- internal/operator/operator_test.go
- sonar-project.properties
- internal/event-processor/helpers.go
🚧 Files skipped from review as they are similar to previous changes (10)
- internal/event-processor/types.go
- test/integration/event-processor/reconciliation_test.go
- internal/event-processor/resolvers.go
- apis/cmk/cmk-ui.yaml
- internal/manager/export_test.go
- internal/event-processor/factory.go
- internal/event-processor/reconciler_test.go
- internal/event-processor/handlers_key.go
- internal/manager/tenant.go
- internal/manager/key.go
b9d5a96 to
0b5aa06
Compare
| if err != nil { | ||
| return fmt.Errorf("failed to update system %s upon job termination: %w", system.ID, err) | ||
| } |
There was a problem hiding this comment.
I wouldn't do this here as this updateSystem doesn't mention any job
sonar-project.properties
Outdated
| sonar.exclusions=**/api/cmkapi/cmkapi.go | ||
| sonar.test.inclusions=**/*_test.go | ||
| sonar.coverage.exclusions=**/api/**,**/noop/**,**/wrapper/**,**/main.go,**/*_test.go,**/*_config.go,**/*_dto.go,**/*_model.go, **.sql,internal/api/cmkapi/cmkapi.go | ||
| sonar.cpd.exclusions=**/api/**,**/noop/**,**/wrapper/**,**/main.go,**/*_test.go,**/mocks/**,**/proto/**,**/*_generated.go,internal/api/cmkapi/cmkapi.go |
There was a problem hiding this comment.
Seems like the file already is in coverage.exclusions and cpd.exclusions. It can be removed from there
There was a problem hiding this comment.
I had trouble with code smells and couldn't find another configuration to disable it specifically. Or do you mean it should be removed from coverage.exclusions and cpd.exclusions?
There was a problem hiding this comment.
I mean that if it's in exclusions it doesn't need to be in cpd.exclusions and coverage.exclusions as it would already include those
0b5aa06 to
6f63e74
Compare
| case st.Code() == codes.InvalidArgument: | ||
| log.Error(ctx, "invalid argument error while unmapping system from tenant", err) | ||
| return OffboardingFailed | ||
| case err != nil: |
There was a problem hiding this comment.
How do we know this case is due to removing OIDC mapping error?
There was a problem hiding this comment.
It doesn't need to be. The point is that, when it is a generic or unknown error we will assume it is something intermittent and reconcile the operator/task, until it reaches the reconciliation limit.
| slog.String("errorMessage", taskErrorMessage), | ||
| ) | ||
|
|
||
| _, err = h.terminate(ctx, job, cmkapi.KeyStateDETACHED) |
There was a problem hiding this comment.
On job failed, is DETACHED a correct state?
There was a problem hiding this comment.
Or should it be UNKNOWN?
There was a problem hiding this comment.
It should be DETACHED. For tenant decommissioning events (unlink and detach) we do not care about the response from the task operator, as long as we know we have informed them. So COMPLETE and FAILED would be the same.
Task cancellation goes back to UNKNOWN because we haven't sent out the task yet so haven't done our part.
doanhluongsap
left a comment
There was a problem hiding this comment.
LGTM! Thanks a lot
Change descriptions
SystemUnlinkjob type toSystemUnlinkandSystemUnlinkDecommission, instead of usingtriggerto distinguish between a normal unlink and a decommission-based unlink.SystemUnlinkDecommissionjob will:model.Evententry since no user intervention for cancel/retry is needed at this point. Tenant access is already blocked.DISCONNECTEDand reset key claim regardless of task result (Done or Failed)FAILEDif the job is canceled, so event is recreated and retried on the next tenant offboarding reconciliation.KeyDetachevent:DETACHING. When event is created key goes intoDETACHINGstate. Only upon job termination it will go toDETACHED(regardless of Done or Failed)UNKNOWNafter job cancellation.TerminateTenantoperator:Summary by CodeRabbit
New Features
DETACHINGstate for keys to track decommissioning progress before completion.Bug Fixes