You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Three independent bugs compose to produce a state where a user-cancelled scan reappears in running status after a Healarr restart, and subsequent cancel clicks become silent no-ops. The scan loop keeps burning files in the background past the original cancel because the in-memory cancellation signal never fires.
Diagnosed in production on 2026-06-01 during the v1.3.4 deployment (Osiris → Sokaris migration triggered the restart cycle that exposed it).
Reproducing the chain
Given a long-running scan (e.g. /media/Movies/HD/ with 7,000+ files):
User clicks cancel on the scan.
CancelScan(scanID) is called with the DB integer ID as a string (e.g. "3").
s.activeScans["3"] — miss, because the map is keyed by the in-memory UUID, not the DB id. The in-memory ctx.cancel() never fires. (Bug 1)
MarkCancelled(3, ...) runs and the DB row is updated to status='cancelled', completed_at=<now>, error_message='cancelled by user'. From the user's perspective the cancel succeeded.
But the scan loop is still running. It keeps iterating files and emitting progress events. The deferred Finalize at end-of-loop would eventually overwrite the cancelled row, but if the loop is long enough you can hit step 2 first.
The shutdown handler runs MarkInterrupted(id=3, file_index=N) → UPDATE scans SET status='interrupted', current_file_index=N WHERE id=3. This SQL has no completed_at guard, so it overwrites the cancelled status. The row is now status='interrupted', completed_at=<earlier>, error_message='cancelled by user'.
Healarr starts back up.
ResumeInterruptedScans() queries WHERE status='interrupted' AND file_list IS NOT NULL. No filter on completed_at. (Bug 2)
resumeScan runs SetStatus(3, 'running'). Row is now status='running', completed_at=<earlier>, error_message='cancelled by user'. The scanner is alive again, processing files past the user's original intent.
User clicks cancel again (because the UI is showing an active scan they thought they cancelled).
MarkCancelled SQL: UPDATE … WHERE id=3 AND completed_at IS NULL. completed_at is already set, so 0 rows match. (Bug 3)
The cancel is a no-op. The UI optimistically hides the row, but a refresh re-fetches it as running and it reappears.
MarkOrphansCancelled (the startup orphan-reconciler) has the same completed_at IS NULL guard, so it can't dig the row out either.
The three bugs
Bug 1 — CancelScan lookup by DB id misses the UUID-keyed map
activeScans is keyed by the UUID generated in recordScanStart (and the resume paths). The HTTP handler passes the DB integer id as a string. Lookup always misses → ctx.cancel() is never invoked → the scan goroutine keeps running.
The DB persistence half (MarkCancelled) still runs, which is why the bug was invisible for short-running scans (the loop would exit naturally before the user noticed).
SELECT id, path_id, … FROM scans
WHERE status ='interrupted'AND file_list IS NOT NULLORDER BY started_at DESC
A scan that was cancelled (so completed_at is set) but later marked interrupted by a graceful shutdown is considered resumable. On the next startup it gets resurrected via resumeScan → SetStatus(scanDBID, 'running').
Bug 3 — MarkCancelled and MarkOrphansCancelled over-trust completed_at IS NULL
internal/repository/scan.go:293-297:
UPDATE scans SET status ='cancelled', completed_at = datetime('now'), error_message = ?
WHERE id = ? AND completed_at IS NULL
And MarkOrphansCancelled at line 317-323 has the same guard. Once a row has completed_at set from any path (cancel, finalize, or the now-fixed Bug 2 zombification), neither user-cancel nor orphan-reconcile can touch it again — even when the row is in a clearly inconsistent state like status='running' AND completed_at IS NOT NULL.
The completed_at IS NULL guard was intended to prevent a benign race between cancel-and-completion (don't clobber a real terminal state). But it's the wrong invariant: the right one is "status is not already terminal."
Fix sketch
Bug 1: make CancelScan find the activeScans entry by DB id. Cheapest: iterate the map and match progress.DBID == parsedScanID (the map is small — at most a handful of concurrent scans). Alternative: maintain a secondary map[int64]string index from DBID to UUID.
Bug 2: add AND completed_at IS NULL to ListInterrupted's WHERE clause.
Bug 3: replace WHERE … completed_at IS NULL with WHERE … status NOT IN ('cancelled', 'completed', 'aborted') in both MarkCancelled and MarkOrphansCancelled. Semantically equivalent for the race the original guard was protecting against, and additionally catches inconsistent-active-with-completed_at rows.
All three fixes are independent, low-risk, and individually testable.
Existing live-data damage assessment
For installs running ≥ v1.3.3 (when MarkInterrupted was added) and that have ever experienced both a user-cancel and a restart on the same scan:
SELECT id, path, status, datetime(started_at), datetime(completed_at), error_message
FROM scans
WHERE completed_at IS NOT NULLAND status IN ('running', 'enumerating', 'scanning', 'interrupted', 'paused');
Any rows that match are zombie/inconsistent and would have been resurrected on every restart until the fix lands. A one-shot manual UPDATE scans SET status='cancelled' WHERE id IN (…) cleans them up.
Why this surfaced now
The migration to Sokaris forced a docker stop + cold start cycle. The original scan had been cancelled hours earlier (Bug 1, scan loop kept running). The graceful stop wrote status='interrupted' (the existing scan loop was healthy enough to be tracked at that point). The new container's ResumeInterruptedScans then resurrected it. None of this is migration-specific — any restart on a previously-cancelled-but-still-looping scan would have done the same.
TL;DR
Three independent bugs compose to produce a state where a user-cancelled scan reappears in
runningstatus after a Healarr restart, and subsequent cancel clicks become silent no-ops. The scan loop keeps burning files in the background past the original cancel because the in-memory cancellation signal never fires.Diagnosed in production on
2026-06-01during the v1.3.4 deployment (Osiris → Sokaris migration triggered the restart cycle that exposed it).Reproducing the chain
Given a long-running scan (e.g.
/media/Movies/HD/with 7,000+ files):User clicks cancel on the scan.
CancelScan(scanID)is called with the DB integer ID as a string (e.g."3").s.activeScans["3"]— miss, because the map is keyed by the in-memory UUID, not the DB id. The in-memoryctx.cancel()never fires. (Bug 1)MarkCancelled(3, ...)runs and the DB row is updated tostatus='cancelled', completed_at=<now>, error_message='cancelled by user'. From the user's perspective the cancel succeeded.Finalizeat end-of-loop would eventually overwrite the cancelled row, but if the loop is long enough you can hit step 2 first.Healarr restarts (graceful
docker stop, OOM kill, host reboot, scheduled restart, anything).MarkInterrupted(id=3, file_index=N)→UPDATE scans SET status='interrupted', current_file_index=N WHERE id=3. This SQL has nocompleted_atguard, so it overwrites the cancelled status. The row is nowstatus='interrupted', completed_at=<earlier>, error_message='cancelled by user'.Healarr starts back up.
ResumeInterruptedScans()queriesWHERE status='interrupted' AND file_list IS NOT NULL. No filter oncompleted_at. (Bug 2)resumeScanrunsSetStatus(3, 'running'). Row is nowstatus='running', completed_at=<earlier>, error_message='cancelled by user'. The scanner is alive again, processing files past the user's original intent.User clicks cancel again (because the UI is showing an active scan they thought they cancelled).
MarkCancelledSQL:UPDATE … WHERE id=3 AND completed_at IS NULL.completed_atis already set, so 0 rows match. (Bug 3)runningand it reappears.MarkOrphansCancelled(the startup orphan-reconciler) has the samecompleted_at IS NULLguard, so it can't dig the row out either.The three bugs
Bug 1 —
CancelScanlookup by DB id misses the UUID-keyed mapinternal/services/scanner.go:1953:activeScansis keyed by the UUID generated inrecordScanStart(and the resume paths). The HTTP handler passes the DB integer id as a string. Lookup always misses →ctx.cancel()is never invoked → the scan goroutine keeps running.The DB persistence half (
MarkCancelled) still runs, which is why the bug was invisible for short-running scans (the loop would exit naturally before the user noticed).Bug 2 —
ListInterrupteddoesn't filtercompleted_atinternal/repository/scan.go:357-361:A scan that was cancelled (so
completed_atis set) but later markedinterruptedby a graceful shutdown is considered resumable. On the next startup it gets resurrected viaresumeScan→SetStatus(scanDBID, 'running').Bug 3 —
MarkCancelledandMarkOrphansCancelledover-trustcompleted_at IS NULLinternal/repository/scan.go:293-297:And
MarkOrphansCancelledat line 317-323 has the same guard. Once a row hascompleted_atset from any path (cancel, finalize, or the now-fixed Bug 2 zombification), neither user-cancel nor orphan-reconcile can touch it again — even when the row is in a clearly inconsistent state likestatus='running' AND completed_at IS NOT NULL.The
completed_at IS NULLguard was intended to prevent a benign race between cancel-and-completion (don't clobber a real terminal state). But it's the wrong invariant: the right one is "status is not already terminal."Fix sketch
CancelScanfind the activeScans entry by DB id. Cheapest: iterate the map and matchprogress.DBID == parsedScanID(the map is small — at most a handful of concurrent scans). Alternative: maintain a secondarymap[int64]stringindex from DBID to UUID.AND completed_at IS NULLtoListInterrupted's WHERE clause.WHERE … completed_at IS NULLwithWHERE … status NOT IN ('cancelled', 'completed', 'aborted')in bothMarkCancelledandMarkOrphansCancelled. Semantically equivalent for the race the original guard was protecting against, and additionally catches inconsistent-active-with-completed_at rows.All three fixes are independent, low-risk, and individually testable.
Existing live-data damage assessment
For installs running ≥ v1.3.3 (when
MarkInterruptedwas added) and that have ever experienced both a user-cancel and a restart on the same scan:Any rows that match are zombie/inconsistent and would have been resurrected on every restart until the fix lands. A one-shot manual
UPDATE scans SET status='cancelled' WHERE id IN (…)cleans them up.Why this surfaced now
The migration to Sokaris forced a
docker stop+ cold start cycle. The original scan had been cancelled hours earlier (Bug 1, scan loop kept running). The graceful stop wrotestatus='interrupted'(the existing scan loop was healthy enough to be tracked at that point). The new container'sResumeInterruptedScansthen resurrected it. None of this is migration-specific — any restart on a previously-cancelled-but-still-looping scan would have done the same.Target
Fix in v1.3.5.