Skip to content

Zombie cancelled scans: 3 composing bugs let a cancelled scan come back as 'running' after a restart #274

@mescon

Description

@mescon

TL;DR

Three independent bugs compose to produce a state where a user-cancelled scan reappears in running status after a Healarr restart, and subsequent cancel clicks become silent no-ops. The scan loop keeps burning files in the background past the original cancel because the in-memory cancellation signal never fires.

Diagnosed in production on 2026-06-01 during the v1.3.4 deployment (Osiris → Sokaris migration triggered the restart cycle that exposed it).

Reproducing the chain

Given a long-running scan (e.g. /media/Movies/HD/ with 7,000+ files):

  1. User clicks cancel on the scan.

    • CancelScan(scanID) is called with the DB integer ID as a string (e.g. "3").
    • s.activeScans["3"]miss, because the map is keyed by the in-memory UUID, not the DB id. The in-memory ctx.cancel() never fires. (Bug 1)
    • MarkCancelled(3, ...) runs and the DB row is updated to status='cancelled', completed_at=<now>, error_message='cancelled by user'. From the user's perspective the cancel succeeded.
    • But the scan loop is still running. It keeps iterating files and emitting progress events. The deferred Finalize at end-of-loop would eventually overwrite the cancelled row, but if the loop is long enough you can hit step 2 first.
  2. Healarr restarts (graceful docker stop, OOM kill, host reboot, scheduled restart, anything).

    • The shutdown handler runs MarkInterrupted(id=3, file_index=N)UPDATE scans SET status='interrupted', current_file_index=N WHERE id=3. This SQL has no completed_at guard, so it overwrites the cancelled status. The row is now status='interrupted', completed_at=<earlier>, error_message='cancelled by user'.
  3. Healarr starts back up.

    • ResumeInterruptedScans() queries WHERE status='interrupted' AND file_list IS NOT NULL. No filter on completed_at. (Bug 2)
    • The previously-cancelled scan API Key doesn't seem to apply #3 is selected for resume.
    • resumeScan runs SetStatus(3, 'running'). Row is now status='running', completed_at=<earlier>, error_message='cancelled by user'. The scanner is alive again, processing files past the user's original intent.
  4. User clicks cancel again (because the UI is showing an active scan they thought they cancelled).

    • MarkCancelled SQL: UPDATE … WHERE id=3 AND completed_at IS NULL. completed_at is already set, so 0 rows match. (Bug 3)
    • The cancel is a no-op. The UI optimistically hides the row, but a refresh re-fetches it as running and it reappears.

MarkOrphansCancelled (the startup orphan-reconciler) has the same completed_at IS NULL guard, so it can't dig the row out either.

The three bugs

Bug 1 — CancelScan lookup by DB id misses the UUID-keyed map

internal/services/scanner.go:1953:

func (s *ScannerService) CancelScan(scanID string) error {
    s.mu.Lock()
    scan, exists := s.activeScans[scanID]  // scanID = "3", but keys are UUIDs
    if exists && scan.cancel != nil {
        scan.cancel()
    }
    s.mu.Unlock()
    …
}

activeScans is keyed by the UUID generated in recordScanStart (and the resume paths). The HTTP handler passes the DB integer id as a string. Lookup always misses → ctx.cancel() is never invoked → the scan goroutine keeps running.

The DB persistence half (MarkCancelled) still runs, which is why the bug was invisible for short-running scans (the loop would exit naturally before the user noticed).

Bug 2 — ListInterrupted doesn't filter completed_at

internal/repository/scan.go:357-361:

SELECT id, path_id, … FROM scans
WHERE status = 'interrupted' AND file_list IS NOT NULL
ORDER BY started_at DESC

A scan that was cancelled (so completed_at is set) but later marked interrupted by a graceful shutdown is considered resumable. On the next startup it gets resurrected via resumeScanSetStatus(scanDBID, 'running').

Bug 3 — MarkCancelled and MarkOrphansCancelled over-trust completed_at IS NULL

internal/repository/scan.go:293-297:

UPDATE scans SET status = 'cancelled', completed_at = datetime('now'), error_message = ?
WHERE id = ? AND completed_at IS NULL

And MarkOrphansCancelled at line 317-323 has the same guard. Once a row has completed_at set from any path (cancel, finalize, or the now-fixed Bug 2 zombification), neither user-cancel nor orphan-reconcile can touch it again — even when the row is in a clearly inconsistent state like status='running' AND completed_at IS NOT NULL.

The completed_at IS NULL guard was intended to prevent a benign race between cancel-and-completion (don't clobber a real terminal state). But it's the wrong invariant: the right one is "status is not already terminal."

Fix sketch

  1. Bug 1: make CancelScan find the activeScans entry by DB id. Cheapest: iterate the map and match progress.DBID == parsedScanID (the map is small — at most a handful of concurrent scans). Alternative: maintain a secondary map[int64]string index from DBID to UUID.
  2. Bug 2: add AND completed_at IS NULL to ListInterrupted's WHERE clause.
  3. Bug 3: replace WHERE … completed_at IS NULL with WHERE … status NOT IN ('cancelled', 'completed', 'aborted') in both MarkCancelled and MarkOrphansCancelled. Semantically equivalent for the race the original guard was protecting against, and additionally catches inconsistent-active-with-completed_at rows.

All three fixes are independent, low-risk, and individually testable.

Existing live-data damage assessment

For installs running ≥ v1.3.3 (when MarkInterrupted was added) and that have ever experienced both a user-cancel and a restart on the same scan:

SELECT id, path, status, datetime(started_at), datetime(completed_at), error_message
FROM scans
WHERE completed_at IS NOT NULL
  AND status IN ('running', 'enumerating', 'scanning', 'interrupted', 'paused');

Any rows that match are zombie/inconsistent and would have been resurrected on every restart until the fix lands. A one-shot manual UPDATE scans SET status='cancelled' WHERE id IN (…) cleans them up.

Why this surfaced now

The migration to Sokaris forced a docker stop + cold start cycle. The original scan had been cancelled hours earlier (Bug 1, scan loop kept running). The graceful stop wrote status='interrupted' (the existing scan loop was healthy enough to be tracked at that point). The new container's ResumeInterruptedScans then resurrected it. None of this is migration-specific — any restart on a previously-cancelled-but-still-looping scan would have done the same.

Target

Fix in v1.3.5.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions