Zombie cancelled scans: 3 composing bugs let a cancelled scan come back as 'running' after a restart

## TL;DR

Three independent bugs compose to produce a state where a user-cancelled scan reappears in `running` status after a Healarr restart, **and subsequent cancel clicks become silent no-ops**. The scan loop keeps burning files in the background past the original cancel because the in-memory cancellation signal never fires.

Diagnosed in production on `2026-06-01` during the v1.3.4 deployment (Osiris → Sokaris migration triggered the restart cycle that exposed it).

## Reproducing the chain

Given a long-running scan (e.g. `/media/Movies/HD/` with 7,000+ files):

1. **User clicks cancel on the scan.**
   - `CancelScan(scanID)` is called with the DB integer ID as a string (e.g. `"3"`).
   - `s.activeScans["3"]` — **miss**, because the map is keyed by the in-memory UUID, not the DB id. The in-memory `ctx.cancel()` never fires. **(Bug 1)**
   - `MarkCancelled(3, ...)` runs and the DB row is updated to `status='cancelled', completed_at=<now>, error_message='cancelled by user'`. From the user's perspective the cancel succeeded.
   - **But the scan loop is still running.** It keeps iterating files and emitting progress events. The deferred `Finalize` at end-of-loop would eventually overwrite the cancelled row, but if the loop is long enough you can hit step 2 first.

2. **Healarr restarts** (graceful `docker stop`, OOM kill, host reboot, scheduled restart, anything).
   - The shutdown handler runs `MarkInterrupted(id=3, file_index=N)` → `UPDATE scans SET status='interrupted', current_file_index=N WHERE id=3`. This SQL has no `completed_at` guard, so it overwrites the cancelled status. The row is now `status='interrupted', completed_at=<earlier>, error_message='cancelled by user'`.

3. **Healarr starts back up.**
   - `ResumeInterruptedScans()` queries `WHERE status='interrupted' AND file_list IS NOT NULL`. No filter on `completed_at`. **(Bug 2)**
   - The previously-cancelled scan #3 is selected for resume.
   - `resumeScan` runs `SetStatus(3, 'running')`. Row is now `status='running', completed_at=<earlier>, error_message='cancelled by user'`. The scanner is alive again, processing files past the user's original intent.

4. **User clicks cancel again** (because the UI is showing an active scan they thought they cancelled).
   - `MarkCancelled` SQL: `UPDATE … WHERE id=3 AND completed_at IS NULL`. **`completed_at` is already set**, so 0 rows match. **(Bug 3)**
   - The cancel is a no-op. The UI optimistically hides the row, but a refresh re-fetches it as `running` and it reappears.

`MarkOrphansCancelled` (the startup orphan-reconciler) has the same `completed_at IS NULL` guard, so it can't dig the row out either.

## The three bugs

### Bug 1 — `CancelScan` lookup by DB id misses the UUID-keyed map

`internal/services/scanner.go:1953`:
```go
func (s *ScannerService) CancelScan(scanID string) error {
    s.mu.Lock()
    scan, exists := s.activeScans[scanID]  // scanID = "3", but keys are UUIDs
    if exists && scan.cancel != nil {
        scan.cancel()
    }
    s.mu.Unlock()
    …
}
```

`activeScans` is keyed by the UUID generated in `recordScanStart` (and the resume paths). The HTTP handler passes the DB integer id as a string. Lookup always misses → `ctx.cancel()` is never invoked → the scan goroutine keeps running.

The DB persistence half (`MarkCancelled`) still runs, which is why the bug was invisible for short-running scans (the loop would exit naturally before the user noticed).

### Bug 2 — `ListInterrupted` doesn't filter `completed_at`

`internal/repository/scan.go:357-361`:
```sql
SELECT id, path_id, … FROM scans
WHERE status = 'interrupted' AND file_list IS NOT NULL
ORDER BY started_at DESC
```

A scan that was cancelled (so `completed_at` is set) but later marked `interrupted` by a graceful shutdown is considered resumable. On the next startup it gets resurrected via `resumeScan` → `SetStatus(scanDBID, 'running')`.

### Bug 3 — `MarkCancelled` and `MarkOrphansCancelled` over-trust `completed_at IS NULL`

`internal/repository/scan.go:293-297`:
```sql
UPDATE scans SET status = 'cancelled', completed_at = datetime('now'), error_message = ?
WHERE id = ? AND completed_at IS NULL
```

And `MarkOrphansCancelled` at line 317-323 has the same guard. Once a row has `completed_at` set from any path (cancel, finalize, or the now-fixed Bug 2 zombification), neither user-cancel nor orphan-reconcile can touch it again — even when the row is in a clearly inconsistent state like `status='running' AND completed_at IS NOT NULL`.

The `completed_at IS NULL` guard was intended to prevent a benign race between cancel-and-completion (don't clobber a real terminal state). But it's the wrong invariant: the right one is "status is not already terminal."

## Fix sketch

1. **Bug 1:** make `CancelScan` find the activeScans entry by DB id. Cheapest: iterate the map and match `progress.DBID == parsedScanID` (the map is small — at most a handful of concurrent scans). Alternative: maintain a secondary `map[int64]string` index from DBID to UUID.
2. **Bug 2:** add `AND completed_at IS NULL` to `ListInterrupted`'s WHERE clause.
3. **Bug 3:** replace `WHERE … completed_at IS NULL` with `WHERE … status NOT IN ('cancelled', 'completed', 'aborted')` in both `MarkCancelled` and `MarkOrphansCancelled`. Semantically equivalent for the race the original guard was protecting against, and additionally catches inconsistent-active-with-completed_at rows.

All three fixes are independent, low-risk, and individually testable.

## Existing live-data damage assessment

For installs running ≥ v1.3.3 (when `MarkInterrupted` was added) and that have ever experienced both a user-cancel **and** a restart on the same scan:

```sql
SELECT id, path, status, datetime(started_at), datetime(completed_at), error_message
FROM scans
WHERE completed_at IS NOT NULL
  AND status IN ('running', 'enumerating', 'scanning', 'interrupted', 'paused');
```

Any rows that match are zombie/inconsistent and would have been resurrected on every restart until the fix lands. A one-shot manual `UPDATE scans SET status='cancelled' WHERE id IN (…)` cleans them up.

## Why this surfaced now

The migration to Sokaris forced a `docker stop` + cold start cycle. The original scan had been cancelled hours earlier (Bug 1, scan loop kept running). The graceful stop wrote `status='interrupted'` (the existing scan loop was healthy enough to be tracked at that point). The new container's `ResumeInterruptedScans` then resurrected it. None of this is migration-specific — any restart on a previously-cancelled-but-still-looping scan would have done the same.

## Target

Fix in v1.3.5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zombie cancelled scans: 3 composing bugs let a cancelled scan come back as 'running' after a restart #274

TL;DR

Reproducing the chain

The three bugs

Bug 1 — `CancelScan` lookup by DB id misses the UUID-keyed map

Bug 2 — `ListInterrupted` doesn't filter `completed_at`

Bug 3 — `MarkCancelled` and `MarkOrphansCancelled` over-trust `completed_at IS NULL`

Fix sketch

Existing live-data damage assessment

Why this surfaced now

Target

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Zombie cancelled scans: 3 composing bugs let a cancelled scan come back as 'running' after a restart #274

Description

TL;DR

Reproducing the chain

The three bugs

Bug 1 — CancelScan lookup by DB id misses the UUID-keyed map

Bug 2 — ListInterrupted doesn't filter completed_at

Bug 3 — MarkCancelled and MarkOrphansCancelled over-trust completed_at IS NULL

Fix sketch

Existing live-data damage assessment

Why this surfaced now

Target

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Bug 1 — `CancelScan` lookup by DB id misses the UUID-keyed map

Bug 2 — `ListInterrupted` doesn't filter `completed_at`

Bug 3 — `MarkCancelled` and `MarkOrphansCancelled` over-trust `completed_at IS NULL`