Skip to content

fix(controller): checkpoint the library DB WAL and shut down cleanly#829

Merged
perminder-klair merged 1 commit into
developfrom
fix/786-library-wal-checkpoint
Jul 4, 2026
Merged

fix(controller): checkpoint the library DB WAL and shut down cleanly#829
perminder-klair merged 1 commit into
developfrom
fix/786-library-wal-checkpoint

Conversation

@perminder-klair

Copy link
Copy Markdown
Owner

Fixes #786 — slow admin UI on the Unraid AIO caused by an unbounded library.db-wal (730MB on a 303MB DB) plus SQLite-over-FUSE on /mnt/user appdata paths.

Root cause

library-db.ts set journal_mode = WAL at open and nothing else, ever:

  • No journal_size_limit, so once a bulk write pass (acoustic analysis writes fat *_json blobs per track) ballooned the WAL, it stayed at its high-water mark forever.
  • The bulk CLIs (npm run tag / npm run analyze) exit via process.exit() without closing the DB, so the pass that grew the WAL never folded it back.
  • The AIO supervisor's stop trap was kill -TERM 0; exit 0 — as PID 1, its immediate exit tears the namespace down and SIGKILLs node before any shutdown work; and server.ts had no SIGTERM handler anyway.
  • With better-sqlite3 being synchronous, every query walking that giant WAL blocked the whole event loop — sluggish admin pages at ~0% CPU, exactly as reported.

Separately, server.ts had no unhandledRejection handler (Node's default since v15 is to crash), so a stray rejection killed the controller and the supervisor bounced it — the random 502s. And Subsonic fetches had no timeout while /dj/recent fanned out ~21 parallel Navidrome calls — the "recent failed (500)".

Changes

WAL management

  • journal_size_limit = 64 MiB at open, so any checkpoint truncates the sidecar back down.
  • close() runs a best-effort wal_checkpoint(TRUNCATE) first (SQLite only auto-checkpoints on the last connection close, and controller/tagger/analyzer can hold the DB concurrently).
  • New checkpointWal() helper; the scheduler's hourly cleanup job now calls it, so the WAL can't balloon unbounded even mid-run.
  • Both bulk CLIs register a process.on('exit') hook that closes the DB on every exit path (better-sqlite3 close is synchronous, so this is safe).

Clean shutdown

  • server.ts: SIGTERM/SIGINT handler closes the library DB before exiting.
  • AIO supervisor: after kill -TERM 0, wait for the supervise loops + a 2s grace for their reparented children before PID 1 exits, so node's handler actually runs. Docker's stop timeout still hard-caps it.

Resilience

  • unhandledRejection now logs loudly and continues instead of killing the controller.
  • Subsonic API calls get a 30s timeout (NAVIDROME_TIMEOUT_MS to override) with a readable error.
  • /dj/recent's album expansion is bounded to 5 concurrent Navidrome calls via the existing mapPool util.

Docs / template (FUSE)

  • docs/unraid.md, the /setup/unraid page, and the CA template description now recommend the direct pool path (/mnt/cache/appdata/subwave) over /mnt/user/... — SQLite WAL over Unraid's shfs/FUSE layer is a documented-bad combo. Template default stays /mnt/user since /mnt/cache only exists when a pool is named cache.

Verified

  • npm run lint clean in both controller/ and web/.
  • Smoke-tested the checkpoint path against the installed better-sqlite3: a 4.1MB WAL truncates to 0 bytes on wal_checkpoint(TRUNCATE) and the sidecar is removed on close.

The library.db WAL sidecar could grow unbounded (730MB on a 303MB DB in
#786): nothing ever ran a TRUNCATE checkpoint, the bulk tag/analyze CLIs
exited via process.exit() without closing the DB, and the AIO supervisor
tore the container down before node's (nonexistent) SIGTERM handler could
run. Every later query walked that giant WAL on better-sqlite3's
synchronous thread, stalling the whole event loop — sluggish admin pages
at ~0% CPU.

- library-db: set journal_size_limit (64 MiB) so checkpoints shrink the
  sidecar; TRUNCATE-checkpoint in close(); new checkpointWal() helper
- library facade: checkpoint() + shutdown() passthroughs
- scheduler: hourly best-effort WAL checkpoint in the cleanup job
- tag-library/analyze-library CLIs: close the DB on every exit path
- server: SIGTERM/SIGINT handler closes the DB; unhandledRejection now
  logs instead of crashing (the supervisor-restart 502s)
- AIO supervisor: wait for children after kill -TERM 0 so node's shutdown
  handler actually runs before the namespace is torn down
- subsonic: 30s per-request timeout (NAVIDROME_TIMEOUT_MS); /dj/recent
  album fan-out bounded to 5 concurrent calls
- docs/template: recommend the direct pool path (/mnt/cache/...) over
  /mnt/user/... on Unraid — SQLite WAL over shfs/FUSE is a known-bad combo

Fixes #786
@perminder-klair perminder-klair marked this pull request as ready for review July 4, 2026 11:39
@perminder-klair perminder-klair merged commit fae6855 into develop Jul 4, 2026
2 checks passed
@perminder-klair perminder-klair deleted the fix/786-library-wal-checkpoint branch July 4, 2026 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow admin WebUI performance of AIO container on Unraid host

1 participant