fix(controller): checkpoint the library DB WAL and shut down cleanly by perminder-klair · Pull Request #829 · perminder-klair/subwave

perminder-klair · 2026-07-04T11:36:21Z

Fixes #786 — slow admin UI on the Unraid AIO caused by an unbounded library.db-wal (730MB on a 303MB DB) plus SQLite-over-FUSE on /mnt/user appdata paths.

Root cause

library-db.ts set journal_mode = WAL at open and nothing else, ever:

No journal_size_limit, so once a bulk write pass (acoustic analysis writes fat *_json blobs per track) ballooned the WAL, it stayed at its high-water mark forever.
The bulk CLIs (npm run tag / npm run analyze) exit via process.exit() without closing the DB, so the pass that grew the WAL never folded it back.
The AIO supervisor's stop trap was kill -TERM 0; exit 0 — as PID 1, its immediate exit tears the namespace down and SIGKILLs node before any shutdown work; and server.ts had no SIGTERM handler anyway.
With better-sqlite3 being synchronous, every query walking that giant WAL blocked the whole event loop — sluggish admin pages at ~0% CPU, exactly as reported.

Separately, server.ts had no unhandledRejection handler (Node's default since v15 is to crash), so a stray rejection killed the controller and the supervisor bounced it — the random 502s. And Subsonic fetches had no timeout while /dj/recent fanned out ~21 parallel Navidrome calls — the "recent failed (500)".

Changes

WAL management

journal_size_limit = 64 MiB at open, so any checkpoint truncates the sidecar back down.
close() runs a best-effort wal_checkpoint(TRUNCATE) first (SQLite only auto-checkpoints on the last connection close, and controller/tagger/analyzer can hold the DB concurrently).
New checkpointWal() helper; the scheduler's hourly cleanup job now calls it, so the WAL can't balloon unbounded even mid-run.
Both bulk CLIs register a process.on('exit') hook that closes the DB on every exit path (better-sqlite3 close is synchronous, so this is safe).

Clean shutdown

server.ts: SIGTERM/SIGINT handler closes the library DB before exiting.
AIO supervisor: after kill -TERM 0, wait for the supervise loops + a 2s grace for their reparented children before PID 1 exits, so node's handler actually runs. Docker's stop timeout still hard-caps it.

Resilience

unhandledRejection now logs loudly and continues instead of killing the controller.
Subsonic API calls get a 30s timeout (NAVIDROME_TIMEOUT_MS to override) with a readable error.
/dj/recent's album expansion is bounded to 5 concurrent Navidrome calls via the existing mapPool util.

Docs / template (FUSE)

docs/unraid.md, the /setup/unraid page, and the CA template description now recommend the direct pool path (/mnt/cache/appdata/subwave) over /mnt/user/... — SQLite WAL over Unraid's shfs/FUSE layer is a documented-bad combo. Template default stays /mnt/user since /mnt/cache only exists when a pool is named cache.

Verified

npm run lint clean in both controller/ and web/.
Smoke-tested the checkpoint path against the installed better-sqlite3: a 4.1MB WAL truncates to 0 bytes on wal_checkpoint(TRUNCATE) and the sidecar is removed on close.

The library.db WAL sidecar could grow unbounded (730MB on a 303MB DB in #786): nothing ever ran a TRUNCATE checkpoint, the bulk tag/analyze CLIs exited via process.exit() without closing the DB, and the AIO supervisor tore the container down before node's (nonexistent) SIGTERM handler could run. Every later query walked that giant WAL on better-sqlite3's synchronous thread, stalling the whole event loop — sluggish admin pages at ~0% CPU. - library-db: set journal_size_limit (64 MiB) so checkpoints shrink the sidecar; TRUNCATE-checkpoint in close(); new checkpointWal() helper - library facade: checkpoint() + shutdown() passthroughs - scheduler: hourly best-effort WAL checkpoint in the cleanup job - tag-library/analyze-library CLIs: close the DB on every exit path - server: SIGTERM/SIGINT handler closes the DB; unhandledRejection now logs instead of crashing (the supervisor-restart 502s) - AIO supervisor: wait for children after kill -TERM 0 so node's shutdown handler actually runs before the namespace is torn down - subsonic: 30s per-request timeout (NAVIDROME_TIMEOUT_MS); /dj/recent album fan-out bounded to 5 concurrent calls - docs/template: recommend the direct pool path (/mnt/cache/...) over /mnt/user/... on Unraid — SQLite WAL over shfs/FUSE is a known-bad combo Fixes #786

perminder-klair marked this pull request as ready for review July 4, 2026 11:39

perminder-klair mentioned this pull request Jul 4, 2026

Slow admin WebUI performance of AIO container on Unraid host #786

Closed

perminder-klair merged commit fae6855 into develop Jul 4, 2026
2 checks passed

perminder-klair deleted the fix/786-library-wal-checkpoint branch July 4, 2026 11:40

This was referenced Jul 5, 2026

release: Liquidsoap 2.4 upgrade, community skills, DJ audio polish #854

Merged

chore(main): release 0.36.0 #855

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(controller): checkpoint the library DB WAL and shut down cleanly#829

fix(controller): checkpoint the library DB WAL and shut down cleanly#829
perminder-klair merged 1 commit into
developfrom
fix/786-library-wal-checkpoint

perminder-klair commented Jul 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

perminder-klair commented Jul 4, 2026

Root cause

Changes

Verified

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant