Fix backend bugs/reliability bucket from audit (#531) by nitrobass24 · Pull Request #537 · nitrobass24/seedsync

nitrobass24 · 2026-06-02T13:42:14Z

Fixes the entire 🐛 Backend bugs / reliability bucket of tracking issue #531 (10 issues) in one branch, as requested. Each fix has its own commit and tests.

Note

Intentional grouping. This bundles 10 issues, deviating from CLAUDE.md's "one concern per branch" rule — done deliberately at the maintainer's request for a single grouped review. Commits are one-per-issue (plus review-polish commits) to keep it reviewable.

How this was built

A multi-agent workflow: parallel investigation (one agent per issue, re-verifying each finding against current code post-#534), sequential implementation (one commit per issue, ordered for the command_pipeline.py overlap between #510/#511), then parallel adversarial review of every landed fix. The review surfaced AC gaps, which were then addressed (see below).

Issues fixed

#	Fix	Commit
#507	Roll back config on any persist failure, not just `OSError` (the `%`-interpolation half already shipped in #534)	`6e98254`
#508	`Controller.exit()` best-effort so a hung lftp / raising worker reap can't leak processes + queue FDs	`24712b3`, `779a898`
#509	Lock-guard `AutoQueuePersist` patterns + listener `new_patterns` set (mirrors `PathPairsConfig`)	`61b3301`, `66d19da`
#510	Surface staging→final move failures via a result queue; discard key on failure; cross-device size + `ENOTEMPTY` fixes	`ea0cc8c`
#511	Isolate extract/validate worker faults from the controller loop; surface a dead worker at ERROR	`35bf824`, `6bdf327`
#512	Harden periodic + shutdown persist against transient write failures (ERROR + continue; classified shutdown)	`8448ae6`, `4761406`
#517	Job-status parser robustness: anchored echo-strip, digit-required ETA regex, queue-header ordering	`22c31fd`
#518	Roll back auto-queue add/remove on persist failure (clears listener `new_patterns`)	`7ccae90`
#519	Expand `~` in remote validate `find`/hash commands; derive keys from the find base	`6221513`, `01ae955`
#526	Bound controller action `Event.wait()` with a timeout → HTTP 504 instead of hanging request threads	`7ff125c`

Review findings → resolution

Adversarial review flagged 3 AC gaps + weak tests on 2 fixes. All addressed in this PR except two deeper reliability features intentionally split into follow-ups:

Controller.exit() leaks worker subprocesses + queue FDs when lftp shutdown errors #508 (acMet gap): worker terminate/join/close_queues was unguarded → routed every teardown call through a best-effort helper so all phases always run and nothing propagates. + test.
AutoQueue / AutoQueuePersist thread-safety: cross-thread set & list races #509 (weak tests): the two concurrency tests were ineffective (one tested the already-safe snapshot; the other only checked JSON validity, which never breaks under the GIL). Replaced with deterministic lock-contract tests (hold the lock, assert each op blocks until release — fails on unguarded code, no GIL-timing reliance).
Periodic persist() failure misclassified as clean shutdown (logged only at INFO) #512 (untested ACs): extracted _log_shutdown_cause + _final_persist from run() and added direct tests for the INFO-vs-ERROR shutdown classification and the guarded final persist.
Worker subprocess crash (Validate/Extract) tears down the whole download engine #511 (blocking → follow-up Recreate dead extract/validate worker processes (follow-up to #511) #535): the shipped fix isolates the fault (downloads continue) and now reports a dead worker at ERROR. Full worker recreation (the preferred AC1) needs a supervisor shared across Controller/CommandPipeline/ModelUpdater (a dead multiprocessing.Process can't restart in place) → Recreate dead extract/validate worker processes (follow-up to #511) #535.
Move pipeline: silent data loss — failed moves reported as success #510 (acMet gap → follow-up Surface failed staging→final moves as a model state + in-session retry (follow-up to #510) #536): core data-loss bugs are fixed and tested. Routing failures into a visible model state + in-session retry touches the model state machine → Surface failed staging→final moves as a model state + in-session retry (follow-up to #510) #536.
Backend: directory validation breaks on tilde/relative remote_path #519 (minor): the leaf-name key anchor has a narrow false-mismatch when a nested dir shares the leaf name; no name-based anchor is robust without the expanded find root → documented in the docstring; robust cd; pwd; find . approach noted for a follow-up.

Test plan

✅ Full unit suite: 839 passed, 59 skipped. (The one failure, test_scan_file_with_latin_chars, is a pre-existing macOS-only Errno 92 — it exists on develop and passes on CI's Linux runner.)
✅ Web integration suite: 151 passed.
✅ ruff check, ruff format --check, and ruff check --select C901 (complexity ≤ 12) clean on all changed files.
New tests accompany every fix (happy path + bug-reproduction + error/edge paths).

Closes #507. Closes #508. Closes #509. Closes #510. Closes #511. Closes #512. Closes #517. Closes #518. Closes #519. Closes #526.

#510 and #511 close here because their core fixes ship in this PR; the remaining enhancement scope continues in dedicated follow-ups:

Move pipeline: silent data loss — failed moves reported as success #510 → Surface failed staging→final moves as a model state + in-session retry (follow-up to #510) #536 — surface failed moves as a model/UI state + in-session retry
Worker subprocess crash (Validate/Extract) tears down the whole download engine #511 → Recreate dead extract/validate worker processes (follow-up to #511) #535 — automatic recreation of a dead extract/validate worker

🤖 Generated with Claude Code