Fix backend bugs/reliability bucket from audit (#531)#537
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#512) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AutoQueuePersist.__patterns and AutoQueuePersistListener.new_patterns were mutated on the web handler thread (add_pattern/remove_pattern) while read on the controller thread (patterns/to_str and __filter_candidates iteration). A concurrent add/remove during process() raised 'RuntimeError: Set changed size during iteration', killing the ControllerJob. Mirror common/path_pairs_config.py: guard the patterns list, to_str, and listener registration with a threading.RLock (notifying listeners outside the lock), and guard new_patterns with snapshot_new_patterns/clear_new_patterns helpers that AutoQueue uses for atomic reads on the controller thread. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MoveProcess now reports failures (vanished source, size mismatch) on a failed-result queue instead of silently returning, mirroring ValidateProcess. Size verification checks only the copied subtree so merging into a pre-existing non-empty destination no longer false-mismatches, and rename ENOTEMPTY/EEXIST falls back to copy+delete instead of re-raising. CommandPipeline.cleanup drains pop_failed and discards the moved key on failure so force_scan retries the move. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s/FDs (#508) Guard each pc.lftp.exit() in a best-effort loop (extracted into __shutdown_lftp_best_effort) so a hung/dead lftp PTY no longer aborts teardown. This ensures the worker terminate/join, __mp_logger.stop(), and close_queues() phases always run, reaping subprocesses and releasing multiprocessing.Queue FDs on every restart. Adds tests/unittests/test_controller/test_controller_exit.py covering the hung-lftp regression, multi-pair isolation, happy path, not-started no-op, and repeated-exit no-leak. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review of #508 found AC2 only partially met: lftp shutdown was guarded, but the worker terminate/join/close_queues sequence was a flat unguarded run, so a raising Process.terminate()/join() could still skip the FD-releasing close_queues phase (and propagate out of exit()). Route every teardown call through a best-effort helper so all phases always run and nothing propagates. Adds a test where a worker terminate()/join() raises and close_queues still runs for every worker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…iew) The #511 fix isolates a worker fault so downloads continue, but propagate_exception() consumes the queued fault only once, so after the first cycle a dead worker went silent — extract/validate stayed disabled with no operator signal. Report the dead state once at ERROR (sticky per worker). Full automatic worker recreation (the issue's preferred AC1) is deferred to a follow-up; this commit only makes the degraded-until-restart state visible. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review found the two concurrency tests were ineffective: the new_patterns test iterated the already-safe snapshot (so it never caught the live-set crash), and the to_str test only checked JSON validity, which never breaks under the GIL (it passed even against the unguarded code). Replace both with deterministic lock-contract tests: while the shared lock is held, each mutator/reader must block until release — failing on unguarded code, passing on the locked code, with no reliance on GIL timing. Add a snapshot-independence test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review found AC2 (INFO-vs-ERROR shutdown classification) and AC3 (guarded final persist) shipped untested because they were inline in run(), which can't be unit-exercised. Extract them into _log_shutdown_cause() and _final_persist() (mirroring the already-tested _persist_periodic) and add direct tests: an intentional ServiceExit/ServiceRestart logs at INFO, a genuine crash logs at ERROR with traceback, and a failing final persist is swallowed (so it can't mask the original in-flight exception). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review noted the leaf-name anchor introduces a narrow false (non-checksum) mismatch when a nested directory shares the validated leaf's name. No name-based anchor is fully robust without the shell-expanded find root; document the limitation and the robust follow-up approach (cd; pwd; find .) in the docstring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
These workflow-authored test files passed ruff check but not ruff format --check (which CI runs). Format-only change; no test behavior altered. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds cross-cutting synchronization for auto-queue, process-safe move failure reporting, guarded controller teardown, persistence resiliency, request timeouts, remote-path escaping, parser fixes, and matching unit/integration tests. ChangesAuto-queue thread-safety and handler atomicity
Move-process failure reporting and pipeline recovery
Controller and pipeline robustness
Persistence resilience and request timeout guarding
Remote-path escaping and job-status parsing stability
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/python/web/handler/auto_queue.py (1)
50-64:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winMove the duplicate/missing checks under
__write_lock.These membership checks are outside the critical section, so two concurrent identical requests can both pass the precheck; the second request then no-ops under the lock, persists unchanged state, and still returns 200 instead of the documented 400. The existence test needs to be atomic with the mutation/persist block.
Also applies to: 100-109
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/python/web/handler/auto_queue.py` around lines 50 - 64, Move the membership/existence checks for aqp against self.__auto_queue_persist.patterns inside the critical section guarded by self.__write_lock so the test-and-set is atomic; specifically, perform the "if aqp in self.__auto_queue_persist.patterns: return HTTPResponse(... 400 ...)" check while holding self.__write_lock before calling add_pattern and persisting, ensuring the mutate → persist → rollback sequence (and the AutoQueuePersistListener pattern_added/new_patterns behavior and controller replay) cannot be raced by concurrent requests; apply the same change to the other duplicate/missing checks around the add/remove flows (the block referenced at lines 100-109).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/python/controller/auto_queue.py`:
- Line 262: The current process() flow snapshots new_patterns and then later
calls __persist_listener.clear_new_patterns() separately, which can drop
patterns added by concurrent add_pattern(); instead modify process() to, under
the same __persist_listener lock, atomically drain the listener into a local
snapshot (e.g. call a single “drain” operation or call get-and-clear while
holding the listener's lock), then reuse that drained snapshot for all
subsequent queue/extract/delete filtering logic; remove the separate
clear_new_patterns() call and ensure add_pattern() still appends to the listener
normally, so no patterns added after the drain are lost for this cycle.
In `@src/python/controller/command_pipeline.py`:
- Around line 421-422: The current loop forces a scan for every PairContext
(self._pair_contexts) after a move finishes; instead, locate the owning
PairContext by matching PairContext.pair_id to move_process.pair_id and call
that context's local_scan_process.force_scan() only for that pair, and only if
that lookup fails fall back to iterating self._pair_contexts and calling
local_scan_process.force_scan() on each; update the code paths around
move_process and _pair_contexts (reference functions/objects:
move_process.pair_id, PairContext, self._pair_contexts,
local_scan_process.force_scan()) to perform the targeted rescan.
- Around line 398-399: The fix: before removing a finished MoveProcess from
Controller.active_move_processes, ensure you reap it by calling
MoveProcess.close_queues() and join() after _finalize_move_process(move_process)
so queues and the process are released; update the loop that builds
still_active_moves (and the same logic covering lines 401-423) to call
move_process.close_queues() and move_process.join() for processes that are no
longer active/finalized, then exclude them from still_active_moves so
Controller.exit() won’t need to clean them up later.
In `@src/python/controller/controller.py`:
- Around line 385-386: The current logic returns early from exit() when
self.__started is False, which skips teardown after a partial failure in start()
and leaks started children; either set self.__started = True before launching
child.start() calls and on any exception call exit() to teardown and then reset
__started = False and re-raise, or make exit() best-effort by removing the
early-return so it always attempts to stop/cleanup any children (iterate over
started children and call their stop/teardown methods) regardless of __started;
update start(), exit(), and the child.start()/child.stop() handling to implement
one of these fixes and ensure consistent state after exceptions.
In `@src/python/tests/unittests/test_controller/test_auto_queue.py`:
- Around line 286-294: The helper starts a probe thread with
threading.Thread(target=run) that can remain alive if op() hangs; mark that
probe thread as a daemon before starting it (i.e., set thread.daemon = True on
the Thread object created in the block using variables thread/run) so the test
process won't hang if the worker never returns; keep the rest of the logic
(lock, started.wait, done.wait, thread.join, assertions) unchanged.
In `@src/python/web/handler/auto_queue.py`:
- Around line 74-81: The except block that currently only catches OSError around
self.__auto_queue_persist.to_file(self.__persist_path) can miss other failures
(e.g. exceptions raised from Persist.to_str()), leaving __patterns and listener
state mutated without rollback; change the handlers around
__auto_queue_persist.to_file(...) (both the add-path and the similar block at
the second occurrence) to catch Exception (or a broader base that includes
to_str errors) so that on any failure you call
self.__auto_queue_persist.remove_pattern(aqp) and then log via
self.__logger.exception("Failed to persist auto-queue after adding pattern %r",
pattern) to ensure rollback runs for all to_file() failures.
In `@src/python/web/handler/controller.py`:
- Around line 118-120: The code enqueues the command via
self.__controller.queue_command(command) before calling callback.wait(), so
timing out after enqueue incorrectly returns a 504 while the non-idempotent
commands (DELETE_REMOTE, DELETE_LOCAL, EXTRACT, QUEUE) may still run; either
perform the timeout check before calling queue_command (i.e., call
callback.wait() first) or, if you must enqueue first, change the timeout branch
to return a 202 Accepted (or another acknowledgement) instead of 504 and
document that execution may continue asynchronously; adjust logic around
callback.wait() and queue_command to ensure responses reflect whether the
request was merely accepted or truly failed.
---
Outside diff comments:
In `@src/python/web/handler/auto_queue.py`:
- Around line 50-64: Move the membership/existence checks for aqp against
self.__auto_queue_persist.patterns inside the critical section guarded by
self.__write_lock so the test-and-set is atomic; specifically, perform the "if
aqp in self.__auto_queue_persist.patterns: return HTTPResponse(... 400 ...)"
check while holding self.__write_lock before calling add_pattern and persisting,
ensuring the mutate → persist → rollback sequence (and the
AutoQueuePersistListener pattern_added/new_patterns behavior and controller
replay) cannot be raced by concurrent requests; apply the same change to the
other duplicate/missing checks around the add/remove flows (the block referenced
at lines 100-109).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 27417b73-1fba-4d7a-b0d9-feb054163b23
📒 Files selected for processing (21)
src/python/controller/auto_queue.pysrc/python/controller/command_pipeline.pysrc/python/controller/controller.pysrc/python/controller/move/__init__.pysrc/python/controller/move/move_process.pysrc/python/controller/validate/validate_process.pysrc/python/lftp/job_status_parser.pysrc/python/seedsync.pysrc/python/tests/integration/test_web/test_handler/test_auto_queue.pysrc/python/tests/integration/test_web/test_handler/test_config.pysrc/python/tests/integration/test_web/test_handler/test_controller.pysrc/python/tests/unittests/test_controller/test_auto_queue.pysrc/python/tests/unittests/test_controller/test_command_pipeline.pysrc/python/tests/unittests/test_controller/test_controller_exit.pysrc/python/tests/unittests/test_controller/test_move_process.pysrc/python/tests/unittests/test_controller/test_validate/test_validate_process.pysrc/python/tests/unittests/test_lftp/test_job_status_parser.pysrc/python/tests/unittests/test_seedsync.pysrc/python/web/handler/auto_queue.pysrc/python/web/handler/config.pysrc/python/web/handler/controller.py
Pyright (strict) flagged the new __iter_worker_processes/__safe_teardown (#508) and __report_if_worker_dead (#511) helpers for missing parameter and return types. Annotate them (Iterator[AppProcess], Callable[[], object], worker: AppProcess) so the Python Type Check job passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review found a race: process() snapshotted new_patterns per-gather and then called clear_new_patterns() separately at the end, so a pattern added concurrently between the snapshot and the clear was wiped without ever being applied retroactively. Replace snapshot_new_patterns + clear_new_patterns with a single atomic drain_new_patterns() (copy + clear under the listener lock), drained once per cycle and threaded through the gather/filter calls; a pattern added after the drain is retained for the next cycle. Tests updated + a regression test for the add-after-drain case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Move the "already exists"/"doesn't exist" membership check inside __write_lock so the test-and-set is atomic against concurrent add/remove requests. - Broaden the persist-failure rollback from `except OSError` to `except Exception` (both add and remove) so a to_str()/serialization failure also rolls back __patterns and the listener side effect (mirrors the #507 config handler). Adds a non-OSError rollback test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Reap a finished MoveProcess in cleanup() (join + close_queues) so its FDs are released immediately. It's dropped from active_move_processes, so Controller.exit() never sees it and would otherwise leak its queue FDs until shutdown (#510 added a per-move failed-result queue). - Rescan only the move's owning pair (matched by pair_id) instead of every pair; fall back to all pairs if the owner can't be located. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
review) If a child.start() raised partway through start(), __started stayed False and exit() early-returned, leaking the already-started processes and their queue FDs. Set __started = True before launching children so the job's cleanup path (best-effort exit()) tears down whatever started. Adds a partial-start test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/python/web/handler/auto_queue.py (1)
50-85:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftThese handlers still expose uncommitted pattern state to the controller.
add_pattern()/remove_pattern()publish the change toAutoQueuePersist.__patternsand fire the listener callbacks beforeto_file()succeeds.AutoQueue.process()runs on another thread and can readpersist.patternsor drainnew_patternsduring that window, so a failed add can still queue/extract/delete work after the request returns 500, and a failed remove can miss one cycle of matching work before rollback restores the pattern. To make the rollback guarantee real, you need to keep controller reads out of this mutate→persist→rollback window or stage/persist a copy and only publish the live mutation after the write succeeds.Also applies to: 104-121
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/python/web/handler/auto_queue.py` around lines 50 - 85, The handler currently calls __auto_queue_persist.add_pattern/remove_pattern which publish changes (fire listeners and update AutoQueuePersist.__patterns/new_patterns) before to_file() succeeds, allowing AutoQueue.process() to observe uncommitted state; change the flow so mutations are not published until after a successful persist: either (A) add staged methods on AutoQueuePersist (e.g. stage_add_pattern(stage_aqp) and stage_remove_pattern(stage_aqp) that only update an internal staged buffer and to_file_from_stage(path) that writes the staged state, followed by a publish_staged() that updates __patterns and fires listeners), or (B) add optional publish=False to add_pattern/remove_pattern so the handler can call add_pattern(aqp, publish=False); then call to_file(...) and only on success call add_pattern(aqp, publish=True) (or a dedicated publish() method); ensure AutoQueue.process() reads only the live __patterns/new_patterns that are updated by the publish path and keep using __write_lock to protect the mutate→persist→publish sequence.src/python/controller/controller.py (1)
396-397:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftBound teardown
join()calls. Insrc/python/controller/controller.py(Controller.exit()lines 396-397), teardown uses a blockingp.join()with no timeout. Even thoughAppProcess.terminate()force-terminates after ~1s, a worker that still doesn’t exit (e.g., stuck/uninterruptible I/O) can keepexit()hung here, preventingclose_queues()and stalling shutdown / leaking queue FDs across restarts. Usep.join(timeout=...)and, ifp.is_alive()afterward, log and continue teardown without callingclose_queues()for still-alive processes (or force-kill + bounded join).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/python/controller/controller.py` around lines 396 - 397, In Controller.exit(), avoid blocking forever on process joins: replace the current unbounded self.__safe_teardown("join", p.join) calls with a bounded join pattern—call p.join(timeout=JOIN_TIMEOUT) (pick a sensible constant, e.g., 1-2s), then if p.is_alive() log a warning including the process id/name and continue teardown without calling close_queues() for that still-alive process (or attempt a forceful termination via AppProcess.terminate() followed by another bounded p.join()); ensure __safe_teardown or the exit loop reflects this timeout-and-check flow so shutdown proceeds even when a worker is stuck.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/python/controller/command_pipeline.py`:
- Around line 399-404: spawn_move_process currently appends MoveProcess to
self.active_move_processes and adds move_key to self.moved_file_keys before
calling process.start(), so if start() raises cleanup() will treat the process
as not started and call move_process.join()/move_process.close_queues() (which
can raise AssertionError) and will not run _finalize_move_process() to remove
the stale move_key. Fix by starting the process before publishing bookkeeping
(call move_process.start() prior to adding move_key to moved_file_keys and
appending to active_move_processes) or, if you prefer to keep the current order,
wrap cleanup’s join/close with a guard that checks move_process.is_alive() or
tracks a started flag and on start() exception remove the move_key from
moved_file_keys and avoid calling join()/close_queues(); ensure
_finalize_move_process() still runs or that the stale key is removed on start
failure.
---
Outside diff comments:
In `@src/python/controller/controller.py`:
- Around line 396-397: In Controller.exit(), avoid blocking forever on process
joins: replace the current unbounded self.__safe_teardown("join", p.join) calls
with a bounded join pattern—call p.join(timeout=JOIN_TIMEOUT) (pick a sensible
constant, e.g., 1-2s), then if p.is_alive() log a warning including the process
id/name and continue teardown without calling close_queues() for that
still-alive process (or attempt a forceful termination via
AppProcess.terminate() followed by another bounded p.join()); ensure
__safe_teardown or the exit loop reflects this timeout-and-check flow so
shutdown proceeds even when a worker is stuck.
In `@src/python/web/handler/auto_queue.py`:
- Around line 50-85: The handler currently calls
__auto_queue_persist.add_pattern/remove_pattern which publish changes (fire
listeners and update AutoQueuePersist.__patterns/new_patterns) before to_file()
succeeds, allowing AutoQueue.process() to observe uncommitted state; change the
flow so mutations are not published until after a successful persist: either (A)
add staged methods on AutoQueuePersist (e.g. stage_add_pattern(stage_aqp) and
stage_remove_pattern(stage_aqp) that only update an internal staged buffer and
to_file_from_stage(path) that writes the staged state, followed by a
publish_staged() that updates __patterns and fires listeners), or (B) add
optional publish=False to add_pattern/remove_pattern so the handler can call
add_pattern(aqp, publish=False); then call to_file(...) and only on success call
add_pattern(aqp, publish=True) (or a dedicated publish() method); ensure
AutoQueue.process() reads only the live __patterns/new_patterns that are updated
by the publish path and keep using __write_lock to protect the
mutate→persist→publish sequence.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: c9d32357-fe1c-4de9-a6f2-8175a91935c4
📒 Files selected for processing (8)
src/python/controller/auto_queue.pysrc/python/controller/command_pipeline.pysrc/python/controller/controller.pysrc/python/tests/integration/test_web/test_handler/test_auto_queue.pysrc/python/tests/unittests/test_controller/test_auto_queue.pysrc/python/tests/unittests/test_controller/test_command_pipeline.pysrc/python/tests/unittests/test_controller/test_controller_exit.pysrc/python/web/handler/auto_queue.py
spawn_move_process added move_key and appended to active_move_processes before process.start(). If start() raised, the next cleanup() saw a never-started process (is_alive() False) and called join()/close_queues() on it (raising AssertionError), while the stale move_key blocked every future retry. Start the process first, then publish bookkeeping only on success. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
exit() joined each worker with no timeout, so a worker stuck in uninterruptible I/O (e.g. a dead NAS mount ignoring the SIGTERM from terminate()) would hang join() — and exit() — forever, wedging ServiceRestart. Join with a bounded timeout and, if a worker is still alive, log a warning and continue teardown. Adds a stuck-worker test (timeout-guarded). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/python/tests/unittests/test_controller/test_controller_exit.py`:
- Around line 192-210: The test
test_partial_start_failure_leaves_started_true_so_exit_cleans_up currently
asserts terminate and close_queues were called on each pair process but omits
verifying join was invoked; update the teardown assertions after calling
self.controller.exit() (in the same test) to also assert that for each proc in
self._all_pair_processes() proc.join.assert_called_once() so the test matches
the full-teardown expectations and mirrors _assert_full_teardown() behavior;
locate uses of self.extract_process.start, self.controller.start,
self.controller.exit and self._all_pair_processes() to add the join assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9fdfeebe-9807-49ad-ab20-1fea5cf234ca
📒 Files selected for processing (8)
src/python/controller/auto_queue.pysrc/python/controller/command_pipeline.pysrc/python/controller/controller.pysrc/python/tests/integration/test_web/test_handler/test_auto_queue.pysrc/python/tests/unittests/test_controller/test_auto_queue.pysrc/python/tests/unittests/test_controller/test_command_pipeline.pysrc/python/tests/unittests/test_controller/test_controller_exit.pysrc/python/web/handler/auto_queue.py
The partial-start test verified terminate + close_queues per pair process but omitted join. exit() bounded-joins each worker, so assert join too, matching _assert_full_teardown's full-teardown expectations. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fixes the entire 🐛 Backend bugs / reliability bucket of tracking issue #531 (10 issues) in one branch, as requested. Each fix has its own commit and tests.
Note
Intentional grouping. This bundles 10 issues, deviating from CLAUDE.md's "one concern per branch" rule — done deliberately at the maintainer's request for a single grouped review. Commits are one-per-issue (plus review-polish commits) to keep it reviewable.
How this was built
A multi-agent workflow: parallel investigation (one agent per issue, re-verifying each finding against current code post-#534), sequential implementation (one commit per issue, ordered for the
command_pipeline.pyoverlap between #510/#511), then parallel adversarial review of every landed fix. The review surfaced AC gaps, which were then addressed (see below).Issues fixed
OSError(the%-interpolation half already shipped in #534)6e98254Controller.exit()best-effort so a hung lftp / raising worker reap can't leak processes + queue FDs24712b3,779a898AutoQueuePersistpatterns + listenernew_patternsset (mirrorsPathPairsConfig)61b3301,66d19daENOTEMPTYfixesea0cc8c35bf824,6bdf3278448ae6,476140622c31fdnew_patterns)7ccae90~in remote validatefind/hash commands; derive keys from the find base6221513,01ae955Event.wait()with a timeout → HTTP 504 instead of hanging request threads7ff125cReview findings → resolution
Adversarial review flagged 3 AC gaps + weak tests on 2 fixes. All addressed in this PR except two deeper reliability features intentionally split into follow-ups:
acMetgap): workerterminate/join/close_queueswas unguarded → routed every teardown call through a best-effort helper so all phases always run and nothing propagates. + test._log_shutdown_cause+_final_persistfromrun()and added direct tests for the INFO-vs-ERROR shutdown classification and the guarded final persist.Controller/CommandPipeline/ModelUpdater(a deadmultiprocessing.Processcan't restart in place) → Recreate dead extract/validate worker processes (follow-up to #511) #535.acMetgap → follow-up Surface failed staging→final moves as a model state + in-session retry (follow-up to #510) #536): core data-loss bugs are fixed and tested. Routing failures into a visible model state + in-session retry touches the model state machine → Surface failed staging→final moves as a model state + in-session retry (follow-up to #510) #536.cd; pwd; find .approach noted for a follow-up.Test plan
test_scan_file_with_latin_chars, is a pre-existing macOS-onlyErrno 92— it exists ondevelopand passes on CI's Linux runner.)ruff check,ruff format --check, andruff check --select C901(complexity ≤ 12) clean on all changed files.Closes #507. Closes #508. Closes #509. Closes #510. Closes #511. Closes #512. Closes #517. Closes #518. Closes #519. Closes #526.
#510 and #511 close here because their core fixes ship in this PR; the remaining enhancement scope continues in dedicated follow-ups:
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Tests