Skip to content

Fix backend bugs/reliability bucket from audit (#531)#537

Merged
nitrobass24 merged 24 commits into
developfrom
fix/backend-bugs-bucket-531
Jun 2, 2026
Merged

Fix backend bugs/reliability bucket from audit (#531)#537
nitrobass24 merged 24 commits into
developfrom
fix/backend-bugs-bucket-531

Conversation

@nitrobass24
Copy link
Copy Markdown
Owner

@nitrobass24 nitrobass24 commented Jun 2, 2026

Fixes the entire 🐛 Backend bugs / reliability bucket of tracking issue #531 (10 issues) in one branch, as requested. Each fix has its own commit and tests.

Note

Intentional grouping. This bundles 10 issues, deviating from CLAUDE.md's "one concern per branch" rule — done deliberately at the maintainer's request for a single grouped review. Commits are one-per-issue (plus review-polish commits) to keep it reviewable.

How this was built

A multi-agent workflow: parallel investigation (one agent per issue, re-verifying each finding against current code post-#534), sequential implementation (one commit per issue, ordered for the command_pipeline.py overlap between #510/#511), then parallel adversarial review of every landed fix. The review surfaced AC gaps, which were then addressed (see below).

Issues fixed

# Fix Commit
#507 Roll back config on any persist failure, not just OSError (the %-interpolation half already shipped in #534) 6e98254
#508 Controller.exit() best-effort so a hung lftp / raising worker reap can't leak processes + queue FDs 24712b3, 779a898
#509 Lock-guard AutoQueuePersist patterns + listener new_patterns set (mirrors PathPairsConfig) 61b3301, 66d19da
#510 Surface staging→final move failures via a result queue; discard key on failure; cross-device size + ENOTEMPTY fixes ea0cc8c
#511 Isolate extract/validate worker faults from the controller loop; surface a dead worker at ERROR 35bf824, 6bdf327
#512 Harden periodic + shutdown persist against transient write failures (ERROR + continue; classified shutdown) 8448ae6, 4761406
#517 Job-status parser robustness: anchored echo-strip, digit-required ETA regex, queue-header ordering 22c31fd
#518 Roll back auto-queue add/remove on persist failure (clears listener new_patterns) 7ccae90
#519 Expand ~ in remote validate find/hash commands; derive keys from the find base 6221513, 01ae955
#526 Bound controller action Event.wait() with a timeout → HTTP 504 instead of hanging request threads 7ff125c

Review findings → resolution

Adversarial review flagged 3 AC gaps + weak tests on 2 fixes. All addressed in this PR except two deeper reliability features intentionally split into follow-ups:

Test plan

  • ✅ Full unit suite: 839 passed, 59 skipped. (The one failure, test_scan_file_with_latin_chars, is a pre-existing macOS-only Errno 92 — it exists on develop and passes on CI's Linux runner.)
  • ✅ Web integration suite: 151 passed.
  • ruff check, ruff format --check, and ruff check --select C901 (complexity ≤ 12) clean on all changed files.
  • New tests accompany every fix (happy path + bug-reproduction + error/edge paths).

Closes #507. Closes #508. Closes #509. Closes #510. Closes #511. Closes #512. Closes #517. Closes #518. Closes #519. Closes #526.

#510 and #511 close here because their core fixes ship in this PR; the remaining enhancement scope continues in dedicated follow-ups:

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Controller action requests now time out with HTTP 504 after 30s
  • Bug Fixes

    • Atomic mutate→persist for config and auto-queue with full rollback on failure; concurrency and thread-safety hardened
    • Improved move failure reporting and safer cross-device fallback behavior
    • Better remote-path escaping/tilde handling for validations
    • More robust LFTP output parsing and ETA handling
    • More resilient shutdown and persistence handling to avoid teardown/data-loss
  • Tests

    • Expanded unit and integration coverage across concurrency, move/validate, controller teardown, persistence, and web handlers

nitrobass24 and others added 16 commits June 1, 2026 21:19
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#512)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AutoQueuePersist.__patterns and AutoQueuePersistListener.new_patterns were
mutated on the web handler thread (add_pattern/remove_pattern) while read on
the controller thread (patterns/to_str and __filter_candidates iteration).
A concurrent add/remove during process() raised
'RuntimeError: Set changed size during iteration', killing the ControllerJob.

Mirror common/path_pairs_config.py: guard the patterns list, to_str, and
listener registration with a threading.RLock (notifying listeners outside the
lock), and guard new_patterns with snapshot_new_patterns/clear_new_patterns
helpers that AutoQueue uses for atomic reads on the controller thread.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MoveProcess now reports failures (vanished source, size mismatch) on a
failed-result queue instead of silently returning, mirroring ValidateProcess.
Size verification checks only the copied subtree so merging into a pre-existing
non-empty destination no longer false-mismatches, and rename ENOTEMPTY/EEXIST
falls back to copy+delete instead of re-raising. CommandPipeline.cleanup drains
pop_failed and discards the moved key on failure so force_scan retries the move.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s/FDs (#508)

Guard each pc.lftp.exit() in a best-effort loop (extracted into __shutdown_lftp_best_effort) so a hung/dead lftp PTY no longer aborts teardown. This ensures the worker terminate/join, __mp_logger.stop(), and close_queues() phases always run, reaping subprocesses and releasing multiprocessing.Queue FDs on every restart. Adds tests/unittests/test_controller/test_controller_exit.py covering the hung-lftp regression, multi-pair isolation, happy path, not-started no-op, and repeated-exit no-leak.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review of #508 found AC2 only partially met: lftp shutdown was guarded, but
the worker terminate/join/close_queues sequence was a flat unguarded run, so a
raising Process.terminate()/join() could still skip the FD-releasing
close_queues phase (and propagate out of exit()). Route every teardown call
through a best-effort helper so all phases always run and nothing propagates.
Adds a test where a worker terminate()/join() raises and close_queues still
runs for every worker.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…iew)

The #511 fix isolates a worker fault so downloads continue, but
propagate_exception() consumes the queued fault only once, so after the first
cycle a dead worker went silent — extract/validate stayed disabled with no
operator signal. Report the dead state once at ERROR (sticky per worker).

Full automatic worker recreation (the issue's preferred AC1) is deferred to a
follow-up; this commit only makes the degraded-until-restart state visible.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review found the two concurrency tests were ineffective: the new_patterns test
iterated the already-safe snapshot (so it never caught the live-set crash), and
the to_str test only checked JSON validity, which never breaks under the GIL
(it passed even against the unguarded code). Replace both with deterministic
lock-contract tests: while the shared lock is held, each mutator/reader must
block until release — failing on unguarded code, passing on the locked code,
with no reliance on GIL timing. Add a snapshot-independence test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review found AC2 (INFO-vs-ERROR shutdown classification) and AC3 (guarded final
persist) shipped untested because they were inline in run(), which can't be
unit-exercised. Extract them into _log_shutdown_cause() and _final_persist()
(mirroring the already-tested _persist_periodic) and add direct tests: an
intentional ServiceExit/ServiceRestart logs at INFO, a genuine crash logs at
ERROR with traceback, and a failing final persist is swallowed (so it can't
mask the original in-flight exception).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review noted the leaf-name anchor introduces a narrow false (non-checksum)
mismatch when a nested directory shares the validated leaf's name. No name-based
anchor is fully robust without the shell-expanded find root; document the
limitation and the robust follow-up approach (cd; pwd; find .) in the docstring.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
These workflow-authored test files passed ruff check but not ruff format
--check (which CI runs). Format-only change; no test behavior altered.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 29b5f5cc-225c-4744-8f64-5540c9a43786

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds cross-cutting synchronization for auto-queue, process-safe move failure reporting, guarded controller teardown, persistence resiliency, request timeouts, remote-path escaping, parser fixes, and matching unit/integration tests.

Changes

Auto-queue thread-safety and handler atomicity

Layer / File(s) Summary
Auto-queue persistence and listener synchronization
src/python/controller/auto_queue.py
Adds RLock to AutoQueuePersist; patterns() returns a lock-protected snapshot; listeners mutated under lock and notified after release; AutoQueuePersistListener stores new_patterns under a lock and exposes drain_new_patterns().
Handler endpoint synchronization with rollback
src/python/web/handler/auto_queue.py
Per-handler __write_lock serializes existence-check → mutate → to_file → rollback flows; persistence failures (any Exception) trigger rollback and HTTP 500.
Auto-queue serialization tests
src/python/tests/integration/test_web/test_handler/test_auto_queue.py, src/python/tests/unittests/test_controller/test_auto_queue.py
Integration tests verify rollback of in-memory patterns and listener side-effects on persistence failure; unit tests exercise locking, blocking semantics, visibility, and drain behavior.

Move-process failure reporting and pipeline recovery

Layer / File(s) Summary
Move failure payload and queue
src/python/controller/move/move_process.py, src/python/controller/move/__init__.py
Adds MoveFailedResult, a process-safe failure queue, _report_failure(), pop_failed(), and close_queues() override; re-exports MoveFailedResult.
Cross-device fallback and size verification
src/python/controller/move/move_process.py
ENOTEMPTY/EEXIST now fallback to copy+delete; _get_copied_size() limits size verification to bytes copied by this op; on mismatch _report_failure() is used and source retained.
Move process and pipeline tests
src/python/tests/unittests/test_controller/test_move_process.py, src/python/tests/unittests/test_controller/test_command_pipeline.py
Adds SyncQueue stub and helpers; tests added for failure queuing, size mismatch, ENOTEMPTY fallback, spawn/start failure cleanup, command-pipeline finalization and dead-worker reporting/isolation.

Controller and pipeline robustness

Layer / File(s) Summary
Controller start/exit hardening
src/python/controller/controller.py
Controller.start sets __started early; Controller.exit performs per-pair best-effort lftp.exit(), deterministic worker iteration, exception-guarded terminate/join/close_queues, bounded joins, and explicit active_scanner closure.
CommandPipeline move finalization and worker isolation
src/python/controller/command_pipeline.py
Adds __reported_dead_workers, isolates propagate_exceptions per-worker with logging, adds _finalize_move_process() to surface failures, drop moved-file keys on failure, and trigger rescans; reorders spawn bookkeeping.
Controller and pipeline tests
src/python/tests/unittests/test_controller/test_controller_exit.py, src/python/tests/unittests/test_controller/test_command_pipeline.py
Unit tests assert exit() completes across lftp/worker failures, bounded joins/timeouts, repeated exits without leaks, and per-worker exception isolation and finalization semantics.

Persistence resilience and request timeout guarding

Layer / File(s) Summary
Seedsync persistence helpers
src/python/seedsync.py
Adds _persist_periodic(), _log_shutdown_cause(e), and _final_persist() so periodic/final persists are guarded and shutdown logging distinguishes intentional vs unexpected causes.
Config and controller handler changes
src/python/web/handler/config.py, src/python/web/handler/controller.py
ConfigHandler now rolls back on any persist Exception. Controller handler adds _ACTION_TIMEOUT_IN_SECS and makes WebResponseActionCallback.wait() return bool so handlers return HTTP 504 when controller callbacks time out.
Persistence and timeout tests
src/python/tests/integration/test_web/test_handler/test_config.py, src/python/tests/integration/test_web/test_handler/test_controller.py, src/python/tests/unittests/test_seedsync.py
Tests added to verify rollback on non-OSError persistence failures, 504 on stalled controller callbacks, and Seedsync persistence/shutdown logging resilience.

Remote-path escaping and job-status parsing stability

Layer / File(s) Summary
ValidateProcess remote escaping and keys
src/python/controller/validate/validate_process.py
Introduces _escape_remote() and _remote_key() to build/interpret remote find/hash commands robustly (handles ~ expansion via $HOME correctly and anchors keys on final name component).
LFTP job-status parser fixes
src/python/lftp/job_status_parser.py
Tightens ETA day regex, restricts jobs -v echo stripping to line boundaries, and peeks queue lines to avoid consuming unrelated lines.
Validate and parser tests
src/python/tests/unittests/test_controller/test_validate/test_validate_process.py, src/python/tests/unittests/test_lftp/test_job_status_parser.py
Adds tests for tilde/absolute remote-path command generation, hash command expansion, digit-less ETA handling, literal 'jobs -v' preservation, malformed ETA tolerance, and queued-command parsing.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix backend bugs/reliability bucket from audit (#531)' directly refers to the PR's purpose: addressing tracked issues from audit #531. It accurately summarizes the main objective of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/backend-bugs-bucket-531

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/python/web/handler/auto_queue.py (1)

50-64: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Move the duplicate/missing checks under __write_lock.

These membership checks are outside the critical section, so two concurrent identical requests can both pass the precheck; the second request then no-ops under the lock, persists unchanged state, and still returns 200 instead of the documented 400. The existence test needs to be atomic with the mutation/persist block.

Also applies to: 100-109

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/python/web/handler/auto_queue.py` around lines 50 - 64, Move the
membership/existence checks for aqp against self.__auto_queue_persist.patterns
inside the critical section guarded by self.__write_lock so the test-and-set is
atomic; specifically, perform the "if aqp in self.__auto_queue_persist.patterns:
return HTTPResponse(... 400 ...)" check while holding self.__write_lock before
calling add_pattern and persisting, ensuring the mutate → persist → rollback
sequence (and the AutoQueuePersistListener pattern_added/new_patterns behavior
and controller replay) cannot be raced by concurrent requests; apply the same
change to the other duplicate/missing checks around the add/remove flows (the
block referenced at lines 100-109).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/python/controller/auto_queue.py`:
- Line 262: The current process() flow snapshots new_patterns and then later
calls __persist_listener.clear_new_patterns() separately, which can drop
patterns added by concurrent add_pattern(); instead modify process() to, under
the same __persist_listener lock, atomically drain the listener into a local
snapshot (e.g. call a single “drain” operation or call get-and-clear while
holding the listener's lock), then reuse that drained snapshot for all
subsequent queue/extract/delete filtering logic; remove the separate
clear_new_patterns() call and ensure add_pattern() still appends to the listener
normally, so no patterns added after the drain are lost for this cycle.

In `@src/python/controller/command_pipeline.py`:
- Around line 421-422: The current loop forces a scan for every PairContext
(self._pair_contexts) after a move finishes; instead, locate the owning
PairContext by matching PairContext.pair_id to move_process.pair_id and call
that context's local_scan_process.force_scan() only for that pair, and only if
that lookup fails fall back to iterating self._pair_contexts and calling
local_scan_process.force_scan() on each; update the code paths around
move_process and _pair_contexts (reference functions/objects:
move_process.pair_id, PairContext, self._pair_contexts,
local_scan_process.force_scan()) to perform the targeted rescan.
- Around line 398-399: The fix: before removing a finished MoveProcess from
Controller.active_move_processes, ensure you reap it by calling
MoveProcess.close_queues() and join() after _finalize_move_process(move_process)
so queues and the process are released; update the loop that builds
still_active_moves (and the same logic covering lines 401-423) to call
move_process.close_queues() and move_process.join() for processes that are no
longer active/finalized, then exclude them from still_active_moves so
Controller.exit() won’t need to clean them up later.

In `@src/python/controller/controller.py`:
- Around line 385-386: The current logic returns early from exit() when
self.__started is False, which skips teardown after a partial failure in start()
and leaks started children; either set self.__started = True before launching
child.start() calls and on any exception call exit() to teardown and then reset
__started = False and re-raise, or make exit() best-effort by removing the
early-return so it always attempts to stop/cleanup any children (iterate over
started children and call their stop/teardown methods) regardless of __started;
update start(), exit(), and the child.start()/child.stop() handling to implement
one of these fixes and ensure consistent state after exceptions.

In `@src/python/tests/unittests/test_controller/test_auto_queue.py`:
- Around line 286-294: The helper starts a probe thread with
threading.Thread(target=run) that can remain alive if op() hangs; mark that
probe thread as a daemon before starting it (i.e., set thread.daemon = True on
the Thread object created in the block using variables thread/run) so the test
process won't hang if the worker never returns; keep the rest of the logic
(lock, started.wait, done.wait, thread.join, assertions) unchanged.

In `@src/python/web/handler/auto_queue.py`:
- Around line 74-81: The except block that currently only catches OSError around
self.__auto_queue_persist.to_file(self.__persist_path) can miss other failures
(e.g. exceptions raised from Persist.to_str()), leaving __patterns and listener
state mutated without rollback; change the handlers around
__auto_queue_persist.to_file(...) (both the add-path and the similar block at
the second occurrence) to catch Exception (or a broader base that includes
to_str errors) so that on any failure you call
self.__auto_queue_persist.remove_pattern(aqp) and then log via
self.__logger.exception("Failed to persist auto-queue after adding pattern %r",
pattern) to ensure rollback runs for all to_file() failures.

In `@src/python/web/handler/controller.py`:
- Around line 118-120: The code enqueues the command via
self.__controller.queue_command(command) before calling callback.wait(), so
timing out after enqueue incorrectly returns a 504 while the non-idempotent
commands (DELETE_REMOTE, DELETE_LOCAL, EXTRACT, QUEUE) may still run; either
perform the timeout check before calling queue_command (i.e., call
callback.wait() first) or, if you must enqueue first, change the timeout branch
to return a 202 Accepted (or another acknowledgement) instead of 504 and
document that execution may continue asynchronously; adjust logic around
callback.wait() and queue_command to ensure responses reflect whether the
request was merely accepted or truly failed.

---

Outside diff comments:
In `@src/python/web/handler/auto_queue.py`:
- Around line 50-64: Move the membership/existence checks for aqp against
self.__auto_queue_persist.patterns inside the critical section guarded by
self.__write_lock so the test-and-set is atomic; specifically, perform the "if
aqp in self.__auto_queue_persist.patterns: return HTTPResponse(... 400 ...)"
check while holding self.__write_lock before calling add_pattern and persisting,
ensuring the mutate → persist → rollback sequence (and the
AutoQueuePersistListener pattern_added/new_patterns behavior and controller
replay) cannot be raced by concurrent requests; apply the same change to the
other duplicate/missing checks around the add/remove flows (the block referenced
at lines 100-109).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 27417b73-1fba-4d7a-b0d9-feb054163b23

📥 Commits

Reviewing files that changed from the base of the PR and between af53304 and 6e00cad.

📒 Files selected for processing (21)
  • src/python/controller/auto_queue.py
  • src/python/controller/command_pipeline.py
  • src/python/controller/controller.py
  • src/python/controller/move/__init__.py
  • src/python/controller/move/move_process.py
  • src/python/controller/validate/validate_process.py
  • src/python/lftp/job_status_parser.py
  • src/python/seedsync.py
  • src/python/tests/integration/test_web/test_handler/test_auto_queue.py
  • src/python/tests/integration/test_web/test_handler/test_config.py
  • src/python/tests/integration/test_web/test_handler/test_controller.py
  • src/python/tests/unittests/test_controller/test_auto_queue.py
  • src/python/tests/unittests/test_controller/test_command_pipeline.py
  • src/python/tests/unittests/test_controller/test_controller_exit.py
  • src/python/tests/unittests/test_controller/test_move_process.py
  • src/python/tests/unittests/test_controller/test_validate/test_validate_process.py
  • src/python/tests/unittests/test_lftp/test_job_status_parser.py
  • src/python/tests/unittests/test_seedsync.py
  • src/python/web/handler/auto_queue.py
  • src/python/web/handler/config.py
  • src/python/web/handler/controller.py

Comment thread src/python/controller/auto_queue.py Outdated
Comment thread src/python/controller/command_pipeline.py
Comment thread src/python/controller/command_pipeline.py Outdated
Comment thread src/python/controller/controller.py
Comment thread src/python/tests/unittests/test_controller/test_auto_queue.py
Comment thread src/python/web/handler/auto_queue.py
Comment thread src/python/web/handler/controller.py
nitrobass24 and others added 5 commits June 2, 2026 08:58
Pyright (strict) flagged the new __iter_worker_processes/__safe_teardown
(#508) and __report_if_worker_dead (#511) helpers for missing parameter and
return types. Annotate them (Iterator[AppProcess], Callable[[], object],
worker: AppProcess) so the Python Type Check job passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review found a race: process() snapshotted new_patterns per-gather and then
called clear_new_patterns() separately at the end, so a pattern added
concurrently between the snapshot and the clear was wiped without ever being
applied retroactively. Replace snapshot_new_patterns + clear_new_patterns with
a single atomic drain_new_patterns() (copy + clear under the listener lock),
drained once per cycle and threaded through the gather/filter calls; a pattern
added after the drain is retained for the next cycle. Tests updated + a
regression test for the add-after-drain case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Move the "already exists"/"doesn't exist" membership check inside __write_lock
  so the test-and-set is atomic against concurrent add/remove requests.
- Broaden the persist-failure rollback from `except OSError` to `except
  Exception` (both add and remove) so a to_str()/serialization failure also
  rolls back __patterns and the listener side effect (mirrors the #507 config
  handler). Adds a non-OSError rollback test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Reap a finished MoveProcess in cleanup() (join + close_queues) so its FDs are
  released immediately. It's dropped from active_move_processes, so
  Controller.exit() never sees it and would otherwise leak its queue FDs until
  shutdown (#510 added a per-move failed-result queue).
- Rescan only the move's owning pair (matched by pair_id) instead of every
  pair; fall back to all pairs if the owner can't be located.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
 review)

If a child.start() raised partway through start(), __started stayed False and
exit() early-returned, leaking the already-started processes and their queue
FDs. Set __started = True before launching children so the job's cleanup path
(best-effort exit()) tears down whatever started. Adds a partial-start test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/python/web/handler/auto_queue.py (1)

50-85: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

These handlers still expose uncommitted pattern state to the controller.

add_pattern() / remove_pattern() publish the change to AutoQueuePersist.__patterns and fire the listener callbacks before to_file() succeeds. AutoQueue.process() runs on another thread and can read persist.patterns or drain new_patterns during that window, so a failed add can still queue/extract/delete work after the request returns 500, and a failed remove can miss one cycle of matching work before rollback restores the pattern. To make the rollback guarantee real, you need to keep controller reads out of this mutate→persist→rollback window or stage/persist a copy and only publish the live mutation after the write succeeds.

Also applies to: 104-121

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/python/web/handler/auto_queue.py` around lines 50 - 85, The handler
currently calls __auto_queue_persist.add_pattern/remove_pattern which publish
changes (fire listeners and update AutoQueuePersist.__patterns/new_patterns)
before to_file() succeeds, allowing AutoQueue.process() to observe uncommitted
state; change the flow so mutations are not published until after a successful
persist: either (A) add staged methods on AutoQueuePersist (e.g.
stage_add_pattern(stage_aqp) and stage_remove_pattern(stage_aqp) that only
update an internal staged buffer and to_file_from_stage(path) that writes the
staged state, followed by a publish_staged() that updates __patterns and fires
listeners), or (B) add optional publish=False to add_pattern/remove_pattern so
the handler can call add_pattern(aqp, publish=False); then call to_file(...) and
only on success call add_pattern(aqp, publish=True) (or a dedicated publish()
method); ensure AutoQueue.process() reads only the live __patterns/new_patterns
that are updated by the publish path and keep using __write_lock to protect the
mutate→persist→publish sequence.
src/python/controller/controller.py (1)

396-397: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Bound teardown join() calls. In src/python/controller/controller.py (Controller.exit() lines 396-397), teardown uses a blocking p.join() with no timeout. Even though AppProcess.terminate() force-terminates after ~1s, a worker that still doesn’t exit (e.g., stuck/uninterruptible I/O) can keep exit() hung here, preventing close_queues() and stalling shutdown / leaking queue FDs across restarts. Use p.join(timeout=...) and, if p.is_alive() afterward, log and continue teardown without calling close_queues() for still-alive processes (or force-kill + bounded join).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/python/controller/controller.py` around lines 396 - 397, In
Controller.exit(), avoid blocking forever on process joins: replace the current
unbounded self.__safe_teardown("join", p.join) calls with a bounded join
pattern—call p.join(timeout=JOIN_TIMEOUT) (pick a sensible constant, e.g.,
1-2s), then if p.is_alive() log a warning including the process id/name and
continue teardown without calling close_queues() for that still-alive process
(or attempt a forceful termination via AppProcess.terminate() followed by
another bounded p.join()); ensure __safe_teardown or the exit loop reflects this
timeout-and-check flow so shutdown proceeds even when a worker is stuck.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/python/controller/command_pipeline.py`:
- Around line 399-404: spawn_move_process currently appends MoveProcess to
self.active_move_processes and adds move_key to self.moved_file_keys before
calling process.start(), so if start() raises cleanup() will treat the process
as not started and call move_process.join()/move_process.close_queues() (which
can raise AssertionError) and will not run _finalize_move_process() to remove
the stale move_key. Fix by starting the process before publishing bookkeeping
(call move_process.start() prior to adding move_key to moved_file_keys and
appending to active_move_processes) or, if you prefer to keep the current order,
wrap cleanup’s join/close with a guard that checks move_process.is_alive() or
tracks a started flag and on start() exception remove the move_key from
moved_file_keys and avoid calling join()/close_queues(); ensure
_finalize_move_process() still runs or that the stale key is removed on start
failure.

---

Outside diff comments:
In `@src/python/controller/controller.py`:
- Around line 396-397: In Controller.exit(), avoid blocking forever on process
joins: replace the current unbounded self.__safe_teardown("join", p.join) calls
with a bounded join pattern—call p.join(timeout=JOIN_TIMEOUT) (pick a sensible
constant, e.g., 1-2s), then if p.is_alive() log a warning including the process
id/name and continue teardown without calling close_queues() for that
still-alive process (or attempt a forceful termination via
AppProcess.terminate() followed by another bounded p.join()); ensure
__safe_teardown or the exit loop reflects this timeout-and-check flow so
shutdown proceeds even when a worker is stuck.

In `@src/python/web/handler/auto_queue.py`:
- Around line 50-85: The handler currently calls
__auto_queue_persist.add_pattern/remove_pattern which publish changes (fire
listeners and update AutoQueuePersist.__patterns/new_patterns) before to_file()
succeeds, allowing AutoQueue.process() to observe uncommitted state; change the
flow so mutations are not published until after a successful persist: either (A)
add staged methods on AutoQueuePersist (e.g. stage_add_pattern(stage_aqp) and
stage_remove_pattern(stage_aqp) that only update an internal staged buffer and
to_file_from_stage(path) that writes the staged state, followed by a
publish_staged() that updates __patterns and fires listeners), or (B) add
optional publish=False to add_pattern/remove_pattern so the handler can call
add_pattern(aqp, publish=False); then call to_file(...) and only on success call
add_pattern(aqp, publish=True) (or a dedicated publish() method); ensure
AutoQueue.process() reads only the live __patterns/new_patterns that are updated
by the publish path and keep using __write_lock to protect the
mutate→persist→publish sequence.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c9d32357-fe1c-4de9-a6f2-8175a91935c4

📥 Commits

Reviewing files that changed from the base of the PR and between 6e00cad and 10dea66.

📒 Files selected for processing (8)
  • src/python/controller/auto_queue.py
  • src/python/controller/command_pipeline.py
  • src/python/controller/controller.py
  • src/python/tests/integration/test_web/test_handler/test_auto_queue.py
  • src/python/tests/unittests/test_controller/test_auto_queue.py
  • src/python/tests/unittests/test_controller/test_command_pipeline.py
  • src/python/tests/unittests/test_controller/test_controller_exit.py
  • src/python/web/handler/auto_queue.py

Comment thread src/python/controller/command_pipeline.py
nitrobass24 and others added 2 commits June 2, 2026 10:34
spawn_move_process added move_key and appended to active_move_processes before
process.start(). If start() raised, the next cleanup() saw a never-started
process (is_alive() False) and called join()/close_queues() on it (raising
AssertionError), while the stale move_key blocked every future retry. Start the
process first, then publish bookkeeping only on success.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
exit() joined each worker with no timeout, so a worker stuck in uninterruptible
I/O (e.g. a dead NAS mount ignoring the SIGTERM from terminate()) would hang
join() — and exit() — forever, wedging ServiceRestart. Join with a bounded
timeout and, if a worker is still alive, log a warning and continue teardown.
Adds a stuck-worker test (timeout-guarded).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/python/tests/unittests/test_controller/test_controller_exit.py`:
- Around line 192-210: The test
test_partial_start_failure_leaves_started_true_so_exit_cleans_up currently
asserts terminate and close_queues were called on each pair process but omits
verifying join was invoked; update the teardown assertions after calling
self.controller.exit() (in the same test) to also assert that for each proc in
self._all_pair_processes() proc.join.assert_called_once() so the test matches
the full-teardown expectations and mirrors _assert_full_teardown() behavior;
locate uses of self.extract_process.start, self.controller.start,
self.controller.exit and self._all_pair_processes() to add the join assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9fdfeebe-9807-49ad-ab20-1fea5cf234ca

📥 Commits

Reviewing files that changed from the base of the PR and between 6e00cad and 0043d5e.

📒 Files selected for processing (8)
  • src/python/controller/auto_queue.py
  • src/python/controller/command_pipeline.py
  • src/python/controller/controller.py
  • src/python/tests/integration/test_web/test_handler/test_auto_queue.py
  • src/python/tests/unittests/test_controller/test_auto_queue.py
  • src/python/tests/unittests/test_controller/test_command_pipeline.py
  • src/python/tests/unittests/test_controller/test_controller_exit.py
  • src/python/web/handler/auto_queue.py

Comment thread src/python/tests/unittests/test_controller/test_controller_exit.py
The partial-start test verified terminate + close_queues per pair process but
omitted join. exit() bounded-joins each worker, so assert join too, matching
_assert_full_teardown's full-teardown expectations.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@nitrobass24 nitrobass24 merged commit 416047c into develop Jun 2, 2026
18 checks passed
@nitrobass24 nitrobass24 deleted the fix/backend-bugs-bucket-531 branch June 2, 2026 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment