feat: v0.1.81#1073
Conversation
- FailoverConfig dataclass with enabled=False default (full back-compat) - FailoverRouter with primary/secondary region routing per backend - Auto-failover on CB OPEN with health probe before committing - TOCTOU-safe probe via _failover_pending flag (prevents concurrent double-probe) - Graceful recovery on CB CLOSED after configurable min_failover_duration - Opt-in: LLMCircuitBreaker gains failover=None kwarg (default preserves all existing behavior; no changes to call sites that omit it) - Zero new dependencies (pure Python) - 32 tests (unit + adversarial + concurrency), all CB tests pass (220 total) Phase 6 of circuit breaker roadmap (after #1065 Phase 5 adaptive thresholds).
Six rounds of adversarial review (3 reviewers x 6 rounds = 18 review agents) on the original Phase 6 implementation found and fixed: Round 1 (3 BLOCKING + 5 minor): - reset() vs in-flight probe race: probe commit could overwrite admin reset. Added _failover_pending guard at commit; added _closed_during_probe latch. - BaseException probe re-open did not notify FailoverRouter. Added failover.on_cb_state_change call in both store and in-memory paths with try/except wrapper. - Whitespace-only region/backend names accepted. Added .strip() check. - reset() accepting unknown backend, missing failover docstring, test category counts. Round 2 (2 MEDIUM): - CLOSED dropped during probe -> stuck on secondary. Added _closed_during_probe latch consumed by commit guard. - Deferred recovery stranding: get_active_region now applies lazy recovery inline. Round 3 (2 MEDIUM): - snapshot() and is_failed_over() did not apply lazy recovery. Extracted _try_lazy_recovery_unlocked helper, called from all 3 observation methods. - _failover_pending stuck window after commit. Pending now cleared atomically with commit. Round 4 (1 MEDIUM): - Finally block race: a committed probe's finally unconditionally cleared _failover_pending, wiping a subsequent concurrent probe's pending claim. Guarded finally with `if not committed:`. Round 5: CLEAN (1st) Round 6: CLEAN (2nd consecutive) Tests: 32 -> 45 (+13). Includes adversarial coverage for: - TOCTOU CLOSED-during-probe abort - Reset during in-flight probe - BaseException probe-reopen notification - Lazy recovery on get_active_region/is_failed_over/snapshot - Negative elapsed (clock skew backward) does not trigger recovery - min_failover_duration_seconds=0 immediate recovery semantics - Whitespace region/backend rejection Diff: +480 / -82 lines across cb_failover.py, llm_circuit_breaker.py, and test_cb_failover.py. All 240 circuit_breaker tests pass; pre-commit clean.
Actionable comments: - Remove unused cb_registry parameter from FailoverRouter.__init__ (was stored but never read; speculative API). - Enforce health_check_timeout_seconds via daemon thread + join(timeout): a hanging health probe must not block the calling thread or leave _failover_pending stuck True. Daemon thread leaks on timeout but does not delay process shutdown. Probe authors should still use HTTP client timeouts to avoid the leak in practice. Nitpicks: - Extract _notify_failover(prev_state, new_state) helper that wraps the router callback in try/except. Replaces 11 bare on_cb_state_change call sites in record_failure, record_success, force_open, reset, and the BaseException probe-reopen path. Catches Exception and other BaseException subclasses (asyncio.CancelledError, GeneratorExit) so an observer bug cannot break CB state mutation, but explicitly re-raises KeyboardInterrupt and SystemExit so user/process control signals are never silently swallowed. - Add docstrings to test classes, fixtures, and PostCommitHookLock to push docstring coverage from 70.79 percent to 81.45 percent above the CodeRabbit pre-merge threshold. - Restructure _handle_open so a try/finally wraps the whole probe-slot claim block. A new pending_claimed local flag is set inside the lock before _failover_pending is mutated, so a BaseException arriving in the slot-claim window cannot leave _failover_pending stuck True. - Drop dead `if region_list:` guard in reset() that follows an earlier membership check. Tests added (4): - test_hanging_probe_times_out_per_health_check_timeout_seconds - test_notify_failover_propagates_keyboard_interrupt - test_notify_failover_propagates_system_exit - test_notify_failover_swallows_generic_exception Verification: - 244 circuit_breaker tests pass (49 failover + 195 prior). - pre-commit clean. - Docstring coverage 81.45 percent above the 80 percent threshold. - 4-round F.R.I.D.A.Y. 3-body review with 2-consecutive CLEAN at R3+R4. Deferred to follow-up: - Probe loop early-abort between candidates (CodeRabbit nitpick N2, acknowledged as not a correctness bug). - User-facing YAML schema and design docs (Phase 5 PR #1065 precedent shipped without; will follow up in a separate docs PR).
Outside-diff actionable:
- Add monotonic transition seq counter (LLMCircuitBreaker._transition_seq)
captured under self._lock at every transition site (in-memory paths
capture inline; store paths use _next_seq helper). Pass via
_notify_failover(prev, new, seq=...) to FailoverRouter.on_cb_state_change.
Router maintains _last_seq[backend] and drops notifications whose seq
is less than or equal to the last applied value, fixing the
out-of-order delivery race that could leave the router stuck on the
secondary while the CB is actually closed.
- on_cb_state_change seq parameter is required (kwarg-only, no default)
so a forgotten / sentinel-zero call cannot bypass the drop check.
Nitpicks:
- Narrow _probe_runner worker thread catch from BaseException to
Exception so KeyboardInterrupt and SystemExit are not silently
absorbed by the daemon thread, matching the policy in
_notify_failover (re-raise KI/SE).
- Log a one-shot WARNING when FailoverRouter is constructed with
config.enabled=True and no explicit health_probe, since the
default lambda returns True for all regions and is not production
safe.
- Strengthen probe_call_count assertion in the 8-thread concurrency
test from <= 1 to == 1 to catch a regression where a commit bypasses
the probe path.
- Expand _notify_failover docstring to a Google-style Args section
and accurately describe the BaseException-except-KI/SE catch scope
including asyncio.CancelledError, GeneratorExit, MemoryError.
- on_cb_state_change validates new_state in ("open", "closed") BEFORE
advancing _last_seq, so a stray notify with new_state="half_open"
cannot consume a seq slot and silently block subsequent legitimate
notifications.
- reset() now clears _last_seq[backend] alongside the other tracking
dicts so a hot-swapped CB instance whose _transition_seq starts at
0 can resume notifying without having its early notifies dropped
as stale.
Tests added (5):
- test_stale_seq_notifications_dropped_by_router
- test_default_probe_warning_when_enabled
- test_default_probe_no_warning_when_disabled
- test_unknown_new_state_does_not_consume_seq_slot
- test_reset_clears_last_seq_for_subsequent_lower_seq_notifies
Plus 50+ existing test call sites updated to pass explicit seq=.
Verification:
- 249 circuit_breaker tests pass (54 failover + 195 prior).
- pre-commit clean.
- 6-round F.R.I.D.A.Y. 3-body review with 2-consecutive CLEAN at R5+R6.
Duplicate / outside-diff: - Add per-claim probe generation token (FailoverRouter._probe_gen) to close the reset+concurrent-OPEN race. _handle_open captures my_gen alongside the pending claim. The probe-success commit guard and the finally cleanup both refuse to proceed when the stored generation has moved beyond my_gen, so a stale probe can no longer (a) commit under a fresh owner's pending claim, nor (b) wipe the fresh claim from a trailing finally. reset() bumps the generation under the same lock as the other state clears so any in-flight probe sees the mismatch. Nitpicks: - FailoverConfig is now @DataClass(frozen=True). __post_init__ rebuilds regions as types.MappingProxyType({backend: tuple(regions)}) so neither the outer mapping nor the inner per-backend sequence can be mutated after construction (validation cannot be bypassed). - regions field annotation switched from dict[str, list[str]] to Mapping[str, Sequence[str]] so type checkers reject mutation attempts (cfg.regions["x"] = ..., cfg.regions["x"].append(...)) statically as well as at runtime. Internal _handle_open / _handle_closed / _try_lazy_recovery_unlocked parameter annotations also changed to Sequence[str]. - __post_init__ materializes each region sequence into a tuple BEFORE iterating so iterator inputs are no longer consumed by the validation loop and reported as "must be a non-empty list". - test_no_secondary_configured_logs_warning_no_crash now uses caplog and asserts the WARNING message, matching the test name. - Hanging-probe wall-clock bound relaxed from < 1.0s to < 2.0s for slow CI runners (still > 6x the theoretical max). - _next_seq() docstring expanded with the store-path ordering caveat. Tests added (2): - test_post_construction_mutation_rejected: covers all 4 mutation paths (frozen field assign, inner tuple append, outer MappingProxyType setitem, outer MappingProxyType delitem). - test_reset_then_concurrent_new_open_does_not_lose_new_failover: exercises the race the generation token closes -- T1 in-flight probe, admin reset(), T2 fresh OPEN, T1 probe completes late. T1 must not commit under T2's slot or wipe T2's claim. Verification: - 251 circuit_breaker tests pass. - pre-commit clean. - 4-round F.R.I.D.A.Y. 3-body review with 2-consecutive CLEAN at R3+R4.
Actionable + nitpick findings, plus internal review-fix loop additions: - Concurrency test diversifies seq per worker (was all seq=1, dropped 7/8 via dedup) so the pending-claim race is actually exercised. - Post-commit window test injects CLOSED with a strictly-greater seq so _handle_closed actually runs (was being silently dropped as stale). - Probe loop early-abort: per-iteration lock + generation + state check so we stop probing remaining secondaries once reset() or CLOSED makes the commit unreachable. - Timeout branch now `continue`s, suppressing the duplicate "trying next" warning. - f-string concat collapsed to single literals in __post_init__. - Args sections added to _try_lazy_recovery_unlocked, _handle_open, _handle_closed (locking contract documented). - pytest.raises tightened to AttributeError + raw regex string. Cross-thread state correctness: - on_cb_state_change releases self._lock between seq update and dispatch, which let low-seq OPEN + high-seq CLOSED dispatched concurrently produce a state mismatch (CB CLOSED, router on secondary). Pass seq through to _handle_open/_handle_closed and abort at top of each when _last_seq has advanced past us. - New `_last_state[backend]` tracks the latest CB state notified; the probe-commit guard reads _last_state instead of _closed_during_probe so OPEN(1)+CLOSED(2)+OPEN(3) interleaved with an in-flight probe correctly commits when the latest state is OPEN. - _closed_during_probe field removed (now dead state). - reset() bumps _probe_gen and clears _last_state alongside the other per-backend trackers. Probe runner hardening: - async health_probe rejected at construction via a richer check that catches `async def`, async generator functions, and callable instances whose __call__ is async (extends from `asyncio.iscoroutinefunction`). - Probe runner inspects the return value and rejects coroutines, awaitables, async generators, and sync generators (closing them best- effort to suppress "never awaited" / ResourceWarning) -- bool(...) is always True for these and would silently report healthy. FailoverConfig hardened: - @DataClass(frozen=True) + types.MappingProxyType outer + tuple inner for `regions`. Mapping[str, Sequence[str]] field annotation rejects mutation statically. __post_init__ materializes each region sequence into tuple BEFORE iteration so iterator inputs do not mis-error. Tests added (8 new, 60 total, 1277 lines): - test_post_construction_mutation_rejected (4 mutation paths) - test_async_health_probe_rejected_at_construction (3 async forms) - test_lambda_returning_coroutine_treated_as_probe_failure - test_sync_callable_returning_async_generator_treated_as_probe_failure - test_sync_callable_returning_sync_generator_treated_as_probe_failure - test_reset_then_concurrent_new_open_does_not_lose_new_failover - test_closed_can_race_between_seq_update_and_open_probe_claim - test_open_close_open_during_probe_commits_for_latest_open - test_open_close_during_probe_does_not_commit Verification: - 258 circuit_breaker tests pass. - pre-commit clean. - 8-round F.R.I.D.A.Y. 3-body review with 2-consecutive CLEAN at R7+R8.
All 3 findings are non-blocking nits: - _next_seq() docstring: add WARNING that in-memory transition paths must NOT call this helper (they capture self._transition_seq inline under the existing mutation lock). - FailoverConfig.__post_init__: O(n^2) duplicate detection via regions_tuple.count(r) replaced with O(n) collections.Counter scan. Behavior unchanged (still raises ValueError listing duplicates); the new list is sorted for stable error messages. - _probe_runner: best-effort close() on a coroutine/awaitable/generator return value now logs at DEBUG when close() raises, instead of silently swallowing. Verification: - 258 circuit_breaker tests pass. - pre-commit clean.
|
Caution Review failedPull request was closed or merged during review No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (12)
💤 Files with no reviewable changes (1)
✅ Files skipped from review due to trivial changes (6)
👮 Files not reviewed due to content moderation or server errors (5)
📝 WalkthroughWalkthroughAdds a new FailoverRouter (and FailoverConfig) for multi-region LLM routing, integrates it into LLMCircuitBreaker with per-transition sequencing and notifications, provides extensive failover tests, and bumps the package version from 0.1.80 to 0.1.81. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant CB as LLMCircuitBreaker
participant Router as FailoverRouter
participant Probe as HealthProbe
participant Clock as Clock
CB->>CB: detect state transition (seq++)
CB->>Router: on_cb_state_change(backend, old, new, seq)
alt new == "open" and Router.enabled
Router->>Router: attempt claim (check _failover_pending, generation)
Router->>Probe: probe secondary region (daemon thread)
Probe-->>Router: healthy / unhealthy / exception / timeout
alt healthy
Router->>Router: commit active_region, set failover_at, clear pending
else unhealthy/exception/timeout
Router->>Router: try next region or clear pending on exhaustion
end
else new == "closed"
Router->>Clock: read time
Router->>Router: if min_failover_duration met -> restore primary else defer (lazy checks)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 5❌ Failed checks (3 warnings, 2 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
feat: add multi-region failover for circuit breaker (Phase 6)
There was a problem hiding this comment.
🧹 Nitpick comments (1)
massgen/backend/cb_failover.py (1)
504-517: Best-effort cleanup is a no-op for async generators.For async generators,
getattr(result, "close", None)returnsNonebecause async generators exposeaclose()(async) rather thanclose()(sync), so the cleanup branch is silently skipped in that case. It works as intended for coroutines and sync generators, which both implement synchronousclose(). Since the docstring already documents this as best-effort, and the regression behavior is correct (region rejected viaTypeError), this is a minor inefficiency — the unnecessarygetattrcall adds noise without benefit for that case.For fully clean shutdown, consider skipping the close attempt for async generators (they will be closed on garbage collection):
♻️ Optional cleanup tightening
- if inspect.isawaitable(result) or inspect.isasyncgen(result) or inspect.isgenerator(result): - close = getattr(result, "close", None) - if callable(close): - try: - close() - except Exception as close_exc: # noqa: BLE001 -- best-effort cleanup - logger.debug( - "Best-effort close() on probe return value raised %s; ignoring.", - type(close_exc).__name__, - ) + if inspect.isawaitable(result) or inspect.isasyncgen(result) or inspect.isgenerator(result): + # Sync generators and coroutines support close(); async + # generators only expose async aclose(), which we cannot + # await from this sync worker -- rely on GC for those. + if not inspect.isasyncgen(result): + close = getattr(result, "close", None) + if callable(close): + try: + close() + except Exception as close_exc: # noqa: BLE001 -- best-effort cleanup + logger.debug( + "Best-effort close() on probe return value raised %s; ignoring.", + type(close_exc).__name__, + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@massgen/backend/cb_failover.py` around lines 504 - 517, The cleanup branch is attempting to call a synchronous close() on async generators (result) which have only aclose(), so the getattr(close) call is a no-op for async generators; update the logic in the block handling inspect.isawaitable(result) or inspect.isasyncgen(result) or inspect.isgenerator(result) to skip the synchronous close() attempt when inspect.isasyncgen(result) is true (only try close() for coroutines or sync generators), keep setting sink[0] to the TypeError as-is, and leave the TypeError/return behavior unchanged; refer to the local symbols result, sink[0], and the inspect.isasyncgen/inspect.isawaitable checks to locate where to adjust the conditional.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@massgen/backend/cb_failover.py`:
- Around line 504-517: The cleanup branch is attempting to call a synchronous
close() on async generators (result) which have only aclose(), so the
getattr(close) call is a no-op for async generators; update the logic in the
block handling inspect.isawaitable(result) or inspect.isasyncgen(result) or
inspect.isgenerator(result) to skip the synchronous close() attempt when
inspect.isasyncgen(result) is true (only try close() for coroutines or sync
generators), keep setting sink[0] to the TypeError as-is, and leave the
TypeError/return behavior unchanged; refer to the local symbols result, sink[0],
and the inspect.isasyncgen/inspect.isawaitable checks to locate where to adjust
the conditional.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7d9bd9a6-3a98-4859-86ef-baf8c859ecb6
📒 Files selected for processing (3)
massgen/backend/cb_failover.pymassgen/backend/llm_circuit_breaker.pymassgen/tests/test_cb_failover.py
docs: docs for v0.1.81
PR Title Format
Your PR title must follow the format:
<type>: <brief description>Valid types:
fix:- Bug fixesfeat:- New featuresbreaking:- Breaking changesdocs:- Documentation updatesrefactor:- Code refactoringtest:- Test additions/modificationschore:- Maintenance tasksperf:- Performance improvementsstyle:- Code style changesci:- CI/CD configuration changesExamples:
fix: resolve memory leak in data processingfeat: add export to CSV functionalitybreaking: change API response formatdocs: update installation guideDescription
Brief description of the changes in this PR
Type of change
fix:) - Non-breaking change which fixes an issuefeat:) - Non-breaking change which adds functionalitybreaking:) - Fix or feature that would cause existing functionality to not work as expecteddocs:) - Documentation updatesrefactor:) - Code changes that neither fix a bug nor add a featuretest:) - Adding missing tests or correcting existing testschore:) - Maintenance tasks, dependency updates, etc.perf:) - Code changes that improve performancestyle:) - Changes that do not affect the meaning of the code (formatting, missing semi-colons, etc.)ci:) - Changes to CI/CD configuration files and scriptsChecklist
Pre-commit status
How to Test
Add test method for this PR.
Test CLI Command
Write down the test bash command. If there is pre-requests, please emphasize.
Expected Results
Description/screenshots of expected results.
Additional context
Add any other context about the PR here.
Summary by CodeRabbit