feat: add circuit breaker observability module (Phase 3)#1056
feat: add circuit breaker observability module (Phase 3)#1056Henry-811 merged 5 commits intomassgen:dev/v0.1.76from
Conversation
- New massgen/observability/ package with CircuitBreakerMetrics class - Optional prometheus_client dependency with lazy import and no-op fallback - Per-instance CollectorRegistry to avoid global metric collision - Metrics: cb_state_transitions_total, cb_requests_total, cb_request_latency_seconds (p50/p95/p99 buckets for LLM calls), cb_current_state gauge - Grafana dashboard template for CB state, request rates, and latency - LLMCircuitBreaker gains optional metrics= constructor parameter - All state transitions (CLOSED/OPEN/HALF_OPEN) emit metrics including abnormal HALF_OPEN probe reopen via BaseException handler - reset() emits state transition metric when state changes - Full backward compat: metrics=None preserves all existing behavior - 19 happy-path + no-op tests, 21 adversarial tests (3+ categories), 66 regression tests all passing (107 total)
- Add _safe_emit() helper to prevent metrics exceptions from crashing CB calls - Emit per-attempt failure metrics before each retry continue in CAP/WAIT/retryable paths - Add prometheus-client>=0.20 to observability and all extras in pyproject.toml - Clear partial state on Histogram/Gauge init failure in _ensure_metrics() - Add module-level enablement docstring to prometheus.py - Add 21 new tests: per-attempt latency, exception safety, partial construction, concurrent get_registry, None/empty label boundaries, OPEN->HALF_OPEN metric, reentrancy under RLock, and label cardinality contract documentation
…utoflake) - black: reformat llm_circuit_breaker.py, test_cb_observability*.py (arg line breaks) - add-trailing-comma: prometheus.py, test_cb_observability.py - isort: sort imports in test_cb_observability*.py - autoflake: remove unused imports in test_cb_observability*.py
📝 WalkthroughWalkthroughAdds Prometheus-based observability for the LLM circuit breaker: new CircuitBreakerMetrics class, metrics emissions from LLMCircuitBreaker (state transitions, per-attempt request outcomes and latencies), a Grafana dashboard JSON, tests (including adversarial cases), and an optional dependency entry for prometheus-client. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant CB as LLMCircuitBreaker
participant Metrics as CircuitBreakerMetrics
participant Prom as Prometheus_Registry
Client->>CB: call_with_retry()
activate CB
CB->>CB: should_block()
alt CLOSED or allowed
CB->>CB: perform attempt loop (measure start/end)
CB->>Metrics: record_request(outcome, latency)
activate Metrics
Metrics->>Prom: inc Counters / observe Histogram / set Gauge
deactivate Metrics
alt failure triggers state change
CB->>Metrics: record_state_transition(prev, OPEN)
else success closes
CB->>Metrics: record_state_transition(prev, CLOSED)
end
else REJECTED (OPEN/HALF_OPEN)
CB->>Metrics: record_request(rejected_open/rejected_half_open, 0.0)
activate Metrics
Metrics->>Prom: inc Counter
deactivate Metrics
end
CB-->>Client: result / exception
deactivate CB
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
massgen/backend/llm_circuit_breaker.py (1)
434-445:⚠️ Potential issue | 🟠 MajorTrack HALF_OPEN probe ownership dynamically, not once per call.
_probe_was_half_openis snapshotted before the retry loop, but this call can become the probe later when Line 439 flipsOPEN -> HALF_OPENon a subsequent attempt. If that later probe exits terminally, the cleanup block skips the reopen/reset path and can leave_half_open_probe_active=TrueinHALF_OPEN, effectively wedging the breaker. Please promote probe ownership when an attempt entersHALF_OPEN, and add a regression test for that mid-retry transition.Also applies to: 560-577
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@massgen/backend/llm_circuit_breaker.py` around lines 434 - 445, The code snapshots _probe_was_half_open once before the retry loop which fails to transfer probe ownership if the circuit transitions to HALF_OPEN mid-retry; update the probe ownership check inside the retry loop: on each attempt re-evaluate self.state and when it becomes CircuitState.HALF_OPEN, set and claim the probe by setting self._half_open_probe_active accordingly (and record probe ownership so cleanup knows to reopen/reset), ensure the cleanup/exit path looks at the dynamically-set self._half_open_probe_active (not the original _probe_was_half_open) before deciding to reset/reopen the breaker and emit metrics via _safe_emit/_metrics.record_request, and add a regression test that simulates OPEN -> HALF_OPEN during retries to verify the probe flag is cleared and the breaker is not wedged; reference symbols: _probe_was_half_open, self.state, should_block(), _half_open_probe_active, _safe_emit, _metrics.record_request, CircuitBreakerOpenError.
🧹 Nitpick comments (1)
massgen/observability/prometheus.py (1)
52-59: Add Google-style docstrings to the remaining new methods.
__init__(),get_registry(),_ensure_metrics(), and_state_value()are new here, but they’re still missing the repo’s required docstring format. As per coding guidelines, "**/*.py: For new or changed functions, include Google-style docstrings`".Also applies to: 117-125, 186-188
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@massgen/observability/prometheus.py` around lines 52 - 59, The listed new methods (__init__, get_registry, _ensure_metrics, and _state_value) are missing the project's required Google-style docstrings; update each of these functions to include a Google-style docstring describing the purpose, Args (if any), Returns (if any), and Raises (if any) following the repo convention—place the docstring immediately under the def line for __init__, get_registry, _ensure_metrics, and _state_value in massgen/observability/prometheus.py and mirror the same pattern used elsewhere in the module (see other documented functions around the file for exact phrasing and sections).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@massgen/backend/llm_circuit_breaker.py`:
- Around line 243-252: The code currently calls
self._safe_emit(self._metrics.record_state_transition, ...) while holding
self._lock; instead, capture the transition parameters (e.g., backend_name and
the from/to state labels) while under the lock and defer calling
self._safe_emit(self._metrics.record_state_transition, ...) until after
releasing self._lock to avoid blocking or re-entrancy into methods like state()
or reset(); apply the same pattern for every occurrence (the blocks invoking
_safe_emit with _metrics.record_state_transition around lines referenced, e.g.,
the instances near 243-252, 292-299, 309-315, 340-346, 368-374, 385-391,
570-576) so you always assemble the payload under lock and perform the actual
_safe_emit call only after the lock is released.
In `@pyproject.toml`:
- Around line 83-86: MANIFEST.in currently lacks rules to include JSON dashboard
assets, so add an inclusion rule to ensure
massgen/observability/dashboards/*.json are packaged (e.g., add a line like
recursive-include massgen/observability/dashboards *.json) so the new
massgen/observability/dashboards/circuit_breaker.json is shipped with
sdist/wheel; also ensure any other observability JSONs referenced around the
same change are covered by the same rule.
---
Outside diff comments:
In `@massgen/backend/llm_circuit_breaker.py`:
- Around line 434-445: The code snapshots _probe_was_half_open once before the
retry loop which fails to transfer probe ownership if the circuit transitions to
HALF_OPEN mid-retry; update the probe ownership check inside the retry loop: on
each attempt re-evaluate self.state and when it becomes CircuitState.HALF_OPEN,
set and claim the probe by setting self._half_open_probe_active accordingly (and
record probe ownership so cleanup knows to reopen/reset), ensure the
cleanup/exit path looks at the dynamically-set self._half_open_probe_active (not
the original _probe_was_half_open) before deciding to reset/reopen the breaker
and emit metrics via _safe_emit/_metrics.record_request, and add a regression
test that simulates OPEN -> HALF_OPEN during retries to verify the probe flag is
cleared and the breaker is not wedged; reference symbols: _probe_was_half_open,
self.state, should_block(), _half_open_probe_active, _safe_emit,
_metrics.record_request, CircuitBreakerOpenError.
---
Nitpick comments:
In `@massgen/observability/prometheus.py`:
- Around line 52-59: The listed new methods (__init__, get_registry,
_ensure_metrics, and _state_value) are missing the project's required
Google-style docstrings; update each of these functions to include a
Google-style docstring describing the purpose, Args (if any), Returns (if any),
and Raises (if any) following the repo convention—place the docstring
immediately under the def line for __init__, get_registry, _ensure_metrics, and
_state_value in massgen/observability/prometheus.py and mirror the same pattern
used elsewhere in the module (see other documented functions around the file for
exact phrasing and sections).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8025c7df-0fe6-4c0b-9164-6ed7bc116e3e
📒 Files selected for processing (7)
massgen/backend/llm_circuit_breaker.pymassgen/observability/__init__.pymassgen/observability/dashboards/circuit_breaker.jsonmassgen/observability/prometheus.pymassgen/tests/test_cb_observability.pymassgen/tests/test_cb_observability_adversarial.pypyproject.toml
…cstrings + manifest) - Track HALF_OPEN probe ownership dynamically inside retry loop (_owns_probe replaces stale _probe_was_half_open snapshot; mid-retry ownership reacquired with explicit guard block after should_block() returns False) - Release self._lock before _safe_emit to prevent re-entrancy: 7 call sites in should_block, record_failure, record_success, force_open, reset, and call_with_retry except-BaseException now capture locals inside lock and invoke _safe_emit after the lock is released - Add Google-style docstrings to prometheus.py: __init__, get_registry, _ensure_metrics, _state_value - Package dashboards JSON via MANIFEST.in recursive-include - Add regression tests TestAdversarialMidRetryProbeTransfer (2 async tests) covering successful and failed probe scenarios after mid-retry OPEN->HALF_OPEN
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
massgen/backend/llm_circuit_breaker.py (1)
190-199:⚠️ Potential issue | 🟡 MinorAdd Google-style docstrings to the new methods.
__init__()now exposesmetrics, and_safe_emit()is new, but neither docstring follows the repo’s Google-style format. Please addArgs:/Returns:sections where applicable so the new API surface stays consistent.As per coding guidelines,
**/*.py: For new or changed functions, include Google-style docstrings.Also applies to: 592-600
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@massgen/backend/llm_circuit_breaker.py` around lines 190 - 199, The new methods __init__ and _safe_emit lack Google-style docstrings; add docstrings for both using the repo's Google style including an "Args:" section (for __init__: config: LLMCircuitBreakerConfig | None, backend_name: str, metrics: CircuitBreakerMetrics | None) and for _safe_emit include "Args:" for parameters it accepts and a "Returns:" if it returns a value (or "Raises:" if it can raise), plus a one-line summary and any relevant behavior notes; place these docstrings immediately above the definitions of __init__ and _safe_emit in llm_circuit_breaker.py to match existing docstring conventions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@massgen/backend/llm_circuit_breaker.py`:
- Around line 232-258: The HALF_OPEN probe must be owned by a specific caller:
modify the probe grant logic (the block in should_block / the section setting
self._half_open_probe_active and _emit_transition) to allocate a unique probe
token/owner (e.g. self._half_open_probe_owner = uuid or incrementing id) and
return or record that token with the caller; then change record_failure() and
record_success() to check that the finishing call's token matches
self._half_open_probe_owner before performing any state transitions or emitting
transition metrics; alternatively, move all HALF_OPEN transition handling into
call_with_retry() so only the caller that was granted the probe (identified by
the token) can mutate self._state, and ensure _half_open_probe_active and
_half_open_probe_owner are cleared only by the probe owner.
In `@massgen/tests/test_cb_observability_adversarial.py`:
- Around line 628-690: The tests currently expire cb._open_until before the
first call so the OPEN->HALF_OPEN transition happens before retries; change each
test so the first attempt runs while the breaker remains OPEN and a subsequent
retry triggers the HALF_OPEN probe branch: set cb._open_until to a future (or
leave it expired only after the first attempt), use a counter/closure in the
coroutine factories (succeed_on_first/always_fail) so the first invocation
either fails (or raises) while OPEN, then on that first invocation update
cb._open_until = time.monotonic() - 1.0 to allow the next retry to transition to
HALF_OPEN and exercise the retry-ownership branch in call_with_retry.
---
Outside diff comments:
In `@massgen/backend/llm_circuit_breaker.py`:
- Around line 190-199: The new methods __init__ and _safe_emit lack Google-style
docstrings; add docstrings for both using the repo's Google style including an
"Args:" section (for __init__: config: LLMCircuitBreakerConfig | None,
backend_name: str, metrics: CircuitBreakerMetrics | None) and for _safe_emit
include "Args:" for parameters it accepts and a "Returns:" if it returns a value
(or "Raises:" if it can raise), plus a one-line summary and any relevant
behavior notes; place these docstrings immediately above the definitions of
__init__ and _safe_emit in llm_circuit_breaker.py to match existing docstring
conventions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: dda08acb-6d58-43cc-a7e6-05f6efc0d547
📒 Files selected for processing (4)
MANIFEST.inmassgen/backend/llm_circuit_breaker.pymassgen/observability/prometheus.pymassgen/tests/test_cb_observability_adversarial.py
✅ Files skipped from review due to trivial changes (1)
- MANIFEST.in
| _emit_transition: tuple[str, str] | None = None | ||
| with self._lock: | ||
| if self._state == CircuitState.CLOSED: | ||
| return False | ||
| should_block = False | ||
|
|
||
| if self._state == CircuitState.OPEN: | ||
| elif self._state == CircuitState.OPEN: | ||
| now = time.monotonic() | ||
| if now >= self._open_until: | ||
| # Transition to HALF_OPEN -- allow one probe | ||
| self._state = CircuitState.HALF_OPEN | ||
| self._half_open_probe_active = True | ||
| self._log("Circuit breaker half-open, allowing probe request") | ||
| return False | ||
| return True | ||
|
|
||
| # HALF_OPEN | ||
| if self._half_open_probe_active: | ||
| # Probe already dispatched; block additional requests | ||
| return True | ||
| # No probe active -- allow one | ||
| self._half_open_probe_active = True | ||
| return False | ||
| _emit_transition = ("open", "half_open") | ||
| should_block = False | ||
| else: | ||
| should_block = True | ||
|
|
||
| else: | ||
| # HALF_OPEN | ||
| if self._half_open_probe_active: | ||
| # Probe already dispatched; block additional requests | ||
| should_block = True | ||
| else: | ||
| # No probe active -- allow one | ||
| self._half_open_probe_active = True | ||
| should_block = False | ||
|
|
There was a problem hiding this comment.
Bind HALF_OPEN transitions to the actual probe owner.
should_block() still models probe ownership with a single global flag, but record_failure() and record_success() transition on self._state == HALF_OPEN without checking whether the finishing call is the one that acquired that probe. A long-running request that started earlier in CLOSED can therefore finish after another call has moved the breaker to HALF_OPEN and incorrectly reopen/close the breaker here, while also emitting the wrong transition metrics. Please carry an explicit probe token/owner through the success/failure path, or centralize all HALF_OPEN transitions inside call_with_retry() so only the granted probe can mutate them.
Also applies to: 280-321, 328-347, 439-458, 570-588
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@massgen/backend/llm_circuit_breaker.py` around lines 232 - 258, The HALF_OPEN
probe must be owned by a specific caller: modify the probe grant logic (the
block in should_block / the section setting self._half_open_probe_active and
_emit_transition) to allocate a unique probe token/owner (e.g.
self._half_open_probe_owner = uuid or incrementing id) and return or record that
token with the caller; then change record_failure() and record_success() to
check that the finishing call's token matches self._half_open_probe_owner before
performing any state transitions or emitting transition metrics; alternatively,
move all HALF_OPEN transition handling into call_with_retry() so only the caller
that was granted the probe (identified by the token) can mutate self._state, and
ensure _half_open_probe_active and _half_open_probe_owner are cleared only by
the probe owner.
| @pytest.mark.asyncio | ||
| async def test_probe_flag_cleared_after_mid_retry_half_open(self) -> None: | ||
| """Probe flag is cleared and breaker closes when OPEN->HALF_OPEN transition occurs mid-retry.""" | ||
| import time as _time | ||
|
|
||
| cb = LLMCircuitBreaker( | ||
| backend_name="test_mid_retry", | ||
| config=LLMCircuitBreakerConfig( | ||
| max_failures=1, | ||
| reset_time_seconds=999.0, | ||
| enabled=True, | ||
| ), | ||
| ) | ||
|
|
||
| # Trip the breaker to OPEN | ||
| cb.record_failure(error_type="test") | ||
| assert cb.state == CircuitState.OPEN | ||
|
|
||
| # Expire the open window so the NEXT should_block() call transitions to HALF_OPEN | ||
| # (simulating time passage between circuit open and call attempt) | ||
| cb._open_until = _time.monotonic() - 1.0 | ||
|
|
||
| # coro_factory: first call succeeds (probe success -> CLOSED) | ||
| # The circuit is OPEN at call_with_retry entry, but should_block() will | ||
| # transition it to HALF_OPEN and allow this call through as the probe. | ||
| async def succeed_on_first(): | ||
| return "ok" | ||
|
|
||
| result = await cb.call_with_retry(succeed_on_first, max_retries=1) | ||
| assert result == "ok" | ||
| assert cb._half_open_probe_active is False, "_half_open_probe_active must be False after successful probe" | ||
| assert cb.state == CircuitState.CLOSED, "Breaker must be CLOSED after successful probe" | ||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_probe_flag_cleared_on_failed_mid_retry_half_open(self) -> None: | ||
| """Probe flag is cleared and breaker is OPEN when probe acquired mid-retry then fails.""" | ||
| import time as _time | ||
|
|
||
| cb = LLMCircuitBreaker( | ||
| backend_name="test_mid_retry_fail", | ||
| config=LLMCircuitBreakerConfig( | ||
| max_failures=1, | ||
| reset_time_seconds=999.0, | ||
| enabled=True, | ||
| ), | ||
| ) | ||
|
|
||
| # Trip the breaker to OPEN | ||
| cb.record_failure(error_type="test") | ||
| assert cb.state == CircuitState.OPEN | ||
|
|
||
| # Expire the window so should_block() transitions to HALF_OPEN | ||
| cb._open_until = _time.monotonic() - 1.0 | ||
|
|
||
| async def always_fail(): | ||
| raise ValueError("probe failure") | ||
|
|
||
| with pytest.raises(ValueError, match="probe failure"): | ||
| await cb.call_with_retry(always_fail, max_retries=1) | ||
|
|
||
| # Probe failed -- breaker must be re-opened, flag must be cleared | ||
| assert cb._half_open_probe_active is False, "_half_open_probe_active must be False after failed probe" | ||
| assert cb.state == CircuitState.OPEN, "Breaker must be OPEN after failed probe" |
There was a problem hiding this comment.
These “mid-retry” regressions never enter the retry-ownership branch.
Both tests expire cb._open_until before the first call and use max_retries=1, so OPEN -> HALF_OPEN happens in the initial gate and Lines 454-458 are never executed. As written, these would still pass if _owns_probe were only snapshotted once before the loop. Please make attempt 1 fail while the breaker is still OPEN, then let a later retry acquire HALF_OPEN so this regression actually covers the stale-owner case.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@massgen/tests/test_cb_observability_adversarial.py` around lines 628 - 690,
The tests currently expire cb._open_until before the first call so the
OPEN->HALF_OPEN transition happens before retries; change each test so the first
attempt runs while the breaker remains OPEN and a subsequent retry triggers the
HALF_OPEN probe branch: set cb._open_until to a future (or leave it expired only
after the first attempt), use a counter/closure in the coroutine factories
(succeed_on_first/always_fail) so the first invocation either fails (or raises)
while OPEN, then on that first invocation update cb._open_until =
time.monotonic() - 1.0 to allow the next retry to transition to HALF_OPEN and
exercise the retry-ownership branch in call_with_retry.
|
Pushed llm_circuit_breaker.py
prometheus.py Google-style docstrings added to MANIFEST.in
|
Summary
Follow-up to #1038 (circuit breaker core). Adds an optional Prometheus metrics module and a Grafana 9+ dashboard. No behavior changes to callers that don't opt in.
What's in
massgen/observability/prometheus.py--CircuitBreakerMetricsclass, lazyprometheus_clientimport, per-instanceCollectorRegistryto avoid collisions with callers' default registrymassgen/observability/dashboards/circuit_breaker.json-- Grafana 9+ dashboard (state gauge, request rate, latency histogram, transition counter)massgen/backend/llm_circuit_breaker.py-- optionalmetrics=parameter onLLMCircuitBreaker.__init__; 88 lines changed, all behindif self._metrics is not Noneguardspyproject.toml--[observability]extra:prometheus-client>=0.20,logfire>=3.0.0test_cb_observability.py, 24 intest_cb_observability_adversarial.pyDesign notes
prometheus_clientis optional.pip install massgen[observability]enables it. Without it, everyCircuitBreakerMetricsmethod is a no-op;get_registry()returnsNone. The guard is a lazyImportErrorcatch on first call -- zero overhead until metrics are actually used.Per-instance
CollectorRegistryrather than the default global one. Avoids duplicate-registration errors when callers have their own Prometheus setup or run multiple CB instances in tests.Latency is per-attempt, not retry-total. Matches how CB dashboards are typically read -- you want to see individual call durations, not the sum across retries with sleep. This is documented in the docstring.
In practice most installs already have
prometheus_clientviapydocketas a transitive dep, so the no-op path is really a safety net for minimal / custom builds.Tests
TestCircuitBreakerMetricsHappyPath,TestCircuitBreakerMetricsNoOp,TestCircuitBreakerIntegration,TestRound5Additions)TestAdversarialCorruptedInput,TestAdversarialConcurrentAccesswith 100 threads,TestAdversarialFailureInjection,TestAdversarialHalfOpenEdgeCases,TestAdversarialRound5)test_llm_circuit_breaker.pytests pass unchanged -- no regressionsAdversarial categories covered: corrupted/empty label values, concurrent emit under load,
prometheus_clientmissing mid-run, partial construction (registry created but Counter raises), duplicate registration, HALF_OPEN edge cases.Backward compat
metrics=defaults toNone. All emit paths are behindif self._metrics is not None. Existing callers that don't passmetrics=see zero code path changes. Verified by running the fulltest_llm_circuit_breaker.pysuite against this branch.Next
Phase 4 (distributed store backend, Redis) is scoped but not started. Happy to hold that pending feedback here, or open a tracking issue if that helps.
Summary by CodeRabbit
New Features
Tests
Chore