Skip to content

feat: add circuit breaker observability module (Phase 3)#1056

Merged
Henry-811 merged 5 commits intomassgen:dev/v0.1.76from
amabito:feat/cb-observability-phase3
Apr 13, 2026
Merged

feat: add circuit breaker observability module (Phase 3)#1056
Henry-811 merged 5 commits intomassgen:dev/v0.1.76from
amabito:feat/cb-observability-phase3

Conversation

@amabito
Copy link
Copy Markdown
Contributor

@amabito amabito commented Apr 11, 2026

Summary

Follow-up to #1038 (circuit breaker core). Adds an optional Prometheus metrics module and a Grafana 9+ dashboard. No behavior changes to callers that don't opt in.

What's in

  • massgen/observability/prometheus.py -- CircuitBreakerMetrics class, lazy prometheus_client import, per-instance CollectorRegistry to avoid collisions with callers' default registry
  • massgen/observability/dashboards/circuit_breaker.json -- Grafana 9+ dashboard (state gauge, request rate, latency histogram, transition counter)
  • massgen/backend/llm_circuit_breaker.py -- optional metrics= parameter on LLMCircuitBreaker.__init__; 88 lines changed, all behind if self._metrics is not None guards
  • pyproject.toml -- [observability] extra: prometheus-client>=0.20, logfire>=3.0.0
  • 54 new tests: 30 in test_cb_observability.py, 24 in test_cb_observability_adversarial.py

Design notes

prometheus_client is optional. pip install massgen[observability] enables it. Without it, every CircuitBreakerMetrics method is a no-op; get_registry() returns None. The guard is a lazy ImportError catch on first call -- zero overhead until metrics are actually used.

Per-instance CollectorRegistry rather than the default global one. Avoids duplicate-registration errors when callers have their own Prometheus setup or run multiple CB instances in tests.

Latency is per-attempt, not retry-total. Matches how CB dashboards are typically read -- you want to see individual call durations, not the sum across retries with sleep. This is documented in the docstring.

In practice most installs already have prometheus_client via pydocket as a transitive dep, so the no-op path is really a safety net for minimal / custom builds.

Tests

  • 30 happy-path / integration tests (TestCircuitBreakerMetricsHappyPath, TestCircuitBreakerMetricsNoOp, TestCircuitBreakerIntegration, TestRound5Additions)
  • 24 adversarial tests (TestAdversarialCorruptedInput, TestAdversarialConcurrentAccess with 100 threads, TestAdversarialFailureInjection, TestAdversarialHalfOpenEdgeCases, TestAdversarialRound5)
  • 63 existing test_llm_circuit_breaker.py tests pass unchanged -- no regressions
  • Total: 107 passing

Adversarial categories covered: corrupted/empty label values, concurrent emit under load, prometheus_client missing mid-run, partial construction (registry created but Counter raises), duplicate registration, HALF_OPEN edge cases.

Backward compat

metrics= defaults to None. All emit paths are behind if self._metrics is not None. Existing callers that don't pass metrics= see zero code path changes. Verified by running the full test_llm_circuit_breaker.py suite against this branch.

Next

Phase 4 (distributed store backend, Redis) is scoped but not started. Happy to hold that pending feedback here, or open a tracking issue if that helps.

Summary by CodeRabbit

  • New Features

    • Prometheus-based observability for the LLM circuit breaker: records state transitions, request outcomes, and per-attempt latencies.
    • Grafana dashboard for circuit breaker state, request rates, latency quantiles, and state transitions.
    • Optional Prometheus monitoring can be enabled via extras.
  • Tests

    • Extensive unit and adversarial tests covering metrics, concurrency, failure modes, and breaker-edge cases.
  • Chore

    • Packaging updated to include dashboard JSON and optional dependency.

amabito added 4 commits April 11, 2026 19:53
- New massgen/observability/ package with CircuitBreakerMetrics class
- Optional prometheus_client dependency with lazy import and no-op fallback
- Per-instance CollectorRegistry to avoid global metric collision
- Metrics: cb_state_transitions_total, cb_requests_total,
  cb_request_latency_seconds (p50/p95/p99 buckets for LLM calls),
  cb_current_state gauge
- Grafana dashboard template for CB state, request rates, and latency
- LLMCircuitBreaker gains optional metrics= constructor parameter
- All state transitions (CLOSED/OPEN/HALF_OPEN) emit metrics including
  abnormal HALF_OPEN probe reopen via BaseException handler
- reset() emits state transition metric when state changes
- Full backward compat: metrics=None preserves all existing behavior
- 19 happy-path + no-op tests, 21 adversarial tests (3+ categories),
  66 regression tests all passing (107 total)
- Add _safe_emit() helper to prevent metrics exceptions from crashing CB calls
- Emit per-attempt failure metrics before each retry continue in CAP/WAIT/retryable paths
- Add prometheus-client>=0.20 to observability and all extras in pyproject.toml
- Clear partial state on Histogram/Gauge init failure in _ensure_metrics()
- Add module-level enablement docstring to prometheus.py
- Add 21 new tests: per-attempt latency, exception safety, partial construction,
  concurrent get_registry, None/empty label boundaries, OPEN->HALF_OPEN metric,
  reentrancy under RLock, and label cardinality contract documentation
…utoflake)

- black: reformat llm_circuit_breaker.py, test_cb_observability*.py (arg line breaks)
- add-trailing-comma: prometheus.py, test_cb_observability.py
- isort: sort imports in test_cb_observability*.py
- autoflake: remove unused imports in test_cb_observability*.py
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 11, 2026

📝 Walkthrough

Walkthrough

Adds Prometheus-based observability for the LLM circuit breaker: new CircuitBreakerMetrics class, metrics emissions from LLMCircuitBreaker (state transitions, per-attempt request outcomes and latencies), a Grafana dashboard JSON, tests (including adversarial cases), and an optional dependency entry for prometheus-client.

Changes

Cohort / File(s) Summary
Observability library
massgen/observability/prometheus.py, massgen/observability/__init__.py
New CircuitBreakerMetrics class with lazy, thread-safe Prometheus instrument initialization, registry management, and no-op behavior when prometheus client is unavailable; exported via package init.
Circuit breaker integration
massgen/backend/llm_circuit_breaker.py
Constructor gains optional `metrics: CircuitBreakerMetrics
Grafana dashboard
massgen/observability/dashboards/circuit_breaker.json
New dashboard for LLM circuit breaker: current state gauge, request-rate by outcome, latency quantiles, and state transition rate; templated by backend.
Tests (unit & adversarial)
massgen/tests/test_cb_observability.py, massgen/tests/test_cb_observability_adversarial.py
Extensive new tests validating metric recording, lazy import behavior, thread-safety, boundary inputs, half-open probe edge cases, retry-path metric emissions, and resilience to metric-side exceptions.
Packaging / manifest
pyproject.toml, MANIFEST.in
Added prometheus-client>=0.20 to observability and all optional groups; included dashboards JSON in packaging.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant CB as LLMCircuitBreaker
    participant Metrics as CircuitBreakerMetrics
    participant Prom as Prometheus_Registry

    Client->>CB: call_with_retry()
    activate CB

    CB->>CB: should_block()
    alt CLOSED or allowed
        CB->>CB: perform attempt loop (measure start/end)
        CB->>Metrics: record_request(outcome, latency)
        activate Metrics
        Metrics->>Prom: inc Counters / observe Histogram / set Gauge
        deactivate Metrics

        alt failure triggers state change
            CB->>Metrics: record_state_transition(prev, OPEN)
        else success closes
            CB->>Metrics: record_state_transition(prev, CLOSED)
        end
    else REJECTED (OPEN/HALF_OPEN)
        CB->>Metrics: record_request(rejected_open/rejected_half_open, 0.0)
        activate Metrics
        Metrics->>Prom: inc Counter
        deactivate Metrics
    end

    CB-->>Client: result / exception
    deactivate CB
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • PR #1024: Base LLMCircuitBreaker implementation; this change augments it with metrics instrumentation and state-transition emissions.
  • PR #1038: Integrations that call LLMCircuitBreaker APIs; constructor signature and metric emissions in this PR directly affect those integrations.

Suggested reviewers

  • a5507203
  • ncrispino
🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.35% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Documentation Updated ⚠️ Warning CircuitBreakerMetrics class and its public methods lack docstrings; LLMCircuitBreaker.init docstring does not document the new metrics parameter; no design documentation exists for the observability feature. Add Google-style docstrings to CircuitBreakerMetrics and methods; document metrics parameter in LLMCircuitBreaker.init; create design documentation at docs/dev_notes/circuit_breaker_observability.md.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: adding a circuit breaker observability module as Phase 3 of a multi-phase effort.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering summary, design notes, tests, backward compatibility, and next steps. It provides clear context about the observability module addition.
Capabilities Registry Check ✅ Passed Custom check for backend/model changes is not applicable. PR only adds observability and metrics to circuit breaker, with no changes to backends, models, or capabilities.
Config Parameter Sync ✅ Passed PR does not add new YAML parameters; metrics parameter is internal Python API, not YAML config. Check condition not met.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
massgen/backend/llm_circuit_breaker.py (1)

434-445: ⚠️ Potential issue | 🟠 Major

Track HALF_OPEN probe ownership dynamically, not once per call.

_probe_was_half_open is snapshotted before the retry loop, but this call can become the probe later when Line 439 flips OPEN -> HALF_OPEN on a subsequent attempt. If that later probe exits terminally, the cleanup block skips the reopen/reset path and can leave _half_open_probe_active=True in HALF_OPEN, effectively wedging the breaker. Please promote probe ownership when an attempt enters HALF_OPEN, and add a regression test for that mid-retry transition.

Also applies to: 560-577

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/backend/llm_circuit_breaker.py` around lines 434 - 445, The code
snapshots _probe_was_half_open once before the retry loop which fails to
transfer probe ownership if the circuit transitions to HALF_OPEN mid-retry;
update the probe ownership check inside the retry loop: on each attempt
re-evaluate self.state and when it becomes CircuitState.HALF_OPEN, set and claim
the probe by setting self._half_open_probe_active accordingly (and record probe
ownership so cleanup knows to reopen/reset), ensure the cleanup/exit path looks
at the dynamically-set self._half_open_probe_active (not the original
_probe_was_half_open) before deciding to reset/reopen the breaker and emit
metrics via _safe_emit/_metrics.record_request, and add a regression test that
simulates OPEN -> HALF_OPEN during retries to verify the probe flag is cleared
and the breaker is not wedged; reference symbols: _probe_was_half_open,
self.state, should_block(), _half_open_probe_active, _safe_emit,
_metrics.record_request, CircuitBreakerOpenError.
🧹 Nitpick comments (1)
massgen/observability/prometheus.py (1)

52-59: Add Google-style docstrings to the remaining new methods.

__init__(), get_registry(), _ensure_metrics(), and _state_value() are new here, but they’re still missing the repo’s required docstring format. As per coding guidelines, "**/*.py: For new or changed functions, include Google-style docstrings`".

Also applies to: 117-125, 186-188

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/observability/prometheus.py` around lines 52 - 59, The listed new
methods (__init__, get_registry, _ensure_metrics, and _state_value) are missing
the project's required Google-style docstrings; update each of these functions
to include a Google-style docstring describing the purpose, Args (if any),
Returns (if any), and Raises (if any) following the repo convention—place the
docstring immediately under the def line for __init__, get_registry,
_ensure_metrics, and _state_value in massgen/observability/prometheus.py and
mirror the same pattern used elsewhere in the module (see other documented
functions around the file for exact phrasing and sections).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@massgen/backend/llm_circuit_breaker.py`:
- Around line 243-252: The code currently calls
self._safe_emit(self._metrics.record_state_transition, ...) while holding
self._lock; instead, capture the transition parameters (e.g., backend_name and
the from/to state labels) while under the lock and defer calling
self._safe_emit(self._metrics.record_state_transition, ...) until after
releasing self._lock to avoid blocking or re-entrancy into methods like state()
or reset(); apply the same pattern for every occurrence (the blocks invoking
_safe_emit with _metrics.record_state_transition around lines referenced, e.g.,
the instances near 243-252, 292-299, 309-315, 340-346, 368-374, 385-391,
570-576) so you always assemble the payload under lock and perform the actual
_safe_emit call only after the lock is released.

In `@pyproject.toml`:
- Around line 83-86: MANIFEST.in currently lacks rules to include JSON dashboard
assets, so add an inclusion rule to ensure
massgen/observability/dashboards/*.json are packaged (e.g., add a line like
recursive-include massgen/observability/dashboards *.json) so the new
massgen/observability/dashboards/circuit_breaker.json is shipped with
sdist/wheel; also ensure any other observability JSONs referenced around the
same change are covered by the same rule.

---

Outside diff comments:
In `@massgen/backend/llm_circuit_breaker.py`:
- Around line 434-445: The code snapshots _probe_was_half_open once before the
retry loop which fails to transfer probe ownership if the circuit transitions to
HALF_OPEN mid-retry; update the probe ownership check inside the retry loop: on
each attempt re-evaluate self.state and when it becomes CircuitState.HALF_OPEN,
set and claim the probe by setting self._half_open_probe_active accordingly (and
record probe ownership so cleanup knows to reopen/reset), ensure the
cleanup/exit path looks at the dynamically-set self._half_open_probe_active (not
the original _probe_was_half_open) before deciding to reset/reopen the breaker
and emit metrics via _safe_emit/_metrics.record_request, and add a regression
test that simulates OPEN -> HALF_OPEN during retries to verify the probe flag is
cleared and the breaker is not wedged; reference symbols: _probe_was_half_open,
self.state, should_block(), _half_open_probe_active, _safe_emit,
_metrics.record_request, CircuitBreakerOpenError.

---

Nitpick comments:
In `@massgen/observability/prometheus.py`:
- Around line 52-59: The listed new methods (__init__, get_registry,
_ensure_metrics, and _state_value) are missing the project's required
Google-style docstrings; update each of these functions to include a
Google-style docstring describing the purpose, Args (if any), Returns (if any),
and Raises (if any) following the repo convention—place the docstring
immediately under the def line for __init__, get_registry, _ensure_metrics, and
_state_value in massgen/observability/prometheus.py and mirror the same pattern
used elsewhere in the module (see other documented functions around the file for
exact phrasing and sections).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8025c7df-0fe6-4c0b-9164-6ed7bc116e3e

📥 Commits

Reviewing files that changed from the base of the PR and between c8c9148 and 653c7bd.

📒 Files selected for processing (7)
  • massgen/backend/llm_circuit_breaker.py
  • massgen/observability/__init__.py
  • massgen/observability/dashboards/circuit_breaker.json
  • massgen/observability/prometheus.py
  • massgen/tests/test_cb_observability.py
  • massgen/tests/test_cb_observability_adversarial.py
  • pyproject.toml

…cstrings + manifest)

- Track HALF_OPEN probe ownership dynamically inside retry loop (_owns_probe
  replaces stale _probe_was_half_open snapshot; mid-retry ownership reacquired
  with explicit guard block after should_block() returns False)
- Release self._lock before _safe_emit to prevent re-entrancy: 7 call sites
  in should_block, record_failure, record_success, force_open, reset, and
  call_with_retry except-BaseException now capture locals inside lock and
  invoke _safe_emit after the lock is released
- Add Google-style docstrings to prometheus.py: __init__, get_registry,
  _ensure_metrics, _state_value
- Package dashboards JSON via MANIFEST.in recursive-include
- Add regression tests TestAdversarialMidRetryProbeTransfer (2 async tests)
  covering successful and failed probe scenarios after mid-retry OPEN->HALF_OPEN
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
massgen/backend/llm_circuit_breaker.py (1)

190-199: ⚠️ Potential issue | 🟡 Minor

Add Google-style docstrings to the new methods.

__init__() now exposes metrics, and _safe_emit() is new, but neither docstring follows the repo’s Google-style format. Please add Args: / Returns: sections where applicable so the new API surface stays consistent.

As per coding guidelines, **/*.py: For new or changed functions, include Google-style docstrings.

Also applies to: 592-600

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/backend/llm_circuit_breaker.py` around lines 190 - 199, The new
methods __init__ and _safe_emit lack Google-style docstrings; add docstrings for
both using the repo's Google style including an "Args:" section (for __init__:
config: LLMCircuitBreakerConfig | None, backend_name: str, metrics:
CircuitBreakerMetrics | None) and for _safe_emit include "Args:" for parameters
it accepts and a "Returns:" if it returns a value (or "Raises:" if it can
raise), plus a one-line summary and any relevant behavior notes; place these
docstrings immediately above the definitions of __init__ and _safe_emit in
llm_circuit_breaker.py to match existing docstring conventions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@massgen/backend/llm_circuit_breaker.py`:
- Around line 232-258: The HALF_OPEN probe must be owned by a specific caller:
modify the probe grant logic (the block in should_block / the section setting
self._half_open_probe_active and _emit_transition) to allocate a unique probe
token/owner (e.g. self._half_open_probe_owner = uuid or incrementing id) and
return or record that token with the caller; then change record_failure() and
record_success() to check that the finishing call's token matches
self._half_open_probe_owner before performing any state transitions or emitting
transition metrics; alternatively, move all HALF_OPEN transition handling into
call_with_retry() so only the caller that was granted the probe (identified by
the token) can mutate self._state, and ensure _half_open_probe_active and
_half_open_probe_owner are cleared only by the probe owner.

In `@massgen/tests/test_cb_observability_adversarial.py`:
- Around line 628-690: The tests currently expire cb._open_until before the
first call so the OPEN->HALF_OPEN transition happens before retries; change each
test so the first attempt runs while the breaker remains OPEN and a subsequent
retry triggers the HALF_OPEN probe branch: set cb._open_until to a future (or
leave it expired only after the first attempt), use a counter/closure in the
coroutine factories (succeed_on_first/always_fail) so the first invocation
either fails (or raises) while OPEN, then on that first invocation update
cb._open_until = time.monotonic() - 1.0 to allow the next retry to transition to
HALF_OPEN and exercise the retry-ownership branch in call_with_retry.

---

Outside diff comments:
In `@massgen/backend/llm_circuit_breaker.py`:
- Around line 190-199: The new methods __init__ and _safe_emit lack Google-style
docstrings; add docstrings for both using the repo's Google style including an
"Args:" section (for __init__: config: LLMCircuitBreakerConfig | None,
backend_name: str, metrics: CircuitBreakerMetrics | None) and for _safe_emit
include "Args:" for parameters it accepts and a "Returns:" if it returns a value
(or "Raises:" if it can raise), plus a one-line summary and any relevant
behavior notes; place these docstrings immediately above the definitions of
__init__ and _safe_emit in llm_circuit_breaker.py to match existing docstring
conventions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dda08acb-6d58-43cc-a7e6-05f6efc0d547

📥 Commits

Reviewing files that changed from the base of the PR and between 653c7bd and 847a9fb.

📒 Files selected for processing (4)
  • MANIFEST.in
  • massgen/backend/llm_circuit_breaker.py
  • massgen/observability/prometheus.py
  • massgen/tests/test_cb_observability_adversarial.py
✅ Files skipped from review due to trivial changes (1)
  • MANIFEST.in

Comment on lines +232 to +258
_emit_transition: tuple[str, str] | None = None
with self._lock:
if self._state == CircuitState.CLOSED:
return False
should_block = False

if self._state == CircuitState.OPEN:
elif self._state == CircuitState.OPEN:
now = time.monotonic()
if now >= self._open_until:
# Transition to HALF_OPEN -- allow one probe
self._state = CircuitState.HALF_OPEN
self._half_open_probe_active = True
self._log("Circuit breaker half-open, allowing probe request")
return False
return True

# HALF_OPEN
if self._half_open_probe_active:
# Probe already dispatched; block additional requests
return True
# No probe active -- allow one
self._half_open_probe_active = True
return False
_emit_transition = ("open", "half_open")
should_block = False
else:
should_block = True

else:
# HALF_OPEN
if self._half_open_probe_active:
# Probe already dispatched; block additional requests
should_block = True
else:
# No probe active -- allow one
self._half_open_probe_active = True
should_block = False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bind HALF_OPEN transitions to the actual probe owner.

should_block() still models probe ownership with a single global flag, but record_failure() and record_success() transition on self._state == HALF_OPEN without checking whether the finishing call is the one that acquired that probe. A long-running request that started earlier in CLOSED can therefore finish after another call has moved the breaker to HALF_OPEN and incorrectly reopen/close the breaker here, while also emitting the wrong transition metrics. Please carry an explicit probe token/owner through the success/failure path, or centralize all HALF_OPEN transitions inside call_with_retry() so only the granted probe can mutate them.

Also applies to: 280-321, 328-347, 439-458, 570-588

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/backend/llm_circuit_breaker.py` around lines 232 - 258, The HALF_OPEN
probe must be owned by a specific caller: modify the probe grant logic (the
block in should_block / the section setting self._half_open_probe_active and
_emit_transition) to allocate a unique probe token/owner (e.g.
self._half_open_probe_owner = uuid or incrementing id) and return or record that
token with the caller; then change record_failure() and record_success() to
check that the finishing call's token matches self._half_open_probe_owner before
performing any state transitions or emitting transition metrics; alternatively,
move all HALF_OPEN transition handling into call_with_retry() so only the caller
that was granted the probe (identified by the token) can mutate self._state, and
ensure _half_open_probe_active and _half_open_probe_owner are cleared only by
the probe owner.

Comment on lines +628 to +690
@pytest.mark.asyncio
async def test_probe_flag_cleared_after_mid_retry_half_open(self) -> None:
"""Probe flag is cleared and breaker closes when OPEN->HALF_OPEN transition occurs mid-retry."""
import time as _time

cb = LLMCircuitBreaker(
backend_name="test_mid_retry",
config=LLMCircuitBreakerConfig(
max_failures=1,
reset_time_seconds=999.0,
enabled=True,
),
)

# Trip the breaker to OPEN
cb.record_failure(error_type="test")
assert cb.state == CircuitState.OPEN

# Expire the open window so the NEXT should_block() call transitions to HALF_OPEN
# (simulating time passage between circuit open and call attempt)
cb._open_until = _time.monotonic() - 1.0

# coro_factory: first call succeeds (probe success -> CLOSED)
# The circuit is OPEN at call_with_retry entry, but should_block() will
# transition it to HALF_OPEN and allow this call through as the probe.
async def succeed_on_first():
return "ok"

result = await cb.call_with_retry(succeed_on_first, max_retries=1)
assert result == "ok"
assert cb._half_open_probe_active is False, "_half_open_probe_active must be False after successful probe"
assert cb.state == CircuitState.CLOSED, "Breaker must be CLOSED after successful probe"

@pytest.mark.asyncio
async def test_probe_flag_cleared_on_failed_mid_retry_half_open(self) -> None:
"""Probe flag is cleared and breaker is OPEN when probe acquired mid-retry then fails."""
import time as _time

cb = LLMCircuitBreaker(
backend_name="test_mid_retry_fail",
config=LLMCircuitBreakerConfig(
max_failures=1,
reset_time_seconds=999.0,
enabled=True,
),
)

# Trip the breaker to OPEN
cb.record_failure(error_type="test")
assert cb.state == CircuitState.OPEN

# Expire the window so should_block() transitions to HALF_OPEN
cb._open_until = _time.monotonic() - 1.0

async def always_fail():
raise ValueError("probe failure")

with pytest.raises(ValueError, match="probe failure"):
await cb.call_with_retry(always_fail, max_retries=1)

# Probe failed -- breaker must be re-opened, flag must be cleared
assert cb._half_open_probe_active is False, "_half_open_probe_active must be False after failed probe"
assert cb.state == CircuitState.OPEN, "Breaker must be OPEN after failed probe"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

These “mid-retry” regressions never enter the retry-ownership branch.

Both tests expire cb._open_until before the first call and use max_retries=1, so OPEN -> HALF_OPEN happens in the initial gate and Lines 454-458 are never executed. As written, these would still pass if _owns_probe were only snapshotted once before the loop. Please make attempt 1 fail while the breaker is still OPEN, then let a later retry acquire HALF_OPEN so this regression actually covers the stale-owner case.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/tests/test_cb_observability_adversarial.py` around lines 628 - 690,
The tests currently expire cb._open_until before the first call so the
OPEN->HALF_OPEN transition happens before retries; change each test so the first
attempt runs while the breaker remains OPEN and a subsequent retry triggers the
HALF_OPEN probe branch: set cb._open_until to a future (or leave it expired only
after the first attempt), use a counter/closure in the coroutine factories
(succeed_on_first/always_fail) so the first invocation either fails (or raises)
while OPEN, then on that first invocation update cb._open_until =
time.monotonic() - 1.0 to allow the next retry to transition to HALF_OPEN and
exercise the retry-ownership branch in call_with_retry.

@amabito
Copy link
Copy Markdown
Contributor Author

amabito commented Apr 11, 2026

Pushed 847a9fb addressing all four findings.

llm_circuit_breaker.py

  • _probe_was_half_open was a stale pre-loop snapshot. Replaced with _owns_probe, set under the lock on each attempt that finds the breaker in HALF_OPEN. The except BaseException cleanup now reads the dynamic flag, so a mid-retry OPEN -> HALF_OPEN transition no longer leaks _half_open_probe_active=True. Regression covered by TestAdversarialMidRetryProbeTransfer (two cases, asyncio.Event-driven timeline).
  • Seven _safe_emit(_metrics.record_state_transition, ...) and record_request call sites no longer fire under self._lock. Each captures a _transition_args tuple inside the with self._lock: block, exits the lock, then emits. State mutation stays under the lock; only the metrics callback is moved out.

prometheus.py

Google-style docstrings added to __init__, get_registry, _ensure_metrics, _state_value.

MANIFEST.in

recursive-include massgen/observability/dashboards *.json so the Grafana JSON ships in the sdist/wheel.

pytest massgen/tests/test_cb_observability* -- 56 passed. pre-commit run clean.

@Henry-811 Henry-811 changed the base branch from main to dev/v0.1.76 April 13, 2026 16:05
@Henry-811 Henry-811 merged commit 3a025df into massgen:dev/v0.1.76 Apr 13, 2026
11 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Apr 13, 2026
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants