v0.6.0 — Crash recovery for stale grants
v0.6.0 — Crash recovery for stale grants
pip install "agent-coherence[langgraph]==0.6.0"When an agent crashes (OOM-kill, segfault) or livelocks holding a write grant, every other agent is blocked from writing the same artifact. v0.6 reclaims those grants automatically: heartbeats piggyback on every read/write, an enforce_stable_grant_timeouts sweep on the coordinator reclaims stale holders, and a recover() primitive on every adapter flushes stale local cache after a process restart.
Behind a feature flag (CrashRecoveryConfig(enabled=False) default) for now. The default-on flip is gated on dogfood validation and tracked as a separate release.
Highlights
Crash recovery (opt-in via CrashRecoveryConfig(enabled=True, ...))
- Coordinator:
record_heartbeatRPC,enforce_stable_grant_timeoutssweep,granted_at_ticklifecycle, reclamation-slot bookkeeping, composition fail-fast (max_hold_ticks > lease_ttl_ticks) - Adapters: every framework adapter (
CCSStore,LangGraphAdapter,CrewAIAdapter,AutoGenAdapter,CoherenceAdapterCore) acceptscrash_recovery=and exposesheartbeat()/recover() - Agent runtime:
invalidate_allprimitive for post-restart cache flush - Simulation: kill / busy / restore failure injection, sweep call-site, heartbeat emission, combined validation scenario
- State log: two new triggers —
reclaim_heartbeatandreclaim_max_hold— surface reclamation events so production incidents leave a trail
Formal verification (Gate 1 closed)
formal/tla/MESI.tla(I1 SingleWriter, I2 MonotonicVersion)formal/tla/CrashRecovery.tla(I3 SweepExclusivity, I4 TriggerExclusivity, I5 TickMonotonicity, I6 SlotSurvival)make tla-checkruns TLC on every push and PR- Unblocks backlog item H (OCC write API) — Gate 2 also closed
Repo hygiene
- Bump
__version__0.5.0 → 0.6.0 - README rewritten end-user-first; crash-recovery quickstart in §Quick start
- guide.md:
CrashRecoveryConfigfield reference table; Disabling / rollback subsection so the full enablement story (config → behavior → rollback) lives in one place - Internal specs untracked from the public repo; only
guide.md,why-coherence-matters.md, andagent-coherence-approach.mdare tracked underdocs/..gitignorewhitelist enforces this on a fresh clone.
Quick start
from ccs.adapters import CCSStore
from ccs.coordinator.service import CrashRecoveryConfig
store = CCSStore(
strategy="lazy",
crash_recovery=CrashRecoveryConfig(
enabled=True,
heartbeat_timeout_ticks=10,
max_hold_ticks=1000,
),
)
# Heartbeats piggyback on every read/write/batch automatically.
# After a process restart, call recover() to flush stale cache:
store.recover(agent_name="planner", now_tick=current_tick)The same crash_recovery= kwarg works on LangGraphAdapter, CrewAIAdapter, AutoGenAdapter, and CoherenceAdapterCore.
What's NOT in this release
CrashRecoveryConfig(enabled=False) remains the default. Flipping it is the next deliberate release after dogfood validation under opt-in enabled=True.
Compatibility
- Python 3.11+
- The protocol is byte-identical to v0.5 when
crash_recovery=is not passed (orenabled=False). Existing v0.5 users see no behavior change.
Rollback
Pin to v0.5.0:
pip install "agent-coherence[langgraph]==0.5.0"Or omit crash_recovery= to keep using v0.6 with crash recovery disabled.
Full Changelog: v0.5.0...v0.6.0