Skip to content

v0.6.0 — Crash recovery for stale grants

Choose a tag to compare

@github-actions github-actions released this 09 May 18:21
· 254 commits to main since this release
76e2699

v0.6.0 — Crash recovery for stale grants

pip install "agent-coherence[langgraph]==0.6.0"

When an agent crashes (OOM-kill, segfault) or livelocks holding a write grant, every other agent is blocked from writing the same artifact. v0.6 reclaims those grants automatically: heartbeats piggyback on every read/write, an enforce_stable_grant_timeouts sweep on the coordinator reclaims stale holders, and a recover() primitive on every adapter flushes stale local cache after a process restart.

Behind a feature flag (CrashRecoveryConfig(enabled=False) default) for now. The default-on flip is gated on dogfood validation and tracked as a separate release.

Highlights

Crash recovery (opt-in via CrashRecoveryConfig(enabled=True, ...))

  • Coordinator: record_heartbeat RPC, enforce_stable_grant_timeouts sweep, granted_at_tick lifecycle, reclamation-slot bookkeeping, composition fail-fast (max_hold_ticks > lease_ttl_ticks)
  • Adapters: every framework adapter (CCSStore, LangGraphAdapter, CrewAIAdapter, AutoGenAdapter, CoherenceAdapterCore) accepts crash_recovery= and exposes heartbeat() / recover()
  • Agent runtime: invalidate_all primitive for post-restart cache flush
  • Simulation: kill / busy / restore failure injection, sweep call-site, heartbeat emission, combined validation scenario
  • State log: two new triggers — reclaim_heartbeat and reclaim_max_hold — surface reclamation events so production incidents leave a trail

Formal verification (Gate 1 closed)

  • formal/tla/MESI.tla (I1 SingleWriter, I2 MonotonicVersion)
  • formal/tla/CrashRecovery.tla (I3 SweepExclusivity, I4 TriggerExclusivity, I5 TickMonotonicity, I6 SlotSurvival)
  • make tla-check runs TLC on every push and PR
  • Unblocks backlog item H (OCC write API) — Gate 2 also closed

Repo hygiene

  • Bump __version__ 0.5.0 → 0.6.0
  • README rewritten end-user-first; crash-recovery quickstart in §Quick start
  • guide.md: CrashRecoveryConfig field reference table; Disabling / rollback subsection so the full enablement story (config → behavior → rollback) lives in one place
  • Internal specs untracked from the public repo; only guide.md, why-coherence-matters.md, and agent-coherence-approach.md are tracked under docs/. .gitignore whitelist enforces this on a fresh clone.

Quick start

from ccs.adapters import CCSStore
from ccs.coordinator.service import CrashRecoveryConfig

store = CCSStore(
    strategy="lazy",
    crash_recovery=CrashRecoveryConfig(
        enabled=True,
        heartbeat_timeout_ticks=10,
        max_hold_ticks=1000,
    ),
)

# Heartbeats piggyback on every read/write/batch automatically.
# After a process restart, call recover() to flush stale cache:
store.recover(agent_name="planner", now_tick=current_tick)

The same crash_recovery= kwarg works on LangGraphAdapter, CrewAIAdapter, AutoGenAdapter, and CoherenceAdapterCore.

What's NOT in this release

CrashRecoveryConfig(enabled=False) remains the default. Flipping it is the next deliberate release after dogfood validation under opt-in enabled=True.

Compatibility

  • Python 3.11+
  • The protocol is byte-identical to v0.5 when crash_recovery= is not passed (or enabled=False). Existing v0.5 users see no behavior change.

Rollback

Pin to v0.5.0:

pip install "agent-coherence[langgraph]==0.5.0"

Or omit crash_recovery= to keep using v0.6 with crash recovery disabled.


Full Changelog: v0.5.0...v0.6.0