Sharding rebalance hardening + sharded-daemon failover

**Depends on #34** (Multi-Node-Spec).  Without that harness, the failover paths can't be tested deterministically.

The sharding code (ShardCoordinator, ShardRegion, AllocationStrategy) exists and works under happy paths.  What's missing is explicit coverage of the unpleasant ones — coordinator dies mid-rebalance, region dies during handoff, network partition between coordinator and region, ShardedDaemonProcess loses its anchor and needs to be respawned elsewhere.

**Three parallel strands of work:**

**a) Failure-injection tests** — using the multi-node harness from #34 with hooks that deterministically introduce crashes / partitions:

- Crash coordinator after BeginHandoff, before HandoffAcked → the new coordinator must clean up the half-done handoff.
- Crash region after EntityStarted, before first command dispatch → coordinator detects down + reallocates.
- Partition between coordinator and region → coordinator marks region unreachable, allocates away, region recovers when partition heals.

**b) Idempotency audit** in the coordinator code: every message (GetShardHome, AllocateShard, BeginHandoff, etc.) must robustly handle 'already seen / already processed'.  Today this is implicit; we run a pass through the code and stamp it explicitly.

**c) Daemon failover** specific to ShardedDaemonProcess — it has a 'fixed N workers' guarantee.  When a node with an active worker dies, the coordinator must start a new worker on another node within the downing window.  Means adding a heartbeat / liveness check that watches the workers.

**Components:**

| File | Task |
|---|---|
| src/cluster/sharding/ShardCoordinator.ts | idempotency audit + recovery-after-restart logic |
| src/cluster/sharding/ShardRegion.ts | dito + handoff-mid-stream cleanup |
| src/cluster/sharding/ShardedDaemonProcess.ts | liveness heartbeat |
| tests/multi-node/sharding-failover.test.ts (new) | 5-7 failure-injection scenarios |

**Estimate:** 5-8 days.  Recommended order: directly after #34 (same test stack).

**Verification:**

- Before: tests stay green under normal conditions.
- After: tests stay green under all 5-7 failure scenarios with realistic recovery times (coordinator restart < 5 s).

See [the roadmap plan](../blob/main/.claude-plan-snapshot.md) for full context (item 3 of 5).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharding rebalance hardening + sharded-daemon failover #35

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

File	Task
src/cluster/sharding/ShardCoordinator.ts	idempotency audit + recovery-after-restart logic
src/cluster/sharding/ShardRegion.ts	dito + handoff-mid-stream cleanup
src/cluster/sharding/ShardedDaemonProcess.ts	liveness heartbeat
tests/multi-node/sharding-failover.test.ts (new)	5-7 failure-injection scenarios

Sharding rebalance hardening + sharded-daemon failover #35

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions