Depends on #34 (Multi-Node-Spec). Without that harness, the failover paths can't be tested deterministically.
The sharding code (ShardCoordinator, ShardRegion, AllocationStrategy) exists and works under happy paths. What's missing is explicit coverage of the unpleasant ones — coordinator dies mid-rebalance, region dies during handoff, network partition between coordinator and region, ShardedDaemonProcess loses its anchor and needs to be respawned elsewhere.
Three parallel strands of work:
a) Failure-injection tests — using the multi-node harness from #34 with hooks that deterministically introduce crashes / partitions:
- Crash coordinator after BeginHandoff, before HandoffAcked → the new coordinator must clean up the half-done handoff.
- Crash region after EntityStarted, before first command dispatch → coordinator detects down + reallocates.
- Partition between coordinator and region → coordinator marks region unreachable, allocates away, region recovers when partition heals.
b) Idempotency audit in the coordinator code: every message (GetShardHome, AllocateShard, BeginHandoff, etc.) must robustly handle 'already seen / already processed'. Today this is implicit; we run a pass through the code and stamp it explicitly.
c) Daemon failover specific to ShardedDaemonProcess — it has a 'fixed N workers' guarantee. When a node with an active worker dies, the coordinator must start a new worker on another node within the downing window. Means adding a heartbeat / liveness check that watches the workers.
Components:
| File |
Task |
| src/cluster/sharding/ShardCoordinator.ts |
idempotency audit + recovery-after-restart logic |
| src/cluster/sharding/ShardRegion.ts |
dito + handoff-mid-stream cleanup |
| src/cluster/sharding/ShardedDaemonProcess.ts |
liveness heartbeat |
| tests/multi-node/sharding-failover.test.ts (new) |
5-7 failure-injection scenarios |
Estimate: 5-8 days. Recommended order: directly after #34 (same test stack).
Verification:
- Before: tests stay green under normal conditions.
- After: tests stay green under all 5-7 failure scenarios with realistic recovery times (coordinator restart < 5 s).
See the roadmap plan for full context (item 3 of 5).
Depends on #34 (Multi-Node-Spec). Without that harness, the failover paths can't be tested deterministically.
The sharding code (ShardCoordinator, ShardRegion, AllocationStrategy) exists and works under happy paths. What's missing is explicit coverage of the unpleasant ones — coordinator dies mid-rebalance, region dies during handoff, network partition between coordinator and region, ShardedDaemonProcess loses its anchor and needs to be respawned elsewhere.
Three parallel strands of work:
a) Failure-injection tests — using the multi-node harness from #34 with hooks that deterministically introduce crashes / partitions:
b) Idempotency audit in the coordinator code: every message (GetShardHome, AllocateShard, BeginHandoff, etc.) must robustly handle 'already seen / already processed'. Today this is implicit; we run a pass through the code and stamp it explicitly.
c) Daemon failover specific to ShardedDaemonProcess — it has a 'fixed N workers' guarantee. When a node with an active worker dies, the coordinator must start a new worker on another node within the downing window. Means adding a heartbeat / liveness check that watches the workers.
Components:
Estimate: 5-8 days. Recommended order: directly after #34 (same test stack).
Verification:
See the roadmap plan for full context (item 3 of 5).