Persistent ShardCoordinator state via DistributedData

**Follow-up to #35.**  Today the `ShardCoordinator` rebuilds its allocation map (`shardHome`, `regions`, `entitiesPerShard`) entirely from `RegisterRegion` gossip on every `LeaderChanged` event.  For small clusters this is fine — a few hundred shards re-register in well under a second.  For clusters with thousands of shards plus `rememberEntities`, leader failover triggers a brief reallocation storm: every region re-registers, the new leader's coordinator runs `tryAllocate` for every shard from scratch, and entities are briefly unreachable.

**Strategy:** persist the coordinator's authoritative state in `DistributedData` (which now exists, see #37) under a well-known key.  On leader-change the new leader reads the snapshot, applies pending re-registrations as deltas, and skips full re-allocation.  When DD is unavailable (single-node dev, no extension started) we fall back to the current rebuild-from-gossip path.

**Components:**

| File | Task |
|---|---|
| \`src/cluster/sharding/ShardCoordinator.ts\` | Persist \`{regions, shardHome, entitiesPerShard}\` to DD on every meaningful state transition; load on \`onLeaderChanged\` if \`isLeader\`. |
| \`src/cluster/sharding/CoordinatorState.ts\` (new) | Serialisable shape + a CRDT-friendly merge (per-shard LWW on \`shardHome\`, per-region LWW on \`regions\`). |
| \`tests/multi-node/sharding-coordinator-recovery.test.ts\` (new) | 3-node test: leader crashes mid-allocation, new leader recovers state in < 500 ms, entities reachable without rebalance. |

**Estimate:** 3-5 days.

**Verification:**
- Existing sharding tests stay green.
- New multi-node test confirms allocations survive a leader crash without re-running \`tryAllocate\` for every shard.

**Out of scope:**
- Cross-cluster federation (DD is intra-cluster only).
- Coordinator state for ShardedDaemonProcess specifically (covered by the existing rememberEntities path; DD-backing comes for free).

File	Task
`src/cluster/sharding/ShardCoordinator.ts`	Persist `{regions, shardHome, entitiesPerShard}` to DD on every meaningful state transition; load on `onLeaderChanged` if `isLeader`.
`src/cluster/sharding/CoordinatorState.ts` (new)	Serialisable shape + a CRDT-friendly merge (per-shard LWW on `shardHome`, per-region LWW on `regions`).
`tests/multi-node/sharding-coordinator-recovery.test.ts` (new)	3-node test: leader crashes mid-allocation, new leader recovers state in < 500 ms, entities reachable without rebalance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent ShardCoordinator state via DistributedData #39

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Persistent ShardCoordinator state via DistributedData #39

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions