Skip to content

Persistent ShardCoordinator state via DistributedData #39

@pathosDev

Description

@pathosDev

Follow-up to #35. Today the ShardCoordinator rebuilds its allocation map (shardHome, regions, entitiesPerShard) entirely from RegisterRegion gossip on every LeaderChanged event. For small clusters this is fine — a few hundred shards re-register in well under a second. For clusters with thousands of shards plus rememberEntities, leader failover triggers a brief reallocation storm: every region re-registers, the new leader's coordinator runs tryAllocate for every shard from scratch, and entities are briefly unreachable.

Strategy: persist the coordinator's authoritative state in DistributedData (which now exists, see #37) under a well-known key. On leader-change the new leader reads the snapshot, applies pending re-registrations as deltas, and skips full re-allocation. When DD is unavailable (single-node dev, no extension started) we fall back to the current rebuild-from-gossip path.

Components:

File Task
`src/cluster/sharding/ShardCoordinator.ts` Persist `{regions, shardHome, entitiesPerShard}` to DD on every meaningful state transition; load on `onLeaderChanged` if `isLeader`.
`src/cluster/sharding/CoordinatorState.ts` (new) Serialisable shape + a CRDT-friendly merge (per-shard LWW on `shardHome`, per-region LWW on `regions`).
`tests/multi-node/sharding-coordinator-recovery.test.ts` (new) 3-node test: leader crashes mid-allocation, new leader recovers state in < 500 ms, entities reachable without rebalance.

Estimate: 3-5 days.

Verification:

  • Existing sharding tests stay green.
  • New multi-node test confirms allocations survive a leader crash without re-running `tryAllocate` for every shard.

Out of scope:

  • Cross-cluster federation (DD is intra-cluster only).
  • Coordinator state for ShardedDaemonProcess specifically (covered by the existing rememberEntities path; DD-backing comes for free).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions