Skip to content

Sharding rebalance hardening + sharded-daemon failover #35

@pathosDev

Description

@pathosDev

Depends on #34 (Multi-Node-Spec). Without that harness, the failover paths can't be tested deterministically.

The sharding code (ShardCoordinator, ShardRegion, AllocationStrategy) exists and works under happy paths. What's missing is explicit coverage of the unpleasant ones — coordinator dies mid-rebalance, region dies during handoff, network partition between coordinator and region, ShardedDaemonProcess loses its anchor and needs to be respawned elsewhere.

Three parallel strands of work:

a) Failure-injection tests — using the multi-node harness from #34 with hooks that deterministically introduce crashes / partitions:

  • Crash coordinator after BeginHandoff, before HandoffAcked → the new coordinator must clean up the half-done handoff.
  • Crash region after EntityStarted, before first command dispatch → coordinator detects down + reallocates.
  • Partition between coordinator and region → coordinator marks region unreachable, allocates away, region recovers when partition heals.

b) Idempotency audit in the coordinator code: every message (GetShardHome, AllocateShard, BeginHandoff, etc.) must robustly handle 'already seen / already processed'. Today this is implicit; we run a pass through the code and stamp it explicitly.

c) Daemon failover specific to ShardedDaemonProcess — it has a 'fixed N workers' guarantee. When a node with an active worker dies, the coordinator must start a new worker on another node within the downing window. Means adding a heartbeat / liveness check that watches the workers.

Components:

File Task
src/cluster/sharding/ShardCoordinator.ts idempotency audit + recovery-after-restart logic
src/cluster/sharding/ShardRegion.ts dito + handoff-mid-stream cleanup
src/cluster/sharding/ShardedDaemonProcess.ts liveness heartbeat
tests/multi-node/sharding-failover.test.ts (new) 5-7 failure-injection scenarios

Estimate: 5-8 days. Recommended order: directly after #34 (same test stack).

Verification:

  • Before: tests stay green under normal conditions.
  • After: tests stay green under all 5-7 failure scenarios with realistic recovery times (coordinator restart < 5 s).

See the roadmap plan for full context (item 3 of 5).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions