Skip to content

ShardCoordinator: optional Lease for split-brain-safe coordinator handover #60

@pathosDev

Description

@pathosDev

Spawned out of #38 (ClusterSingleton + Lease). The same split-brain reasoning applies to the `ShardCoordinator`: it's a cluster-wide singleton-shaped thing — only the leader's instance is active, every other node's coordinator runs as a warm standby. Under a network partition the gossip protocol can converge to two leader views, briefly running two active coordinators that each issue `HandOff` / `AllocateShard` directives. That's worse than singleton split-brain because it can corrupt shard ownership state across the cluster.

Strategy: mirror the #38 pattern. Add an optional `lease?: Lease` to `ClusterSharding.start(...)` settings. When provided, only the lease holder's coordinator processes incoming messages — the others stay passive even if they think they're the leader.

```ts
const lease = new KubernetesLease({
name: 'app-sharding-coordinator',
namespace: 'default',
owner: process.env.HOSTNAME,
ttlMs: 30_000,
});

ClusterSharding.get(system, cluster).start({
typeName: 'entity',
entityProps: Props.create(() => new MyEntity()),
extractEntityId: (m) => m.id,
numShards: 64,
lease, // ← only the lease holder processes shard messages
});
```

Behaviour:

  • The coordinator's existing `isLeader()` check stays — local cheap predicate.
  • A NEW `isActive()` predicate ANDs `isLeader()` with `lease.checkAlive()`. All message handling gates on `isActive()`.
  • On becoming leader, the coordinator calls `lease.acquire()` before serving requests. While acquire is pending, queries are buffered (already partially the case).
  • On `lease.onLost`, the coordinator becomes passive immediately — buffered queries get retried next time `isActive()` flips back true.

Components:

File Task
`src/cluster/sharding/ShardCoordinator.ts` Accept `lease?: Lease` in settings. `isActive()` gate. Acquire on leader-up; release on leader-down.
`src/cluster/sharding/ClusterSharding.ts` Pass-through plumbing for the new option.
`tests/unit/cluster/sharding/ShardingLease.test.ts` (new) acquire-blocks-coordinator-activity, lost-lease-stops-activity, release-on-leave.
`tests/multi-node/sharding-lease-split-brain.test.ts` (new) Multi-node: simulate split → only one side acquires the lease → only one side processes shards.

Estimate: 2-3 days.

Verification:

  • Without lease: existing tests stay green (path unchanged).
  • With lease + InMemoryLease: only the holder's coordinator dispatches `AllocateShard` and friends.
  • Multi-node: forced split-brain via `MultiNodeSpec.partition(...)` produces ONE active coordinator, not two.

Out of scope:

Depends on: #38 (we re-use the state-machine pattern from `ClusterSingletonManager`).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions