Spawned out of #38 (ClusterSingleton + Lease). The same split-brain reasoning applies to the `ShardCoordinator`: it's a cluster-wide singleton-shaped thing — only the leader's instance is active, every other node's coordinator runs as a warm standby. Under a network partition the gossip protocol can converge to two leader views, briefly running two active coordinators that each issue `HandOff` / `AllocateShard` directives. That's worse than singleton split-brain because it can corrupt shard ownership state across the cluster.
Strategy: mirror the #38 pattern. Add an optional `lease?: Lease` to `ClusterSharding.start(...)` settings. When provided, only the lease holder's coordinator processes incoming messages — the others stay passive even if they think they're the leader.
```ts
const lease = new KubernetesLease({
name: 'app-sharding-coordinator',
namespace: 'default',
owner: process.env.HOSTNAME,
ttlMs: 30_000,
});
ClusterSharding.get(system, cluster).start({
typeName: 'entity',
entityProps: Props.create(() => new MyEntity()),
extractEntityId: (m) => m.id,
numShards: 64,
lease, // ← only the lease holder processes shard messages
});
```
Behaviour:
- The coordinator's existing `isLeader()` check stays — local cheap predicate.
- A NEW `isActive()` predicate ANDs `isLeader()` with `lease.checkAlive()`. All message handling gates on `isActive()`.
- On becoming leader, the coordinator calls `lease.acquire()` before serving requests. While acquire is pending, queries are buffered (already partially the case).
- On `lease.onLost`, the coordinator becomes passive immediately — buffered queries get retried next time `isActive()` flips back true.
Components:
| File |
Task |
| `src/cluster/sharding/ShardCoordinator.ts` |
Accept `lease?: Lease` in settings. `isActive()` gate. Acquire on leader-up; release on leader-down. |
| `src/cluster/sharding/ClusterSharding.ts` |
Pass-through plumbing for the new option. |
| `tests/unit/cluster/sharding/ShardingLease.test.ts` (new) |
acquire-blocks-coordinator-activity, lost-lease-stops-activity, release-on-leave. |
| `tests/multi-node/sharding-lease-split-brain.test.ts` (new) |
Multi-node: simulate split → only one side acquires the lease → only one side processes shards. |
Estimate: 2-3 days.
Verification:
- Without lease: existing tests stay green (path unchanged).
- With lease + InMemoryLease: only the holder's coordinator dispatches `AllocateShard` and friends.
- Multi-node: forced split-brain via `MultiNodeSpec.partition(...)` produces ONE active coordinator, not two.
Out of scope:
Depends on: #38 (we re-use the state-machine pattern from `ClusterSingletonManager`).
Spawned out of #38 (ClusterSingleton + Lease). The same split-brain reasoning applies to the `ShardCoordinator`: it's a cluster-wide singleton-shaped thing — only the leader's instance is active, every other node's coordinator runs as a warm standby. Under a network partition the gossip protocol can converge to two leader views, briefly running two active coordinators that each issue `HandOff` / `AllocateShard` directives. That's worse than singleton split-brain because it can corrupt shard ownership state across the cluster.
Strategy: mirror the #38 pattern. Add an optional `lease?: Lease` to `ClusterSharding.start(...)` settings. When provided, only the lease holder's coordinator processes incoming messages — the others stay passive even if they think they're the leader.
```ts
const lease = new KubernetesLease({
name: 'app-sharding-coordinator',
namespace: 'default',
owner: process.env.HOSTNAME,
ttlMs: 30_000,
});
ClusterSharding.get(system, cluster).start({
typeName: 'entity',
entityProps: Props.create(() => new MyEntity()),
extractEntityId: (m) => m.id,
numShards: 64,
lease, // ← only the lease holder processes shard messages
});
```
Behaviour:
Components:
Estimate: 2-3 days.
Verification:
Out of scope:
Depends on: #38 (we re-use the state-machine pattern from `ClusterSingletonManager`).