Skip to content

ClusterSingleton: accept a Lease for split-brain protection #38

@pathosDev

Description

@pathosDev

Today ClusterSingleton.start({ typeName, entityProps }) runs the singleton on whichever node membership thinks is oldest. Under a network partition the gossip protocol can converge to two oldest views, briefly running the singleton on two nodes simultaneously — the classic split-brain failure mode.

Adding an optional lease parameter that wraps the entity-spawn in an acquire-and-guard pattern fixes this:

const lease = new KubernetesLease({
  name: 'app-cron-singleton', namespace: 'default',
  owner: process.env.HOSTNAME ?? 'local',
  ttlMs: 30_000,
});

singleton.start({
  typeName: 'cron',
  entityProps: Props.create(() => new CronActor()),
  lease,                                  // ← only the lease holder spawns
});

Behaviour change:

  • On 'I am the elected oldest' notification, the manager calls lease.acquire() before spawning. If acquire returns false, it leaves the slot empty and re-tries on a configurable interval — another node currently holds the lease.
  • The manager subscribes to lease.onLost(reason) and stops the entity if the lease is revoked mid-flight.
  • On graceful handoff (older node becoming reachable), the current holder calls lease.release() before the membership transition completes.

Out of scope:

  • The Lease object itself — it's already provided by #33 (KubernetesLease, just landed) and the existing InMemoryLease.
  • Lease integration on ShardCoordinator (the cluster's other singleton-shaped thing) — would warrant its own issue if needed.

Components:

File Task
src/cluster/singleton/ClusterSingleton.ts accept lease?: Lease in start options
src/cluster/singleton/ClusterSingletonManager.ts wrap entity spawn in lease.acquire(); subscribe to onLost; release on shutdown
tests/unit/cluster/singleton/ClusterSingletonLease.test.ts (new) acquire-blocks-spawn, onLost-stops-entity, release-on-leave
examples/coordination/k8s-lease-singleton.ts rewrite using the new option (much shorter than the manual acquire-and-guard loop the example currently shows)

Estimate: 1-2 days. Surfaced as a follow-up to #33 — the example for KubernetesLease today reaches into a manual acquire/onLost/release loop because this option doesn't exist yet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestpriority: highTop priority — high impact, plan next

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions