Skip to content

Wire DowningProvider into Cluster failure-detection pipeline #61

@pathosDev

Description

@pathosDev

Discovered while planning #51. The four existing downing strategies (`KeepMajority`, `KeepOldest`, `KeepReferee`, `StaticQuorum`) live in `src/cluster/downing/` and are tested as pure functions, but they are NOT invoked anywhere in `Cluster.ts`. `ClusterSettings` has no `downing` field; `failureDetectionTick` doesn't ask any provider for a decision. The strategies are decorative until this is fixed.

This is a real correctness gap — users who configure `new KeepMajority()` and expect partition-handling get nothing. The cluster's only failure response today is the failure detector's elapsed-time-based `unreachable → down → removed` cascade, which doesn't account for cluster topology and can leave both sides of a partition alive.

Strategy:

  • Add `downing?: DowningProvider` to `ClusterSettings`.
  • In `failureDetectionTick`, when a member transitions to unreachable, build a `ClusterPartitionView` and invoke `provider.decide(view)`.
  • Apply the returned decision via the same `down → removed` path that exists today (members listed in the decision get force-downed regardless of failure-detector state).
  • Re-evaluate on every change to the unreachable set (debounce so we don't spam the provider on every tick).
  • Without a provider configured: existing behaviour (purely heartbeat-driven downing). Backwards-compat preserved.

Components:

File Task
`src/cluster/Cluster.ts` Accept `downing?: DowningProvider`; build `ClusterPartitionView` on partition changes; apply the decision.
`tests/unit/cluster/downing/DowningWiring.test.ts` (new) Single-node + 2-node smoke tests of the wiring with each existing strategy.
`tests/multi-node/downing-keep-majority.test.ts` (new) 5-node cluster, 3/2 partition; minority side downs itself end-to-end.

Estimate: 1-2 days.

Verification:

  • All four existing strategies pass their existing pure-function tests AND a new end-to-end test that drives them through real cluster failure detection.
  • Without `downing` configured: existing tests stay green (no behavioural change).

Required by: #51 (LeaseMajority needs working downing infrastructure).

Out of scope:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpriority: highTop priority — high impact, plan next

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions