Skip to content

hashicorp/hyparview

What’s This?

  • Partial membership for gossip clusters
  • The Paper in production in Partisan
  • 10k node cluster requires 5 active and 30 passive peers in views at each node
  • Side goal: demonstrate separation of state changes from async or mutex behavior to improve testing options

Running The Simulation

brew install gnuplot

make simulation

See data for the data input. Plotting output is in plot

Known bugs

  • The simulation consistently shows a number of asymmetric links, these would actually be repaired in a stable system that continued to shuffle. The simulation should be extended to demonstrate that.
  • When running with only a single send attempt, failure recovery causes a stack overflow. The simulator send plugin should use an external stack that doesn’t overflow so easily

Related Topics for Improving the Approach

  • Thicket (Thicket PDF, also by Leitão and Rodrigues) describes multiple spanning tree approach to sending application messages, which reduces waste messages while remaining resilient to churn. The simulator currently implements very simple epidemic gossip transmission where the active view size is the degree of fanout (novel messages are forwarded to all active peers). Waste and path length are plotted.
  • In the Peer Sample paper (which uses only the passive view), entries are tagged with a timestamp of the last direct active contact, and the shuffling algorithm is biased to penalize old entries. This allows a dead node to fall out of the passive view of the total network more quickly. This may have a detrimental effect if a network is partitioned on recovery.
  • Partion recovery in general needs more investigation, there may be a risk that partioned networks would repair their respective active views and interconnected tightly, so that repairing the partion would lead to two well connected networks joined by just a few nodes.

Improvements to the Code

  • Isolated nodes in simulation runs have asymmetric active views. Refactor the simulator to support periodic messages like active view keepalive and shuffle from the perspective of each node.
  • Incorporate failActive directly into the code. Because it relies on synchronous low priority Neighbor requests, it requires a sort of iterator interface (or some other refactoring)
  • Minor, dependent on research results: the view constructor takes the expected number of peers as an argument, and should use that to set the appropriate default configuration, once it’s factors are known outside the values given in the paper

Live Test

There’s an example secure GRPC application of this library at hyparview-example. It will be used for a live test on hardware.