Timing related bug in view change protocol #1488

skhoroshavin · 2020-07-23T10:36:10Z

It looks like Plenum has timing-related bug in view change protocol.

Potential steps to reproduce

create a test pool with 4 nodes
pause 2 nodes, none of which are primary. If using docker enviroment:
- use docker pause command, so nodes are frozen, and no explicit disconnection events happen
- pause Node3 and Node4 - they are guaranteed not to be primaries initially
wait for 30 minutes, during that time
- master primary will send freshness batch (probably couple of times)
- working nodes will get and store these batches, but won't be able to order it because of lack of consensus
- after about 10 minutes working nodes (including primary) should realize, that consensus is lost, and start sending votes for view change (INSTANCE_CHANGE messages), but because of lack of consensus view change won't start
after 30 minutes unpause paused nodes
- they will realize that consensus was lost for too long, and also vote for view change
- view change will start, NEW_VIEW message with previously unordered freshness batches will be created, but ordering will fail, complaining about incorrect batch time
- so next view change will happen, with same results
- so pool will enter perpetual view change cycle even though all nodes are up and healthy
restarting all nodes at once should break cycle and put pool back into healthy state

Actual steps when I caught this were longer, but based on my preliminary analysis these should also suffice.

Cause and potential fix

there is indeed a safeguard on batch time during normal ordering, so that malicious primary won't be able to create batches far in future or in past
however this safeguard also applies to batches that are reordered during view change, and if for whatever reason view change took longer than that safeguard window batches won't be able to be reordered, since their timestamps cannot be altered, and so view change will never be able to finish
potential fix should include either different time safeguard logic for reordering phase, or disabling that safeguard during reordering (however before doing that thorough analysis should be performed on safety of such action)

The text was updated successfully, but these errors were encountered:

WadeBarnes added the help wanted label May 23, 2023

Provide feedback