Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timing related bug in view change protocol #1488

Open
skhoroshavin opened this issue Jul 23, 2020 · 0 comments
Open

Timing related bug in view change protocol #1488

skhoroshavin opened this issue Jul 23, 2020 · 0 comments

Comments

@skhoroshavin
Copy link
Member

skhoroshavin commented Jul 23, 2020

It looks like Plenum has timing-related bug in view change protocol.

Potential steps to reproduce

  • create a test pool with 4 nodes
  • pause 2 nodes, none of which are primary. If using docker enviroment:
    • use docker pause command, so nodes are frozen, and no explicit disconnection events happen
    • pause Node3 and Node4 - they are guaranteed not to be primaries initially
  • wait for 30 minutes, during that time
    • master primary will send freshness batch (probably couple of times)
    • working nodes will get and store these batches, but won't be able to order it because of lack of consensus
    • after about 10 minutes working nodes (including primary) should realize, that consensus is lost, and start sending votes for view change (INSTANCE_CHANGE messages), but because of lack of consensus view change won't start
  • after 30 minutes unpause paused nodes
    • they will realize that consensus was lost for too long, and also vote for view change
    • view change will start, NEW_VIEW message with previously unordered freshness batches will be created, but ordering will fail, complaining about incorrect batch time
    • so next view change will happen, with same results
    • so pool will enter perpetual view change cycle even though all nodes are up and healthy
  • restarting all nodes at once should break cycle and put pool back into healthy state

Actual steps when I caught this were longer, but based on my preliminary analysis these should also suffice.

Cause and potential fix

  • there is indeed a safeguard on batch time during normal ordering, so that malicious primary won't be able to create batches far in future or in past
  • however this safeguard also applies to batches that are reordered during view change, and if for whatever reason view change took longer than that safeguard window batches won't be able to be reordered, since their timestamps cannot be altered, and so view change will never be able to finish
  • potential fix should include either different time safeguard logic for reordering phase, or disabling that safeguard during reordering (however before doing that thorough analysis should be performed on safety of such action)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants