Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INDY-1911: Send INSTANCE_CHANGE when state signatures are not fresh enough #1078

Merged
merged 11 commits into from
Feb 11, 2019

Conversation

skhoroshavin
Copy link
Member

No description provided.

Sergey Khoroshavin added 4 commits February 6, 2019 15:40
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
@@ -714,7 +713,21 @@ def get_msgs_for_lagged_nodes(self) -> List[ViewChangeDone]:
format(self, self.view_no))
return messages

def propose_view_change(self, suspicion=Suspicions.PRIMARY_DEGRADED):
proposed_view_no = self.view_no
if not self.view_change_in_progress or suspicion == Suspicions.INSTANCE_CHANGE_TIMEOUT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't call this method for INSTANCE_CHANGE_TIMEOUT, do we? See, for example, on_view_change_not_completed_in_time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we do

sdk_ensure_pool_functional(looper, txnPoolNodeSet, sdk_wallet_client, sdk_pool_handle)


def test_view_change_happens_if_ordering_is_halted(looper, tconf, txnPoolNodeSet,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test when the ordering is stopped because Primary doesn't send a PrePrepare in time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this test emulates behaviour you described by blocking delivery of PrePrepares to all nodes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand it. Knowing how it's implemented it's clear that this is the same and doesn't matter.
But from a black box point of view it can be a different case (in the first case Primary sends PrePrepare, that is not malicious, but the PrePrepares are just lost; in the second case Nodes are ready to order, but the Primary is malicious and doesn't sent PrePrepares, or send them not frequently enough).


for n in some_nodes:
n.view_changer.on_master_degradation()
looper.runFor(0.2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this run? Should we check instead that other nodes received 2 IC messages?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, check for received IC messages would be more clean, current implementation is just a quick hack around.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

1. some_nodes (Beta and Gamma) send InstanceChange for all nodes.
2. Restart other_nodes (Gamma and Delta)
3. last_node (Delta) send InstanceChange for all nodes.
4. Ensure elections done and pool is functional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, in this test we have the following situation after step3:

  • Alpha (Primary) has 3 IC and started VC
  • Beta has 3 IC and started VC
  • Gamma has 1 IC
  • Delta has 1 IC
    Since Primary stopped ordering, freshness checks lead to start of a VC on all nodes.
    So, it makes sense to call the test like test_vc_finished_when_less_than_quorum_started_including_primary.

But what if a different combination of nodes restarted, so that Primary doesn't have a quorum of IC?
For example,

  • Alpha and Beta send IC
  • Beta is restarted (wait until it's restarted before go to the next step)
  • Alpha is restarted (wait until it's restarted before go to the next step)
  • Gamma is restarted (wait until it's restarted before go to the next step)
  • Delta sends IC
    = >
    Alpha - 1 IC
    Beta - 1 IC
    Gamma - 1 IC
    Delta - 3 IC => starts VC.
    As a result, Delta will be in infinite VC.

I think the freshness check can not solve the situation above, so persistence of IC will be needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another test: all nodes start VC, but 1/2 of the nodes restart during the VC.
Sub-tests:

  • 1/2 nodes are restarted immediately
  • 1/2 nodes are restarted in > than INSTANCE_CHANGE_TIMEOUT
  • 1/2 nodes are restarted in > than 3*INSTANCE_CHANGE_TIMEOUT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have:
Maybe we should implement something like unit/property-based tests for checking that we can recover from any state.
Take ViewChange classes, put then into different view, and sent IC and VC messages to it in different combinations. It's expected that all ViewChanger classes should eventually come to the same view.
We can do it with our new property-based model as well.
But if we could do it with a real View Changer class, we would have more confidence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I tried to implement test where only minority of nodes enter view change, however there are some problems with this:

  • it is possible to do so on a 7 nodes pool, however since master primary also enters view change and ordering is stopped this case is no different from already implemented
  • preventing primary from collecting instance change quorum by restarting it will lead to primary disconnection events and view change will happen anyways
  • fun fact - in INDY-1903 primary was not restarted, but it didn't receive instance change messages during network outages, that's why it didn't enter view change. However it should have received them due to queueing (and I fail to see how to reproduce it correctly in tests), so probably there was some yet another bug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing property based tests for view changer would be extremely helpful, however view changer does depend a lot on node and it uses real communication layer, not simulated one. There are multiple implementation options, but they all have drawbacks:

  • just implement full-blown integration property based test - this can be done pretty quickly, however such test would take ages to run, and I have concerns about reproducibility of its results
  • use monkeypatching and mocking extensively to isolate view changer without touching it too much - this can be done in moderate amount of time, however I'm afraid such tests would be very fragile as they will depend a lot on implementation details
  • refactor view changer so that it can be easily used in isolation - this is the cleanest solution, and it can be used as a foundation for similar changes in other parts of code, but the question is if we are okay with amount of work it would take

2. Restart nodes_to_restart (Beta, Gamma).
3. Wait OUTDATED_INSTANCE_CHANGES_CHECK_INTERVAL sec.
4. nodes_to_restart send InstanceChanges for all nodes.
5. Ensure elections done.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention that VC didn't happen since IC from the panic node was discarded by timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Sergey Khoroshavin added 6 commits February 8, 2019 12:49
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
…_lost_due_to_restarts

Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
@ashcherbakov ashcherbakov merged commit 5ccf4a3 into hyperledger:master Feb 11, 2019
@skhoroshavin skhoroshavin deleted the indy-1911 branch February 11, 2019 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants