-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INDY-1911: Send INSTANCE_CHANGE when state signatures are not fresh enough #1078
Conversation
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
@@ -714,7 +713,21 @@ def get_msgs_for_lagged_nodes(self) -> List[ViewChangeDone]: | |||
format(self, self.view_no)) | |||
return messages | |||
|
|||
def propose_view_change(self, suspicion=Suspicions.PRIMARY_DEGRADED): | |||
proposed_view_no = self.view_no | |||
if not self.view_change_in_progress or suspicion == Suspicions.INSTANCE_CHANGE_TIMEOUT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we don't call this method for INSTANCE_CHANGE_TIMEOUT
, do we? See, for example, on_view_change_not_completed_in_time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we do
sdk_ensure_pool_functional(looper, txnPoolNodeSet, sdk_wallet_client, sdk_pool_handle) | ||
|
||
|
||
def test_view_change_happens_if_ordering_is_halted(looper, tconf, txnPoolNodeSet, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test when the ordering is stopped because Primary doesn't send a PrePrepare in time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this test emulates behaviour you described by blocking delivery of PrePrepares to all nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand it. Knowing how it's implemented it's clear that this is the same and doesn't matter.
But from a black box point of view it can be a different case (in the first case Primary sends PrePrepare, that is not malicious, but the PrePrepares are just lost; in the second case Nodes are ready to order, but the Primary is malicious and doesn't sent PrePrepares, or send them not frequently enough).
|
||
for n in some_nodes: | ||
n.view_changer.on_master_degradation() | ||
looper.runFor(0.2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this run? Should we check instead that other nodes received 2 IC messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, check for received IC messages would be more clean, current implementation is just a quick hack around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
1. some_nodes (Beta and Gamma) send InstanceChange for all nodes. | ||
2. Restart other_nodes (Gamma and Delta) | ||
3. last_node (Delta) send InstanceChange for all nodes. | ||
4. Ensure elections done and pool is functional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, in this test we have the following situation after step3:
- Alpha (Primary) has 3 IC and started VC
- Beta has 3 IC and started VC
- Gamma has 1 IC
- Delta has 1 IC
Since Primary stopped ordering, freshness checks lead to start of a VC on all nodes.
So, it makes sense to call the test liketest_vc_finished_when_less_than_quorum_started_including_primary
.
But what if a different combination of nodes restarted, so that Primary doesn't have a quorum of IC?
For example,
- Alpha and Beta send IC
- Beta is restarted (wait until it's restarted before go to the next step)
- Alpha is restarted (wait until it's restarted before go to the next step)
- Gamma is restarted (wait until it's restarted before go to the next step)
- Delta sends IC
= >
Alpha - 1 IC
Beta - 1 IC
Gamma - 1 IC
Delta - 3 IC => starts VC.
As a result, Delta will be in infinite VC.
I think the freshness check can not solve the situation above, so persistence of IC will be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another test: all nodes start VC, but 1/2 of the nodes restart during the VC.
Sub-tests:
- 1/2 nodes are restarted immediately
- 1/2 nodes are restarted in > than INSTANCE_CHANGE_TIMEOUT
- 1/2 nodes are restarted in > than 3*INSTANCE_CHANGE_TIMEOUT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to have:
Maybe we should implement something like unit/property-based tests for checking that we can recover from any state.
Take ViewChange classes, put then into different view, and sent IC and VC messages to it in different combinations. It's expected that all ViewChanger classes should eventually come to the same view.
We can do it with our new property-based model as well.
But if we could do it with a real View Changer class, we would have more confidence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I tried to implement test where only minority of nodes enter view change, however there are some problems with this:
- it is possible to do so on a 7 nodes pool, however since master primary also enters view change and ordering is stopped this case is no different from already implemented
- preventing primary from collecting instance change quorum by restarting it will lead to primary disconnection events and view change will happen anyways
- fun fact - in INDY-1903 primary was not restarted, but it didn't receive instance change messages during network outages, that's why it didn't enter view change. However it should have received them due to queueing (and I fail to see how to reproduce it correctly in tests), so probably there was some yet another bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing property based tests for view changer would be extremely helpful, however view changer does depend a lot on node and it uses real communication layer, not simulated one. There are multiple implementation options, but they all have drawbacks:
- just implement full-blown integration property based test - this can be done pretty quickly, however such test would take ages to run, and I have concerns about reproducibility of its results
- use monkeypatching and mocking extensively to isolate view changer without touching it too much - this can be done in moderate amount of time, however I'm afraid such tests would be very fragile as they will depend a lot on implementation details
- refactor view changer so that it can be easily used in isolation - this is the cleanest solution, and it can be used as a foundation for similar changes in other parts of code, but the question is if we are okay with amount of work it would take
2. Restart nodes_to_restart (Beta, Gamma). | ||
3. Wait OUTDATED_INSTANCE_CHANGES_CHECK_INTERVAL sec. | ||
4. nodes_to_restart send InstanceChanges for all nodes. | ||
5. Ensure elections done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mention that VC didn't happen since IC from the panic node was discarded by timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
…_lost_due_to_restarts Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
bdcae11
to
dc1a7d9
Compare
Signed-off-by: Sergey Khoroshavin <sergey.khoroshavin@dsr-corporation.com>
No description provided.