-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INDY-2231: fixing checkpoint stabilization after view change on nodes lagging behind #1359
Conversation
ashcherbakov
commented
Oct 3, 2019
- do not stabilize checkpoints from NewView during view change on nodes lagging behind (that is if a node doesn't have this checkpoint)
- change quorum (from weak to strong certificate) when calculating a checkpoint for NewView. This is needed to make sure that view change is finished and nodes can order without catchup before processing NewView. Now the checkpoint can be lower, so more re-ordering may be needed.
- added simulation tests with random seeds
- fixes and improvements in tests
…lica doesn't have this checkpoint Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
@@ -330,7 +330,7 @@ def calc_checkpoint(self, vcs: List[ViewChange]) -> Optional[Checkpoint]: | |||
|
|||
# Don't add checkpoint to pretending ones if not enough nodes have it | |||
have_checkpoint = [vc for vc in vcs if cur_cp in vc.checkpoints] | |||
if not self._data.quorums.weak.is_reached(len(have_checkpoint)): | |||
if not self._data.quorums.strong.is_reached(len(have_checkpoint)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding comment here stating that this is a divergence from PBFT paper and describing reasons for doing it (in short) would be nice
|
||
|
||
@pytest.mark.parametrize("seed", range(100)) | ||
def test_view_change_while_ordering_with_real_msgs(seed): | ||
@pytest.mark.parametrize("seed", Random().sample(range(1000000), 100)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to:
- move this into helper functions
- use it instead of
range(xxx)
in all places whereseed
is parametrized
All in all, I like this solution - it is very minimalistic, yet satisfies all requirements (especially one that used seeds should be logged).
@pytest.mark.parametrize("seed", range(10)) | ||
def test_ordering_with_real_msgs(seed): | ||
@pytest.mark.parametrize("seed", range(100)) | ||
def test_ordering_with_real_msgs_default_seed(seed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we really need tests with fixed seeds, when we already have ones with random. Rationale is that when we do have some easily (>5%) reproduced problems it doesn't matter whether seeds are all in 0..100 range or not - only number of them really matters.
Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
test this please |
view_no = self._get_view_no_from_audit(audit_txn) | ||
digest = self._get_digest_from_audit(audit_ledger, audit_txn_seq_no) | ||
return Checkpoint(instId=self._data.inst_id, | ||
viewNo=view_no, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think we can fill viewNo with just 0 or None or even drop it together with seqNoStart - we don't need these fields anymore. Given that next upgrade most probably should be forced I think we have a good chance to do some protocol cleanup as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agree. I would prefer to do it in a separate PR since we already fixed more than expected in this one.
@@ -283,7 +283,7 @@ def _finish_view_change_if_needed(self): | |||
def _finish_view_change(self): | |||
# Update shared data | |||
self._data.waiting_for_new_view = False | |||
self._data.prev_view_prepare_cert = self._new_view.batches[-1].pp_seq_no if self._new_view.batches else None | |||
self._data.prev_view_prepare_cert = self._new_view.batches[-1].pp_seq_no if self._new_view.batches else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we don't reset pp_seq_no after view change anymore this looks very suspicious. None
actually looks more explicit (we don't know last cert) than 0. Even better - why not default it just to last ordered pp seq no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that ppSeqNo starts with 1, so 0 here can be considered as None or don't know (agree that we need a comment about this).
The assumption 'don't know == 0' is more safe from the code point of view since we are doing (and may do more in future) comparisons (>=, <=), so we don't have to check for None every time.
There is another non-obvious benefit of setting it to 0 and not last_ordered: we do allow re-ordering for everything in between low watermark and prepared certificate. So setting it to last_ordered allows to potentially flood the pool with unnecessary re-ordering.
@@ -139,13 +139,7 @@ def _start_catchup_if_needed(self, key: CheckpointKey): | |||
caught_up_till_3pc=key_3pc)) | |||
self.caught_up_till_3pc(key_3pc) | |||
|
|||
def gc_before_new_view(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 🎉 🎉
waitNodeDataEquality(looper, *txnPoolNodeSet, customTimeout=120, | ||
exclude_from_check=['check_last_ordered_3pc_backup']) | ||
assert len(log_re_ask) - old_re_ask_count == 2 # for audit and domain ledgers | ||
waitNodeDataEquality(looper, *txnPoolNodeSet, customTimeout=120, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why move out of consistency proof delayer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't move out, then we can not finish the view change, can't we?