Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INDY-2231: fixing checkpoint stabilization after view change on nodes lagging behind #1359

Merged
merged 14 commits into from
Oct 7, 2019

Conversation

ashcherbakov
Copy link
Contributor

  • do not stabilize checkpoints from NewView during view change on nodes lagging behind (that is if a node doesn't have this checkpoint)
  • change quorum (from weak to strong certificate) when calculating a checkpoint for NewView. This is needed to make sure that view change is finished and nodes can order without catchup before processing NewView. Now the checkpoint can be lower, so more re-ordering may be needed.
  • added simulation tests with random seeds
  • fixes and improvements in tests

…lica doesn't have this checkpoint

Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
skhoroshavin
skhoroshavin previously approved these changes Oct 3, 2019
@@ -330,7 +330,7 @@ def calc_checkpoint(self, vcs: List[ViewChange]) -> Optional[Checkpoint]:

# Don't add checkpoint to pretending ones if not enough nodes have it
have_checkpoint = [vc for vc in vcs if cur_cp in vc.checkpoints]
if not self._data.quorums.weak.is_reached(len(have_checkpoint)):
if not self._data.quorums.strong.is_reached(len(have_checkpoint)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding comment here stating that this is a divergence from PBFT paper and describing reasons for doing it (in short) would be nice



@pytest.mark.parametrize("seed", range(100))
def test_view_change_while_ordering_with_real_msgs(seed):
@pytest.mark.parametrize("seed", Random().sample(range(1000000), 100))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to:

  • move this into helper functions
  • use it instead of range(xxx) in all places where seed is parametrized

All in all, I like this solution - it is very minimalistic, yet satisfies all requirements (especially one that used seeds should be logged).

@pytest.mark.parametrize("seed", range(10))
def test_ordering_with_real_msgs(seed):
@pytest.mark.parametrize("seed", range(100))
def test_ordering_with_real_msgs_default_seed(seed):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we really need tests with fixed seeds, when we already have ones with random. Rationale is that when we do have some easily (>5%) reproduced problems it doesn't matter whether seeds are all in 0..100 range or not - only number of them really matters.

Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>
@ashcherbakov
Copy link
Contributor Author

test this please

view_no = self._get_view_no_from_audit(audit_txn)
digest = self._get_digest_from_audit(audit_ledger, audit_txn_seq_no)
return Checkpoint(instId=self._data.inst_id,
viewNo=view_no,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think we can fill viewNo with just 0 or None or even drop it together with seqNoStart - we don't need these fields anymore. Given that next upgrade most probably should be forced I think we have a good chance to do some protocol cleanup as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agree. I would prefer to do it in a separate PR since we already fixed more than expected in this one.

@@ -283,7 +283,7 @@ def _finish_view_change_if_needed(self):
def _finish_view_change(self):
# Update shared data
self._data.waiting_for_new_view = False
self._data.prev_view_prepare_cert = self._new_view.batches[-1].pp_seq_no if self._new_view.batches else None
self._data.prev_view_prepare_cert = self._new_view.batches[-1].pp_seq_no if self._new_view.batches else 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we don't reset pp_seq_no after view change anymore this looks very suspicious. None actually looks more explicit (we don't know last cert) than 0. Even better - why not default it just to last ordered pp seq no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that ppSeqNo starts with 1, so 0 here can be considered as None or don't know (agree that we need a comment about this).
The assumption 'don't know == 0' is more safe from the code point of view since we are doing (and may do more in future) comparisons (>=, <=), so we don't have to check for None every time.
There is another non-obvious benefit of setting it to 0 and not last_ordered: we do allow re-ordering for everything in between low watermark and prepared certificate. So setting it to last_ordered allows to potentially flood the pool with unnecessary re-ordering.

@@ -139,13 +139,7 @@ def _start_catchup_if_needed(self, key: CheckpointKey):
caught_up_till_3pc=key_3pc))
self.caught_up_till_3pc(key_3pc)

def gc_before_new_view(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 🎉 🎉

waitNodeDataEquality(looper, *txnPoolNodeSet, customTimeout=120,
exclude_from_check=['check_last_ordered_3pc_backup'])
assert len(log_re_ask) - old_re_ask_count == 2 # for audit and domain ledgers
waitNodeDataEquality(looper, *txnPoolNodeSet, customTimeout=120,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move out of consistency proof delayer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't move out, then we can not finish the view change, can't we?

@skhoroshavin skhoroshavin merged commit 9dd894d into hyperledger:master Oct 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants