INDY-2231: fixing checkpoint stabilization after view change on nodes lagging behind #1359

ashcherbakov · 2019-10-03T11:46:17Z

do not stabilize checkpoints from NewView during view change on nodes lagging behind (that is if a node doesn't have this checkpoint)
change quorum (from weak to strong certificate) when calculating a checkpoint for NewView. This is needed to make sure that view change is finished and nodes can order without catchup before processing NewView. Now the checkpoint can be lower, so more re-ordering may be needed.
added simulation tests with random seeds
fixes and improvements in tests

…lica doesn't have this checkpoint Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>

Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>

…nto indy-2231

skhoroshavin · 2019-10-03T11:53:00Z

plenum/server/consensus/view_change_service.py

@@ -330,7 +330,7 @@ def calc_checkpoint(self, vcs: List[ViewChange]) -> Optional[Checkpoint]:

                # Don't add checkpoint to pretending ones if not enough nodes have it
                have_checkpoint = [vc for vc in vcs if cur_cp in vc.checkpoints]
-                if not self._data.quorums.weak.is_reached(len(have_checkpoint)):
+                if not self._data.quorums.strong.is_reached(len(have_checkpoint)):


I think adding comment here stating that this is a divergence from PBFT paper and describing reasons for doing it (in short) would be nice

skhoroshavin · 2019-10-03T11:59:09Z

plenum/test/consensus/order_service/test_sim_order_during_view_change.py



-@pytest.mark.parametrize("seed", range(100))
-def test_view_change_while_ordering_with_real_msgs(seed):
+@pytest.mark.parametrize("seed", Random().sample(range(1000000), 100))


I think it would be nice to:

move this into helper functions

use it instead of range(xxx) in all places where seed is parametrized

All in all, I like this solution - it is very minimalistic, yet satisfies all requirements (especially one that used seeds should be logged).

skhoroshavin · 2019-10-03T12:02:06Z

plenum/test/consensus/order_service/test_sim_ordering.py

-@pytest.mark.parametrize("seed", range(10))
-def test_ordering_with_real_msgs(seed):
+@pytest.mark.parametrize("seed", range(100))
+def test_ordering_with_real_msgs_default_seed(seed):


I don't think we really need tests with fixed seeds, when we already have ones with random. Rationale is that when we do have some easily (>5%) reproduced problems it doesn't matter whether seeds are all in 0..100 range or not - only number of them really matters.

Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>

ashcherbakov · 2019-10-06T08:03:09Z

test this please

skhoroshavin · 2019-10-07T08:59:47Z

plenum/server/consensus/checkpoint_service.py

+        view_no = self._get_view_no_from_audit(audit_txn)
+        digest = self._get_digest_from_audit(audit_ledger, audit_txn_seq_no)
+        return Checkpoint(instId=self._data.inst_id,
+                          viewNo=view_no,


Actually I think we can fill viewNo with just 0 or None or even drop it together with seqNoStart - we don't need these fields anymore. Given that next upgrade most probably should be forced I think we have a good chance to do some protocol cleanup as well.

Yeah, agree. I would prefer to do it in a separate PR since we already fixed more than expected in this one.

skhoroshavin · 2019-10-07T10:51:55Z

plenum/server/consensus/view_change_service.py

@@ -283,7 +283,7 @@ def _finish_view_change_if_needed(self):
    def _finish_view_change(self):
        # Update shared data
        self._data.waiting_for_new_view = False
-        self._data.prev_view_prepare_cert = self._new_view.batches[-1].pp_seq_no if self._new_view.batches else None
+        self._data.prev_view_prepare_cert = self._new_view.batches[-1].pp_seq_no if self._new_view.batches else 0


Given that we don't reset pp_seq_no after view change anymore this looks very suspicious. None actually looks more explicit (we don't know last cert) than 0. Even better - why not default it just to last ordered pp seq no?

Please note that ppSeqNo starts with 1, so 0 here can be considered as None or don't know (agree that we need a comment about this).
The assumption 'don't know == 0' is more safe from the code point of view since we are doing (and may do more in future) comparisons (>=, <=), so we don't have to check for None every time.
There is another non-obvious benefit of setting it to 0 and not last_ordered: we do allow re-ordering for everything in between low watermark and prepared certificate. So setting it to last_ordered allows to potentially flood the pool with unnecessary re-ordering.

skhoroshavin · 2019-10-07T10:53:02Z

plenum/server/consensus/checkpoint_service.py

@@ -139,13 +139,7 @@ def _start_catchup_if_needed(self, key: CheckpointKey):
                                             caught_up_till_3pc=key_3pc))
            self.caught_up_till_3pc(key_3pc)

-    def gc_before_new_view(self):


🎉 🎉 🎉

skhoroshavin · 2019-10-07T13:20:02Z

plenum/test/node_catchup/catchup_req/test_catchup_with_one_slow_node.py

-            waitNodeDataEquality(looper, *txnPoolNodeSet, customTimeout=120,
-                         exclude_from_check=['check_last_ordered_3pc_backup'])
-            assert len(log_re_ask) - old_re_ask_count == 2  # for audit and domain ledgers
+        waitNodeDataEquality(looper, *txnPoolNodeSet, customTimeout=120,


Why move out of consistency proof delayer?

If we don't move out, then we can not finish the view change, can't we?

ashcherbakov added 8 commits October 1, 2019 04:00

INDY-2231: Do not stabilize checkpoint after the view change if a Rep…

8ab5abe

…lica doesn't have this checkpoint Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>

fix test

c206549

Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>

fix checpoint calculation in new view and tests

da42d64

Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>

fix sim tests

ab64ca8

Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>

fix sim tests

858285c

Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>

add simulation tests with randomly selected seed

fa8ba44

Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>

add TODO comment for deviation from the paper

f28b624

Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>

Merge branch 'master' of https://github.com/hyperledger/indy-plenum i…

380aecb

…nto indy-2231

ashcherbakov requested review from lampkin-diet, skhoroshavin and Toktar as code owners October 3, 2019 11:46

skhoroshavin previously approved these changes Oct 3, 2019

View reviewed changes

fix creation of checkpoint after catchup

37db8e1

Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>

ashcherbakov dismissed skhoroshavin’s stale review via 37db8e1 October 5, 2019 14:19

ashcherbakov added 5 commits October 5, 2019 18:12

fix tests

330c3ac

Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>

fix tests

b96efa1

Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>

fix tests

b08017e

Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>

Merge remote-tracking branch 'remotes/upstream/master' into indy-2231

c673ada

fix setting prev_view_prepare_cert to 0 instead of None

8d267b5

Signed-off-by: ashcherbakov <alexander.sherbakov@dsr-corporation.com>

skhoroshavin approved these changes Oct 7, 2019

View reviewed changes

skhoroshavin merged commit 9dd894d into hyperledger:master Oct 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INDY-2231: fixing checkpoint stabilization after view change on nodes lagging behind #1359

INDY-2231: fixing checkpoint stabilization after view change on nodes lagging behind #1359

ashcherbakov commented Oct 3, 2019

skhoroshavin Oct 3, 2019

skhoroshavin Oct 3, 2019

skhoroshavin Oct 3, 2019

ashcherbakov commented Oct 6, 2019

skhoroshavin Oct 7, 2019

ashcherbakov Oct 7, 2019

skhoroshavin Oct 7, 2019

ashcherbakov Oct 7, 2019

skhoroshavin Oct 7, 2019

skhoroshavin Oct 7, 2019

ashcherbakov Oct 7, 2019

INDY-2231: fixing checkpoint stabilization after view change on nodes lagging behind #1359

INDY-2231: fixing checkpoint stabilization after view change on nodes lagging behind #1359

Conversation

ashcherbakov commented Oct 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashcherbakov commented Oct 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment