Doomslug #1991

SkidanovAlex · 2020-01-21T00:46:44Z

This PR consists of three commits: one reverts proof-of-stake-time, one completely removes the concept of weight from the codebase, and the final one contains the implementation of Doomslug. It might be easier to review the last commit without the first two, since the first two are very mechanical, but create enormous number of changes.

Since Github doesn't by default pick up the description of the multi-commit PR, pasting here the description of Doomslug commit:

Implementation and integration of Doomslug (see https://near.ai/doomslug).

Doomslug itself is in chain/doomslug.rs, and doesn't depend on chain or any
other code. It also always takes the current time as an argument. It
significantly simplifies testing.

Doomslug is stored on the client, not chain, because both block
production and approvals processing happens on client. Instantiating on
chain would require a slightly more complex interface (e.g. it would be
impossible to pass me).

Doomslug needs to know about the latest Tip. Instead of intercepting the
lowest level tip-updating routine (which is in storage), I update the
Tip in the client when we accept the new head. It could miss head
changes related to syncing and challenges. To be safe I always update
the head whenever I interact with Doomslug, so the head is guaranteed to
be accurate whenever we do anything related to Doomslug, but it might
get sent to Doomslug with a slight delay.

This change also solves a problem that approvals before were in a cache
hash map, therefore an adversary could have spammed a node with lots of
invalid approvals and remove all valid approvals from the cache,
resulting in a node not having enough approvals when producing
the block. New logic is more complex, see
DoomslugApprovalsTrackersAtHeight class.

Test plan

Sanity tests (basic invariants for set_tip and approval processing)
A fuzzy test that tests safety and liveness under different times to
GST and deltas.
test_catchup_sanity_blocks_produced_doomslug verifies that heights
are properly skipped.
Also added checking doomslug invariant into cross_shard_tx, which
is known to evoke all kinds of weird structures (but cross_shard_tx
operates without the requirement to have 50% approvals on the block,
thus still causing lots of forkfullness).
A new version of cross_shard_tx that enables doomslug, but disables
tampering with the finality gadget. Thus the vanilla version tests heavy
forkfulness and tampers with FG, while the new doomslug version has
practically no forfulness due to doomslug, and doesn't stress the FG as
much, but does test block production with doomslug.

codecov · 2020-01-21T00:56:03Z

Codecov Report

Merging #1991 into staging will increase coverage by 0.25%.
The diff coverage is 95.57%.

@@             Coverage Diff             @@
##           staging    #1991      +/-   ##
===========================================
+ Coverage    87.13%   87.38%   +0.25%     
===========================================
  Files          170      170              
  Lines        33544    33845     +301     
===========================================
+ Hits         29227    29575     +348     
+ Misses        4317     4270      -47

Impacted Files	Coverage Δ
chain/network/tests/runner/mod.rs	`96.14% <ø> (+0.01%)`	⬆️
chain/chain/src/lib.rs	`100% <ø> (ø)`	⬆️
near/src/config.rs	`95.2% <ø> (-0.02%)`	⬇️
chain/client/tests/bug_repros.rs	`92.59% <ø> (-0.1%)`	⬇️
chain/jsonrpc/tests/rpc_query.rs	`98.18% <ø> (-0.04%)`	⬇️
chain/network/src/test_utils.rs	`91.75% <ø> (-0.09%)`	⬇️
chain/chain/src/error.rs	`75.92% <ø> (ø)`	⬆️
near/src/lib.rs	`98.21% <ø> (ø)`	⬆️
chain/network/tests/stress_network.rs	`0% <0%> (ø)`	⬆️
core/primitives/src/test_utils.rs	`92.59% <100%> (-0.57%)`	⬇️
... and 96 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea8dc46...edcc305. Read the comment docs.

MaksymZavershynskyi

LGTM for runtime

chain/chain/src/doomslug.rs

ilblackdragon · 2020-01-21T01:22:23Z

chain/chain/tests/finality.rs

@@ -165,7 +168,14 @@ fn create_block(
 }

 fn apr(account_id: AccountId, reference_hash: CryptoHash, parent_hash: CryptoHash) -> Approval {


rename to approve?

ilblackdragon · 2020-01-21T01:27:05Z

chain/chain/src/doomslug.rs

+                self.tip.reference_hash,
+                target_height,
+                is_endorsement,
+                &**self.signer.as_ref().unwrap(),


this looks like something evil.

if let Some((me, signer)) = (self.me.as_ref(), self.signer.as_ref()) { ... &*signer }

also might be easier to pack me + signer into one object - BlockSigner.
We also need BlockSigner to get HSM signing going (because we will need to pass non-hashed data to HSM for signatures anyway ,so we need API that has specific sign_block, sign_approval APIs).
/\ this all sounds like a separate PR if you are not me :D

I don't want to change anything related to HSM yet, it will be a separate effort.

Your approach still requires &**signer, because it's a reference to an Arc.

ilblackdragon · 2020-01-21T01:45:16Z

chain/client/src/client.rs

@@ -97,15 +103,27 @@ impl Client {
        let num_block_producer_seats = config.num_block_producer_seats as usize;
        let data_parts = runtime_adapter.num_data_parts();
        let parity_parts = runtime_adapter.num_total_parts() - data_parts;
+
+        let doomslug = Doomslug::new(
+            block_producer.as_ref().map(|x| x.account_id.clone()),


may be let's move BlockProducer to primitives (and rename to BlockSigner) and use it in Doomslug instead of splitting here?

chain/chain/src/doomslug.rs

bowenwang1996 · 2020-01-22T03:56:37Z

chain/chain/src/doomslug.rs

+/// `None`
+/// `PassedThreshold(when)` - after processing this approval the block has passed the threshold set by
+///                     `threshold_mode` (either one half of the total stake, or a single approval).
+///                     Once the threshold is hit, we want for `T(h - h_final) / 2` before producing


want -> wait. Also why do we wait for T(h - h_final) / 2?

The paper has been updated to reflect as to why we sleep

chain/chain/src/doomslug.rs

bowenwang1996 · 2020-01-22T21:24:45Z

chain/chain/src/doomslug.rs

+    pub fn new(
+        me: Option<AccountId>,
+        largest_previously_skipped_height: BlockHeight,
+        largest_previously_endorsed_height: BlockHeight,


why are these two arguments passed in instead of being initialized as 0?

chain/chain/src/doomslug.rs

bowenwang1996 · 2020-01-23T00:23:08Z

chain/chain/src/doomslug.rs

+        approved_stake > threshold
+    }
+
+    pub fn remove_witness(


why do we need to remove witness?

We remove it when we consume it to produce a block

chain/chain/src/doomslug.rs

bowenwang1996 · 2020-01-23T00:27:20Z

chain/chain/src/doomslug.rs

+        self.timer.started = now;
+
+        self.approval_tracking
+            .retain(|h, _| *h > height && *h <= height + MAX_HEIGHTS_AHEAD_TO_STORE_APPROVALS);


This seems inefficient

It's 10k iterations per each update of the head, which is acceptable.

Opened #2022 to track it.

frol

RPC changes are fine and I approve them, but I have a few suggestions in other places.

frol · 2020-01-28T13:21:54Z

chain/chain/src/doomslug.rs

+            if self.time_passed_threshold == None {
+                self.time_passed_threshold = Some(now);
+            }
+            DoomslugBlockProductionReadiness::ReadyToProduce(self.time_passed_threshold.unwrap())


I suggest to avoid .unwrap(), .e.g.:

let time_passed_threshold = if let Some(threshold) = self.time_passed_threshold { threshold } else { self.time_passed_threshold = Some(now); now }; DoomslugBlockProductionReadiness::ReadyToProduce(time_passed_threshold)

frol · 2020-01-28T13:28:38Z

chain/chain/src/doomslug.rs

+    pub fn create_approval(
+        &self,
+        target_height: BlockHeight,
+        is_endorsement: bool,


I just want to bring this to our attention, that bool-based APIs are usually hard to follow when you read/maintain the code, e.g. the use looks like self.create_approval(self.timer.height + 1, false) (it is hard to tell what that false implies): http://blakesmith.me/2019/05/07/rust-patterns-enums-instead-of-booleans.html / https://www.reddit.com/r/programming/comments/ebxzp8/dont_use_booleans/

frol · 2020-01-28T13:37:38Z

core/primitives/src/block.rs

+    pub reference_hash: Option<CryptoHash>,
+    pub target_height: BlockHeight,
+    pub is_endorsement: bool,


If I understand the code correctly, Approval can be an enum, which would express the type in a safer way.

frol · 2020-01-28T13:42:37Z

core/primitives/src/views.rs

    pub next_bp_hash: CryptoHash,
-    pub approvals: Vec<(AccountId, CryptoHash, CryptoHash, Signature)>,
+    pub approvals: Vec<(AccountId, CryptoHash, Option<CryptoHash>, BlockHeight, bool, Signature)>,


How do you feel about making this tuple a dedicated type (is it Approval structure?) where the arguments are named?

bowenwang1996 · 2020-01-28T21:34:34Z

chain/chain/src/doomslug.rs

+        } else if self.approved_stake > self.total_stake / 2
+            || self.threshold_mode == DoomslugThresholdMode::NoApprovals
+        {
+            if self.time_passed_threshold == None {


Suggested change

if self.time_passed_threshold == None {

if self.time_passed_threshold.is_none() {

Well, what I suggested above is to actually avoid this check and the unwrapping a few lines below in the first place.

bowenwang1996 · 2020-01-29T13:54:18Z

chain/chain/tests/doomslug.rs

+        let mut is_done = false;
+        while !is_done {
+            now = now + Duration::from_millis(25);
+            /*println!("{:?}, {}, {}", now, approval_queue.len(), block_queue.len());


should this be removed?

bowenwang1996 · 2020-01-29T14:23:14Z

chain/client/src/client.rs

@@ -291,10 +265,22 @@ impl Client {
        )?
        .clone();

+        let account_id_to_stake = self


duplicate code here?

bowenwang1996 · 2020-01-29T15:20:32Z

chain/client/src/client.rs

+                    &approval,
+                )?;
+
+                self.collect_block_approval(&approval, true);


looks like this is not necessary

This reverts commit 2c7bf9c.

In preparation for implementing Doomslug, removing the concept of `weight` and replacing the fork choice rule to choose the chain with the highest score, and then height. Replacing score with the height of the last pre-voted block rather than the weight.

Implementation and integration of Doomslug (see https://near.ai/doomslug). Doomslug itself is in `chain/doomslug.rs`, and doesn't depend on chain or any other code. It also always takes the current time as an argument. It significantly simplifies testing. Doomslug is stored on the client, not chain, because both block production and approvals processing happens on client. Instantiating on chain would require a slightly more complex interface (e.g. it would be impossible to pass `me`). Doomslug needs to know about the latest Tip. Instead of intercepting the lowest level tip-updating routine (which is in storage), I update the Tip in the client when we accept the new head. It could miss head changes related to syncing and challenges. To be safe I always update the head whenever I interact with Doomslug, so the head is guaranteed to be accurate whenever we do anything related to Doomslug, but it might get sent to Doomslug with a slight delay. This change also solves a problem that approvals before were in a cache hash map, therefore an adversary could have spammed a node with lots of invalid approvals and remove all valid approvals from the cache, resulting in a node not having enough approvals when producing the block. New logic is more complex, see `DoomslugApprovalsTrackersAtHeight` class. Test plan --------- 1. Sanity tests (basic invariants for `set_tip` and approval processing) 2. A fuzzy test that tests safety and liveness under different times to GST and deltas. 3. `test_catchup_sanity_blocks_produced_doomslug` verifies that heights are properly skipped. 4. Also added checking doomslug invariant into `cross_shard_tx`, which is known to evoke all kinds of weird structures (but `cross_shard_tx` operates without the requirement to have 50% approvals on the block, thus still causing lots of forkfullness). 5. A new version of `cross_shard_tx` that enables doomslug, but disables tampering with the finality gadget. Thus the vanilla version tests heavy forkfulness and tampers with FG, while the new doomslug version has practically no forfulness due to doomslug, and doesn't stress the FG as much, but does test block production with doomslug.

SkidanovAlex requested review from bowenwang1996, evgenykuzyakov and MaksymZavershynskyi as code owners January 21, 2020 00:46

SkidanovAlex force-pushed the doomslug branch from fe367d8 to f88ba21 Compare January 21, 2020 00:58

MaksymZavershynskyi approved these changes Jan 21, 2020

View reviewed changes

ilblackdragon reviewed Jan 21, 2020

View reviewed changes

SkidanovAlex force-pushed the doomslug branch from f88ba21 to 2e3e7c4 Compare January 21, 2020 19:11

evgenykuzyakov reviewed Jan 21, 2020

View reviewed changes

chain/chain/src/doomslug.rs Outdated Show resolved Hide resolved

bowenwang1996 reviewed Jan 23, 2020

View reviewed changes

weekly-digest bot mentioned this pull request Jan 24, 2020

Weekly Digest (17 January, 2020 - 24 January, 2020) #2018

Closed

SkidanovAlex force-pushed the doomslug branch from 2e3e7c4 to 61ca21e Compare January 25, 2020 18:44

SkidanovAlex requested a review from frol as a code owner January 25, 2020 18:44

frol approved these changes Jan 28, 2020

View reviewed changes

bowenwang1996 approved these changes Jan 29, 2020

View reviewed changes

SkidanovAlex added 2 commits January 29, 2020 09:58

Revert "Proof-of-stake-time (#1848)"

e10dc0e

This reverts commit 2c7bf9c.

SkidanovAlex force-pushed the doomslug branch 2 times, most recently from 8d6bb3e to 3cd4e80 Compare January 29, 2020 18:37

SkidanovAlex force-pushed the doomslug branch from 3cd4e80 to edcc305 Compare January 29, 2020 18:42

SkidanovAlex merged commit 4c9ebd7 into staging Jan 29, 2020

SkidanovAlex deleted the doomslug branch January 29, 2020 20:38

weekly-digest bot mentioned this pull request Jan 31, 2020

Weekly Digest (24 January, 2020 - 31 January, 2020) #2051

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doomslug #1991

Doomslug #1991

SkidanovAlex commented Jan 21, 2020

codecov bot commented Jan 21, 2020 •

edited

Loading

MaksymZavershynskyi left a comment

ilblackdragon Jan 21, 2020

ilblackdragon Jan 21, 2020

SkidanovAlex Jan 21, 2020

ilblackdragon Jan 21, 2020

bowenwang1996 Jan 22, 2020

SkidanovAlex Jan 25, 2020

bowenwang1996 Jan 22, 2020

bowenwang1996 Jan 23, 2020

SkidanovAlex Jan 25, 2020

bowenwang1996 Jan 23, 2020

SkidanovAlex Jan 25, 2020

frol left a comment

frol Jan 28, 2020

frol Jan 28, 2020

frol Jan 28, 2020

frol Jan 28, 2020

bowenwang1996 Jan 28, 2020

frol Jan 29, 2020

bowenwang1996 Jan 29, 2020

bowenwang1996 Jan 29, 2020

bowenwang1996 Jan 29, 2020

		@@ -165,7 +168,14 @@ fn create_block(
		}

		fn apr(account_id: AccountId, reference_hash: CryptoHash, parent_hash: CryptoHash) -> Approval {

	if self.time_passed_threshold == None {
	if self.time_passed_threshold.is_none() {

Doomslug #1991

Doomslug #1991

Conversation

SkidanovAlex commented Jan 21, 2020

Test plan

codecov bot commented Jan 21, 2020 • edited Loading

Codecov Report

MaksymZavershynskyi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 21, 2020 •

edited

Loading