Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doomslug #1991

Merged
merged 3 commits into from Jan 29, 2020
Merged

Doomslug #1991

merged 3 commits into from Jan 29, 2020

Conversation

@SkidanovAlex
Copy link
Member

SkidanovAlex commented Jan 21, 2020

This PR consists of three commits: one reverts proof-of-stake-time, one completely removes the concept of weight from the codebase, and the final one contains the implementation of Doomslug. It might be easier to review the last commit without the first two, since the first two are very mechanical, but create enormous number of changes.

Since Github doesn't by default pick up the description of the multi-commit PR, pasting here the description of Doomslug commit:

Implementation and integration of Doomslug (see https://near.ai/doomslug).

Doomslug itself is in chain/doomslug.rs, and doesn't depend on chain or any
other code. It also always takes the current time as an argument. It
significantly simplifies testing.

Doomslug is stored on the client, not chain, because both block
production and approvals processing happens on client. Instantiating on
chain would require a slightly more complex interface (e.g. it would be
impossible to pass me).

Doomslug needs to know about the latest Tip. Instead of intercepting the
lowest level tip-updating routine (which is in storage), I update the
Tip in the client when we accept the new head. It could miss head
changes related to syncing and challenges. To be safe I always update
the head whenever I interact with Doomslug, so the head is guaranteed to
be accurate whenever we do anything related to Doomslug, but it might
get sent to Doomslug with a slight delay.

This change also solves a problem that approvals before were in a cache
hash map, therefore an adversary could have spammed a node with lots of
invalid approvals and remove all valid approvals from the cache,
resulting in a node not having enough approvals when producing
the block. New logic is more complex, see
DoomslugApprovalsTrackersAtHeight class.

Test plan

  1. Sanity tests (basic invariants for set_tip and approval processing)
  2. A fuzzy test that tests safety and liveness under different times to
    GST and deltas.
  3. test_catchup_sanity_blocks_produced_doomslug verifies that heights
    are properly skipped.
  4. Also added checking doomslug invariant into cross_shard_tx, which
    is known to evoke all kinds of weird structures (but cross_shard_tx
    operates without the requirement to have 50% approvals on the block,
    thus still causing lots of forkfullness).
  5. A new version of cross_shard_tx that enables doomslug, but disables
    tampering with the finality gadget. Thus the vanilla version tests heavy
    forkfulness and tampers with FG, while the new doomslug version has
    practically no forfulness due to doomslug, and doesn't stress the FG as
    much, but does test block production with doomslug.
@codecov
Copy link

codecov bot commented Jan 21, 2020

Codecov Report

Merging #1991 into staging will increase coverage by 0.25%.
The diff coverage is 95.57%.

Impacted file tree graph

@@             Coverage Diff             @@
##           staging    #1991      +/-   ##
===========================================
+ Coverage    87.13%   87.38%   +0.25%     
===========================================
  Files          170      170              
  Lines        33544    33845     +301     
===========================================
+ Hits         29227    29575     +348     
+ Misses        4317     4270      -47
Impacted Files Coverage Δ
chain/network/tests/runner/mod.rs 96.14% <ø> (+0.01%) ⬆️
chain/chain/src/lib.rs 100% <ø> (ø) ⬆️
near/src/config.rs 95.2% <ø> (-0.02%) ⬇️
chain/client/tests/bug_repros.rs 92.59% <ø> (-0.1%) ⬇️
chain/jsonrpc/tests/rpc_query.rs 98.18% <ø> (-0.04%) ⬇️
chain/network/src/test_utils.rs 91.75% <ø> (-0.09%) ⬇️
chain/chain/src/error.rs 75.92% <ø> (ø) ⬆️
near/src/lib.rs 98.21% <ø> (ø) ⬆️
chain/network/tests/stress_network.rs 0% <0%> (ø) ⬆️
core/primitives/src/test_utils.rs 92.59% <100%> (-0.57%) ⬇️
... and 96 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea8dc46...edcc305. Read the comment docs.

@SkidanovAlex SkidanovAlex force-pushed the doomslug branch from fe367d8 to f88ba21 Jan 21, 2020
Copy link
Collaborator

nearmax left a comment

LGTM for runtime

chain/chain/src/doomslug.rs Outdated Show resolved Hide resolved
@@ -165,7 +168,14 @@ fn create_block(
}

fn apr(account_id: AccountId, reference_hash: CryptoHash, parent_hash: CryptoHash) -> Approval {

This comment has been minimized.

Copy link
@ilblackdragon

ilblackdragon Jan 21, 2020

Member

rename to approve?

self.tip.reference_hash,
target_height,
is_endorsement,
&**self.signer.as_ref().unwrap(),

This comment has been minimized.

Copy link
@ilblackdragon

ilblackdragon Jan 21, 2020

Member

this looks like something evil.

if let Some((me, signer)) = (self.me.as_ref(), self.signer.as_ref()) {
   ... 
    &*signer
}

also might be easier to pack me + signer into one object - BlockSigner.
We also need BlockSigner to get HSM signing going (because we will need to pass non-hashed data to HSM for signatures anyway ,so we need API that has specific sign_block, sign_approval APIs).
/\ this all sounds like a separate PR if you are not me :D

This comment has been minimized.

Copy link
@SkidanovAlex

SkidanovAlex Jan 21, 2020

Author Member

I don't want to change anything related to HSM yet, it will be a separate effort.

Your approach still requires &**signer, because it's a reference to an Arc.

@@ -97,15 +103,27 @@ impl Client {
let num_block_producer_seats = config.num_block_producer_seats as usize;
let data_parts = runtime_adapter.num_data_parts();
let parity_parts = runtime_adapter.num_total_parts() - data_parts;

let doomslug = Doomslug::new(
block_producer.as_ref().map(|x| x.account_id.clone()),

This comment has been minimized.

Copy link
@ilblackdragon

ilblackdragon Jan 21, 2020

Member

may be let's move BlockProducer to primitives (and rename to BlockSigner) and use it in Doomslug instead of splitting here?

@SkidanovAlex SkidanovAlex force-pushed the doomslug branch from f88ba21 to 2e3e7c4 Jan 21, 2020
/// `None`
/// `PassedThreshold(when)` - after processing this approval the block has passed the threshold set by
/// `threshold_mode` (either one half of the total stake, or a single approval).
/// Once the threshold is hit, we want for `T(h - h_final) / 2` before producing

This comment has been minimized.

Copy link
@bowenwang1996

bowenwang1996 Jan 23, 2020

Member

want -> wait. Also why do we wait for T(h - h_final) / 2?

This comment has been minimized.

Copy link
@SkidanovAlex

SkidanovAlex Jan 25, 2020

Author Member

The paper has been updated to reflect as to why we sleep

chain/chain/src/doomslug.rs Outdated Show resolved Hide resolved
chain/chain/src/doomslug.rs Outdated Show resolved Hide resolved
chain/chain/src/doomslug.rs Outdated Show resolved Hide resolved
chain/chain/src/doomslug.rs Outdated Show resolved Hide resolved
pub fn new(
me: Option<AccountId>,
largest_previously_skipped_height: BlockHeight,
largest_previously_endorsed_height: BlockHeight,

This comment has been minimized.

Copy link
@bowenwang1996

bowenwang1996 Jan 23, 2020

Member

why are these two arguments passed in instead of being initialized as 0?

chain/chain/src/doomslug.rs Outdated Show resolved Hide resolved
approved_stake > threshold
}

pub fn remove_witness(

This comment has been minimized.

Copy link
@bowenwang1996

bowenwang1996 Jan 23, 2020

Member

why do we need to remove witness?

This comment has been minimized.

Copy link
@SkidanovAlex

SkidanovAlex Jan 25, 2020

Author Member

We remove it when we consume it to produce a block

chain/chain/src/doomslug.rs Outdated Show resolved Hide resolved
self.timer.started = now;

self.approval_tracking
.retain(|h, _| *h > height && *h <= height + MAX_HEIGHTS_AHEAD_TO_STORE_APPROVALS);

This comment has been minimized.

Copy link
@bowenwang1996

bowenwang1996 Jan 23, 2020

Member

This seems inefficient

This comment has been minimized.

Copy link
@SkidanovAlex

SkidanovAlex Jan 25, 2020

Author Member

It's 10k iterations per each update of the head, which is acceptable.

Opened #2022 to track it.

@frol
frol approved these changes Jan 28, 2020
Copy link
Member

frol left a comment

RPC changes are fine and I approve them, but I have a few suggestions in other places.

if self.time_passed_threshold == None {
self.time_passed_threshold = Some(now);
}
DoomslugBlockProductionReadiness::ReadyToProduce(self.time_passed_threshold.unwrap())
Comment on lines +232 to +235

This comment has been minimized.

Copy link
@frol

frol Jan 28, 2020

Member

I suggest to avoid .unwrap(), .e.g.:

let time_passed_threshold = if let Some(threshold) = self.time_passed_threshold {
    threshold
} else {
    self.time_passed_threshold = Some(now);
    now
};
DoomslugBlockProductionReadiness::ReadyToProduce(time_passed_threshold)
pub fn create_approval(
&self,
target_height: BlockHeight,
is_endorsement: bool,

This comment has been minimized.

Copy link
@frol

frol Jan 28, 2020

Member

I just want to bring this to our attention, that bool-based APIs are usually hard to follow when you read/maintain the code, e.g. the use looks like self.create_approval(self.timer.height + 1, false) (it is hard to tell what that false implies): http://blakesmith.me/2019/05/07/rust-patterns-enums-instead-of-booleans.html / https://www.reddit.com/r/programming/comments/ebxzp8/dont_use_booleans/

pub reference_hash: Option<CryptoHash>,
pub target_height: BlockHeight,
pub is_endorsement: bool,
Comment on lines +151 to +153

This comment has been minimized.

Copy link
@frol

frol Jan 28, 2020

Member

If I understand the code correctly, Approval can be an enum, which would express the type in a safer way.

pub next_bp_hash: CryptoHash,
pub approvals: Vec<(AccountId, CryptoHash, CryptoHash, Signature)>,
pub approvals: Vec<(AccountId, CryptoHash, Option<CryptoHash>, BlockHeight, bool, Signature)>,

This comment has been minimized.

Copy link
@frol

frol Jan 28, 2020

Member

How do you feel about making this tuple a dedicated type (is it Approval structure?) where the arguments are named?

} else if self.approved_stake > self.total_stake / 2
|| self.threshold_mode == DoomslugThresholdMode::NoApprovals
{
if self.time_passed_threshold == None {

This comment has been minimized.

Copy link
@bowenwang1996

bowenwang1996 Jan 29, 2020

Member
Suggested change
if self.time_passed_threshold == None {
if self.time_passed_threshold.is_none() {

This comment has been minimized.

Copy link
@frol

frol Jan 29, 2020

Member

Well, what I suggested above is to actually avoid this check and the unwrapping a few lines below in the first place.

let mut is_done = false;
while !is_done {
now = now + Duration::from_millis(25);
/*println!("{:?}, {}, {}", now, approval_queue.len(), block_queue.len());

This comment has been minimized.

Copy link
@bowenwang1996

bowenwang1996 Jan 29, 2020

Member

should this be removed?

@@ -291,10 +265,22 @@ impl Client {
)?
.clone();

let account_id_to_stake = self

This comment has been minimized.

Copy link
@bowenwang1996

bowenwang1996 Jan 29, 2020

Member

duplicate code here?

&approval,
)?;

self.collect_block_approval(&approval, true);

This comment has been minimized.

Copy link
@bowenwang1996

bowenwang1996 Jan 29, 2020

Member

looks like this is not necessary

SkidanovAlex added 2 commits Jan 4, 2020
In preparation for implementing Doomslug, removing the concept of
`weight` and replacing the fork choice rule to choose the chain with the
highest score, and then height. Replacing score with the height of the
last pre-voted block rather than the weight.
@SkidanovAlex SkidanovAlex force-pushed the doomslug branch 2 times, most recently from 8d6bb3e to 3cd4e80 Jan 29, 2020
Implementation and integration of Doomslug (see https://near.ai/doomslug).

Doomslug itself is in `chain/doomslug.rs`, and doesn't depend on chain or any
other code. It also always takes the current time as an argument. It
significantly simplifies testing.

Doomslug is stored on the client, not chain, because both block
production and approvals processing happens on client. Instantiating on
chain would require a slightly more complex interface (e.g. it would be
impossible to pass `me`).

Doomslug needs to know about the latest Tip. Instead of intercepting the
lowest level tip-updating routine (which is in storage), I update the
Tip in the client when we accept the new head. It could miss head
changes related to syncing and challenges. To be safe I always update
the head whenever I interact with Doomslug, so the head is guaranteed to
be accurate whenever we do anything related to Doomslug, but it might
get sent to Doomslug with a slight delay.

This change also solves a problem that approvals before were in a cache
hash map, therefore an adversary could have spammed a node with lots of
invalid approvals and remove all valid approvals from the cache,
resulting in a node not having enough approvals when producing
the block. New logic is more complex, see
`DoomslugApprovalsTrackersAtHeight` class.

Test plan
---------
1. Sanity tests (basic invariants for `set_tip` and approval processing)
2. A fuzzy test that tests safety and liveness under different times to
GST and deltas.
3. `test_catchup_sanity_blocks_produced_doomslug` verifies that heights
are properly skipped.
4. Also added checking doomslug invariant into `cross_shard_tx`, which
is known to evoke all kinds of weird structures (but `cross_shard_tx`
operates without the requirement to have 50% approvals on the block,
thus still causing lots of forkfullness).
5. A new version of `cross_shard_tx` that enables doomslug, but disables
tampering with the finality gadget. Thus the vanilla version tests heavy
forkfulness and tampers with FG, while the new doomslug version has
practically no forfulness due to doomslug, and doesn't stress the FG as
much, but does test block production with doomslug.
@SkidanovAlex SkidanovAlex force-pushed the doomslug branch from 3cd4e80 to edcc305 Jan 29, 2020
@SkidanovAlex SkidanovAlex merged commit 4c9ebd7 into staging Jan 29, 2020
3 checks passed
3 checks passed
codecov/patch 95.57% of diff hit (target 87.13%)
Details
codecov/project 87.38% (+0.25%) compared to ea8dc46
Details
gitlab-ci
Details
@SkidanovAlex SkidanovAlex deleted the doomslug branch Jan 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.