Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doomslug #1991

Merged
merged 3 commits into from Jan 29, 2020
Merged

Doomslug #1991

merged 3 commits into from Jan 29, 2020

Conversation

@SkidanovAlex
Copy link
Collaborator

@SkidanovAlex SkidanovAlex commented Jan 21, 2020

This PR consists of three commits: one reverts proof-of-stake-time, one completely removes the concept of weight from the codebase, and the final one contains the implementation of Doomslug. It might be easier to review the last commit without the first two, since the first two are very mechanical, but create enormous number of changes.

Since Github doesn't by default pick up the description of the multi-commit PR, pasting here the description of Doomslug commit:

Implementation and integration of Doomslug (see https://near.ai/doomslug).

Doomslug itself is in chain/doomslug.rs, and doesn't depend on chain or any
other code. It also always takes the current time as an argument. It
significantly simplifies testing.

Doomslug is stored on the client, not chain, because both block
production and approvals processing happens on client. Instantiating on
chain would require a slightly more complex interface (e.g. it would be
impossible to pass me).

Doomslug needs to know about the latest Tip. Instead of intercepting the
lowest level tip-updating routine (which is in storage), I update the
Tip in the client when we accept the new head. It could miss head
changes related to syncing and challenges. To be safe I always update
the head whenever I interact with Doomslug, so the head is guaranteed to
be accurate whenever we do anything related to Doomslug, but it might
get sent to Doomslug with a slight delay.

This change also solves a problem that approvals before were in a cache
hash map, therefore an adversary could have spammed a node with lots of
invalid approvals and remove all valid approvals from the cache,
resulting in a node not having enough approvals when producing
the block. New logic is more complex, see
DoomslugApprovalsTrackersAtHeight class.

Test plan

  1. Sanity tests (basic invariants for set_tip and approval processing)
  2. A fuzzy test that tests safety and liveness under different times to
    GST and deltas.
  3. test_catchup_sanity_blocks_produced_doomslug verifies that heights
    are properly skipped.
  4. Also added checking doomslug invariant into cross_shard_tx, which
    is known to evoke all kinds of weird structures (but cross_shard_tx
    operates without the requirement to have 50% approvals on the block,
    thus still causing lots of forkfullness).
  5. A new version of cross_shard_tx that enables doomslug, but disables
    tampering with the finality gadget. Thus the vanilla version tests heavy
    forkfulness and tampers with FG, while the new doomslug version has
    practically no forfulness due to doomslug, and doesn't stress the FG as
    much, but does test block production with doomslug.
@codecov
Copy link

@codecov codecov bot commented Jan 21, 2020

Codecov Report

Merging #1991 into staging will increase coverage by 0.25%.
The diff coverage is 95.57%.

Impacted file tree graph

@@             Coverage Diff             @@
##           staging    #1991      +/-   ##
===========================================
+ Coverage    87.13%   87.38%   +0.25%     
===========================================
  Files          170      170              
  Lines        33544    33845     +301     
===========================================
+ Hits         29227    29575     +348     
+ Misses        4317     4270      -47
Impacted Files Coverage Δ
chain/network/tests/runner/mod.rs 96.14% <ø> (+0.01%) ⬆️
chain/chain/src/lib.rs 100% <ø> (ø) ⬆️
near/src/config.rs 95.2% <ø> (-0.02%) ⬇️
chain/client/tests/bug_repros.rs 92.59% <ø> (-0.1%) ⬇️
chain/jsonrpc/tests/rpc_query.rs 98.18% <ø> (-0.04%) ⬇️
chain/network/src/test_utils.rs 91.75% <ø> (-0.09%) ⬇️
chain/chain/src/error.rs 75.92% <ø> (ø) ⬆️
near/src/lib.rs 98.21% <ø> (ø) ⬆️
chain/network/tests/stress_network.rs 0% <0%> (ø) ⬆️
core/primitives/src/test_utils.rs 92.59% <100%> (-0.57%) ⬇️
... and 96 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea8dc46...edcc305. Read the comment docs.

Copy link
Member

@nearmax nearmax left a comment

LGTM for runtime

chain/chain/src/doomslug.rs Outdated Show resolved Hide resolved
@@ -165,7 +168,14 @@ fn create_block(
}

fn apr(account_id: AccountId, reference_hash: CryptoHash, parent_hash: CryptoHash) -> Approval {
Copy link
Member

@ilblackdragon ilblackdragon Jan 21, 2020

rename to approve?

self.tip.reference_hash,
target_height,
is_endorsement,
&**self.signer.as_ref().unwrap(),
Copy link
Member

@ilblackdragon ilblackdragon Jan 21, 2020

this looks like something evil.

if let Some((me, signer)) = (self.me.as_ref(), self.signer.as_ref()) {
   ... 
    &*signer
}

also might be easier to pack me + signer into one object - BlockSigner.
We also need BlockSigner to get HSM signing going (because we will need to pass non-hashed data to HSM for signatures anyway ,so we need API that has specific sign_block, sign_approval APIs).
/\ this all sounds like a separate PR if you are not me :D

Copy link
Collaborator Author

@SkidanovAlex SkidanovAlex Jan 21, 2020

I don't want to change anything related to HSM yet, it will be a separate effort.

Your approach still requires &**signer, because it's a reference to an Arc.

@@ -97,15 +103,27 @@ impl Client {
let num_block_producer_seats = config.num_block_producer_seats as usize;
let data_parts = runtime_adapter.num_data_parts();
let parity_parts = runtime_adapter.num_total_parts() - data_parts;

let doomslug = Doomslug::new(
block_producer.as_ref().map(|x| x.account_id.clone()),
Copy link
Member

@ilblackdragon ilblackdragon Jan 21, 2020

may be let's move BlockProducer to primitives (and rename to BlockSigner) and use it in Doomslug instead of splitting here?

chain/chain/src/doomslug.rs Outdated Show resolved Hide resolved
/// `None`
/// `PassedThreshold(when)` - after processing this approval the block has passed the threshold set by
/// `threshold_mode` (either one half of the total stake, or a single approval).
/// Once the threshold is hit, we want for `T(h - h_final) / 2` before producing
Copy link
Collaborator

@bowenwang1996 bowenwang1996 Jan 23, 2020

want -> wait. Also why do we wait for T(h - h_final) / 2?

Copy link
Collaborator Author

@SkidanovAlex SkidanovAlex Jan 25, 2020

The paper has been updated to reflect as to why we sleep

chain/chain/src/doomslug.rs Outdated Show resolved Hide resolved
chain/chain/src/doomslug.rs Outdated Show resolved Hide resolved
chain/chain/src/doomslug.rs Outdated Show resolved Hide resolved
chain/chain/src/doomslug.rs Outdated Show resolved Hide resolved
pub fn new(
me: Option<AccountId>,
largest_previously_skipped_height: BlockHeight,
largest_previously_endorsed_height: BlockHeight,
Copy link
Collaborator

@bowenwang1996 bowenwang1996 Jan 23, 2020

why are these two arguments passed in instead of being initialized as 0?

chain/chain/src/doomslug.rs Outdated Show resolved Hide resolved
approved_stake > threshold
}

pub fn remove_witness(
Copy link
Collaborator

@bowenwang1996 bowenwang1996 Jan 23, 2020

why do we need to remove witness?

Copy link
Collaborator Author

@SkidanovAlex SkidanovAlex Jan 25, 2020

We remove it when we consume it to produce a block

chain/chain/src/doomslug.rs Outdated Show resolved Hide resolved
self.timer.started = now;

self.approval_tracking
.retain(|h, _| *h > height && *h <= height + MAX_HEIGHTS_AHEAD_TO_STORE_APPROVALS);
Copy link
Collaborator

@bowenwang1996 bowenwang1996 Jan 23, 2020

This seems inefficient

Copy link
Collaborator Author

@SkidanovAlex SkidanovAlex Jan 25, 2020

It's 10k iterations per each update of the head, which is acceptable.

Opened #2022 to track it.

frol
frol approved these changes Jan 28, 2020
Copy link
Member

@frol frol left a comment

RPC changes are fine and I approve them, but I have a few suggestions in other places.

if self.time_passed_threshold == None {
self.time_passed_threshold = Some(now);
}
DoomslugBlockProductionReadiness::ReadyToProduce(self.time_passed_threshold.unwrap())
Copy link
Member

@frol frol Jan 28, 2020

I suggest to avoid .unwrap(), .e.g.:

let time_passed_threshold = if let Some(threshold) = self.time_passed_threshold {
    threshold
} else {
    self.time_passed_threshold = Some(now);
    now
};
DoomslugBlockProductionReadiness::ReadyToProduce(time_passed_threshold)

pub fn create_approval(
&self,
target_height: BlockHeight,
is_endorsement: bool,
Copy link
Member

@frol frol Jan 28, 2020

I just want to bring this to our attention, that bool-based APIs are usually hard to follow when you read/maintain the code, e.g. the use looks like self.create_approval(self.timer.height + 1, false) (it is hard to tell what that false implies): http://blakesmith.me/2019/05/07/rust-patterns-enums-instead-of-booleans.html / https://www.reddit.com/r/programming/comments/ebxzp8/dont_use_booleans/

pub reference_hash: Option<CryptoHash>,
pub target_height: BlockHeight,
pub is_endorsement: bool,
Copy link
Member

@frol frol Jan 28, 2020

If I understand the code correctly, Approval can be an enum, which would express the type in a safer way.

pub next_bp_hash: CryptoHash,
pub approvals: Vec<(AccountId, CryptoHash, CryptoHash, Signature)>,
pub approvals: Vec<(AccountId, CryptoHash, Option<CryptoHash>, BlockHeight, bool, Signature)>,
Copy link
Member

@frol frol Jan 28, 2020

How do you feel about making this tuple a dedicated type (is it Approval structure?) where the arguments are named?

} else if self.approved_stake > self.total_stake / 2
|| self.threshold_mode == DoomslugThresholdMode::NoApprovals
{
if self.time_passed_threshold == None {
Copy link
Collaborator

@bowenwang1996 bowenwang1996 Jan 29, 2020

Suggested change
if self.time_passed_threshold == None {
if self.time_passed_threshold.is_none() {

Copy link
Member

@frol frol Jan 29, 2020

Well, what I suggested above is to actually avoid this check and the unwrapping a few lines below in the first place.

let mut is_done = false;
while !is_done {
now = now + Duration::from_millis(25);
/*println!("{:?}, {}, {}", now, approval_queue.len(), block_queue.len());
Copy link
Collaborator

@bowenwang1996 bowenwang1996 Jan 29, 2020

should this be removed?

@@ -291,10 +265,22 @@ impl Client {
)?
.clone();

let account_id_to_stake = self
Copy link
Collaborator

@bowenwang1996 bowenwang1996 Jan 29, 2020

duplicate code here?

&approval,
)?;

self.collect_block_approval(&approval, true);
Copy link
Collaborator

@bowenwang1996 bowenwang1996 Jan 29, 2020

looks like this is not necessary

In preparation for implementing Doomslug, removing the concept of
`weight` and replacing the fork choice rule to choose the chain with the
highest score, and then height. Replacing score with the height of the
last pre-voted block rather than the weight.
@SkidanovAlex SkidanovAlex force-pushed the doomslug branch 2 times, most recently from 8d6bb3e to 3cd4e80 Jan 29, 2020
Implementation and integration of Doomslug (see https://near.ai/doomslug).

Doomslug itself is in `chain/doomslug.rs`, and doesn't depend on chain or any
other code. It also always takes the current time as an argument. It
significantly simplifies testing.

Doomslug is stored on the client, not chain, because both block
production and approvals processing happens on client. Instantiating on
chain would require a slightly more complex interface (e.g. it would be
impossible to pass `me`).

Doomslug needs to know about the latest Tip. Instead of intercepting the
lowest level tip-updating routine (which is in storage), I update the
Tip in the client when we accept the new head. It could miss head
changes related to syncing and challenges. To be safe I always update
the head whenever I interact with Doomslug, so the head is guaranteed to
be accurate whenever we do anything related to Doomslug, but it might
get sent to Doomslug with a slight delay.

This change also solves a problem that approvals before were in a cache
hash map, therefore an adversary could have spammed a node with lots of
invalid approvals and remove all valid approvals from the cache,
resulting in a node not having enough approvals when producing
the block. New logic is more complex, see
`DoomslugApprovalsTrackersAtHeight` class.

Test plan
---------
1. Sanity tests (basic invariants for `set_tip` and approval processing)
2. A fuzzy test that tests safety and liveness under different times to
GST and deltas.
3. `test_catchup_sanity_blocks_produced_doomslug` verifies that heights
are properly skipped.
4. Also added checking doomslug invariant into `cross_shard_tx`, which
is known to evoke all kinds of weird structures (but `cross_shard_tx`
operates without the requirement to have 50% approvals on the block,
thus still causing lots of forkfullness).
5. A new version of `cross_shard_tx` that enables doomslug, but disables
tampering with the finality gadget. Thus the vanilla version tests heavy
forkfulness and tampers with FG, while the new doomslug version has
practically no forfulness due to doomslug, and doesn't stress the FG as
much, but does test block production with doomslug.
@SkidanovAlex SkidanovAlex merged commit 4c9ebd7 into staging Jan 29, 2020
3 checks passed
@SkidanovAlex SkidanovAlex deleted the doomslug branch Jan 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants