Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Do not set best block when there already exists a best block with an higher number #544

Merged
merged 2 commits into from
Jul 21, 2021

Conversation

bkchr
Copy link
Member

@bkchr bkchr commented Jul 21, 2021

Closes: #534

client/consensus/common/src/parachain_consensus.rs Outdated Show resolved Hide resolved
client/consensus/common/src/tests.rs Outdated Show resolved Hide resolved
Co-authored-by: André Silva <123550+andresilva@users.noreply.github.com>
@bkchr bkchr merged commit 3462a4b into master Jul 21, 2021
@bkchr bkchr deleted the bkchr-do-not-set-best-block-to-lower-number branch July 21, 2021 14:21
bkchr added a commit that referenced this pull request Jul 21, 2021
…higher number (#544)

* Do not set best block when there already exists a best block with an higher number

* Apply suggestions from code review

Co-authored-by: André Silva <123550+andresilva@users.noreply.github.com>

Co-authored-by: André Silva <123550+andresilva@users.noreply.github.com>
@@ -346,6 +346,18 @@ where
P: UsageProvider<Block> + Send + Sync + BlockBackend<Block>,
for<'a> &'a P: BlockImport<Block>,
{
let best_number = parachain.usage_info().chain.best_number;
if *header.number() < best_number {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkchr Could there be the case where a re-org on the relay would actually create a re-org on the parachain leading to a lower block number becoming best ?
I'm thinking in the case where a re-org would happen over 2 blocks on the relay, and the parachain network would have only produced 1 block in the fork branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it will take one block more to set a new best block and you see the reorg later.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is probably a rare case, but what if the node is the selected author for the new block after this re-org.
The node would not import the new fork block and so will not produce a new one ?

(This can/should be solved by other mechanism like selecting authors based on other input that just the previous block, but it still remains a possible issue)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is certainly not a perfect solution and I think you are correct that it may lead to missed authorship opportunities during reorgs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably need changes in Substrate client to avoid this entire issue, but it is without a doubt possible to improve this and future client versions will address it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which block would not be imported?

This here has nothing to do with authorship. The best block is also not important for block authorship

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change here is entirely about which block will be set as new best block. And the bug we had seen will only happen on major sync, but never when we are syncing at the tip of the chain.

@JoshOrndorff
Copy link
Contributor

JoshOrndorff commented Aug 5, 2021

Moonbeam nodes still occasionally panic with this existence of block ... implies existence... error (although it seems less frequently than before this PR). One user reported it in moonbeam-foundation/moonbeam#646

And also @purestakeoskar captured some logs of it happening on our own nodes. Can you please share those logs?

@arkpar
Copy link
Member

arkpar commented Aug 5, 2021

That panic is no longer in the substrate code. Are they using the latest version of cumulus?

@bkchr
Copy link
Member Author

bkchr commented Aug 5, 2021

We did not yet had a release that contains the Substrate version that removes the panic.

@bkchr
Copy link
Member Author

bkchr commented Aug 5, 2021

Moonbeam nodes still occasionally panic with this existence of block ... implies existence... error occasionally (although it seems less frequently than before this PR. One user reported it in PureStake/moonbeam#646

And also @purestakeoskar captured some logs of it happening on our own nodes. Can you please share those logs?

Moonbeam 0.9.6 is using polkadot 0.9.8 branch?

@crystalin
Copy link

@bkchr yes it is

@bkchr
Copy link
Member Author

bkchr commented Aug 5, 2021

@JoshOrndorff @crystalin the last log moonbeam-foundation/moonbeam#646 (comment) doesn't include the mentioned panic. So, this is not related.

@JoshOrndorff
Copy link
Contributor

JoshOrndorff commented Aug 5, 2021

@bkchr You're right about the user-reported issue. He now has a different problem. I failed to notice that. I'm sorry about that.

However, @purestakeoskar (please chime in) does have this same panic. He is running Moonbeam 0.9.6 (polkadot-v0.9.8 dependency branches).

It happens on this block (https://moonbase.subscan.io/block/74522, https://moonbase-blockscout.testnet.moonbeam.network/blocks/74522/transactions) which is pretty simple (3 inherents, one eth transfer).

@bkchr
Copy link
Member Author

bkchr commented Aug 5, 2021

This bug can not really occur on a specific block. Would be good to get some log.

@gilmouta
Copy link

@bkchr There is a Moonbeam node runner having this problem consistently (even after deleting the parachain db). He's running Moonbeam v0.11.2, which is based on Polkadot v0.9.8. Interestingly, he is also getting stuck syncing on the block reported above (74522) before the error occurs.

2021-08-24 06:37:00 [Relaychain] :sparkles: Imported #1289678 (0x1917…02d3)    
2021-08-24 06:37:01 [:last_quarter_moon:] :gear:  Syncing  0.0 bps, target=#608868 (31 peers), best: #74522 (0x0b3b…c1a0), finalized #34966 (0xced9…350c), :arrow_down: 163.8kiB/s :arrow_up: 1.3kiB/s    
2021-08-24 06:37:01 [Relaychain] :zzz: Idle (33 peers), best: #1289678 (0x1917…02d3), finalized #1289675 (0x211c…2fe8), :arrow_down: 9.5kiB/s :arrow_up: 9.7kiB/s    

====================

Version: 0.11.2-ae60d2a-x86_64-linux-gnu

   0: sp_panic_handler::set::{{closure}}
   1: std::panicking::rust_panic_with_hook
             at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/std/src/panicking.rs:627:17
   2: std::panicking::begin_panic_handler::{{closure}}
             at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/std/src/panicking.rs:520:13
   3: std::sys_common::backtrace::__rust_end_short_backtrace
             at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/std/src/sys_common/backtrace.rs:141:18
   4: rust_begin_unwind
             at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/std/src/panicking.rs:516:5
   5: core::panicking::panic_fmt
             at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/core/src/panicking.rs:93:14
   6: core::option::expect_failed
             at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/core/src/option.rs:1578:5
   7: sc_client_db::Backend<Block>::try_commit_operation
   8: <sc_client_db::Backend<Block> as sc_client_api::backend::Backend<Block>>::commit_operation
   9: <sc_service::client::client::Client<B,E,Block,RA> as sc_client_api::backend::LockImportRun<Block,B>>::lock_import_and_run::{{closure}}
  10: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  11: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  12: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  13: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  14: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  15: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  16: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  17: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
  18: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
  19: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll
  20: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  21: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
  22: std::thread::local::LocalKey<T>::with
  23: futures_executor::local_pool::block_on
  24: tokio::runtime::task::core::Core<T,S>::poll
  25: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
  26: tokio::runtime::task::harness::Harness<T,S>::poll
  27: tokio::runtime::blocking::pool::Inner::run
  28: tokio::runtime::context::enter
  29: std::sys_common::backtrace::__rust_begin_short_backtrace
  30: core::ops::function::FnOnce::call_once{{vtable.shim}}
  31: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/alloc/src/boxed.rs:1572:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/alloc/src/boxed.rs:1572:9
      std::sys::unix::thread::Thread::new::thread_start
             at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/std/src/sys/unix/thread.rs:91:17
  32: start_thread
  33: clone


Thread 'tokio-runtime-worker' panicked at 'existence of block with number `new_canonical` implies existence of blocks with all numbers before it; qed', /home/gh-actions/.cargo/git/checkouts/substrate-7e08433d4c370a21/74101dc/client/db/src/lib.rs:1309

@bkchr
Copy link
Member Author

bkchr commented Aug 24, 2021

Not sure how this can still happen. However, with the latest polkadot branch this is just an error and no panic anymore. So, updating should fix it.

slumber pushed a commit that referenced this pull request Sep 1, 2021
…higher number (#544)

* Do not set best block when there already exists a best block with an higher number

* Apply suggestions from code review

Co-authored-by: André Silva <123550+andresilva@users.noreply.github.com>

Co-authored-by: André Silva <123550+andresilva@users.noreply.github.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

existence of block with number new_canonical implies existence of blocks with all numbers before it
7 participants