Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

existence of block with number new_canonical implies existence of blocks with all numbers before it #541

Closed
Comr8-Nga opened this issue Jun 26, 2021 · 23 comments

Comments

@Comr8-Nga
Copy link

Comr8-Nga commented Jun 26, 2021

2021-06-26 11:50:02 Moonbeam Parachain Collator
2021-06-26 11:50:02 ✌️ version 0.8.0-unknown-x86_64-linux-gnu
2021-06-26 11:50:02 ❤️ by PureStake, 2019-2021
2021-06-26 11:50:02 📋 Chain specification: Moonbase Alpha
2021-06-26 11:50:02 🏷 Node name: Comr8_Nga
2021-06-26 11:50:02 👤 Role: AUTHORITY
2021-06-26 11:50:02 💾 Database: RocksDb at /data/chains/moonbase_alpha/db
2021-06-26 11:50:02 ⛓ Native runtime: moonbase-40 (moonbase-1.tx2.au3)
2021-06-26 11:50:02 Parachain id: Id(1000)
2021-06-26 11:50:02 Parachain Account: 5Ec4AhPZk8STuex8Wsi9TwDtJQxKqzPJRCH7348Xtcs9vZLJ
2021-06-26 11:50:02 Parachain genesis state: 0x000000000000000000000000000000000000000000000000000000000000000000b505bc9a20d69f14620b2417b6d777c398ceb3e32119b9a53507111d1880927c03170a2e7597b7b7e3d84c05391d139a62b157e78786d8c082f29dcf4c11131400
2021-06-26 11:50:02 [Relaychain] 🏷 Local node identity is: 12D3KooWLkVaWfT5BxW1LnxYeEy78Rb9LKXCuoEX4iDYsYownQjv
2021-06-26 11:50:02 [Relaychain] 📦 Highest known block at #436285
2021-06-26 11:50:02 [Relaychain] 〽️ Prometheus server started at 127.0.0.1:9616
2021-06-26 11:50:02 [Relaychain] Listening for new connections on 127.0.0.1:9945.
2021-06-26 11:50:02 Using default protocol ID "sup" because none is configured in the chain specs
2021-06-26 11:50:02 🏷 Local node identity is: 12D3KooWEKi2pXZ4TH34C8YUQ9iravpSdBAxptSZDHKcUJkrgtST
2021-06-26 11:50:02 📦 Highest known block at #74522
2021-06-26 11:50:02 〽️ Prometheus server started at 127.0.0.1:9615
2021-06-26 11:50:02 Listening for new connections on 127.0.0.1:9944.
2021-06-26 11:50:03 🔍 Discovered new external address for our node: /ip4/65.21.185.66/tcp/30333/p2p/12D3KooWEKi2pXZ4TH34C8YUQ9iravpSdBAxptSZDHKcUJkrgtST
2021-06-26 11:50:03 [Relaychain] 🔍 Discovered new external address for our node: /ip4/65.21.185.66/tcp/30334/ws/p2p/12D3KooWLkVaWfT5BxW1LnxYeEy78Rb9LKXCuoEX4iDYsYownQjv
2021-06-26 11:50:03 [Relaychain] 🔍 Discovered new external address for our node: /ip4/10.2.112.215/tcp/30334/ws/p2p/12D3KooWLkVaWfT5BxW1LnxYeEy78Rb9LKXCuoEX4iDYsYownQjv
2021-06-26 11:50:03 🔍 Discovered new external address for our node: /ip4/10.2.112.215/tcp/30333/p2p/12D3KooWEKi2pXZ4TH34C8YUQ9iravpSdBAxptSZDHKcUJkrgtST
2021-06-26 11:50:03 🔍 Discovered new external address for our node: /ip4/10.2.96.244/tcp/30333/p2p/12D3KooWEKi2pXZ4TH34C8YUQ9iravpSdBAxptSZDHKcUJkrgtST
2021-06-26 11:50:03 [Relaychain] 🔍 Discovered new external address for our node: /ip4/172.18.0.3/tcp/30334/ws/p2p/12D3KooWLkVaWfT5BxW1LnxYeEy78Rb9LKXCuoEX4iDYsYownQjv
2021-06-26 11:50:04 🔍 Discovered new external address for our node: /ip6/2a01:4f9:c010:9e1f::1/tcp/30333/p2p/12D3KooWEKi2pXZ4TH34C8YUQ9iravpSdBAxptSZDHKcUJkrgtST
2021-06-26 11:50:04 🔍 Discovered new external address for our node: /ip4/10.2.96.243/tcp/30333/p2p/12D3KooWEKi2pXZ4TH34C8YUQ9iravpSdBAxptSZDHKcUJkrgtST
2021-06-26 11:50:04 [Relaychain] 🔍 Discovered new external address for our node: /ip4/127.0.0.1/tcp/30334/ws/p2p/12D3KooWLkVaWfT5BxW1LnxYeEy78Rb9LKXCuoEX4iDYsYownQjv
2021-06-26 11:50:04 [Relaychain] 🔍 Discovered new external address for our node: /ip4/192.168.170.128/tcp/30334/ws/p2p/12D3KooWLkVaWfT5BxW1LnxYeEy78Rb9LKXCuoEX4iDYsYownQjv
2021-06-26 11:50:04 [Relaychain] 🔍 Discovered new external address for our node: /ip4/192.168.211.64/tcp/30334/ws/p2p/12D3KooWLkVaWfT5BxW1LnxYeEy78Rb9LKXCuoEX4iDYsYownQjv
2021-06-26 11:50:04 [Relaychain] Candidate included without being backed? candidate_hash=0x8c80d74867da6e3fb6af49a785ec1db2d6ea02dc64289644de37f907a8923141
2021-06-26 11:50:05 🔍 Discovered new external address for our node: /ip4/10.2.112.239/tcp/30333/p2p/12D3KooWEKi2pXZ4TH34C8YUQ9iravpSdBAxptSZDHKcUJkrgtST
2021-06-26 11:50:05 [Relaychain] 🔍 Discovered new external address for our node: /ip6/2a01:4f9:c010:9e1f::1/tcp/30334/ws/p2p/12D3KooWLkVaWfT5BxW1LnxYeEy78Rb9LKXCuoEX4iDYsYownQjv
2021-06-26 11:50:05 [Relaychain] 🔍 Discovered new external address for our node: /ip4/10.132.0.60/tcp/30334/ws/p2p/12D3KooWLkVaWfT5BxW1LnxYeEy78Rb9LKXCuoEX4iDYsYownQjv

Version: 0.8.0-unknown-x86_64-linux-gnu

0: sp_panic_handler::set::{{closure}}
1: std::panicking::rust_panic_with_hook
at rustc/7a9b552cb1621c9c57898d147228aab32b65a7c3/library/std/src/panicking.rs:595:17
2: std::panicking::begin_panic_handler::{{closure}}
at rustc/7a9b552cb1621c9c57898d147228aab32b65a7c3/library/std/src/panicking.rs:497:13
3: std::sys_common::backtrace::__rust_end_short_backtrace
at rustc/7a9b552cb1621c9c57898d147228aab32b65a7c3/library/std/src/sys_common/backtrace.rs:141:18
4: rust_begin_unwind
at rustc/7a9b552cb1621c9c57898d147228aab32b65a7c3/library/std/src/panicking.rs:493:5
5: core::panicking::panic_fmt
at rustc/7a9b552cb1621c9c57898d147228aab32b65a7c3/library/core/src/panicking.rs:92:14
6: core::option::expect_failed
at rustc/7a9b552cb1621c9c57898d147228aab32b65a7c3/library/core/src/option.rs:1258:5
7: sc_client_db::Backend::try_commit_operation
8: <sc_client_db::Backend as sc_client_api::backend::Backend>::commit_operation
9: <sc_service::client::client::Client<B,E,Block,RA> as sc_client_api::backend::LockImportRun<Block,B>>::lock_import_and_run::{{closure}}
10: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
11: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
12: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
13: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
14: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
15: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
16: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
17: <sc_service::task_manager::prometheus_future::PrometheusFuture as core::future::future::Future>::poll
18: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
19: <tracing_futures::Instrumented as core::future::future::Future>::poll
20: futures_executor::local_pool::block_on
21: <std::panic::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once
22: tokio::runtime::task::harness::Harness<T,S>::poll
23: tokio::runtime::blocking::pool::Inner::run
24: tokio::runtime::context::enter
25: std::sys_common::backtrace::__rust_begin_short_backtrace
26: core::ops::function::FnOnce::call_once{{vtable.shim}}
27: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once
at rustc/7a9b552cb1621c9c57898d147228aab32b65a7c3/library/alloc/src/boxed.rs:1480:9
<alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once
at rustc/7a9b552cb1621c9c57898d147228aab32b65a7c3/library/alloc/src/boxed.rs:1480:9
std::sys::unix::thread::Thread::new::thread_start
at rustc/7a9b552cb1621c9c57898d147228aab32b65a7c3/library/std/src/sys/unix/thread.rs:71:17
28: start_thread
29: clone

Thread 'tokio-runtime-worker' panicked at 'existence of block with number new_canonical implies existence of blocks with all numbers before it; qed', /home/actions/.cargo/git/checkouts/substrate-7e08433d4c370a21/1d7f6e1/client/db/src/lib.rs:1187

This is a bug. Please report it at:

    https://github.com/PureStake/moonbeam/issues/new
@bkchr
Copy link

bkchr commented Jun 28, 2021

Cc @arkpar

@arkpar
Copy link

arkpar commented Jun 28, 2021

@bkchr As far as I understand this is collator syncing relay chain. Does it not keep full relay chain block history?

@bkchr
Copy link

bkchr commented Jun 28, 2021

It should. We don't change anything for the relay chain client when it comes to syncing a normal full node.

@arkpar
Copy link

arkpar commented Jun 28, 2021

@Comr8-Nga

Could you reproduce this with -l state-db=trace,sync=trace?

@Comr8-Nga
Copy link
Author

@arkpar Unfortunately, I've already deleted that container and launched a new one (the node was synchronized and this bug didn't occur again)

@JoshOrndorff JoshOrndorff changed the title a bug running a collator node existence of block with number new_canonical implies existence of blocks with all numbers before it Jul 2, 2021
@JoshOrndorff
Copy link
Contributor

JoshOrndorff commented Jul 2, 2021

Also reported in #563 and #544 A lot of people are hitting this. @albertov19 Do you know anything about how often this happens, whether it's collators only or all node? Does a node restart work around it?

@razumv
Copy link

razumv commented Jul 2, 2021

@JoshOrndorff Reboot doesn't help.
I've even tried clearing the database and starting over again, with no results. Several people in my community have the same error.

@albertov19
Copy link
Contributor

This issue has not been reported by any of our collators. I'll keep a lookout in case anyone reports it.

@crystalin
Copy link
Collaborator

crystalin commented Jul 13, 2021

@arkpar we retrieve trace logs for a failing node there: #595
Let me know if you need more information

@arkpar
Copy link

arkpar commented Jul 13, 2021

@bkchr Looks like when syncing, the parachain head gets moved to a much earlier block. This leads to number-to-head mappings for retracted blocks being removed from the database. The sync then tries to import a block on top of a retracted block and this panic happens.

Changing head by a large number of blocks is not really supported by the database layer. This also breaks the sync in a number of ways. In the log above the best block is reset more than 70000 blocks behind. And since the sync always starts from best blocks it leads to a lot of existing blocks being re-downloaded.

@gdzien-co
Copy link

@bkchr Looks like when syncing, the parachain head gets moved to a much earlier block

This happened to me when all I did was start the docker container as per manual. Is there going to be any fix soon, or should I just build "keep trying till it works" into my automation?

@rphmeier
Copy link

@bkchr Looks like when syncing, the parachain head gets moved to a much earlier block. This leads to number-to-head mappings for retracted blocks being removed from the database. The sync then tries to import a block on top of a retracted block and this panic happens.

Changing head by a large number of blocks is not really supported by the database layer. This also breaks the sync in a number of ways. In the log above the best block is reset more than 70000 blocks behind. And since the sync always starts from best blocks it leads to a lot of existing blocks being re-downloaded.

Either way, the assertion failure within the database is not the correct way of handling an API usage issue. We should address the underlying cause (collations built by out-of-sync collators?) as well as change the database to handle this case more gracefully.

@arkpar
Copy link

arkpar commented Jul 16, 2021

The assertion is for that once block number N exists, it may be mapped to a different hash, but never removed. Which was true at the time. It needs to be updated, I agree.

@crystalin
Copy link
Collaborator

crystalin commented Jul 16, 2021

@arkpar is your fix enough to allow nodes to recover, or it will prevent them to panic but keep them stall in that case?

@arkpar
Copy link

arkpar commented Jul 16, 2021

@crystalin The latter. There's a more fundamental issue that needs to be fixed by someone from the cumulus team.

@crystalin
Copy link
Collaborator

@arkpar , This seems to be a critical issue (impacting also other parachains), is someone from Parity current working on it?
If yes, is the progress posted somewhere ?

@arkpar
Copy link

arkpar commented Jul 21, 2021

@arkpar , This seems to be a critical issue (impacting also other parachains), is someone from Parity current working on it?
If yes, is the progress posted somewhere ?

I'm redirecting this question to @bkchr and @rphmeier

@rphmeier
Copy link

For posterity - paritytech/cumulus#544 is a patch working around the issue.

@crystalin
Copy link
Collaborator

Thank you @rphmeier, we are still testing the fix but haven't observed any error internally.

@Comr8-Nga and others impacted by this issue, please upgrade to Moonbeam client v0.9.6 (https://github.com/PureStake/moonbeam/releases/tag/v0.9.6)

@crystalin
Copy link
Collaborator

The issue doesn't seem to appear again, closing it.

@crystalin
Copy link
Collaborator

I'm re-opening this, as it happens also in recent versions.
@rphmeier, it seems the fix in Cumulus wasn't enough as we are still observing it.

The error is reporting a different code line, here are some logs reported (I'll try to gather some more recent ones):

Version: 0.10.0-6fab892-x86_64-linux-gnu

0: sp_panic_handler::set::{{closure}}
1: std::panicking::rust_panic_with_hook
at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/std/src/panicking.rs:627:17
2: std::panicking::begin_panic_handler::{{closure}}
at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/std/src/panicking.rs:520:13
3: std::sys_common::backtrace::__rust_end_short_backtrace
at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/std/src/sys_common/backtrace.rs:141:18
4: rust_begin_unwind
at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/std/src/panicking.rs:516:5
5: core::panicking::panic_fmt
at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/core/src/panicking.rs:93:14
6: core::option::expect_failed
at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/core/src/option.rs:1578:5
7: sc_client_db::Backend::try_commit_operation
8: <sc_client_db::Backend as sc_client_api::backend::Backend>::commit_operation
9: <sc_service::client::client::Client<B,E,Block,RA> as sc_client_api::backend::LockImportRun<Block,B>>::lock_import_and_run::{{closure}}
10: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
11: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
12: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
13: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
14: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
15: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
16: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
17: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
18: <sc_service::task_manager::prometheus_future::PrometheusFuture as core::future::future::Future>::poll
19: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll
20: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
21: <tracing_futures::Instrumented as core::future::future::Future>::poll
22: std:🧵:local::LocalKey::with
23: futures_executor::local_pool::block_on
24: tokio::runtime::task::core::Core<T,S>::poll
25: <std::panic::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once
26: tokio::runtime::task::harness::Harness<T,S>::poll
27: tokio::runtime::blocking::pool::Inner::run
28: tokio::runtime::context::enter
29: std::sys_common::backtrace::__rust_begin_short_backtrace
30: core::ops::function::FnOnce::call_once{{vtable.shim}}
31: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once
at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/alloc/src/boxed.rs:1572:9
<alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once
at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/alloc/src/boxed.rs:1572:9
std::sys::unix:🧵:Thread:🆕:thread_start
at rustc/4e282795d7d1d28a4c6c1c6521045ae2b59f3519/library/std/src/sys/unix/thread.rs:91:17
32: start_thread
33: clone

Thread 'tokio-runtime-worker' panicked at 'existence of block with number new_canonical implies existence of blocks with all numbers before it; qed', /home/gh-actions/.cargo/git/checkouts/substrate-7e08433d4c370a21/74101dc/client/db/src/lib.rs:1309

@crystalin crystalin reopened this Sep 5, 2021
@bkchr
Copy link

bkchr commented Sep 6, 2021

@crystalin this panic doesn't exist anymore. Please update your nodes. This was removed entirely.

@crystalin
Copy link
Collaborator

@bkchr sorry I might have got a mismatch of version, I was expecting to be fixed in our previous version. I'll keep this one open until I get the new version tested to avoid confusion.
(For future information, Moonbeam v0.12 should be the version including the fix)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants