nearcore stuck with crashes on betanet-node7 #3084

frol · 2020-08-05T14:06:02Z

Describe the bug

The node stuck with tons of logs of same backtraces:

Aug 05 09:37:11.845 ERROR near_client::client_actor: Error while sending an approval Chain(Error { inner:    0: failure::backtrace::Backtrace::new
   1: <near_chain::error::Error as core::convert::From<near_primitives::errors::EpochError>>::from
   2: <neard::runtime::NightshadeRuntime as near_chain::types::RuntimeAdapter>::get_epoch_id_from_prev_block
   3: near_client::client_actor::ClientActor::check_triggers
   4: <near_client::client_actor::ClientActor as actix::handler::Handler<near_network::types::NetworkClientMessages>>::handle
   5: <actix::address::envelope::SyncEnvelopeProxy<A,M> as actix::address::envelope::EnvelopeProxy>::handle
   6: <actix::contextimpl::ContextFut<A,C> as core::future::future::Future>::poll
   7: tokio::runtime::task::raw::poll
   8: tokio::task::local::LocalSet::tick
   9: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  10: actix_rt::runtime::Runtime::block_on
  11: near::main
  12: std::rt::lang_start_internal::{{closure}}::{{closure}}
             at src/libstd/rt.rs:52
      std::sys_common::backtrace::__rust_begin_short_backtrace
             at src/libstd/sys_common/backtrace.rs:130
  13: main
  14: __libc_start_main
  15: _start


DB Not Found Error: 96Bikdv3BruAoEhdxRr4yvKoLiz5Y8VAJyhQo5JGxUFU })

The end of the log is the following (as is, without edits, note the timestamps):


Aug 05 09:38:07.117 ERROR sync: State sync received hash FURpQ91DLKnTXFiiBwKsyPGuHkHkGMVkJQMWuCfPaxM9 that we're not expecting, potential malicious peer
Aug 05 09:38:30.759 ERROR sync: State sync received hash FURpQ91DLKnTXFiiBwKsyPGuHkHkGMVkJQMWuCfPaxM9 that we're not expecting, potential malicious peer
Aug 05 09:38:30.808 ERROR sync: State sync received hash FURpQ91DLKnTXFiiBwKsyPGuHkHkGMVkJQMWuCfPaxM9 that we're not expecting, potential malicious peer
Aug 05 09:39:45.860  WARN network: Received Block while Connecting from Outbound connection.
Aug 05 09:41:01.693 ERROR sync: State sync received hash FURpQ91DLKnTXFiiBwKsyPGuHkHkGMVkJQMWuCfPaxM9 that we're not expecting, potential malicious peer
Aug 05 09:43:21.021 ERROR sync: State sync received hash FURpQ91DLKnTXFiiBwKsyPGuHkHkGMVkJQMWuCfPaxM9 that we're not expecting, potential malicious peer
Aug 05 09:43:22.083 ERROR sync: State sync received hash FURpQ91DLKnTXFiiBwKsyPGuHkHkGMVkJQMWuCfPaxM9 that we're not expecting, potential malicious peer
Aug 05 09:43:22.582 ERROR sync: State sync received hash FURpQ91DLKnTXFiiBwKsyPGuHkHkGMVkJQMWuCfPaxM9 that we're not expecting, potential malicious peer
Aug 05 09:44:20.107  WARN network: Peer stream error: Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }
Aug 05 09:55:59.494 ERROR network: Failed sending broadcast message(query_active_peers): Mailbox has closed
Aug 05 10:35:43.570  WARN network: Attempting to connect to a node (ed25519:6sF1yWWwy3aXtstrEU7SiM6u1eQ5WCqaf24jdoy5u1aa@0.0.0.0:24567@bisontrails.stakingpool) with a different genesis block. Our genesis: GenesisId { chain_id: "betanet", hash: `FPJFkXFrfvQgNxjFp97VpbZgLg9jNakpBCPf7CZxGaji` }, their genesis: GenesisId { chain_id: "testnet", hash: `EUbqkM9kGbBBYVBuJcMfu6UzjYz93y7yDoKVB1M7X3VB` }
Aug 05 10:40:43.588  WARN network: Received Block while Connecting from Outbound connection.
Aug 05 10:42:43.870  WARN network: Attempting to connect to a node (ed25519:6sF1yWWwy3aXtstrEU7SiM6u1eQ5WCqaf24jdoy5u1aa@0.0.0.0:24567@bisontrails.stakingpool) with a different genesis block. Our genesis: GenesisId { chain_id: "betanet", hash: `FPJFkXFrfvQgNxjFp97VpbZgLg9jNakpBCPf7CZxGaji` }, their genesis: GenesisId { chain_id: "testnet", hash: `EUbqkM9kGbBBYVBuJcMfu6UzjYz93y7yDoKVB1M7X3VB` }
Aug 05 11:47:07.192 ERROR network: Failed sending broadcast message(query_active_peers): Mailbox has closed

Node uses 100% of all the CPU (the VM has 2 cores and all of them are busy with neard) and consumes 3.3GB or RAM.

strace reports that there are tons of writes of a single byte 0x01 (the file descriptors are anonymous pipes):

write(386, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(216, "\1", 1)                     = 1
write(216, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(216, "\1", 1)                     = 1
write(226, "\1", 1)                     = 1
write(226, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(205, "\1", 1)                     = 1
write(205, "\1", 1)                     = 1
write(216, "\1", 1)                     = 1
write(226, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(216, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(226, "\1", 1)                     = 1
write(226, "\1", 1)                     = 1
write(226, "\1", 1)                     = 1
write(226, "\1", 1)                     = 1

To Reproduce

N/A

Version (please complete the following information):

nearcore commit/branch: 1.9.0-beta.1 (build 157b015-modified) [official release]
betanet

Additional context

It is an RPC node running on betanet-node7 instance. I am restarting the instance.

The text was updated successfully, but these errors were encountered:

frol · 2020-08-05T14:23:38Z

In fact, this error happened "immediately" on node start ("immediately" is almost 10 minutes from the boot time, but there were no sync logs during that period of time):

Aug 05 09:28:45.232  INFO near: Version: 1.9.0-beta.1, Build: 157b0153-modified, Latest Protocol: 31
Aug 05 09:28:47.421  INFO near: Opening store database at "/home/ubuntu/.near/data"
Aug 05 09:29:06.101  INFO stats: Server listening at ed25519:EAV7gSD8Xvfph14HtLfohsiRPYT8GfrDwaj6Aqn1MDwU@0.0.0.0:24567
Aug 05 09:37:11.288 ERROR near_client::client_actor: Error while sending an approval Chain(Error { inner:    0: failure::backtrace::Backtrace::new
   1: <near_chain::error::Error as core::convert::From<near_primitives::errors::EpochError>>::from
   2: <neard::runtime::NightshadeRuntime as near_chain::types::RuntimeAdapter>::get_epoch_id_from_prev_block
   3: near_client::client_actor::ClientActor::check_triggers
   4: <near_client::client_actor::ClientActor as actix::handler::Handler<near_network::types::NetworkClientMessages>>::handle
   5: <actix::address::envelope::SyncEnvelopeProxy<A,M> as actix::address::envelope::EnvelopeProxy>::handle
   6: <actix::contextimpl::ContextFut<A,C> as core::future::future::Future>::poll

bowenwang1996 · 2020-08-06T02:26:48Z

I believe this is caused by an unclean shutdown of the node, but I don't think this will cause the node to get stuck forever.

frol · 2020-08-06T04:57:21Z

It was in that state 5 hours, so I call it “forever”

SkidanovAlex · 2020-08-06T16:59:00Z

I believe this is caused by an unclean shutdown of the node, but I don't think this will cause the node to get stuck forever.

I don't think an unclean shut down can explain it? The batch writes to the storage are atomic, so the head should not be updated if the block is not persisted.

This appears to be a real bug.

bowenwang1996 · 2020-08-07T00:34:11Z

Yes I was wrong. I was investigating it today but our nodes got nuked :(

bowenwang1996 · 2020-08-08T22:27:06Z

It should have been fixed by #3099. We can reopen if we see it again.

frol added C-bug Category: This is a bug A-chain Area: Chain, client & related labels Aug 5, 2020

frol assigned SkidanovAlex and bowenwang1996 Aug 5, 2020

frol mentioned this issue Aug 5, 2020

nearcore fall out of sync and ultimately stuck unable to broadcast messages on testnet #3085

Closed

weekly-digest bot mentioned this issue Aug 7, 2020

Weekly Digest (31 July, 2020 - 7 August, 2020) #3100

Closed

bowenwang1996 closed this as completed Aug 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nearcore stuck with crashes on betanet-node7 #3084

nearcore stuck with crashes on betanet-node7 #3084

frol commented Aug 5, 2020

frol commented Aug 5, 2020

bowenwang1996 commented Aug 6, 2020

frol commented Aug 6, 2020

SkidanovAlex commented Aug 6, 2020

bowenwang1996 commented Aug 7, 2020

bowenwang1996 commented Aug 8, 2020

nearcore stuck with crashes on betanet-node7 #3084

nearcore stuck with crashes on betanet-node7 #3084

Comments

frol commented Aug 5, 2020

frol commented Aug 5, 2020

bowenwang1996 commented Aug 6, 2020

frol commented Aug 6, 2020

SkidanovAlex commented Aug 6, 2020

bowenwang1996 commented Aug 7, 2020

bowenwang1996 commented Aug 8, 2020