Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nearcore stuck with crashes on betanet-node7 #3084

Closed
frol opened this issue Aug 5, 2020 · 6 comments
Closed

nearcore stuck with crashes on betanet-node7 #3084

frol opened this issue Aug 5, 2020 · 6 comments
Assignees
Labels
A-chain Area: Chain, client & related C-bug Category: This is a bug

Comments

@frol
Copy link
Collaborator

frol commented Aug 5, 2020

Describe the bug

The node stuck with tons of logs of same backtraces:

Aug 05 09:37:11.845 ERROR near_client::client_actor: Error while sending an approval Chain(Error { inner:    0: failure::backtrace::Backtrace::new
   1: <near_chain::error::Error as core::convert::From<near_primitives::errors::EpochError>>::from
   2: <neard::runtime::NightshadeRuntime as near_chain::types::RuntimeAdapter>::get_epoch_id_from_prev_block
   3: near_client::client_actor::ClientActor::check_triggers
   4: <near_client::client_actor::ClientActor as actix::handler::Handler<near_network::types::NetworkClientMessages>>::handle
   5: <actix::address::envelope::SyncEnvelopeProxy<A,M> as actix::address::envelope::EnvelopeProxy>::handle
   6: <actix::contextimpl::ContextFut<A,C> as core::future::future::Future>::poll
   7: tokio::runtime::task::raw::poll
   8: tokio::task::local::LocalSet::tick
   9: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  10: actix_rt::runtime::Runtime::block_on
  11: near::main
  12: std::rt::lang_start_internal::{{closure}}::{{closure}}
             at src/libstd/rt.rs:52
      std::sys_common::backtrace::__rust_begin_short_backtrace
             at src/libstd/sys_common/backtrace.rs:130
  13: main
  14: __libc_start_main
  15: _start


DB Not Found Error: 96Bikdv3BruAoEhdxRr4yvKoLiz5Y8VAJyhQo5JGxUFU })

The end of the log is the following (as is, without edits, note the timestamps):


Aug 05 09:38:07.117 ERROR sync: State sync received hash FURpQ91DLKnTXFiiBwKsyPGuHkHkGMVkJQMWuCfPaxM9 that we're not expecting, potential malicious peer
Aug 05 09:38:30.759 ERROR sync: State sync received hash FURpQ91DLKnTXFiiBwKsyPGuHkHkGMVkJQMWuCfPaxM9 that we're not expecting, potential malicious peer
Aug 05 09:38:30.808 ERROR sync: State sync received hash FURpQ91DLKnTXFiiBwKsyPGuHkHkGMVkJQMWuCfPaxM9 that we're not expecting, potential malicious peer
Aug 05 09:39:45.860  WARN network: Received Block while Connecting from Outbound connection.
Aug 05 09:41:01.693 ERROR sync: State sync received hash FURpQ91DLKnTXFiiBwKsyPGuHkHkGMVkJQMWuCfPaxM9 that we're not expecting, potential malicious peer
Aug 05 09:43:21.021 ERROR sync: State sync received hash FURpQ91DLKnTXFiiBwKsyPGuHkHkGMVkJQMWuCfPaxM9 that we're not expecting, potential malicious peer
Aug 05 09:43:22.083 ERROR sync: State sync received hash FURpQ91DLKnTXFiiBwKsyPGuHkHkGMVkJQMWuCfPaxM9 that we're not expecting, potential malicious peer
Aug 05 09:43:22.582 ERROR sync: State sync received hash FURpQ91DLKnTXFiiBwKsyPGuHkHkGMVkJQMWuCfPaxM9 that we're not expecting, potential malicious peer
Aug 05 09:44:20.107  WARN network: Peer stream error: Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }
Aug 05 09:55:59.494 ERROR network: Failed sending broadcast message(query_active_peers): Mailbox has closed
Aug 05 10:35:43.570  WARN network: Attempting to connect to a node (ed25519:6sF1yWWwy3aXtstrEU7SiM6u1eQ5WCqaf24jdoy5u1aa@0.0.0.0:24567@bisontrails.stakingpool) with a different genesis block. Our genesis: GenesisId { chain_id: "betanet", hash: `FPJFkXFrfvQgNxjFp97VpbZgLg9jNakpBCPf7CZxGaji` }, their genesis: GenesisId { chain_id: "testnet", hash: `EUbqkM9kGbBBYVBuJcMfu6UzjYz93y7yDoKVB1M7X3VB` }
Aug 05 10:40:43.588  WARN network: Received Block while Connecting from Outbound connection.
Aug 05 10:42:43.870  WARN network: Attempting to connect to a node (ed25519:6sF1yWWwy3aXtstrEU7SiM6u1eQ5WCqaf24jdoy5u1aa@0.0.0.0:24567@bisontrails.stakingpool) with a different genesis block. Our genesis: GenesisId { chain_id: "betanet", hash: `FPJFkXFrfvQgNxjFp97VpbZgLg9jNakpBCPf7CZxGaji` }, their genesis: GenesisId { chain_id: "testnet", hash: `EUbqkM9kGbBBYVBuJcMfu6UzjYz93y7yDoKVB1M7X3VB` }
Aug 05 11:47:07.192 ERROR network: Failed sending broadcast message(query_active_peers): Mailbox has closed

Node uses 100% of all the CPU (the VM has 2 cores and all of them are busy with neard) and consumes 3.3GB or RAM.

strace reports that there are tons of writes of a single byte 0x01 (the file descriptors are anonymous pipes):

write(386, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(216, "\1", 1)                     = 1
write(216, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(216, "\1", 1)                     = 1
write(226, "\1", 1)                     = 1
write(226, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(205, "\1", 1)                     = 1
write(205, "\1", 1)                     = 1
write(216, "\1", 1)                     = 1
write(226, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(216, "\1", 1)                     = 1
write(386, "\1", 1)                     = 1
write(226, "\1", 1)                     = 1
write(226, "\1", 1)                     = 1
write(226, "\1", 1)                     = 1
write(226, "\1", 1)                     = 1

To Reproduce

N/A

Version (please complete the following information):

  • nearcore commit/branch: 1.9.0-beta.1 (build 157b015-modified) [official release]
  • betanet

Additional context

It is an RPC node running on betanet-node7 instance. I am restarting the instance.

@frol frol added C-bug Category: This is a bug A-chain Area: Chain, client & related labels Aug 5, 2020
@frol
Copy link
Collaborator Author

frol commented Aug 5, 2020

In fact, this error happened "immediately" on node start ("immediately" is almost 10 minutes from the boot time, but there were no sync logs during that period of time):

Aug 05 09:28:45.232  INFO near: Version: 1.9.0-beta.1, Build: 157b0153-modified, Latest Protocol: 31
Aug 05 09:28:47.421  INFO near: Opening store database at "/home/ubuntu/.near/data"
Aug 05 09:29:06.101  INFO stats: Server listening at ed25519:EAV7gSD8Xvfph14HtLfohsiRPYT8GfrDwaj6Aqn1MDwU@0.0.0.0:24567
Aug 05 09:37:11.288 ERROR near_client::client_actor: Error while sending an approval Chain(Error { inner:    0: failure::backtrace::Backtrace::new
   1: <near_chain::error::Error as core::convert::From<near_primitives::errors::EpochError>>::from
   2: <neard::runtime::NightshadeRuntime as near_chain::types::RuntimeAdapter>::get_epoch_id_from_prev_block
   3: near_client::client_actor::ClientActor::check_triggers
   4: <near_client::client_actor::ClientActor as actix::handler::Handler<near_network::types::NetworkClientMessages>>::handle
   5: <actix::address::envelope::SyncEnvelopeProxy<A,M> as actix::address::envelope::EnvelopeProxy>::handle
   6: <actix::contextimpl::ContextFut<A,C> as core::future::future::Future>::poll

@bowenwang1996
Copy link
Collaborator

I believe this is caused by an unclean shutdown of the node, but I don't think this will cause the node to get stuck forever.

@frol
Copy link
Collaborator Author

frol commented Aug 6, 2020

It was in that state 5 hours, so I call it “forever”

@SkidanovAlex
Copy link
Collaborator

I believe this is caused by an unclean shutdown of the node, but I don't think this will cause the node to get stuck forever.

I don't think an unclean shut down can explain it? The batch writes to the storage are atomic, so the head should not be updated if the block is not persisted.

This appears to be a real bug.

@bowenwang1996
Copy link
Collaborator

Yes I was wrong. I was investigating it today but our nodes got nuked :(

@bowenwang1996
Copy link
Collaborator

It should have been fixed by #3099. We can reopen if we see it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-chain Area: Chain, client & related C-bug Category: This is a bug
Projects
None yet
Development

No branches or pull requests

3 participants