Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Attempt to graceful shutdown in case of panics #8999

Merged
merged 8 commits into from Jul 2, 2018
Merged

Conversation

sorpaas
Copy link
Collaborator

@sorpaas sorpaas commented Jun 28, 2018

Panics in our code indicates logic errors. However, it still happens. And if it does, the current behavior causes all Drop handlers not being called and possibly lead to resources not being freed or shutdown. This can lead to a "slippery slop", where if panic happens to an user, it can lead to other panics in the future (with a totally unrelated reason that won't help us to debug the issue). This possibly resulted in the "transaction/block not found in the database; qed" and other panic errors we occasionally see in some issue reports.

This PR makes it so that it still aborts on panic initially when the program starts. After it has started, at the first occasion when we can, switch to graceful shutdown.

  • std::panic::set_hook is thread-safe as it internally uses Mutex.
  • The should_exit boolean is added to ExitStatus, because we may have a situation where a panic happens after panic_hook::set_with, but before we can wait on exit.1.
  • The exiting atomic boolean is used because double panic can happen, and as we lock ExitStatus after the main thread is notified, it cannot be locked again.

@sorpaas sorpaas added A0-pleasereview 🤓 Pull request needs code review. M4-core ⛓ Core client code / Rust. labels Jun 28, 2018
@sorpaas sorpaas added this to the 1.12 milestone Jun 28, 2018
@ordian
Copy link
Collaborator

ordian commented Jun 28, 2018

Do we have any Drop handlers that are essential for maintaining consistent state? Leaking memory on shutdown is fine, since OS will take care of reclaiming resources. What about DB flush?

@sorpaas
Copy link
Collaborator Author

sorpaas commented Jun 28, 2018

A basic search of impl Drop gives me this:

./dapps/src/tests/helpers/mod.rs:248:impl Drop for Server {
./ethcore/src/client/client.rs:2312:impl Drop for Client {
./ethcore/src/snapshot/service.rs:57:impl Drop for Guard {
./ethcore/src/snapshot/service.rs:751:impl Drop for Service {
./ethcore/stratum/src/lib.rs:120:impl Drop for Stratum {
./ethkey/src/password.rs:39:impl Drop for Password {
./ethstore/tests/util/transient_dir.rs:55:impl Drop for TransientDir {
./hw/src/lib.rs:293:impl Drop for HardwareWalletManager {
./ipfs/src/lib.rs:146:impl Drop for Listening {
./rpc/src/v1/helpers/nonce.rs:192:impl Drop for Reserved {
./rpc/src/v1/helpers/nonce.rs:243:impl Drop for Ready {
./rpc/src/v1/helpers/signing_queue.rs:156:impl Drop for ConfirmationsQueue {
./rpc/src/v1/tests/eth.rs:377:	impl Drop for PanicHandler {
./secret_store/src/listener/service_contract_listener.rs:427:impl Drop for ServiceContractListener {
./secret_store/src/key_server.rs:219:impl Drop for KeyServerCore {
./util/fetch/src/client.rs:160:impl Drop for Client {
./util/fetch/src/client.rs:783:	impl Drop for Handle {
./util/io/src/worker.rs:122:impl Drop for Worker {
./util/reactor/src/lib.rs:219:impl Drop for EventLoopHandle {
./util/kvdb-rocksdb/src/lib.rs:703:impl Drop for Database {
./util/stop-guard/src/lib.rs:41:impl Drop for StopGuard {
./util/trace-time/src/lib.rs:48:impl Drop for PerfTimer {
./util/network-devp2p/src/node_table.rs:442:impl Drop for NodeTable {
./whisper/src/rpc/filter.rs:194:impl Drop for Manager {

I think what might matter most are database waits. If we don't shutdown cleanly and wait for locks then it's possible to break db consistency (like all the "block/transaction not found" panics).

@ordian
Copy link
Collaborator

ordian commented Jun 28, 2018

There are also generic Drop impls:

dapps/src/apps/fetcher/mod.rs:58:impl<R: URLHint + 'static, F: Fetch> Drop for ContentFetcher<F, R> {
ethcore/light/src/net/load_timer.rs:191:impl<'a> Drop for LoadTimer<'a> {
ethcore/light/src/net/mod.rs:332:   impl<'a> Drop for IdGuard<'a> {
ethcore/src/snapshot/service.rs:834:impl<T: ChainRestorationParams> Drop for Service<T> {
ethcore/src/verification/queue/mod.rs:702:impl<K: Kind> Drop for VerificationQueue<K> {
local-store/src/lib.rs:223:impl<T: NodeInfo> Drop for LocalDataStore<T> {
util/io/src/service_mio.rs:492:impl<Message> Drop for IoService<Message> where Message: Send + Sync {
util/io/src/service_non_mio.rs:304:impl<Message> Drop for IoService<Message> where Message: Send + Sync {
util/mem/src/lib.rs:32:impl<T: AsMut<[u8]>> Drop for Memzero<T> {
util/patricia_trie/src/triedbmut.rs:940:impl<'a> Drop for TrieDBMut<'a> {

Copy link
Contributor

@tomaka tomaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good except for the nits.

parity/main.rs Outdated
/// Status used to exit or restart the program.
struct ExitStatus {
/// Whether the program panicked.
pub panicking: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: these fields don't have to be public.

parity/main.rs Outdated
spec_name_override: None
}), Condvar::new()));

let exiting = Arc::new(AtomicBool::new(false));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should document here what you said in the comment of the PR.
It's not obvious why this exists and is not part of ExitStatus.

@tomaka
Copy link
Contributor

tomaka commented Jun 30, 2018

I'm not totally sure that trying to exit gracefully may not lead to deadlocks.
Since we sometimes use callbacks, it's possible that a thread expects a callback to be called, except that the caller thread has panicked.

@tomaka tomaka added A8-looksgood 🦄 Pull request is reviewed well. and removed A0-pleasereview 🤓 Pull request needs code review. labels Jun 30, 2018
@sorpaas
Copy link
Collaborator Author

sorpaas commented Jul 2, 2018

@tomaka Not sure about that either. In shutdown sequence we have wait_for_drop, which will force the main thread to exit (thus exiting the whole program) if weak_client handle is not dropped on time. There may be cases for other Drop handlers. But from things I looked, most services, if spawning a new thread, will often create a close channel and send a value to it. If the child thread panics, then the send would fail and in worst scenario it just causes double panic.

@5chdn 5chdn merged commit a1a002f into master Jul 2, 2018
@5chdn 5chdn deleted the sorpaas-panics-graceful branch July 2, 2018 09:53
dvdplm added a commit that referenced this pull request Jul 2, 2018
* master:
  Attempt to graceful shutdown in case of panics (#8999)
  simplify kvdb error types (#8924)
  Add option for user to set max size limit for RPC requests (#9010)
  bump ntp to 0.5.0 (#9009)
  Removed duplicate dependency (#9021)
ordian added a commit to ordian/parity that referenced this pull request Jul 9, 2018
…rp_sync_on_light_client

* 'master' of https://github.com/paritytech/parity:
  Attempt to graceful shutdown in case of panics (openethereum#8999)
  simplify kvdb error types (openethereum#8924)
  Add option for user to set max size limit for RPC requests (openethereum#9010)
  bump ntp to 0.5.0 (openethereum#9009)
  Removed duplicate dependency (openethereum#9021)
  Minimal effective gas price in the queue (openethereum#8934)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A8-looksgood 🦄 Pull request is reviewed well. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants