Ability to shutdown neard cleanly #3266

bowenwang1996 · 2020-08-31T18:36:43Z

Neard should shutdown cleanly. More specifically,

SIGTERM and SIGINT should be enough to shutdown neard. One should not have to resort to SIGKILL to kill neard.
When neard is killed, the database integrity should not be damaged. Currently we sometimes see DBNotFound errors after stopping and restarting a node.

MaksymZavershynskyi · 2020-09-01T15:21:47Z

We need to use this new feature: rust-rocksdb/rust-rocksdb#459
Kudos to @ailisp

frol · 2021-04-29T09:47:42Z

In #4229 we implemented ctrl+c handler for the tests infrastructure, though we clearly did not gracefully shutdown RocksDB, yet it might be still useful to take some inspiration from there.

Stop the system once a SIGINT is received. This should allow for graceful termination since System::stop will stop all the arbiters and that in turn will stop all the actors (leading them through stopping and stopped states thus allowing all the necessary cleanups). Fixes: near#3266

Per RocksDB FAQ: > Q: Is it safe to close RocksDB while another thread is issuing read, > write or manual compaction requests? > A: No. The users of RocksDB need to make sure all functions have > finished before they close RocksDB. You can speed up the waiting > by calling CancelAllBackgroundWork(). Better be safe than sorry so add the call before the rocksdb::DB object is dropped. Issue: near#3266

mina86 · 2021-06-29T17:18:12Z

Issue here is that wasmer 0.17.1 is catching SIGINT and interferes with us trying to catch it as well. This was fixed upstream but our fork still does that. All in all, this is now blocked on near/wasmer#38 which fixes this in our fork as well.

bowenwang1996 · 2021-06-29T20:30:16Z

cc @matklad

miraclx · 2021-06-30T12:04:31Z

This is the same issue I encountered on #4229, I think, and I may be wrong, that the issue is more likely caused by us having long blocking tasks on the active thread that it only occasionally catches the signal. The workaround was to take a more direct approach by isolating the listener on a dedicated thread and proceeding to alert the tasks that depend on it when it's caught. https://github.com/near/nearcore/pull/4229/files#diff-c4d3d011b8925d7128b2a5779e866237fa4627a9655c60ff3ce01d7f37d8bdae

Per RocksDB FAQ: > Q: Is it safe to close RocksDB while another thread is issuing read, > write or manual compaction requests? > A: No. The users of RocksDB need to make sure all functions have > finished before they close RocksDB. You can speed up the waiting > by calling CancelAllBackgroundWork(). Better be safe than sorry so add the call before the rocksdb::DB object is dropped. Issue: near#3266

Stop the system once a SIGINT is received. This should allow for graceful termination since System::stop will stop all the arbiters and that in turn will stop all the actors (leading them through stopping and stopped states thus allowing all the necessary cleanups). To achieve this, also update uptade wasmer-runtime-core dependency to 0.17.4. Among other things, the new version no longer catches the INT signal making it available for tokie to handle. Issue: near#3266

The way remove `near_actix_test_utils::run_actix_until` was written, the expect_panic flag didn’t actually matter: SET_PANIC_HOOK.call_once(|| { let default_hook = std::panic::take_hook(); std::panic::set_hook(Box::new(move |info| { if !expect_panic { default_hook(info); } // ... })); }); Since `SET_PANIC_HOOK.call_once` invokes the closure only once, the value of expect_panic when that calls happen is the only one that matters. In other words, the first run of `run_actix_until` function decides what the value of `expect_panic` in the panic handler is. Fortunately this didn’t actually matter. The only test which set the flag to true – `chunks_recovered_from_full_timeout_too_short` – was marked `#[should_panic]` and running the default panic hook didn’t negatively influence the test. As such, get rid of `run_actix_until_panic` and rename `run_actix_until_stop` to simply be `run_actix`. Issue: near#3266

Stop the system once a SIGINT is received. This should allow for graceful termination since System::stop will stop all the arbiters and that in turn will stop all the actors (leading them through stopping and stopped states thus allowing all the necessary cleanups). To achieve this, also update uptade wasmer-runtime-core dependency to 0.17.4. Among other things, the new version no longer catches the INT signal making it available for tokie to handle. Issue: #3266

The default signal ‘kill’ command sends is SIGTERM so catch it in addition ta SIGINT when running under Unix-like system. Issue: near#3266

…ly (#4463) * cli: unix: catch SIGTERM in addition to SIGINT and terminate gracefully The default signal ‘kill’ command sends is SIGTERM so catch it in addition ta SIGINT when running under Unix-like system. Issue: #3266

mina86 · 2021-07-07T17:27:58Z

The process will now gracefully handle SIGINT (i.e. ^C) and SIGTERM (i.e. default signal used by kill command).

cancel_all_background_work is still waiting for review.

The way remove `near_actix_test_utils::run_actix_until` was written, the expect_panic flag didn’t actually matter: SET_PANIC_HOOK.call_once(|| { let default_hook = std::panic::take_hook(); std::panic::set_hook(Box::new(move |info| { if !expect_panic { default_hook(info); } // ... })); }); Since `SET_PANIC_HOOK.call_once` invokes the closure only once, the value of expect_panic when that calls happen is the only one that matters. In other words, the first run of `run_actix_until` function decides what the value of `expect_panic` in the panic handler is. Fortunately this didn’t actually matter. The only test which set the flag to true – `chunks_recovered_from_full_timeout_too_short` – was marked `#[should_panic]` and running the default panic hook didn’t negatively influence the test. As such, get rid of `run_actix_until_panic` and rename `run_actix_until_stop` to simply be `run_actix`. Issue: #3266

Issue: #3266

…4429) Per RocksDB FAQ: > Q: Is it safe to close RocksDB while another thread is issuing read, > write or manual compaction requests? > A: No. The users of RocksDB need to make sure all functions have > finished before they close RocksDB. You can speed up the waiting > by calling CancelAllBackgroundWork(). Better be safe than sorry so add the call before the rocksdb::DB object is dropped. Fixes: #3266

bowenwang1996 · 2021-08-16T22:52:22Z

@mina86 looks like this is still not fixed. Today I shutdown a node that is running 3216f6e and tried to load it from state-viewer and got

thread 'main' panicked at 'Failed to open the database: DBError(Error { message: "Corruption: Corruption: IO error: No such file or directoryWhile open a file for random read: /home/ubuntu/.near/data/534603.ldb: No such file or directory" })', core/store/src/lib.rs:299:42

stale · 2021-11-14T23:28:36Z

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months.
It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.

mina86 · 2021-11-24T18:05:04Z

I’m going to close this in favour of #5340

bowenwang1996 added the A-chain Area: Chain, client & related label Aug 31, 2020

bowenwang1996 assigned pmnoxx Aug 31, 2020

nickmonad mentioned this issue Sep 21, 2020

Try process.terminate(), fallback to process.kill() near/nearup#108

Merged

pmnoxx closed this as completed Sep 29, 2020

pmnoxx reopened this Sep 29, 2020

mikhailOK mentioned this issue Apr 7, 2021

Node crash on state sync #4137

Closed

mikhailOK added this to To do in Node Experience Q3 2021 via automation Apr 7, 2021

bowenwang1996 mentioned this issue Apr 20, 2021

Node don't stop on Ctrl-C #4242

Closed

janewang added the T-node Team: issues relevant to the node experience team label Jun 7, 2021

bowenwang1996 assigned mina86 and unassigned pmnoxx Jun 22, 2021

bowenwang1996 mentioned this issue Jun 29, 2021

neard fails to restart after a clean shutdown (data corruption) #3713

Closed

bowenwang1996 added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jun 29, 2021

mina86 mentioned this issue Jun 29, 2021

neard: handle SIGINT by stopping the node #4428

Closed

mina86 mentioned this issue Jun 29, 2021

store: cancel background RocksDB tasks before closing the database #4429

Merged

janewang moved this from Backlog to In Development in Node Experience Q3 2021 Jun 29, 2021

mina86 mentioned this issue Jun 29, 2021

Make the SIGINT handler conditional on --features "managed". near/wasmer#38

Merged

mina86 mentioned this issue Jun 30, 2021

Update to rocksdb dependency to 0.16.0 #4442

Merged

mina86 mentioned this issue Jul 1, 2021

neard: handle SIGINT by stopping the node #4449

Merged

mina86 mentioned this issue Jul 6, 2021

cli: unix: catch SIGTERM in addition to SIGINT and terminate gracefully #4463

Merged

janewang moved this from In Development to Done in Node Experience Q3 2021 Jul 6, 2021

mina86 added a commit that referenced this issue Jul 9, 2021

Update to rocksdb dependency to 0.16.0 (#4442)

34914ad

Issue: #3266

near-bulldozer bot closed this as completed in #4429 Jul 9, 2021

mina86 mentioned this issue Jul 9, 2021

Implement a lame duck mode for RPC services #4489

Closed

bowenwang1996 reopened this Aug 16, 2021

Node Experience Q3 2021 automation moved this from Done to In Development Aug 16, 2021

mina86 mentioned this issue Sep 17, 2021

Make sure all arbiters are joined #4841

Open

janewang added this to Backlog in Node Experience Q4 2021 via automation Oct 8, 2021

janewang moved this from Backlog to In Development in Node Experience Q4 2021 Oct 8, 2021

janewang removed this from In Development in Node Experience Q3 2021 Oct 8, 2021

bowenwang1996 mentioned this issue Oct 20, 2021

Node should not corrupt its state on exit #5050

Closed

bowenwang1996 mentioned this issue Nov 8, 2021

Make node crash if state sync thread crashes #5178

Open

stale bot added the S-stale label Nov 14, 2021

bowenwang1996 removed the S-stale label Nov 16, 2021

bowenwang1996 mentioned this issue Nov 17, 2021

Should nearcore exit with error code after upgrade panic #5340

Closed

mina86 closed this as completed Nov 24, 2021

Node Experience Q4 2021 automation moved this from In Development to Done Nov 24, 2021

elmattic mentioned this issue Aug 5, 2022

Terminate Forest gracefully ChainSafe/forest#1587

Closed

gmilescu added the Node Node team label Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to shutdown neard cleanly #3266

Ability to shutdown neard cleanly #3266

bowenwang1996 commented Aug 31, 2020

MaksymZavershynskyi commented Sep 1, 2020

frol commented Apr 29, 2021

mina86 commented Jun 29, 2021

bowenwang1996 commented Jun 29, 2021

miraclx commented Jun 30, 2021 •

edited

Loading

mina86 commented Jul 7, 2021

bowenwang1996 commented Aug 16, 2021

stale bot commented Nov 14, 2021

mina86 commented Nov 24, 2021

Ability to shutdown neard cleanly #3266

Ability to shutdown neard cleanly #3266

Comments

bowenwang1996 commented Aug 31, 2020

MaksymZavershynskyi commented Sep 1, 2020

frol commented Apr 29, 2021

mina86 commented Jun 29, 2021

bowenwang1996 commented Jun 29, 2021

miraclx commented Jun 30, 2021 • edited Loading

mina86 commented Jul 7, 2021

bowenwang1996 commented Aug 16, 2021

stale bot commented Nov 14, 2021

mina86 commented Nov 24, 2021

miraclx commented Jun 30, 2021 •

edited

Loading