Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to shutdown neard cleanly #3266

Closed
bowenwang1996 opened this issue Aug 31, 2020 · 9 comments · Fixed by #4429
Closed

Ability to shutdown neard cleanly #3266

bowenwang1996 opened this issue Aug 31, 2020 · 9 comments · Fixed by #4429
Assignees
Labels
A-chain Area: Chain, client & related C-enhancement Category: An issue proposing an enhancement or a PR with one. Node Node team T-node Team: issues relevant to the node experience team

Comments

@bowenwang1996
Copy link
Collaborator

Neard should shutdown cleanly. More specifically,

  • SIGTERM and SIGINT should be enough to shutdown neard. One should not have to resort to SIGKILL to kill neard.
  • When neard is killed, the database integrity should not be damaged. Currently we sometimes see DBNotFound errors after stopping and restarting a node.
@bowenwang1996 bowenwang1996 added the A-chain Area: Chain, client & related label Aug 31, 2020
@MaksymZavershynskyi
Copy link
Contributor

We need to use this new feature: rust-rocksdb/rust-rocksdb#459
Kudos to @ailisp

@frol
Copy link
Collaborator

frol commented Apr 29, 2021

In #4229 we implemented ctrl+c handler for the tests infrastructure, though we clearly did not gracefully shutdown RocksDB, yet it might be still useful to take some inspiration from there.

@janewang janewang added the T-node Team: issues relevant to the node experience team label Jun 7, 2021
@bowenwang1996 bowenwang1996 assigned mina86 and unassigned pmnoxx Jun 22, 2021
@bowenwang1996 bowenwang1996 added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jun 29, 2021
mina86 added a commit to mina86/nearcore that referenced this issue Jun 29, 2021
Stop the system once a SIGINT is received.  This should allow for graceful
termination since System::stop will stop all the arbiters and that in turn
will stop all the actors (leading them through stopping and stopped states
thus allowing all the necessary cleanups).

Fixes: near#3266
mina86 added a commit to mina86/nearcore that referenced this issue Jun 29, 2021
Per RocksDB FAQ:

> Q: Is it safe to close RocksDB while another thread is issuing read,
>    write or manual compaction requests?
> A: No.  The users of RocksDB need to make sure all functions have
>    finished before they close RocksDB.  You can speed up the waiting
>    by calling CancelAllBackgroundWork().

Better be safe than sorry so add the call before the rocksdb::DB object
is dropped.

Issue: near#3266
@janewang janewang moved this from Backlog to In Development in Node Experience Q3 2021 Jun 29, 2021
@mina86
Copy link
Contributor

mina86 commented Jun 29, 2021

Issue here is that wasmer 0.17.1 is catching SIGINT and interferes with us trying to catch it as well. This was fixed upstream but our fork still does that. All in all, this is now blocked on near/wasmer#38 which fixes this in our fork as well.

@bowenwang1996
Copy link
Collaborator Author

cc @matklad

@miraclx
Copy link
Contributor

miraclx commented Jun 30, 2021

This is the same issue I encountered on #4229, I think, and I may be wrong, that the issue is more likely caused by us having long blocking tasks on the active thread that it only occasionally catches the signal. The workaround was to take a more direct approach by isolating the listener on a dedicated thread and proceeding to alert the tasks that depend on it when it's caught. https://github.com/near/nearcore/pull/4229/files#diff-c4d3d011b8925d7128b2a5779e866237fa4627a9655c60ff3ce01d7f37d8bdae

mina86 added a commit to mina86/nearcore that referenced this issue Jun 30, 2021
Per RocksDB FAQ:

> Q: Is it safe to close RocksDB while another thread is issuing read,
>    write or manual compaction requests?
> A: No.  The users of RocksDB need to make sure all functions have
>    finished before they close RocksDB.  You can speed up the waiting
>    by calling CancelAllBackgroundWork().

Better be safe than sorry so add the call before the rocksdb::DB object
is dropped.

Issue: near#3266
mina86 added a commit to mina86/nearcore that referenced this issue Jun 30, 2021
Per RocksDB FAQ:

> Q: Is it safe to close RocksDB while another thread is issuing read,
>    write or manual compaction requests?
> A: No.  The users of RocksDB need to make sure all functions have
>    finished before they close RocksDB.  You can speed up the waiting
>    by calling CancelAllBackgroundWork().

Better be safe than sorry so add the call before the rocksdb::DB object
is dropped.

Issue: near#3266
mina86 added a commit to mina86/nearcore that referenced this issue Jul 1, 2021
Stop the system once a SIGINT is received.  This should allow for
graceful termination since System::stop will stop all the arbiters
and that in turn will stop all the actors (leading them through
stopping and stopped states thus allowing all the necessary cleanups).

To achieve this, also update uptade wasmer-runtime-core dependency to
0.17.4.  Among other things, the new version no longer catches the INT
signal making it available for tokie to handle.

Issue: near#3266
mina86 added a commit to mina86/nearcore that referenced this issue Jul 1, 2021
The way remove `near_actix_test_utils::run_actix_until` was written,
the expect_panic flag didn’t actually matter:

    SET_PANIC_HOOK.call_once(|| {
        let default_hook = std::panic::take_hook();
        std::panic::set_hook(Box::new(move |info| {
            if !expect_panic {
                default_hook(info);
            }
            // ...
        }));
    });

Since `SET_PANIC_HOOK.call_once` invokes the closure only once, the
value of expect_panic when that calls happen is the only one that
matters.  In other words, the first run of `run_actix_until` function
decides what the value of `expect_panic` in the panic handler is.

Fortunately this didn’t actually matter.  The only test which set the
flag to true – `chunks_recovered_from_full_timeout_too_short` – was
marked `#[should_panic]` and running the default panic hook didn’t
negatively influence the test.

As such, get rid of `run_actix_until_panic` and rename
`run_actix_until_stop` to simply be `run_actix`.

Issue: near#3266
mina86 added a commit that referenced this issue Jul 5, 2021
Stop the system once a SIGINT is received.  This should allow for
graceful termination since System::stop will stop all the arbiters
and that in turn will stop all the actors (leading them through
stopping and stopped states thus allowing all the necessary cleanups).

To achieve this, also update uptade wasmer-runtime-core dependency to
0.17.4.  Among other things, the new version no longer catches the INT
signal making it available for tokie to handle.

Issue: #3266
mina86 added a commit to mina86/nearcore that referenced this issue Jul 6, 2021
The default signal ‘kill’ command sends is SIGTERM so catch it in
addition ta SIGINT when running under Unix-like system.

Issue: near#3266
mina86 added a commit to mina86/nearcore that referenced this issue Jul 6, 2021
The default signal ‘kill’ command sends is SIGTERM so catch it in
addition ta SIGINT when running under Unix-like system.

Issue: near#3266
@janewang janewang moved this from In Development to Done in Node Experience Q3 2021 Jul 6, 2021
mina86 added a commit that referenced this issue Jul 7, 2021
…ly (#4463)

* cli: unix: catch SIGTERM in addition to SIGINT and terminate gracefully

The default signal ‘kill’ command sends is SIGTERM so catch it in
addition ta SIGINT when running under Unix-like system.

Issue: #3266
@mina86
Copy link
Contributor

mina86 commented Jul 7, 2021

The process will now gracefully handle SIGINT (i.e. ^C) and SIGTERM (i.e. default signal used by kill command).

cancel_all_background_work is still waiting for review.

mina86 added a commit that referenced this issue Jul 7, 2021
The way remove `near_actix_test_utils::run_actix_until` was written,
the expect_panic flag didn’t actually matter:

    SET_PANIC_HOOK.call_once(|| {
        let default_hook = std::panic::take_hook();
        std::panic::set_hook(Box::new(move |info| {
            if !expect_panic {
                default_hook(info);
            }
            // ...
        }));
    });

Since `SET_PANIC_HOOK.call_once` invokes the closure only once, the
value of expect_panic when that calls happen is the only one that
matters.  In other words, the first run of `run_actix_until` function
decides what the value of `expect_panic` in the panic handler is.

Fortunately this didn’t actually matter.  The only test which set the
flag to true – `chunks_recovered_from_full_timeout_too_short` – was
marked `#[should_panic]` and running the default panic hook didn’t
negatively influence the test.

As such, get rid of `run_actix_until_panic` and rename
`run_actix_until_stop` to simply be `run_actix`.

Issue: #3266
mina86 added a commit that referenced this issue Jul 9, 2021
near-bulldozer bot pushed a commit that referenced this issue Jul 9, 2021
…4429)

Per RocksDB FAQ:

> Q: Is it safe to close RocksDB while another thread is issuing read,
>    write or manual compaction requests?
> A: No.  The users of RocksDB need to make sure all functions have
>    finished before they close RocksDB.  You can speed up the waiting
>    by calling CancelAllBackgroundWork().

Better be safe than sorry so add the call before the rocksdb::DB object
is dropped.

Fixes: #3266
@bowenwang1996
Copy link
Collaborator Author

@mina86 looks like this is still not fixed. Today I shutdown a node that is running 3216f6e and tried to load it from state-viewer and got

thread 'main' panicked at 'Failed to open the database: DBError(Error { message: "Corruption: Corruption: IO error: No such file or directoryWhile open a file for random read: /home/ubuntu/.near/data/534603.ldb: No such file or directory" })', core/store/src/lib.rs:299:42

@bowenwang1996 bowenwang1996 reopened this Aug 16, 2021
Node Experience Q3 2021 automation moved this from Done to In Development Aug 16, 2021
@janewang janewang added this to Backlog in Node Experience Q4 2021 via automation Oct 8, 2021
@janewang janewang moved this from Backlog to In Development in Node Experience Q4 2021 Oct 8, 2021
@janewang janewang removed this from In Development in Node Experience Q3 2021 Oct 8, 2021
@stale
Copy link

stale bot commented Nov 14, 2021

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months.
It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.

@mina86
Copy link
Contributor

mina86 commented Nov 24, 2021

I’m going to close this in favour of #5340

@mina86 mina86 closed this as completed Nov 24, 2021
Node Experience Q4 2021 automation moved this from In Development to Done Nov 24, 2021
@gmilescu gmilescu added the Node Node team label Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-chain Area: Chain, client & related C-enhancement Category: An issue proposing an enhancement or a PR with one. Node Node team T-node Team: issues relevant to the node experience team
Development

Successfully merging a pull request may close this issue.

8 participants