Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a lame duck mode for RPC services #4489

Closed
mina86 opened this issue Jul 9, 2021 · 0 comments · Fixed by #4681
Closed

Implement a lame duck mode for RPC services #4489

mina86 opened this issue Jul 9, 2021 · 0 comments · Fixed by #4681
Assignees
Labels
A-RPC Area: rpc C-enhancement Category: An issue proposing an enhancement or a PR with one. Node Node team T-node Team: issues relevant to the node experience team

Comments

@mina86
Copy link
Contributor

mina86 commented Jul 9, 2021

With #3266 done, the node handles SIGINT and SIGTERM somewhat gracefully, i.e. all the cleanup code is executed so the result is no longer effectively a crash. However, all processing is interrupted. Most notably, if a client sent an RPC request, if the node gets SIGINT it will stop the processing and close connection to the client.

This should be improved by adding a lame duck mode to RPC endpoints. That is, once SIGINT or SIGTERM is received, the node should stop accepting new requests on RPC endpoints, finish whatever it’s doing and only then exit.

Note that there’s similar situation with nodes communicating with each other when the node may drop a connection to another peer while working on some request. This is less of an issue since peers should expect and handle correctly unexpected network interruptions and peer crashes. However, RPC clients are usually less fault tolerant so it would be nice if the node handled all incoming requests properly.

@bowenwang1996 bowenwang1996 added C-enhancement Category: An issue proposing an enhancement or a PR with one. A-RPC Area: rpc T-node Team: issues relevant to the node experience team labels Jul 11, 2021
@bowenwang1996 bowenwang1996 added this to Backlog in Node Experience Q3 2021 via automation Jul 11, 2021
mina86 added a commit to mina86/nearcore that referenced this issue Aug 12, 2021
…tuple

Rather than relying on positions in a tuple, change start_with_config
function to return its data in a structure so that the elements can be
named.  This will also make it easier to add new return values and
thanks to the use of ‘..’ callers won’t be affected.

Issue: near#4489
near-bulldozer bot pushed a commit that referenced this issue Aug 13, 2021
…4676)

Rather than relying on positions in a tuple, change start_with_config
function to return its data in a structure so that the elements can be
named.  This will also make it easier to add new return values and
thanks to the use of ‘..’ callers won’t be affected.

Issue: #4489
mina86 added a commit to mina86/nearcore that referenced this issue Aug 13, 2021
With the default signal handlers actix_web HTTP servers uses, sending
a SIGINT results in a forceful shutdown of the servers even if there
are requests in flight.  This is unfriendly to the clients which may
as a result get a partial response or a broken connection.

Disable the signal handlers and instead explicitly call `stop(true)`
on the HTTP servers.  This gives the workers time to finish their
processing.  As a result, if a client successfully connected to an RPC
endpoint, they will get their response.  (That is, of course, unless
processing the request would take more than five seconds but that
should’t be the case with any of the supported RPC calls).

Fixes: near#4489
Node Experience Q3 2021 automation moved this from Backlog to Done Aug 16, 2021
near-bulldozer bot pushed a commit that referenced this issue Aug 16, 2021
With the default signal handlers actix_web HTTP servers uses, sending
a SIGINT results in a forceful shutdown of the servers even if there
are requests in flight.  This is unfriendly to the clients which may
as a result get a partial response or a broken connection.

Disable the signal handlers and instead explicitly call `stop(true)`
on the HTTP servers.  This gives the workers time to finish their
processing.  As a result, if a client successfully connected to an RPC
endpoint, they will get their response.  (That is, of course, unless
processing the request would take more than five seconds but that
should’t be the case with any of the supported RPC calls).

Fixes: #4489
@gmilescu gmilescu added the Node Node team label Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-RPC Area: rpc C-enhancement Category: An issue proposing an enhancement or a PR with one. Node Node team T-node Team: issues relevant to the node experience team
Development

Successfully merging a pull request may close this issue.

3 participants