Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implemented functions to restart abnormally terminated node automatically. #1101

Merged
merged 17 commits into from
Feb 3, 2022

Conversation

sirano11
Copy link
Contributor

@sirano11 sirano11 commented Jan 5, 2022

Proposed changes

This PR implemented functions to restart abnormally terminated node in klaytn node daemon file.(kend, kcnd...)

To restart abnormally terminated node automatically, it need to set AUTO_RESTART_NODE, AUTO_RESTART_INTERVAL in kxxd.conf file.

  • AUTO_RESTART_NODE : enable option to auto restart node.
  • AUTO_RESTART_INTERVAL : cycle to restart node.(unit second).
    • This value will increase with an exponential power of 2.

I tested this script in variable environments. Please check the attached log and below log.

INFO[2022. 01. 05. (수) 01:47:01 UTC] node[3225] is down
INFO[2022. 01. 05. (수) 01:47:01 UTC] remove redundant pid file
INFO[2022. 01. 05. (수) 01:47:01 UTC] Sleep for backOffTime.... 0.1 seconds.
INFO[2022. 01. 05. (수) 01:47:01 UTC] -------------------> start network
Starting kcnd: Success to start node.
INFO[2022. 01. 05. (수) 01:47:01 UTC] -------------------> start network
INFO[2022. 01. 05. (수) 01:47:01 UTC] backOffTime = 0.2, Restarted node pid = 3225

INFO[2022. 01. 05. (수) 01:47:02 UTC] node[3243] is down
INFO[2022. 01. 05. (수) 01:47:02 UTC] remove redundant pid file
INFO[2022. 01. 05. (수) 01:47:02 UTC] Sleep for backOffTime.... 0.2 seconds.
INFO[2022. 01. 05. (수) 01:47:02 UTC] -------------------> start network
Starting kcnd: Success to start node.
INFO[2022. 01. 05. (수) 01:47:02 UTC] -------------------> start network
INFO[2022. 01. 05. (수) 01:47:02 UTC] backOffTime = 0.4, Restarted node pid = 3243

INFO[2022. 01. 05. (수) 01:47:03 UTC] node[3267] is down
INFO[2022. 01. 05. (수) 01:47:03 UTC] remove redundant pid file
INFO[2022. 01. 05. (수) 01:47:03 UTC] Sleep for backOffTime.... 0.4 seconds.
INFO[2022. 01. 05. (수) 01:47:04 UTC] -------------------> start network
Starting kcnd: Success to start node.
INFO[2022. 01. 05. (수) 01:47:04 UTC] -------------------> start network
INFO[2022. 01. 05. (수) 01:47:04 UTC] backOffTime = 0.8, Restarted node pid = 3267

INFO[2022. 01. 05. (수) 01:47:05 UTC] node[3291] is down
INFO[2022. 01. 05. (수) 01:47:05 UTC] remove redundant pid file
INFO[2022. 01. 05. (수) 01:47:05 UTC] Sleep for backOffTime.... 0.8 seconds.
INFO[2022. 01. 05. (수) 01:47:06 UTC] -------------------> start network
Starting kcnd: Success to start node.
INFO[2022. 01. 05. (수) 01:47:06 UTC] -------------------> start network
INFO[2022. 01. 05. (수) 01:47:06 UTC] backOffTime = 1.6, Restarted node pid = 3291

INFO[2022. 01. 05. (수) 01:47:07 UTC] node[3315] is down
INFO[2022. 01. 05. (수) 01:47:07 UTC] remove redundant pid file
INFO[2022. 01. 05. (수) 01:47:07 UTC] Sleep for backOffTime.... 1.6 seconds.
INFO[2022. 01. 05. (수) 01:47:08 UTC] -------------------> start network
Starting kcnd: Success to start node.
INFO[2022. 01. 05. (수) 01:47:08 UTC] -------------------> start network
INFO[2022. 01. 05. (수) 01:47:08 UTC] backOffTime = 3.2, Restarted node pid = 3315

INFO[2022. 01. 05. (수) 01:47:09 UTC] node[3340] is down
INFO[2022. 01. 05. (수) 01:47:09 UTC] remove redundant pid file
INFO[2022. 01. 05. (수) 01:47:09 UTC] Sleep for backOffTime.... 3.2 seconds.
INFO[2022. 01. 05. (수) 01:47:12 UTC] -------------------> start network
Starting kcnd: Success to start node.
INFO[2022. 01. 05. (수) 01:47:12 UTC] -------------------> start network
INFO[2022. 01. 05. (수) 01:47:12 UTC] backOffTime = 6.4, Restarted node pid = 3340

INFO[2022. 01. 05. (수) 01:47:13 UTC] node[3367] is down
INFO[2022. 01. 05. (수) 01:47:13 UTC] remove redundant pid file
INFO[2022. 01. 05. (수) 01:47:13 UTC] Sleep for backOffTime.... 6.4 seconds.
INFO[2022. 01. 05. (수) 01:47:20 UTC] -------------------> start network
Starting kcnd: Success to start node.
INFO[2022. 01. 05. (수) 01:47:20 UTC] -------------------> start network
INFO[2022. 01. 05. (수) 01:47:20 UTC] backOffTime = 12.8, Restarted node pid = 3367

INFO[2022. 01. 05. (수) 01:47:21 UTC] node[3392] is down
INFO[2022. 01. 05. (수) 01:47:21 UTC] remove redundant pid file
INFO[2022. 01. 05. (수) 01:47:21 UTC] Sleep for backOffTime.... 12.8 seconds.
INFO[2022. 01. 05. (수) 01:47:34 UTC] -------------------> start network
Starting kcnd: Success to start node.
INFO[2022. 01. 05. (수) 01:47:34 UTC] -------------------> start network
INFO[2022. 01. 05. (수) 01:47:34 UTC] backOffTime = 25.6, Restarted node pid = 3392

INFO[2022. 01. 05. (수) 01:47:35 UTC] node[3416] is down
INFO[2022. 01. 05. (수) 01:47:35 UTC] remove redundant pid file
INFO[2022. 01. 05. (수) 01:47:35 UTC] Sleep for backOffTime.... 25.6 seconds.
INFO[2022. 01. 05. (수) 01:48:00 UTC] -------------------> start network
Starting kcnd: Success to start node.
INFO[2022. 01. 05. (수) 01:48:00 UTC] -------------------> start network
INFO[2022. 01. 05. (수) 01:48:00 UTC] backOffTime = 51.2, Restarted node pid = 3416

INFO[2022. 01. 05. (수) 01:48:01 UTC] node[3443] is down
INFO[2022. 01. 05. (수) 01:48:01 UTC] remove redundant pid file
INFO[2022. 01. 05. (수) 01:48:01 UTC] Sleep for backOffTime.... 51.2 seconds.
INFO[2022. 01. 05. (수) 01:48:53 UTC] -------------------> start network
Starting kcnd: Success to start node.
INFO[2022. 01. 05. (수) 01:48:53 UTC] -------------------> start network
INFO[2022. 01. 05. (수) 01:48:53 UTC] backOffTime = 102.4, Restarted node pid = 3443

Types of changes

Please put an x in the boxes related to your change.

  • Bugfix
  • New feature or enhancement
  • Others

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING GUIDELINES doc
  • I have signed the CLA
  • Lint and unit tests pass locally with my changes ($ make test)
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in downstream modules

Related issues

  • Please leave the issue numbers or links related to this PR here.

Further comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc...

@CLAassistant
Copy link

CLAassistant commented Jan 5, 2022

CLA assistant check
All committers have signed the CLA.

hqjang-pepper
hqjang-pepper previously approved these changes Jan 7, 2022
@aeharvlee
Copy link
Contributor

aeharvlee commented Jan 10, 2022

@sirano11
I'd like to suggest an idea.
How about giving unique name for all restart_daemon.pids?

Im not tested yet, but just worried about duplicating pid files situation (e.g. Running all k*nd series in a machine with same DATA_DIR and autorestart option and then maybe duplication problem happen)

So my idea is like below.

in kend
Before:
auto_restart_daemon_pidfile=$DATA_DIR/restart_daemon.pid
After:
auto_restart_daemon_pidfile=$DATA_DIR/kend_restart_daemon.pid

in kcnd
Before:
auto_restart_daemon_pidfile=$DATA_DIR/restart_daemon.pid
After:
auto_restart_daemon_pidfile=$DATA_DIR/kcnd_restart_daemon.pid

@KimKyungup KimKyungup closed this Jan 10, 2022
@KimKyungup KimKyungup reopened this Jan 10, 2022
jimni1222
jimni1222 previously approved these changes Jan 11, 2022
hqjang-pepper
hqjang-pepper previously approved these changes Jan 11, 2022
aeharvlee
aeharvlee previously approved these changes Jan 12, 2022
aidan-kwon
aidan-kwon previously approved these changes Jan 18, 2022
Copy link
Member

@aidan-kwon aidan-kwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me

fi
AUTO_RESTART_DAEMON_PID_NUM=$(eval "cat $auto_restart_daemon_pidfile")
if [[ ! -z "$AUTO_RESTART_DAEMON_PID_NUM" ]]; then
export auto_restart_daemon_pid=$(eval "ps -p $AUTO_RESTART_DAEMON_PID_NUM -o pid=")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to use $(eval "cmd...") instead of $(cmd...)?

It seems like this pattern was already in use in other places, but I think $(cmd) should have the same result (capturing output of the command to an environment variable) while being simpler to read and avoiding the use of eval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aluminous
I removed eval. thanks.

@aidan-kwon aidan-kwon added this to the v1.8.0 milestone Jan 21, 2022
@aidan-kwon aidan-kwon added this to In progress in Interface via automation Jan 21, 2022
jeongkyun-oh
jeongkyun-oh previously approved these changes Jan 21, 2022
Interface automation moved this from In progress to Review in progress Jan 21, 2022
@sirano11 sirano11 merged commit 7620043 into klaytn:dev Feb 3, 2022
Interface automation moved this from Review in progress to Done Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

9 participants