Enable NHC to handle Slurm boot node state #83
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add
boot
node state to node online and offline script.Properly handle
scontrol reboot asap
so Slurm doesn't erroneouslyonline a node after the first NHC call after a reboot.
For context, see #81 and
https://bugs.schedmd.com/show_bug.cgi?id=6391
scontrol reboot asap
will set the node state toREBOOT+DRAIN
and reason toReboot ASAP
. Then, after boot, and after NHC runs once, Slurm will set the node base state toIDLE
. If reason ==Reboot ASAP
, Slurm will also clear theDRAIN
flag. We want NHC to clear theDRAIN
flag, not Slurm, so delete theReboot ASAP
reason by not preserving it below. See https://slurm.schedmd.com/scontrol.html --> reboot.