Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new check: check_reboot_slurm #6

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from
Open

Conversation

martbhell
Copy link
Contributor

Hi!

Another custom script we've been using for a while and it's been working quite nicely.

Tested with:

  • Scientific Linux 6.7 and slurm 2.6.7
  • CentOS7 and slurm 15.08.04

What it does:

  • set a slurm node to drained and reason=reboot and nhc will:
    • reboot the node when it is drained
    • set it to idle when it's back online

 - set a slurm node to drained and reason=reboot and nhc will:
  - reboot the node when it is drained
  - set it to idle when it's back online
@kcgthb
Copy link

kcgthb commented Feb 25, 2016

Sounds useful, but unless I'm missing something, which is very possible, you can do pretty much the same thing natively in Slurm, with scontrol reboot_nodes [nodelist] and ReturnToService=2

@martbhell
Copy link
Contributor Author

Nice @kcgthb I didn't know about that. One benefit of letting NHC do this is that the node must pass all the health checks before it's put online. Using NHC also adds a bit of delay while waiting for nhc to run.

@mej
Copy link
Owner

mej commented Feb 25, 2016

As I mentioned yesterday during my talk at the HPCAC Stanford Conference, I have had an item for "rolling reboots" on my "TODO list" for some time now, having first discussed it with someone from Compute Canada at MoabCon back in 2012 or so. I don't want to limit it to use with SLURM, so clearly it follows that the name would have to change, but I definitely want to implement something like this.

I'll provide some additional feedback once I have a chance to really dig into your patch! Thanks as always for the submission! :-)

@mej mej self-assigned this Feb 25, 2016
@mej mej added this to the 1.4.3 Release milestone Feb 25, 2016
@mej mej added this to Pending in NHC 1.4.3 Release Oct 30, 2018
@mej
Copy link
Owner

mej commented Dec 29, 2018

Due to the delay caused by my changing jobs, I'm having to bump the merging of new checks to the 1.4.4 release to ensure that they have ample baking time without delaying the 1.4.3 release any further. Hope you understand! This is still very much in plan for merging!

@mej mej modified the milestones: 1.4.3 Release, 1.4.4 Release Dec 29, 2018
@mej mej removed this from Pending in NHC 1.4.3 Release Dec 29, 2018
wpoely86 pushed a commit to wpoely86/nhc that referenced this pull request Mar 10, 2021
@mej mej modified the milestones: 1.4.4 Release, 1.4.4 Release (new), 1.5 Release Apr 17, 2021
@mej mej added this to Pending in NHC 1.4.4 Release Apr 18, 2021
@mej mej removed this from Pending in NHC 1.4.4 Release Apr 18, 2021
@mej mej added this to Pending in NHC 1.5 Release Apr 18, 2021
wpoely86 added a commit to wpoely86/nhc that referenced this pull request Oct 18, 2023
replace check on /var/spool with /local for Slurm 23 with job_container/tmpfs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
NHC 1.5 Release
  
Triage / TODO
NHC 1.4.4 Release
Awaiting triage
Development

Successfully merging this pull request may close these issues.

None yet

3 participants