Undrain nodes after user problems + rewrite passive checks on Python#1392
Merged
Undrain nodes after user problems + rewrite passive checks on Python#1392
Conversation
asteny
reviewed
Aug 7, 2025
8a8c162 to
ec6c29d
Compare
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the Slurm passive health check system by consolidating multiple shell-based scripts into a new Python-based check runner. The system now automatically undrai the nodes when user-fixable problems are resolved and introduces clearer categorization of node problems.
- Replaces duplicated shell scripting logic across
prolog.sh,epilog.sh, andhc_program.shwith a centralized Python-basedcheck_runner.py - Introduces automatic node undraining for user-fixable problems like disk space and GPU allocation issues
- Changes drain reason prefixes from
[HC]to[node_problem]and[user_problem]for better problem categorization
Reviewed Changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
internal/render/common/configmap.go |
Updates Slurm configuration with increased timeout and reboot program |
internal/controller/soperatorchecks/slurm_nodes_controller.go |
Updates health check reason parsing to use new [node_problem] prefix |
internal/controller/soperatorchecks/activecheck_jobs_controller.go |
Updates active check failure reason to use new [node_problem] prefix |
internal/consts/slurm.go |
Changes constant from [HC] to [node_problem] prefix |
images/worker/supervisord_entrypoint.sh |
Links SSH message of the day scripts from jail to worker nodes |
images/worker/slurmd.dockerfile |
Adds reboot script to the worker image |
images/slurm_check_job/slurm_submit_array_job.sh |
Excludes RESERVED nodes from job array submissions |
images/jail/scripts/reboot.sh |
New script for rebooting Kubernetes nodes via chroot |
images/jail/scripts/fs_usage.sh |
New utility script for displaying filesystem usage |
images/jail/motd/10-system-info |
Improves filesystem usage display in SSH welcome message |
images/jail/jail.dockerfile |
Adds the new fs_usage.sh utility script |
helm/slurm-cluster/templates/slurm-cluster-cr.yaml |
Changes health check to run on ANY node state instead of specific states |
helm/slurm-cluster/slurm_scripts/unmap_job_dcgm.sh |
Fixes DCGM mapping to use SLURM_JOB_GPUS instead of CUDA_VISIBLE_DEVICES |
helm/slurm-cluster/slurm_scripts/prolog.sh |
Replaces complex shell logic with Python check runner invocation |
helm/slurm-cluster/slurm_scripts/map_job_dcgm.sh |
Fixes DCGM mapping to use SLURM_JOB_GPUS instead of CUDA_VISIBLE_DEVICES |
helm/slurm-cluster/slurm_scripts/health_checker.sh |
Updates to use environment variables from check runner |
helm/slurm-cluster/slurm_scripts/hc_program.sh |
Replaces complex shell logic with Python check runner invocation |
helm/slurm-cluster/slurm_scripts/epilog.sh |
Replaces complex shell logic with Python check runner invocation |
helm/slurm-cluster/slurm_scripts/cleanup_enroot.sh |
Improves error handling and fixes potential failures |
helm/slurm-cluster/slurm_scripts/checks.json |
New JSON configuration file defining all health checks |
helm/slurm-cluster/slurm_scripts/check_runner.py |
New Python script that executes health checks based on JSON configuration |
helm/slurm-cluster/slurm_scripts/boot_disk_full.sh |
Simplifies disk usage check and adds user guidance for fixing issues |
helm/slurm-cluster/slurm_scripts/alloc_gpus_busy.sh |
Improves GPU process detection and adds user guidance |
helm/slurm-cluster/slurm_scripts/all_gpus_free.sh |
New script to check if all GPUs are free of processes |
ec6c29d to
41c8143
Compare
6110306 to
71c6d1c
Compare
asteny
reviewed
Aug 12, 2025
asteny
previously approved these changes
Aug 12, 2025
71c6d1c to
50876e1
Compare
asteny
approved these changes
Aug 12, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Now, passive checks (Slurm
Prolog,Epilog, andHealthCheckProgramscripts) also run on drained nodes.Nodes, drained by
alloc_gpus_busyandboot_disk_fullchecks, are automatically undrained if the user fixes the problem.The logic for launching passive check commands is moved to a new Python script,
check_runner.py, which accepts a JSON configuration file containing the settings for all checks.Other changes:
map_job_dcgm.sh,unmap_job_dcgm.sh,cleanup_enroot.sh) to the same check runner.[node_problem]or[user_problem]prefix instead the previous[HC].chroot,nvidia-smi,sinfo) are executed only once instead of several times.[user_problem]issues to node drain reasons.enroot_cleanup.shscript.map_job_dcgm.shscript so it maps only allocated GPUs./opt/soperator-utils/fs_usage.shthat prints usage for shared, local, or in-memory volumes.RebootProgramandscontrol rebootcommand.10-system-info.shSSH welcome message by printing FS usage correctly.eachWorkerJobArray: trueactive checks so that they don't schedule jobs for reserved nodes.SlurmctldTimeoutfrom 30 to 180 sec.