Skip to content

Undrain nodes after user problems + rewrite passive checks on Python#1392

Merged
rdjjke merged 1 commit intomainfrom
undrain-nodes-after-user-problems/0
Aug 12, 2025
Merged

Undrain nodes after user problems + rewrite passive checks on Python#1392
rdjjke merged 1 commit intomainfrom
undrain-nodes-after-user-problems/0

Conversation

@rdjjke
Copy link
Collaborator

@rdjjke rdjjke commented Aug 7, 2025

Now, passive checks (Slurm Prolog, Epilog, and HealthCheckProgram scripts) also run on drained nodes.
Nodes, drained by alloc_gpus_busy and boot_disk_full checks, are automatically undrained if the user fixes the problem.

The logic for launching passive check commands is moved to a new Python script, check_runner.py, which accepts a JSON configuration file containing the settings for all checks.

Other changes:

  • Moved utility scripts (map_job_dcgm.sh, unmap_job_dcgm.sh, cleanup_enroot.sh) to the same check runner.
  • Nodes are drained with either [node_problem] or [user_problem] prefix instead the previous [HC].
  • Some checks run out of jail, where possible.
  • Added the possibility for commenting nodes instead of draining.
  • Long-running calls (chroot, nvidia-smi, sinfo) are executed only once instead of several times.
  • Added recommendations for fixing [user_problem] issues to node drain reasons.
  • Fixed rare failures of enroot_cleanup.sh script.
  • Fixed map_job_dcgm.sh script so it maps only allocated GPUs.
  • Added utility script /opt/soperator-utils/fs_usage.sh that prints usage for shared, local, or in-memory volumes.
  • Enabled Slurm RebootProgram and scontrol reboot command.
  • Fixed 10-system-info.sh SSH welcome message by printing FS usage correctly.
  • Added SSH welcome message to worker nodes (the same as on login nodes).
  • Changed eachWorkerJobArray: true active checks so that they don't schedule jobs for reserved nodes.
  • Increased SlurmctldTimeout from 30 to 180 sec.
  • Added healthcheck for draining Slurm nodes that have available memory < allocated memory.

@rdjjke rdjjke added the feature label Aug 7, 2025
@rdjjke rdjjke force-pushed the undrain-nodes-after-user-problems/0 branch 5 times, most recently from 8a8c162 to ec6c29d Compare August 8, 2025 16:02
@theyoprst theyoprst requested a review from Copilot August 8, 2025 18:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the Slurm passive health check system by consolidating multiple shell-based scripts into a new Python-based check runner. The system now automatically undrai the nodes when user-fixable problems are resolved and introduces clearer categorization of node problems.

  • Replaces duplicated shell scripting logic across prolog.sh, epilog.sh, and hc_program.sh with a centralized Python-based check_runner.py
  • Introduces automatic node undraining for user-fixable problems like disk space and GPU allocation issues
  • Changes drain reason prefixes from [HC] to [node_problem] and [user_problem] for better problem categorization

Reviewed Changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
internal/render/common/configmap.go Updates Slurm configuration with increased timeout and reboot program
internal/controller/soperatorchecks/slurm_nodes_controller.go Updates health check reason parsing to use new [node_problem] prefix
internal/controller/soperatorchecks/activecheck_jobs_controller.go Updates active check failure reason to use new [node_problem] prefix
internal/consts/slurm.go Changes constant from [HC] to [node_problem] prefix
images/worker/supervisord_entrypoint.sh Links SSH message of the day scripts from jail to worker nodes
images/worker/slurmd.dockerfile Adds reboot script to the worker image
images/slurm_check_job/slurm_submit_array_job.sh Excludes RESERVED nodes from job array submissions
images/jail/scripts/reboot.sh New script for rebooting Kubernetes nodes via chroot
images/jail/scripts/fs_usage.sh New utility script for displaying filesystem usage
images/jail/motd/10-system-info Improves filesystem usage display in SSH welcome message
images/jail/jail.dockerfile Adds the new fs_usage.sh utility script
helm/slurm-cluster/templates/slurm-cluster-cr.yaml Changes health check to run on ANY node state instead of specific states
helm/slurm-cluster/slurm_scripts/unmap_job_dcgm.sh Fixes DCGM mapping to use SLURM_JOB_GPUS instead of CUDA_VISIBLE_DEVICES
helm/slurm-cluster/slurm_scripts/prolog.sh Replaces complex shell logic with Python check runner invocation
helm/slurm-cluster/slurm_scripts/map_job_dcgm.sh Fixes DCGM mapping to use SLURM_JOB_GPUS instead of CUDA_VISIBLE_DEVICES
helm/slurm-cluster/slurm_scripts/health_checker.sh Updates to use environment variables from check runner
helm/slurm-cluster/slurm_scripts/hc_program.sh Replaces complex shell logic with Python check runner invocation
helm/slurm-cluster/slurm_scripts/epilog.sh Replaces complex shell logic with Python check runner invocation
helm/slurm-cluster/slurm_scripts/cleanup_enroot.sh Improves error handling and fixes potential failures
helm/slurm-cluster/slurm_scripts/checks.json New JSON configuration file defining all health checks
helm/slurm-cluster/slurm_scripts/check_runner.py New Python script that executes health checks based on JSON configuration
helm/slurm-cluster/slurm_scripts/boot_disk_full.sh Simplifies disk usage check and adds user guidance for fixing issues
helm/slurm-cluster/slurm_scripts/alloc_gpus_busy.sh Improves GPU process detection and adds user guidance
helm/slurm-cluster/slurm_scripts/all_gpus_free.sh New script to check if all GPUs are free of processes

@rdjjke rdjjke force-pushed the undrain-nodes-after-user-problems/0 branch from ec6c29d to 41c8143 Compare August 11, 2025 16:41
@rdjjke rdjjke requested a review from MeenaAlfons as a code owner August 11, 2025 16:41
@rdjjke rdjjke force-pushed the undrain-nodes-after-user-problems/0 branch 7 times, most recently from 6110306 to 71c6d1c Compare August 11, 2025 17:58
asteny
asteny previously approved these changes Aug 12, 2025
@rdjjke rdjjke force-pushed the undrain-nodes-after-user-problems/0 branch from 71c6d1c to 50876e1 Compare August 12, 2025 10:16
@rdjjke rdjjke merged commit ec230e0 into main Aug 12, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants