Skip to content

Add SLURM exporter documentation#1058

Merged
theyoprst merged 1 commit intodevfrom
slurm-exporter-docs
Jun 24, 2025
Merged

Add SLURM exporter documentation#1058
theyoprst merged 1 commit intodevfrom
slurm-exporter-docs

Conversation

@theyoprst
Copy link
Collaborator

Summary

This PR adds comprehensive documentation for the SLURM Exporter component that provides Prometheus metrics for SLURM cluster monitoring.

Changes

  • New Documentation: Created docs/slurm-exporter.md with complete metrics catalog
  • Metrics Coverage: Documents all 8 exported metrics including:
    • Node metrics (slurm_node_info, slurm_node_gpu_seconds_total, slurm_node_fails_total)
    • Job metrics (slurm_job_info, slurm_node_job)
    • Controller RPC metrics (slurm_controller_rpc_*, slurm_controller_server_thread_count)
  • Detailed Specifications: Each metric includes type, description, labels, and example output
  • Grafana Integration: References production dashboard from nebius-solutions-library

The documentation focuses on the metrics themselves and their practical usage for monitoring SLURM cluster health and performance.

Documents all exported metrics including node, job, and controller RPC metrics.
Includes metric types, labels, examples, and Grafana dashboard reference.
@theyoprst theyoprst force-pushed the slurm-exporter-docs branch from 818d99c to d8cb762 Compare June 24, 2025 16:35
@theyoprst theyoprst added the documentation Improvements or additions to documentation label Jun 24, 2025
@theyoprst theyoprst merged commit de23cce into dev Jun 24, 2025
@asteny asteny deleted the slurm-exporter-docs branch June 25, 2025 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants