Skip to content

Soperator Exporter: add node metrics.#971

Merged
theyoprst merged 1 commit intodevfrom
exporter-node-metrics
Jun 5, 2025
Merged

Soperator Exporter: add node metrics.#971
theyoprst merged 1 commit intodevfrom
exporter-node-metrics

Conversation

@theyoprst
Copy link
Collaborator

@theyoprst theyoprst commented Jun 5, 2025

➜ curl localhost:8080/metrics
# HELP slurm_active_node_gpu_seconds_total Total GPU seconds on active Slurm nodes (not down, not idle+drain)
# TYPE slurm_active_node_gpu_seconds_total counter
slurm_active_node_gpu_seconds_total{node_name="worker-0"} 30.072787542
slurm_active_node_gpu_seconds_total{node_name="worker-1"} 51.23264987500001
# HELP slurm_node_fails_total Total number of times a node has failed (went from not down/drain to down/drain state)
# TYPE slurm_node_fails_total counter
slurm_node_fails_total{node_name="worker-0",reason="Scheduled maintenance"} 2
# HELP slurm_node_info Slurm node info
# TYPE slurm_node_info gauge
slurm_node_info{address="10.0.15.254",base_state="idle",compute_instance_id="computeinstance-xxx",is_drain="true",node_name="worker-0"} 1
slurm_node_info{address="10.0.63.138",base_state="idle",compute_instance_id="computeinstance-yyy",is_drain="false",node_name="worker-1"} 1
# HELP soperator_cluster_info Soperator cluster information
# TYPE soperator_cluster_info gauge
soperator_cluster_info{soperator_version="1.19.xxx"} 1

@theyoprst theyoprst force-pushed the exporter-node-metrics branch from e28acfc to 1c779eb Compare June 5, 2025 15:06
@theyoprst theyoprst marked this pull request as ready for review June 5, 2025 15:45
@theyoprst theyoprst merged commit 18937b4 into dev Jun 5, 2025
4 checks passed
@asteny asteny deleted the exporter-node-metrics branch June 23, 2025 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants