SLURMSUPPORT-75: add more state unavailable node to slurm exporter by Uburro · Pull Request #1988 · nebius/soperator

Uburro · 2026-01-06T14:42:14Z

Problem

Some Slurm node states were not treated as unavailable by the Slurm exporter, leading to inaccurate monitoring and resource availability metrics.

Solution

The Slurm exporter logic was extended to include additional node states as unavailable, ensuring correct reporting in exported metrics.

Testing

All changes were validated through existing CI pipelines and checks, with no regressions detected.

Release Notes

Improved Slurm Exporter accuracy by expanding the set of node states reported as unavailable for monitoring and alerting

Copilot

Pull request overview

This PR enhances the Slurm exporter to better track unavailable node states by expanding the conditions that mark a node as unavailable.

Key Changes:

Added two new node state detection methods: IsNotRespondingState() and IsInvalidState()
Expanded the isNodeUnavailable() logic to include NOT_RESPONDING, UNKNOWN, ERROR, FAIL, and INVALID states
Added three new metric labels: state_is_not_responding, state_is_invalid, and is_unavailable to provide more detailed node state information

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
internal/slurmapi/node.go	Added `IsNotRespondingState()` and `IsInvalidState()` methods; added FUTURE to base states list; updated documentation URL; removed unused `valueOrDefault` function
internal/exporter/collector.go	Added three new labels to `slurm_node_info` metric; expanded `isNodeUnavailable()` to check for additional unavailable states; reformatted metric descriptor definitions
internal/exporter/collector_test.go	Refactored tests to use flexible label matching via new `assertMetricHasLabels()` helper function instead of exact string matching to accommodate new labels
docs/slurm-exporter.md	Updated documentation to describe new `state_is_not_responding`, `state_is_invalid`, and `is_unavailable` labels

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-06T16:48:37Z

internal/exporter/collector.go

 func isNodeUnavailable(node slurmapi.Node) bool {
 	if node.IsDownState() {
 		return true
 	}
+	if node.IsNotRespondingState() {
+		return true
+	}
 	if node.BaseState() == api.V0041NodeStateIDLE && node.IsDrainState() {
 		return true
 	}
+	if node.BaseState() == api.V0041NodeStateUNKNOWN {
+		return true
+	}
+	if node.BaseState() == api.V0041NodeStateERROR {
+		return true
+	}
+	if node.IsFailState() {
+		return true
+	}
+	if node.IsInvalidState() {
+		return true
+	}
 	return false
 }


The test for isNodeUnavailable() doesn't include test cases for the newly added unavailability conditions: NOTRESPONDING state, UNKNOWN base state, ERROR base state, FAIL state, and INVALID state. Add test cases to ensure these conditions correctly identify nodes as unavailable.

Copilot · 2026-01-06T16:48:38Z

internal/slurmapi/node.go

+func (n *Node) IsNotRespondingState() bool {
+	_, exists := n.States[api.V0041NodeStateNOTRESPONDING]
+	return exists
+}
+
+func (n *Node) IsInvalidState() bool {
+	_, exists := n.States[api.V0041NodeStateINVALID]
+	return exists
+}


The newly added IsNotRespondingState() and IsInvalidState() methods lack test coverage. Consider adding test cases to verify these methods correctly identify nodes in NOT_RESPONDING and INVALID states.

Copilot · 2026-01-06T16:48:38Z

internal/slurmapi/node.go

+
 // baseStates defines the mutually exclusive base states of a Slurm node.
 // The node state is a 32-bit integer where the lowest 4 bits (0x0000000f) encode
 // exactly 6 mutually exclusive base states: IDLE, DOWN, ALLOCATED, ERROR, MIXED, UNKNOWN.


The comment states "exactly 6 mutually exclusive base states" but with the addition of FUTURE on line 185, there are now 7 base states. Update the comment to reflect this change.

Suggested change

// exactly 6 mutually exclusive base states: IDLE, DOWN, ALLOCATED, ERROR, MIXED, UNKNOWN.

// exactly 7 mutually exclusive base states: IDLE, DOWN, ALLOCATED, ERROR, MIXED, UNKNOWN, FUTURE.

Copilot · 2026-01-06T16:48:38Z

docs/slurm-exporter.md

+| **slurm_node_info**<br>*Gauge* | Provides detailed information about SLURM nodes<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `instance_id` - Kubernetes instance identifier<br>• `state_base` - Base node state (IDLE, ALLOCATED, DOWN, ERROR, MIXED, UNKNOWN)<br>• `state_is_drain` - Whether node is in drain state ("true"/"false")<br>• `state_is_maintenance` - Whether node is in maintenance state ("true"/"false")<br>• `state_is_reserved` - Whether node is in reserved state ("true"/"false")<br>• `state_is_completing` - Whether node is in completing state ("true" or empty)<br>• `state_is_fail` - Whether node is in fail state ("true" or empty)<br>• `state_is_planned` - Whether node is in planned state ("true" or empty)<br>• `state_is_not_responding` - Whether the node is marked as not responding ("true" or empty)<br>• `state_is_invalid` - Whether the node state is considered invalid by SLURM ("true" or empty)<br>• `is_unavailable` - Computed by the exporter: "true" when the node is considered unavailable (DOWN+* or IDLE+DRAIN+*), empty string otherwise<br>• `reservation_name` - Reservation that currently includes the node (trimmed to 50 characters)<br>• `address` - IP address of the node<br>• `reason` - Reason for current node state (empty string if node has no reason set) |
 | **slurm_node_gpu_seconds_total**<br>*Counter* | Total GPU seconds accumulated on SLURM nodes<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `state_base` - Base node state<br>• `state_is_drain` - Drain state flag<br>• `state_is_maintenance` - Maintenance state flag<br>• `state_is_reserved` - Reserved state flag |
 | **slurm_node_fails_total**<br>*Counter* | Total number of node state transitions to failed states (DOWN/DRAIN)<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `state_base` - Base node state at time of failure<br>• `state_is_drain` - Drain state flag<br>• `state_is_maintenance` - Maintenance state flag<br>• `state_is_reserved` - Reserved state flag<br>• `reason` - Reason for the node failure |
 | **slurm_node_unavailability_duration_seconds**<br>*Histogram* | Duration of completed node unavailability events (DOWN+* or IDLE+DRAIN+*)<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br><br>**Note:** Observations are recorded when unavailability events complete. Duration tracking is reset on exporter restarts, which may affect accuracy |


The documentation for the is_unavailable label states it is "true when the node is considered unavailable (DOWN+* or IDLE+DRAIN+*)". However, the code in isNodeUnavailable() now includes additional conditions: NOTRESPONDING, UNKNOWN, ERROR, FAIL, and INVALID states. Update the documentation to accurately reflect all the conditions that make a node unavailable.

Suggested change

| **slurm_node_info** *Gauge* | Provides detailed information about SLURM nodes **Labels:** • `node_name` - Name of the SLURM node • `instance_id` - Kubernetes instance identifier • `state_base` - Base node state (IDLE, ALLOCATED, DOWN, ERROR, MIXED, UNKNOWN) • `state_is_drain` - Whether node is in drain state ("true"/"false") • `state_is_maintenance` - Whether node is in maintenance state ("true"/"false") • `state_is_reserved` - Whether node is in reserved state ("true"/"false") • `state_is_completing` - Whether node is in completing state ("true" or empty) • `state_is_fail` - Whether node is in fail state ("true" or empty) • `state_is_planned` - Whether node is in planned state ("true" or empty) • `state_is_not_responding` - Whether the node is marked as not responding ("true" or empty) • `state_is_invalid` - Whether the node state is considered invalid by SLURM ("true" or empty) • `is_unavailable` - Computed by the exporter: "true" when the node is considered unavailable (DOWN+* or IDLE+DRAIN+*), empty string otherwise • `reservation_name` - Reservation that currently includes the node (trimmed to 50 characters) • `address` - IP address of the node • `reason` - Reason for current node state (empty string if node has no reason set) |

| **slurm_node_gpu_seconds_total** *Counter* | Total GPU seconds accumulated on SLURM nodes **Labels:** • `node_name` - Name of the SLURM node • `state_base` - Base node state • `state_is_drain` - Drain state flag • `state_is_maintenance` - Maintenance state flag • `state_is_reserved` - Reserved state flag |

| **slurm_node_fails_total** *Counter* | Total number of node state transitions to failed states (DOWN/DRAIN) **Labels:** • `node_name` - Name of the SLURM node • `state_base` - Base node state at time of failure • `state_is_drain` - Drain state flag • `state_is_maintenance` - Maintenance state flag • `state_is_reserved` - Reserved state flag • `reason` - Reason for the node failure |

| **slurm_node_unavailability_duration_seconds** *Histogram* | Duration of completed node unavailability events (DOWN+* or IDLE+DRAIN+*) **Labels:** • `node_name` - Name of the SLURM node **Note:** Observations are recorded when unavailability events complete. Duration tracking is reset on exporter restarts, which may affect accuracy |

| **slurm_node_info** *Gauge* | Provides detailed information about SLURM nodes **Labels:** • `node_name` - Name of the SLURM node • `instance_id` - Kubernetes instance identifier • `state_base` - Base node state (IDLE, ALLOCATED, DOWN, ERROR, MIXED, UNKNOWN) • `state_is_drain` - Whether node is in drain state ("true"/"false") • `state_is_maintenance` - Whether node is in maintenance state ("true"/"false") • `state_is_reserved` - Whether node is in reserved state ("true"/"false") • `state_is_completing` - Whether node is in completing state ("true" or empty) • `state_is_fail` - Whether node is in fail state ("true" or empty) • `state_is_planned` - Whether node is in planned state ("true" or empty) • `state_is_not_responding` - Whether the node is marked as not responding ("true" or empty) • `state_is_invalid` - Whether the node state is considered invalid by SLURM ("true" or empty) • `is_unavailable` - Computed by the exporter: "true" when the node is considered unavailable (e.g., DOWN+*, IDLE+DRAIN+*, or when in NOTRESPONDING, UNKNOWN, ERROR, FAIL, or INVALID states), empty string otherwise • `reservation_name` - Reservation that currently includes the node (trimmed to 50 characters) • `address` - IP address of the node • `reason` - Reason for current node state (empty string if node has no reason set) |

| **slurm_node_gpu_seconds_total** *Counter* | Total GPU seconds accumulated on SLURM nodes **Labels:** • `node_name` - Name of the SLURM node • `state_base` - Base node state • `state_is_drain` - Drain state flag • `state_is_maintenance` - Maintenance state flag • `state_is_reserved` - Reserved state flag |

| **slurm_node_fails_total** *Counter* | Total number of node state transitions to failed states (DOWN/DRAIN) **Labels:** • `node_name` - Name of the SLURM node • `state_base` - Base node state at time of failure • `state_is_drain` - Drain state flag • `state_is_maintenance` - Maintenance state flag • `state_is_reserved` - Reserved state flag • `reason` - Reason for the node failure |

| **slurm_node_unavailability_duration_seconds** *Histogram* | Duration of completed node unavailability events as defined by the `is_unavailable` label (e.g., DOWN+*, IDLE+DRAIN+*, or NOTRESPONDING, UNKNOWN, ERROR, FAIL, INVALID states) **Labels:** • `node_name` - Name of the SLURM node **Note:** Observations are recorded when unavailability events complete. Duration tracking is reset on exporter restarts, which may affect accuracy |

Copilot · 2026-01-06T16:48:39Z

internal/exporter/collector.go

 // isNodeUnavailable checks if a node is in unavailable state
-// Unavailable state: DOWN+* or IDLE+DRAIN+*
+// Unavailable state: DOWN+* or IDLE+DRAIN+* or NOTRESPONDING or UNKNOWN or ERROR or FAIL or INVALID


The updated comment describes the unavailable state conditions but doesn't match the actual implementation. The comment lists "NOTRESPONDING or UNKNOWN or ERROR or FAIL or INVALID" but the implementation checks for these in different ways - NOTRESPONDING and INVALID are checked via dedicated state methods, while UNKNOWN and ERROR are checked as base states, and FAIL is checked via IsFailState(). Consider making the comment more precise to reflect the actual checks being performed.

…1988) * SLURMSUPPORT-75: add more state unavailable node to slurm exporter (cherry picked from commit 9cbe6dd)

SLURMSUPPORT-75: add more state unavailable node to slurm exporter

4275057

Uburro added the fix label Jan 6, 2026

Uburro requested a review from theyoprst January 6, 2026 14:47

Uburro added 2 commits January 6, 2026 17:19

coule of updates

a8b8a19

fix make lint

264e6c5

Uburro requested a review from Copilot January 6, 2026 16:41

theyoprst approved these changes Jan 6, 2026

View reviewed changes

Copilot started reviewing on behalf of Uburro January 6, 2026 16:42 View session

Uburro merged commit 9cbe6dd into main Jan 6, 2026
13 checks passed

Uburro deleted the SLURMSUPPORT-75/0 branch January 6, 2026 16:47

Copilot AI reviewed Jan 6, 2026

View reviewed changes

theyoprst pushed a commit that referenced this pull request Jan 8, 2026

SLURMSUPPORT-75: add more state unavailable node to slurm exporter (#…

7ed11eb

…1988) * SLURMSUPPORT-75: add more state unavailable node to slurm exporter (cherry picked from commit 9cbe6dd)

This was referenced Jan 8, 2026

SCHED-706: Cherry pick to/soperator release 1.23/from/main/1 #1998

Merged

Cherry pick to/soperator release 1.23/from/main/2 #2000

Merged

github-actions bot mentioned this pull request Jan 9, 2026

Merge to main: Cherry pick to/soperator release 1.23/from/main/2 #2004

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLURMSUPPORT-75: add more state unavailable node to slurm exporter#1988

SLURMSUPPORT-75: add more state unavailable node to slurm exporter#1988
Uburro merged 3 commits intomainfrom
SLURMSUPPORT-75/0

Uburro commented Jan 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 6, 2026

Uh oh!

Copilot AI Jan 6, 2026

Uh oh!

Copilot AI Jan 6, 2026

Uh oh!

Copilot AI Jan 6, 2026

Uh oh!

Copilot AI Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// exactly 6 mutually exclusive base states: IDLE, DOWN, ALLOCATED, ERROR, MIXED, UNKNOWN.
	// exactly 7 mutually exclusive base states: IDLE, DOWN, ALLOCATED, ERROR, MIXED, UNKNOWN, FUTURE.

Conversation

Uburro commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Testing

Release Notes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uburro commented Jan 6, 2026 •

edited

Loading