SLURMSUPPORT-75: add more state unavailable node to slurm exporter#1988
SLURMSUPPORT-75: add more state unavailable node to slurm exporter#1988
Conversation
There was a problem hiding this comment.
Pull request overview
This PR enhances the Slurm exporter to better track unavailable node states by expanding the conditions that mark a node as unavailable.
Key Changes:
- Added two new node state detection methods:
IsNotRespondingState()andIsInvalidState() - Expanded the
isNodeUnavailable()logic to include NOT_RESPONDING, UNKNOWN, ERROR, FAIL, and INVALID states - Added three new metric labels:
state_is_not_responding,state_is_invalid, andis_unavailableto provide more detailed node state information
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| internal/slurmapi/node.go | Added IsNotRespondingState() and IsInvalidState() methods; added FUTURE to base states list; updated documentation URL; removed unused valueOrDefault function |
| internal/exporter/collector.go | Added three new labels to slurm_node_info metric; expanded isNodeUnavailable() to check for additional unavailable states; reformatted metric descriptor definitions |
| internal/exporter/collector_test.go | Refactored tests to use flexible label matching via new assertMetricHasLabels() helper function instead of exact string matching to accommodate new labels |
| docs/slurm-exporter.md | Updated documentation to describe new state_is_not_responding, state_is_invalid, and is_unavailable labels |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func isNodeUnavailable(node slurmapi.Node) bool { | ||
| if node.IsDownState() { | ||
| return true | ||
| } | ||
| if node.IsNotRespondingState() { | ||
| return true | ||
| } | ||
| if node.BaseState() == api.V0041NodeStateIDLE && node.IsDrainState() { | ||
| return true | ||
| } | ||
| if node.BaseState() == api.V0041NodeStateUNKNOWN { | ||
| return true | ||
| } | ||
| if node.BaseState() == api.V0041NodeStateERROR { | ||
| return true | ||
| } | ||
| if node.IsFailState() { | ||
| return true | ||
| } | ||
| if node.IsInvalidState() { | ||
| return true | ||
| } | ||
| return false | ||
| } |
There was a problem hiding this comment.
The test for isNodeUnavailable() doesn't include test cases for the newly added unavailability conditions: NOTRESPONDING state, UNKNOWN base state, ERROR base state, FAIL state, and INVALID state. Add test cases to ensure these conditions correctly identify nodes as unavailable.
| func (n *Node) IsNotRespondingState() bool { | ||
| _, exists := n.States[api.V0041NodeStateNOTRESPONDING] | ||
| return exists | ||
| } | ||
|
|
||
| func (n *Node) IsInvalidState() bool { | ||
| _, exists := n.States[api.V0041NodeStateINVALID] | ||
| return exists | ||
| } |
There was a problem hiding this comment.
The newly added IsNotRespondingState() and IsInvalidState() methods lack test coverage. Consider adding test cases to verify these methods correctly identify nodes in NOT_RESPONDING and INVALID states.
|
|
||
| // baseStates defines the mutually exclusive base states of a Slurm node. | ||
| // The node state is a 32-bit integer where the lowest 4 bits (0x0000000f) encode | ||
| // exactly 6 mutually exclusive base states: IDLE, DOWN, ALLOCATED, ERROR, MIXED, UNKNOWN. |
There was a problem hiding this comment.
The comment states "exactly 6 mutually exclusive base states" but with the addition of FUTURE on line 185, there are now 7 base states. Update the comment to reflect this change.
| // exactly 6 mutually exclusive base states: IDLE, DOWN, ALLOCATED, ERROR, MIXED, UNKNOWN. | |
| // exactly 7 mutually exclusive base states: IDLE, DOWN, ALLOCATED, ERROR, MIXED, UNKNOWN, FUTURE. |
| | **slurm_node_info**<br>*Gauge* | Provides detailed information about SLURM nodes<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `instance_id` - Kubernetes instance identifier<br>• `state_base` - Base node state (IDLE, ALLOCATED, DOWN, ERROR, MIXED, UNKNOWN)<br>• `state_is_drain` - Whether node is in drain state ("true"/"false")<br>• `state_is_maintenance` - Whether node is in maintenance state ("true"/"false")<br>• `state_is_reserved` - Whether node is in reserved state ("true"/"false")<br>• `state_is_completing` - Whether node is in completing state ("true" or empty)<br>• `state_is_fail` - Whether node is in fail state ("true" or empty)<br>• `state_is_planned` - Whether node is in planned state ("true" or empty)<br>• `state_is_not_responding` - Whether the node is marked as not responding ("true" or empty)<br>• `state_is_invalid` - Whether the node state is considered invalid by SLURM ("true" or empty)<br>• `is_unavailable` - Computed by the exporter: "true" when the node is considered unavailable (DOWN+* or IDLE+DRAIN+*), empty string otherwise<br>• `reservation_name` - Reservation that currently includes the node (trimmed to 50 characters)<br>• `address` - IP address of the node<br>• `reason` - Reason for current node state (empty string if node has no reason set) | | ||
| | **slurm_node_gpu_seconds_total**<br>*Counter* | Total GPU seconds accumulated on SLURM nodes<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `state_base` - Base node state<br>• `state_is_drain` - Drain state flag<br>• `state_is_maintenance` - Maintenance state flag<br>• `state_is_reserved` - Reserved state flag | | ||
| | **slurm_node_fails_total**<br>*Counter* | Total number of node state transitions to failed states (DOWN/DRAIN)<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `state_base` - Base node state at time of failure<br>• `state_is_drain` - Drain state flag<br>• `state_is_maintenance` - Maintenance state flag<br>• `state_is_reserved` - Reserved state flag<br>• `reason` - Reason for the node failure | | ||
| | **slurm_node_unavailability_duration_seconds**<br>*Histogram* | Duration of completed node unavailability events (DOWN+* or IDLE+DRAIN+*)<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br><br>**Note:** Observations are recorded when unavailability events complete. Duration tracking is reset on exporter restarts, which may affect accuracy | |
There was a problem hiding this comment.
The documentation for the is_unavailable label states it is "true when the node is considered unavailable (DOWN+* or IDLE+DRAIN+*)". However, the code in isNodeUnavailable() now includes additional conditions: NOTRESPONDING, UNKNOWN, ERROR, FAIL, and INVALID states. Update the documentation to accurately reflect all the conditions that make a node unavailable.
| | **slurm_node_info**<br>*Gauge* | Provides detailed information about SLURM nodes<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `instance_id` - Kubernetes instance identifier<br>• `state_base` - Base node state (IDLE, ALLOCATED, DOWN, ERROR, MIXED, UNKNOWN)<br>• `state_is_drain` - Whether node is in drain state ("true"/"false")<br>• `state_is_maintenance` - Whether node is in maintenance state ("true"/"false")<br>• `state_is_reserved` - Whether node is in reserved state ("true"/"false")<br>• `state_is_completing` - Whether node is in completing state ("true" or empty)<br>• `state_is_fail` - Whether node is in fail state ("true" or empty)<br>• `state_is_planned` - Whether node is in planned state ("true" or empty)<br>• `state_is_not_responding` - Whether the node is marked as not responding ("true" or empty)<br>• `state_is_invalid` - Whether the node state is considered invalid by SLURM ("true" or empty)<br>• `is_unavailable` - Computed by the exporter: "true" when the node is considered unavailable (DOWN+* or IDLE+DRAIN+*), empty string otherwise<br>• `reservation_name` - Reservation that currently includes the node (trimmed to 50 characters)<br>• `address` - IP address of the node<br>• `reason` - Reason for current node state (empty string if node has no reason set) | | |
| | **slurm_node_gpu_seconds_total**<br>*Counter* | Total GPU seconds accumulated on SLURM nodes<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `state_base` - Base node state<br>• `state_is_drain` - Drain state flag<br>• `state_is_maintenance` - Maintenance state flag<br>• `state_is_reserved` - Reserved state flag | | |
| | **slurm_node_fails_total**<br>*Counter* | Total number of node state transitions to failed states (DOWN/DRAIN)<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `state_base` - Base node state at time of failure<br>• `state_is_drain` - Drain state flag<br>• `state_is_maintenance` - Maintenance state flag<br>• `state_is_reserved` - Reserved state flag<br>• `reason` - Reason for the node failure | | |
| | **slurm_node_unavailability_duration_seconds**<br>*Histogram* | Duration of completed node unavailability events (DOWN+* or IDLE+DRAIN+*)<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br><br>**Note:** Observations are recorded when unavailability events complete. Duration tracking is reset on exporter restarts, which may affect accuracy | | |
| | **slurm_node_info**<br>*Gauge* | Provides detailed information about SLURM nodes<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `instance_id` - Kubernetes instance identifier<br>• `state_base` - Base node state (IDLE, ALLOCATED, DOWN, ERROR, MIXED, UNKNOWN)<br>• `state_is_drain` - Whether node is in drain state ("true"/"false")<br>• `state_is_maintenance` - Whether node is in maintenance state ("true"/"false")<br>• `state_is_reserved` - Whether node is in reserved state ("true"/"false")<br>• `state_is_completing` - Whether node is in completing state ("true" or empty)<br>• `state_is_fail` - Whether node is in fail state ("true" or empty)<br>• `state_is_planned` - Whether node is in planned state ("true" or empty)<br>• `state_is_not_responding` - Whether the node is marked as not responding ("true" or empty)<br>• `state_is_invalid` - Whether the node state is considered invalid by SLURM ("true" or empty)<br>• `is_unavailable` - Computed by the exporter: "true" when the node is considered unavailable (e.g., DOWN+*, IDLE+DRAIN+*, or when in NOTRESPONDING, UNKNOWN, ERROR, FAIL, or INVALID states), empty string otherwise<br>• `reservation_name` - Reservation that currently includes the node (trimmed to 50 characters)<br>• `address` - IP address of the node<br>• `reason` - Reason for current node state (empty string if node has no reason set) | | |
| | **slurm_node_gpu_seconds_total**<br>*Counter* | Total GPU seconds accumulated on SLURM nodes<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `state_base` - Base node state<br>• `state_is_drain` - Drain state flag<br>• `state_is_maintenance` - Maintenance state flag<br>• `state_is_reserved` - Reserved state flag | | |
| | **slurm_node_fails_total**<br>*Counter* | Total number of node state transitions to failed states (DOWN/DRAIN)<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `state_base` - Base node state at time of failure<br>• `state_is_drain` - Drain state flag<br>• `state_is_maintenance` - Maintenance state flag<br>• `state_is_reserved` - Reserved state flag<br>• `reason` - Reason for the node failure | | |
| | **slurm_node_unavailability_duration_seconds**<br>*Histogram* | Duration of completed node unavailability events as defined by the `is_unavailable` label (e.g., DOWN+*, IDLE+DRAIN+*, or NOTRESPONDING, UNKNOWN, ERROR, FAIL, INVALID states)<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br><br>**Note:** Observations are recorded when unavailability events complete. Duration tracking is reset on exporter restarts, which may affect accuracy | |
| // isNodeUnavailable checks if a node is in unavailable state | ||
| // Unavailable state: DOWN+* or IDLE+DRAIN+* | ||
| // Unavailable state: DOWN+* or IDLE+DRAIN+* or NOTRESPONDING or UNKNOWN or ERROR or FAIL or INVALID |
There was a problem hiding this comment.
The updated comment describes the unavailable state conditions but doesn't match the actual implementation. The comment lists "NOTRESPONDING or UNKNOWN or ERROR or FAIL or INVALID" but the implementation checks for these in different ways - NOTRESPONDING and INVALID are checked via dedicated state methods, while UNKNOWN and ERROR are checked as base states, and FAIL is checked via IsFailState(). Consider making the comment more precise to reflect the actual checks being performed.
Problem
Some Slurm node states were not treated as unavailable by the Slurm exporter, leading to inaccurate monitoring and resource availability metrics.
Solution
The Slurm exporter logic was extended to include additional node states as unavailable, ensuring correct reporting in exported metrics.
Testing
All changes were validated through existing CI pipelines and checks, with no regressions detected.
Release Notes
Improved Slurm Exporter accuracy by expanding the set of node states reported as unavailable for monitoring and alerting