Skip to content

SLURMSUPPORT-75: add more state unavailable node to slurm exporter#1988

Merged
Uburro merged 3 commits intomainfrom
SLURMSUPPORT-75/0
Jan 6, 2026
Merged

SLURMSUPPORT-75: add more state unavailable node to slurm exporter#1988
Uburro merged 3 commits intomainfrom
SLURMSUPPORT-75/0

Conversation

@Uburro
Copy link
Collaborator

@Uburro Uburro commented Jan 6, 2026

Problem

Some Slurm node states were not treated as unavailable by the Slurm exporter, leading to inaccurate monitoring and resource availability metrics.

Solution

The Slurm exporter logic was extended to include additional node states as unavailable, ensuring correct reporting in exported metrics.

Testing

All changes were validated through existing CI pipelines and checks, with no regressions detected.

Release Notes

Improved Slurm Exporter accuracy by expanding the set of node states reported as unavailable for monitoring and alerting

@Uburro Uburro added the fix label Jan 6, 2026
@Uburro Uburro requested a review from theyoprst January 6, 2026 14:47
@Uburro Uburro requested a review from Copilot January 6, 2026 16:41
@Uburro Uburro merged commit 9cbe6dd into main Jan 6, 2026
13 checks passed
@Uburro Uburro deleted the SLURMSUPPORT-75/0 branch January 6, 2026 16:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the Slurm exporter to better track unavailable node states by expanding the conditions that mark a node as unavailable.

Key Changes:

  • Added two new node state detection methods: IsNotRespondingState() and IsInvalidState()
  • Expanded the isNodeUnavailable() logic to include NOT_RESPONDING, UNKNOWN, ERROR, FAIL, and INVALID states
  • Added three new metric labels: state_is_not_responding, state_is_invalid, and is_unavailable to provide more detailed node state information

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
internal/slurmapi/node.go Added IsNotRespondingState() and IsInvalidState() methods; added FUTURE to base states list; updated documentation URL; removed unused valueOrDefault function
internal/exporter/collector.go Added three new labels to slurm_node_info metric; expanded isNodeUnavailable() to check for additional unavailable states; reformatted metric descriptor definitions
internal/exporter/collector_test.go Refactored tests to use flexible label matching via new assertMetricHasLabels() helper function instead of exact string matching to accommodate new labels
docs/slurm-exporter.md Updated documentation to describe new state_is_not_responding, state_is_invalid, and is_unavailable labels

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 139 to 162
func isNodeUnavailable(node slurmapi.Node) bool {
if node.IsDownState() {
return true
}
if node.IsNotRespondingState() {
return true
}
if node.BaseState() == api.V0041NodeStateIDLE && node.IsDrainState() {
return true
}
if node.BaseState() == api.V0041NodeStateUNKNOWN {
return true
}
if node.BaseState() == api.V0041NodeStateERROR {
return true
}
if node.IsFailState() {
return true
}
if node.IsInvalidState() {
return true
}
return false
}
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test for isNodeUnavailable() doesn't include test cases for the newly added unavailability conditions: NOTRESPONDING state, UNKNOWN base state, ERROR base state, FAIL state, and INVALID state. Add test cases to ensure these conditions correctly identify nodes as unavailable.

Copilot uses AI. Check for mistakes.
Comment on lines +158 to +166
func (n *Node) IsNotRespondingState() bool {
_, exists := n.States[api.V0041NodeStateNOTRESPONDING]
return exists
}

func (n *Node) IsInvalidState() bool {
_, exists := n.States[api.V0041NodeStateINVALID]
return exists
}
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly added IsNotRespondingState() and IsInvalidState() methods lack test coverage. Consider adding test cases to verify these methods correctly identify nodes in NOT_RESPONDING and INVALID states.

Copilot uses AI. Check for mistakes.

// baseStates defines the mutually exclusive base states of a Slurm node.
// The node state is a 32-bit integer where the lowest 4 bits (0x0000000f) encode
// exactly 6 mutually exclusive base states: IDLE, DOWN, ALLOCATED, ERROR, MIXED, UNKNOWN.
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "exactly 6 mutually exclusive base states" but with the addition of FUTURE on line 185, there are now 7 base states. Update the comment to reflect this change.

Suggested change
// exactly 6 mutually exclusive base states: IDLE, DOWN, ALLOCATED, ERROR, MIXED, UNKNOWN.
// exactly 7 mutually exclusive base states: IDLE, DOWN, ALLOCATED, ERROR, MIXED, UNKNOWN, FUTURE.

Copilot uses AI. Check for mistakes.
Comment on lines +68 to 71
| **slurm_node_info**<br>*Gauge* | Provides detailed information about SLURM nodes<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `instance_id` - Kubernetes instance identifier<br>• `state_base` - Base node state (IDLE, ALLOCATED, DOWN, ERROR, MIXED, UNKNOWN)<br>• `state_is_drain` - Whether node is in drain state ("true"/"false")<br>• `state_is_maintenance` - Whether node is in maintenance state ("true"/"false")<br>• `state_is_reserved` - Whether node is in reserved state ("true"/"false")<br>• `state_is_completing` - Whether node is in completing state ("true" or empty)<br>• `state_is_fail` - Whether node is in fail state ("true" or empty)<br>• `state_is_planned` - Whether node is in planned state ("true" or empty)<br>• `state_is_not_responding` - Whether the node is marked as not responding ("true" or empty)<br>• `state_is_invalid` - Whether the node state is considered invalid by SLURM ("true" or empty)<br>• `is_unavailable` - Computed by the exporter: "true" when the node is considered unavailable (DOWN+* or IDLE+DRAIN+*), empty string otherwise<br>• `reservation_name` - Reservation that currently includes the node (trimmed to 50 characters)<br>• `address` - IP address of the node<br>• `reason` - Reason for current node state (empty string if node has no reason set) |
| **slurm_node_gpu_seconds_total**<br>*Counter* | Total GPU seconds accumulated on SLURM nodes<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `state_base` - Base node state<br>• `state_is_drain` - Drain state flag<br>• `state_is_maintenance` - Maintenance state flag<br>• `state_is_reserved` - Reserved state flag |
| **slurm_node_fails_total**<br>*Counter* | Total number of node state transitions to failed states (DOWN/DRAIN)<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `state_base` - Base node state at time of failure<br>• `state_is_drain` - Drain state flag<br>• `state_is_maintenance` - Maintenance state flag<br>• `state_is_reserved` - Reserved state flag<br>• `reason` - Reason for the node failure |
| **slurm_node_unavailability_duration_seconds**<br>*Histogram* | Duration of completed node unavailability events (DOWN+* or IDLE+DRAIN+*)<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br><br>**Note:** Observations are recorded when unavailability events complete. Duration tracking is reset on exporter restarts, which may affect accuracy |
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for the is_unavailable label states it is "true when the node is considered unavailable (DOWN+* or IDLE+DRAIN+*)". However, the code in isNodeUnavailable() now includes additional conditions: NOTRESPONDING, UNKNOWN, ERROR, FAIL, and INVALID states. Update the documentation to accurately reflect all the conditions that make a node unavailable.

Suggested change
| **slurm_node_info**<br>*Gauge* | Provides detailed information about SLURM nodes<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `instance_id` - Kubernetes instance identifier<br>• `state_base` - Base node state (IDLE, ALLOCATED, DOWN, ERROR, MIXED, UNKNOWN)<br>• `state_is_drain` - Whether node is in drain state ("true"/"false")<br>• `state_is_maintenance` - Whether node is in maintenance state ("true"/"false")<br>• `state_is_reserved` - Whether node is in reserved state ("true"/"false")<br>• `state_is_completing` - Whether node is in completing state ("true" or empty)<br>• `state_is_fail` - Whether node is in fail state ("true" or empty)<br>• `state_is_planned` - Whether node is in planned state ("true" or empty)<br>• `state_is_not_responding` - Whether the node is marked as not responding ("true" or empty)<br>• `state_is_invalid` - Whether the node state is considered invalid by SLURM ("true" or empty)<br>• `is_unavailable` - Computed by the exporter: "true" when the node is considered unavailable (DOWN+* or IDLE+DRAIN+*), empty string otherwise<br>• `reservation_name` - Reservation that currently includes the node (trimmed to 50 characters)<br>• `address` - IP address of the node<br>• `reason` - Reason for current node state (empty string if node has no reason set) |
| **slurm_node_gpu_seconds_total**<br>*Counter* | Total GPU seconds accumulated on SLURM nodes<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `state_base` - Base node state<br>• `state_is_drain` - Drain state flag<br>• `state_is_maintenance` - Maintenance state flag<br>• `state_is_reserved` - Reserved state flag |
| **slurm_node_fails_total**<br>*Counter* | Total number of node state transitions to failed states (DOWN/DRAIN)<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `state_base` - Base node state at time of failure<br>• `state_is_drain` - Drain state flag<br>• `state_is_maintenance` - Maintenance state flag<br>• `state_is_reserved` - Reserved state flag<br>• `reason` - Reason for the node failure |
| **slurm_node_unavailability_duration_seconds**<br>*Histogram* | Duration of completed node unavailability events (DOWN+* or IDLE+DRAIN+*)<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br><br>**Note:** Observations are recorded when unavailability events complete. Duration tracking is reset on exporter restarts, which may affect accuracy |
| **slurm_node_info**<br>*Gauge* | Provides detailed information about SLURM nodes<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `instance_id` - Kubernetes instance identifier<br>• `state_base` - Base node state (IDLE, ALLOCATED, DOWN, ERROR, MIXED, UNKNOWN)<br>• `state_is_drain` - Whether node is in drain state ("true"/"false")<br>• `state_is_maintenance` - Whether node is in maintenance state ("true"/"false")<br>• `state_is_reserved` - Whether node is in reserved state ("true"/"false")<br>• `state_is_completing` - Whether node is in completing state ("true" or empty)<br>• `state_is_fail` - Whether node is in fail state ("true" or empty)<br>• `state_is_planned` - Whether node is in planned state ("true" or empty)<br>• `state_is_not_responding` - Whether the node is marked as not responding ("true" or empty)<br>• `state_is_invalid` - Whether the node state is considered invalid by SLURM ("true" or empty)<br>• `is_unavailable` - Computed by the exporter: "true" when the node is considered unavailable (e.g., DOWN+*, IDLE+DRAIN+*, or when in NOTRESPONDING, UNKNOWN, ERROR, FAIL, or INVALID states), empty string otherwise<br>• `reservation_name` - Reservation that currently includes the node (trimmed to 50 characters)<br>• `address` - IP address of the node<br>• `reason` - Reason for current node state (empty string if node has no reason set) |
| **slurm_node_gpu_seconds_total**<br>*Counter* | Total GPU seconds accumulated on SLURM nodes<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `state_base` - Base node state<br>• `state_is_drain` - Drain state flag<br>• `state_is_maintenance` - Maintenance state flag<br>• `state_is_reserved` - Reserved state flag |
| **slurm_node_fails_total**<br>*Counter* | Total number of node state transitions to failed states (DOWN/DRAIN)<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br>• `state_base` - Base node state at time of failure<br>• `state_is_drain` - Drain state flag<br>• `state_is_maintenance` - Maintenance state flag<br>• `state_is_reserved` - Reserved state flag<br>• `reason` - Reason for the node failure |
| **slurm_node_unavailability_duration_seconds**<br>*Histogram* | Duration of completed node unavailability events as defined by the `is_unavailable` label (e.g., DOWN+*, IDLE+DRAIN+*, or NOTRESPONDING, UNKNOWN, ERROR, FAIL, INVALID states)<br><br>**Labels:**<br>• `node_name` - Name of the SLURM node<br><br>**Note:** Observations are recorded when unavailability events complete. Duration tracking is reset on exporter restarts, which may affect accuracy |

Copilot uses AI. Check for mistakes.
Comment on lines 137 to +138
// isNodeUnavailable checks if a node is in unavailable state
// Unavailable state: DOWN+* or IDLE+DRAIN+*
// Unavailable state: DOWN+* or IDLE+DRAIN+* or NOTRESPONDING or UNKNOWN or ERROR or FAIL or INVALID
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated comment describes the unavailable state conditions but doesn't match the actual implementation. The comment lists "NOTRESPONDING or UNKNOWN or ERROR or FAIL or INVALID" but the implementation checks for these in different ways - NOTRESPONDING and INVALID are checked via dedicated state methods, while UNKNOWN and ERROR are checked as base states, and FAIL is checked via IsFailState(). Consider making the comment more precise to reflect the actual checks being performed.

Copilot uses AI. Check for mistakes.
theyoprst pushed a commit that referenced this pull request Jan 8, 2026
…1988)

* SLURMSUPPORT-75: add more state unavailable node to slurm exporter

(cherry picked from commit 9cbe6dd)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants