#710 Add reactions to active checks#998
Merged
ChessProfessor merged 9 commits intodevfrom Jun 17, 2025
Merged
Conversation
theyoprst
reviewed
Jun 13, 2025
c2f7037 to
6dfefca
Compare
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR extends the ActiveCheck feature by adding configurable reactions (draining nodes or setting conditions) when SLURM checks fail and refactors the SLURM API client to return multiple jobs with richer metadata.
- Refactor SLURM client: replace
GetJobStatuswithGetJobsByID, updateJobstruct (pointer fields,SubmitTime), error out on missing state, and add helper converters. - Implement ActiveCheck reactions: drain nodes or update conditions based on job failure states in the controller.
- Extend CRDs and Helm charts: add
reactionsunderActiveCheck, introducelastJobFailReasons, remove deprecated fields, and update deep-copy logic.
Reviewed Changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| internal/slurmapi/job_test.go | Update tests for pointer-based NodeCount, mandatory state, and new error cases |
| internal/slurmapi/job_status.go | Remove obsolete JobStatus type and helpers |
| internal/slurmapi/job.go | Expand Job model, add state validation, converters, and terminal-state helpers |
| internal/slurmapi/interface.go | Rename GetJobStatus → GetJobsByID in the client interface |
| internal/slurmapi/fake/mock_client.go | Update mock methods and types to match GetJobsByID |
| internal/slurmapi/client.go | Implement GetJobsByID, loop through API jobs, drop single-job assumption |
| internal/controller/soperatorchecks/slurm_nodes_controller.go | Add TODO for handling active-check failures |
| internal/controller/soperatorchecks/activecheck_jobs_controller.go | Apply reactions (drain/set condition), compute aggregated job status |
| internal/consts/slurm.go | Add SlurmNodeReasonActiveCheckFailed constant |
| internal/consts/conditions.go | Add ActiveCheckSlurmJobStatus enum for job outcomes |
| internal/consts/activecheck.go | Add SlurmNodeReasonActiveCheckFailedUnknown constant |
| helm/soperator/crds/slurmcluster-crd.yaml | Add drainSlurmNode flag and lastJobFailReasons to CRD |
| helm/soperator-crds/templates/slurmcluster-crd.yaml | Template changes mirroring CRD schema updates |
| helm/soperator-activechecks/values.yaml | Introduce reactions.setCondition and drainSlurmNode values |
| helm/soperator-activechecks/templates/activecheck.yaml | Render reactions under ActiveCheck spec |
| config/crd/bases/slurm.nebius.ai_activechecks.yaml | Base CRD updates for reactions and job-failure fields |
| api/v1alpha1/zz_generated.deepcopy.go | Deep-copy logic updated for LastJobFailReasons |
| api/v1alpha1/activecheck_types.go | CRD type updates: Reactions, ActiveCheckSlurmJobsStatus |
Comments suppressed due to low confidence (6)
helm/soperator/crds/slurmcluster-crd.yaml:7208
- This CRD block defines
lastJobNametwice, which will produce invalid YAML. Remove the duplicate key.
lastJobName:
api/v1alpha1/activecheck_types.go:159
- The code now sets
LastJobIdas a string, but the CRD schema still expects an integer. This mismatch will break validation—update the CRD or revert the type change.
LastJobId string `json:"lastJobId"`
internal/slurmapi/interface.go:13
- [nitpick]
GetJobsByIDreturns multiple jobs for a single ID string; consider renaming toListJobsByIDorGetArrayJobsto clarify its purpose.
GetJobsByID(ctx context.Context, jobID string) ([]Job, error)
internal/slurmapi/job_test.go:80
- [nitpick] New fields like
SubmitTimeand pointer-basedNodeCountlack dedicated tests. Consider adding cases that verify their conversion logic.
for _, tt := range tests {
internal/slurmapi/job.go:48
- You're calling fmt.Errorf in this function but
fmtis not imported in the file. Please addimport "fmt"to the import block.
job.State = string((*apiJob.JobState)[0])
internal/controller/soperatorchecks/activecheck_jobs_controller.go:193
- There is no
GetNodeListmethod onJob; did you mean to callGetRequiredNodeList(or implementGetNodeList)?
nodes, err := slurmJob.GetNodeList()
theyoprst
approved these changes
Jun 17, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.