Skip to content

#710 Add reactions to active checks#998

Merged
ChessProfessor merged 9 commits intodevfrom
issue-710/add-reactions-to-active-checks
Jun 17, 2025
Merged

#710 Add reactions to active checks#998
ChessProfessor merged 9 commits intodevfrom
issue-710/add-reactions-to-active-checks

Conversation

@ChessProfessor
Copy link
Collaborator

No description provided.

@theyoprst theyoprst requested a review from Copilot June 13, 2025 08:51

This comment was marked as outdated.

@ChessProfessor ChessProfessor force-pushed the issue-710/add-reactions-to-active-checks branch from c2f7037 to 6dfefca Compare June 17, 2025 12:49
@ChessProfessor ChessProfessor self-assigned this Jun 17, 2025
@ChessProfessor ChessProfessor requested a review from theyoprst June 17, 2025 13:42
@theyoprst theyoprst requested a review from Copilot June 17, 2025 13:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR extends the ActiveCheck feature by adding configurable reactions (draining nodes or setting conditions) when SLURM checks fail and refactors the SLURM API client to return multiple jobs with richer metadata.

  • Refactor SLURM client: replace GetJobStatus with GetJobsByID, update Job struct (pointer fields, SubmitTime), error out on missing state, and add helper converters.
  • Implement ActiveCheck reactions: drain nodes or update conditions based on job failure states in the controller.
  • Extend CRDs and Helm charts: add reactions under ActiveCheck, introduce lastJobFailReasons, remove deprecated fields, and update deep-copy logic.

Reviewed Changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated no comments.

Show a summary per file
File Description
internal/slurmapi/job_test.go Update tests for pointer-based NodeCount, mandatory state, and new error cases
internal/slurmapi/job_status.go Remove obsolete JobStatus type and helpers
internal/slurmapi/job.go Expand Job model, add state validation, converters, and terminal-state helpers
internal/slurmapi/interface.go Rename GetJobStatusGetJobsByID in the client interface
internal/slurmapi/fake/mock_client.go Update mock methods and types to match GetJobsByID
internal/slurmapi/client.go Implement GetJobsByID, loop through API jobs, drop single-job assumption
internal/controller/soperatorchecks/slurm_nodes_controller.go Add TODO for handling active-check failures
internal/controller/soperatorchecks/activecheck_jobs_controller.go Apply reactions (drain/set condition), compute aggregated job status
internal/consts/slurm.go Add SlurmNodeReasonActiveCheckFailed constant
internal/consts/conditions.go Add ActiveCheckSlurmJobStatus enum for job outcomes
internal/consts/activecheck.go Add SlurmNodeReasonActiveCheckFailedUnknown constant
helm/soperator/crds/slurmcluster-crd.yaml Add drainSlurmNode flag and lastJobFailReasons to CRD
helm/soperator-crds/templates/slurmcluster-crd.yaml Template changes mirroring CRD schema updates
helm/soperator-activechecks/values.yaml Introduce reactions.setCondition and drainSlurmNode values
helm/soperator-activechecks/templates/activecheck.yaml Render reactions under ActiveCheck spec
config/crd/bases/slurm.nebius.ai_activechecks.yaml Base CRD updates for reactions and job-failure fields
api/v1alpha1/zz_generated.deepcopy.go Deep-copy logic updated for LastJobFailReasons
api/v1alpha1/activecheck_types.go CRD type updates: Reactions, ActiveCheckSlurmJobsStatus
Comments suppressed due to low confidence (6)

helm/soperator/crds/slurmcluster-crd.yaml:7208

  • This CRD block defines lastJobName twice, which will produce invalid YAML. Remove the duplicate key.
                  lastJobName:

api/v1alpha1/activecheck_types.go:159

  • The code now sets LastJobId as a string, but the CRD schema still expects an integer. This mismatch will break validation—update the CRD or revert the type change.
	LastJobId string `json:"lastJobId"`

internal/slurmapi/interface.go:13

  • [nitpick] GetJobsByID returns multiple jobs for a single ID string; consider renaming to ListJobsByID or GetArrayJobs to clarify its purpose.
	GetJobsByID(ctx context.Context, jobID string) ([]Job, error)

internal/slurmapi/job_test.go:80

  • [nitpick] New fields like SubmitTime and pointer-based NodeCount lack dedicated tests. Consider adding cases that verify their conversion logic.
	for _, tt := range tests {

internal/slurmapi/job.go:48

  • You're calling fmt.Errorf in this function but fmt is not imported in the file. Please add import "fmt" to the import block.
	job.State = string((*apiJob.JobState)[0])

internal/controller/soperatorchecks/activecheck_jobs_controller.go:193

  • There is no GetNodeList method on Job; did you mean to call GetRequiredNodeList (or implement GetNodeList)?
					nodes, err := slurmJob.GetNodeList()

@ChessProfessor ChessProfessor merged commit cb9f416 into dev Jun 17, 2025
4 checks passed
@ChessProfessor ChessProfessor deleted the issue-710/add-reactions-to-active-checks branch June 17, 2025 14:30
@asteny asteny added the feature label Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants