Skip to content

Release 1.21.9#1248

Merged
Uburro merged 29 commits intomainfrom
dev
Jul 16, 2025
Merged

Release 1.21.9#1248
Uburro merged 29 commits intomainfrom
dev

Conversation

@Uburro
Copy link
Copy Markdown
Collaborator

@Uburro Uburro commented Jul 16, 2025

No description provided.

Uburro and others added 29 commits July 10, 2025 17:44
* [fix] bug with dependency kruise
Return bash health checks and fix HC program node states
* Fix active checks for manual runs

* PR fixes
Fix worker startup race conditions with controller readiness check
* set default slurm path to jail #1221

* fix
* add jailConfigPath for sconficontroller #1229
NOTIC: Run bash healthchecks only on 8gpu nodes
* fix rebooter incorrect limit cpu

* Update internal/render/nodeconfigurator/container.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ecks/0

1203: Fix Slurm nodes stuck in PLANNED when eachWorkerJobArray active…
- Add UserID field to Job struct to access user IDs from Slurm API
- Add user_id label to job_info metric descriptor alongside existing labels
- Include numeric user_id value in job_info metric output
- Update tests to include UserID field and expect user_id label in metrics
- Update documentation to describe new user_id label

Example metric:
slurm_job_info{job_id="12345",user_name="researcher",user_id="1000",...} 1
…b_info metric

REST container now mounts the jail volume read-only and links passwd files to enable
username resolution in Slurm REST API. This fixes empty user_name labels in job_info metric.
Fix empty user_name labels in job_info metric
Fix alloc_gpus_busy Prolog check for non-exclusive GPU jobs
…h files atomically (#1239)

* Catch error from Close in FileStore.Add

Before error from Close would be ignored silently, now it would propagate to reconciler, and trigger error requeue

* Replace fmt.Sprintf with filepath.Join

* Replace direct writes in FileStore.Add with temp file + atomic rename

* Use hardcoded mode in FileStore.Add

* Wait until directory entries cache invalidation

---------

Co-authored-by: Mikhail Cheshkov <mcheshkov@nebius.com>
@Uburro Uburro added the release PR for the release of new version label Jul 16, 2025
@Uburro Uburro merged commit 4a0855b into main Jul 16, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release PR for the release of new version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants