4.0.0
Changes made since version 3.0.4 prior to version 4.0.0:
🚀 Features
- SCHED-795, SCHED-797: TaskProlog should check for recursive CPU bindings
- PR: #2316
- SCHED-1024 SSSD for LDAP integration
- PR: #2295
- SCHED-1210 use cuda_force_upgrade for upgrading CUDA version and upgr…
- PR: #2355
- SCHED-1228 Create active check with force options to upgrade cuda, nccl-tests, etc
- PR: #2359
- SCHED-1231 Do not collect diff and logs from sensitive files like .bashrc
- PR: #2361
- SCHED-856: Implement PersistentPodState CRD to ensure pods always schedule on the same node
- PR: #2362
- SCHED-1347: Add customizable liveness and readiness probe templates for all soperator CRDs and cluster components via Helm values
- PR: #2401
- SCHED-1079 First iteration of e2e using Godog
- PR: #2277
- SCHED-1260: add the initial number of powered up nodes
- PR: #2406
- SCHED-1295 Cluster Creation Acceptance
- PR: #2428
- SCHED-1320 Make Internal SSH test up to date
- PR: #2443
- SCHED-1302 Add RUN_UNSTABLE_TESTS flag
- PR: #2447
- SCHED-1080 Use Nebius docker registry proxies for public images
- PR: #2469
- SCHED-1298 Enroot container test
- PR: #2473
- SCHED-1297 SCHED-1298 Docker container test
- PR: #2448
- SCHED-1669: Cherry-pick Slurm controller metrics to release-4.0
- PR: #2521
- Remove ClusterType (refactoring)
- PR: #2547
🐛 Fixes
- SCHED-1204: Revert task prolog feature (PR #2316)
- PR: #2349
- fix: handle unregistered node in scontrol update during worker init
- PR: #2330
- fix: avoid nil pointer to empty string for ProcMount in slurmd contai…
- PR: #2329
- SCHED-951: e2e jail upload should not fail on semicolons
- PR: #2358
- SCHED-1232 fix Ansible warning "ansible_facts["fact_name"]"
- PR: #2363
- SCHED-1056 run slurm-divert twice
- PR: #2365
- runAfterCreation: false for manage-jail-state-force
- PR: #2369
- SCHED-1272: hostUsers to activechecks
- PR: #2364
- fix custom envs in nodesets
- PR: #2387
- SCHED-1229: Automatically undrain nodes after pod_ephemeral_storage check
- PR: #2399
- SCHED-1206 Do not set-unhealthy to instances assigned after drain time
- PR: #2397
- SCHED-1402: Requeue when populate jail job exists but has not completed yet
- PR: #2421
- SCHED-1429 Fix activecheck_jobs_controller skipping unfinished jobs
- PR: #2423
- SCHED-1372 pin libcublas-dev-13-0 package version
- PR: #2430
- SCHED-1372 add force option for upgrading nccl-tests
- PR: #2441
- SCHED-1471: Allow initialNumberEphemeralNodes to be set to 0
- PR: #2453
- SCHED-1464: Gate otel-collector jail-logs on soperator-outputs creation
- PR: #2455
- SCHED-1471: helm/nodesets: render initialNumberEphemeralNodes when set to 0
- PR: #2458
- SCHED-1389 Bind-mount SSSD sockets if they exist to the jail
- PR: #2461
- SCHED-1498 upgrade mocks for libnvidia-compute and create mock for libnvidia-ml1 and libnvidia-ml.so.1
- PR: #2505
- SCHED-1654 Change activeDeadlineSeconds for manage-jail-state checks
- PR: #2511
- remove [node_problem] prefix for nvme health check
- PR: #2529
- remove [node_problem] prefix for nvme health check
- PR: #2534
- SCHED-1660 Bind-mount libdummy not only on login nodes but also on CPU-only workers in GPU clusters
- PR: #2538
- Fix populate_jail_entrypoint for NFS
- PR: #2542
- remove if clusterType statement in Slurm healthCheckConfig for cpu clusters
- PR: #2543
📦 Dependencies
- Bump dorny/paths-filter from 3 to 4
- PR: #2320
- Bump softprops/action-gh-release from 2.5.0 to 2.6.1
- PR: #2327
- Bump actions/upload-artifact from 6 to 7
- PR: #2303
- Bump google.golang.org/grpc from 1.72.1 to 1.79.3 in the go_modules group across 1 directory
- PR: #2335
- Bump cryptography from 46.0.5 to 46.0.6 in /ansible in the pip group across 1 directory
- PR: #2367
📔Docs
- Add Readme for SSSD integration
- PR: #2345
Other
- Merge to soperator release 4.0 from/pr 2429/sched 1347/1
- PR: #2436
- Grafana dashboards: new panels and bug fixes
- PR: #2418
- New dahsboard: GPU stats
- PR: #2477
Contributors:
@theyoprst, @github-actions[bot], @dependabot[bot], @faucct, @asteny, @ivaravko, @Uburro, @rdjjke, @itechdima, @ChessProfessor, @ali-sattari
| 📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted |
|---|---|---|---|---|
| 4059 | 178 | 310 | 29065 | 6470 |