Skip to content

4.0.0

Choose a tag to compare

@github-actions github-actions released this 26 May 14:09
· 272 commits to main since this release
9b04017

Changes made since version 3.0.4 prior to version 4.0.0:

🚀 Features

  • SCHED-795, SCHED-797: TaskProlog should check for recursive CPU bindings
  • SCHED-1024 SSSD for LDAP integration
  • SCHED-1210 use cuda_force_upgrade for upgrading CUDA version and upgr…
  • SCHED-1228 Create active check with force options to upgrade cuda, nccl-tests, etc
  • SCHED-1231 Do not collect diff and logs from sensitive files like .bashrc
  • SCHED-856: Implement PersistentPodState CRD to ensure pods always schedule on the same node
  • SCHED-1347: Add customizable liveness and readiness probe templates for all soperator CRDs and cluster components via Helm values
  • SCHED-1079 First iteration of e2e using Godog
  • SCHED-1260: add the initial number of powered up nodes
  • SCHED-1295 Cluster Creation Acceptance
  • SCHED-1320 Make Internal SSH test up to date
  • SCHED-1302 Add RUN_UNSTABLE_TESTS flag
  • SCHED-1080 Use Nebius docker registry proxies for public images
  • SCHED-1298 Enroot container test
  • SCHED-1297 SCHED-1298 Docker container test
  • SCHED-1669: Cherry-pick Slurm controller metrics to release-4.0
  • Remove ClusterType (refactoring)

🐛 Fixes

  • SCHED-1204: Revert task prolog feature (PR #2316)
  • fix: handle unregistered node in scontrol update during worker init
  • fix: avoid nil pointer to empty string for ProcMount in slurmd contai…
  • SCHED-951: e2e jail upload should not fail on semicolons
  • SCHED-1232 fix Ansible warning "ansible_facts["fact_name"]"
  • SCHED-1056 run slurm-divert twice
  • runAfterCreation: false for manage-jail-state-force
  • SCHED-1272: hostUsers to activechecks
  • fix custom envs in nodesets
  • SCHED-1229: Automatically undrain nodes after pod_ephemeral_storage check
  • SCHED-1206 Do not set-unhealthy to instances assigned after drain time
  • SCHED-1402: Requeue when populate jail job exists but has not completed yet
  • SCHED-1429 Fix activecheck_jobs_controller skipping unfinished jobs
  • SCHED-1372 pin libcublas-dev-13-0 package version
  • SCHED-1372 add force option for upgrading nccl-tests
  • SCHED-1471: Allow initialNumberEphemeralNodes to be set to 0
  • SCHED-1464: Gate otel-collector jail-logs on soperator-outputs creation
  • SCHED-1471: helm/nodesets: render initialNumberEphemeralNodes when set to 0
  • SCHED-1389 Bind-mount SSSD sockets if they exist to the jail
  • SCHED-1498 upgrade mocks for libnvidia-compute and create mock for libnvidia-ml1 and libnvidia-ml.so.1
  • SCHED-1654 Change activeDeadlineSeconds for manage-jail-state checks
  • remove [node_problem] prefix for nvme health check
  • remove [node_problem] prefix for nvme health check
  • SCHED-1660 Bind-mount libdummy not only on login nodes but also on CPU-only workers in GPU clusters
  • Fix populate_jail_entrypoint for NFS
  • remove if clusterType statement in Slurm healthCheckConfig for cpu clusters

📦 Dependencies

  • Bump dorny/paths-filter from 3 to 4
  • Bump softprops/action-gh-release from 2.5.0 to 2.6.1
  • Bump actions/upload-artifact from 6 to 7
  • Bump google.golang.org/grpc from 1.72.1 to 1.79.3 in the go_modules group across 1 directory
  • Bump cryptography from 46.0.5 to 46.0.6 in /ansible in the pip group across 1 directory

📔Docs

  • Add Readme for SSSD integration

Other

  • Merge to soperator release 4.0 from/pr 2429/sched 1347/1
  • Grafana dashboards: new panels and bug fixes
  • New dahsboard: GPU stats

Contributors:
@theyoprst, @github-actions[bot], @dependabot[bot], @faucct, @asteny, @ivaravko, @Uburro, @rdjjke, @itechdima, @ChessProfessor, @ali-sattari

📁 Categorized PRs 📂 Uncategorized PRs 📥 Commits Lines added Lines deleted
4059 178 310 29065 6470