2.0.0
Changes made since version 1.23.2 prior to version 2.0.0:
π Features
- Change base cuda image and use ansible for nccl-tests
- PR: #1967
- Change base Neubuntu image for all images
- PR: #1975
- feat: Adding node metrics to track node status in Slurm
- PR: #1927
- Support disabling controllers via flag or env
- PR: #1978
- Use base neubuntu image with ansible and return pushing stable release images to the Github docker registry
- PR: #1987
- Allow set procMount
- PR: #1980
- SCHED-696: update umbrella chart for logging
- PR: #1991
- Move common-packages, repos and python roles to the base layers of images (repo ml-containers)
- PR: #1993
- SCHED-696: customize log headers
- PR: #2007
- SCHED-626 Removing slurm installation from images (use base image)
- PR: #2008
- SCHED-696: configure logs endpoint
- PR: #2028
- Use base images with Nebius apt snapshots
- PR: #2031
- SCHED-696: add attribute for service provider application
- PR: #2033
- SCHED-761 move openmpi role to the ml-containers repo
- PR: #2035
- SCHED-773 Move dcgmi, cuda and nccl-tests roles to the ml-containers
- PR: #2047
- Bump docker and nvtop
- PR: #2051
- Use base image for jail and active_checks with ansible roles for downloading binaries: nccl-tests, cuda-samples and mlc
- PR: #2062
- Use slurm_training_diag as base image for jail
- PR: #2064
- SCHED-864 Create sansible docker image for handling the jail state
- PR: #2108
- SCHED-906 remove outdated scripts
- PR: #2142
- bump nvtop
- PR: #2156
π Fixes
- SCHED-567: Ensure deterministic startup order between DB, accounting and controller
- PR: #1918
- Fix k8up backup image repo and tag
- PR: #1966
- SLURMSUPPORT-75: add more state unavailable node to slurm exporter
- PR: #1988
- SCHED-690: removing exporter rb, sa, role from soperator to helm chart
- PR: #1986
- Fix pod monitor bug in renderer
- PR: #1992
- SCHED-785: Plug-in SPANK plugins properly in slurm job active checks
- PR: #2058
- SCHED-807: fix autohealing for nodesets
- PR: #2072
- Get correct environment for passive checks
- PR: #2070
- turn off dcgmi diag active checks
- PR: #2082
- increase default reconfiguration period
- PR: #2087
- Use base images without workdir /opt/ansible
- PR: #2084
- SCHED-855 add WorkingDir for activecheck container images
- PR: #2100
- do not always undrain node
- PR: #2113
- Use CLOUD nodes and make gres.conf configurable for NodeSets
- PR: #2119
- SCHED-898 Ignore non-draining checks in wait-for-active-checks job
- PR: #2132
- SCHED-885: change init containers order
- PR: #2133
- Add slurm script that does chmod a+rw for enroot image layers
- PR: #2138
- [SCHED-804] Deprecate and make optional slurmNodes.worker field in SlurmCluster CRD
- PR: #2141
- Use default nfs-in-k8s for e2e
- PR: #2152
π¦ Dependencies
- Bump golang.org/x/crypto from 0.45.0 to 0.46.0
- PR: #1919
- Bump github.com/onsi/gomega from 1.38.2 to 1.38.3
- PR: #1920
- Bump k8s.io/client-go from 0.34.2 to 0.34.3
- PR: #1930
- Bump k8s.io/component-base from 0.34.2 to 0.34.3
- PR: #1929
- Bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.86.2 to 0.87.1
- PR: #1931
- Bump sigs.k8s.io/controller-runtime from 0.22.3 to 0.22.4
- PR: #1935
- Bump actions/upload-artifact from 5 to 6
- PR: #1934
- Bump actions/download-artifact from 6 to 7
- PR: #1933
- Bump filelock from 3.20.0 to 3.20.1 in /ansible in the pip group across 1 directory
- PR: #1946
- Bump actions/checkout from 4 to 6
- PR: #1994
- Bump actions/checkout from 4 to 6
- PR: #2023
Other
- Support customizing built-in Slurm scripts
- PR: #1915
- Make Helm chart soperator-activechecks customizable
- PR: #1954
- Fixes for issues with metrics and dashboards
- PR: #2061
- Add fsGroupChangePolicy: "OnRootMismatch" to NFS server StatefulSet
- PR: #2066
- NOTIC: Move status and resolution fields to log labels from body
- PR: #2125
Contributors:
@github-actions[bot], @dependabot[bot], @dstaroff, @Uburro, @ali-sattari, @asteny, @aaroniscode, @mateusclira-nv, @theyoprst, @andriishestakov, @itechdima, @rdjjke, @ChessProfessor
| π Categorized PRs | π Uncategorized PRs | π₯ Commits | β Lines added | β Lines deleted |
|---|---|---|---|---|
| 3985 | 358 | 250 | 41893 | 5554 |