Skip to content

2.0.0

Choose a tag to compare

@github-actions github-actions released this 10 Feb 19:05
· 977 commits to main since this release
d07f06e

Changes made since version 1.23.2 prior to version 2.0.0:

πŸš€ Features

  • Change base cuda image and use ansible for nccl-tests
  • Change base Neubuntu image for all images
  • feat: Adding node metrics to track node status in Slurm
  • Support disabling controllers via flag or env
  • Use base neubuntu image with ansible and return pushing stable release images to the Github docker registry
  • Allow set procMount
  • SCHED-696: update umbrella chart for logging
  • Move common-packages, repos and python roles to the base layers of images (repo ml-containers)
  • SCHED-696: customize log headers
  • SCHED-626 Removing slurm installation from images (use base image)
  • SCHED-696: configure logs endpoint
  • Use base images with Nebius apt snapshots
  • SCHED-696: add attribute for service provider application
  • SCHED-761 move openmpi role to the ml-containers repo
  • SCHED-773 Move dcgmi, cuda and nccl-tests roles to the ml-containers
  • Bump docker and nvtop
  • Use base image for jail and active_checks with ansible roles for downloading binaries: nccl-tests, cuda-samples and mlc
  • Use slurm_training_diag as base image for jail
  • SCHED-864 Create sansible docker image for handling the jail state
  • SCHED-906 remove outdated scripts
  • bump nvtop

πŸ› Fixes

  • SCHED-567: Ensure deterministic startup order between DB, accounting and controller
  • Fix k8up backup image repo and tag
  • SLURMSUPPORT-75: add more state unavailable node to slurm exporter
  • SCHED-690: removing exporter rb, sa, role from soperator to helm chart
  • Fix pod monitor bug in renderer
  • SCHED-785: Plug-in SPANK plugins properly in slurm job active checks
  • SCHED-807: fix autohealing for nodesets
  • Get correct environment for passive checks
  • turn off dcgmi diag active checks
  • increase default reconfiguration period
  • Use base images without workdir /opt/ansible
  • SCHED-855 add WorkingDir for activecheck container images
  • do not always undrain node
  • Use CLOUD nodes and make gres.conf configurable for NodeSets
  • SCHED-898 Ignore non-draining checks in wait-for-active-checks job
  • SCHED-885: change init containers order
  • Add slurm script that does chmod a+rw for enroot image layers
  • [SCHED-804] Deprecate and make optional slurmNodes.worker field in SlurmCluster CRD
  • Use default nfs-in-k8s for e2e

πŸ“¦ Dependencies

  • Bump golang.org/x/crypto from 0.45.0 to 0.46.0
  • Bump github.com/onsi/gomega from 1.38.2 to 1.38.3
  • Bump k8s.io/client-go from 0.34.2 to 0.34.3
  • Bump k8s.io/component-base from 0.34.2 to 0.34.3
  • Bump github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring from 0.86.2 to 0.87.1
  • Bump sigs.k8s.io/controller-runtime from 0.22.3 to 0.22.4
  • Bump actions/upload-artifact from 5 to 6
  • Bump actions/download-artifact from 6 to 7
  • Bump filelock from 3.20.0 to 3.20.1 in /ansible in the pip group across 1 directory
  • Bump actions/checkout from 4 to 6
  • Bump actions/checkout from 4 to 6

Other

  • Support customizing built-in Slurm scripts
  • Make Helm chart soperator-activechecks customizable
  • Fixes for issues with metrics and dashboards
  • Add fsGroupChangePolicy: "OnRootMismatch" to NFS server StatefulSet
  • NOTIC: Move status and resolution fields to log labels from body

Contributors:
@github-actions[bot], @dependabot[bot], @dstaroff, @Uburro, @ali-sattari, @asteny, @aaroniscode, @mateusclira-nv, @theyoprst, @andriishestakov, @itechdima, @rdjjke, @ChessProfessor

πŸ“ Categorized PRs πŸ“‚ Uncategorized PRs πŸ“₯ Commits βž• Lines added βž– Lines deleted
3985 358 250 41893 5554