NixOS-CUDA CI/CD Infrastructure, including NixOS configurations for Hydra and the builders. This is not an official NixOS project.
The purpose of this system is to advance maintainability of hardware-accelerated (specifically CUDA) software in Nixpkgs. Sustainable maintenance and development of Nixpkgs CUDA requires both a comprehensive test suite run on-schedule, for retroactive detection, and a lighter on-push test-suite for early notification of contributors and the prevention of regressions from being merged.
We aim to detect and distinguish between:
- build failures;
- breakages of basic functionality, like the loading of shared libraries by downstream applications in their GPU branches;
- architecture-specific errors;
- errors in collective communication libraries;
- regressions in performance and closure sizes.
Accounts of currently available hardware and access.
| Hostname | IP | GPU | GPU architecture |
|---|---|---|---|
| ada | ada.nixos-cuda.org - 144.76.101.55 |
RTX 4000 ada (SFF) | Ada Lovelace |
| pascal | pascal.nixos-cuda.org - 95.216.72.164 |
GeForce GTX 1080 | Pascal |
| CPU builder courtesy of Gaetan and liberodark | N/A | None | None |
Hydra jobsets
cuda-gpu-tests: runs the nixpkgs GPU tests on builders withcudacapability.cuda-packages: buildsnixpkgs'srelease-cuda.nixjobset.
Hydra currently runs on ada.
Hydra's binary cache is exposed for development purposes.
For a compliant way to consume CUDA with Nix refer to NVIDIA.
The substituter is currently backed by harmonia.
Hydra currently runs on ada.
- Coverage
- Remove hard-coded attribute lists: cf. "Collect
gpuChecks by followingrecurseIntoAttrs" in "MVE"; same for packages. - Data-Center Hardware and Multi-GPU set-ups
- Probably requires ephemeral builders due to cost.
- Currently no multi-GPU/collective communications test-suites available in Nixpkgs.
- Jetson (tentatively, based on owned hardware and colocation)
- Remove hard-coded attribute lists: cf. "Collect
- Efficiency:
-
harmonia→snix-narbridge; - virtiofsd flat stores → snix virtiofs; in particular, we should hope to eliminate the inefficient Nix substitution;
- Ephemeral Builders:
- Make NixOS work on Azure (under pain limits).
- Basic functionality: on-demand deployment and automatic deallocation of remote builders; the hooking up the builders to Hydra.
- IO costs: synchronizing the closures is likely to be the bottleneck. Cf. the snix virtio story.
-
- Isolation and Access Control:
- [Serge] Move remote builders, Hydra, and web services to microvms with isolated stores.
- Prevent unaudited SSH access to hypervisors and to Hydra (currently Gaetan and Serge in authorized keys).
- Pull-based Deployment.
- Mimimal Viable Example:
- [third parties via Jonas] Initial funding for GPU hardware.
- [Jonas] GitHub organization, domain names, web page.
- [Gaetan] Set up NixOS and Hydra.
- [Gaetan] ZFS Nix store on
ada,pascal. - [Gaetan] Set up
sops-nixfor managing the secrets. - [Gaetan] Hydra.
- [Gaetan] Back up the Hydra configuration (DB?, jobsets?).
- [Gaetan] Move Hydra to
ada(more storage available). - [Serge] Figure out how Hydra inputs work.
- Open PR for cuda-gpu-tests jobset (currently the input points at Gaetan's branch)
- Collect
gpuChecks by followingrecurseIntoAttrsandpassthru.tests(currently using a hard-coded list). - Declarative jobsets (currently configured via web UI).
- [Gaetan] Expose binary cache