Skip to content

nixos-cuda/infra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NixOS-CUDA CI/CD

NixOS-CUDA CI/CD Infrastructure, including NixOS configurations for Hydra and the builders. This is not an official NixOS project.

Scope

The purpose of this system is to advance maintainability of hardware-accelerated (specifically CUDA) software in Nixpkgs. Sustainable maintenance and development of Nixpkgs CUDA requires both a comprehensive test suite run on-schedule, for retroactive detection, and a lighter on-push test-suite for early notification of contributors and the prevention of regressions from being merged.

We aim to detect and distinguish between:

  • build failures;
  • breakages of basic functionality, like the loading of shared libraries by downstream applications in their GPU branches;
  • architecture-specific errors;
  • errors in collective communication libraries;
  • regressions in performance and closure sizes.

Hosts

Accounts of currently available hardware and access.

Hostname IP GPU GPU architecture
ada ada.nixos-cuda.org - 144.76.101.55 RTX 4000 ada (SFF) Ada Lovelace
pascal pascal.nixos-cuda.org - 95.216.72.164 GeForce GTX 1080 Pascal
CPU builder courtesy of Gaetan and liberodark N/A None None

Hydra jobsets

Hydra currently runs on ada.

Hydra's binary cache is exposed for development purposes. For a compliant way to consume CUDA with Nix refer to NVIDIA. The substituter is currently backed by harmonia. Hydra currently runs on ada.

ROADMAP

  • Coverage
    • Remove hard-coded attribute lists: cf. "Collect gpuChecks by following recurseIntoAttrs" in "MVE"; same for packages.
    • Data-Center Hardware and Multi-GPU set-ups
      • Probably requires ephemeral builders due to cost.
      • Currently no multi-GPU/collective communications test-suites available in Nixpkgs.
    • Jetson (tentatively, based on owned hardware and colocation)
  • Efficiency:
    • harmoniasnix-narbridge;
    • virtiofsd flat stores → snix virtiofs; in particular, we should hope to eliminate the inefficient Nix substitution;
    • Ephemeral Builders:
      • Make NixOS work on Azure (under pain limits).
      • Basic functionality: on-demand deployment and automatic deallocation of remote builders; the hooking up the builders to Hydra.
      • IO costs: synchronizing the closures is likely to be the bottleneck. Cf. the snix virtio story.
  • Isolation and Access Control:
    • [Serge] Move remote builders, Hydra, and web services to microvms with isolated stores.
    • Prevent unaudited SSH access to hypervisors and to Hydra (currently Gaetan and Serge in authorized keys).
    • Pull-based Deployment.
  • Mimimal Viable Example:
    • [third parties via Jonas] Initial funding for GPU hardware.
    • [Jonas] GitHub organization, domain names, web page.
    • [Gaetan] Set up NixOS and Hydra.
    • [Gaetan] ZFS Nix store on ada, pascal.
    • [Gaetan] Set up sops-nix for managing the secrets.
    • [Gaetan] Hydra.
      • [Gaetan] Back up the Hydra configuration (DB?, jobsets?).
      • [Gaetan] Move Hydra to ada (more storage available).
      • [Serge] Figure out how Hydra inputs work.
      • Open PR for cuda-gpu-tests jobset (currently the input points at Gaetan's branch)
      • Collect gpuChecks by following recurseIntoAttrs and passthru.tests (currently using a hard-coded list).
      • Declarative jobsets (currently configured via web UI).
    • [Gaetan] Expose binary cache

About

Machine config for the CUDA team [maintainer=@GaetanLepage]

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages