Skip to content
Tools for building GPU clusters
Jsonnet Shell JavaScript Python HTML CSS Other
Branch: master
Clone or download
ajdecon Merge pull request #304 from dholt/rc-local-playbook
Add playbook to manage ad-hoc startup commands via rc.local
Latest commit c8beb2d Oct 8, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.githooks improve pre-commit check Mar 21, 2019
.jenkins-scripts fix quoting for ssh command in test-slurm-job Jul 30, 2019
config.example Merge branch 'master' into rc-local-playbook Oct 8, 2019
containers Added nccl-tests container from https://github.com/lukeyeager/nccl-te… Sep 5, 2019
docs add fact cache clearing info Oct 4, 2019
examples Removed slurm-perf examples. Sep 5, 2019
kubespray @ 7d8da83 update kubespray to v2.10.4 Jul 2, 2019
playbooks Merge branch 'master' into rc-local-playbook Oct 8, 2019
roles Merge pull request #341 from michael-balint/nvidia-dgx-role-update Sep 11, 2019
scripts Merge pull request #340 from ajdecon/apt-non-interactive Sep 8, 2019
services move rook to release branch and update to latest Jul 2, 2019
tests Merge pull request #224 from ScottESanDiego/nvcr_io_cuda Apr 17, 2019
virtual Merge pull request #340 from ajdecon/apt-non-interactive Sep 8, 2019
.ansible-lint add a config file for ansible-lint Mar 21, 2019
.env move env link Jun 5, 2019
.gitignore Moved virtual-specific ignore items to virtual/.gitignore. Feb 27, 2019
.gitmodules begin code cleanup and refactor Feb 12, 2019
CLA initial commit Jun 26, 2018
Jenkinsfile Split Jenkinsfile into scripts Jul 30, 2019
LICENSE Create LICENSE Jun 26, 2018
README.md Merge pull request #331 from timuster/fix Sep 5, 2019
ansible.cfg fix error when copying kubeconfig to ansible host Jul 2, 2019
requirements.yml Upgraded ansible-role-enroot version. Sep 5, 2019

README.md

DeepOps

GPU infrastructure and automation tools

Overview

The DeepOps project encapsulates best practices in the deployment of GPU server clusters and sharing single powerful nodes (such as NVIDIA DGX Systems). DeepOps can also be adapted or used in a modular fashion to match site-specific cluster needs. For example:

  • An on-prem, air-gapped data center of NVIDIA DGX servers where DeepOps provides end-to-end capabilities to set up the entire cluster management stack
  • An existing cluster running Kubernetes where DeepOps scripts are used to deploy Kubeflow and connect NFS storage
  • An existing cluster that needs a resource manager / batch scheduler, where DeepOps is used to install Slurm, Kubernetes, or a hybrid of both
  • A single machine where no scheduler is desired, only NVIDIA drivers, Docker, and the NVIDIA Container Runtime

Check out the video tutorial for how to use DeepOps to deploy Kubernetes and Kubeflow on a single DGX Station. This provides a good base test ground for larger deployments.

Releases

Latest release: DeepOps 19.07 Release

It is recommended to use the latest release branch for stable code (linked above). All development takes place on the master branch, which is generally functional but may change significantly between releases.

Getting Started

For detailed help or guidance, read through our Getting Started Guide or pick one of the deployment options documented below.

Deployment Options

Supported distributions

DeepOps currently supports the following Linux distributions:

  • NVIDIA DGX OS 4
  • Ubuntu 18.04 LTS
  • CentOS 7
  • Red Hat Enterprise Linux 7

Kubernetes

Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications.

Consult the Kubernetes Guide for instructions on building a GPU-enabled Kubernetes cluster using DeepOps.

For more information on Kubernetes in general, refer to the official Kubernetes docs.

Slurm

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

Consult the Slurm Guide for instructions on building a GPU-enabled Slurm cluster using DeepOps.

For more information on Slurm in general, refer to the official Slurm docs.

DGX POD Hybrid Cluster

A hybrid cluster with both Kubernetes and Slurm can also be deployed. This is recommended for DGX POD and other setups that wish to make maximal use of the cluster.

Consult the DGX POD Guide for step-by-step instructions on building a GPU-enabled hybrid cluster using DeepOps.

Virtual

To try DeepOps before deploying it on an actual cluster, a virtualized version of DeepOps may be deployed on a single node using Vagrant. This can be used for testing, adding new features, or configuring DeepOps to meet deployment-specific needs.

Consult the Virtual Guide to build a GPU-enabled virtual cluster with DeepOps.

Updating DeepOps

To update from a previous version of DeepOps to a newer release, please consult the Update Guide.

Copyright and License

This project is released under the BSD 3-clause license.

Issues

NVIDIA DGX customers should file an NVES ticket via NVIDIA Enterprise Services.

Otherwise, bugs and feature requests can be made by filing a GitHub Issue.

Contributing

To contribute, please issue a pull request against the master branch from a local fork.

A signed copy of the Contributor License Agreement needs to be provided to deepops@nvidia.com before any change can be accepted.

You can’t perform that action at this time.