Multi GPU Programming Models

This project implements the well known multi GPU Jacobi solver with different multi GPU Programming Models:

single_threaded_copy Single Threaded using cudaMemcpy for inter GPU communication
multi_threaded_copy Multi Threaded with OpenMP using cudaMemcpy for inter GPU communication
multi_threaded_copy_overlap Multi Threaded with OpenMP using cudaMemcpy for itner GPU communication with overlapping communication
multi_threaded_p2p Multi Threaded with OpenMP using GPUDirect P2P mappings for inter GPU communication
multi_threaded_p2p_opt Multi Threaded with OpenMP using GPUDirect P2P mappings for inter GPU communication with delayed norm execution
multi_threaded_um Multi Threaded with OpenMP relying on transparent peer mappings with Unified Memory for inter GPU communication
mpi Multi Process with MPI using CUDA-aware MPI for inter GPU communication
mpi_overlap Multi Process with MPI using CUDA-aware MPI for inter GPU communication with overlapping communication
nccl Multi Process with MPI and NCCL using NCCL for inter GPU communication
nccl_overlap Multi Process with MPI and NCCL using NCCL for inter GPU communication with overlapping communication
nccl_graphs Multi Process with MPI and NCCL using NCCL for inter GPU communication with overlapping communication combined with CUDA Graphs
nvshmem Multi Process with MPI and NVSHMEM using NVSHMEM for inter GPU communication.

Each variant is a stand alone Makefile project and most variants have been discussed in various GTC Talks, e.g.:

single_threaded_copy, multi_threaded_copy, multi_threaded_copy_overlap, multi_threaded_p2p, multi_threaded_p2p_opt, mpi, mpi_overlap and nvshmem on DGX-1V at GTC Europe 2017 in 23031 - Multi GPU Programming Models
single_threaded_copy, multi_threaded_copy, multi_threaded_copy_overlap, multi_threaded_p2p, multi_threaded_p2p_opt, mpi, mpi_overlap and nvshmem on DGX-2 at GTC 2019 in S9139 - Multi GPU Programming Models
multi_threaded_copy, multi_threaded_copy_overlap, multi_threaded_p2p, multi_threaded_p2p_opt, mpi, mpi_overlap, nccl, nccl_overlap and nvshmem on DGX A100 at GTC 2021 in A31140 - Multi-GPU Programming Models

Some examples in this repository are the basis for an interactive tutorial: FZJ-JSC/tutorial-multi-gpu.

Requirements

CUDA: version 11.0 (9.2 if build with DISABLE_CUB=1) or later is required by all variants.
- nccl_graphs requires NCCL 2.15.1, CUDA 11.7 and CUDA Driver 515.65.01 or newer
OpenMP capable compiler: Required by the Multi Threaded variants. The examples have been developed and tested with gcc.
MPI: The mpi and mpi_overlap variants require a CUDA-aware¹ implementation. For NVSHMEM and NCCL, a non CUDA-aware MPI is sufficient. The examples have been developed and tested with OpenMPI.
NVSHMEM (version 0.4.1 or later): Required by the NVSHMEM variant.
NCCL (version 2.8 or later): Required by the NCCL variant

Building

Each variant comes with a Makefile and can be built by simply issuing make, e.g.

multi-gpu-programming-models$ cd multi_threaded_copy
multi_threaded_copy$ make
nvcc -DHAVE_CUB -Xcompiler -fopenmp -lineinfo -DUSE_NVTX -lnvToolsExt -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -std=c++14 jacobi.cu -o jacobi
multi_threaded_copy$ ls jacobi
jacobi

Run instructions

All variant have the following command line options

-niter: How many iterations to carry out (default 1000)
-nccheck: How often to check for convergence (default 1)
-nx: Size of the domain in x direction (default 16384)
-ny: Size of the domain in y direction (default 16384)
-csv: Print performance results as -csv
-use_hp_streams: In mpi_overlap use high priority streams to hide kernel launch latencies of boundary kernels.

The nvshmem variant additionally provides

-use_block_comm: Use block cooperative nvshmemx_float_put_nbi_block instead of nvshmem_float_p for communication.
-norm_overlap: Enable delayed norm execution as also implemented in multi_threaded_p2p_opt
-neighborhood_sync: Use custom neighbor only sync instead of nvshmemx_barrier_all_on_stream

The provided script bench.sh contains some examples executing all the benchmarks presented in the GTC Talks referenced above.

Developers guide

The code applies the style guide implemented in .clang-format file. clang-format version 7 or later should be used to format the code prior to submitting it. E.g. with

multi-gpu-programming-models$ cd multi_threaded_copy
multi_threaded_copy$ clang-format -style=file -i jacobi.cu

A check for CUDA-aware support is done at compile and run time (see the OpenMPI FAQ for details). If your CUDA-aware MPI implementation does not support this check, which requires MPIX_CUDA_AWARE_SUPPORT and MPIX_Query_cuda_support() to be defined in mpi-ext.h, it can be skipped by setting SKIP_CUDA_AWARENESS_CHECK=1. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
mpi		mpi
mpi_overlap		mpi_overlap
multi_threaded_copy		multi_threaded_copy
multi_threaded_copy_overlap		multi_threaded_copy_overlap
multi_threaded_p2p		multi_threaded_p2p
multi_threaded_p2p_opt		multi_threaded_p2p_opt
multi_threaded_um		multi_threaded_um
nccl		nccl
nccl_graphs		nccl_graphs
nccl_overlap		nccl_overlap
nvshmem		nvshmem
single_gpu		single_gpu
single_threaded_copy		single_threaded_copy
.clang-format		.clang-format
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
bench.sbatch.sh		bench.sbatch.sh
bench.sh		bench.sh
test.sh		test.sh

License

NVIDIA/multi-gpu-programming-models

Folders and files

Latest commit

History

Repository files navigation

Multi GPU Programming Models

Requirements

Building

Run instructions

Developers guide

Footnotes

About

Resources

License

Stars

Watchers

Forks

Languages