
# Learning Objectives

In this lab, we will learn the following:

* CUDA-aware MPI concepts and APIs.
* Impact of fine-tuning CUDA-aware MPI on application performance.
* Underlying GPUDirect technologies like P2P and RDMA.

**Note:** Execution results can vary significantly based on the MPI installation, supporting libraries, workload manager, and underlying CPU and GPU hardware configuration and topology. The codes in this lab have been tested on DGX-1 8 Tesla V100 16 GB nodes connected by Mellanox InfiniBand NICs running OpenMPI v4.1.1 with HPCX 2.8.1 and CUDA v11.3.0.0 as well as DGX with 8 Ampere A100 80GB nodes (OpenMPI v4.1.1, HPC SDK 22.7, HPCX 2.11).


# Improving Application Performance

## Analysis

Thus far, we have passed host (system) memory pointers to the MPI calls. With a regular MPI implementation only pointers to host memory can be passed to MPI. However, if we combine MPI and CUDA, we need to send (and receive) GPU buffers instead of host buffers. Thus, using regular MPI, we need to stage GPU buffers through host memory explicitly using `cudaMemcpy` as we saw in the previous lab.

As mentioned in previous lab, initially MPI calls take a lot of time and they gradually improve in latency and throughput. It is therefore helpful to zoom out of a particular Jacobi iteration and look at the bigger picture, that is, the average time taken for a halo exchange. Recall that with the `--stats=true` flag, stats are visible on the terminal as well. In particular, observe the NVTX Push-Pop stats:

![mpi_memcpy_nvtx_stats](../../images/mpi_memcpy_nvtx_stats.png)

The minimum, maximum and average time taken for single halo exchange, including software overhead, is visible. The average time is $84\mu$s, minimum is $50\mu$s, and maximum is $6382\mu$s. The average time taken is a useful statistic for us.

We can also view the throughput and latency of HtoD and DtoH copy operations as follows:

![mpi_host_staging_throughput_latency](../../images/mpi_host_staging_throughput_latency.png)

### Opportunity for improvement

There is considerable software overhead of using multiple Memcpy operations with the MPI call. Moreover, the HtoD and DtoH throughput/ latency are worse compared to DtoD because PCIe is used for CPU-GPU communication and NVLinks are not utilized.

With regular MPI, we can try to write a program where intra-node communication is handled in a single process and then we can enable P2P and other optimizations that we learnt in previous labs. We will also need a separate inter-node communication code. This is a complex and time-consuming approach and will not scale well espcially with more communication intensive programs. 

Thus, we need to make use of CUDA-aware MPI which simplifies the code substantially and enables many optimizations under the hood transparently to the user.

## CUDA-aware MPI

With CUDA-aware MPI, the GPU buffers can be passed directly to MPI. A CUDA-aware MPI implementation handles buffers differently depending on whether it resides in host or device memory. With the Unified Virtual Addressing (UVA) feature, the host memory and the memory of all GPUs in a system (a single node) are combined into one large (virtual) address space. The function is then able to infer from the memory pointer as to whether it resides on host or on the device and handles the operations accordingly.

From an API standpoint, CUDA-aware MPI results in simplified codes where CUDA memory pointers can seamlessly be used in MPI calls. Without CUDA-aware MPI, we need to stage GPU buffers through host memory buffers (`s_buf_h`, `r_buf_h`), using `cudaMemcpy` as shown in the following code excerpt:

```c
//MPI rank 0
cudaMemcpy(s_buf_h, s_buf_d, size, cudaMemcpyDeviceToHost);
MPI_Send(s_buf_h, size, MPI_CHAR, 1, 0, MPI_COMM_WORLD);

//MPI rank 1
MPI_Recv(r_buf_h, size, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &status);
cudaMemcpy(r_buf_d, r_buf_h, size, cudaMemcpyHostToDevice);
```

With a CUDA-aware MPI library this is not necessary; the GPU buffers (`s_buf_d`, `r_buf_d`) can be directly passed to MPI as in the following excerpt:

```c
//MPI rank 0
MPI_Send(s_buf_d, size, MPI_CHAR, 1, 0, MPI_COMM_WORLD);

//MPI rank n-1
MPI_Recv(r_buf_d, size, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &status);
```

Indeed, the implementation is quite intuitive and easy-to-use. Now, let us use CUDA-aware MPI in our application.

## Implementation Exercise: Part 2

Open the [jacobi_cuda_aware_mpi.cpp](../../source_code/mpi/jacobi_cuda_aware_mpi.cpp) and [jacobi_kernels.cu](../../source_code/mpi/jacobi_kernels.cu) files. Alternatively, you can navigate to `CFD/English/C/source_code/mpi/` directory in Jupyter's file browser in the left pane. Then, click to open the `jacobi_cuda_aware_mpi.cpp` and `jacobi_kernels.cu` files. The `jacobi_kernels.cu` file is same as in previous lab. 

Also open the [Makefile](../../source_code/mpi/Makefile) and note how the compilation and linking is also same as in previous lab.

Understand the flow of the `jacobi_cuda_aware_mpi.cpp` program and observe the following:

1. `local_rank` is used to set the current GPU device.
2. Device kernel calls have been replaced with function wrappers for ease of compilation.
3. Rank 0 is used to calculate efficiency and other metrics, even though all ranks compute `single_gpu` function to verify multi-GPU implementation's correctness.
4. Each halo exchange is accomplished with an `MPI_Sendrecv` call with no explicit `cudaMemcpy` function calls. 

### To-Do

Now, implement the following marked as `TODO: Part 2-`:

* Implement top and bottom halo exchanges using `MPI_Sendrecv` call for each exchange. Use only GPU buffers in the MPI call's function arguments.
* Reduce the rank-local L2 Norm to a global L2 norm using `MPI_Allreduce` function.

After implementing these, compile the program::

In [None]:
!cd ../../source_code/mpi && make clean && make jacobi_cuda_aware_mpi

Ensure there are no compilation errors. Now, let us validate the program with a smaller size grid using half a node.

**Due to limited resources, we will be using smaller grid ($2K\times4K$) using 2 GPUs.**

- **To run on half a node with 4 GPUs with $4K\times8K$ grid size, use: `mpirun -np 4 --map-by ppr:4:socket ./jacobi_cuda_aware_mpi -nx 4096 -ny 8192.`**
- **To run on full node with 8 GPUs with $16K\times16K$ grid size, use: `mpirun -np 8 --map-by ppr:4:socket ./jacobi_cuda_aware_mpi -ny 16384.`**
- **To run across 2 nodes with 16 GPUs with $16K\times32K$ grid size, use: `mpirun -np 16 --map-by ppr:4:socket ./jacobi_cuda_aware_mpi -ny 32768` inside batch script or simply run `srun --partition=gpu  --nodes=2 --gres=gpu:8  --ntasks=16  --mpi=pmix --ntasks-per-socket=4 ./jacobi_cuda_aware_mpi -ny 32768` on the command line** ***<mark>NOTE: If resources are available</mark>***

Run the program with 2 processes:

In [None]:
!cd ../../source_code/mpi && srun --partition=gpu -n1 --gres=gpu:2  --ntasks=2 --mpi=pmix --ntasks-per-socket=2 ./jacobi_cuda_aware_mpi -nx 2048 -ny 4096

### DGX system with 8 Ampere A100
Partial results obtained from a DGX system with 8 A100s:

Using 2 GPUs (2K$\times$4K grid size)

```bash
Num GPUs: 2.
2048x4096: 1 GPU:   0.1356 s, 2 GPUs:   0.1282 s, speedup:     1.06, efficiency:    52.87 
```

Using 2 nodes connected by InfiniBand (IB) NICs (16384$\times$32768 grid size):

```bash
Num GPUs: 16.
16384x32768: 1 GPU:   6.5659 s, 16 GPUs:   0.5674 s, speedup:    11.57, efficiency:    72.33 
```

### DGX system with 8 Tesla V100
Partial results obtained from a DGX system with 8 Tesla V100s:

Using 2 nodes connected by InfiniBand (IB) NICs (16384$\times$32768 grid size):

```bash
Num GPUs: 16.
16384x32768: 1 GPU:   8.9087 s, 16 GPUs:   1.1786 s, speedup:     7.56, efficiency:    47.24
```

You may observe a drop in efficiency. Recall that initially MPI calls take a lot of time and they gradually improve in latency and throughput. Try running the program again with 5000 Jacobi loop iterations by using the `-niter 5000` option:

In [None]:
!cd ../../source_code/mpi && srun --partition=gpu -n1 --gres=gpu:2  --ntasks=2 --mpi=pmix --ntasks-per-socket=2 ./jacobi_cuda_aware_mpi -nx 2048 -ny 4096 -niter 5000

### DGX system with 8 Ampere A100
Partial results obtained from a DGX system with 8 A100s:

Using 2 GPUs (2K$\times$4K grid size)

```bash
Num GPUs: 2.
2048x4096: 1 GPU:   0.6726 s, 2 GPUs:   0.6453 s, speedup:     1.04, efficiency:    52.11 
```

Using 2 nodes connected by InfiniBand (IB) NICs (16384$\times$32768 grid size):

```bash
Num GPUs: 16.
16384x32768: 1 GPU:  32.8321 s, 16 GPUs:   2.5078 s, speedup:    13.09, efficiency:    81.82 
```

### DGX system with 8 Tesla V100
Partial results obtained from a DGX system with 8 Tesla V100s:

Using 2 nodes connected by InfiniBand (IB) NICs (16384$\times$32768 grid size):

```bash
Num GPUs: 16.
16384x32768: 1 GPU:  44.5246 s, 16 GPUs:   3.7889 s, speedup:    11.75, efficiency:    73.45 
```

As seen in the above example outputs, increasing the iterations for larger grid size improves the efficiency should improve. Now, let;s profile the program to understand what's happening here.


## Profiling

Before we profile the binary, note that our program runs both the single-GPU and multi-GPU versions to calculate efficiency and speedup. However, this feature is made available to check the correctness of multi-GPU code. Once we know that our implementation is correct, we don't need to run single-GPU version every time as it takes a lot of time, which you would have realized by running the 5000 iterations version.

Moreover, we are not interested in profiling the single GPU version as profiling it increases both profiling time and the `.nsys-rep` file size. So, we will skip running the single-GPU version by passing the `-skip_single_gpu` flag to binary. Note that we will not get the speedup and efficiency numbers.

That isn't a problem, however as NVTX statistics provide the runtime for our multi-GPU Jacobi loop as well as the time taken for halo exchange, we can use them for comparison.

Now, let us profile only the multi-GPU version for the baseline 1K iterations and 5K (on 4 GPUs):

In [None]:
!cd ../../source_code/mpi && sbatch profiling_4g_cuda_aware

<mark>**NOTE:**</mark> if you are intersted to view the profiler report via the GUI, please download the files from the `../../source_code/mpi/output_profiler/` folder and import them to the Nsight System. You can also view the slurm output to view the report via the jupyter tab. The slurm output is located at `../../source_code/mpi/` and the file format is `slurm-{job id}.out`.


We ran this for 10K and 25K iterations (on DGX system with 8 Tesla V100) as we ll and we share the relevant NVTX stats for all these versions:

![mpi_cuda_aware_halo_exchange_latency](../../images/mpi_cuda_aware_halo_exchange_latency.png)

We also run the `jacobi_memcpy_mpi` binary for 25K iterations (on DGX system with 8 Tesla V100) and the results are as follows:

![mpi_memcpy_halo_exchange_latency](../../images/mpi_memcpy_halo_exchange_latency.png)

At 25K iterations, the CUDA-aware MPI version outperforms the Memcpy+MPI version both in average halo exchange latency and total execution time for Jacobi loop. The average time taken for CUDA-aware MPI version is 19.1s compared to 20.5s for Memcpy+MPI version.

### Optimization Employed by CUDA-aware MPI

Let us now understand the optimizations that are employed by CUDA-aware MPI transparently to the user. 

#### GPUDirect P2P

We have already learnt about this technology in previous module on CUDA-based single-node multi-GPU programming. The Peer-to-Peer Memory Access is enabled by GPUDirect P2P technology. Here's a quick recap of how it works:

![gpudirect_p2p](../../images/gpudirect_p2p.png)

This accelerates intra-node communication. Buffers can be directly copied between the memories of two GPUs in the same system with GPUDirect P2P. Recall that as NVLink is present in our DGX-1V system, it will be used for data transfer instead of PCIe. The profiler description confirms the same:

![mpi_cuda_aware_p2p_metrics](../../images/mpi_cuda_aware_p2p_metrics.png)

#### GPUDirect RDMA 

With GPUDirect Remote Direct Memory Access (RDMA), abbreviated as GDR, buffers can be directly sent from the GPU memory to a network adapter without staging through host memory as shown below:

![gpudirect_rdma](../../images/gpudirect_rdma.png)

To understand the impact of GDR, we run the program on 2 GPUs with 1 GPU per node. This way, the GPUs must communicate either via GPUDirect RDMA or via host-staging. Moreover, we will decrease the grid size to $16384\times128$ to make the application more communication-bound. Note that the size of copy operation is still the same (16K * size of float (4B) = 64KB).

Run the binary with GDR enabled (default configuration) for 1 GPU per node for 10K iterations and compare the result after running the program again with GDR disabled (we can disable GDR by using the `-x UCX_IB_GPU_DIRECT_RDMA=no` flag with `mpirun` command, or simply run `export UCX_IB_GPU_DIRECT_RDMA=no` before your run the executable). 

Example commands to use are `mpirun -np 2 --map-by ppr:1:node ./jacobi_cuda_aware_mpi -ny 128 -skip_single_gpu -niter 10000` and `mpirun -np 2 --map-by ppr:1:node -x UCX_IB_GPU_DIRECT_RDMA=no ./jacobi_cuda_aware_mpi -ny 128 -skip_single_gpu -niter 10000`.

**NOTE: Due to limited resources, we will only view the example outputs.**

### DGX system with 8 Ampere A100
Partial results obtained from a DGX system with 8 A100s:

- GDR enabled
```bash
Num GPUs: 2.
16384x128: 2 GPUs:   0.6396 s 
```


- GDR disabled
```bash
Num GPUs: 2.
16384x128: 2 GPUs:   1.1941 s 
```

### DGX system with 8 Tesla V100
Partial results obtained from a DGX system with 8 Tesla V100s:

- GDR enabled
```bash
Num GPUs: 2.
16384x128: 2 GPUs:   1.0814 s
```


- GDR disabled
```bash
Num GPUs: 2.
16384x128: 2 GPUs:   1.3647 s
```

As seen in the above example outputs, the computation time increases considerably. On DGX system with 8 Tesla V100, it is an increase of about $25\%$ from 1.08s to 1.36s. 

The profiler output of these two runs highlights the significant difference in halo exchange time. Focus on the minimum latency as it will reflect the most optimized inter-process communication with the given configuration options. The average latency also decreases for GDR-enabled run.

![mpi_cuda_aware_gdr_latency](../../images/mpi_cuda_aware_gdr_latency.png)

Note that GDR-based transfers are not visible in Nsight System Timeline. You will see an `MPI_Sendrecv` call in NVTX but no memory copy operations will be visible either in CPU or in GPU.

**Note:** If your OpenMPI installation does not use UCX PML and instead relies on the `openib` BTL, you can disable GDR by using the `--mca btl_openib_want_cuda_gdr 1` flag.

There are several other optimizations employed by CUDA-aware MPI that we will not cover in detail. Some of them are: 

* GDR Copy: While GPUDirect RDMA is meant for direct access to GPU memory from third-party devices like NICs, it is possible to use the same APIs to create valid CPU mappings of the GPU memory. The advantage of a CPU driven copy is the very small overhead involved. That might be useful when low latencies are required.
* GPUDirect for Accelerated Communication with Network and Storage Devices: This feature allows the network fabric driver (like MLX5) and the CUDA driver to share a common pinned buffer in order to avoid an unnecessary `memcpy` within host memory between the intermediate pinned buffers of the CUDA driver and the network fabric buffer.
* Pipelining: All operations that are required to carry out the message transfer can be pipelined.

**Solution:** The solution for this exercise is present in `source_code/mpi/solutions` directory: [jacobi_cuda_aware_mpi.cpp](../../source_code/mpi/solutions/jacobi_cuda_aware_mpi.cpp).

We now have an in-depth understanding of CUDA-aware MPI and how it simplifies the code while being highly performant. We have also covered GPUDirect technologies like P2P and RDMA and their effects on application performance. 

Now, let us learn about high-performance NVIDIA libraries NCCL and NVSHMEM that allow us to extract more performance while simplifying the code and runtime configuration further. 

Click below to access the lab and learn more about NVIDIA's NCCL library:

# [Next: NCCL Library](../nccl/nccl.ipynb)

Here's a link to the home notebook through which all other notebooks are accessible:

# [HOME](../../../start_here.ipynb)

---
## Links and Resources

* [Concepts: CUDA-aware MPI and GPUDirect Technologies](https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/)
* [Concepts: GPUDirect Technologies](http://developer.download.nvidia.com/devzone/devcenter/cuda/docs/GPUDirect_Technology_Overview.pdf)
* [Documentation: GPUDirect RDMA](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html)
* [Documentation: CUDA support in OpenMPI](https://www.open-mpi.org/faq/?category=runcuda#mpi-cuda-support)
* [Code: GDRCopy Library](https://github.com/NVIDIA/gdrcopy)
* [Code: Multi-GPU Programming Models](https://github.com/NVIDIA/multi-gpu-programming-models)
* [Code: GPU Bootcamp](https://github.com/gpuhackathons-org/gpubootcamp/)

Don't forget to check out additional [Open Hackathons Resources](https://www.openhackathons.org/s/technical-resources) and join our [OpenACC and Hackathons Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

## Licensing
Copyright © 2022 OpenACC-Standard.org.  This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.
