# CUDAWARE-MPI on Multi-GPU

In this notebook we will introduce how MPI and CUDA compatibility works, how efficient it is, and how it can be used on API CUDAWARE-MPI.

## Objectives

By the time you complete this notebook you will:

- Understand the concepts of MPI, CUDA and CUDAWARE-MPI on multiple GPUs.
- Understant the API CUDAWARE-MPI.

## Benchmarks Ping-Pong

In this notebook, we will look at a simple *ping pong* code that measures the bandwidth for data transfers between 2 MPI processes. We will look at the following versions:

- A first version using CPU with __MPI__;
- A second version with __MPI + CUDA__ between two GPUs which processes data through CPU memory;
- And the last one that uses __CUDAWARE-MPI__ which exchange data directly between GPUs using GPUdirect or by NVLINK.

### MPI

We will start by looking at a CPU-only version of the code to understand the idea behind a simple data transfer program (*ping-pong*). MPI processes pass data back and forth, and bandwidth is calculated by measuring the data transfers, as you know how much size is being transferred. Let is look at the `ping-pong-MPI.c` code to see how it is implemented. At the top of the main program, we start the MPI, determine the total number of processes and the rank identifiers, and make sure we only have two ranks in total to run the *ping-pong*:

```cpp
    int size, rank;
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Status status; 
```

We then enter the main *loop* `for`, where each iteration performs data transfers and bandwidth calculations for different message size, ranging from 8 bytes to 1 GB:

```cpp
   for(int i = 0; i <= 27; i++)
     long int N = 1 << i; 
```

Next, we initialize the *A* array, define some labels to match the MPI send/receive pairs.

```cpp
   double *A = (double*) calloc (N, sizeof(double)); 
```

Basically, each iteration of the *loop* does the following:

- If rank is 0, it first sends a message with data from the matrix \verb+A+ to rank 1, then expects to receive a message of rank 1.

- If rank is 1, first expect to receive a message from rank 0 and then send a message back to rank 0.

```cpp
    start_time = MPI_Wtime();
    for(int i = 1; i <= loop_count; i++)
    {
      if(rank == 0)
      {
        MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
        MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat);
      }else if(rank == 1)
       {
         MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat);
         MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
       }
    }
    stop_time = MPI_Wtime();
```

The previous two points describe an application data transfer *ping-pong*. Now that we are familiar with the basic *ping-pong* code in MPI let us look at a version that includes GPUs with CUDA. In this example, we are still passing data back and forth between two MPI ratings, but the data is in GPU memory this time. More specifically, rank 0 has a memory buffer on GPU 0, and rank 1 has a memory buffer on GPU 1, and they will pass the data between the memories of the two GPUs. Here, to get data from memory from GPU 0 to GPU 1, we will first put the data into CPU memory *host*. Next, we can see the differences between the previous version to the new version with MPI+CUDA. Then, from the synchronization results and the known size of the data transfers, we calculate the bandwidth and print the results:

```cpp
long int num_B = 8 * N;
long int B_in_GB = 1 << 30;
double num_GB = (double)num_B / (double)B_in_GB;
double avg_time_per_transfer=elapsed_time/(2.0*(double)loop_count);
```

Remember that in order to compile MPI programs, we must include the appropriate compilation option, such as:

In [None]:
%%writefile ping-pong-MPI.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>

int main(int argc, char *argv[])
{
    int size, rank;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    MPI_Status status;

    double start_time, stop_time, elapsed_time;
       
    for(int i = 0; i <= 27; i++) 
    {
       long int N = 1 << i; /*Loop from 8 Bytes to 1 GB*/

       double *A = (double*)calloc( N, sizeof(double));  /*Allocate memory for A on CPU*/

       int tag1 = 1000;
       int tag2 = 2000;

       int loop_count = 50;

       /********************************/      
       /**/ start_time = MPI_Wtime();/**/
       /********************************/

       for(int i = 1; i <= loop_count; i++)
       {
            if(rank == 0)
            {
               MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
               MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &status);
            }
            else if(rank == 1)
            {
               MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &status);
               MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
            }
        }

       /*********************************/      
       /**/  stop_time = MPI_Wtime(); /**/
       /********************************/      
      
        /*measured time*/
        elapsed_time = stop_time - start_time;  
        long int num_B = 8 * N;
        long int B_in_GB = 1 << 30;
        double num_GB = (double)num_B / (double)B_in_GB;
        double avg_time_per_transfer = elapsed_time / (2.0*(double)loop_count);

        if(rank == 0) 
            printf("Transfer size (Bytes): %10li, Transfer Time (seconds): %15.9f, Bandwidth (GB/s): %15.9f\n", 
                   num_B, avg_time_per_transfer, num_GB/avg_time_per_transfer );  

        free(A);   
    }

    MPI_Finalize();

    return 0;
}

#### Run the Code

In [None]:
!mpicc ping-pong-MPI.c -o ping-pong-MPI

In [None]:
!mpirun -np 2 ./ping-pong-MPI

### MPI + CUDA

Now that we are familiar with the basic *ping-pong* code in MPI let us look at a version that includes GPUs with CUDA. In this example, we are still passing data back and forth between two MPI ratings, but the data is in GPU memory this time. More specifically, rank 0 has a memory buffer on GPU 0, and rank 1 has a memory buffer on GPU 1, and they will pass the data between the memories of the two GPUs. Here, to get data from memory from GPU 0 to GPU 1, we will first put the data into CPU memory *host*. Next, we can see the differences between the previous version to the new version with MPI+CUDA.

```cpp
 start_time = MPI_Wtime();
 for(int i = 1; i <= loop_count; i++)
 {
  if(rank == 0)
  {
   cudaMemcpy(A, d_A, N * sizeof(double), cudaMemcpyDeviceToHost);
   MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
   MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &status);
   cudaMemcpy(d_A, A, N * sizeof(double), cudaMemcpyHostToDevice);
   }else if(rank == 1)
    {
     MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &status);
     cudaMemcpy(d_A, A, N * sizeof(double), cudaMemcpyHostToDevice);
     cudaMemcpy(A, d_A, N * sizeof(double), cudaMemcpyDeviceToHost);
     MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
    }
 }
 stop_time = MPI_Wtime();
```

Similar to the CPU-only version, we initialize MPI and find the identifier of each MPI rank, but here we also assign each rank a different GPU (i.e., rank 0 is assigned to GPU 0 and rank 1 is mapped to GPU 1).

```cpp
int size, rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Status status;

cudaSetDevice(rank);
```

For this release, each iteration of the \emph{loop} does the following:

- We enter the main *loop* `for`, which iterates over the different message sizes, and assign and initialize the __A__ array. However, we now have a call to `cudaMalloc` to reserve a memory buffer __d_A__ on the GPUs and a call to `cudaMemcpy` to transfer the data initialized in the *A* array to the buffer __d_A__. We need the command `cudaMemcpy` to get the data to the GPU before we start our *ping-pong*.

- Data must first be transferred from GPU memory 0 to CPU memory. Then an MPI call is used to pass the data from ranks 0 to 1. Now that rank 1 has the data (in CPU memory), it can transfer it to GPU memory 1. Rank 0 must first transfer the data from a buffer in GPU 0 memory to one in CPU memory. Now that rank 1 contains the data in the CPU memory buffer, and it can transfer it to GPU 1 memory.

As in the case where only the CPU is used, from the synchronization results and the known size of the data transfers, we calculate the bandwidth, print the results, and finally free up the memory of the computational resources. We ended the MPI and the program.

In [None]:
%%writefile ping-pong-MPI+CUDA.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <unistd.h>
#include <mpi.h>

int main(int argc, char *argv[])
{
    int size, rank;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    MPI_Status status;

    cudaSetDevice(rank);

    double start_time, stop_time, elapsed_time;

    for(int i = 0; i <= 27; i++)
    {
        long int N = 1 << i; /*Loop from 8 Bytes to 1 GB*/
   
        double *A = (double*)calloc(N, sizeof(double)); /*Allocate memory for A on CPU*/

        double *d_A;

        cudaMalloc(&d_A, N * sizeof(double)) ;
        cudaMemcpy(d_A, A, N * sizeof(double), cudaMemcpyHostToDevice);

        int tag1 = 1000;
        int tag2 = 2000;

        int loop_count = 50;

       /********************************/      
       /**/ start_time = MPI_Wtime();/**/
       /********************************/

        for(int i = 1; i <= loop_count; i++)
        {
            if(rank == 0)
            {
                cudaMemcpy(A, d_A, N * sizeof(double), cudaMemcpyDeviceToHost);
                MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
                MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &status);
                cudaMemcpy(d_A, A, N * sizeof(double), cudaMemcpyHostToDevice);
            }
            else if(rank == 1)
            {
                MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &status);
                cudaMemcpy(d_A, A, N * sizeof(double), cudaMemcpyHostToDevice);
                cudaMemcpy(A, d_A, N * sizeof(double), cudaMemcpyDeviceToHost);
                MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
            }
        }

       /**********************************/      
       /**/  stop_time = MPI_Wtime(); /**/
       /*********************************/

        /*measured time*/
        elapsed_time = stop_time - start_time;
        long int num_B = 8 * N;
        long int B_in_GB = 1 << 30;
        double num_GB = (double)num_B / (double)B_in_GB;
        double avg_time_per_transfer = elapsed_time / (2.0*(double)loop_count);

        if(rank == 0) 
          printf("Transfer size (Bytes): %10li, Transfer Time (seconds): %15.9f, Bandwidth (GB/s): %15.9f\n", 
                    num_B, avg_time_per_transfer, num_GB/avg_time_per_transfer );

        cudaFree(d_A);
        free(A);
    }

    MPI_Finalize();

    return 0;
}

#### Run the Code

##### Compile with Shell Script

In [None]:
%%writefile howtocompile.sh
#!/bin/bash

usage()
{
 echo "howtocompile.sh: wrong number of input parameters. Exiting."
 echo -e "Usage: bash howtocompile.sh <supercomputer>"
 echo -e "  g.e: bash howtocompile.sh ogbon"
}

ogbon()
{
 nvcc -I/opt/share/openmpi/4.1.1-cuda/include -L/opt/share/openmpi/4.1.1-cuda/lib64 -lnccl -lmpi -o ping-pong-MPI+CUDA ping-pong-MPI+CUDA.cu
}

#args in comand line
if [ "$#" ==  0 ]; then
 usage
 exit
fi

#ogbon
if [[ $1 == "ogbon" ]];then
 ogbon
fi

In [None]:
!bash howtocompile.sh ogbon

##### Execute with Shell Script

In [None]:
%%writefile howtoexecute.sh
#!/bin/bash

usage()
{
 echo "howtoexecute.sh: wrong number of input parameters. Exiting."
 echo -e "Usage: bash howtoexecute.sh <supercomputer>"
 echo -e "  g.e: bash howtoexecute.sh ogbon"
}

ogbon()
{
 sbatch slurm-MPI+CUDA.sh
}

localnode()
{
 mpirun -np 2 --report-bindings --map-by numa -x UCX_MEMTYPE_CACHE=n  -mca pml ucx -mca btl ^vader,tcp,openib,smcuda -x UCX_NET_DEVICES=mlx5_0:1 ./ping-pong-MPI+CUDA
}

#args in comand line
if [ "$#" ==  0 ]; then
 usage
 exit
fi

#ogbon
if [[ $1 == "ogbon" ]];then
 ogbon
fi

#localhost
if [[ $1 == "localnode" ]];then
 localnode
fi

In [None]:
!bash howtoexecute.sh localnode

### CUDAWARE-MPI

Before looking at this code example, let us first describe [CUDAWARE-MPI](https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/) and [GPUDirect RDMA](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html). CUDAWARE-MPI is an MPI implementation that allows GPU buffers (e.g., GPU memory allocated with cudaMalloc) to be used directly in MPI calls. However, CUDAWARE-MPI alone does not specify whether data is stored in intermediate stages in CPU memory or passed from GPU to GPU. It will depend on the computational structure of the execution environment.

The GPUDirect is an umbrella name used to refer to several specific technologies. In MPI, the GPUDirect technologies cover all kinds of inter-rank communication: intra-node, inter-node, and RDMA inter-node communication. Now let us take a look at the code below. It is the same as the tested version of MPI+CUDA, but now there are no calls to cudaMemcpy during the ping-pong steps. Instead, we use our GPU buffers (__d_A__) directly in MPI calls:

```cpp
    start_time = MPI_Wtime();
    for(int i = 1; i <= loop_count; i++)
    {
      if(rank == 0)
      {
        MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
        MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat);
      }else if(rank == 1)
       {
         MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat);
         MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
       }
    }
    stop_time = MPI_Wtime();
```

In [None]:
%%writefile ping-pong-CUDAWARE-MPI.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <unistd.h>
#include <mpi.h>

int main(int argc, char *argv[]){

    int size, rank;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    MPI_Status status;

    cudaSetDevice(rank);

    double start_time, stop_time, elapsed_time;

    for(int i = 0; i <= 27; i++)
    {
        long int N = 1 << i; /*Loop from 8 Bytes to 1 GB*/
   
        double *A = (double*)calloc(N, sizeof(double)); /*Allocate memory for A on CPU*/

        double *d_A;

        cudaMalloc(&d_A, N * sizeof(double)) ;
        cudaMemcpy(d_A, A, N * sizeof(double), cudaMemcpyHostToDevice);

        int tag1 = 1000;
        int tag2 = 2000;

        int loop_count = 50;

       /********************************/      
       /**/ start_time = MPI_Wtime();/**/
       /********************************/

        for(int i = 1; i <= loop_count; i++)
        {
            if(rank == 0)
            {
              MPI_Send(d_A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
              MPI_Recv(d_A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &status);
            }
            else if(rank == 1)
            {
              MPI_Recv(d_A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &status);
              MPI_Send(d_A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
            }
         }

       /**********************************/      
       /**/  stop_time = MPI_Wtime(); /**/
       /*********************************/

        /*measured time*/
        elapsed_time = stop_time - start_time;
        long int num_B = 8 * N;
        long int B_in_GB = 1 << 30;
        double num_GB = (double)num_B / (double)B_in_GB;
        double avg_time_per_transfer = elapsed_time / (2.0*(double)loop_count);

        if(rank == 0) 
            printf("Transfer size (Bytes): %10li, Transfer Time (seconds): %15.9f, Bandwidth (GB/s): %15.9f\n", 
                    num_B, avg_time_per_transfer, num_GB/avg_time_per_transfer );

        cudaFree(d_A);
        free(A);
    }

    MPI_Finalize();

    return 0;
}

#### Run the Code

##### Compile with Shell Script 

In [None]:
%%writefile howtocompile.sh
#!/bin/bash

usage()
{
 echo "howtocompile.sh: wrong number of input parameters. Exiting."
 echo -e "Usage: bash howtocompile.sh <supercomputer>"
 echo -e "  g.e: bash howtocompile.sh ogbon"
}

ogbon()
{
 nvcc -I/opt/share/openmpi/4.1.1-cuda/include -L/opt/share/openmpi/4.1.1-cuda/lib64 -lmpi ping-pong-CUDAWARE-MPI.cu -o ping-pong-CUDAWARE-MPI
}

#args in comand line
if [ "$#" ==  0 ]; then
 usage
 exit
fi

#ogbon
if [[ $1 == "ogbon" ]];then
 ogbon
fi

In [None]:
!bash howtocompile.sh ogbon

##### Execute with Shell Script

In [None]:
%%writefile howtoexecute.sh
#!/bin/bash

usage()
{
 echo "howtoexecute.sh: wrong number of input parameters. Exiting."
 echo -e "Usage: bash howtoexecute.sh <supercomputer>"
 echo -e "  g.e: bash howtoexecute.sh ogbon"
}

ogbon()
{
 sbatch slurm-CUDAWARE-MPI.sh
}

localnode()
{
 mpirun -np 2 --report-bindings --map-by numa -x UCX_MEMTYPE_CACHE=n -mca pml ucx -mca btl ^vader,tcp,openib,smcuda -x UCX_NET_DEVICES=mlx5_0:1 ./ping-pong-CUDAWARE-MPI
}

#args in comand line
if [ "$#" ==  0 ]; then
 usage
 exit
fi

#ogbon
if [[ $1 == "ogbon" ]];then
 ogbon
fi

#localhost
if [[ $1 == "localnode" ]];then
 localnode
fi

In [None]:
!bash howtoexecute.sh localnode

## Exercise 1: Matrix Multiply in CUDAWARE-MPI

Program a matrix multiplication using the CUDAWARE-MPI where matrices are generated in process 0, and all processes work in parts of the multiplication. The final result will finally be compiled in process 0. Make the implementation with distribution by blocks of lines and compare the execution times varying the size of the problem.

## References

* OpenMPI, https://www.open-mpi.org/doc/current/man1/mpirun.1.php
(accessed January 12, 2023).

* CUDAWARE, https://github.com/olcf-tutorials/MPI_ping_pong
(accessed January 16, 2023).



