# CUDA-aware MPI on Multi-GPU

In this notebook we will introduce how MPI and CUDA compatibility works, how efficient it is, and how it can be used on API CUDA-aware MPI.

## Objectives

By the time you complete this notebook you will:

- Understand the concepts of MPI, CUDA andCUDA-aware MPI on multiple GPUs.
- Understant the API CUDA-aware MPI.

## Benchmarks Ping-Pong

In this notebook, we will look at a simple *ping pong* code that measures the bandwidth for data transfers between 2 MPI processes. We will look at the following versions:

- A first version using CPU with __MPI__;
- A second version with __MPI + CUDA__ between two GPUs which processes data through CPU memory;
- And the last one that uses __CUDA-aware MPI__ which exchange data directly between GPUs using GPUdirect or by NVLINK.

### MPI

We will start by looking at a CPU-only version of the code to understand the idea behind a simple data transfer program (*ping-pong*). MPI processes pass data back and forth, and bandwidth is calculated by measuring the data transfers, as you know how much size is being transferred. Let is look at the `ping-pong-MPI.c` code to see how it is implemented. At the top of the main program, we start the MPI, determine the total number of processes and the rank identifiers, and make sure we only have two ranks in total to run the *ping-pong*:

```cpp
    int size, rank;
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Status status; 
```

We then enter the main *loop* `for`, where each iteration performs data transfers and bandwidth calculations for different message size, ranging from 8 bytes to 1 GB:

```cpp
   for(int i = 0; i <= 27; i++)
     long int N = 1 << i; 
```

Next, we initialize the *A* array, define some labels to match the MPI send/receive pairs.

```cpp
   double *A = (double*) calloc (N, sizeof(double)); 
```

Basically, each iteration of the *loop* does the following:

- If rank is 0, it first sends a message with data from the matrix \verb+A+ to rank 1, then expects to receive a message of rank 1.

- If rank is 1, first expect to receive a message from rank 0 and then send a message back to rank 0.

```cpp
    start_time = MPI_Wtime();
    for(int i = 1; i <= loop_count; i++)
    {
      if(rank == 0)
      {
        MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
        MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat);
      }else if(rank == 1)
       {
         MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat);
         MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
       }
    }
    stop_time = MPI_Wtime();
```

The previous two points describe an application data transfer *ping-pong*. Now that we are familiar with the basic *ping-pong* code in MPI let us look at a version that includes GPUs with CUDA. In this example, we are still passing data back and forth between two MPI ratings, but the data is in GPU memory this time. More specifically, rank 0 has a memory buffer on GPU 0, and rank 1 has a memory buffer on GPU 1, and they will pass the data between the memories of the two GPUs. Here, to get data from memory from GPU 0 to GPU 1, we will first put the data into CPU memory *host*. Next, we can see the differences between the previous version to the new version with MPI+CUDA. Then, from the synchronization results and the known size of the data transfers, we calculate the bandwidth and print the results:

```cpp
long int num_B = 8 * N;
long int B_in_GB = 1 << 30;
double num_GB = (double)num_B / (double)B_in_GB;
double avg_time_per_transfer=elapsed_time/(2.0*(double)loop_count);
```

Remember that in order to compile MPI programs, we must include the appropriate compilation option, such as:

In [1]:
%%writefile ping-pong-MPI.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>

int main(int argc, char *argv[])
{
    int size, rank;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    MPI_Status status;

    double start_time, stop_time, elapsed_time;
       
    for(int i = 0; i <= 27; i++) 
    {
       long int N = 1 << i; /*Loop from 8 Bytes to 1 GB*/

       double *A = (double*)calloc( N, sizeof(double));  /*Allocate memory for A on CPU*/

       int tag1 = 1000;
       int tag2 = 2000;

       int loop_count = 50;

       /********************************/      
       /**/ start_time = MPI_Wtime();/**/
       /********************************/

       for(int i = 1; i <= loop_count; i++)
       {
            if(rank == 0)
            {
               MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
               MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &status);
            }
            else if(rank == 1)
            {
               MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &status);
               MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
            }
        }

       /*********************************/      
       /**/  stop_time = MPI_Wtime(); /**/
       /********************************/      
      
        /*measured time*/
        elapsed_time = stop_time - start_time;  
        long int num_B = 8 * N;
        long int B_in_GB = 1 << 30;
        double num_GB = (double)num_B / (double)B_in_GB;
        double avg_time_per_transfer = elapsed_time / (2.0*(double)loop_count);

        if(rank == 0) 
            printf("Transfer size (Bytes): %10li, Transfer Time (seconds): %15.9f, Bandwidth (GB/s): %15.9f\n", 
                   num_B, avg_time_per_transfer, num_GB/avg_time_per_transfer );  

        free(A);   
    }

    MPI_Finalize();

    return 0;
}

Overwriting ping-pong-MPI.c


#### Run the Code

In [3]:
!mpicxx ping-pong-MPI.c -o ping-pong-MPI

In [11]:
!mpirun -np 2 ./ping-pong-MPI

[c018:61156] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../../../../../../../..]
[c018:61156] MCW rank 1 bound to socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]], socket 1[core 20[hwt 0-1]], socket 1[core 21[hwt 0-1]], socket 1[core 22[hwt 0-1]], socket 1[core 23[hwt 0-1]], socket 1[core 24[hwt 0-1]], socket 1[core 25[hwt 0-1]], socket 1[core 26[hwt 0-1]], socket 1[core 27[hwt 0-1]], socket 1[core 28[hwt 0-1]], socket 1[core 29[hwt 

---
### Exercise: Re-configuration of the processes using MPI

- Assign the tags `--report-bindings`, and  `--map-by numa` in the compilation process:

In [None]:
!mpirun --report-bindings --map-by numa -np 2 ./ping-pong-MPI

- Explain why the values of Bandwidth are differents using the tags?

### MPI + CUDA

Now that we are familiar with the basic *ping-pong* code in MPI let us look at a version that includes GPUs with CUDA. In this example, we are still passing data back and forth between two MPI ratings, but the data is in GPU memory this time. More specifically, rank 0 has a memory buffer on GPU 0, and rank 1 has a memory buffer on GPU 1, and they will pass the data between the memories of the two GPUs. Here, to get data from memory from GPU 0 to GPU 1, we will first put the data into CPU memory *host*. Next, we can see the differences between the previous version to the new version with MPI+CUDA.

```cpp
 start_time = MPI_Wtime();
 for(int i = 1; i <= loop_count; i++)
 {
  if(rank == 0)
  {
   cudaMemcpy(A, d_A, N * sizeof(double), cudaMemcpyDeviceToHost);
   MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
   MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &status);
   cudaMemcpy(d_A, A, N * sizeof(double), cudaMemcpyHostToDevice);
   }else if(rank == 1)
    {
     MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &status);
     cudaMemcpy(d_A, A, N * sizeof(double), cudaMemcpyHostToDevice);
     cudaMemcpy(A, d_A, N * sizeof(double), cudaMemcpyDeviceToHost);
     MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
    }
 }
 stop_time = MPI_Wtime();
```

Similar to the CPU-only version, we initialize MPI and find the identifier of each MPI rank, but here we also assign each rank a different GPU (i.e., rank 0 is assigned to GPU 0 and rank 1 is mapped to GPU 1).

```cpp
int size, rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Status status;

cudaSetDevice(rank);
```

For this release, each iteration of the \emph{loop} does the following:

- We enter the main *loop* `for`, which iterates over the different message sizes, and assign and initialize the __A__ array. However, we now have a call to `cudaMalloc` to reserve a memory buffer __d_A__ on the GPUs and a call to `cudaMemcpy` to transfer the data initialized in the *A* array to the buffer __d_A__. We need the command `cudaMemcpy` to get the data to the GPU before we start our *ping-pong*.

- Data must first be transferred from GPU memory 0 to CPU memory. Then an MPI call is used to pass the data from ranks 0 to 1. Now that rank 1 has the data (in CPU memory), it can transfer it to GPU memory 1. Rank 0 must first transfer the data from a buffer in GPU 0 memory to one in CPU memory. Now that rank 1 contains the data in the CPU memory buffer, and it can transfer it to GPU 1 memory.

As in the case where only the CPU is used, from the synchronization results and the known size of the data transfers, we calculate the bandwidth, print the results, and finally free up the memory of the computational resources. We ended the MPI and the program.

In [5]:
%%writefile ping-pong-MPI+CUDA.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <unistd.h>
#include <mpi.h>

int main(int argc, char *argv[])
{
    int size, rank;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    MPI_Status status;

    cudaSetDevice(rank);

    double start_time, stop_time, elapsed_time;

    for(int i = 0; i <= 27; i++)
    {
        long int N = 1 << i; /*Loop from 8 Bytes to 1 GB*/
   
        double *A = (double*)calloc(N, sizeof(double)); /*Allocate memory for A on CPU*/

        double *d_A;

        cudaMalloc(&d_A, N * sizeof(double)) ;
        cudaMemcpy(d_A, A, N * sizeof(double), cudaMemcpyHostToDevice);

        int tag1 = 1000;
        int tag2 = 2000;

        int loop_count = 50;

       /********************************/      
       /**/ start_time = MPI_Wtime();/**/
       /********************************/

        for(int i = 1; i <= loop_count; i++)
        {
            if(rank == 0)
            {
                cudaMemcpy(A, d_A, N * sizeof(double), cudaMemcpyDeviceToHost);
                MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
                MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &status);
                cudaMemcpy(d_A, A, N * sizeof(double), cudaMemcpyHostToDevice);
            }
            else if(rank == 1)
            {
                MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &status);
                cudaMemcpy(d_A, A, N * sizeof(double), cudaMemcpyHostToDevice);
                cudaMemcpy(A, d_A, N * sizeof(double), cudaMemcpyDeviceToHost);
                MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
            }
        }

       /**********************************/      
       /**/  stop_time = MPI_Wtime(); /**/
       /*********************************/

        /*measured time*/
        elapsed_time = stop_time - start_time;
        long int num_B = 8 * N;
        long int B_in_GB = 1 << 30;
        double num_GB = (double)num_B / (double)B_in_GB;
        double avg_time_per_transfer = elapsed_time / (2.0*(double)loop_count);

        if(rank == 0) 
          printf("Transfer size (Bytes): %10li, Transfer Time (seconds): %15.9f, Bandwidth (GB/s): %15.9f\n", 
                    num_B, avg_time_per_transfer, num_GB/avg_time_per_transfer );

        cudaFree(d_A);
        free(A);
    }

    MPI_Finalize();

    return 0;
}

Writing ping-pong-MPI+CUDA.cu


#### Run the Code

##### Compile with Shell Script

In [6]:
%%writefile howtocompile.sh
#!/bin/bash

usage()
{
 echo "howtocompile.sh: wrong number of input parameters. Exiting."
 echo -e "Usage: bash howtocompile.sh <supercomputer>"
 echo -e "  g.e: bash howtocompile.sh ogbon"
}

ogbon()
{
 nvcc -I/opt/share/openmpi/4.1.1-cuda/include -L/opt/share/openmpi/4.1.1-cuda/lib64 -lnccl -lmpi -o ping-pong-MPI+CUDA ping-pong-MPI+CUDA.cu
}

#args in comand line
if [ "$#" ==  0 ]; then
 usage
 exit
fi

#ogbon
if [[ $1 == "ogbon" ]];then
 ogbon
fi

Writing howtocompile.sh


In [7]:
!bash howtocompile.sh ogbon

##### Execute with Shell Script

In [8]:
%%writefile howtoexecute.sh
#!/bin/bash

usage()
{
 echo "howtoexecute.sh: wrong number of input parameters. Exiting."
 echo -e "Usage: bash howtoexecute.sh <supercomputer>"
 echo -e "  g.e: bash howtoexecute.sh ogbon"
}

ogbon()
{
 sbatch slurm-MPI+CUDA.sh
}

localnode()
{
 mpirun -np 2 --report-bindings --map-by numa -x UCX_MEMTYPE_CACHE=n  -mca pml ucx -mca btl ^vader,tcp,openib,smcuda -x UCX_NET_DEVICES=mlx5_0:1 ./ping-pong-MPI+CUDA
}

#args in comand line
if [ "$#" ==  0 ]; then
 usage
 exit
fi

#ogbon
if [[ $1 == "ogbon" ]];then
 ogbon
fi

#localhost
if [[ $1 == "localnode" ]];then
 localnode
fi

Writing howtoexecute.sh


In [9]:
!bash howtoexecute.sh localnode

[c018:45711] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../../../../../../../..]
[c018:45711] MCW rank 1 bound to socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]], socket 1[core 20[hwt 0-1]], socket 1[core 21[hwt 0-1]], socket 1[core 22[hwt 0-1]], socket 1[core 23[hwt 0-1]], socket 1[core 24[hwt 0-1]], socket 1[core 25[hwt 0-1]], socket 1[core 26[hwt 0-1]], socket 1[core 27[hwt 0-1]], socket 1[core 28[hwt 0-1]], socket 1[core 29[hwt 

### CUDA-aware MPI

Before looking at this code example, let us first describe [CUDAWARE-MPI](https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/) and [GPUDirect RDMA](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html). CUDA-aware MPI is an MPI implementation that allows GPU buffers (e.g., GPU memory allocated with cudaMalloc) to be used directly in MPI calls. However, CUDA-aware MPI alone does not specify whether data is stored in intermediate stages in CPU memory or passed from GPU to GPU. It will depend on the computational structure of the execution environment.

The GPUDirect is an umbrella name used to refer to several specific technologies. In MPI, the GPUDirect technologies cover all kinds of inter-rank communication: intra-node, inter-node, and RDMA inter-node communication. Now let us take a look at the code below. It is the same as the tested version of MPI+CUDA, but now there are no calls to cudaMemcpy during the ping-pong steps. Instead, we use our GPU buffers (__d_A__) directly in MPI calls:

```cpp
    start_time = MPI_Wtime();
    for(int i = 1; i <= loop_count; i++)
    {
      if(rank == 0)
      {
        MPI_Send(A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
        MPI_Recv(A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat);
      }else if(rank == 1)
       {
         MPI_Recv(A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat);
         MPI_Send(A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
       }
    }
    stop_time = MPI_Wtime();
```

In [12]:
%%writefile ping-pong-CUDAWARE-MPI.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <unistd.h>
#include <mpi.h>

int main(int argc, char *argv[]){

    int size, rank;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    MPI_Status status;

    cudaSetDevice(rank);

    double start_time, stop_time, elapsed_time;

    for(int i = 0; i <= 27; i++)
    {
        long int N = 1 << i; /*Loop from 8 Bytes to 1 GB*/
   
        double *A = (double*)calloc(N, sizeof(double)); /*Allocate memory for A on CPU*/

        double *d_A;

        cudaMalloc(&d_A, N * sizeof(double)) ;
        cudaMemcpy(d_A, A, N * sizeof(double), cudaMemcpyHostToDevice);

        int tag1 = 1000;
        int tag2 = 2000;

        int loop_count = 50;

       /********************************/      
       /**/ start_time = MPI_Wtime();/**/
       /********************************/

        for(int i = 1; i <= loop_count; i++)
        {
            if(rank == 0)
            {
              MPI_Send(d_A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
              MPI_Recv(d_A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &status);
            }
            else if(rank == 1)
            {
              MPI_Recv(d_A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &status);
              MPI_Send(d_A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
            }
         }

       /**********************************/      
       /**/  stop_time = MPI_Wtime(); /**/
       /*********************************/

        /*measured time*/
        elapsed_time = stop_time - start_time;
        long int num_B = 8 * N;
        long int B_in_GB = 1 << 30;
        double num_GB = (double)num_B / (double)B_in_GB;
        double avg_time_per_transfer = elapsed_time / (2.0*(double)loop_count);

        if(rank == 0) 
            printf("Transfer size (Bytes): %10li, Transfer Time (seconds): %15.9f, Bandwidth (GB/s): %15.9f\n", 
                    num_B, avg_time_per_transfer, num_GB/avg_time_per_transfer );

        cudaFree(d_A);
        free(A);
    }

    MPI_Finalize();

    return 0;
}

Writing ping-pong-CUDAWARE-MPI.cu


#### Run the Code

##### Compile with Shell Script 

In [13]:
%%writefile howtocompile.sh
#!/bin/bash

usage()
{
 echo "howtocompile.sh: wrong number of input parameters. Exiting."
 echo -e "Usage: bash howtocompile.sh <supercomputer>"
 echo -e "  g.e: bash howtocompile.sh ogbon"
}

ogbon()
{
 nvcc -I/opt/share/openmpi/4.1.1-cuda/include -L/opt/share/openmpi/4.1.1-cuda/lib64 -lmpi ping-pong-CUDAWARE-MPI.cu -o ping-pong-CUDAWARE-MPI
}

#args in comand line
if [ "$#" ==  0 ]; then
 usage
 exit
fi

#ogbon
if [[ $1 == "ogbon" ]];then
 ogbon
fi

Overwriting howtocompile.sh


In [14]:
!bash howtocompile.sh ogbon

##### Execute with Shell Script

In [15]:
%%writefile howtoexecute.sh
#!/bin/bash

usage()
{
 echo "howtoexecute.sh: wrong number of input parameters. Exiting."
 echo -e "Usage: bash howtoexecute.sh <supercomputer>"
 echo -e "  g.e: bash howtoexecute.sh ogbon"
}

ogbon()
{
 sbatch slurm-CUDAWARE-MPI.sh
}

localnode()
{
 mpirun -np 2 --report-bindings --map-by numa -x UCX_MEMTYPE_CACHE=n -mca pml ucx -mca btl ^vader,tcp,openib,smcuda -x UCX_NET_DEVICES=mlx5_0:1 ./ping-pong-CUDAWARE-MPI
}

#args in comand line
if [ "$#" ==  0 ]; then
 usage
 exit
fi

#ogbon
if [[ $1 == "ogbon" ]];then
 ogbon
fi

#localhost
if [[ $1 == "localnode" ]];then
 localnode
fi

Overwriting howtoexecute.sh


In [16]:
!bash howtoexecute.sh localnode

[c018:64488] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../../../../../../../..]
[c018:64488] MCW rank 1 bound to socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]], socket 1[core 20[hwt 0-1]], socket 1[core 21[hwt 0-1]], socket 1[core 22[hwt 0-1]], socket 1[core 23[hwt 0-1]], socket 1[core 24[hwt 0-1]], socket 1[core 25[hwt 0-1]], socket 1[core 26[hwt 0-1]], socket 1[core 27[hwt 0-1]], socket 1[core 28[hwt 0-1]], socket 1[core 29[hwt 

## Exercise 1: Comparison Performance internode: NCCL x CUDA-aware MPI

Compare the following ping-pong code using NCCL within one compute node with the previous implementation of CUDA-aware MPI. The idea is to understand why the values differ since both pass through the same high-speed channel inside the node.

In [17]:
%%writefile ping-pong-NCCL.cu
#include <iostream>
#include <nccl.h>
#include <cuda_runtime.h>
#include <chrono>

#define NUM_GPUS 2

__global__ void print_values(int gpu_id, float *data) {
  printf("GPU %d: %f\n", gpu_id, data[threadIdx.x]);
}

int main(int argc, char *argv[]) {
  ncclComm_t comms[NUM_GPUS];

  cudaStream_t streams[NUM_GPUS];

  // Initializing NCCL
  ncclUniqueId id;
  ncclGetUniqueId(&id);
  ncclGroupStart();
  for (int i = 0; i < NUM_GPUS; ++i) {
    cudaSetDevice(i);
    ncclCommInitRank(&comms[i], NUM_GPUS, id, i);
  }
  ncclGroupEnd();

  // Create a stream on each GPU
  for (int i = 0; i < NUM_GPUS; ++i) {
    cudaSetDevice(i);
    cudaStreamCreate(&streams[i]);
  }

  for (int i = 0; i <= 27; i++) {
    long int N = 1 << i;
    size_t numBytes = N * sizeof(float);
    float *buffers[NUM_GPUS];

    // Allocate memory on each GPU
    for (int j = 0; j < NUM_GPUS; ++j) {
      cudaSetDevice(j);
      cudaMalloc(&buffers[j], numBytes);
    }

    // Initializing data on each GPU
    for (int j = 0; j < NUM_GPUS; ++j) {
      cudaSetDevice(j);
      float *h_data = new float[N];
      for (int k = 0; k < N; ++k) h_data[k] = j + 1.0f;
      cudaMemcpy(buffers[j], h_data, numBytes, cudaMemcpyHostToDevice);
      delete[] h_data;
    }

    int loop_count = 50;

    // Performing ping-pong between GPUs and measuring time
    cudaEvent_t start, stop;
    cudaSetDevice(0);
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, streams[0]);

    for (int j = 0; j < loop_count; ++j) {
      int src = j % NUM_GPUS;
      int dst = (j + 1) % NUM_GPUS;

      ncclGroupStart();
      cudaSetDevice(src);
      ncclSend(buffers[src], N, ncclFloat, dst, comms[src], streams[src]);

      cudaSetDevice(dst);
      ncclRecv(buffers[dst], N, ncclFloat, src, comms[dst], streams[dst]);
      ncclGroupEnd();
    }

    cudaEventRecord(stop, streams[0]);
    cudaEventSynchronize(stop);

    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime, start, stop);
    
    /*measured*/
    long int num_B = 8 * N;
    long int B_in_GB = 1 << 30;
    double num_GB = (double)num_B / (double)B_in_GB;
    double avg_time_per_transfer = (elapsedTime * 1e-3) / (2.0*(double)loop_count);
    float bandwidth = num_GB/avg_time_per_transfer ;
  
    printf("Transfer size (Bytes): %10li, Transfer Time (seconds): %15.9f, Bandwidth (GB/s): %15.9f\n", 
                  num_B, avg_time_per_transfer, bandwidth  );
 
    // Cleanup memory
    for (int j = 0; j < NUM_GPUS; ++j) {
      cudaSetDevice(j);
      cudaFree(buffers[j]);
    }
  }

  // Destroy NCCL communicators
  for (int i = 0; i < NUM_GPUS; ++i) 
  {
    cudaSetDevice(i);
    ncclCommDestroy(comms[i]);
  }

  return 0;
}

Writing ping-pong-NCCL.cu


### Compile and Run the Code

In [19]:
!nvcc ping-pong-NCCL.cu -o ping-pong-NCCL -lnccl -std=c++11

In [20]:
!./ping-pong-NCCL

Transfer size (Bytes):          8, Transfer Time (seconds):     0.000055060, Bandwidth (GB/s):     0.000135316
Transfer size (Bytes):         16, Transfer Time (seconds):     0.000004209, Bandwidth (GB/s):     0.003540612
Transfer size (Bytes):         32, Transfer Time (seconds):     0.000004209, Bandwidth (GB/s):     0.007081224
Transfer size (Bytes):         64, Transfer Time (seconds):     0.000004198, Bandwidth (GB/s):     0.014196990
Transfer size (Bytes):        128, Transfer Time (seconds):     0.000004157, Bandwidth (GB/s):     0.028673723
Transfer size (Bytes):        256, Transfer Time (seconds):     0.000004229, Bandwidth (GB/s):     0.056375459
Transfer size (Bytes):        512, Transfer Time (seconds):     0.000004168, Bandwidth (GB/s):     0.114413090
Transfer size (Bytes):       1024, Transfer Time (seconds):     0.000004229, Bandwidth (GB/s):     0.225501835
Transfer size (Bytes):       2048, Transfer Time (seconds):     0.000004168, Bandwidth (GB/s):     0.457652360
T

## Clear the Memory

Before moving on, please execute the following cell to clear up the CPU memory. This is required to move on to the next notebook.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Next

Please continue to the next notebook: [_4-OGBON-MCπ-SGPU.ipynb_](4-OGBON-MCπ-SGPU.ipynb).