#  Case Study: Monte Carlo Approximation of $\pi$ - CUDAWARE-MPI

In this notebook we will introduce CUDAWARE-MPI, which will grant us the benefits of the SPMD paradigm while retaining the ability to utilize direct peer-to-peer memory with multiple GPUs.

## Objectives

By the time you complete this notebook you will:

- Be able to use CUDAWARE-MPI to run multiple copies of a CUDA application on multiple GPUs while leveraging direct peer-to-peer memory access between GPUs.

## CUDAWARE-MPI

MPI helped clean up much of the boilerplate we used when managing multiple devices explicitly. But we also gave up the benefit of multiple GPUs talking to each other directly. MPI is a [distributed memory parallel programming model](https://en.wikipedia.org/wiki/Distributed_memory), where each processor has its own (virtual) memory and address space, even if all ranks are on the same server and thus share the same physical memory. (This is typically contrasted with [shared memory parallel programming models](https://en.wikipedia.org/wiki/Shared_memory), where each processing thread has access to the same memory space, like [OpenMP](https://en.wikipedia.org/wiki/OpenMP), and also like traditional single-GPU CUDA programming where all threads have access to global memory.) So we copied the result for each GPU to the CPU and then summed the results on the CPU.

But as long as we're staying on a single server the rules of the [CUDA unified virtual address space](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-virtual-address-space) still hold, so all *CUDA* allocations result in virtual addresses that can be meaningfully shared across processes (even if the normal CPU dynamic memory allocations cannot be). As a result, it's possible for MPI to directly implement peer memory copies under the hood. For communication among remote servers this is not possible, but there are other technologies that allow direct GPU to GPU communication through a network interface, in particular [GPUDirect RDMA](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html). Recognizing the value in leveraging these technologies for efficient communication, many MPI implementations (including OpenMPI, which we are using in this workshop) provide [CUDAWARE-MPI](https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/), which allows the programmer to provide an address to an MPI communication routine which may reside on a device. The MPI implementation is then free to use whatever scheme it desires to transmit the data from one GPU to another, including the use of GPUDirect P2P and GPUDirect RDMA where appropriate. (Note that [GPUDirect](https://developer.nvidia.com/gpudirect) refers to a family of technologies while CUDAWARE-MPI refers to an API which may use those technologies under the hood, although it is common to see the two terms incorrectly conflated.)

<center><img src="images/GPUDirectRDMA.png" width="1000"></center>

So CUDAWARE-MPI provides the benefit of simplified programming while retaining the performance benefit of avoiding unnecessary copies to CPU memory. With that in mind, one way to write the final reduction is...

```cpp
MPI_Reduce(d_hits, total_hits, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
```

where MPI automatically detects that the send buffer `d_hits` resides on the device while the receive buffer `total_hits` resides on the host and does the right thing behind the scenes to enable this copy.

## Perform Reduction Entirely on GPUs with CUDAWARE-MPI

Now, rewrite this application to do the reduction entirely in GPU memory.

In order to do this you will need to create a device array for storing the hits total, and use this device array in the call to `MPI_Reduce`, and then explicitly copy the result back to the host on rank 0 at the end.

You can check the solution in the code `monte_carlo_mgpu_cuda_mpi_cuda_aware.cu`

In [None]:
%%writefile monte_carlo_mgpu_cuda_mpi_cuda_aware.cu
#include <iostream>
#include <curand_kernel.h>
#include <mpi.h>

#define N 1024*1024

__global__ void calculate_pi(int* hits, int device) 
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    // Initialize random number state (unique for every thread in the grid)
    int seed = device;
    int offset = 0;
    curandState_t curand_state;
    curand_init(seed, idx, offset, &curand_state);

    // Generate random coordinates within (0.0, 1.0]
    float x = curand_uniform(&curand_state);
    float y = curand_uniform(&curand_state);

    // Increment hits counter if this point is inside the circle
    if (x * x + y * y <= 1.0f) 
        atomicAdd(hits, 1);
                                           
}

int main(int argc, char** argv)
{
    // Initialize MPI
    MPI_Init(&argc, &argv);

    // Obtain our rank and the total number of ranks
    // MPI_COMM_WORLD means that we want to include all processes
    // (it is possible in MPI to create "communicators" that only
    // include some of the ranks).

    int rank, num_ranks;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &num_ranks);

    // Ensure that we don't have more ranks than GPUs
    int device_count;
    cudaGetDeviceCount(&device_count);

    if (num_ranks > device_count) 
    {
        std::cout << "Error: more MPI ranks than GPUs" << std::endl;
        return -1;
    }

    // Each rank (arbitrarily) chooses the GPU corresponding to its rank
    int dev = rank;
    cudaSetDevice(dev);

    // Allocate host and device values
    int* hits;
    hits = (int*) malloc(sizeof(int));

    int* d_hits;
    cudaMalloc((void**) &d_hits, sizeof(int));

    // Initialize number of hits and copy to device
    *hits = 0;
    cudaMemcpy(d_hits, hits, sizeof(int), cudaMemcpyHostToDevice);

    // Launch kernel to do the calculation
    int threads_per_block = 256;
    int blocks = (N / device_count + threads_per_block - 1) / threads_per_block;

    calculate_pi<<<blocks, threads_per_block>>>(d_hits, dev);
    cudaDeviceSynchronize();

    // Accumulate the results across all ranks to the result on rank 0
    int* d_total_hits;
    cudaMalloc((void**) &d_total_hits, sizeof(int));

    int root = 0;
    MPI_Reduce(d_hits, d_total_hits, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);

    if (rank == root) 
    {
        // Copy result back to host
        int* total_hits = (int*) malloc(sizeof(int));
        cudaMemcpy(total_hits, d_total_hits, sizeof(int), cudaMemcpyDeviceToHost);

        // Calculate final value of pi
        float pi_est = (float) *total_hits / (float) (N) * 4.0f;
        free(total_hits);

        // Print out result
        std::cout << "Estimated value of pi = " << pi_est << std::endl;
        std::cout << "Error = " << std::abs((M_PI - pi_est) / pi_est) << std::endl;
    }

    // Clean up
    free(hits);
    cudaFree(d_hits);

    // Finalize MPI
    MPI_Finalize();

    return 0;
}

#### Run the Code

##### Compile with Shell Script 

In [None]:
%%writefile howtocompile.sh
#!/bin/bash

usage()
{
 echo "howtocompile.sh: wrong number of input parameters. Exiting."
 echo -e "Usage: bash howtocompile.sh <supercomputer>"
 echo -e "  g.e: bash howtocompile.sh sdumont"
}

sdumont()
{
 module load openmpi/gnu/4.1.4+cuda-11.2
 nvcc $CPPFLAGS $LDFLAGS -lmpi -ccbin=mpicxx monte_carlo_mgpu_cuda_mpi_cuda_aware.cu -o monte_carlo_mgpu_cuda_mpi_cuda_aware
}

#args in comand line
if [ "$#" ==  0 ]; then
 usage
 exit
fi

#sdumont
if [[ $1 == "sdumont" ]];then
 ogbon
fi

In [None]:
!bash howtocompile.sh sdumont

##### Execute with Shell Script

In [None]:
%%writefile Slurm-MONTECARLO-CUDAWARE-MPI.sh
#!/bin/sh

#SBATCH --job-name=MONTECARLO-CUDAWARE-MPI                # Job name
#SBATCH --nodes=2                                         # Run all processes on 2 nodes  
#SBATCH --partition=sequana_gpu_dev                       # Partition SDUMONT
#SBATCH --output=out_v100_%j-MONTECARLO-CUDAWARE-MPI.log  # Standard output and error log
#SBATCH --ntasks-per-node=1                               # 1 job per node

module load openmpi/gnu/4.1.4+cuda-11.2
mpirun -np 2 --report-bindings --map-by numa -x UCX_MEMTYPE_CACHE=n -mca pml ucx -mca btl ^vader,tcp,openib,smcuda -x UCX_NET_DEVICES=mlx5_0:1 ./monte_carlo_mgpu_cuda_mpi_cuda_aware

In [None]:
%%writefile howtoexecute.sh
#!/bin/bash

usage()
{
 echo "howtoexecute.sh: wrong number of input parameters. Exiting."
 echo -e "Usage: bash howtoexecute.sh <supercomputer>"
 echo -e "  g.e: bash howtoexecute.sh sdumont"
}

sdumont()
{
 sbatch slurm-CUDAWARE-MPI.sh
}

#args in comand line
if [ "$#" ==  0 ]; then
 usage
 exit
fi

#sdumont
if [[ $1 == "sdumont" ]];then
 sdumont
fi

In [None]:
!bash howtoexecute.sh sdumont

#### Print output in log file

In [None]:
!cat *-MONTECARLO-CUDAWARE-MPI.log

## Clear the Memory

Before moving on, please execute the following cell to clear up the CPU memory. This is required to move on to the next notebook.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Next

Please continue to the next notebook: [_7-SDumont-MCπ-NVSHMEM.ipynb_](7-SDumont-MCπ-NVSHMEM.ipynb).