#  Case Study: Monte Carlo Approximation of $\pi$ - Multiple GPUs

In this notebook we will refactor the single GPU implementation of the monte carlo approximation of $\pi$ algorithm to run on multiple GPUs using a technique of looping over available GPU devices to perform work on each. While this is a perfectly valid technique, we hope to begin demonstrating that it can quickly add significant complexity to your code.

## Objectives

By the time you complete this notebook you will:

- Be able to utilize multiple GPUs by looping over them to perform work on each.

## Extending to Multiple GPUs

A simple way to extend our example to multiple GPUs is to use a single host process that manages multiple GPUs. If we have *M* GPUs and *N* sample points to evaluate, we can distribute *N/M* to each GPU, and in principle calculate the result up to *M* times more quickly.

To enact this approach, we:
- Use `cudaGetDeviceCount` to ascertain the number of available GPUs.
- Loop over the number of GPUs, using `cudaSetDevice` in each loop iteration.
- Perform the correct fraction of the work for the set GPU.

```cpp
int device_count;
cudaGetDeviceCount(&device_count);

for (int i = 0; i < device_count; ++i) 
{
    cudaSetDevice(i);
    # Do single GPU worth of work.
}
```

## Refactor to Multiple GPUs

Note that in this example we are giving each GPU a different seed for the random number generator so that each GPU is doing different work. As a result our answer will change a little. You can consult the code here [Monte Carlo Multi-GPU](solutions/monte_carlo_mgpu_cuda.cpp).

In [None]:
%%writefile monte_carlo_mgpu_cuda.cu
#include <iostream>
#include <curand_kernel.h>
#define N 1024*1024

__global__ void calculate_pi(int* hits, int device) 
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    // Initialize random number state (unique for every thread in the grid)
    int seed = device;
    int offset = 0;
    curandState_t curand_state;
    curand_init(seed, idx, offset, &curand_state);

    // Generate random coordinates within (0.0, 1.0]
    float x = curand_uniform(&curand_state);
    float y = curand_uniform(&curand_state);

    // Increment hits counter if this point is inside the circle
    if (x * x + y * y <= 1.0f) 
        atomicAdd(hits, 1);
    
}

int main(int argc, char** argv) 
{
    // Determine number of GPUs
    int device_count;
    cudaGetDeviceCount(&device_count);

    std::cout << "Using " << device_count << " GPUs" << std::endl;

    // Allocate host and device values (one per GPU)
    int** hits = (int**) malloc(device_count * sizeof(int*));
    for (int i = 0; i < device_count; ++i) 
        hits[i] = (int*) malloc(sizeof(int));
    

    int** d_hits = (int**) malloc(device_count * sizeof(int*));
    for (int i = 0; i < device_count; ++i) 
    {
        cudaSetDevice(i);
        cudaMalloc((void**) &d_hits[i], sizeof(int));
    }

    // Initialize number of hits and copy to device
    for (int i = 0; i < device_count; ++i) 
    {
        *hits[i] = 0;
        cudaSetDevice(i);
        cudaMemcpy(d_hits[i], hits[i], sizeof(int), cudaMemcpyHostToDevice);
    }

    // Launch kernel to do the calculation
    int threads_per_block = 256;
    int blocks = (N / device_count + threads_per_block - 1) / threads_per_block;

    // Allow for asynchronous execution by launching all kernels first
    // and then synchronizing on all devices after.
    for (int i = 0; i < device_count; ++i) 
    {
        cudaSetDevice(i);
        calculate_pi<<<blocks, threads_per_block>>>(d_hits[i], i);
    }

    for (int i = 0; i < device_count; ++i) 
    {
        cudaSetDevice(i);
        cudaDeviceSynchronize();
    }

    // Copy final result back to the host
    for (int i = 0; i < device_count; ++i) 
    {
        cudaSetDevice(i);
        cudaMemcpy(hits[i], d_hits[i], sizeof(int), cudaMemcpyDeviceToHost);
    }

    // Sum number of hits over all devices
    int hits_total = 0;
    for (int i = 0; i < device_count; ++i) 
        hits_total += *hits[i];

    // Calculate final value of pi
    float pi_est = (float) hits_total / (float) (N) * 4.0f;

    // Print out result
    std::cout << "Estimated value of pi = " << pi_est << std::endl;
    std::cout << "Error = " << std::abs((M_PI - pi_est) / pi_est) << std::endl;

    // Clean up
    for (int i = 0; i < device_count; ++i) 
    {
        free(hits[i]);
        cudaFree(d_hits[i]);
    }
    free(hits);
    free(d_hits);

    return 0;
}

#### Run the Code

##### Compile with Shell Script

In [None]:
%%writefile howtocompile.sh
#!/bin/bash

usage()
{
 echo "howtocompile.sh: wrong number of input parameters. Exiting."
 echo -e "Usage: bash howtocompile.sh <supercomputer>"
 echo -e "  g.e: bash howtocompile.sh sdumont"
}

sdumont()
{
 module load openmpi/gnu/4.1.4+cuda-11.2
 nvcc monte_carlo_mgpu_cuda.cu -o monte_carlo_mgpu_cuda $CPPFLAGS $LDFLAGS
}

#args in comand line
if [ "$#" ==  0 ]; then
 usage
 exit
fi

#sdumont
if [[ $1 == "sdumont" ]];then
 sdumont
fi

In [None]:
!bash howtocompile.sh sdumont

##### Execute with Shell Script

In [None]:
%%writefile v100-MonteCarloMultiGPU.sh
#!/bin/bash

#SBATCH --job-name=MonteCarloMultiGPU               # Job name
#SBATCH --nodes=1                                   # Run on 1 node  
#SBATCH --partition=sequana_gpu_dev                 # Partition SDUMONT
#SBATCH --output=out_v100_%j-MonteCarloMultiGPU.log # Standard output and error log
#SBATCH --ntasks-per-node=1                         # 1 job per node

module load openmpi/gnu/4.1.4+cuda-11.2
./monte_carlo_mgpu_cuda

In [None]:
%%writefile howtoexecute.sh
#!/bin/bash

usage()
{
 echo "howtoexecute.sh: wrong number of input parameters. Exiting."
 echo -e "Usage: bash howtoexecute.sh <supercomputer>"
 echo -e "  g.e: bash howtoexecute.sh sdumont"
}

sdumont()
{
 sbatch v100-MonteCarloMultiGPU.sh
}

#args in comand line
if [ "$#" ==  0 ]; then
 usage
 exit
fi

#sdumont
if [[ $1 == "sdumont" ]];then
 sdumont
fi

In [None]:
!bash howtoexecute.sh sdumont

#### Print output in log file

In [None]:
!cat *-MonteCarloMultiGPU.log

## Clear the Memory

Before moving on, please execute the following cell to clear up the CPU memory. This is required to move on to the next notebook.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Next

Please continue to the next notebook: [_6-SDumont-MCπ-CUDAWARE-MPI.ipynb_](6-SDumont-MCπ-CUDAWARE-MPI.ipynb).