# Case Study: Monte Carlo Approximation of $\pi$ - Single GPU

We will be using the highly parallelizable [monte carlo approximation of $\pi$](https://en.wikipedia.org/wiki/Approximations_of_%CF%80#Summing_a_circle's_area) algorithm to introduce several multi-GPU programming motifs. In this notebook we will introduce the algorithm, and begin our exploration by running it on a single GPU.

## Objectives

By the time you complete this notebook you will:

- Understand the key features of the monte carlo approximation of $\pi$ algorithm.
- Be familiar with a single GPU CUDA implementation of the algorithm upon which to explore several multi-GPU implementations.

## The Algorithm at a High Level

A [well-known technique](https://en.wikipedia.org/wiki/Approximations_of_%CF%80#Summing_a_circle's_area) for numerically estimating $\pi$ is to select a large number of random points within the [unit square](https://en.wikipedia.org/wiki/Unit_square) and count the fraction that fall within the [unit circle](https://en.wikipedia.org/wiki/Unit_circle). Since the area of the square is 1 and the area of the circle is $\pi / 4$, the fraction of points that fall in the circle, multiplied by 4, is a good approximation of $\pi$.

<center><img src="https://upload.wikimedia.org/wikipedia/commons/8/84/Pi_30K.gif" width="600">

© [User:nicoguaro](https://commons.wikimedia.org/wiki/User:Nicoguaro) / [Wikimedia Commons](https://commons.wikimedia.org/wiki/Main_Page) / [CC-BY-3.0](https://creativecommons.org/licenses/by/3.0/deed.en)
</center>

## Highly Parallelizable

A nice property of this algorithm from the perspective of parallel programming is that each random point can be evaluated independently. We only need to know a point's coordinate to evaluate whether it falls within the circle since for a point with coordinates $(x, y)$, if $x^2 + y^2 <= 1$ then the point falls within the circle, and our counter of the number of points within the circle can be incremented, so long as we handle any race conditions with respect to the counter.

## A Single GPU Implementation

Let is see how this looks in CUDA for a single GPU. We have provided a sample implementation; Execute the `monte_carlo_cuda.cu` and review the parts of this code:

In [None]:
%%writefile monte_carlo_pi_cuda.cu
#include <iostream>
#include <curand_kernel.h>
#define N 1024*1024

__global__ void calculate_pi(int* hits) 
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    // Initialize random number state (unique for every thread in the grid)
    int seed = 0;
    int offset = 0;
    curandState_t curand_state;
    curand_init(seed, idx, offset, &curand_state);

    // Generate random coordinates within (0.0, 1.0)
    float x = curand_uniform(&curand_state);
    float y = curand_uniform(&curand_state);

    // Increment hits counter if this point is inside the circle
    if (x * x + y * y <= 1.0f) 
        atomicAdd(hits, 1);    
}

int main(int argc, char** argv) 
{
    // Allocate host and device values
    int* hits;
    hits = (int*) malloc(sizeof(int));

    int* d_hits;
    cudaMalloc((void**) &d_hits, sizeof(int));

    // Initialize number of hits and copy to device
    *hits = 0;
    cudaMemcpy(d_hits, hits, sizeof(int), cudaMemcpyHostToDevice);

    // Launch kernel to do the calculation
    int threads_per_block = 256;
    int blocks = (N + threads_per_block - 1) / threads_per_block;

    calculate_pi<<<blocks, threads_per_block>>>(d_hits);
    cudaDeviceSynchronize();

    // Copy final result back to the host
    cudaMemcpy(hits, d_hits, sizeof(int), cudaMemcpyDeviceToHost);

    // Calculate final value of pi
    float pi_est = (float) *hits / (float) (N) * 4.0f;

    // Print out result
    std::cout << "Estimated value of pi = " << pi_est << std::endl;
    std::cout << "Error = " << std::abs((M_PI - pi_est) / pi_est) << std::endl;

    // Clean up
    free(hits);
    cudaFree(d_hits);

    return 0;
}

Note that this code is just meant for instructional purposes, it is not meant to be especially high performance. In particular:

- We are using the [device-side API](https://docs.nvidia.com/cuda/curand/device-api-overview.html) of [cuRAND](https://developer.nvidia.com/curand) to generate random numbers directly in the kernel. It is OK if you are unfamiliar with cuRAND, just know that every CUDA thread will have its own unique random numbers.
- We are having every thread only evaluate a single value, so the arithmetic intensity is quite low.
- We will have a lot of atomic collisions while updating the `hits` counter.

Nevertheless, we can quickly estimate   $\pi$   using one million sample points and we should get an error compared to the correct value of only about 0.05%.

#### Run the Code

##### Compile with Shell Script

In [None]:
%%writefile howtocompile.sh
#!/bin/bash

usage()
{
 echo "howtocompile.sh: wrong number of input parameters. Exiting."
 echo -e "Usage: bash howtocompile.sh <supercomputer>"
 echo -e "  g.e: bash howtocompile.sh sdumont"
}

sdumont()
{
 module load openmpi/gnu/4.1.4+cuda-11.2
 nvcc monte_carlo_pi_cuda.cu -o monte_carlo_pi_cuda $CPPFLAGS $LDFLAGS
}

#args in comand line
if [ "$#" ==  0 ]; then
 usage
 exit
fi

#sdumont
if [[ $1 == "sdumont" ]];then
 sdumont
fi

In [None]:
!bash howtocompile.sh sdumont

##### Execute with Shell Script

In [None]:
%%writefile v100-MonteCarlo1GPU.sh
#!/bin/bash

#SBATCH --job-name=MonteCarlo1GPU               # Job name
#SBATCH --nodes=1                               # Run on 1 node  
#SBATCH --partition=sequana_gpu_dev             # Partition SDUMONT
#SBATCH --output=out_v100_%j-MonteCarlo1GPU.log # Standard output and error log
#SBATCH --ntasks-per-node=1                     # 1 job per node

module load openmpi/gnu/4.1.4+cuda-11.2
./monte_carlo_pi_cuda

In [None]:
%%writefile howtoexecute.sh
#!/bin/bash

usage()
{
 echo "howtoexecute.sh: wrong number of input parameters. Exiting."
 echo -e "Usage: bash howtoexecute.sh <supercomputer>"
 echo -e "  g.e: bash howtoexecute.sh sdumont"
}

sdumont()
{
 sbatch v100-MonteCarlo1GPU.sh
}

#args in comand line
if [ "$#" ==  0 ]; then
 usage
 exit
fi

#sdumont
if [[ $1 == "sdumont" ]];then
 sdumont
fi

In [None]:
!bash howtoexecute.sh sdumont

#### Print output in log file

In [None]:
!cat *-MonteCarlo1GPU.log

## Clear the Memory

Before moving on, please execute the following cell to clear up the CPU memory. This is required to move on to the next notebook.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Next

Please continue to the next notebook: [_5-SDumont-MCπ-MGPU.ipynb_](5-SDumont-MCπ-MGPU.ipynb).