#  Case Study: Monte Carlo Approximation of $\pi$ - NVSHMEM

In this notebook we will reinforce NVSHMEM and make a pass of using it in the monte-carlo approximation of $\pi$ program.

## Objectives

By the time you complete this notebook you will:

- Understand the benefits of using NVSHMEM for multi-GPU applications.
- Be able to write, compile, and run an NVSHMEM program that utilizes multiple GPUs.

## NVSHMEM

[NVSHMEM](https://developer.nvidia.com/nvshmem) is a parallel programming model for efficient and scalable communication across multiple NVIDIA GPUs. NVSHMEM, which is based on [OpenSHMEM](http://openshmem.org/site/), provides a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA streams. NVSHMEM offers a compelling multi-GPU programming model for many application use cases, and is especially valuable on modern GPU servers that have a high density of GPUs per server node and complex interconnects such as [NVIDIA NVSwitch](https://www.nvidia.com/en-us/data-center/nvlink/) on the [NVIDIA DGX A100 server](https://www.nvidia.com/en-us/data-center/dgx-a100/).

<center><img src="images/NVSHMEM.png" width="1000"></center>

## Motivation for NVSHMEM

Traditionally, communication patterns involving GPUs on multiple servers may look like the following: <span style="color:limegreen">compute</span> happens on the GPU, while <span style="color:skyblue">communication</span> happens on the CPU after synchronizing the GPU (to ensure that the data we send is valid). While this approach is very easy to program, it inserts the latency of initiating the communication and/or launching the kernel on the application's critical path. We are losing out on the ability to overlap communication with compute. If we do overlap communication with compute by pipelining the work, we can partially hide the latency, but at the cost of increased application complexity.

<center><img src="images/CPU_initiated_communication.png" width="1000"/></center>

By contrast, in a model with GPU-initiated rather than CPU-initiated communication, we do *both* compute and communication directly from the GPU. We can write extremely fine-grained communication patterns this way, and we can hide communication latency by the very nature of the GPU architecture (where warps that are computing can continue on while other warps are stalled waiting for data).

<center><img src="images/GPU_initiated_communication.png" width="1000"></center>

## Launching NVSHMEM Applications

NVSHMEM, like MPI, is an example of the SPMD programming style. NVSHMEM provides a launcher script[<sup>1</sup>](#footnote1) called `nvshmrun` that handles launching the $M$ processes. The arguments to `nvshmrun` are `-np`, the number of processes to launch, and then the application executable followed by any arguments to that executable. Each independent process is called a **Processing Element (PE)** and has a unique (zero-indexed) numerical identifier associated with it[<sup>2</sup>](#footnote2).

<center><img src="images/nvshmrun.png" width="1000"></center>

## Using NVSHMEM in Code

Let's learn the mechanics of launching multiple processes with NVSHMEM in application code.

### Initializing and Finalizing NVSHMEM

As a core requirement on the host side, we must initialize and finalize NVSHMEM as the first and last things in our program.

```cpp
nvshmem_init();
...
nvshmem_finalize();
```

### Obtaining Processing Element IDs

The API call [nvshmem_my_pe()](https://docs.nvidia.com/hpc-sdk/nvshmem/api/gen/api/setup.html#nvshmem-my-pe) returns the unique numerical ID of each PE.

```cpp
int my_pe = nvshmem_my_pe();
int device = my_pe;
cudaSetDevice(device);
```

In multi-node environments you will have to account for the fact that CUDA devices are always zero-indexed within each node. In that case you would obtain the PE identifier *local* to that node. For example, if we were using two nodes with four GPUs each, then we would ask our job launcher to run four tasks per node (e.g. `nvshmrun -np 8 -ppn 4 -hosts hostname1,hostname2`) and then do[<sup>3</sup>](#footnote3):

```cpp
int my_pe_node = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);
int device = my_pe_node;
cudaSetDevice(device);
```

In [None]:
%%writefile nvshmem_pi.cu
#include <iostream>
#include <curand_kernel.h>
#include <nvshmem.h>
#include <nvshmemx.h>
#define N 1024*1024

__global__ void calculate_pi(int* hits) 
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    // Initialize random number state (unique for every thread in the grid)
    int seed = 0;
    int offset = 0;
    curandState_t curand_state;
    curand_init(seed, idx, offset, &curand_state);

    // Generate random coordinates within (0.0, 1.0]
    float x = curand_uniform(&curand_state);
    float y = curand_uniform(&curand_state);

    // Increment hits counter if this point is inside the circle
    if (x * x + y * y <= 1.0f) 
        atomicAdd(hits, 1);
    
}

int main(int argc, char** argv) 
{
    // Initialize NVSHMEM
    nvshmem_init();

    // Obtain our NVSHMEM processing element ID
    int my_pe = nvshmem_my_pe();

    // Each PE (arbitrarily) chooses the GPU corresponding to its ID
    int device = my_pe;
    cudaSetDevice(device);

    // Allocate host and device values
    int* hits;
    hits = (int*) malloc(sizeof(int));

    int* d_hits;
    cudaMalloc((void**) &d_hits, sizeof(int));

    // Initialize number of hits and copy to device
    *hits = 0;
    cudaMemcpy(d_hits, hits, sizeof(int), cudaMemcpyHostToDevice);

    // Launch kernel to do the calculation
    int threads_per_block = 256;
    int blocks = (N + threads_per_block - 1) / threads_per_block;

    calculate_pi<<<blocks, threads_per_block>>>(d_hits);
    cudaDeviceSynchronize();

    // Copy final result back to the host
    cudaMemcpy(hits, d_hits, sizeof(int), cudaMemcpyDeviceToHost);

    // Calculate final value of pi
    float pi_est = (float) *hits / (float) (N) * 4.0f;

    // Print out result
    std::cout << "Estimated value of pi on PE " << my_pe << " = " << pi_est << std::endl;
    std::cout << "Relative error on PE " << my_pe << " = " << std::abs((M_PI - pi_est) / pi_est) << std::endl;

    free(hits);
    cudaFree(d_hits);

    // Finalize nvshmem
    nvshmem_finalize();

    return 0;
}

### Compiling NVSHMEM Code

[Compiling](https://docs.nvidia.com/hpc-sdk/nvshmem/api/using.html#compiling-nvshmem-programs) looks similar to before, but we now need to point to the relevant include and library directories for NVSHMEM (`-I $NVSHMEM_HOME/include -L $NVSHMEM_HOME/lib -lnvshmem`; we've provided the environment variable for you) and also link in the CUDA driver API (`-lcuda`). We also need to add to the code `#include <nvshmem.h>`[<sup>4</sup>](#footnote4) and `#include <nvshmemx.h>`[<sup>5</sup>](#footnote5). Finally we need to add `-rdc=true` to enable [relocatable device code](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#using-separate-compilation-in-cuda), a requirement of NVSHMEM.

```shell
nvcc -arch=sm_70 -rdc=true -I $NVSHMEM_HOME/include -L $NVSHMEM_HOME/lib -lnvshmem -lcuda -o nvshmem_pi nvshmem_pi.cu
```

#### Run the Code

##### Compile with Shell Script

In [None]:
%%writefile howtocompile.sh
#!/bin/bash

usage()
{
 echo "howtocompile.sh: wrong number of input parameters. Exiting."
 echo -e "Usage: bash howtocompile.sh <supercomputer>"
 echo -e "  g.e: bash howtocompile.sh sdumont"
}

sdumont()
{
 module load nvshmem/2.8.0_cuda-11.2
 nvcc -arch=sm_70 -rdc=true $CPPFLAGS $LDFLAGS -lnvshmem_host -lnvshmem_device -o nvshmem_pi nvshmem_pi.cu
}

#args in comand line
if [ "$#" ==  0 ]; then
 usage
 exit
fi

#sdumont
if [[ $1 == "sdumont" ]];then
 sdumont
fi

In [None]:
!bash howtocompile.sh sdumont

##### Execute with Shell Script

In [None]:
%%writefile v100-nvshmem_pi.sh
#!/bin/bash

#SBATCH --job-name=nvshmem_pi                  # Job name
#SBATCH --nodes=1                              # Run all processes on 2 nodes  
#SBATCH --partition=sequana_gpu_dev            # Partition SDUMONT
#SBATCH --output=out_v100_%j-nvshmem_pi.log    # Standard output and error log
#SBATCH --ntasks-per-node=4                    # 1 job per node

module load nvshmem/2.8.0_cuda-11.2
nvshmrun -np 4 nvshmem_pi                       

In [None]:
%%writefile howtoexecute.sh
#!/bin/bash

usage()
{
 echo "howtoexecute.sh: wrong number of input parameters. Exiting."
 echo -e "Usage: bash howtoexecute.sh <supercomputer>"
 echo -e "  g.e: bash howtoexecute.sh sdumont"
}

sdumont()
{
 sbatch v100-nvshmem_pi.sh
}

#args in comand line
if [ "$#" ==  0 ]; then
 usage
 exit
fi

#sdumont
if [[ $1 == "sdumont" ]];then
 sdumont
fi

In [None]:
!bash howtoexecute.sh sdumont

#### Print output in log file

In [None]:
!cat *-nvshmem_pi.log

## Clear the Memory

Before moving on, please execute the following cell to clear up the CPU memory. This is required to move on to the next notebook.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Next

Please continue to the next notebook: [_8-SDumont-Jacobi.ipynb_](8-SDumont-Jacobi.ipynb).