# Introduction to Parall Programming (RMIT-NCI 2022)

In [1]:
import os
# the jupyter notebook is launched from your $HOME, change the working directory provided a username directory is created under /scratch/vp91
os.chdir(os.path.expandvars("/scratch/vp91/$USER/RMIT2022"))

## 1. OpenMP
Our example ([monte-carlo-pi-serial](./monte-carlo-pi-serial.c)) for you to get a hang of  parallel programming is slightly more complicated than a helloword program. Nevertheless, it is a simple snippet showcasing a basic openmp program. 

The program approximates Pi by Monte-Carlo method. Run the next cell to compile and execute the serial code.


In [29]:
!make clean && make mc-serial && echo "Compilation Successful!" && ./monte-carlo-pi-serial

rm -f *.o laplace_mpi_blocking laplace_mpi_nonblocking laplace_mpi_persistent
gcc -g -Wall -fopenmp -o monte-carlo-pi-serial monte-carlo-pi-serial.c -lm
Compilation Successful!
MATH Pi 3.141593
/////////////////////////////////////////////////////
Sampling points 4000000; Hit numbers 3140304; Approx Pi 3.140304, Total time in 0.064008 seconds 
Sampling points 8000000; Hit numbers 6282953; Approx Pi 3.141477, Total time in 0.102494 seconds 
Sampling points 16000000; Hit numbers 12565669; Approx Pi 3.141417, Total time in 0.204732 seconds 
Sampling points 32000000; Hit numbers 25130758; Approx Pi 3.141345, Total time in 0.410154 seconds 
Sampling points 64000000; Hit numbers 50264936; Approx Pi 3.141558, Total time in 0.818817 seconds 
Sampling points 128000000; Hit numbers 100532384; Approx Pi 3.141637, Total time in 1.640441 seconds 
Sampling points 256000000; Hit numbers 201059854; Approx Pi 3.141560, Total time in 3.281945 seconds 
Sampling points 512000000; Hit numbers 402124922; Ap

The multithreading version is implemented at ([monte-carlo-pi-openmp.c](./monte-carlo-pi-openmp.c)) by OpenMP. In essence, $N$ number of randowm numbers are distributed to multiple threads. 


Run the next cell to compile the OpenMP code.

In [23]:
!make clean && make mc-omp && echo "Compilation Successful!" 

rm -f *.o laplace_mpi_blocking laplace_mpi_nonblocking laplace_mpi_persistent
gcc  -g -fopenmp -Wall -o monte-carlo-pi-openmp monte-carlo-pi-openmp.c -lm
Compilation Successful!


Run the program with a fixed number of threads

In [24]:
!OMP_NUM_THREADS=12 ./monte-carlo-pi-openmp

MATH Pi 3.141593
/////////////////////////////////////////////////////
Sampling points 4000000; Hit numbers 3142455; Approx Pi 3.142455, Total time in 0.015405 seconds 
Sampling points 8000000; Hit numbers 6284800; Approx Pi 3.142400, Total time in 0.016148 seconds 
Sampling points 16000000; Hit numbers 12566293; Approx Pi 3.141573, Total time in 0.019214 seconds 
Sampling points 32000000; Hit numbers 25133048; Approx Pi 3.141631, Total time in 0.035671 seconds 
Sampling points 64000000; Hit numbers 50264658; Approx Pi 3.141541, Total time in 0.071952 seconds 
Sampling points 128000000; Hit numbers 100525709; Approx Pi 3.141428, Total time in 0.143536 seconds 
Sampling points 256000000; Hit numbers 201053080; Approx Pi 3.141454, Total time in 0.285846 seconds 
Sampling points 512000000; Hit numbers 402112108; Approx Pi 3.141501, Total time in 0.572919 seconds 


## 2. OpenACC
Now we offload the computation to a GPU to accelerate the for-loop. To this end, firstly we need to load NVIDIA HPC STK module on Gadi.

**`TODO`**: Refactor [monte-carlo-pi-openacc.c](./monte-carlo-pi-openacc.c) by changing to OpenACC clauses. 

Since we will compile with managed memory, there's no need to include data transfer clauses. But this will come to an issue for gaining more performance.

The following flags are used in compiling the OpenACC code:

-Minfo=accel: Show the information about the accelerated code by OpenACC

-ta:telsa=mamaged: Target OpenACC to Nvidia GPUs with mamanged memory

We also use NVTX libray which provides annotations for profiling the code.

If you are getting stuck, peek the solution at ([solution](./solution/monte-carlo-pi-openacc.c))

Once you have rendered the code with correct OpenACC, run the next cell to compile and execute the program.

In [6]:
!make clean && make mc-acc && echo "Compilation Successful!" && ./monte-carlo-pi-openacc

rm -f *.o laplace_mpi_blocking laplace_mpi_nonblocking laplace_mpi_persistent
nvc -g -Wextra -acc -Minfo=accel -ta=tesla:managed  -o monte-carlo-pi-openacc monte-carlo-pi-openacc.c -lm -lnvToolsExt
calc_pi:
     39, Generating NVIDIA GPU code
         39, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Generating reduction(+:count)
     39, Generating implicit copyin(random_array[:]) [if not already present]
         Generating implicit copy(count) [if not already present]
Compilation Successful!
MATH Pi 3.141593
Sampling points 4000000; Hit numbers 3141015; Approx Pi 3.141015, Total time in 0.240623 seconds 
Sampling points 8000000; Hit numbers 6281556; Approx Pi 3.140778, Total time in 0.158552 seconds 
Sampling points 16000000; Hit numbers 12565785; Approx Pi 3.141446, Total time in 0.301478 seconds 
Sampling points 32000000; Hit numbers 25129756; Approx Pi 3.141220, Total time in 0.587117 seconds 
Sampling points 64000000; Hit numbers 50263089; Approx P

Now we will demonstrate how to submit a batch job.

## 3. MPI
Our last parallel programming model uses MPI. The total $N$ number of random numbers are split into multiple processors. Each process independtely calculates the number of random numbers that are locally stored witthin the process. The results of each individual MPI rank are collected and summed at a root process (MPI rank 0). 

Look out for the following parts in the program ([monte-carlo-pi-mpi.c](./monte-carlo-pi-mpi.c)).
```cpp
#include <mpi.h>

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &size);

MPI_Wtime();

MPI_Reduce(&count, &count_tot, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

MPI_Barrier(MPI_COMM_WORLD);

MPI_Finalize();
```

Run the next cell to excute the MC_pi program.

In [30]:
!make clean && make mc-mpi && echo "Compilation Successful!" && mpiexec -np 4 ./monte-carlo-pi-mpi

rm -f *.o laplace_mpi_blocking laplace_mpi_nonblocking laplace_mpi_persistent
mpicc -g -Wall -o monte-carlo-pi-mpi monte-carlo-pi-mpi.c -lmpiP -lm -lbfd -liberty -lunwind
Compilation Successful!
start 1000000 end 2000000 rank 1 
start 2000000 end 3000000 rank 2 
start 3000000 end 4000000 rank 3 
mpiP: 
mpiP: mpiP: mpiP V3.4.1 (Build Apr  1 2020/12:09:46)
mpiP: Direct questions and errors to mpip-help@lists.sourceforge.net
mpiP: 
start 0 end 1000000 rank 0 
MPI program runtime = 0.016044 on rank 1
Hit numbers  3140429 Approx Pi 3.140429
MPI program runtime = 0.018354 on rank 3
MPI program runtime = 0.032232 on rank 2
MPI program runtime = 0.030789 on rank 0
mpiP: 
mpiP: Storing mpiP output in [./monte-carlo-pi-mpi.4.1968846.1.mpiP].
mpiP: 


### Profile with mpiP

Run the next cell to inspect the profiling results.

In [31]:
!cat *.mpiP
!rm -r *.mpiP

@ mpiP
@ Command : ./monte-carlo-pi-mpi 
@ Version                  : 3.4.1
@ MPIP Build date          : Apr  1 2020, 12:09:46
@ Start time               : 2022 11 15 21:31:48
@ Stop time                : 2022 11 15 21:31:48
@ Timer Used               : PMPI_Wtime
@ MPIP env var             : [null]
@ Collector Rank           : 0
@ Collector PID            : 1968846
@ Final Output Dir         : .
@ Report generation        : Single collector task
@ MPI Task Assignment      : 0 gadi-gpu-v100-0090.gadi.nci.org.au
@ MPI Task Assignment      : 1 gadi-gpu-v100-0090.gadi.nci.org.au
@ MPI Task Assignment      : 2 gadi-gpu-v100-0090.gadi.nci.org.au
@ MPI Task Assignment      : 3 gadi-gpu-v100-0090.gadi.nci.org.au

---------------------------------------------------------------------------
@--- MPI Time (seconds) ---------------------------------------------------
---------------------------------------------------------------------------
Task    AppTime    MPITime     MPI%
   0     0.0398     