**DeapSECURE module 6: Parallel Programming**

# Session 2: Parallelizing a Serial Code with MPI

Welcome to the DeapSECURE online training program!
This is a Jupyter notebook for the hands-on learning activities of the
["Parallel and High-Performance Programming" module](https://deapsecure.gitlab.io/deapsecure-lesson06-par/), pisodes 5: ["Problem Decomposition"](https://deapsecure.gitlab.io/deapsecure-lesson06-par/20-problem-decomposition/index.html) and 7: ["Parallel Computation of Statistics of a Large Array"](https://deapsecure.gitlab.io/deapsecure-lesson06-par/25-parallel-reduction/index.html).
Please visit the [DeapSECURE](https://deapsecure.gitlab.io/) website to learn more about our training program.


<a id="TOC"></a>
**Quick Links** (sections of this notebook):

* 1 [Setup](#sec-Setup)
* 2 [Introduction](#sec-Intro)
* 3 [Serial Program](#sec-Serial_prog)
* 4 [Problem Decomposition in One Dimension](#sec-Problem_decomp)
* 5 [Parallel Program](#sec-Par_prog)
* 6 [Distributing Data](#sec-Distrib)
* 7 [Additional Improvements and Exercises](#sec-Additional_ex)


> **CAUTIONS**
>
> <!-- FIXME FIXME FIXME
> In this session, we will use this notebook as partly Python and partly UNIX shell to invoke MPI programs directly from the Jupyter notebook in order to capture the results in the same notebook.
> This notebook was designed with ODU Wahab cluster and Open OnDemand in mind, in which it is possible to launch MPI programs within the notebook environment.
> Sufficient number of CPU cores must be requested when starting this Jupyter session to run MPI programs interactively from the notebook.
> Otherwise, the program must be invoked through your cluster's job scheduler.
> -->
>
> **FOR THE TIME BEING, DO NOT LAUNCH MPI JOBS FROM THIS NOTEBOOK.**
>
> The trick to launch MPI programs from within a notebook is still broken and still canoot be done.
> Please launch MPI jobs directly from the login node's terminal using an appropriate job script and the `sbatch` command, not from Jupyter terminal.
> Please follow your instructor regarding the specific way of running MPI programs on your HPC site.
>
> There is a lot of variabilities regarding how one must run MPI programs on a particular HPC site, as well as the timing results of the MPI programs.
> Running the programs on different clusters will definitely result in different timing as well.
> For this reason, this notebook is not intended to be strictly reproducible across different environments.

<a id="sec-Setup"></a>
## 1. Setup Instructions

If you are opening this notebook from Wahab cluster's OnDemand interface, you're all set.

If you see this notebook elsewhere and want to perform the exercises on Wahab cluster, please follow the steps outlined in our setup procedure.

1. Make sure you have activated your HPC service.
2. Point your web browser to https://ondemand.wahab.hpc.odu.edu/ and sign in with your MIDAS ID and password.
3. Create a new Jupyter session with the following parameters: Python version **3.7**, Python suite `tensorflow 2.6 + pytorch 1.10`, Number of Cores **1**, Number of GPU **0**, Partition `main`, and Number of Hours at least **4**. (See <a href="https://wiki.hpc.odu.edu/en/ood-jupyter" target="_blank">ODU HPC wiki</a> for more detailed help.)

4. Get the necessary files using commands below within Jupyter:

       mkdir -p ~/CItraining/module-par
       cp -pr /shared/DeapSECURE/module-par/. ~/CItraining/module-par

Using the file manager on the left sidebar, now change the working directory to `~/CItraining/module-par`.
The file name of this notebook is `Par-session-2-Reduction.ipynb`.

<!--
> **VERY IMPORTANT:** Make sure that you start the Jupyter session with at least **4 cores**. We will use this notebook to launch real MPI programs, so multiple CPU cores are needed.
-->

### 1.1 Reminder

* Throughout this notebook, `#TODO` is used as a placeholder where you need to fill in with something appropriate. 
* To run a code in a cell, press `Shift+Enter`.
* Use `ls` to view the contents of a directory.

### 1.2 Loading Python Libraries

<!--
Now we need to **import** the required libraries into this Jupyter notebook:`numpy`.
-->

<!--
**Important**: On Wahab HPC, software packages, including Python libraries, are managed and deployed via *environment modules*.
Before we can import the Python libraries in our current notebook, we have to load the corresponding environment modules.
We have setup a custom environment "DeapSECURE" which will load all the required libraries for this workshop. Please load "DeapSECURE" module.

* Load the modules above using the `module("load", "MODULE")` or `module("load", "MODULE1", "MODULE2", "MODULE n")` statement.
* Next, invoke `module("list")` to confirm that these modules are loaded.
* In this module, we have setup a custom environment "DeapSECURE" including all the required libraries. Please load "DeapSECURE" module. (You can also setup your [custom environment](https://wiki.hpc.odu.edu/Software/Python#install-additional-python-modules))
-->

Now we can import the following standard Python libraries:
`os`, `re`, `sys`, `numpy`, `time`.

In [None]:
"""Uncomment, edit, and run code below to import libraries""";
#import #TODO

Now load the functions from DeapSECURE's special module, `parallel_prog_env`, to make MPI programs available & invocable from your notebook:

In [None]:
"""Uncomment and run these commands""";

#import parallel_prog_env
#from parallel_prog_env import *

In [None]:
# Makes MPI-related programs and modules available to this notebook:
load_parallel_prog_env(module)

> **NOTE**: `parallel_prog_env` is a DeapSECURE-specific module.

### Running Shell Commands inside the Jupyter Notebook

Lines that start with `!`are passed directly to the system shell. For example, `!ls` will run `ls`in the current directory.

Find out the following:

* Use the `hostname` to print the name of the compute node you're running in.
* Use the `date` command to print the present date/time.
* Use the `pwd` command to print your current directory.

<a id="sec-Intro"></a>
## 2. Introduction

Parallel programming is a challenging task.
For this reason, we provide several options for learners to take according to what's comfortable to them.

1) **Option 1 (easy)**: see the example of (almost) the parallelized code and understand the effects of parallelization.
   There will be opportunities to exercise some MPI programming by improving the code.

2) **Option 2 (challenging)**: perform parallelization from scratch, based on the serial code and the skeleton of `master_template.py`.
   This requires a strong programming skill and willingness to troubleshoot the code.


In this workshop, we provide two Python programs to parallelize:

1. `rand_reduction_seq.py` -- a program which generates many random numbers and compute the average and standard deviation. Let's give a nickname for this computational problem: "**rand_reduction**".

2. `encrypt_img.py` -- a program which reads an image file and encrypts it pixel-by-pixel, and saves the encrypted image in a JSON format. Let's name this computational problem "**encrypt_img**".

**CHALLENGE**: This notebook is focused on the parallelization of the first program, **rand_reduction**.
Those who are interested in a challenge and find the first program too trivial can go straight to the **encrypt_img** program and attempt to parallelize it from scratch.
Make sure you follow the steps prescribed below, though, as they are (mostly) independent of the program being parallelized.

<a id="sec-Serial_prog"></a>
## 3. The Serial Program: `rand_reduction_seq.py`

The first step in parallelizing any program is to become familiar with the original (serial) program and obtain the performance characteristics of the program.
The `rand_reduction_seq.py` program is located in the `reduction` subdirectory.

If you have not already done so, please issue: `cd reduction` once. Check if you have the `rand_reduction_seq.py` file in your new directory.

In [None]:
#TODO

### 3.1 Objective of Computation

The **rand_reduction** program generates many random numbers $\{ n_0, n_1, ... n_{N-1} \}$ (where $N$ = 10 million by default; this can be adjusted) and save them in the array called `NUMBERS`.
(Important: The program must be general enough because we must be able to replace the random numbers with a different stream of (non-random) numbers later on.)

The goal of the program is to compute the average and standard deviation of these numbers.
To do so, we must compute two sums:

$$P \equiv n_0 + n_1 + ... + n_{N-1} \equiv \sum_i n_i$$

$$Q \equiv n_0^2 + n_1^2 + ... + n_{N-1}^2 \equiv \sum_i n_i^2$$


The average is given by:

$$\langle n \rangle = P/N$$

and the standard deviation is given by

$$\sigma = \sqrt{ Q/N - (P/N)^2 }$$

### 3.2 Running the Serial Program

**EXERCISE:** Run the `rand_reduction_seq.py` program and get the timing information!

* There are no input files required. 
* Extra bonus: Create a job script for the program and run it through the SLURM job scheduler.
* Observe how much time is required to complete the calculation. The program reports this time at the end.

*HINT:* The program can be run in UNIX shell or in this Jupyter notebook. Recall that `!` allows shell commands to be run in Jupyter.

*>> (edit this cell to record your answer & observation)*

> ### BONUS EXERCISE: Job Script
>
> A job script can be created to run the program through the job scheduler.
> This is *the* recommended way to run jobs on HPC (especially the long-running jobs).
> This script becomes essential when the program is converted to parallel.
>
> ```bash
> #!/bin/bash
> #SBATCH --job-name=rand_reduction_seq
> #SBATCH --ntasks=1
> #SBATCH --output=rand_reduction_seq.out
> #SBATCH --time=1:00:00
>
> source parallel-prog-env
> python3 rand_reduction_seq.py
> ```
>
> *Notes*:
>
> * The job script above is for running the serial program.
> * The `--ntasks` flag determines the number of processes (UNIX tasks), which is obviously one for a serial program.
> * The `source parallel-prog-env` reads additional commands from `parallel-prog-env`, which loads the necessary environment modules for `mpiexec`, `python3` and other tools needed later on.

### 3.3 Analyzing the Serial Program and Identifying the Parts

**EXERCISE:**
Open the program in a text editor or viewer and identify the steps of the program.
Since this is a purely serial program, the parts that you need to identify will only be
  
  - `PROCESS_INPUT_DATA`
  - `DO_WORK`
  - `FINALIZE_AND_REPORT_RESULTS`
  
*(You are encouraged to work on this exercise with your breakout group!)*

*>> (edit this cell to record your answer & observation)*

### 3.4 Finding the Execution Hotspot

The goal of parallel programming is to *significantly* shorten the program's execution time.
Usually, a few parts of the program would account a large fraction of the execution time.
This means that the very first thing to do is to find which part(s) of the code are taking a lot of time.
For this exercise, we will employ a very simple time measurement, like this:

In [None]:
import time

# ... some prior codes here

t1 = time.time()
# Replace the sleep() below with the real code line(s), whose execution time will be measured
time.sleep(0.05)
t2 = time.time()
print('timing: <DESCRIBE_CODE_PART>: ', t2 - t1, 'secs')

# ... more codes here

The code snippet above measures and prints the execution time, defined by `t2 - t1`, of the code line(s) in between the two time measurements.
In the example above we only measure a single "sleep" command.
In real codes, you may measure a loop, or a few lines, or a few blocks of code.

There can be more than one time measurement in a given code.
Many good codes record and print the time measurement of their major computations/actions.

*HINTS:* Quite frequently, execution hot spots involve:

* Loops with large number of iterations
* Computation with lots of data (e.g. large arrays)

Additionally, you must also suspect function calls, as we may or may not have knowledge about what's being done there.

**EXERCISE 3:**
Using the simple timing trick above, find a section of the code that takes a long time to compute.

*>> (edit this cell to record your answer & observation)*

*(This may take a few trials-and-errors to find out. Don't give up!)*

Now that you have identified the expensive part(s) of the program, you will want to device a strategy to parallelize that part(s).

> ### Profiling the Code
>
> Timing sections of a code as done in the previous exercise is not always practical.
> Unless you are familiar with the computational methodology and have a good sense of where to measure, it will be a lot of trial and error to find the expensive section of the code!
> There is a tool called "profiler" which will help you pinpoint the lines of code, or the function calls, which take a lot of time to execute.
>
> Python has a package called `line_profiler` which can measure the amount of time spent in a function, line-by-line.
> There is a good lesson for this which you can follow: <http://www.hpc-carpentry.org/hpc-parallel-novice/01-estimate-of-pi/index.html> .
> If you have a real application to speed up (whether to parallelize or not), this is definitely the first step you will want to do.

<a id="sec-Problem_decomp"></a>
## 4. Problem Decomposition in One Dimension

In order to do so, we must first break up the expensive computation into parts that can be executed in parallel.
This effort of "breaking-up" tasks into smaller pieces is often referred to as ***problem decomposition*** or ***domain decomposition***.

A very common scenario in problem-decomposition is the distribution of $N$ independent tasks as evenly as possible among $P$ workers.
(This is to say that there is no interdependence among these tasks that would yield invalid results when executed in parallel.)

![Decomposition of array N=12 P=4](images/domainDecomp-1D-array_N12_P4.png)

The illustration above shows the distribution of $N=12$ numbers into $P=4$ partitions, where each partition corresponds to a worker.
This approach works well to minimize the processing time if each task takes the same amount of time to process.

<!--
Suppose we have N items that have to be split as evenly as possible across P partitions.

Examples:

* N=10 million elements to be processed by P=4 workers
* N=100 rows to be processed by P=7 workers

-->

**EXERCISE**: create a subprogram to divide up the $N$ items as evenly as possible among the workers?
Per MPI convention, we will label the workers with $r = 0, 1, ... (P-1)$.

### 4.1 Cases to Test

Here are some cases against which to test the correctness of your subprogram (or function):

1. $N$ being an integer multiples of $P$.
   Examples:

   * $N = 12; P = 4$
   * $N = 100; P = 4$
   * $N = 72; P = 8$

   
2. $N$ not being an integer multiples of $P$.
   Examples:

   * $N = 14; P = 4$
   * $N = 100; P = 7$


3. $N < P$ (corner case, but not uncommon).
   Example: $N = 5; P = 7$


4. $N == 0; P > 0$.
   Example: $N = 0; P = 7$

### 4.2 Manual Solution Workout

We will first try to solve this manually, then derive a formula (i.e. a Python expression) that can be evaluated by the computer.

#### Easy Case: $N$ divisible by $P$

Consider a concrete example of $N=12, P=4$.

In [None]:
N = 12
P = 4

1. How many items does each worker receive (call this `worksize`)? Express it also in terms of `N` and `P` (i.e. as a mathematical formula involving $N$ and $P$).

*(your answer here)*

2. What are the *inclusive* lower (`L`) and *exclusive* upper (`U`) bounds of the original array received by worker with rank `r`? We follow the convention of Python for upper bound: where the elements assigned to a worker `r` will be `NUMBERS[L]`, `NUMBERS[L+1]`, ... `NUMBERS[U-1]`.

*(your answer here)*

- for rank 0: L=... U=...
- for rank 1: L=... U=...
- ...

Note that by definition, `worksize = U - L`.

3. Based on the manually worked-out `L`'s and `U`'s above, create the appropriate formulas for them involving `N`, `P`, and `r`.

*Hints*: you can use the integer division operator `//` or the `int()` function to yield integers instead of real numbers.
The `L` and `U` variables should be an array with `r` as its index, as each worker is supposed to have a non-overlapping range of tasks (or data elements) to process.

*(your answer here)*

4. Test your formula for various `r`'s for the same `N` and `P`.

In [None]:
## Use as many python cells as you need here

> *Hint*: It pays well to write a loop and/or a function to expedite the testings below.
> For example, you can use this as a starting point:
>
> ```python
> L = [0] * N
> U = [0] * N
> for r in range(N):
>     L[r] = ...
>     U[r] = ...
> ```
>
> <br>
>
> *Advanced Hint*: If you know NumPy, you can replace `[0] * N` with `numpy.zeros((N,), dtype=int)` for a more robust array, and use NumPy's array operations to get rid of the `for` loop altogether!

#### Generalizing the Formula

It is important to test that your formula works correctly on general cases.
Test your formula for various combinations of `N` and `P`, trying out several `r` values as well.
At minimum, test it against:

* $N=14, P=4$  (N not divisible by P)
* $N=5, P=7$   (N < P)

**Important**:
You must make sure that all the `N` elements will be distributed into the partitions, and no element is assigned multiple times.

**HINTS**: *Possible* correct solution tables are given below.

For $N=14, P=4$:

```
r           L        U  worksize
0           0        3         3
1           3        7         4
2           7       10         3
3          10       14         4
```

For $N=5, P=7$:

```
r           L        U  worksize
0           0        0         0
1           0        1         1
2           1        2         1
3           2        2         0
4           2        3         1
5           3        4         1
6           4        5         1
```

Your solution table may vary depending on the specific algorithm you use.
For example, in the first table, your method may return slightly different answer on which rank would get `worksize=3` or `4`.
But the sum of all `worksize`'s must equal to $N$, and there must not be any overlap or gap among the $[L, (U-1)]$ intervals.

----

**Optional Exercise**

Write a simple Python code here to print a table containing 4 columns: rank, lower bound, upper bound, number of elements.
A possible output looks like:

```
0  0  3  3
1  3  7  4
2 28 42 14
3 42 57 15
...
```
or

```
(0, 0, 14, 14)
(1, 14, 28, 14)
(2, 28, 42, 14)
(3, 42, 57, 15)
...
```

(The columns may not align, as long as it makes it easy for you to inspect.)

*Hints*: Use the `for r in range(P): ...` construct to print the values line-by-line. Advanced users may try using pandas' DataFrame.

----

The code above can be very useful for visually inspecting the result of decomposition. 

(end optional exercise)

----

<a id="sec-Par_prog"></a>
## 5. Parallel Program: Starter's Edition

We provide an example of a starter program that is partially parallelized---it still needs some work.
The starter parallel program is located at `reduction/rand_reduction_par.py`.

For **option 1**, we will guide you to complete this parallelization.
If you want to parallelize from scratch, feel free to use this program as a "cheat sheet" to see how certain things are done, but be careful of the issue pointed out in section "**6. Distributing Data**" below.

Regardless if you want to parallelize the program, please go through the following exercises to understand the characteristics of the code.

### 5.1 Running the Parallel Program

**EXERCISE**: Run `rand_reduction_par.py` and observe the speedup

* Run it with 4 cores
* Use `mpiexec python3 SCRIPTNAME.py` to launch the MPI program.
* Create a job script for the program and submit it to the job scheduler (`sbatch JOB_SCRIPT`).
* Verify if the results of the parallel computation are still the same as the serial computation before, but the timing is reduced

Let's name the execution timing with 4 cores: $T_4$.

*>> (edit this cell to record your answer & observation)*

In [None]:
"""How to run parallel program interactively from Jupyter:
Uncomment the following line and add --oversubscribe flag on Wahab.""";
#! mpiexec   -n 4 python3 #TODO

### 5.2 Parallel Speedup

Let's define the *speedup* due to 4-worker parallelism as

$$S_4 \equiv \frac{T_{serial}}{T_4}$$

What is the speedup of this parallel run?

Ideally $S_4$ should be 4, but it will not be.
The actual speedup divided by the ideal speedup is called the *parallel efficiency*.

In [None]:
"""Uncomment and complete the code below with your observed results""";

# T_serial = #TODO    ## What was the timing result from running the serial program?
# T_4 = #TODO         ## What was the result from running the parallel program with 4 cores?
# S_4 = #TODO         ## How to find the speedup?
# print("4-worker speedup =", S_4)

### 5.3 Parallel Scaling

**EXERCISE**: Run the code with different number of workers ($P$):

  * 1 core (any difference from the serial run?)
  * 2 cores
  * 4 cores (the case above)
  * 8 cores
  * 16 cores

Take note of the execution timings in each ($T_1$, $T_2$, ...). Using Matplotlib, plot $T_P$ as a function of $P$ and see the effect of parallelization.

In [None]:
"""Replace numbers with execution timings found above.
You may want to consider numpy arrays for versatility.""";
#T = numpy.array([#TODO])
#S = T / T_serial
#print("Timing =", T)
#print("Speedup =", S)

Sample code to display your speedup result:

```python
# This is a simple way to plot the data;
# check out the documentation on matplotlib for more ways

import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots()

# Change the values to plot the Speedup vs. Time
Procs = [ 1, 2, 4, 8, 16 ]
# WP's results (early 2021 on Wahab w/ legacy Python)
Timings = #TODO
Speedups = [ Timings[0] / T for T in Timings ]

ax.plot(Procs, Speedups)
ax.set_xlabel('Timing')
ax.set_ylabel('Speedup')
ax.set_title('Speedup vs. Timing')
```

<a id="sec-Distrib"></a>
## 6. Distributing Data

The starter parallel program can get reasonable speedup by running it with multiple workers, but there is one **fundamental problem**: the working data (`NUMBERS`) was **not distributed**.

*(snippet of the original rand_reduction_par.py)*
~~~python
if rank == 0:
    COUNT = 10000000
    SEED = COUNT    # for the time being, random seed is the same as COUNT

    numpy.random.seed(SEED)
    NUMBERS = numpy.random.random((COUNT,))
else:
    # Assign them with dummy values because the var names must exist on bcast below
    COUNT = None
    SEED = None
    NUMBERS = None

# Replicate the following data to all workers
count_g = comm.bcast(COUNT, root=0)
seed_g = comm.bcast(SEED, root=0)
numbers_g = comm.bcast(NUMBERS, root=0)  ### This is problematic
~~~

In the starter's program, the random number was generated on the master rank, then *broadcast* to all the ranks.
Consequently, each worker holds a full copy of `NUMBERS`, the primary data of the program.
(There are even two copies of `NUMBERS` data in the master rank--verify that.
This violates the principle of distributed-memory programming, where the working data should be distributed across the workers.
Each worker should only hold the portion it needs perform its part of the computation.


**EXERCISE**:
Modify `rand_reduction_par.py` so that each worker receives only its relevant partition of the data.
For example, worker rank 1 should only receive `NUMBERS[2500000:5000000]`.

> **IMPORTANT**:
> This exercise is central to the objective of our workshop, so be sure to attempt this one.
> You will need to modify the program to perform the correct data distribution *and* make the workers work correctly with their portions of the data.
> Please create a copy of the original program and modify the copy.

**DISCUSSION QUESTIONS**:

1. For a given piece of data (e.g. an array), when does it make sense to *distribute* the data, and when can we just *replicate* it?
What are the advantages and disadvantages of each choice?

2. After data distribution, can process 1 access portion of the array that is in process 2?

3. Does distributing data have any effect on the application timing? Why, or why not?

<a id="sec-Additional_ex"></a>
## 7. Additional Improvements

Besides the critical data distribution, there are a number improvements that can be made in the starter's version of the parallel code.

### 7.1 Using Scatter Operation

**EXERCISE**:
The distribution of work (`U` and `L`) was done using a series of `send`'s on the master side, and a single `recv` on the other workers.
Of course, this is exactly what the `scatter` funtion does!
Please replace the `send`/`recv` combination in `rand_reduction_par.py` with a single call of `scatter`.
Applying this modification will reduce the number of MPI function calls in the program.

### 7.2 Problem Decomposition on Worker Processes

The domain decomposition for a given rank is often a well-defined computation that can be carried out *independently* by each worker process.

**QUESTIONS**:

* On the original starter's parallel program, can we do the domain decomposition on the worker processes?

* On the parallel program with properly distributed data, can we do the domain decomposition on the worker processes?

* If this can be done, what are the adantages of performing the domain decomposition in parallel?
  Applying this modification can potentially reduce the amount of data exchange at the start of the program.

> ## MPI Improvement Challenges
>
> The following subsections contain more extensive programming exercises that may require significant reworking of the program.
> These exercises are quite profitable to highlight common issues encountered in high-performance computing in the real world.
> You are encouraged to attempt one or more of them on your own, and use our training's Slack channel to discuss your issues or findings.


### 7.3 Conserving Memory Usage (Challenge)

The original code generates *all* the random numbers in the master process before dispersing all of them to the worker processes.
This is a problem when the data is extremely large.
One way to get around this is to use *pipelining*, by generating chunks of random numbers and let those be processed first before generating the next batch.

**EXERCISE**: Modify `rand_reduction_par.py` to take this approach.

*Hint*: In this approach, the master process may have to take a unique role of providing the random numbers to all the workers, i.e. it does not participate in the computation as the rest of the workers.


### 7.4 Speeding Up Random Number Generation (Challenge)

The `numpy.random.random` function used in the reduction code generates the so-called *pseudo-random numbers*, i.e. a stream of numbers that look random in the eyes, but they are actually deterministically generated (i.e. can be replicated).
The generation of pseudo-random numbers is a serious bottleneck in parallel computing.
In real computing, we can use *parallel random number generator* to let workers generate their own stream of random numbers independently, therefore avoiding one process becoming the bottleneck to this process.
However, this also leads to issues in reproducibility: The computed results cannot be reproduced unless we use exactly the same set of random number streams--which mostly translates to repeating the computation with exactly the same parallel configuration.
In many computing applications, this is not a serious issue because the results from different computations have to only match statistically.

**EXERCISE**: Modify `rand_reduction_par.py` to use parallel random number generators.

> *Hints:* Start from numpy documentation: https://numpy.org/devdocs/reference/random/parallel.html .
> On Wahab, you must use numpy version 1.17 to use `SeedSequence`.

(As an alternative, find resources on parallel random number generators.)
Hint: for many good random number generators such as that provided by `numpy`, we can pre-mix the MPI rank into the random seed, with optional computation of hash such as MD5 or SHA1 or SHA256 to ensure both random stream independence *and* reproducibility of the computation.

## Summary

The objective of this hands-on lesson module is to give you a "taste" of MPI's basic capabilities to enable parallel programming, and to guide you through a hands-on experience of parallelizing a simple program, (nearly) from scratch.

Programming a parallel involves both science and art--even more so than writing a serial program.
In creating a parallel program, there are lot of design decisions that has be made with much consideration, because each of these may dramatically improve or hurt performance.
It takes some experience to judge whether to parallelize a portion of a code or not, and to decide the distribution layout of the data, and so on.
These considerations are beyond the scope of this workshop.
Please refer to some learning resources below to learn more.

## Learning Resources

The following resources are valuable if you want to gain mastery in parallel programming using MPI.
Most resources will use C/C++/Fortran languages because they are what defined in the standard.
(Truthfully, if you aim for truly high performance, you should seriously consider programming in these languages.)

### Message Passing Interface (MPI) -- A Tutorial by LLNL

https://hpc-tutorials.llnl.gov/mpi/

Written for C/C++/Fortran languages.
Has more complete coverage than basic functionalities, to include derived data types and topology.


### MPI Tutorial

<https://mpitutorial.com/tutorials/>

Targeting C++ language.
This excellent tutorial focuses on in-depth, technical discussion on each MPI capability and how to use them properly.
Sample programs are provided.


### Parallel Programming Concepts and High-Performance Computing: Introduction

<https://cvw.cac.cornell.edu/parallel/default>

Has a lot of high-level concepts on architecting a parallel program,
as well as consideration for achieving high performance.


### HPC University Bi-Weekly Challenge

* Challenges: <http://hpcuniversity.org/students/weeklyChallenge/>
* Resources: <http://hpcuniversity.org/students/weeklyChallenge/resources/>

Contains many hands-on  exercises for MPI and high-performance programmings.
Sample/starting programs are available for download.
Note: Not all challenges are related to MPI.