# Shared Memory Parallelism Using OpenMP
We want to use OpenMP to enable parallel execution of our codes. If multiple workers can do the same job, execution will be sped up.

In [None]:
export OMP_NUM_THREADS=10
export OMP_PROC_BIND=close
export OMP_PLACES=cores

In [None]:
pygmentize omp_examples/a01-parallel.F90

We need to tell the compiler that our program contains OpenMP pragmas with the `-fopenmp` option.

In [None]:
gfortran omp_examples/a01-parallel.F90 -fopenmp -o parallel.out

Now we can run the generated executable.

In [None]:
./parallel.out

<div class="alert alert-block alert-warning">
    We see that each worker is entering the parallel region and executes the code independent of the other workers.<br>
    If we repeat the same block multiple times we see that there is no deterministic answer as the order changes. We see thought that the numbers printed are always the same
</div>    

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>1.</b> Can you explain why the numbers are the same but the order is jumbled up?
    
<b>2.</b> Why do you see the number of outputs?
</div>

<b>ANSWERS</b><p>

1. ...
2. ...

## Infos About the Region
OpenMP allows to get the information about number of threads present in each parallel region

In [None]:
pygmentize omp_examples/a02-infos.F90

In [None]:
gfortran omp_examples/a02-infos.F90 -fopenmp -o infos.out && ./infos.out

<div class="alert alert-block alert-warning">
    We can inspect the current thread number and the total number of threads via intrinsics
</div>    

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>3.</b> Can you think of a good reason why this information might be useful?
    
</div>

<b>ANSWERS</b><p>

3. ...

## Setting the Number of Threads
We can do this either on each parallel region or with an environment variable

In [None]:
pygmentize omp_examples/a03-threadnum.F90

In [None]:
gfortran omp_examples/a03-threadnum.F90 -fopenmp -o threadnum.out && ./threadnum.out

<div class="alert alert-block alert-warning">
    We can set the number of threads with the <code>num_threads</code> keyword for parallel regions
</div>

In [None]:
OMP_NUM_THREADS=4 ./infos.out

<div class="alert alert-block alert-warning">
    We can set the number of threads to use when nothing is specified with the <code>OMP_NUM_THREADS</code> environment variable.
</div>

In [None]:
OMP_NUM_THREADS=4 ./threadnum.out

<div class="alert alert-block alert-warning">
    We see that <code>num_threads</code> takes presedence over <code>OMP_NUM_THREADS</code>
</div>

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>4.</b> Why do you think num_threads is higher up in the order of precedence?<br>
<b>5.</b> Why do we need two ways of controlling this? Are these two ways fulfilling the same purpose?
</div>

<b>ANSWERS</b><p>

4. ...
5. ...
    

## Parallel Loops
Since loops are such an important concept to parallelize, there is a special directive for it.

In [None]:
pygmentize omp_examples/a04-loops.F90

In [None]:
gfortran omp_examples/a04-loops.F90 -fopenmp -o loops.out

In [None]:
OMP_NUM_THREADS=4 ./loops.out

<div class="alert alert-block alert-warning">
    We see that the loop order is not preserved, threads take diffent amout of iterations and are not sorted
</div>

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>6.</b> Since this is an existing concept it has it's applications. Can you think of a simple example where the loop order does not matter? Can you also think of one where an unordered loop will break the program flow?<br>
    
</div>

<b>ANSWERS</b><p>

6. ...

In [None]:
OMP_NUM_THREADS=12 ./loops.out

<div class="alert alert-block alert-warning">
    If more threads than loop iterations are available, only the first set of threads is used
</div>

## Scheduling
<div class="alert alert-block alert-danger">
This is an important concept to understand but is not necessarily used in our example
</div>

Controlling the loop execution with the `schedule` directive might be important for performance. Since the work is oftentimes similar in weather and climate codes this is not super important for our example but is still a concept worth knowing.

In [None]:
pygmentize omp_examples/a05-schedule.F90

In [None]:
gfortran omp_examples/a05-schedule.F90 -fopenmp -o schedule.out

In [None]:
OMP_NUM_THREADS=2 ./schedule.out

<div class="alert alert-block alert-warning">
    Static scheduling allows us to assign chunks of the iteration to the same thread
</div>

## Variable Scoping
We try to understand how variables are scoped, who owns them at which parts of the code and who can see effects of writing them.

In [None]:
pygmentize omp_examples/a06-scoping.F90

In [None]:
gfortran omp_examples/a06-scoping.F90 -fopenmp -o scoping.out

In [None]:
./scoping.out

<div class="alert alert-block alert-warning">
    We see that shared variables can cause race conditions
</div>

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>7.</b> Run the above code 5 times. What do you observe?<br>
<b>8.</b> Why is the code above problematic? Is what we're doing here still compatible with the Turing Machine way of modelling computers?
    
</div>

<b>ANSWERS</b><p>

7. ...
8. ...

The private directive allows each thread to have a copy of a variable: 

In [None]:
pygmentize omp_examples/a07-private.F90

In [None]:
gfortran omp_examples/a07-private.F90 -fopenmp -o private.out

In [None]:
OMP_NUM_THREADS=10 ./private.out

<div class="alert alert-block alert-warning">
    We see:
    <ul>
        <li>Private variables are always empty when coming in to the parallel regions</li>
        <li>Private variables do not cause race conditions</li>
        <li>The values in private variables are lost after exiting the parallel region</li>
    </ul>
</div>

shared is the default but can also be stated explicitly

In [None]:
pygmentize omp_examples/a08-shared.F90

In [None]:
gfortran omp_examples/a08-shared.F90 -fopenmp -o shared.out

In [None]:
OMP_NUM_THREADS=10 ./shared.out

<div class="alert alert-block alert-warning">
    We find a way to explicitly use shared variables in parallel regions
</div>

## Special Regions
<div class="alert alert-block alert-danger">
This is an important concept to understand but is not necessarily used in our example
</div>

There are certain pieces of the code that might be more sensitive to how threads should be handling them, here are the options:

In [None]:
pygmentize omp_examples/a09-regions.F90

In [None]:
gfortran omp_examples/a09-regions.F90 -fopenmp -o regions.out && ./regions.out

<div class="alert alert-block alert-warning">
    We learn that thread 0 is the only one entering <code>master</code>, only one thread ever enters <code>single</code> and every thread enters <code>critical</code>, but only one at a time
</div>

## What if parallel regions span multiple tasks
<div class="alert alert-block alert-danger">
This is an important concept to understand but is not necessarily used in our example
</div>

Depending on how parallelization was done there might be a need to let threads wait for each other. The `#pragma omp barrier` is used for that.

In [None]:
pygmentize omp_examples/a10-barrier.F90

In [None]:
gfortran omp_examples/a10-barrier.F90 -fopenmp -o barrier.out && ./barrier.out

<div class="alert alert-block alert-warning">
    We learn that there is no guarantee on the execution order of different statements in a parallel block across threads
</div>

In [None]:
pygmentize omp_examples/a11-barrier.F90

In [None]:
gfortran omp_examples/a11-barrier.F90 -fopenmp -o barrier2.out && ./barrier2.out

<div class="alert alert-block alert-warning">
    We learn that barriers help synchronize code
</div>

## Nowait
<div class="alert alert-block alert-danger">
This is an important concept to understand but is not necessarily used in our example
</div>

Certain OpenMP statements come with implicit barriers so `nowait` is the keyword to explicitly disable those

In [None]:
pygmentize omp_examples/a12-loops.F90

In [None]:
gfortran omp_examples/a12-loops.F90 -fopenmp -o loops.out && ./loops.out

<div class="alert alert-block alert-warning">
    We learn that loops synchrnoize afterwards
</div>

In [None]:
pygmentize omp_examples/a13-nowait.F90

In [None]:
gfortran omp_examples/a13-nowait.F90 -fopenmp -o nowait.out && ./nowait.out

<div class="alert alert-block alert-warning">
    We learn that the nowait keyword can remove synchonizaton
</div>

## Reductions
<div class="alert alert-block alert-danger">
This is an important concept to understand but is not necessarily used in our example
</div>

Since reductions are such an omnipresent motif, we do not want to implement it with critical / atomic every time, so there is a keyword for it:

In [None]:
pygmentize omp_examples/a14-reduction.F90

In [None]:
gfortran omp_examples/a14-reduction.F90 -fopenmp -o reduction.out && ./reduction.out

<div class="alert alert-block alert-warning">
    We learn about the reduction keyword
</div>

# Amdahl's law

We try to investigate the performance of a simple example, computing $\pi$ using the Leibniz formula:

$$1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \frac{1}{9} ... = \frac{\pi}{4}$$

Our goal is to understand what might affect performance and how this is reflected in strong and weak scaling plots

## Strong Scaling

Understanding how much faster the code gets though parallelization

In [None]:
pygmentize omp_examples/b01-timing.F90

In [None]:
gfortran -fopenmp omp_examples/b01-timing.F90 -o timing.out

In [None]:
echo -n > out.txt
for nthread in `seq 12` ; do
  ./timing.out $nthread >> out.txt
done

In [None]:
python -c "
import matplotlib
matplotlib.use('Agg')
import numpy as np, matplotlib.pyplot as plt;
data = np.loadtxt('out.txt');
threads = data[:,0].astype(int); runtimes = data[:,1];
plt.figure(figsize=(8,6));
plt.bar(threads, runtimes, color='purple', label='runtime');
plt.xlabel('# threads'); plt.ylabel('runtime [s]'); plt.legend(); plt.tight_layout();
plt.savefig('out.png', dpi=72)
"

cat out.png | display

In [None]:
python -c "
import matplotlib
matplotlib.use('Agg')
import numpy as np, matplotlib.pyplot as plt
data = np.loadtxt('out.txt')
threads = data[:, 0].astype(int)
runtimes = data[:, 1]
base_runtime = runtimes[0]
speedup = base_runtime / runtimes
linear = threads
plt.figure(figsize=(8,6))
plt.bar(threads, speedup, color='purple', label='runtime')
plt.plot(threads, linear, color='turquoise', label='linear')
plt.xlabel('# threads')
plt.ylabel('speedup (relative to 1 core)')
plt.ylim(0, 14)
plt.legend()
plt.tight_layout()
plt.savefig('out.png', dpi=72)
"

cat out.png | display

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>9.</b> Can you explain the performance numbers that we see here? What effects are in place?<br>    
</div>

<b>ANSWERS</b><p>

9. ...


## Weak scaling
Measuring the performance hit we get by increasing the problem size as well as the number of threads

In [None]:
pygmentize omp_examples/b02-weakscaling.F90

In [None]:
gfortran -fopenmp omp_examples/b02-weakscaling.F90 -o weak.out

In [None]:
echo -n > weak.txt
for nthread in `seq 72` ; do
    ./weak.out $nthread >> weak.txt
done

In [None]:
python -c "
import matplotlib
matplotlib.use('Agg')
import numpy as np, matplotlib.pyplot as plt
data = np.loadtxt('weak.txt')
threads = data[:, 0].astype(int)
runtimes = data[:, 1]
base_runtime = runtimes[0]
efficiency = (base_runtime / runtimes)
plt.figure(figsize=(8,6))
plt.bar(threads, efficiency, color='purple', label='efficiency')
plt.xlabel('# threads')
plt.ylabel('parallel efficiency')
plt.ylim(0, 1.1)
plt.legend()
plt.tight_layout()
plt.savefig('out.png', dpi=72)
"

cat out.png | display

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>10.</b> Can you explain the performance numbers that we see here? What effects are in place?<br>
</div>

<b>ANSWERS</b><p>

10. ...


## Caching

Here we see the implication of caching in a multithreaded environment

In [None]:
pygmentize omp_examples/b03-caching.F90

In [None]:
gfortran -fopenmp omp_examples/b03-caching.F90 -o caching.out

In [None]:
echo -n > caching.txt
for nthread in `seq 72` ; do
  ./caching.out $nthread >> caching.txt
done

In [None]:
python -c "
import matplotlib
matplotlib.use('Agg')
import numpy as np, matplotlib.pyplot as plt
data = np.loadtxt('caching.txt')
threads = data[:, 0].astype(int)
runtimes = data[:, 1]
plt.figure(figsize=(8,6))
plt.bar(threads, runtimes, color='purple', label='runtime')
plt.xlabel('# threads')
plt.ylabel('runtime [s]')
plt.ylim(0, 0.03)
plt.legend()
plt.tight_layout()
plt.savefig('out.png', dpi=72)
"

cat out.png | display

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>11.</b> Can you explain the performance numbers that we see here? What effects are in place?<br>
</div>

<b>ANSWERS</b><p>

11. ...

In [None]:
pygmentize omp_examples/b04-caching_fast.F90

In [None]:
gfortran -fopenmp omp_examples/b04-caching_fast.F90 -o caching2.out

In [None]:
echo -n > caching2.txt
for nthread in `seq 72` ; do
  ./caching2.out $nthread >> caching2.txt
done

In [None]:
python -c "
import matplotlib
matplotlib.use('Agg')
import numpy as np, matplotlib.pyplot as plt
data = np.loadtxt('caching2.txt')
threads = data[:, 0].astype(int)
runtimes = data[:, 1]
plt.figure(figsize=(8,6))
plt.bar(threads, runtimes, color='purple', label='runtime')
plt.xlabel('# threads')
plt.ylabel('runtime [s]')
#plt.ylim(0, 0.03)
plt.legend()
plt.tight_layout()
plt.savefig('out.png', dpi=72)
"

cat out.png | display

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>12.</b> Can you explain the performance numbers that we see here? What effects are in place?<br>
</div>

<b>ANSWERS</b><p>

12. ...

In [None]:
make clean_examples