# Shared Memory Parallelism Using OpenMP
We want to use OpenMP to enable parallel execution of our codes. If multiple workers can do the same job, execution will be sped up.

In [None]:
pygmentize omp_examples/a01-parallel.cpp

We need to tell the compiler that our program contains OpenMP pragmas with the `-fopenmp` option.

In [None]:
g++ omp_examples/a01-parallel.cpp -fopenmp -o parallel

Now we can run the generated executable.

In [None]:
./parallel

## Infos About the Region
OpenMP allows to get the information about number of threads present in each parallel region

In [None]:
pygmentize omp_examples/a02-infos.cpp

In [None]:
g++ omp_examples/a02-infos.cpp -fopenmp -o infos && ./infos

## Setting the Number of Threads
We can do this either on each parallel region or with an environment variable

In [None]:
pygmentize omp_examples/a03-threadnum.cpp

In [None]:
g++ omp_examples/a03-threadnum.cpp -fopenmp -o threadnum && ./threadnum

We can set the number of threads to use when nothing is specified with the `OMP_NUM_THREADS` environment variable.

In [None]:
OMP_NUM_THREADS=4 ./infos

In [None]:
OMP_NUM_THREADS=4 ./threadnum

## Parallel Loops
Since loops are such an important concept to parallelize, there is a special directive for it.

In [None]:
pygmentize omp_examples/a04-loops.cpp

In [None]:
g++ omp_examples/a04-loops.cpp -fopenmp -o loops

In [None]:
OMP_NUM_THREADS=4 ./loops

In [None]:
OMP_NUM_THREADS=12 ./loops

## Scheduling
<div class="alert alert-block alert-warning">
This is an important concept to understand but is not necessarily used in our example
</div>

Controlling the loop execution with the `schedule` directive might be important for performance. Since the work is oftentimes similar in weather and climate codes this is not super important for our example but is still a concept worth knowing.

In [None]:
pygmentize omp_examples/a05-schedule.cpp

In [None]:
g++ omp_examples/a05-schedule.cpp -fopenmp -o schedule

In [None]:
OMP_NUM_THREADS=2 ./schedule

## Variable Scoping
We try to understand how variables are scoped, who owns them at which parts of the code and who can see effects of writing them.

In [None]:
pygmentize omp_examples/a06-scoping.cpp

In [None]:
g++ omp_examples/a06-scoping.cpp -fopenmp -o scoping

In [None]:
./scoping

The private directive allows each thread to have a copy of a variable: 

In [None]:
pygmentize omp_examples/a07-private.cpp

In [None]:
g++ omp_examples/a07-private.cpp -fopenmp -o private

In [None]:
OMP_NUM_THREADS=10 ./private

shared is the default but can also be stated explicitly

In [None]:
pygmentize omp_examples/a08-shared.cpp

In [None]:
g++ omp_examples/a08-shared.cpp -fopenmp -o shared

In [None]:
OMP_NUM_THREADS=10 ./shared

## Special Regions
<div class="alert alert-block alert-warning">
This is an important concept to understand but is not necessarily used in our example
</div>

There are certain pieces of the code that might be more sensitive to how threads should be handling them, here are the options:

In [None]:
pygmentize omp_examples/a09-regions.cpp

In [None]:
g++ omp_examples/a09-regions.cpp -fopenmp -o regions && ./regions

## What if parallel regions span multiple tasks
<div class="alert alert-block alert-warning">
This is an important concept to understand but is not necessarily used in our example
</div>

Depending on how parallelization was done there might be a need to let threads wait for each other. The `#pragma omp barrier` is used for that.

In [None]:
pygmentize omp_examples/a10-barrier.cpp

In [None]:
g++ omp_examples/a10-barrier.cpp -fopenmp -o barrier && ./barrier

In [None]:
pygmentize omp_examples/a11-barrier.cpp

In [None]:
g++ omp_examples/a11-barrier.cpp -fopenmp -o barrier2 && ./barrier2

## Nowait
<div class="alert alert-block alert-warning">
This is an important concept to understand but is not necessarily used in our example
</div>

Certain OpenMP statements come with implicit barriers so `nowait` is the keyword to explicitly disable those

In [None]:
pygmentize omp_examples/a12-loops.cpp

In [None]:
g++ omp_examples/a12-loops.cpp -fopenmp -o loops && ./loops

In [None]:
pygmentize omp_examples/a13-nowait.cpp

In [None]:
g++ omp_examples/a13-nowait.cpp -fopenmp -o nowait && ./nowait

## Reductions
<div class="alert alert-block alert-warning">
This is an important concept to understand but is not necessarily used in our example
</div>

Since reductions are such an omnipresent motif, we do not want to implement it with critical / atomic every time, so there is a keyword for it:

In [None]:
pygmentize omp_examples/a14-reduction.cpp

In [None]:
g++ omp_examples/a14-reduction.cpp -fopenmp -o reduction && ./reduction

# Amdahl's law

We try to investigate the performance of a simple example, computing $\pi$ using the Leibniz formula:

$$1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \frac{1}{9} ... = \frac{\pi}{4}$$

Our goal is to understand what might affect performance and how this is reflected in strong and weak scaling plots

## Strong Scaling

Understanding how much faster the code gets though parallelization

In [None]:
pygmentize omp_examples/b01-timing.cpp

In [None]:
g++ -fopenmp omp_examples/b01-timing.cpp -o timing

In [None]:
./timing 1 > out.txt
./timing 2 >> out.txt
./timing 3 >> out.txt
./timing 4 >> out.txt
./timing 5 >> out.txt
./timing 6 >> out.txt
./timing 7 >> out.txt
./timing 8 >> out.txt
./timing 9 >> out.txt
./timing 10 >> out.txt
./timing 11 >> out.txt
./timing 12 >> out.txt

In [None]:
gnuplot -e "\
set terminal png; \
set style fill solid; \
set yrange[0:0.1]; \
set xlabel '# cores'; \
set ylabel 'runtime [s]'; \
plot 'out.txt' using 2: xtic(1) title 'runtime' with histogram \
" | display

In [None]:
base=`head -1 out.txt | awk '{print $2}'`
gnuplot -e "\
set terminal png; \
set style fill solid; \
set yrange[0:14]; \
set xlabel '# cores'; \
set ylabel 'speedup (relative to 1 core)'; \
plot 'out.txt' using ($base/\$2): xtic(1) title 'runtime' with histogram, 
'out.txt' using :(\$1) title 'linear' with lines\
" | display

## Weak scaling
Measuring the performance hit we get by increasing the problem size as well as the number of threads

In [None]:
pygmentize omp_examples/b02-weakscaling.cpp

In [None]:
g++ -fopenmp omp_examples/b02-weakscaling.cpp -o weak

In [None]:
./weak 1 > weak.txt
./weak 2 >> weak.txt
./weak 4 >> weak.txt
./weak 8 >> weak.txt
./weak 12 >> weak.txt
./weak 18 >> weak.txt
./weak 20 >> weak.txt
./weak 24 >> weak.txt
./weak 30 >> weak.txt

In [None]:
base=`head -1 weak.txt | awk '{print $2}'`
gnuplot -e "\
set terminal png; \
set style fill solid; \
set yrange[0:1.1]; \
set xlabel '# cores'; \
set ylabel 'parallel efficiency'; \
plot 'weak.txt' using ($base/\$2): xtic(1) title 'efficiency' with histogram, 
" | display

## Caching

Here we see the implication of caching in a multithreaded environment

In [None]:
pygmentize omp_examples/b03-caching.cpp

In [None]:
g++ -fopenmp omp_examples/b03-caching.cpp -o caching

In [None]:
./caching 1 > caching.txt
./caching 2 >> caching.txt
./caching 4 >> caching.txt
./caching 8 >> caching.txt
./caching 12 >> caching.txt
./caching 13 >> caching.txt
./caching 20 >> caching.txt
./caching 24 >> caching.txt

In [None]:
gnuplot -e "\
set terminal png; \
set style fill solid; \
set yrange[0:0.1]; \
set xlabel '# cores'; \
set ylabel 'runtime [s]'; \
plot 'caching.txt' using 2: xtic(1) title 'runtime' with histogram \
" | display