# Shared Memory Parallelism Using OpenMP
We want to use OpenMP to enable parallel execution of our codes. If multiple workers can do the same job, execution will be sped up.

In [None]:
pygmentize omp_examples/a01-parallel.cpp

We need to tell the compiler that our program contains OpenMP pragmas with the `-fopenmp` option.

In [None]:
g++ omp_examples/a01-parallel.cpp -fopenmp -o parallel.out

Now we can run the generated executable.

In [None]:
./parallel.out

<div class="alert alert-block alert-warning">
    We see that each worker is entering the parallel region and executes the code independent of the other workers.<br>
    If we repeat the same block multiple times we see that there is no deterministic answer as the order changes. We see thought that the numbers printed are always the same
</div>    

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>1.</b> Can you explain why the numbers are the same but the order is jumbled up?
    
<b>2.</b> Why do you see the number of outputs?
</div>

<b>TODO</b><br>
1: Why are the numbers the same?

2: Why do you see as many outputs as you see?

## Infos About the Region
OpenMP allows to get the information about number of threads present in each parallel region

In [None]:
pygmentize omp_examples/a02-infos.cpp

In [None]:
g++ omp_examples/a02-infos.cpp -fopenmp -o infos.out && ./infos.out

<div class="alert alert-block alert-warning">
    We can inspect the current thread number and the total number of threads via intrinsics
</div>    

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>3.</b> Can you think of a good reason why this information might be useful?
    
</div>

<b>TODO</b><br>
3: What is a good application of this information?

## Setting the Number of Threads
We can do this either on each parallel region or with an environment variable

In [None]:
pygmentize omp_examples/a03-threadnum.cpp

In [None]:
g++ omp_examples/a03-threadnum.cpp -fopenmp -o threadnum.out && ./threadnum.out

<div class="alert alert-block alert-warning">
    We can set the number of threads with the <code>num_threads</code> keyword for parallel regions
</div>

In [None]:
OMP_NUM_THREADS=4 ./infos.out

<div class="alert alert-block alert-warning">
    We can set the number of threads to use when nothing is specified with the <code>OMP_NUM_THREADS</code> environment variable.
</div>

In [None]:
OMP_NUM_THREADS=4 ./threadnum.out

<div class="alert alert-block alert-warning">
    We see that <code>num_threads</code> takes presedence over <code>OMP_NUM_THREADS</code>
</div>

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>4.</b> Why do you think num_threads is hihger up in the order of presedence?<br>
<b>5.</b> Why do we need to ways of controlling this are these two ways fulfilling the same purpose?
    
</div>

<b>TODO</b><br>
4: Why does one take presendence?

5: Why do we have two ways to control this?

## Parallel Loops
Since loops are such an important concept to parallelize, there is a special directive for it.

In [None]:
pygmentize omp_examples/a04-loops.cpp

In [None]:
g++ omp_examples/a04-loops.cpp -fopenmp -o loops.out

In [None]:
OMP_NUM_THREADS=4 ./loops.out

<div class="alert alert-block alert-warning">
    We see that the loop order is not preserved, threads take diffent amout of iterations and are not sorted
</div>

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>6.</b> Since this is an existing concept it has it's applications. Can you think of a simple example where the loop order does not matter? Can you also think of one where an unordered loop will break the program flow?<br>
    
</div>

<b>TODO</b><br>
6a: Example of where loop order does not matter

6b: Example of where random loop execution breaks the code

In [None]:
OMP_NUM_THREADS=12 ./loops.out

<div class="alert alert-block alert-warning">
    If more threads than loop iterations are available, only the first set of threads is used
</div>

## Scheduling
<div class="alert alert-block alert-danger">
This is an important concept to understand but is not necessarily used in our example
</div>

Controlling the loop execution with the `schedule` directive might be important for performance. Since the work is oftentimes similar in weather and climate codes this is not super important for our example but is still a concept worth knowing.

In [None]:
pygmentize omp_examples/a05-schedule.cpp

In [None]:
g++ omp_examples/a05-schedule.cpp -fopenmp -o schedule.out

In [None]:
OMP_NUM_THREADS=2 ./schedule.out

<div class="alert alert-block alert-warning">
    Static scheduling allows us to assign chunks of the iteration to the same thread
</div>

## Variable Scoping
We try to understand how variables are scoped, who owns them at which parts of the code and who can see effects of writing them.

In [None]:
pygmentize omp_examples/a06-scoping.cpp

In [None]:
g++ omp_examples/a06-scoping.cpp -fopenmp -o scoping.out

In [None]:
./scoping.out

<div class="alert alert-block alert-warning">
    We see that shared variables can cause race conditions
</div>

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>7.</b> Run the above code 5 times. What do you observe?<br>
<b>8.</b> Why is the code above problematic? Is what we're doing here still compatible with the Turing Machine way of modelling computers?
    
</div>

<b>TODO</b><br>
7: Obersvation

8: What is the problem?

The private directive allows each thread to have a copy of a variable: 

In [None]:
pygmentize omp_examples/a07-private.cpp

In [None]:
g++ omp_examples/a07-private.cpp -fopenmp -o private.out

In [None]:
OMP_NUM_THREADS=10 ./private.out

<div class="alert alert-block alert-warning">
    We see:
    <ul>
        <li>Private variables are always empty when coming in to the parallel regions</li>
        <li>Private variables do not cause race conditions</li>
        <li>The values in private variables are lost after exiting the parallel region</li>
    </ul>
</div>

shared is the default but can also be stated explicitly

In [None]:
pygmentize omp_examples/a08-shared.cpp

In [None]:
g++ omp_examples/a08-shared.cpp -fopenmp -o shared.out

In [None]:
OMP_NUM_THREADS=10 ./shared.out

<div class="alert alert-block alert-warning">
    We find a way to explicitly use shared variables in parallel regions
</div>

## Special Regions
<div class="alert alert-block alert-danger">
This is an important concept to understand but is not necessarily used in our example
</div>

There are certain pieces of the code that might be more sensitive to how threads should be handling them, here are the options:

In [None]:
pygmentize omp_examples/a09-regions.cpp

In [None]:
g++ omp_examples/a09-regions.cpp -fopenmp -o regions.out && ./regions.out

<div class="alert alert-block alert-warning">
    We learn that thread 0 is the only one entering <code>master</code>, only one thread ever enters <code>single</code> and every thread enters <code>critical</code>, but only one at a time
</div>

## What if parallel regions span multiple tasks
<div class="alert alert-block alert-danger">
This is an important concept to understand but is not necessarily used in our example
</div>

Depending on how parallelization was done there might be a need to let threads wait for each other. The `#pragma omp barrier` is used for that.

In [None]:
pygmentize omp_examples/a10-barrier.cpp

In [None]:
g++ omp_examples/a10-barrier.cpp -fopenmp -o barrier.out && ./barrier.out

<div class="alert alert-block alert-warning">
    We learn that there is no guarantee on the execution order of different statements in a parallel block across threads
</div>

In [None]:
pygmentize omp_examples/a11-barrier.cpp

In [None]:
g++ omp_examples/a11-barrier.cpp -fopenmp -o barrier2.out && ./barrier2.out

<div class="alert alert-block alert-warning">
    We learn that barriers help synchronize code
</div>

## Nowait
<div class="alert alert-block alert-danger">
This is an important concept to understand but is not necessarily used in our example
</div>

Certain OpenMP statements come with implicit barriers so `nowait` is the keyword to explicitly disable those

In [None]:
pygmentize omp_examples/a12-loops.cpp

In [None]:
g++ omp_examples/a12-loops.cpp -fopenmp -o loops.out && ./loops.out

<div class="alert alert-block alert-warning">
    We learn that loops synchrnoize afterwards
</div>

In [None]:
pygmentize omp_examples/a13-nowait.cpp

In [None]:
g++ omp_examples/a13-nowait.cpp -fopenmp -o nowait.out && ./nowait.out

<div class="alert alert-block alert-warning">
    We learn that the nowait keyword can remove synchronization
</div>

## Reductions
<div class="alert alert-block alert-danger">
This is an important concept to understand but is not necessarily used in our example
</div>

Since reductions are such an omnipresent motif, we do not want to implement it with critical / atomic every time, so there is a keyword for it:

In [None]:
pygmentize omp_examples/a14-reduction.cpp

In [None]:
g++ omp_examples/a14-reduction.cpp -fopenmp -o reduction.out && ./reduction.out

<div class="alert alert-block alert-warning">
    We learn about the reduction keyword
</div>

# Amdahl's law

We try to investigate the performance of a simple example, computing $\pi$ using the Leibniz formula:

$$1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \frac{1}{9} ... = \frac{\pi}{4}$$

Our goal is to understand what might affect performance and how this is reflected in strong and weak scaling plots

## Strong Scaling

Understanding how much faster the code gets though parallelization

In [None]:
pygmentize omp_examples/b01-timing.cpp

In [None]:
g++ -fopenmp omp_examples/b01-timing.cpp -o timing.out

In [None]:
./timing.out 1 > out.txt
./timing.out 2 >> out.txt
./timing.out 3 >> out.txt
./timing.out 4 >> out.txt
./timing.out 5 >> out.txt
./timing.out 6 >> out.txt
./timing.out 7 >> out.txt
./timing.out 8 >> out.txt
./timing.out 9 >> out.txt
./timing.out 10 >> out.txt
./timing.out 11 >> out.txt
./timing.out 12 >> out.txt

In [None]:
gnuplot -e "\
set terminal png; \
set style fill solid; \
set yrange[0:0.1]; \
set xlabel '# cores'; \
set ylabel 'runtime [s]'; \
plot 'out.txt' using 2: xtic(1) title 'runtime' with histogram \
" | display

In [None]:
base=`head -1 out.txt | awk '{print $2}'`
gnuplot -e "\
set terminal png; \
set style fill solid; \
set yrange[0:14]; \
set xlabel '# cores'; \
set ylabel 'speedup (relative to 1 core)'; \
plot 'out.txt' using ($base/\$2): xtic(1) title 'runtime' with histogram, 
'out.txt' using :(\$1) title 'linear' with lines\
" | display

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>9.</b> Can you explain the performance numbers that we see here? What effects are in place?<br>    
</div>

<b>TODO</b><br>
9: Explain the numbers


## Weak scaling
Measuring the performance hit we get by increasing the problem size as well as the number of threads

In [None]:
pygmentize omp_examples/b02-weakscaling.cpp

In [None]:
g++ -fopenmp omp_examples/b02-weakscaling.cpp -o weak.out

In [None]:
./weak.out 1 > weak.txt
./weak.out 2 >> weak.txt
./weak.out 4 >> weak.txt
./weak.out 8 >> weak.txt
./weak.out 12 >> weak.txt
./weak.out 18 >> weak.txt
./weak.out 20 >> weak.txt
./weak.out 24 >> weak.txt
./weak.out 30 >> weak.txt

In [None]:
base=`head -1 weak.txt | awk '{print $2}'`
gnuplot -e "\
set terminal png; \
set style fill solid; \
set yrange[0:1.1]; \
set xlabel '# cores'; \
set ylabel 'parallel efficiency'; \
plot 'weak.txt' using ($base/\$2): xtic(1) title 'efficiency' with histogram, 
" | display

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>10.</b> Can you explain the performance numbers that we see here? What effects are in place?<br>
</div>

<b>TODO</b><br>
10: Explain the numbers


## Caching

Here we see the implication of caching in a multithreaded environment

In [None]:
pygmentize omp_examples/b03-caching.cpp

In [None]:
g++ -fopenmp omp_examples/b03-caching.cpp -o caching.out

In [None]:
./caching.out 1 > caching.txt
./caching.out 2 >> caching.txt
./caching.out 4 >> caching.txt
./caching.out 8 >> caching.txt
./caching.out 12 >> caching.txt
./caching.out 13 >> caching.txt
./caching.out 20 >> caching.txt
./caching.out 24 >> caching.txt

In [None]:
gnuplot -e "\
set terminal png; \
set style fill solid; \
set yrange[0:0.1]; \
set xlabel '# cores'; \
set ylabel 'runtime [s]'; \
plot 'caching.txt' using 2: xtic(1) title 'runtime' with histogram \
" | display

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>11.</b> Can you explain the performance numbers that we see here? What effects are in place?<br>
</div>

<b>TODO</b><br>
11: Explain the numbers


In [None]:
pygmentize omp_examples/b04-caching_fast.cpp

In [None]:
g++ -fopenmp omp_examples/b06-caching_fast.cpp -o caching2.out

In [None]:
./caching2.out 1 > caching2.txt
./caching2.out 2 >> caching2.txt
./caching2.out 4 >> caching2.txt
./caching2.out 8 >> caching2.txt
./caching2.out 12 >> caching2.txt
./caching2.out 13 >> caching2.txt
./caching2.out 20 >> caching2.txt
./caching2.out 24 >> caching2.txt

In [None]:
gnuplot -e "\
set terminal png; \
set style fill solid; \
set yrange[0:0.1]; \
set xlabel '# cores'; \
set ylabel 'runtime [s]'; \
plot 'caching2.txt' using 2: xtic(1) title 'runtime' with histogram \
" | display

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>12.</b> Can you explain the performance numbers that we see here? What effects are in place?<br>
</div>

<b>TODO</b><br>
12: Explain the numbers


In [None]:
make clean_examples