# Shared Memory Parallelism using OpenMP

We want to use OpenMP to enable parallel execution of our codes. If multiple workers can do the same job, execution will be sped up.

In [1]:
pygmentize omp_examples/a01-parallel.cpp

[36m#[39;49;00m[36minclude[39;49;00m [37m<cstdlib>[39;49;00m[36m[39;49;00m
[36m#[39;49;00m[36minclude[39;49;00m [37m<iostream>[39;49;00m[36m[39;49;00m
[36m#[39;49;00m[36minclude[39;49;00m [37m<omp.h>[39;49;00m[36m[39;49;00m
[36m#[39;49;00m[36minclude[39;49;00m [37m<vector>[39;49;00m[36m[39;49;00m

[36mint[39;49;00m [32mmain[39;49;00m([36mint[39;49;00m argc, [36mchar[39;49;00m [34mconst[39;49;00m* argv[]) {
  srand([34m712[39;49;00m);
  [36mdouble[39;49;00m a = rand() % [34m100[39;49;00m;
  std::cout << std::to_string(a) + [33m"[39;49;00m[33m\n[39;49;00m[33m\n[39;49;00m[33m"[39;49;00m;

[36m#[39;49;00m[36mpragma omp parallel[39;49;00m[36m[39;49;00m
  {
    [36mdouble[39;49;00m b = rand() % [34m100[39;49;00m;
    std::cout << std::to_string(b) + [33m"[39;49;00m[33m\n[39;49;00m[33m"[39;49;00m;
  }
  std::cout << std::flush;

    
}


We need to tell the compiler that our program contains OpenMP pragmas with the `-fopenmp` option.

In [None]:
g++ omp_examples/a01-parallel.cpp -fopenmp -o parallel

Now we can run the generated executable.

In [None]:
./parallel

# Infos about the region
OpenMP allows to get the information about number of threads present in each parallel region

In [None]:
pygmentize omp_examples/a02-infos.cpp

In [None]:
g++ omp_examples/a02-infos.cpp -fopenmp -o infos && ./infos

# Setting the number of threads
We can do this either on each parallel region or with an environment variable

In [None]:
pygmentize omp_examples/a03-threadnum.cpp

In [None]:
g++ omp_examples/a03-threadnum.cpp -fopenmp -o threadnum && ./threadnum

We can set the number of threads to use when nothing is specified with the `OMP_NUM_THREADS` environment variable.

In [None]:
OMP_NUM_THREADS=4 ./infos

In [None]:
OMP_NUM_THREADS=4 ./threadnum

# Parallel loops
Since loops are such an important concept to parallelize, there is a special directive for it

In [None]:
pygmentize omp_examples/a04-loops.cpp

In [None]:
g++ omp_examples/a04-loops.cpp -fopenmp -o loops

In [None]:
OMP_NUM_THREADS=4 ./loops

In [None]:
OMP_NUM_THREADS=12 ./loops

# Scheduling
Controlling the loop execution with the schedule directive

In [None]:
pygmentize omp_examples/a05-schedule.cpp

In [None]:
g++ omp_examples/a05-schedule.cpp -fopenmp -o schedule

In [None]:
OMP_NUM_THREADS=2 ./schedule

# Variable Scoping
Who owns wich variables and what is their value 

In [None]:
pygmentize omp_examples/a06-scoping.cpp

In [None]:
g++ omp_examples/a06-scoping.cpp -fopenmp -o scoping

In [None]:
./scoping

The private directive allows each thread to have a copy of a variable

In [None]:
pygmentize omp_examples/a07-private.cpp

In [None]:
g++ omp_examples/a07-private.cpp -fopenmp -o private

In [None]:
OMP_NUM_THREADS=10 ./private

shared is the default but can also be stated explicitly

In [None]:
pygmentize omp_examples/a08-shared.cpp

In [None]:
g++ omp_examples/a08-shared.cpp -fopenmp -o shared

In [None]:
OMP_NUM_THREADS=10 ./private

# Special Regions
We need special regions

In [None]:
pygmentize omp_examples/a09-regions.cpp

In [None]:
g++ omp_examples/a09-regions.cpp -fopenmp -o regions && ./regions

# What if parallel regions span multiple tasks
Barrier can be a thing

In [None]:
pygmentize omp_examples/a10-barrier.cpp

In [None]:
g++ omp_examples/a10-barrier.cpp -fopenmp -o barrier && ./barrier

In [None]:
pygmentize omp_examples/a11-barrier.cpp

In [None]:
g++ omp_examples/a11-barrier.cpp -fopenmp -o barrier2 && ./barrier2

# Nowait
and the opposite of it

In [None]:
pygmentize omp_examples/a12-loops.cpp

In [None]:
g++ omp_examples/a12-loops.cpp -fopenmp -o loops && ./loops

In [None]:
pygmentize omp_examples/a13-nowait.cpp

In [None]:
g++ omp_examples/a13-nowait.cpp -fopenmp -o nowait && ./nowait

# Reductions
Since reductions are such an omnipresent motif, we do not want to implement it with critical / atomic every time, so there is a keyword for it:

In [None]:
pygmentize omp_examples/a14-reduction.cpp

In [None]:
g++ omp_examples/a14-reduction.cpp -fopenmp -o reduction && ./reduction

# Amdahl's law

We look at the performance of the simple code above (slightly changed for better output readability

### Example

Computing $\pi$ using the Leibniz formula:
$$1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \frac{1}{9} ... = \frac{\pi}{4}$$

In [None]:
pygmentize omp_examples/b01-timing.cpp

In [None]:
g++ -fopenmp omp_examples/b01-timing.cpp -o timing

In [None]:
./timing 1 > out.txt
./timing 2 >> out.txt
./timing 3 >> out.txt
./timing 4 >> out.txt
./timing 5 >> out.txt
./timing 6 >> out.txt
./timing 7 >> out.txt
./timing 8 >> out.txt
./timing 9 >> out.txt
./timing 10 >> out.txt
./timing 11 >> out.txt
./timing 12 >> out.txt

In [None]:
gnuplot -e "\
set terminal png; \
set style fill solid; \
set yrange[0:0.1]; \
set xlabel '# cores'; \
set ylabel 'runtime [s]'; \
plot 'out.txt' using 2: xtic(1) title 'runtime' with histogram \
" | display

In [None]:
base=`head -1 out.txt | awk '{print $2}'`
gnuplot -e "\
set terminal png; \
set style fill solid; \
set yrange[0:14]; \
set xlabel '# cores'; \
set ylabel 'speedup (relative to 1 core)'; \
plot 'out.txt' using ($base/\$2): xtic(1) title 'runtime' with histogram, 
'out.txt' using :(\$1) title 'linear' with lines\
" | display

## Caching

Here we see the implication of caching in a multithreaded environment

In [3]:
pygmentize omp_examples/b03-caching.cpp

[36m#[39;49;00m[36minclude[39;49;00m [37m<iostream>[39;49;00m[36m[39;49;00m
[36m#[39;49;00m[36minclude[39;49;00m [37m<omp.h>[39;49;00m[36m[39;49;00m
[36m#[39;49;00m[36minclude[39;49;00m [37m<vector>[39;49;00m[36m[39;49;00m

[36mint[39;49;00m [32mmain[39;49;00m([36mint[39;49;00m argc, [36mchar[39;49;00m [34mconst[39;49;00m* argv[]) {

  std::vector<[36mdouble[39;49;00m> input([34m10000000[39;49;00m, [34m1[39;49;00m);
  std::vector<[36mdouble[39;49;00m> output([34m10000000[39;49;00m, [34m0[39;49;00m);

  omp_set_num_threads(atoi(argv[[34m1[39;49;00m]));

  [36mdouble[39;49;00m tic = omp_get_wtime();

[36m#[39;49;00m[36mpragma omp parallel for schedule(static, 1)[39;49;00m[36m[39;49;00m
  [34mfor[39;49;00m([36mint[39;49;00m i = [34m0[39;49;00m; i < input.size(); ++i) {
    output[i] = [34m2[39;49;00m * input[i];
    input[i] = [34m0[39;49;00m;
  }

  [36mdouble[39;49;00m toc = omp_get_wtime();

[36m#[39;49;00m[36mpragm

In [None]:
g++ -fopenmp omp_examples/b03-caching.cpp -o caching

In [None]:
./caching 1 > caching.txt
./caching 2 >> caching.txt
./caching 4 >> caching.txt
./caching 8 >> caching.txt
./caching 12 >> caching.txt
./caching 13 >> caching.txt
./caching 20 >> caching.txt
./caching 24 >> caching.txt

In [None]:
gnuplot -e "\
set terminal png; \
set style fill solid; \
set yrange[0:0.1]; \
set xlabel '# cores'; \
set ylabel 'runtime [s]'; \
plot 'caching.txt' using 2: xtic(1) title 'runtime' with histogram \
" | display