# Motivation

We want to use OpenMP to enable parallel execution of our codes. If multiple workers can do the same job, exeuction will be sped up.
### Example:
computing $\pi$ using the Leibniz formula:
$$1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \frac{1}{9} ... = \frac{\pi}{4}$$

In [None]:
cat omp_examples/01-demo.cpp

In [None]:
g++ omp_examples/01-demo.cpp -o serial && ./serial

Seems good enough.


Now if we have 12 workers available, what is the easiest way to parallelize this?


In [None]:
cat omp_examples/01-demoparallel.cpp

Compilation of an OpenMP program requires linking to the corresponding library

In [None]:
g++ -fopenmp omp_examples/01-demoparallel.cpp -o parallel

In [None]:
OMP_NUM_THREADS=12 ./parallel

Check the performance impact of the one line of pragma changes

In [None]:
time for i in {1..10}; do ./serial; done
time for i in {1..10}; do OMP_NUM_THREADS=12 ./parallel; done

and a bit of cleanup

In [32]:
rm serial parallel

# Amdahl's law

We look at the performance of the simple code above (slightly changed for better output readability

In [None]:
cat omp_examples/02-timing.cpp

In [None]:
g++ -fopenmp omp_examples/02-timing.cpp -o timing

In [None]:
./timing 1 > out.txt
./timing 2 >> out.txt
./timing 3 >> out.txt
./timing 4 >> out.txt
./timing 5 >> out.txt
./timing 6 >> out.txt
./timing 7 >> out.txt
./timing 8 >> out.txt
./timing 9 >> out.txt
./timing 10 >> out.txt
./timing 11 >> out.txt
./timing 12 >> out.txt

In [None]:
gnuplot -e 'set terminal png; set style fill solid; set yrange[0:0.1]; plot "out.txt" using 2: xtic(1) with histogram' > graph.png

![generated](./graph.png)

In [None]:
gnuplot -e 'set terminal png; set style fill solid; set yrange[0:0.1]; plot "out.txt" using 2: xtic(1) with histogram' > graph.png

![generated](./graph.png)

In [31]:
rm graph.png out.txt timing

# Race conditions

Since these processes can all interfere with each other we need to be careful

In [33]:
cat omp_examples/03-race.cpp

#include <iostream>
#include <omp.h>

int main(int argc, char const* argv[]) {
  int solution = -1;
#pragma omp parallel
  { solution = omp_get_thread_num(); }
  std::cout << solution << std::endl;
  return 0;
}



What happens if we write to the same memory location with more than one thread?

In [38]:
g++ -fopenmp omp_examples/03-race.cpp -o test

In [39]:
OMP_NUM_THREADS=10 ./test

9


This does not only affect variables defined outside. This can have a lot of implications:

In [43]:
cat omp_examples/03-race2.cpp

#include <iostream>
#include <omp.h>

int main() {

#pragma omp parallel num_threads(10)
  {
    std::cout << "I am processor " << omp_get_thread_num() << std::endl;
  }
  return 0;
}


In [41]:
g++ -fopenmp omp_examples/03-race2.cpp -o output

In [44]:
./output

I am processor I am processor I am processor I am processor I am processor I am processor I am processor I am processor I am processor I am processor 0894251


6



7
3



and a bit of cleanup

In [40]:
rm test

# Synchronization
Options to prevent race conditions are:
- ensure only one thread is in the critical region at once
- Make writes atomic
## Ensure only one processor is present

In [85]:
cat omp_examples/04-critical.cpp

#include <iostream>
#include <omp.h>

int main() {

#pragma omp parallel num_threads(10)
  {
#pragma omp critical(output)
    std::cout << "I am processor " << omp_get_thread_num() << std::endl;
  }
  return 0;
}


In [None]:
g++ -fopenmp omp_examples/04-critical.cpp -o test && ./test

In [87]:
cat omp_examples/04-ordered.cpp

#include <iostream>
#include <omp.h>

int main() {
#pragma omp parallel for ordered 
  for (int i = 0; i < 10; ++i){
    int j = (100+i)*10 / 7.1;
#pragma omp ordered
    std::cout << "This is iteration " << i << std::endl;
  }
  return 0;
}


In [None]:
g++ -fopenmp omp_examples/04-ordered.cpp -o test && ./test

In [88]:
cat omp_examples/04-flush.cpp

#include <iostream>
#include <omp.h>

int main() {
  int data = 0.;
  int flag = 0;
#pragma omp parallel num_threads(10)
  {
    if(flag == 0) {
      data += omp_get_thread_num() + 100;
      flag =  omp_get_thread_num() + 100;;
#pragma omp flush(data, flag)
    }
  }
  std::cout << data << std::endl;
  std::cout << flag << std::endl;
  return 0;
}

In [None]:
g++ -fopenmp omp_examples/03-flush.cpp -o test && ./test

In [None]:
rm -rf test

## Caching
Here we see the implication of caching in a multithreaded environment

In [74]:
cat omp_examples/05-caching.cpp

#include <iostream>
#include <omp.h>
#include <vector>

int main(int argc, char const* argv[]) {
    std::vector<double> input(10000000,1);
    std::vector<double> output(10000000,0);
    
    omp_set_num_threads(atoi(argv[1]));
    double tick = omp_get_wtime();
    
#pragma omp parallel for schedule(static, 1)
  for (int i = 0; i < input.size(); ++i){
    output[i] = 2*input[i]; 
    input[i] = 0;
  }
    
    double tock = omp_get_wtime();
#pragma omp parallel
  {
    if(omp_get_thread_num() == 0)
      std::cout << omp_get_num_threads() << "\t" << tock - tick << std::endl;
  }
  return 0;
}


In [65]:
g++ -fopenmp omp_examples/05-caching.cpp -o timing

In [75]:
./timing 1 > caching.txt
./timing 2 >> caching.txt
./timing 4 >> caching.txt
./timing 8 >> caching.txt
./timing 12 >> caching.txt
./timing 13 >> caching.txt
./timing 20 >> caching.txt
./timing 24 >> caching.txt

In [78]:
gnuplot -e 'set terminal png; set style fill solid; set yrange[0:0.1]; plot "caching.txt" using 2: xtic(1) with histogram' > caching.png

![generated](./caching.png)

In [82]:
rm timing caching.png caching.txt -rf