# Bonus Questions for more in-depth OpenMP 

In this notebook we will explore some of the more advanced concepts surrounding openMP. Note that not all of these are critical to speed up weather and climate codes so they are structured in this bonus notebook

## Parallel Execution Time

<div class="alert alert-block alert-success">
<b>Now it's your turn...</b><br>
B1: Assume you have a program hat is 99% parallelizeable. If we parallelize it but by parallelizing we add parallel overhead to the code (by adding communication). The overhead is at $0.001 \cdot \log(n)$ runtime. How many nodes are the ideal configuration for this to run as fast as possible?
</div>

In [None]:
import numpy as np
import math

cores = [None] #TODO
run_time = [None] #TODO
index_min = np.argmin(run_time)
print(index_min)


<div class="alert alert-block alert-success">
<b>Now it's your turn...</b><br>
B2: Can we mathemathically prove that this is actually ideal?
    </div>

## Amdahl's Law

<div class="alert alert-block alert-success">
<b>Now it's your turn...</b><br>
B3: Assume you have a program that is 99% parallelizeable. If you have three machines available, your laptop with 8 cores, all the fat-nodes of Brutus (3840 cores) and the full CPU partition of Piz Daint with it's 65268 cores. How much speedup do these machines offer you?
    </div>

In [None]:
cores = [] # TODO
run_time = [] # TODO
speedup = [] # TODO
print(speedup)


In [None]:
import matplotlib.pyplot as plt

# creating the dataset
hardware = ["Single Node", "Laptop", "Brutus", "Daint"]
fig = plt.figure(figsize = (10, 5))
 
# creating the bar plot
plt.bar(hardware, speedup, color ='maroon',
        width = 0.4)
 
plt.xlabel("Machine")
plt.ylabel("Speedup")
plt.title("Speedup on different machines")
plt.show()

# Exploration of intrinsics

In this section we are exploring the difference in speed between various intrinsics

<div class="alert alert-block alert-success">
<b>Now it's your turn...</b><br>
B4: Write a program `expensive.cpp` where we loop over $n$ iteration and take the $arccos$ of the $cos$ of the $arcsin$ of the $sin$ of the $abs$ value of it's iteration number divided by the total number of iterations.

</div>
We sum all these values up and verify correctness by printing the result.
Ideally the program is parametrized with the number of iterations as well as the number of threads used.


In [None]:
%%bash
make clean

In [None]:
%%bash
module load daint-gpu
module switch PrgEnv-gnu PrgEnv-cray

CC expensive.cpp -fopenmp -o expensive.x -O3

In [None]:
%%bash
srun -n 1 ./expensive.x 1 10

This call should print a value of 4.5

<div class="alert alert-block alert-success">
<b>Now it's your turn...</b><br>
B5: Duplicate the file into a file called `critical.cpp` Parallelize the for loop and make the updates work in critical secions.
How much speedup do we get?
</div>    

In [None]:
%%bash
module load daint-gpu
module switch PrgEnv-gnu PrgEnv-cray

CC critical.cpp -fopenmp -o critical.x -O3

In [None]:
%%bash
srun -n 1 ./critical.x 12 1000

<b>TODO.</b><br>
Can you explain the speedup?
</div>    

<div class="alert alert-block alert-success">
<b>Now it's your turn...</b><br>
B6: Duplicate the file into a file called `atomic.cpp` update the critical section to use an atomic instead.
How much speedup do we get?
</div>    

In [None]:
%%bash
module load daint-gpu
module switch PrgEnv-gnu PrgEnv-cray

CC atomic.cpp -fopenmp -o atomic.x -O3

In [None]:
%%bash
srun -n 1 ./atomic.x 12 1000

<div class="alert alert-block alert-success">
<b>Now it's your turn...</b><br>
B7: Duplicate the file into a file called `reduction.cpp` and change the loop to use the intrinsic reduction. How do times compare between the reduction and the atomic?
</div>

In [None]:
%%bash
module load daint-gpu
module switch PrgEnv-gnu PrgEnv-cray

CC reduction.cpp -fopenmp -o reduction.x -O3

In [None]:
%%bash
srun -n 1 ./atomic.x 12 1000

<div class="alert alert-block alert-success">
<b>Now it's your turn...</b><br>
B8: Lastly we try to completely parallelize the code and have a single thread sum up the result in `fully_parallel.cpp`. How does that compare to the above times?
</div>

In [None]:
%%bash
module load daint-gpu
module switch PrgEnv-gnu PrgEnv-cray

CC fully_parallel.cpp -fopenmp -o fully_parallel.x -O3

In [None]:
%%bash
srun -n 1 ./fully_parallel.x 12 1000

<div class="alert alert-block alert-success">
<b>Now it's your turn...</b><br>
B9: How does this version compare to the reduction if we don't use a complicated expression but a very simple one: simply adding up the iteration numbers
</div>

In [None]:
%%bash
module load daint-gpu
module switch PrgEnv-gnu PrgEnv-cray

CC iterations_count_atomic.cpp -fopenmp -o iterations_count_atomic.x -O3
CC iterations_count_fully_parallel.cpp -fopenmp -o iterations_count_fully_parallel.x -O3

In [None]:
%%bash
srun -n 1 ./iterations_count_atomic.x 12 1000
srun -n 1 ./iterations_count_fully_parallel.x 12 1000

# Exploration of caching
<div class="alert alert-block alert-success">
<b>Now it's your turn...</b><br>
B10: Parallelize `expensive.cpp` in a new file `ststatic_small_expensive.cpp` with a static loop schedule of size 1. Change the output of the expression to be 
<br>`output[i] = acos(cos(asin(sin(abs(input[i]))))) + output[i];`
</div>


<div class="alert alert-block alert-success">
<b>Now it's your turn...</b><br>
B11: How does our time change if we move to a static schedule of size 80? Copy `static_small_expensive.cpp` and change it to have a different policy in `static_large_expensive.cpp`
</div>


In [None]:
%%bash
module load daint-gpu
module switch PrgEnv-gnu PrgEnv-cray

CC static_small_expensive.cpp -fopenmp -o static_small_expensive.x -O3
CC static_large_expensive.cpp -fopenmp -o static_large_expensive.x -O3


In [None]:
%%bash
srun -n 1 ./static_small_expensive.x 12 1000
srun -n 1 ./static_large_expensive.x 12 1000

<div class="alert alert-block alert-success">
<b>Now it's your turn...</b><br>
B12: How do these two results change if we move to a cheap iteration in just adding up the iteration number? Explore in `static_small_cheap.cpp` and `static_large_cheap.cpp`
</div>



In [None]:
%%bash
module load daint-gpu
module switch PrgEnv-gnu PrgEnv-cray

CC static_small_cheap.cpp -fopenmp -o static_small_cheap.x -O3
CC static_large_cheap.cpp -fopenmp -o static_large_cheap.x -O3


In [None]:
%%bash
srun -n 1 ./static_small_cheap.x 12 1000
srun -n 1 ./static_large_cheap.x 12 1000