# Introduction to OpenMP

![streaming memory](./images/ram.jpg)


This notebook contains some simple OpenMP examples to illustrate the OpenMP interface.

You can find a more full featured tutorial from [Lawrence Livermore National Laboratory](https://computing.llnl.gov/tutorials/openMP/).

This class is not designed to go into the interface in detail: you should feel comfortable looking something up in [the reference](https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf).

For the purposes of performance in this class, there are a few basic things you want to know about the directives in openMP:

- When you are not giving specific directions to each worker thread, the openMP runtime will make some *scheduling* decisions, about how the work should be divided.  You should be aware of what the default scheduling is for different directives, and how you can control it.

- Whenever work is not embarrassingly parallel, threads will interact, which means that the *synchronization* behavior of directives becomes important:
  - What is the policy of the directive for handling oversubscribed resources (such as concurrent writes to the same location)?
  - Does the construct have any *implicit synchronization*, such as a barrier at the start or end of the scope of the directive?

In [1]:
module use $CSE6230_DIR/modulefiles

In [4]:
module load cse6230

|                                                                         |
|       A note about python/3.6:                                          |
|       PACE is lacking the staff to install all of the python 3          |
|       modules, but we do maintain an anaconda distribution for          |
|       both python 2 and python 3. As conda significantly reduces        |
|       the overhead with package management, we would much prefer        |
|       to maintain python 3 through anaconda.                            |
|                                                                         |
|       All pace installed modules are visible via the module avail       |
|       command.                                                          |
|                                                                         |


In [5]:
cd $CSE6230_DIR/notes/openmp

## Hello, world (OpenMP has environment variables)

In [6]:
pygmentize openmp-ex00.c

[37m/* Hello threads: adapted from Edmond Chow's OpenMP notes */[39;49;00m
[36m#[39;49;00m[36minclude[39;49;00m [37m<stdio.h>[39;49;00m[36m[39;49;00m

[36mint[39;49;00m [32mmain[39;49;00m([36mvoid[39;49;00m)
{
  printf ([33m"[39;49;00m[33mYou're all individuals![39;49;00m[33m\n[39;49;00m[33m"[39;49;00m);
  [37m/* create a team of threads for the following structured block */[39;49;00m
[36m#[39;49;00m[36mpragma omp parallel[39;49;00m[36m[39;49;00m
  {
    printf([33m"[39;49;00m[33mYes, we're all individuals![39;49;00m[33m\n[39;49;00m[33m"[39;49;00m);
  }
  [37m/* team of threads join master thread after the structured block */[39;49;00m

  [34mreturn[39;49;00m [34m0[39;49;00m;
}


In [9]:
make -B openmp-ex00 # -B means "force recompilation" in this case
./openmp-ex00

icc -I../../utils/tictoc -g -O -Wall -qopenmp -o openmp-ex00 openmp-ex00.c 
You're all individuals!
Yes, we're all individuals!


Hmm, when I ran this, only one worker was present.  Let's see if I have the environment variable OMP_NUM_THREADS set:

In [10]:
echo $OMP_NUM_THREADS

1


Let's try this again:

In [11]:
OMP_NUM_THREADS=4 ./openmp-ex00

You're all individuals!
Yes, we're all individuals!
Yes, we're all individuals!
Yes, we're all individuals!
Yes, we're all individuals!


OpenMP is often used as a least-resistance path to parallelizing a program.  Because of that, a useful programming style for openmp is for the program to still be available *if openMP is not available*.  This is not possible with every style of parallelism in openMP (for example you can use openMP to set up a server/client type of parallelism which does not serialize), but is a useful guideline, particularly for debugging: if your program isn't doing the right thing without openMP, then the problem is elsewhere.

In [15]:
make -B openmp-ex00 OMPFLAGS=""
OMP_NUM_THREADS=4 ./openmp-ex00

icc -I../../utils/tictoc -g -O -Wall  -o openmp-ex00 openmp-ex00.c 
  #pragma omp parallel
          ^

You're all individuals!
Yes, we're all individuals!


When compiled without support for the openMP runtime, the openMP environment variables do nothing

## OpenMP has a C API as well

So we can choose a level of parallelism a priori

In [16]:
pygmentize openmp-ex01.c

[36m#[39;49;00m[36minclude[39;49;00m [37m<stdio.h>[39;49;00m[36m[39;49;00m
[37m/* OpenMP includes some library calls, for which we need the header file */[39;49;00m
[36m#[39;49;00m[36minclude[39;49;00m [37m<omp.h>[39;49;00m[36m[39;49;00m

[36mint[39;49;00m [32mmain[39;49;00m([36mvoid[39;49;00m)
{
  [36mint[39;49;00m max_threads = [34m5[39;49;00m;

  printf ([33m"[39;49;00m[33mYou're all individuals![39;49;00m[33m\n[39;49;00m[33m"[39;49;00m);

  [37m/* one library call sets the number of threads in the next parallel region */[39;49;00m
  omp_set_num_threads(max_threads);
[36m#[39;49;00m[36mpragma omp parallel[39;49;00m[36m[39;49;00m
  {
    printf([33m"[39;49;00m[33mYes, we're all individuals![39;49;00m[33m\n[39;49;00m[33m"[39;49;00m);
  }

  [34mreturn[39;49;00m [34m0[39;49;00m;
}


In [17]:
make -B openmp-ex01
./openmp-ex01

icc -I../../utils/tictoc -g -O -Wall -qopenmp -o openmp-ex01 openmp-ex01.c 
You're all individuals!
Yes, we're all individuals!
Yes, we're all individuals!
Yes, we're all individuals!
Yes, we're all individuals!
Yes, we're all individuals!


Who wins in a fight between the environment variables and the C interfaces?

In [18]:
OMP_NUM_THREADS=100 ./openmp-ex01

You're all individuals!
Yes, we're all individuals!
Yes, we're all individuals!
Yes, we're all individuals!
Yes, we're all individuals!
Yes, we're all individuals!


Note that example 01 violates the "valid without openMP" approach:

In [19]:
make -B openmp-ex01 OMPFLAGS=""

icc -I../../utils/tictoc -g -O -Wall  -o openmp-ex01 openmp-ex01.c 
  #pragma omp parallel
          ^

/tmp/iccI85Iiu.o: In function `main':
/nv/coc-ice/tisaac3/srv/rep/cse6230/notes/openmp/openmp-ex01.c:12: undefined reference to `omp_set_num_threads'
make: *** [openmp-ex01] Error 1


: 2

## OpenMP directives have Clauses

Clauses are extra terms controling the behavior of `#pragma omp` directives.

These can also be used to control the parallelism in the code.  Who wins in a fight between environment variables, the C interface, and clauses?

In [21]:
pygmentize openmp-ex02.c

[36m#[39;49;00m[36minclude[39;49;00m [37m<stdio.h>[39;49;00m[36m[39;49;00m
[36m#[39;49;00m[36minclude[39;49;00m [37m<omp.h>[39;49;00m[36m[39;49;00m

[36mint[39;49;00m [32mmain[39;49;00m([36mvoid[39;49;00m)
{
  [36mint[39;49;00m max_threads = [34m5[39;49;00m;

  printf ([33m"[39;49;00m[33mYou're all individuals![39;49;00m[33m\n[39;49;00m[33m"[39;49;00m);

  omp_set_num_threads(max_threads);
  [37m/* now we have two competing values for the number of threads in this[39;49;00m
[37m   * region: who wins? */[39;49;00m
[36m#[39;49;00m[36mpragma omp parallel num_threads(7)[39;49;00m[36m[39;49;00m
  {
    printf([33m"[39;49;00m[33mYes, we're all individuals![39;49;00m[33m\n[39;49;00m[33m"[39;49;00m);
  }

  [34mreturn[39;49;00m [34m0[39;49;00m;
}


In [22]:
make -B openmp-ex02
OMP_NUM_THREADS=1000 ./openmp-ex02

icc -I../../utils/tictoc -g -O -Wall -qopenmp -o openmp-ex02 openmp-ex02.c 
You're all individuals!
Yes, we're all individuals!
Yes, we're all individuals!
Yes, we're all individuals!
Yes, we're all individuals!
Yes, we're all individuals!
Yes, we're all individuals!
Yes, we're all individuals!


It looks like clauses reign supreme.

## Fork-Join

In [23]:
pygmentize openmp-ex03.c

[36m#[39;49;00m[36minclude[39;49;00m [37m<stdio.h>[39;49;00m[36m[39;49;00m
[36m#[39;49;00m[36minclude[39;49;00m [37m<omp.h>[39;49;00m[36m[39;49;00m

[36mint[39;49;00m [32mmain[39;49;00m([36mvoid[39;49;00m)
{
  [36mint[39;49;00m num_threads, my_thread;

  [37m/* OpenMP implements "fork-join", where one master thread runs outside of[39;49;00m
[37m   * the parallel regions, forks to create them, and joins them at the end of[39;49;00m
[37m   * the region.  Let's see if we can confirm this. */[39;49;00m

  [37m/* Count the number of threads and my thread number before ... */[39;49;00m
  num_threads = omp_get_num_threads();
  my_thread   = omp_get_thread_num();
  printf ([33m"[39;49;00m[33m\"[39;49;00m[33mYou're all individuals![39;49;00m[33m\"[39;49;00m[33m said %d of %d.[39;49;00m[33m\n[39;49;00m[33m"[39;49;00m, my_thread, num_threads);

[36m#[39;49;00m[36mpragma omp parallel[39;49;00m[36m[39;49;00m
  {
    [37m/* during ... */[39;49;00

In [24]:
make -B openmp-ex03
OMP_NUM_THREADS=4 ./openmp-ex03

icc -I../../utils/tictoc -g -O -Wall -qopenmp -o openmp-ex03 openmp-ex03.c 
"You're all individuals!" said 0 of 1.
"Yes, we're all individuals!" replied 0 of 4.
"Yes, we're all individuals!" replied 1 of 4.
"Yes, we're all individuals!" replied 3 of 4.
"Yes, we're all individuals!" replied 2 of 4.
"I'm not," said 0 of 1.


## Variable Scoping

Variables outside of a `#pragma omp parallel` region are shared by default;
variables inside such a region are private by default.

In [33]:
pygmentize openmp-ex04.c

[36m#[39;49;00m[36minclude[39;49;00m [37m<stdio.h>[39;49;00m[36m[39;49;00m
[36m#[39;49;00m[36minclude[39;49;00m [37m<unistd.h>[39;49;00m[36m[39;49;00m
[36m#[39;49;00m[36minclude[39;49;00m [37m<omp.h>[39;49;00m[36m[39;49;00m

[36mint[39;49;00m [32mmain[39;49;00m([36mvoid[39;49;00m)
{
  [36mint[39;49;00m num_threads, my_thread;

  num_threads = omp_get_num_threads();
  my_thread   = omp_get_thread_num();
  printf ([33m"[39;49;00m[33m\"[39;49;00m[33mYou're all individuals![39;49;00m[33m\"[39;49;00m[33m said %d of %d.[39;49;00m[33m\n[39;49;00m[33m"[39;49;00m, my_thread, num_threads);

[36m#[39;49;00m[36mpragma omp parallel[39;49;00m[36m[39;49;00m
  {
    num_threads = omp_get_num_threads();
    my_thread   = omp_get_thread_num();
    [37m/* what if the parallel region takes a little longer? */[39;49;00m
    sleep([34m1[39;49;00m);
    printf([33m"[39;49;00m[33m\"[39;49;00m[33mYes, we're all individuals![39;49;00m[33m\"[39;49;0

In [39]:
make -B openmp-ex04 CFLAGS="-O0" # On pace-ice higher optimization affects this behavior
                                 # It seems to correct for multiple threads writing to `my_thread`,
                                 # Probably by eliminating that variable entirely.  I wouldn't
                                 # want to rely on that...
OMP_NUM_THREADS=4 ./openmp-ex04

icc -I../../utils/tictoc -O0 -qopenmp -o openmp-ex04 openmp-ex04.c 
"You're all individuals!" said 0 of 1.
"Yes, we're all individuals!" replied 3 of 4, sleepily.
"Yes, we're all individuals!" replied 3 of 4, sleepily.
"Yes, we're all individuals!" replied 3 of 4, sleepily.
"Yes, we're all individuals!" replied 3 of 4, sleepily.
"I'm not," said 0 of 1.


Run examples 5-9 to see different ways of achieving the desired behavior of
"each thread writes to the same symbol": in 5, the symbol is private in scope;
in 6-9, the privately scoped variable is created that shadows the symbol.  The examples explores the scope and initialization of private variables that shadow publich ones.

In [48]:
pygmentize openmp-ex05.c
make -B openmp-ex05 CFLAGS="-O0"
OMP_NUM_THREADS=4 ./openmp-ex05
pygmentize openmp-ex06.c
make -B openmp-ex06 CFLAGS="-O0"
OMP_NUM_THREADS=4 ./openmp-ex06
pygmentize openmp-ex07.c
make -B openmp-ex07 CFLAGS="-O0"
OMP_NUM_THREADS=4 ./openmp-ex07
pygmentize openmp-ex08.c
make -B openmp-ex08 CFLAGS="-O0"
OMP_NUM_THREADS=4 ./openmp-ex08

[36m#[39;49;00m[36minclude[39;49;00m [37m<stdio.h>[39;49;00m[36m[39;49;00m
[36m#[39;49;00m[36minclude[39;49;00m [37m<unistd.h>[39;49;00m[36m[39;49;00m
[36m#[39;49;00m[36minclude[39;49;00m [37m<omp.h>[39;49;00m[36m[39;49;00m

[36mint[39;49;00m [32mmain[39;49;00m([36mvoid[39;49;00m)
{
  [36mint[39;49;00m orig_num_threads, orig_my_thread;

  orig_num_threads = omp_get_num_threads();
  orig_my_thread   = omp_get_thread_num();
  printf ([33m"[39;49;00m[33m\"[39;49;00m[33mYou're all individuals![39;49;00m[33m\"[39;49;00m[33m said %d of %d.[39;49;00m[33m\n[39;49;00m[33m"[39;49;00m, orig_my_thread, orig_num_threads);

[36m#[39;49;00m[36mpragma omp parallel[39;49;00m[36m[39;49;00m
  {
    [37m/* The last example showed that variables are shared by default in[39;49;00m
[37m     * parallel regions: having multiple threads write to the same variable[39;49;00m
[37m     * creates a race condition.[39;49;00m
[37m     *[39;49;00m
[37m     * 

## Loop scheduling

Example 10 shows how loop scheduling can be done with nothing but `#pragma omp parallel`

In [52]:
pygmentize openmp-ex10.c

[36m#[39;49;00m[36minclude[39;49;00m [37m<stdio.h>[39;49;00m[36m[39;49;00m
[36m#[39;49;00m[36minclude[39;49;00m [37m<omp.h>[39;49;00m[36m[39;49;00m

[36mint[39;49;00m [32mmain[39;49;00m([36mvoid[39;49;00m)
{
  [36mint[39;49;00m N = [34m10[39;49;00m;

  [37m/* We could to loop parallelization with just what we've seen so far */[39;49;00m
[36m#[39;49;00m[36mpragma omp parallel[39;49;00m[36m[39;49;00m
  {
    [36mint[39;49;00m my_thread   = omp_get_thread_num();
    [36mint[39;49;00m num_threads = omp_get_num_threads();
    [36mint[39;49;00m istart      = (N * my_thread) / num_threads;
    [36mint[39;49;00m iend        = (N * (my_thread+[34m1[39;49;00m)) / num_threads;
    [36mint[39;49;00m i;

    [34mfor[39;49;00m (i = istart; i < iend; i++) {
      printf([33m"[39;49;00m[33miteration %d, thread %d[39;49;00m[33m\n[39;49;00m[33m"[39;49;00m, i, my_thread);
    }
  }

  [34mreturn[39;49;00m [34m0[39;49;00m;
}


In [54]:
make -B openmp-ex10
OMP_NUM_THREADS=4 ./openmp-ex10

icc -I../../utils/tictoc -g -O -Wall -qopenmp -o openmp-ex10 openmp-ex10.c 
iteration 0, thread 0
iteration 1, thread 0
iteration 2, thread 1
iteration 3, thread 1
iteration 4, thread 1
iteration 7, thread 3
iteration 5, thread 2
iteration 6, thread 2
iteration 8, thread 3
iteration 9, thread 3


`#pragma omp for` exists for that, but then openMP is in charge of scheduling.  But you can tell it how to schedule.

In [56]:
pygmentize openmp-ex11.c

[36m#[39;49;00m[36minclude[39;49;00m [37m<stdio.h>[39;49;00m[36m[39;49;00m
[36m#[39;49;00m[36minclude[39;49;00m [37m<omp.h>[39;49;00m[36m[39;49;00m

[36mint[39;49;00m [32mmain[39;49;00m([36mvoid[39;49;00m)
{
  [36mint[39;49;00m N = [34m10[39;49;00m;

[36m#[39;49;00m[36mpragma omp parallel[39;49;00m[36m[39;49;00m
  {
    [36mint[39;49;00m my_thread = omp_get_thread_num();
    [36mint[39;49;00m i;

    [37m/* But openmp has a directive "for" for for loops */[39;49;00m
[36m#[39;49;00m[36mpragma omp for[39;49;00m[36m[39;49;00m
    [34mfor[39;49;00m (i = [34m0[39;49;00m; i < N; i++) {
      printf([33m"[39;49;00m[33miteration %d, thread %d[39;49;00m[33m\n[39;49;00m[33m"[39;49;00m, i, my_thread);
    }
  }

  [34mreturn[39;49;00m [34m0[39;49;00m;
}


In [57]:
make -B openmp-ex11
OMP_NUM_THREADS=4 ./openmp-ex11

icc -I../../utils/tictoc -g -O -Wall -qopenmp -o openmp-ex11 openmp-ex11.c 
iteration 0, thread 0
iteration 1, thread 0
iteration 2, thread 0
iteration 8, thread 3
iteration 9, thread 3
iteration 3, thread 1
iteration 4, thread 1
iteration 5, thread 1
iteration 6, thread 2
iteration 7, thread 2


In [60]:
OMP_DISPLAY_ENV=true OMP_NUM_THREADS=4 ./openmp-ex11


OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201307'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_NESTED='FALSE'
  [host] OMP_NUM_THREADS='4'
  [host] OMP_PLACES: value is not defined
  [host] OMP_PROC_BIND='false'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='4M'
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END


iteration 0, thread 0
iteration 1, thread 0
iteration 6, thread 2
iteration 7, thread 2
iteration 3, thread 1
iteration 4, thread 1
iteration 5, thread 1
iteration 8, thread 3
iteration 9, thread 3
iteration 2, thread 0


In [67]:
pygmentize openmp-ex14.c

[36m#[39;49;00m[36minclude[39;49;00m [37m<stdio.h>[39;49;00m[36m[39;49;00m
[36m#[39;49;00m[36minclude[39;49;00m [37m<omp.h>[39;49;00m[36m[39;49;00m

[36mint[39;49;00m [32mmain[39;49;00m([36mvoid[39;49;00m)
{
  [36mint[39;49;00m N = [34m10[39;49;00m;
  [36mint[39;49;00m i;

  [37m/* Thus far the first thread has always received the start of the loop, but[39;49;00m
[37m   * we can control this with the schedule() clause: schedule(runtime) means[39;49;00m
[37m   * we can control the schedule with the OMP_SCHEDULE environment variable.[39;49;00m
[37m   * */[39;49;00m
[36m#[39;49;00m[36mpragma omp parallel for schedule(runtime)[39;49;00m[36m[39;49;00m
  [34mfor[39;49;00m (i = [34m0[39;49;00m; i < N; i++) {
    [36mint[39;49;00m my_thread = omp_get_thread_num();

    printf([33m"[39;49;00m[33miteration %d, thread %d[39;49;00m[33m\n[39;49;00m[33m"[39;49;00m, i, my_thread);
  }

  [34mreturn[39;49;00m [34m0[39;49;00m;
}


In [68]:
make -B openmp-ex14
OMP_NUM_THREADS=4 OMP_SCHEDULE="static,1" ./openmp-ex14

icc -I../../utils/tictoc -g -O -Wall -qopenmp -o openmp-ex14 openmp-ex14.c 
iteration 0, thread 0
iteration 4, thread 0
iteration 8, thread 0
iteration 2, thread 2
iteration 6, thread 2
iteration 3, thread 3
iteration 7, thread 3
iteration 1, thread 1
iteration 5, thread 1
iteration 9, thread 1
