# Lab2: Parallelism Generation in Multithreading

The objective of this lab is to better understand parallelism generation in OpenMP. Parallelism generation is the ability to create workers. Some of these concepts were briefly introduced in [lab1](../Lab1/lab1.ipynb). In this lab we will describe them more in depth

## Table of content:

1. Parallel regions
    1. OMP_NUM_THREADS environment variable
    2. OMP_THREAD_LIMIT environment variable
    3. omp_set_num_threads() API call
    4. mixed number of threads
    5. if() clause
    6. Dynamic parallelism
    7. Programs Relying on the number of threads
2. Teams regions
    3. thread_limit() clause
3. Parallel and teams
4. Nested parallelism

## Parallel regions
Since its origin, OpenMP has focus on parallel regions. Parallel regions are created with the `#pragma omp parallel` directive, and its purpose is to generate a collection of software threads, each entering the same region of code.

The `parallel` construct generates a `team of threads` containing one or more threads. There are different factors that influence how many threads are created and the user have some control over these. One factor that we cannot control as users is the system limitation. OpenMP is portable across systems, however, each system may impose limitations on how many threads can be generated. For example, an embedded system with a light weight operating system may not allow thread oversubscription (see lab 1 for details on this), a batch scheduler may restrict the number of threads in an application, or an OpenMP compiler implementation may decide to limit the number of threads it can handle to a maximum number.

There are two rules of thumb:
1. Number of threads used by the application is not guaranteed to be equal to the number of threads requested. The actual number of threads may be less or implementation defined. It is not a good idea for applications to rely on the number of threads as this will limit portability (Use discretion here, as there are good reasons to break this rule.)
2. System oversubscription is not recommended, when all the threads will be constantly busy, as this has demonstrated to reduce application performance. Still, some applications use thread oversubscription to have some threads that do little work. 

Elements that affect the number of threads:
* The `OMP_NUM_THREADS` environment variable
* The `omp_set_num_threads()` API call
* The `num_threads()` clause of the `parallel` construct
* The `if()` clause of the `parallel` construct
* The `OMP_DYNAMIC` environment variable
* The `omp_set_dynamic()` API call
* The `thread_limit()` clause in the `teams` directive
* The `OMP_THREAD_LIMIT` and `OMP_TEAMS_THREAD_LIMIT` environment variable
* The `OMP_NESTED` and `OMP_MAX_ACTIVE_LEVELS` environment variables

In this section we will try to cover most of these. For the following example, we assume that we are running on a system that supports more threads.

### OMP_NUM_THREADS environment variable
Sets the default value to be use by `parallel` regions inside the OpenMP code.

For the following hello world code

```C
#pragma omp parallel
{
    printf("Hello from %d out of \n", omp_get_thread_num(), omp_get_num_threads());
}
```

The number of threads can be changed during runtime by using the `OMP_NUM_THREADS` environment variable, without the need to recompile

In [1]:
# First, let's compile it
!clang -fopenmp C/hello_world.c -o C/hello_world.exe

In [2]:
# Now let's run it with different values

#Running with default num threads. Implementation defined
!C/./hello_world.exe

Hi from 3 out of 6
Hi from 4 out of 6
Hi from 2 out of 6
Hi from 1 out of 6
Hi from 5 out of 6
Hi from 0 out of 6


In [3]:
#Running with different num threads modifying OMP_NUM_THREADS
!OMP_NUM_THREADS=3 C/./hello_world.exe

Hi from 0 out of 3
Hi from 1 out of 3
Hi from 2 out of 3


In [4]:
#Running with different num threads modifying OMP_NUM_THREADS
!OMP_NUM_THREADS=5 C/./hello_world.exe

Hi from 0 out of 5
Hi from 3 out of 5
Hi from 1 out of 5
Hi from 2 out of 5
Hi from 4 out of 5


Note:
```
Environment variables depend on the shell you're using. Here I am assuming bash and I am using the inline format. In bash you can also call `export` command prior to the execution of the program. Use your favorite shell, but change the code appropriately
```

You can check this code in [hello_world.c](C/hello_world.c).

## OMP_THREAD_LIMIT environment variable

This environment variable presents an additional limitation in the number of threads that are used, regardless of the number of threads requested. Using the same example above, we can set both environment variables to demonstrate the precedence order.

In [5]:
#Building the hello world program
!clang -fopenmp C/./hello_world.c -o C/hello_world.exe

In [6]:
# Running it with OMP_THREAD_LIMIT
!OMP_THREAD_LIMIT=4 OMP_NUM_THREADS=1000 C/./hello_world.exe

OMP: Hint Consider unsetting KMP_DEVICE_THREAD_LIMIT (KMP_ALL_THREADS), KMP_TEAMS_THREAD_LIMIT, and OMP_THREAD_LIMIT (if any are set).
Hi from 0 out of 4
Hi from 3 out of 4
Hi from 2 out of 4
Hi from 1 out of 4


Some compilers, like clang, may give a warning about not being able to satisfied the requested number of threads.

## omp_set_num_threads() API call
Another way of setting the number of threads before the execution of a parallel region is to use the `omp_set_num_threads()` API function call. This call will have precedence over the environment variable OMP_NUM_THREADS.

Take for example the following code that sets the number of threads to 4 before the parallel region.

```C
omp_set_num_threads(4);
#pragma omp parallel
{
    printf("Hi from %d out of %d\n", omp_get_thread_num(), omp_get_num_threads());
}
```

In [29]:
#Building the program
!clang -fopenmp C/omp_set_num_threads.c -o C/omp_set_num_threads.exe

In [30]:
#Running the program
!C/./omp_set_num_threads.exe

Setting the number of threads to 4
Hi from 0 out of 4
Hi from 3 out of 4
Hi from 1 out of 4
Hi from 2 out of 4


In [31]:
#Running the program trying to force num threads with env variable
!OMP_NUM_THREADS=1000 C/./omp_set_num_threads.exe

Setting the number of threads to 4
Hi from 0 out of 4
Hi from 3 out of 4
Hi from 2 out of 4
Hi from 1 out of 4


The environment variable is ignored, because the API call has higher precedence.

To play with this code go to [omp_set_num_threads.c](C/omp_set_num_threads.c)

## num_threads() clause

Yet another way to change the number of threads is with the `num_threads()` clause supported by the `parallel` construct. The clause applied directly to the parallel region has higher precedence in comparison to the `OMP_NUM_THREADS` environment variable, and the `omp_set_num_threads()` API call.

Take for example the following code

```C
#pragma omp parallel num_threads(4)
{
    printf("Hi from %d out of %d\n", omp_get_thread_num(), omp_get_num_threads());
}
```

In [32]:
#Building the program
!clang -fopenmp C/num_threads.c -o C/num_threads.exe

In [33]:
# Running
!C/./num_threads.exe

Hi from 1 out of 4
Hi from 0 out of 4
Hi from 2 out of 4
Hi from 3 out of 4


In [35]:
# Running with OMP_NUM_THREADS to demonstrate precedence
!OMP_NUM_THREADS=1000 C/./num_threads.exe

Hi from 2 out of 4
Hi from 1 out of 4
Hi from 0 out of 4
Hi from 3 out of 4


You can play with this code in [num_threads.c](C/num_threads.c)

## Mixed number of threads

It is not necessary to have the same number of threads for all your program. Furthermore, it is possible to use the different aforementioned methods within the same program and rely on their changes to the Internal Control Variables (ICVs)

The following example uses the default number of threads or `OMP_NUM_THREADS`, the `omp_set_num_threads()` API call and the `num_threads()` clauses within the same program. 

In [38]:
# Building
!clang -fopenmp C/num_threads_mixed.c -o C/num_threads_mixed.exe

In [39]:
# Running
!OMP_NUM_THREADS=1 C/./num_threads_mixed.exe

Using default or OMP_NUM_THREADS from 0 out of 1
Using omp_set_num_threads() from 0 out of 2
Using omp_set_num_threads() from 1 out of 2
using num_threads() clause from 0 out of 4
using num_threads() clause from 1 out of 4
using num_threads() clause from 2 out of 4
using num_threads() clause from 3 out of 4


You can play with this code in [num_threads_mixed.c](C/num_threads_mixed.c)

## The if() clause

The `if()` clause is used to enable or disable different directives (e.g. threads, teams, and target). In the context of number of threads, the `if()` clause will create a region that uses only a single thread, regardless of the ICV variable that controls the number of threads.

Take for example the following code
```C
#pragma omp parallel if(false) num_threads(1000)
{
    printf("Hi from %d out of %d\n", omp_get_thread_num(), omp_get_num_threads());
}
```

In this case, the parallel region will only contain a single thread.

In [42]:
# Building
!clang -fopenmp C/parallel_if.c -o C/parallel_if.exe

In [43]:
# Running
!C/./parallel_if.exe

Hi from 0 out of 1


You can play with this code changing the file [parallel_if.c](C/parallel_if.c)

Note:
```
Number of threads equal to one and no OpenMP parallel region seem to be identical, However, the openMP parallel region is still outlined and sent through the runtime. This can lead to missed opportunities during compilation, that can make a program with if(false) slower than a program without no OpenMP at all. Depending on the compiler implementation, an if()` clause that evaluates to false during compile time, may be optimized. Johannes Doerfart and Hal Finkel have an excellent paper on this called "Compiler optimizations For OpenMP". A recommended reading on this topic. 
```

## Programs relying on number of threads
Notice that it is possible to create programs that lock or fail due to insufficient number of threads. Take for example the following code:

```C
int a_var = 0;
#pragma omp parallel shared(a_var)
{
    if (omp_get_thread_num() == 0) {
        while(a_var == 0);
    } else {
        #pragma omp atomic
        a_var++;
    }
}
```

Really important note:
```
The example above is a terrible code meant only to demonstrate that it is possible to create programs that rely on a given number of threads, in this case it must be more than 1. Please use caution when using this code, and don't blame me.
```

In [17]:
#building the example
!clang -fopenmp C/lock_program.c -o C/lock_program.exe

In [24]:
#Executing with many threads
!OMP_NUM_THREADS=4 C/./lock_program.exe

I'm starting
running with 4 threads
I'm waiting for any other thread
I've finished


In [25]:
#Executing with timeout to avoid locking this cell
!OMP_NUM_THREADS=1 timeout 10 C/./lock_program.exe || if [ $? -eq 124 ]; then echo "Program took too long"; fi

I'm starting
running with 1 threads
I'm waiting for any other thread
Program took too long


Therefore, if your program relies on the number of threads, it is preferable to have checks on what would be the number of threads that the parallel region will create. The `omp_get_max_threads()` allows to know how many threads an upcoming parallel region will generate, by checking the current state of the Internal Control Variable (ICV).

In [26]:
# Building
!clang -fopenmp C/lock_safe_program.c -o C/lock_safe_program.exe

In [28]:
# Running with sufficient threads
!OMP_NUM_THREADS=4 C/./lock_safe_program.exe

I'm starting
running with 4 threads
I'm waiting for any other thread
I've finished


In [27]:
# Running without sufficient threads
!OMP_NUM_THREADS=1 C/./lock_safe_program.exe

Early termination. insufficient number of threads


Notice how the same code we had before has an early termination check that avoids this issue. Again, this code is for demonstration purposes only, as there are better ways to do this.

You can play with these codes going to [lock_program.c](C/lock_program.c) and [lock_safe_program.c](C/lock_safe_program.c)