# Lab2.1: Parallelism Generation in Multithreading

The objective of this lab is to better understand parallelism generation in OpenMP. Parallelism generation is the ability to create workers. Some of these concepts were briefly introduced in [lab1](../Lab1/lab1.ipynb). In this lab we will describe them more in depth

## Table of content:

1. Parallel regions
    1. OMP_NUM_THREADS environment variable
    2. OMP_THREAD_LIMIT environment variable
    3. omp_set_num_threads() API call
    4. mixed number of threads
    5. if() clause
    7. Programs Relying on the number of threads
2. Teams regions
    1. Creating teams
3. Parallel and teams
    1. thread_limit() clause
4. Nested parallelism
    1. omp_get_ancestor_num_thread()
    2. Some minor comments

## Parallel regions
Since its origin, OpenMP has focus on parallel regions. Parallel regions are created with the `#pragma omp parallel` directive, and its purpose is to generate a collection of software threads, each entering the same region of code.

The `parallel` construct generates a `team of threads` containing one or more threads. There are different factors that influence how many threads are created and the user have some control over these. One factor that we cannot control as users is the system limitation. OpenMP is portable across systems, however, each system may impose limitations on how many threads can be generated. For example, an embedded system with a light weight operating system may not allow thread oversubscription (see lab 1 for details on this), a batch scheduler may restrict the number of threads in an application, or an OpenMP compiler implementation may decide to limit the number of threads it can handle to a maximum number.

There are two rules of thumb:
1. Number of threads used by the application is not guaranteed to be equal to the number of threads requested. The actual number of threads may be less or implementation defined. It is not a good idea for applications to rely on the number of threads as this will limit portability (Use discretion here, as there are good reasons to break this rule.)
2. System oversubscription is not recommended, when all the threads will be constantly busy, as this has demonstrated to reduce application performance. Still, some applications use thread oversubscription to have some threads that do little work. 

Elements that affect the number of threads:
* The `OMP_NUM_THREADS` environment variable
* The `omp_set_num_threads()` API call
* The `num_threads()` clause of the `parallel` construct
* The `if()` clause of the `parallel` construct
* The `OMP_DYNAMIC` environment variable
* The `omp_set_dynamic()` API call
* The `thread_limit()` clause in the `teams` directive
* The `OMP_THREAD_LIMIT` and `OMP_TEAMS_THREAD_LIMIT` environment variable
* The `OMP_NESTED` and `OMP_MAX_ACTIVE_LEVELS` environment variables

In this section we will try to cover most of these. For the following example, we assume that we are running on a system that supports more threads.

### OMP_NUM_THREADS environment variable
Sets the default value to be use by `parallel` regions inside the OpenMP code.

For the following hello world code

```C
#pragma omp parallel
{
    printf("Hello from %d out of \n", omp_get_thread_num(), omp_get_num_threads());
}
```

The number of threads can be changed during runtime by using the `OMP_NUM_THREADS` environment variable, without the need to recompile

In [None]:
# First, let's compile it
!srun -N 1 -c 8 clang -fopenmp C/hello_world.c -o C/hello_world.exe

In [None]:
# Now let's run it with different values

#Running with default num threads. Implementation defined
!srun -N 1 -c 8 C/./hello_world.exe

In [None]:
#Running with different num threads modifying OMP_NUM_THREADS
!OMP_NUM_THREADS=3 srun -N 1 -c 8 C/./hello_world.exe

In [None]:
#Running with different num threads modifying OMP_NUM_THREADS
!OMP_NUM_THREADS=5 srun -N 1 -c 8 C/./hello_world.exe

Note:
```
Environment variables depend on the shell you're using. Here I am assuming bash and I am using the inline format. In bash you can also call `export` command prior to the execution of the program. Use your favorite shell, but change the code appropriately
```

You can check this code in [hello_world.c](C/hello_world.c).

### OMP_THREAD_LIMIT environment variable

This environment variable presents an additional limitation in the number of threads that are used, regardless of the number of threads requested. Using the same example above, we can set both environment variables to demonstrate the precedence order.

In [None]:
#Building the hello world program
!srun -N 1 -c 8 clang -fopenmp C/./hello_world.c -o C/hello_world.exe

In [None]:
# Running it with OMP_THREAD_LIMIT
!OMP_THREAD_LIMIT=4 OMP_NUM_THREADS=1000 srun -N 1 -c 8 C/./hello_world.exe

Some compilers, like clang, may give a warning about not being able to satisfied the requested number of threads.

### omp_set_num_threads() API call
Another way of setting the number of threads before the execution of a parallel region is to use the `omp_set_num_threads()` API function call. This call will have precedence over the environment variable OMP_NUM_THREADS.

Take for example the following code that sets the number of threads to 4 before the parallel region.

```C
omp_set_num_threads(4);
#pragma omp parallel
{
    printf("Hi from %d out of %d\n", omp_get_thread_num(), omp_get_num_threads());
}
```

In [None]:
#Building the program
!srun -N 1 -c 8 clang -fopenmp C/omp_set_num_threads.c -o C/omp_set_num_threads.exe

In [None]:
#Running the program
!srun -N 1 -c 8 C/./omp_set_num_threads.exe

In [None]:
#Running the program trying to force num threads with env variable
!OMP_NUM_THREADS=1000 srun -N 1 -c 8 C/./omp_set_num_threads.exe

The environment variable is ignored, because the API call has higher precedence.

To play with this code go to [omp_set_num_threads.c](C/omp_set_num_threads.c)

### num_threads() clause

Yet another way to change the number of threads is with the `num_threads()` clause supported by the `parallel` construct. The clause applied directly to the parallel region has higher precedence in comparison to the `OMP_NUM_THREADS` environment variable, and the `omp_set_num_threads()` API call.

Take for example the following code

```C
#pragma omp parallel num_threads(4)
{
    printf("Hi from %d out of %d\n", omp_get_thread_num(), omp_get_num_threads());
}
```

In [None]:
#Building the program
!srun -N 1 -c 8 clang -fopenmp C/num_threads.c -o C/num_threads.exe

In [None]:
# Running
!srun -N 1 -c 8 C/./num_threads.exe

In [None]:
# Running with OMP_NUM_THREADS to demonstrate precedence
!OMP_NUM_THREADS=1000 srun -N 1 -c 8 C/./num_threads.exe

You can play with this code in [num_threads.c](C/num_threads.c)

### Mixed number of threads

It is not necessary to have the same number of threads for all your program. Furthermore, it is possible to use the different aforementioned methods within the same program and rely on their changes to the Internal Control Variables (ICVs)

The following example uses the default number of threads or `OMP_NUM_THREADS`, the `omp_set_num_threads()` API call and the `num_threads()` clauses within the same program. 

In [None]:
# Building
!srun -N 1 -c 8 clang -fopenmp C/num_threads_mixed.c -o C/num_threads_mixed.exe

In [None]:
# Running
!OMP_NUM_THREADS=1 srun -N 1 -c 8 C/./num_threads_mixed.exe

You can play with this code in [num_threads_mixed.c](C/num_threads_mixed.c)

### The if() clause

The `if()` clause is used to enable or disable different directives (e.g. threads, teams, and target). In the context of number of threads, the `if()` clause will create a region that uses only a single thread, regardless of the ICV variable that controls the number of threads.

Take for example the following code
```C
#pragma omp parallel if(false) num_threads(1000)
{
    printf("Hi from %d out of %d\n", omp_get_thread_num(), omp_get_num_threads());
}
```

In this case, the parallel region will only contain a single thread.

In [None]:
# Building
!srun -N 1 -c 8 clang -fopenmp C/parallel_if.c -o C/parallel_if.exe

In [None]:
# Running
!srun -N 1 -c 8 C/./parallel_if.exe

You can play with this code changing the file [parallel_if.c](C/parallel_if.c)

Note:
```
Number of threads equal to one and no OpenMP parallel region seem to be identical, However, the openMP parallel region is still outlined and sent through the runtime. This can lead to missed opportunities during compilation, that can make a program with if(false) slower than a program without no OpenMP at all. Depending on the compiler implementation, an if()` clause that evaluates to false during compile time, may be optimized. Johannes Doerfart and Hal Finkel have an excellent paper on this called "Compiler optimizations For OpenMP". A recommended reading on this topic. 
```

### Programs relying on number of threads
Notice that it is possible to create programs that lock or fail due to insufficient number of threads. Take for example the following code:

```C
int a_var = 0;
#pragma omp parallel shared(a_var)
{
    if (omp_get_thread_num() == 0) {
        while(a_var == 0);
    } else {
        #pragma omp atomic
        a_var++;
    }
}
```

Really important note:
```
The example above is a terrible code meant only to demonstrate that it is possible to create programs that rely on a given number of threads, in this case it must be more than 1. Please use caution when using this code, and don't blame me.
```

In [None]:
#building the example
!srun -N 1 -c 8 clang -fopenmp C/lock_program.c -o C/lock_program.exe

In [None]:
#Executing with many threads
!OMP_NUM_THREADS=4 srun -N 1 -c 8  C/./lock_program.exe

In [None]:
#Executing with timeout to avoid locking this cell
!OMP_NUM_THREADS=1 srun -N 1 -c 8 timeout 10 C/./lock_program.exe || if [ $? -eq 124 ]; then echo "Program took too long"; fi

Therefore, if your program relies on the number of threads, it is preferable to have checks on what would be the number of threads that the parallel region will create. The `omp_get_max_threads()` allows to know how many threads an upcoming parallel region will generate, by checking the current state of the Internal Control Variable (ICV).

In [None]:
# Building
!srun -N 1 -c 8 clang -fopenmp C/lock_safe_program.c -o C/lock_safe_program.exe

In [None]:
# Running with sufficient threads
!OMP_NUM_THREADS=4 srun -N 1 -c 8 C/./lock_safe_program.exe

In [None]:
# Running without sufficient threads
!OMP_NUM_THREADS=1 srun -N 1 -c 8 C/./lock_safe_program.exe

Notice how the same code we had before has an early termination check that avoids this issue. Again, this code is for demonstration purposes only, as there are better ways to do this.

You can play with these codes going to [lock_program.c](C/lock_program.c) and [lock_safe_program.c](C/lock_safe_program.c)

## Teams regions

Team regions were originally introduced for expressing parallelism inside target regions. However, teams are now available for the host as well. Therefore, it is not necessary to have a target region in order to support teams.

The `teams` directive creates a league of teams. In contrast with threads, teams are loosely coupled, and should not synchronize. Teams within a league of threads are not guaranteed to be executed concurrently, therefore, if a team is waiting for another team that has not been scheduled, this may cause a deadlock. 

When a thread encounters a `teams` construct, an _initial team_ is created. An initial team contains a single thread, that represents the _primary_ thread (formerly called _master_ thread). The team region is executed by the _primary_ thread of each team.

The number of teams is affected by:
1. the `OMP_NUM_TEAMS` environment variable
2. the `omp_set_num_teams()` API call
3. the `num_teams()` clause
4. the `if()` clause

However, contrary to threads, teams are guaranteed to be exact, as long as they are less than the system's limit. The system's limit can be confirmed with `omp_get_max_teams()`

### Creating teams

The `teams` directive is used to create teams. Take the following example.

```C
#pragma omp teams num_teams(8)
{
    printf("Hi from team %d and I have %d threads\n", omp_get_team_num(), omp_get_num_threads());
}
```

In [None]:
# Building
!srun -N 1 -c 8 clang -fopenmp C/teams.c -o C/teams.exe

In [None]:
# Compiling
!srun -N 1 -c 8 C/./teams.exe

You can play with this code in [teams.c](C/teams.c)

Notice that this example is similar to threads. However, there are differences between these. First, teams are mainly used to group threads creating an additional level of scheduling locality. Through the use of `OMP_PLACES` (to be covered in another laboratory), teams can be used to better control the distribution of threads in the system. Furthermore, teams can be combined with threads as we will see later in this tutorial.

## Parallel and teams

Inside a `teams` region, a `parallel` region should be used as well. Teams allow to create set of threads, that are scheduled independently, as well as placed in specific parts of the system.

Take for example the following code:

```C
    #pragma omp teams num_teams(6)
    {
        #pragma omp parallel num_threads(2)
        {
            printf("Hi from thread %d in team %d\n", omp_get_thread_num(), omp_get_team_num());
        }
    }
```

This code creates 12 threads in total. However, a compiler may decide to create and execute one team at a time with 2 threads each. Thus, reducing the possible collision of resources that comes with thread oversubscription.

In [None]:
#building (using GCC due to poor support in clang of teams in the host)
!srun -N 1 -c 8 gcc -fopenmp C/teams_parallel.c -o C/teams_parallel.exe

In [None]:
#Running
!srun -N 1 -c 8 C/./teams_parallel.exe

Play with this code in [teams_parallel.c](C/teams_parallel.c)

### Limiting threads per team

The number of threads per team can be limited by using:
1. Environment variable `OMP_TEAMS_THREAD_LIMIT`
2. API call `omp_set_teams_thread_limit()`
3. `thread_limit()` clause

Imagine that we want to create a function containing a parallel region. However, we want to limit how many threads are executed in this parallel region at the same time, depending on the number of teams (e.g. depending on the machine configuration). This is possible by creating a code like this:

```C
void foo() {
    #pragma omp parallel num_threads(12)
    {
        printf("hello from %d\n", omp_get_thread_num());
    }
}

int main() {
    foo();
    #pragma omp teams num_teams(2) thread_limit(6)
    {
        foo();
    }
}
```

Of course there is no particular logic here to specify the thread_limit, but this code provides an example of how limiting the number of threads can lead to different thread distribution

In [None]:
#Building (Using GCC due to a limited implementation in clang)
!srun -N 1 -c 8 gcc -fopenmp C/thread_limit.c -o C/thread_limit.exe

In [None]:
#Running
!srun -N 1 -c 8 C/./thread_limit.exe

You can play with this code in [thread_limit.c](C/thread_limit.c)

## Nested parallelism

New parallel regions can be spawned within already existing regions. This is referred to as nested parallelism. Nested parallelism must be supported by the implementation, and it is possible to restrict the number of levels that are supported. This is important when oversubscription of threads leads to performance degradation.

The term level refers to the number of parallel regions that have been created at the moment of creating a parallel region. The first parallel region is level 1, increasing as we add parallel regions. 

Let us begin with a simple example:

```C
#pragma omp parallel num_threads(1)
{
    printf("Level %d\n", omp_get_level());
    #pragma omp parallel num_threads(1)
    {
        printf("Level %d\n", omp_get_level());
        #pragma omp parallel num_threads(1)
        {
            printf("Level %d\n", omp_get_level());
        }
    }
}
```

In [None]:
#build
!srun -N 1 -c 8 clang -fopenmp C/levels.c -o C/levels.exe

In [None]:
#Execute
!srun -N 1 -c 8 C/./levels.exe

Not all levels are active levels. Is is possible to limit how many nested levels are supported. An active level will spawn more threads, while an inactive level will not. 

The maximum number of levels can be set with:
* Environment variable `OMP_MAX_ACTIVE_LEVELS`
* omp_set_max_active_levels() API call

Let's take a look at this in action:

```C
void foo(int a) {
    if (a == 0) return;
    printf("Level %d is %s\n", omp_get_level(), 
            (omp_get_level() == omp_get_active_level()) ? "Active": "Inactive");
    #pragma omp parallel num_threads(2)
    {
        #pragma omp single
        foo(a-1);
    }
}
```

In [None]:
#Build
!srun -N 1 -c 8 clang -fopenmp C/active_levels.c -o C/active_levels.exe

In [None]:
#Execute
!OMP_MAX_ACTIVE_LEVELS=3 srun -N 1 -c 8 C/./active_levels.exe

The `omp single` is used to avoid spawning all the threads per thread, helping print the active/inactive status. Removing it will also display the same messages, but repeated every time.

To play with this code go to [active_levels.c](C/active_levels.c)

### omp_get_ancestor_thread_num API call

In the previous example we use other API functions that are intended for understanding nested parallelism.

`omp_get_level()` gives the current executing level. However, `omp_get_active_level()` returns the number of parallel regions (levels) that are active, as seen before.

Another important API call is `omp_get_ancestor_thread_num(level)` which allows to get the thread number for all the levels that are ancestors of the current thread.

The aforementioned code can be rewritten to use this, spawn all threads, and only let a single task per level output a message

```C
void foo(int a) {
    if (a == 0) return;
    int i = 0;
    // Check if all my ancestors are thread 0
    while(omp_get_ancestor_thread_num(i) == 0 && i++ <= omp_get_level());
    if (omp_get_level() == i)
        printf("Level %d is %s\n", omp_get_level(), 
            (omp_get_level() == omp_get_active_level()) ? "Active": "Inactive");
    #pragma omp parallel num_threads(2)
    {
        foo(a-1);
    }
}
```


In [None]:
#Build
!srun -N 1 -c 8 clang -fopenmp C/active_levels2.c -o C/active_levels2.exe

In [None]:
#Execute
!OMP_MAX_ACTIVE_LEVELS=3 srun -N 1 -c 8 C/./active_levels2.exe

You can play with this code in [active_levels2.c](C/active_levels2.c)

The `omp single` is used to avoid spawning all the threads per thread, helping print the active/inactive status. Removing it will also display the same messages, but repeated every time.

To play with this code go to [active_levels.c](C/active_levels.c)

### Some additional notes on nested parallelism:
* Before `OMP_NESTED` would enable/disable nested parallelism. This has been deprecated in favor of active levels.
* The max number of levels supported overall by an implementation can be query using `omp_get_supported_active_levels()`