# Lab 4.2: Exploiting parallelism in the GPU

The objective of this lab is to understand how to create workers and assign work to workers in the GPU. This lab introduces the concepts of teams, parallel, distribute, for, and related clauses. 

This tutorial is expected to run in a Linux-like environment.

## Table of content

* The GPGPU
* OMP Target is not parallel
    * teams
    * parallel
    * Controlling number of teams and threads
* Worksharing-loop constructs
    * Distribute
    * for
    * Scheduling
* Data environment

## The GPGPU

A GPU often contains a large number of cores. These cores are grouped together into sets, each named **Streaming Multiprocessor (SM)**. Each streaming multiprocessor contains a local memory that can be shared across the cores that belong to the same SM. Each core may also contain private memory (e.g. registers, and L1 cache), but these are not exposed to OpenMP. SMs are connected through a global GPU memory that supports atomic operations. We showed an example of this in the slides, feel free to refer to this example. 

We can use different tools, either specific to the vendor (e.g. `nvidia-smi` and `rocm-info`) or to the compiler. Let's take a look at the GPUs in your system from the perspective of the LLVM compiler runtime using `llvm-omp-device-info`:


In [1]:
# Command line tool
!llvm-omp-device-info

Device (0):
    This is a generic-elf-64bit device

Device (1):
    This is a generic-elf-64bit device

Device (2):
    This is a generic-elf-64bit device

Device (3):
    This is a generic-elf-64bit device

Device (4):
    CUDA Driver Version: 		11040 
    CUDA Device Number: 		0 
    Device Name: 			Quadro P1000 
    Global Memory Size: 		4236312576 bytes 
    Number of Multiprocessors: 		5 
    Concurrent Copy and Execution: 	Yes 
    Total Constant Memory: 		65536 bytes
    Max Shared Memory per Block: 	49152 bytes 
    Registers per Block: 		65536 
    Warp Size: 				32 Threads 
    Maximum Threads per Block: 		1024 
    Maximum Block Dimensions: 		1024, 1024, 64 
    Maximum Grid Dimensions: 		2147483647 x 65535 x 65535 
    Maximum Memory Pitch: 		2147483647 bytes 
    Texture Alignment: 			512 bytes 
    Clock Rate: 			1480500 kHz
    Execution Timeout: 			No 
    Integrated Device: 			No 
    Can Map Host Memory: 		Yes 
    Compute Mode: 			DEFAULT 
    Concurrent Kernels: 		Y

This command will help you understand what devices the OpenMP runtime is seeing in your system, as well as its characteristics. Assuming the execution environment is the same as the code you're executing, this command will show you what devices your application will see. To get more details on what the above numbers are, please see the vendor specific manuals (or just ask the instructor of this lab).

```
Note: Streaming Multiprocessor is the name given to the units of hardware that contains multiple cores in the NVIDIA. Since NVIDIA is currently the most common architecture, I am using this term here. However, be aware that each vendor has a different name for this. AMD uses Data Parallel Processor (DPP) array. Intel uses Compute Slice or Xe Core. These last two architectures also exploit the SIMD instruction level parallelism, which is not common in NVIDIA GPU. While there are changes in the architectures, the OpenMP specifications only considers the three levels in the hierarchy described below: Teams, Threads, and SIMD
```

## A target is not parallel

So far in this tutorial we have been using `#pragma omp target` to specify a region of code to be executed in the device. However, it is important to understand that, if you are only using target, your code is running in the GPU, but it does not mean it is running in parallel. GPU cores are often slower and less robust than CPU cores. So just adding your application to a target region will most likely result in a considerable slow down in the execution time. 

Let's take a look at the following example:


```C
    start = omp_get_wtime();
    for (int i = 0; i < 100000; i++)
        A[i]++;
    end = omp_get_wtime();
    printf("CPU Time = %f", end-start);

    start = omp_get_wtime();
    #pragma omp target map(tofrom:A[0:100000])
    for (int i = 0; i < 100000; i++)
        A[i]++;
    end = omp_get_wtime();
```



In [10]:
# Building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/target_only.c -o C/./target_only.exe

In [11]:
# Running
!srun -N 1 -c 8 C/./target_only.exe

CPU Time = 0.001914
GPU Time = 1.474614
GPU no data move Time = 1.395071


Most likely the above code will have a larger GPU time than the CPU. We will fix this on this lab.

## Teams, Parallel, and SIMD

There are three levels of parallelism in OpenMP. Teams, threads and SIMD lanes. The latter is not exercised in the Clang compiler for NVIDIA GPUs so we will not provide an example on this lab.

### Teams

A team is in principle an aggregation of threads. The `teams` directive creates a *league of teams* which is a set of teams. Teams map to the Streaming Multiprocessor of the GPUs. If you're familiar with CUDA, a `team` is similar to a `block` of threads. Teams should never be synchronize other than at the beginning and end of the teams region. Teams are also not guaranteed to be executed concurrently, therefore, you should not rely on any particular synchronization mechanism (e.g. through atomic operations in memory) to synchronize teams, this may lead to deadlocks.

Teams begin execution with a single thread. In order to create more threads the parallel construct is used, as described in the next sub section. 

Let's take a look at the teams example:


```C
    #pragma omp target teams
    {
        int teamNum = omp_get_team_num();
        int threadNum = omp_get_thread_num();
        int numThreads = omp_get_num_threads();
        int numTeams = omp_get_num_teams();
        if ( teamNum == 0)
            printf("There are %d teams in total\n", numTeams);
        printf("I am thread %d in team %d and there are %d threads inside me\n", threadNum, teamNum, numThreads);
    }
```


In [12]:
# Building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/teams_only.c -o C/./teams_only.exe

In [13]:
!srun -N 1 -c 8 C/./teams_only.exe

There are 128 teams in total
I am thread 0 in team 5 and there are 1 threads inside me
I am thread 0 in team 10 and there are 1 threads inside me
I am thread 0 in team 7 and there are 1 threads inside me
I am thread 0 in team 17 and there are 1 threads inside me
I am thread 0 in team 12 and there are 1 threads inside me
I am thread 0 in team 2 and there are 1 threads inside me
I am thread 0 in team 6 and there are 1 threads inside me
I am thread 0 in team 16 and there are 1 threads inside me
I am thread 0 in team 11 and there are 1 threads inside me
I am thread 0 in team 1 and there are 1 threads inside me
I am thread 0 in team 8 and there are 1 threads inside me
I am thread 0 in team 18 and there are 1 threads inside me
I am thread 0 in team 13 and there are 1 threads inside me
I am thread 0 in team 3 and there are 1 threads inside me
I am thread 0 in team 19 and there are 1 threads inside me
I am thread 0 in team 14 and there are 1 threads inside me
I am thread 0 in team 4 and there 

Play with this code in [teams_only.c](C/teams_only.c)

## Parallel

The parallel region creates multiple threads within a team. The `parallel` construct only affects the surrounding team. Since teams are mapped to a single SM, only using parallel will likely sub-utilize the system. Threads can synchronize (e.g. `single`, `master` or `masked`, `barriers`), and they are guaranteed to be executed concurrently, even when the system is oversubscribed (This was discussed during lab 1). Teams map to the different cores in the system. 

Take a look at the following code


```C
    #pragma omp target parallel
    {
        int teamNum = omp_get_team_num();
        int threadNum = omp_get_thread_num();
        int numThreads = omp_get_num_threads();
        int numTeams = omp_get_num_teams();
        #pragma omp master
            printf("There are %d teams in total and %d threads\n", numTeams, numThreads);
        printf("I am thread %d out of %d in team %d \n", threadNum, numThreads, teamNum);

        #pragma omp barrier

        #pragma omp single
            printf("This message is guaranteed to go at the end\n");
    }
```

In [18]:
# Building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/parallel_only.c -o C/./parallel_only.exe

In [19]:
!srun -N 1 -c 8 C/./parallel_only.exe


There are 1 teams in total and 128 threads
I am thread 32 out of 128 in team 0 
I am thread 33 out of 128 in team 0 
I am thread 34 out of 128 in team 0 
I am thread 35 out of 128 in team 0 
I am thread 36 out of 128 in team 0 
I am thread 37 out of 128 in team 0 
I am thread 38 out of 128 in team 0 
I am thread 39 out of 128 in team 0 
I am thread 40 out of 128 in team 0 
I am thread 41 out of 128 in team 0 
I am thread 42 out of 128 in team 0 
I am thread 43 out of 128 in team 0 
I am thread 44 out of 128 in team 0 
I am thread 45 out of 128 in team 0 
I am thread 46 out of 128 in team 0 
I am thread 47 out of 128 in team 0 
I am thread 48 out of 128 in team 0 
I am thread 49 out of 128 in team 0 
I am thread 50 out of 128 in team 0 
I am thread 51 out of 128 in team 0 
I am thread 52 out of 128 in team 0 
I am thread 53 out of 128 in team 0 
I am thread 54 out of 128 in team 0 
I am thread 55 out of 128 in team 0 
I am thread 56 out of 128 in team 0 
I am thread 57 out of 128 in tea

You can play with this code in [parallel_only.c](C/parallel_only.c)

## Controlling number of teams and threads

It is possible to change the number of teams and threads that are created. The clauses `num_teams()` and `num_threads()` in the `teams` and `parallel` directives respectively allows the developer to control this. 

Take a look at the following code:


```C
 #pragma omp target
{
    #pragma omp teams num_teams(5)
    {
        #pragma omp parallel num_threads(32)
        {
            int teamNum = omp_get_team_num();
            int threadNum = omp_get_thread_num();
            int numThreads = omp_get_num_threads();
            int numTeams = omp_get_num_teams();
            //#pragma omp master -- there's a bug in clang
            if (threadNum == 0)
                printf("There are %d teams in total and %d threads in this team\n", numTeams, numThreads);
            printf("I am thread %d out of %d in team %d \n", threadNum, numThreads, teamNum);

            #pragma omp barrier

            //#pragma omp single -- there's a bug in clang
            if (threadNum == 0)
                printf("This message is guaranteed to go at the end of parallel but not across teams\n");
        }
    }
}
```

In [23]:
# Building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/teams_parallel.c -o C/./teams_parallel.exe

In [24]:
!srun -N 1 -c 8 C/./teams_parallel.exe

There are 5 teams in total and 32 threads in this team
There are 5 teams in total and 32 threads in this team
There are 5 teams in total and 32 threads in this team
There are 5 teams in total and 32 threads in this team
There are 5 teams in total and 32 threads in this team
I am thread 0 out of 32 in team 0 
I am thread 1 out of 32 in team 0 
I am thread 2 out of 32 in team 0 
I am thread 3 out of 32 in team 0 
I am thread 4 out of 32 in team 0 
I am thread 5 out of 32 in team 0 
I am thread 6 out of 32 in team 0 
I am thread 7 out of 32 in team 0 
I am thread 8 out of 32 in team 0 
I am thread 9 out of 32 in team 0 
I am thread 10 out of 32 in team 0 
I am thread 11 out of 32 in team 0 
I am thread 12 out of 32 in team 0 
I am thread 13 out of 32 in team 0 
I am thread 14 out of 32 in team 0 
I am thread 15 out of 32 in team 0 
I am thread 16 out of 32 in team 0 
I am thread 17 out of 32 in team 0 
I am thread 18 out of 32 in team 0 
I am thread 19 out of 32 in team 0 
I am thread 20 

Play with the above code in [teams_parallel.c](C/teams_parallel.c)

An important note here is that `num_threads` corresponds to the **desired** number of threads. The compiler or runtime may change this value to be less than the number specified, but never more than the specified number. The *OMP_DYNAMIC* environment variable allows us to remove the dynamic thread creation, forcing the runtime to create the exact number of threads specified. 

## Worksharing loops

Tbe idea behind worksharing loops is to be able to distribute an iteration space across teams and threads. As you noticed above, the code within the `teams` or `parallel` region is executed by all the threads. If we have a loop inside, then this loop is executed from beginning to end by all the threads. 

Imagine that you have a loop with 1000 iterations. Now, instead of parallelizing those iterations you will be executing the loop `1000*numTeams*numThreads` times. Way more work than originally planned, and potentially an error in the execution. 

See the following example


```C
    #pragma omp target map(tofrom:a)
    {
        #pragma omp teams num_teams(5)
        {
            #pragma omp parallel num_threads(32)
            {
                for (int i = 0; i < 100; i++)
                    #pragma omp atomic
                        a++;
            }
        }
    }
```

In [25]:
# Building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/teams_parallel_no_worksharing.c -o C/./teams_parallel_no_worksharing.exe

In [26]:
!srun -N 1 -c 8 C/./teams_parallel_no_worksharing.exe

Num Threads = 32, Num Teams = 5
a was incremented 16000 times


Play with this code in [teams_parallel_no_worksharing.c](C/teams_parallel_no_worksharing.c)

However, developers often use loops to determine work across different data. When the work performed in each iteration is independent from other iteartions, it is possible to safely execute this code in parallel. These are the cases where GPUs shine: Single Programs performing the same work on Multiple Data (SPMD). 

# Distribute Directive

The `distribute` construct partitions an interation space across the different teams. If, for example we have 3 teams, and an iteration space of 9 iterations, a possible distribution strategy would be:

`|T0|T0|T0|T1|T1|T1|T2|T2|T2`

If the user does not specify an scheduling strategy for the distribution of work (see below), then the compiler is free to chose a distribution strategy. So even if I have 3 teams, the above distribution strategy is still valid:

`|T0|T0|T0|T0|T0|T0|T0|T0|T0`

Let's take a look at an example


```C
#pragma omp target teams distribute num_teams(3)
for(int i = 0; i < 10; i++) {
    printf("iteration %d executed by team %d", i, omp_get_team_num());
}
```

In [32]:
# Building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/teams_distribute.c -o C/./teams_distribute.exe

In [31]:
!srun -N 1 -c 8 C/./teams_distribute.exe

iteration 0 executed by team 0
iteration 4 executed by team 1
iteration 7 executed by team 2
iteration 1 executed by team 0
iteration 5 executed by team 1
iteration 8 executed by team 2
iteration 2 executed by team 0
iteration 6 executed by team 1
iteration 9 executed by team 2
iteration 3 executed by team 0


Play with the above code in [teams_distribute.c](C/teams_distribute.c)

# For directive
Parallel works really similar to distribute, but it distributes the iteration space over threads. Like before, if scheduling is not specified, then it is up to the compiler or runtime to determine it.

Same code as above but for parallel.


```C
#pragma omp target parallel for num_threads(3)
for(int i = 0; i < 10; i++) {
    printf("iteration %d executed by thread %d", i, omp_get_thread_num());
}
```

In [33]:
# Building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/parallel_for.c -o C/./parallel_for.exe

In [34]:
!srun -N 1 -c 8 C/./parallel_for.exe

iteration 0 executed by thread 0
iteration 1 executed by thread 0
iteration 2 executed by thread 0
iteration 3 executed by thread 0
iteration 4 executed by thread 0
iteration 5 executed by thread 0
iteration 6 executed by thread 0
iteration 7 executed by thread 0
iteration 8 executed by thread 0
iteration 9 executed by thread 0


## Combined construct

Often times, we want the iteration space to be distributed across teams and threads. In this case it is possible to specify a combined construct that spawns both teams and threads, and distributes the iteration space accordingly. 

The commonly used combined construct `#pragma omp target teams distribute parallel for` achieves this. Scheduling, when not specified, is still up to the compiler or runtime. Take for example the above iteration space with 9 iterations. Here the number proceeding the capital T is the team number, and the one proceeding the lower case t is the thread number. 

`|T0t0|T0t1|T0t2|T1t0|T1t1|T1t2|T2t0|T2t1|T2t2|`

Let's run this code again:


```C
#pragma omp target teams distribute parallel for num_teams(3) thread_limit(3)
for(int i = 0; i < 10; i++) {
    printf("iteration %d executed by team %d thread %d\n", i, omp_get_team_num(), omp_get_thread_num());
}
```


In [35]:
# Building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/teams_distribute_parallel_for.c -o C/./teams_distribute_parallel_for.exe

In [36]:
!srun -N 1 -c 8 C/./teams_distribute_parallel_for.exe

iteration 3 executed by team 1 thread 0
iteration 0 executed by team 0 thread 0
iteration 6 executed by team 2 thread 0
iteration 4 executed by team 1 thread 0
iteration 1 executed by team 0 thread 0
iteration 7 executed by team 2 thread 0
iteration 5 executed by team 1 thread 0
iteration 2 executed by team 0 thread 0
iteration 8 executed by team 2 thread 0
iteration 9 executed by team 0 thread 0


Play with the above code in [teams_distribute_parallel_for.c](C/teams_distribute_parallel_for.c)

## Scheduling decisions

Scheduling of worksharing loops corresponds to providing the compiler with strategies to distribute the iteration space across workers. There are different type of scheduling policies (e.g. `dynamic`, `static`, `auto`, ...), most of which pre-date GPU programming and were meant for CPUs. In the case of GPUs it is often desired to have a static scheduling. In the case of teams, distribute scheduling only supports `static`. A more complete description of scheduling strategies can be found in a different lab. We will focus only on static scheduling. 

There are two scheduling clauses: `schedule()` and `dist_schedule()` for the `for` and `distribute` worksharing loops respectively. Static scheduling applies to both, and it tells the compiler what is the size of the iteration group to be assigned to each worker. An scheduling of `schedule(2)`, will create chunks of two iterations and assign it to each thread. For example for the above iteration space of 9 iterations, a clause `schedule(2)` will result in the following distribution of the iteration space.

`|t0|t0|t1|t1|t2|t2|t0|t0|t1|`

Let's take a look at a more complex example:


```C
#pragma omp target teams distribute parallel for \
        num_teams(3) dist_schedule(static,3) \
        num_threads(100) schedule(static,2)
for(int i = 0; i < 10; i++) {
    printf("iteration %d executed by team %d thread %d\n", i, omp_get_team_num(), omp_get_thread_num());
}
```

```
Note: We are using a large number of threads in `num_threads` trying to avoid the dynamic thread creation to result into a single thread only. The explanation of why this is the case is outside of the scope of this tutorial. Allow us to hand-wave this one
```


In [44]:
# Building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/teams_distribute_parallel_for_sched.c -o C/./teams_distribute_parallel_for_sched.exe

In [45]:
!srun -N 1 -c 8 C/./teams_distribute_parallel_for_sched.exe

iteration 3 executed by team 1 thread 0
iteration 5 executed by team 1 thread 1
iteration 0 executed by team 0 thread 0
iteration 2 executed by team 0 thread 1
iteration 6 executed by team 2 thread 0
iteration 8 executed by team 2 thread 1
iteration 4 executed by team 1 thread 0
iteration 1 executed by team 0 thread 0
iteration 7 executed by team 2 thread 0
iteration 9 executed by team 0 thread 0


You can play with this code in [teams_distribute_parallel_for_sched.c](C/teams_distribute_parallel_for_sched.c).

# Exercise 3

Change the modified version of the example used in the introduction to use all the concepts learned here. What's the new execution time?

Go to [exercise3.c](Exercises/exercise3.c) to provide your solution. Scale the size of the problem by changing N. Try to plot the results

In [72]:
# Building Solution
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 -lm Exercises/exercise3.c -o Exercises/exercise3.exe

# Running solution
!srun -N 1 -c 8 Exercises/./exercise3.exe

CPU Time = 0.007429
GPU Time = 3.009654
GPU no data move Time = 2.926691


The following is a possible solution to this exercise: [exercise3.c](Solutions/exercise3.c)



In [69]:
# Building Solution
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 -lm Solutions/exercise3.c -o Solutions/exercise3.exe

# Running solution
!srun -N 1 -c 8 Solutions/./exercise3.exe

CPU Time = 7.382452
GPU Time = 5.611982
GPU no data move Time = 5.283598
