# OpenMP* Device Parallelism (Fortran)

#### Sections
- [Learning Objectives](#Learning-Objectives)
- [Device Parallelism](#Device-Parallelism)
- [GPU Architecture](#GPU-Architecture)
- ["Normal" OpenMP constructs](#"Normal"-OpenMP-constructs)
- [League of Teams](#League-of-Teams)
- [Worksharing with Teams](#Worksharing-with-Teams)
- _Code:_ [Lab Exercise: OpenMP Device Parallelism](#Lab-Exercise:-OpenMP-Device-Parallelism)


## Learning Objectives

* Explain basic GPU Architecture 
* Be able to use OpenMP offload worksharing constructs to fully utilize the GPU

### Prerequisites
Basic understanding of OpenMP constructs are assumed for this module. You also should have already went through the  [Introduction to OpenMP Offload module](../intro/intro_f.ipynb) and [Managing Device Data module](../datatransfer/datatransfer_f.ipynb), where the basics of using the Jupyter notebooks with the Intel® DevCloud and an introduction to the OpenMP `target` and `target data` constructs were discussed.

***
## Device Parallelism
As we've discussed in the previous modules, the OpenMP `target` construct transfers the control flow to the target device. However, the transfer of control is sequential and synchronous.

In OpenMP, offload and parallelism are separate, so programmers need to explicitly create parallel regions on the target device. In theory, constructs that create parallelism on offload devices can be combined with any OpenMP construct, but in practice, only a subset of OpenMP constructs are useful for the target device.

## GPU Architecture
Before diving into OpenMP parallelism constructs for target divices, let's first examine Intel® GPU architecture.

<img src="Assets/GPU_Arch.png">

Intel® GPUs contain 1 or more slices. Each slice is composed of several Subslices. Each Subslice contain multiple EUs (likely 8 or more), has it's own thread dispatcher unit, instruction cache, share local memory, and other resources. EUs are compute processors that drive the SIMD ALUs.

The following table displays how the OpenMP concepts of League, Team, Thread, and SIMD are mapped to GPU hardware.

|OpenMP | GPU Hardware |
|:----:|:----|
|SIMD | SIMD Lane (Channel)|
|Thread | SIMD Thread mapped to an EU |
|Team | Group of threads mapped to a Subslice |
|League | Multiple Teams mapped to a GPU |


## "Normal" OpenMP constructs
OpenMP GPU offload support all "normal" OpenMP constructs such as `parallel`, `do`, `barrier`, `sections`, `tasks`, etc. However, not every construct will be useful for the GPU. When using these constructs, the full threading model is only supported with in a subslice, this is because there's no synchronization among subslices, and there's no coherence and memory fence among subslices' L1 caches.

Let's examine the following example.
```fortran
subroutine saxpy(a, x, y, sz)
    ! Declarations Omitted
    !$omp target map(to:x(1:sz)) map(tofrom(y(1:sz))
    !$omp parallel do simd
    do i=1,sz
        y(i) = a * x(i) + y(i);
    end do
    !$omp end target
end subroutine
```
Here, we use the `target` pragma to offload the execution to the GPU. We then use `parallel` to create a team of threads, `do` to distribute loop iterations to those threads, and `simd` to request iteration vectorization with SIMD instructions. However, due to the restrictions aforementioned, only one GPU subslice is utilized here, so the GPU would be significantly underutilized. In some cases, the compiler may deduce `team distribute` from `parallel for` and still use the entire GPU.

## League of Teams
To take advantage of multiple subslices, use the `teams` pragma to create multiple **master** threads for execution. When combined with the `parallel` pragma, these master threads become a league of thread teams. Becuase there's no synchronization across teams of threads, the teams could then be assigned to different GPU subslices.

<img src="Assets/teams.JPG">

When using the `teams` construct, the number of teams created is implementation defined. Although, you may optionally specify an upper limit with the **num_teams** clause. The **thread_limit** clause of the `teams` pragma can be optionally used to limit the number of threads in each team.

Example: `!$omp teams num_teams(8) thread_limit(16)`


## Worksharing with Teams
After a league of teams is created by `teams`, use the `distribute` construct to distribute chunks of iterations of a loop across the different teams in the league. This is analogous to what the `do` construct does for `parallel` regions. The `distribute` pragma is associated with a loop nest inside a teams region.

For nested loops, the **collapse** clause can be used to specify how many loops are associated with the `distribute` pragma. You may specify a **collapse** clause with a parameter value greater than 1 to collapse associated loops into one large loop.

You can also use **dist_schedule** clause on the `distribute` construct to manually specify the chunk size that are distributed to master threads of each team. For example, `!$omp distribute dist_schedule(static, 512)` would create chunks of 512 iterations.

### Example with Combined Constructs
For convenience, OpenMP supports combined constructs for OpenMP offload. The code below shows how a single line can encompass all of the directives that we've discussed.
```fortran
subroutine saxpy(a, x, y, sz)
    ! Declarations Omitted
    !$omp target teams distribute parallel do simd map(to:x(1:sz)) map(tofrom(y(1:sz))
    do i=1,sz
        y(i) = a * x(i) + y(i);
    end do
    !$omp end target teams distribute parallel do simd
end subroutine
```
When these constructs are used without additional clauses, the number of teams created, the number of threads created per team, and how loop iterations are distributed are all implementation defined.
The following diagram breaks down the effects of each pragma in the previous example. Here we assume that there are a total of 128 loop iterations and that 4 teams, and 4 threads per team are created by the implementation.

1. The `omp target` pragma offloads the execution to device
2. The `omp teams` pragma creates multiple master threads, 4 thread teams in this diagram.
3. The `omp distribute` pragma distributes loop iterations to those 4 thread teams, 32 threads for each team shown.
4. The `omp parallel` pragma creates a team of threads for each master thread (team), 4 threads created for each team shown.
5. The `omp do` pragma distributes the 32 iterations to each of the 4 threads.
6. The `omp simd` pragma specifies that multiple iterations of the loop can be executed using SIMD instructions.

<img src="Assets/distribute.JPG">

## Lab Exercise: OpenMP Device Parallelism
In this exercise, we will practice using the offload worksharing constructs on the saxpy function that we've already worked with in the previous modules.

In [None]:
#Optional, see the contents of main.cpp
%pycat main.f90

In the cell below, add an OpenMP directive before the outer loop to perform the following tasks.
1. Offload execution to the GPU, use the clause `map(tofrom:y) map(to:x) map(from:is_cpu, num_teams)`
2. Create NUM_BLOCKS of **master** threads, use the clause `num_teams(NUM_BLOCKS)`
3. Distribute the outer loop iterations to the varoius master threads.

Ensure to also include the appropriate end directive.

In [None]:
%%writefile lab/saxpy_func_parallel.f90
! Add combined directive here to
!    1. Offload execution to the GPU, use the cause map(tofrom:y)
!       map(to: x) map(from:is_cpu) map(from:num_teams)
!    2. Create multiple master threads use clause num_teams(NUM_BLOCKS)
!    3. Distribute loop iterations to the various master threads.

do ib=1,ARRAY_SIZE, NUM_BLOCKS
        if (ib==1) then
                !Test if target is the CPU host or the GPU device
                is_cpu=omp_is_initial_device()
                !Query number of teams created
                num_teams=omp_get_num_teams()
        end if

        do i=ib, ib+NUM_BLOCKS-1
                y(i) = a*x(i) + y(i)
        end do
end do

!TODO add the appropriate end directive here


Next, compile code using *compile_f.sh*. If you would like to see the contents of compile_f.sh execute the following cell.

In [None]:
%pycat compile_f.sh

In [None]:
#Execute this cell to compile the code
! chmod 755 compile_f.sh; ./compile_f.sh;

Once the code has been successfully compiled, run the code by executing the _run.sh_ script.

In [None]:
#Optionally examine the run script by executing this cell.
%pycat run.sh

Execute the following cell to execute the program. Make sure you see the "Passed!" message.

In [None]:
! chmod 755 q; chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q run.sh; else ./run.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the samples, please restart the Kernel and compile the samples again_

Execute the following cell to see the solution.

In [None]:
%pycat saxpy_func_parallel_solution.f90

# Summary
In this module, you have learned the following:
* High-level overview of GPU architecture and how OpenMP constructs map to it.
* Create multiple master threads that can be assigned to GPU subslices using the `teams` construct.
* Distribute loop iterations to those master threads using the `distribute` construct.
* Use the `teams` and `distribute` constructs combined with other OpenMP constructs for better performance.

***

@Intel Corporation | [\*Trademark](https://www.intel.com/content/www/us/en/legal/trademarks.html)