# Lab 4.1: The offloading model

The objective of this lab is to introduce the concept of offloading to accelerator devices. This lab introduces target regions, accelerator devices, and memory management for the device

This tutorial is expected to run in a unix-like environment.

## Table of content:

* Offloading model
* Target directive
* Memory
    * Device memory
    * Implicit memory mapping
    * Structured memory management
    * Unstructured memory management


# The offloading model

OpenMP supports accelerator devies. These are special devices with compute capabilities that are different to the traditional CPU architectures. The most common accelerator devices is the General Purpuse Graphic Processing Unit (GPGPU or simply GPU).

Nvidia devices can be discovered using `nvidia-smi` and amd devices using `rocm-info`. Likewise, compilers like LLVM support special commands to obtain the available accelerator devices. This is the case of `llvm-omp-device-info`

In [1]:
# If you have an NVIDIA device use
!nvidia-smi

Tue Jun 21 20:11:51 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 470.74       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Quadro P1000        Off  | 00000000:01:00.0 Off |                  N/A |
| 27%   44C    P0    N/A /  N/A |      0MiB /  4040MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# If you have an AMD device use
!rocm-info

In [2]:
# If you're using llvm use
!llvm-omp-device-info

/bin/bash: llvm-omp-device-info: command not found


# Target directive
The target directive allows a programmer to use accelerator devices, as long as the device is supported by the implementation. For the remaining of this tutorial we will use LLVM with the Clang front end and it will assume that the device is an NVIDIA device. Other compilers that support OpenMP offloading can be found in the [compilers section of the OpenMP website](https://www.openmp.org/resources/openmp-compilers-tools/).

## Our first target program
One of the simplest target programs that one can write is the following 

```C
#include <stdio.h>
#include <omp.h>

int main() {
    int a[1] = {0};
    #pragma omp target
    {
        a[0] = omp_is_initial_device();
    }
    printf("Code executed in the %s\n",a[0] ? "Host":"Device");
    return 0;
}
```

In [3]:
# Building this code
!clang -fopenmp -fopenmp-targets=nvptx64 C/simple_target.c -o C/simple_target.exe

clang-12: [0;1;31merror: [0m[1mNo library 'libomptarget-nvptx-cuda_110-sm_70.bc' found in the default clang lib directory or in LIBRARY_PATH. Please use --libomptarget-nvptx-bc-path to specify nvptx bitcode library.[0m


In [None]:
# Running the code on the host
!OMP_TARGET_OFFLOAD=disabled C/./simple_target.exe

Open and play with this code in [simple_multiple_devices](C/simple_target.c).

In this example we are cheching where the code was executed by using the API function `omp_is_initial_device()`. This API function returns true if the execution environment is the same device that innitiated the target region, i.e. the host. The `omp target` directive does not expose any parallelism per-se. Instead, it tells the compiler that the enclosed region is meant to execute in the device. The compiler will generate different versions of the same code, for different possible devices. Notice that target regions are not necessarily executed on the accelerator device (i.e. GPU in our case). It is still possible to control if target regions are executed in the host as well. This means that this code has a host version and a device version.

```
Note: An array of only 1 possition is used instead of a scalar to avoid having to keep the code simple. As we will learn later on, arrays have a default mapping of tofrom, while scalars are firstprivate. Allowing to avoid using explicit data mapping. 
```

## Controling target regions
Some important clauses to mention for the target construct are:
* `if(condition)`: Allows code to be conditionally executed in the device.
* `device(device_num)`: When multiple devices are available, allows to select the different devices
* `nowait`: Enables asynchronous execution of code. More on this on a future lab.

Take for example the following code

```C
#include <stdio.h>
#include <omp.h>

int main() {
    int device_num[omp_get_num_devices()+1] = {0};
    int i;
    // Iterate over each available device and execute code
    // If i == omp_get_num_devices(), execute on the host.
    for (i = 0; i <= omp_get_num_devices(); i++) {
        #pragma omp target device(i) if(i!=omp_get_num_devices())
        {
            device_num[0] = omp_get_device_num();
        }
    }

    // Print which device was used for each region.
    for (i = 0; i <= omp_get_num_devices(); i++)
        printf("Code executed for i = %d in device %d\n", i,device_num[i]);
    return 0;
}
```

In [5]:
# Building the example
!clang -fopenmp -fopenmp-targets=nvptx64 C/simple_multiple_devices.c -o C/simple_multiple_devices.exe

clang-12: [0;1;31merror: [0m[1mNo library 'libomptarget-nvptx-cuda_110-sm_70.bc' found in the default clang lib directory or in LIBRARY_PATH. Please use --libomptarget-nvptx-bc-path to specify nvptx bitcode library.[0m


In [6]:
# Running the example
C/./simple_multiple_devices.exe

SyntaxError: invalid syntax (<ipython-input-6-82191245d585>, line 2)

Open and play with this code in [simple_multiple_devices](C/simple_multiple_devices.c).

## Data mapping
Often case, devices feature a memory and corresponding address space that is independent from the host. While OpenMP supports `unified address space` and `unified shared memory`, learning to manage data between host and device is really important for application performance. OpenMP refers to **"mapping"** as the process of moving variables from the host to the device and from the device to the host, allocating variables and deallocating variables.

The following example will show that a variable within a target region has a different address value, as long as this code is not executed with unified shared memory support.

```C
#include <stdio.h>
#include <omp.h>

int main() {
    int a[1];
    printf("Addres of a in host = %lx", (unsigned long)a);
    #pragma omp target
    {
        printf("Addres of a in device = %lx", (unsigned long)a);
    }
    return 0;
}
```

In [8]:
# Building this code
!clang -fopenmp -fopenmp-targets=nvptx64 C/different_addresses.c -o C/different_addresses.exe

clang-12: [0;1;31merror: [0m[1mNo library 'libomptarget-nvptx-cuda_110-sm_70.bc' found in the default clang lib directory or in LIBRARY_PATH. Please use --libomptarget-nvptx-bc-path to specify nvptx bitcode library.[0m


In [7]:
# Running the code
!C/./different_addresses.exe

/bin/bash: C/./different_addresses.exe: No such file or directory


If you want to play with this code open [different_addresses.c](C/different_addresses.c)

Mapping can be of the form:
* `to`: From host to device.
* `from`: From device to host.
* `tofrom`: Both from host to device at the beginning of the region, and from device to host at the end of the region.
* `alloc`: Only allocate memory, but do not copy values over to the device
* `delete`: Used with unstructured data mapping (see below). Delete a variable in the device

## Implicit data mapping

So far all the example codes we have shown do not use the `map()` clause. Whenever a variable is referenced inside of a target region, and this variable is not in a `map()` clause, it is said that this variable is implicitely mapped.

Different variable types have different implicit data mappings. While the complete list of rules can be found in the [specification document](https://www.openmp.org/specifications/), here is a set of rules of thumb that developers should follow. 

### Scalar variables
Variables that use scalar data types such as `int`, `double`, `float`, etc are mapped as `firstprivate` by default. Therefore, whenever these variables are implicitely mapped, and modified on the devices, they are not copied back to the host. The following example shows this behavior.

```C
int a = 10;
#pragma omp target // implicit firsprivate(a)
{
    printf("a = %d",a);
    a = 20
}
printf("a = %d",a);
```

In [9]:
# building
!clang -fopenmp -fopenmp-targets=nvptx64 C/implicit_map_scalar.c -o C/implicit_map_scalar.exe

clang-12: [0;1;31merror: [0m[1mNo library 'libomptarget-nvptx-cuda_110-sm_70.bc' found in the default clang lib directory or in LIBRARY_PATH. Please use --libomptarget-nvptx-bc-path to specify nvptx bitcode library.[0m


In [10]:
# Running
!C/./implicit_map_scalar.exe

/bin/bash: C/./implicit_map_scalar.exe: No such file or directory


Use the file [implicit_map_scalar.c](C/implicit_map_scalar.c) to modify and play with the above code.

### Non scalar types (classes and structs)
User defined types in C and C++ are mapped as `map(tofrom:...)` by default. This means that in stand alone target regions, these variables are copied over to the device at the beginning of the region, and from the device to the host at the end of the region.

```
Note: Be aware that mapping copies contiguous memory regions. Therefore, no deep copy is performed by default. In order to support deep copy the user must either create a "declare mapper" or specify mapping of the different attribute memebers. These cases are ignored for now as it is beyond the purpose of this section
```
The following is an example of a default mapping of an struvct

```C
typedef struct myS{
    int a;
    double *b;
}myS_t;

int main() {
    myS_t myStruct = {1, NULL};

    myStruct.b = (double *)malloc(sizeof(double));
    myStruct->b = 11.1;

    printf("Host {%d, %lx}\n", myStruct.a, myStruct.b);

    #pragma omp target // implicit map(tofrom:myStruct). Not implicit map(tofrom:myStruct.b[0:1])
    {
        myStruct.a = 10;
        // printf("%f\n", myStruct.b); error since b is not deep copied
        printf("Device {%d, %lx}\n", myStruct.a, myStruct.b);
    }

    printf("Host {%d, %lx}\n", myStruct.a, myStruct.b);
    return 0;
}
```

In [11]:
# building
!clang -fopenmp -fopenmp-targets=nvptx64 C/implicit_map_struct.c -o C/implicit_map_struct.exe

clang-12: [0;1;31merror: [0m[1mNo library 'libomptarget-nvptx-cuda_110-sm_70.bc' found in the default clang lib directory or in LIBRARY_PATH. Please use --libomptarget-nvptx-bc-path to specify nvptx bitcode library.[0m


In [12]:
# Running
!C/./implicit_map_struct.exe

/bin/bash: C/./implicit_map_struct.exe: No such file or directory


Play with this code in [implicit_map_struct.c](C/implicit_map_struct.c)

### Arrays (not pointers)

Arrays for which the compiler can determine the size are also mapped as `map(tofrom:...)`.

```C

int main() {
    int A[] = {1,2,3};

    printf("Host {%d, %d, %d}\n", A[0],A[1],A[2]);

    #pragma omp target // implicit map(tofrom:A[0:3])
    {
        printf("Device {%d, %d, %d}\n", A[0],A[1],A[2]);
        A[0]++; A[1]++; A[2]++;
    }

    printf("Host {%d, %d, %d}\n", A[0],A[1],A[2]);
    return 0;
}
```

In [13]:
# building
!clang -fopenmp -fopenmp-targets=nvptx64 C/implicit_map_arrays.c -o C/implicit_map_arrays.exe

clang-12: [0;1;31merror: [0m[1mNo library 'libomptarget-nvptx-cuda_110-sm_70.bc' found in the default clang lib directory or in LIBRARY_PATH. Please use --libomptarget-nvptx-bc-path to specify nvptx bitcode library.[0m


In [14]:
# Running
!C/./implicit_map_arrays.exe

/bin/bash: C/./implicit_map_arrays.exe: No such file or directory


To play with this code go to [implicit_map_arrays.c](C/implicit_map_arrays.c)

### Pointers

Pointers are a special case. Pointers are also mapped `map(tofrom:...)` by default. However, since it is not possible to determine how many elements are pointed by a pointer, the compiler cannot detemrine the size of the map. Pointers are therefore mapped as `tofrom:ptr[0:0]` where `[0:0]` means, starting from the 0 possition, copy 0 elements. This is confusing at first, but it allows the compiler to perform pointer translation when the array has been previously mapped to the device (e.g. using structured or unstructured data mapping as we will see later on). 

```C

int main() {
    int *A = (int*) malloc(3*sizeof(int));

    printf("Host {%d, %d, %d}\n", A[0],A[1],A[2]);

    #pragma omp target // implicit map(tofrom:A[0:0])
    {
        // printf("Device {%d, %d, %d}\n", A[0],A[1],A[2]); Error since pointer is not mapped. 
        // A[0]++; A[1]++; A[2]++; Error since pointer is not mapped
        printf("Cannot access A[]\n");
    }

    #pragma omp target data map(tofrom: A[0:3])
    {
        #pragma omp target // implicit mapping map(tofrom:A[0:0])
        {
            // A is automatically translated to a previously mapped location
            printf("Device {%d, %d, %d}\n", A[0],A[1],A[2]); 
            A[0]++; A[1]++; A[2]++;
        }
    }

    printf("Host {%d, %d, %d}\n", A[0],A[1],A[2]);
    return 0;
}
```

In [15]:
# building
!clang -fopenmp -fopenmp-targets=nvptx64 C/implicit_map_pointers.c -o C/implicit_map_pointers.exe

clang-12: [0;1;31merror: [0m[1mNo library 'libomptarget-nvptx-cuda_110-sm_70.bc' found in the default clang lib directory or in LIBRARY_PATH. Please use --libomptarget-nvptx-bc-path to specify nvptx bitcode library.[0m


In [16]:
# Running
!C/./implicit_map_pointers.exe

/bin/bash: C/./implicit_map_pointers.exe: No such file or directory


To play with this code go to [implicit_map_pointers.c](C/implicit_map_pointers.c)