# Lab 4.1: The offloading model

The objective of this lab is to introduce the concept of offloading to accelerator devices. This lab introduces target regions, accelerator devices, and memory management for the device

This tutorial is expected to run in a unix-like environment.

## Table of content:

* Offloading model
* Target directive
* Memory
    * Device memory
    * Implicit memory mapping
    * Structured memory management
    * Unstructured memory management


# The offloading model

OpenMP supports accelerator devies. These are special devices with compute capabilities that are different to the traditional CPU architectures. The most common accelerator devices is the General Purpose Graphic Processing Unit (GPGPU or simply GPU).

Nvidia devices can be discovered using `nvidia-smi` and amd devices using `rocm-info`. Likewise, compilers like LLVM support special commands to obtain the available accelerator devices. This is the case of `llvm-omp-device-info`

In [42]:
# If you have an NVIDIA device use
!nvidia-smi

Wed Jun 22 16:32:25 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 470.74       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Quadro P1000        Off  | 00000000:01:00.0 Off |                  N/A |
| 27%   44C    P0    N/A /  N/A |      0MiB /  4040MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [43]:
# If you have an AMD device use
!rocm-info

/bin/bash: rocm-info: command not found


In [44]:
# If you're using llvm use
!llvm-omp-device-info

Device (0):
    This is a generic-elf-64bit device

Device (1):
    This is a generic-elf-64bit device

Device (2):
    This is a generic-elf-64bit device

Device (3):
    This is a generic-elf-64bit device

Device (4):
    CUDA Driver Version: 		11040 
    CUDA Device Number: 		0 
    Device Name: 			Quadro P1000 
    Global Memory Size: 		4236312576 bytes 
    Number of Multiprocessors: 		5 
    Concurrent Copy and Execution: 	Yes 
    Total Constant Memory: 		65536 bytes
    Max Shared Memory per Block: 	49152 bytes 
    Registers per Block: 		65536 
    Warp Size: 				32 Threads 
    Maximum Threads per Block: 		1024 
    Maximum Block Dimensions: 		1024, 1024, 64 
    Maximum Grid Dimensions: 		2147483647 x 65535 x 65535 
    Maximum Memory Pitch: 		2147483647 bytes 
    Texture Alignment: 			512 bytes 
    Clock Rate: 			1480500 kHz
    Execution Timeout: 			No 
    Integrated Device: 			No 
    Can Map Host Memory: 		Yes 
    Compute Mode: 			DEFAULT 
    Concurrent Kernels: 		Y

# Target directive
The target directive allows a programmer to use accelerator devices, as long as the device is supported by the implementation. For the remaining of this tutorial we will use LLVM with the Clang front end and it will assume that the device is an NVIDIA device. Other compilers that support OpenMP offloading can be found in the [compilers section of the OpenMP website](https://www.openmp.org/resources/openmp-compilers-tools/).

## Our first target program
One of the simplest target programs that one can write is the following 

```C
#include <stdio.h>
#include <omp.h>

int main() {
    int a[1] = {0};
    #pragma omp target
    {
        a[0] = omp_is_initial_device();
    }
    printf("Code executed in the %s\n",a[0] ? "Host":"Device");
    return 0;
}
```


In [45]:
# Building this code
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/simple_target.c -o C/simple_target.exe

In [46]:
# Running the code on the host
!OMP_TARGET_OFFLOAD=disabled srun -N 1 -c 8 C/./simple_target.exe

Code executed in the Host


In [47]:
# Running the code on the host
!OMP_TARGET_OFFLOAD=mandatory srun -N 1 -c 8  C/./simple_target.exe

Code executed in the Device


Open and play with this code in [simple_multiple_devices](C/simple_target.c).

If the above code fails, there's a problem with your compiler. Please make sure to fix your running environment before continuing this lab.
The `OMP_TARGET_OFFLOAD` is an environment variable that allows users to specify if device offloading is enabled, disabled, or mandatory.

In this example we are cheching where the code was executed by using the API function `omp_is_initial_device()`. This API function returns true if the execution environment is the same device that innitiated the target region, i.e. the host. The `omp target` directive does not expose any parallelism per-se. Instead, it tells the compiler that the enclosed region is meant to execute in the device. The compiler will generate different versions of the same code, for different possible devices. Notice that target regions are not necessarily executed on the accelerator device (i.e. GPU in our case). It is still possible to control if target regions are executed in the host as well. This means that this code has a host version and a device version.

```
Note: An array of only 1 element is used instead of a scalar because arrays and scalars ar treated differently by OpenMP. As we will learn later on, arrays have a default mapping of tofrom, while scalars are firstprivate. An array allows us to avoid using explicit data mapping. 
```

## Controling target regions
Some important clauses to mention for the target construct are:
* `if(condition)`: Allows code to be conditionally executed in the device.
* `device(device_num)`: When multiple devices are available, allows the selection of a particular device.
* `nowait`: Enables asynchronous execution of code. More on this in a future lab.

Take for example the following code

```C
#include <stdio.h>
#include <omp.h>

int main() {
    int num_dev = omp_get_num_devices();
    num_dev = num_dev > 10 ? 10 : num_dev;
    int device_num[11];
    int i;
    // Iterate over each available device and execute code
    // If i == omp_get_num_devices(), execute on the host.
    for (i = 0; i <= num_dev; i++) {
        #pragma omp target device(i) if(i != num_dev)
        {
            device_num[i] = omp_get_device_num();
        }
    }

    // Print which device was used for each region.
    for (i = 0; i <= num_dev; i++)
        printf("Code executed for i = %d in device %d\n", i,device_num[i]);
    return 0;
}
```


In [48]:
# Building the example
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/simple_multiple_devices.c -o C/simple_multiple_devices.exe

In [49]:
# Running the example
!srun -N 1 -c 8 C/./simple_multiple_devices.exe

Code executed for i = 0 in device 0 that is not the initial device
Code executed for i = 1 in device 1 that is the initial device


Open and play with this code in [simple_multiple_devices](C/simple_multiple_devices.c).

## Exercise 1

Create a program that assigns values to an array using target regions. The positions of the array with odd indexes should be assigned by the host. The positions of the array with even index should be assigned by the device. Following this pattern:

`|Device|Host|Device|Host|Device|Host`

Use omp_get_num_device() to determine which device filled the corresponding position. 

Write your code in [exercise1.c](Exercises/exercise1.c).

In [6]:
# Build
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 Exercises/exercise1.c -o Exercises/exercise1.exe

# Run
!srun -N 1 -c 8 Exercises/./exercise1.exe

A = [ 1293095552  32767  400991600  32714  401094768  32714  1400905104  22099  1400898192  22099  1424728416  22099  1293095640  32767  402185388  32714  1293095616  32767  402210268  32714  1424728432  22099  401082605  32714  380962544  32714  1293095656  32767  1424728416  22099  1400905104  22099  1400905104  22099  0  0  1424676952  22099  1400905072  22099  1424728432  22099  1424518832  22099  1400905088  22099  1  0  0  0  1424518832  22099  397286432  32714  1424518832  22099  1400905104  22099  401099408  32714  401099404  32714  401018079  32714  1293095816  32767  1293095808  32767  1424518832  22099  401074160  32714  0  0  0  0  1  0  1293096200  32767  1293096216  32767  1400905040  22099  2  0  1400889469  22099  0  0  1400889949  22099  400495336  32714  1400889872  22099  0  0  1400889488  22099 ]


See the solution to this exercise in [exercise1.c](Solutions/exercise1.c)

In [1]:
# Build
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 Solutions/exercise1.c -o Solutions/exercise1.exe

# Run
!srun -N 1 -c 8 Solutions/./exercise1.exe

A = [ 0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1 ]


## Data mapping
Oftentimes, devices feature a memory and corresponding address space that is independent from the host. While OpenMP supports `unified address space` and `unified shared memory`, learning to manage data between host and device is really important for application performance. OpenMP refers to **"mapping"** as the process of moving variables from the host to the device and from the device to the host, allocating variables and deallocating variables.

The following example will show that a variable within a target region has a different address value, as long as this code is not executed with unified shared memory support.

```C
#include <stdio.h>
#include <omp.h>

int main() {
    int a[1];

    printf("Address of a in host = %lx", (unsigned long)a);
    
    #pragma omp target
    {
        printf("Address of a in device = %lx", (unsigned long)a);
    }
    return 0;
}
```


In [50]:
# Building this code
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/different_addresses.c -o C/different_addresses.exe

In [51]:
# Running the code
!srun -N 1 -c 8 C/./different_addresses.exe

Addres of a in host = 7ffe636458d8
Addres of a in device = 7f5d97a00000


What is important to notice here is that the address of `a` is different in the Host and the Device. If you want to play with this code open [different_addresses.c](C/different_addresses.c)

Mapping can be of the form:
* `to`: From host to device.
* `from`: From device to host.
* `tofrom`: Both from host to device at the beginning of the region, and from device to host at the end of the region.
* `alloc`: Only allocate memory, but do not copy values over to the device
* `delete`: Used with unstructured data mapping (see below). Delete a variable in the device

## Implicit data mapping

So far all the example codes we have shown do not use the `map()` clause. Whenever a variable is referenced inside of a target region, and this variable is not in a `map()` clause, it is said that this variable is implicitly mapped.

Different variable types have different implicit data mappings. While the complete list of rules can be found in the [specification document](https://www.openmp.org/specifications/), here is a set of rules of thumb that developers should follow. 

### Scalar variables
Variables that use scalar data types such as `int`, `double`, `float`, etc are mapped as `firstprivate` by default. Therefore, whenever these variables are implicitely mapped, and modified on the devices, they are not copied back to the host. The following example shows this behavior.

```C
int a = 10;
#pragma omp target // implicit firsprivate(a)
{
    printf("a = %d\n",a);
    a = 20;
}
printf("a = %d\n",a);
```

In [52]:
# building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/implicit_map_scalar.c -o C/implicit_map_scalar.exe

In [53]:
# Running
!srun -N 1 -c 8 C/./implicit_map_scalar.exe

a = 10
a = 10


Although a is modified in the device, the value `20` is not seen by the host after the target region. Variable `a` is firstprivatized. 

Use the file [implicit_map_scalar.c](C/implicit_map_scalar.c) to modify and play with the above code.

### Non scalar types (classes and structs)
User defined types in C and C++ are mapped as `map(tofrom:...)` by default. This means that in stand alone target regions, these variables are copied over to the device at the beginning of the region, and from the device to the host at the end of the region.

```
Note: Be aware that mapping copies contiguous memory regions. Therefore, no deep copy is performed by default. In order to support deep copy the user must either create a "declare mapper" or specify mapping of the different attribute members. These cases are ignored for now as it is beyond the purpose of this section
```

The following is an example of a default mapping of an struct.


```C
typedef struct myS{
    int a;
    double *b;
}myS_t;

int main() {
    myS_t myStruct = {1, NULL};

    myStruct.b = (double *)malloc(sizeof(double));
    *(myStruct.b) = 11.1;

    printf("Host {%d, %lx}\n", myStruct.a, myStruct.b);

    #pragma omp target // implicit map(tofrom:myStruct). Not implicit map(tofrom:myStruct.b[0:1])
    {
        myStruct.a = 10;
        // printf("%f\n", myStruct.b); error since b is not deep copied
        printf("Device {%d, %lx}\n", myStruct.a, myStruct.b);
    }

    printf("Host {%d, %lx}\n", myStruct.a, myStruct.b);
    return 0;
}
```


In [None]:
# building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/implicit_map_struct.c -o C/implicit_map_struct.exe

In [55]:
# Running
!srun -N 1 -c 8 C/./implicit_map_struct.exe

Host {1, 564fdf16d030}
Device {10, 564fdf16d030}
Host {10, 564fdf16d030}


Notice how the Device prints the same address for `myStruct.b` in the host and the device. However, when running on a non unified memory address space, this is not a valid pointer in the device. Attempting to reference it could lead to a segmentation fault. 

Play with this code in [implicit_map_struct.c](C/implicit_map_struct.c).

### Arrays (not pointers)

Arrays for which the compiler can determine the size are also mapped as `map(tofrom:...)`.

```C

int main() {
    int A[] = {1,2,3};

    printf("Host {%d, %d, %d}\n", A[0],A[1],A[2]);

    #pragma omp target // implicit map(tofrom:A[0:3])
    {
        printf("Device {%d, %d, %d}\n", A[0],A[1],A[2]);
        A[0]++; A[1]++; A[2]++;
    }

    printf("Host {%d, %d, %d}\n", A[0],A[1],A[2]);
    return 0;
}
```


In [56]:
# building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/implicit_map_arrays.c -o C/implicit_map_arrays.exe

In [57]:
# Running
!srun -N 1 -c 8 C/./implicit_map_arrays.exe

Host {1, 2, 3}
Device {1, 2, 3}
Host {2, 3, 4}


To play with this code go to [implicit_map_arrays.c](C/implicit_map_arrays.c)

### Pointers

Pointers are a special case. Pointers are also mapped `map(tofrom:...)` by default. However, since it is not possible to determine how many elements are pointed by a pointer, the compiler cannot determine the size of the map. Pointers are therefore mapped as `tofrom:ptr[0:0]` where `[0:0]` means, starting from the possition 0, copy 0 elements. This is confusing at first, but it allows the compiler to perform pointer translation when the array has been previously mapped to the device (e.g. using structured or unstructured data mapping as we will see later on). 

```C

int main() {
    int *A = (int*) malloc(3*sizeof(int));
    A[0] = 1; A[1] = 2; A[2] = 3;

    printf("Host A = {%d, %d, %d}\n", A[0],A[1],A[2]);

    #pragma omp target // implicit map(tofrom:A[0:0])
    {
        // printf("Device A = {%d, %d, %d}\n", A[0],A[1],A[2]); Error since pointer is not mapped. 
        // A[0]++; A[1]++; A[2]++; Error since pointer is not mapped
        printf("Cannot access A[]\n");
    }

    #pragma omp target data map(tofrom: A[0:3])
    {
        #pragma omp target // implicit mapping map(tofrom:A[0:0])
        {
            // A is automatically translated to a previously mapped location
            printf("Device A = {%d, %d, %d}\n", A[0],A[1],A[2]); 
            A[0]++; A[1]++; A[2]++;
        }
    }

    printf("Host A = {%d, %d, %d}\n", A[0],A[1],A[2]);
    return 0;
}
```


In [58]:
# building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/implicit_map_pointers.c -o C/implicit_map_pointers.exe

In [59]:
# Running
!srun -N 1 -c 8 C/./implicit_map_pointers.exe

Host A = {1, 2, 3}
Cannot access A[]
Device A = {1, 2, 3}
Host A = {2, 3, 4}


The first region uses a default map of A[0:0]. Therefore, no allocation has been made in the target device. Attempting to access A will lead to undefined behavior (likely a segmentation fault). Default mapping of pointers is useful when variables are already in the device, like in the second part of the example. In this case no data movement is made when entering the `target` region, but A inside the target region points to the already mapped variable from the `target data` region. Bear with us a little longer to understand `target data`. 

To play with this code go to [implicit_map_pointers.c](C/implicit_map_pointers.c)

## Explicit data mapping

In order to control how variables are mapped to/from the device, the `map()` clause is used. In a target region each variable can be mapped as `to`, `from`, `tofrom`, or `alloc`. To use different mappings for different variables, multiple `map()` clauses can be specified. Likewise, a single `map()` clause can contain multiple variables. Here are some examples:

* `map(A)`: Maps variable A `tofrom`.
* `map(tofrom: B)`: Maps variable B `tofrom`.
* `map(alloc: C, D)`: Maps variable C and D `alloc`.
* `map(from: E) map(to: F)`: Maps variable E as `from` and F as `to`.

Let's take a look at these different mappings in action in the following code. Let's ignore the array section notation (i.e. `[:]` and `[0:10]`) as this will be discussed shortly. Can you guess what would be the values of the variables before and after the target region?

```C
    typedef struct myStr_s{
        char a[10];
        int b;
    }myStr;
    ...
    
    int x = 3;
    int y[3] = {1,2,3};
    int *z = (int*)malloc(3 * sizeof(int));
    z[0] = 1; z[1] = 2; z[2] = 3;
    myStr s = {"aaa",0};
    myStr sa[3] = {{"hola", 1},{"chao", 2},{"hello", 3}};

    #pragma omp target          \
            map (alloc: y[:])   \
            map (to: z[0:3])    \
            map (from: sa)
    {
        x = 44;
        y[0] = 1; y[1] = 44; y[2] = 2;
        z[0] = 99; z[1] = 44; z[2] = 2;
        strcpy(sa[0].a, "hey");
        sa[0].b = 3;
        strcpy(sa[1].a, "you");
        sa[1].b = 4;
        strcpy(sa[2].a, "there");
        sa[2].b = 5;
        strcpy(s.a, "hi");
        s.b = 42;
    }

```

In [60]:
# building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/explicit_map.c -o C/explicit_map.exe

In [61]:
# running
!srun -N 1 -c 8 C/./explicit_map.exe

In host: 
x = 3,
y = {1, 2, 3},
z = {1, 2, 3},
s = {aaa, 0},
sa = [{hola, 1},{chao, 2},{hello, 3}]
In device:
x = 3,
y = {0, 0, 0},
z = {1, 2, 3},
s = {aaa, 0},
sa = [{, 0},{, 0},{, 0}]
modifying variables
In device modified:
x = 44,
y = {1, 44, 2},
z = {99, 44, 2},
s = {k, 42},
sa = [{h, 3},{y, 4},{t, 5}]
In host: 
x = 3,
y = {1, 2, 3},
z = {1, 2, 3},
s = {k, 42},
sa = [{h, 3},{y, 4},{t, 5}]


If you want to play with this code open [explicit_map.c](C/explicit_map.c)
The following rules apply:

* `x` was mapped as `firsprivate(x)` because of the implicit mapping behavior
* `y` was only allocated in the device, but its content only lived in the device. Data was never sent from host to device or the other way around.
* `z` was allocated in the device and initialized with the host value. However, modifications in the device were not sent over to the host, because data is not copied back from the device to the host.
* `s` was implicitly mapped as `map(tofrom:s)`, therefore, it is allocated in the device, initialized with the host value, and then the host is updated with the device at the end of the target region.
* `sa` was allocated in the device but not initialized. Therefore the values that are seeing when printing it the first time inside the device. Values are transfered to the host from the device at the end of the target region, updating the host values. 

It is important to notice that the char array with static size. In memory these elements are consecutive, and proceded by the integer value of the struct. Memory looks as follows. 

`| char[0] | char[1] | char[2] | ,,, | char[9] | int |` 

Mapping copies consecutive memory regions, making it possible to have a copy of the char array. If s.a were a pointer instead of an array, the value would have not been copied.

# Array sections

When mapping arrays, it is possible to only map sections of the array. To do this, OpenMP uses an array section notation with the following syntax:

`base_ptr[lower_bound : length : stride]`

Currently the LLVM C compiler (*clang*) only supports `lower_bound` and `length`.

Array sections are useful to map data to different devices, as well as to reduce size of memory copies. Take for example the following code:


```C
int A[100];
for (int i = 0; i < 50; i++)
   A[i] = i;

#pragma omp target map(to:A[0:50]) map(from:A[50:50])
{
    for (int i = 50; i < 100; i++) {
        A[i] = A[i-50];
    }
}
```

In this example, only the first 50 elements of the array A are copied over to the device, and the last 50 elements are copied back from the device.

In [62]:
# building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/map_array_sections.c -o C/map_array_sections.exe

In [63]:
# running
!srun -N 1 -c 8 C/./map_array_sections.exe

A = [ 0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 ]
B = [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 ]
A = [ 0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  

If you want to play with this code open [map_array_sections.c](C/map_array_sections.c)

## Structured data mapping

Structured data mapping refers to a `target data` region, enclosed by a pair of brackets that mark a beginning and end of a data environment. Structured mapping uses the `target data` which is different to the `target` directive. The `target data` directive **does not offload code to the device**, rather, it creates a data environment in the device, allocating variables and moving data between host and device according to the `map()` clauses.

Structured data mapping allows to easily understand the scope of variables in the device. However, it can only be used within the same block of code. Therefore, the beginning and end regions cannot exist in different functions.

Here's an example:


```C
int A[100];

for (int i = 0; i < 100; i++)
       A[i] = i;
#pragma omp target data map(tofrom: A)
{
    for (int i = 0; i < 100; i++) // EXECUTED IN THE HOST
       A[i] = 0;
}
```

This example highlights two aspects of the `target data` region. First, the code inside of the `target data` region (unless it is a `target` region itself) is executed in the host. Second, at the end of the target region, data is copied back to the host, when variables are mapped `from`. Therefore, data in the host will be overwritten. 

What do you think is the value of `A[]` after the end of the `target data` region?

In [64]:
# building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/target_data_simple.c -o C/target_data_simple.exe

In [65]:
# running
!srun -N 1 -c 8 C/./target_data_simple.exe

A = [ 0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 ]
A = [ 0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 ]


The answer is that the modification of A inside of the target data region `A[i] = 0;` is not visible outside of the `target data` region. Because the end of the `target data` region overwrites A again with the value sent to the device in the first place. 

Play with this code in [target_data_simple.c](C/target_data_simple.c)

Now, let's take a look at the effect of using `target data` and how it can help us avoid moving data around. An important rule of thumb when programming accelerator devices is that data movements are often expensive. It is necessary to avoid them as much as possible, and only leave those operations that are absolutely necessary.

Take for example the following code. This is often a really common pattern in many applications.


```C
int A[N];
for (int step = 0; step < STEPS; step++)
   #pragma omp target
   {
      foo(A);
   }
```

Each time a new step is executed, the target region will move data from host to the device, and the other way around. These operations add to a considerable overhead.

By using `target data` it is possible to create a data environment that remains between multiple executions of the target region. 

```C
int A[N];
#pragma target data map(tofrom:A)
{
for (int step = 0; step < STEPS; step++)
   #pragma omp target
   {
      foo(A);
   }
}
```

The OpenMP runtime will only move data at the beginning and at the end of the `target data` region. Let's compare execution times for both.

In [66]:
# Compiling
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/target_data_simulation.c -o C/target_data_simulation.exe

In [67]:
# Running
!srun -N 1 -c 8 C/./target_data_simulation.exe

Exec time without target data = 11.577703
Exec time with target data = 2.463462


Consider that in both cases, the `target` region is doing exactly the same work, moving data back and forth increases the execution time considerably.

```
Note: We have not exposed parallelism in the GPU so far, so the code from the example above is a really bad example on how to use GPUs, but it emphasizes the importance of data movements for computation. 
```

Play with this code opening the file [target_data_simulation.c](C/target_data_simulation.c)

## Unstructured data mapping

Unstructured data mapping is used to move data between host and device on-demand. This is, without creating a data region that has a beginning and end. This feature is particularly useful for managing memory inside of functions and classes.

Unstructured data mapping has two versions:

* Using variables and the `map()` clause
    * `#pragma omp target enter data map(...)`
    * `#pragma omp target exit data map(...)`
    * `#pragma omp target update to/from(...)`
* Manually managing device pointers using the OpenMP API
    * `omp_target_alloc()`
    * `omp_target_free()`
    * `omp_target_memcpy()`
    * `omp_target_memcpy_async()`
    * `omp_target_memcpy_rect()`
    * `omp_target_memcpy_rect_async()`
    * `omp_target_associate_ptr()`


### Unstructured data using map()
When using the `map()` clause, all the rules for data mapping apply. Without any modifiers (e.g. `map(always to:a)`), `map()` will only copy data to the device whenever a variable's reference counter is going from 0 to 1. And data will only be copied to the host whenever a variable's reference count is going from 1 to 0. 

Take for example the following code:


```C
    int A[1];

    A[0] = 0;
    #pragma omp target enter data map(to:A[:]) // Data copied
        
    A[0] = 1;
    // Data not copied (reference = 2)
    #pragma omp target enter data map(to:A[:]) 

    // Update always copies
    #pragma omp target update to(A[0:1])
    
    #pragma omp target
        A[0] = 99;
    
    // data not copied because reference is not 0
    #pragma omp target exit data map(from:A[:])

    // Update always copies
    #pragma omp target update from(A[0:1])

    #pragma omp target
        A[0] = 66;

    // Reference reaches 0. Data is copied
    #pragma omp target exit data map(from:A[:])

```


In [3]:
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/target_update_and_enter_exit_data.c -o C/target_update_and_enter_exit_data.exe

In [4]:
!srun -N 1 -c 8 C/./target_update_and_enter_exit_data.exe

Device A[0] = 0
Device A[0] = 1
Host A[0] = 1
Device A[0] = 99
Host A[0] = 66


To play with this code go to [target_update_and_enter_exit_data.c](C/target_update_and_enter_exit_data.c).

### Unstructured mapping using API



OpenMP allows programmers to take full control over data management in the device. OpenMP API calls for data management communicates directly with the underlying hardware API and runtime. Therefore, it bypasses a lot of the mechanisms that exists in the OpenMP runtime.

Unstructured mapping is also useful when interacting with external libraries that may not use OpenMP, since it allows to get a handler to pointers that are valid for the device address space.

However, when combining unstructured mapping with structured mapping, it is important to add some additional information, such that the OpenMP runtime do not attempt to use the pointers in the host.

`omp_target_alloc()` and `omp_target_free()` work similar to the C calls of `malloc()` and `free()`. the different versions of `omp_target_memcpy()` allows to move data between different devices, including the host.

Let's take a look at an example:

```C
int *A;int B[100];
int dev_id = omp_get_default_device();
int host_id = omp_get_initial_device();

// Create data in device
A = omp_target_alloc(/*size*/ sizeof(int)*100,
                     /*dest_dev_id*/ dev_id);

// This will generate a runtime error if not unified shared memory
// A is a pointer to the device
// *A = 5; ERROR

// Move data from host to device
omp_target_memcpy(/*dest_ptr*/ A,
                  /*src_ptr*/ B, 
                  /*length*/ sizeof(int)*100, 
                  /*offset_dst*/ 0, 
                  /*offset_src*/ 0, 
                  /*dest_dev_id*/ dev_id, 
                  /*src_dev_id*/ host_id);

// Delete data from device
omp_target_free(/*dev_ptr*/ A,
                /*dev_id*/  dev_id);
```

In [1]:
# Building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/api_target_data.c -o C/api_target_data.exe

In [14]:
# Running and profiling
# !nvprof --print-gpu-trace C/./api_target_data.exe
!srun -N 1 -c 8 C/./api_target_data.exe

This example does not really output anything, since we are only moving data to the device. However, if you use a profiler, you may be able to see the trace showing the data movement from host to device. 

If you're using an NVIDIA GPU, you can use the following command instead, to see the events in the GPU. You should be able to find a line that contains a `CUDA memcpy HtoD` corresponding to the time where the array B is move to location A. 

`nvprof --print-gpu-trace C/./api_target_data.exe`

Play with the above code with file [api_target_data.c](C/api_target_data.c)

### is_device_ptr()

The reason why we did not introduce a target region above is because target regions by default will attempt to translate host pointers to device pointers. Therefore, in order to use manual data management, it is necessary to introduce another clause to the `target` directive: `is_device_ptr()`

A pointer is just a number that represents a memory location. When accessing a pointer (e.g. `*A = 5;`), the compiler translates this to an instruction that access the memory location which address is equal to the number in the pointer. However, when a system has non-unified shared memory, the address returned by `omp_target_alloc()` is not *valid* for the host. Therefore, aiming to access this variable will cause an error or undefined behavior. The above example has a commented segment of code that explains this. 

The following example completes the above code with a target region, and the corresponding `is_device_ptr()`. It also shows that happens when a pointer is not used with `is_device_ptr()` by printing the number in the pointer (i.e. the address).

```C
int *A; int B[100];

for (int i = 0; i < 100; i++) {
    B[i] = i;
}

int dev_id = omp_get_default_device();
int host_id = omp_get_initial_device();

A = omp_target_alloc(/*size*/ sizeof(int)*100,
                    /*dest_dev_id*/ dev_id);

omp_target_memcpy(/*dest_ptr*/ A,
                /*src_ptr*/ B, 
                /*length*/ sizeof(int)*100, 
                /*offset_dst*/ 0, 
                /*offset_src*/ 0, 
                /*dest_dev_id*/ dev_id, 
                /*src_dev_id*/ host_id);

#pragma omp target
{
   // A[0] will fail
}

#pragma omp target is_device_ptr(A)
{
    for (int i = 0; i < 100; i++)
        A[i]++;
}

// Manually copying data back into B
omp_target_memcpy(/*dest_ptr*/ B,
            /*src_ptr*/ A, 
            /*length*/ sizeof(int)*100, 
            /*offset_dst*/ 0, 
            /*offset_src*/ 0, 
            /*dest_dev_id*/ host_id, 
            /*src_dev_id*/ dev_id);

omp_target_free(/*dev_ptr*/ A,
                /*dev_id*/  dev_id);

```

In [25]:
# Building
!srun -N 1 -c 8 clang -fopenmp -fopenmp-targets=nvptx64 C/target_is_device_ptr.c -o C/./target_is_device_ptr.exe

In [26]:
# Running
!srun -N 1 -c 8 C/./target_is_device_ptr.exe

Values of B in the host are:
B = [ 0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19 ]
Allocating a device pointer with address A = 0x7F3107600000:
With no is_device_ptr() device sees (invalid) A = 0x0:
With is_device_ptr() we can access the address of A = 0x7F3107600000:
Copying back A = 0x7F3107600000: to B = 0x7FFEFF3B5970
Freeing A = 0x7F3107600000 from device 
Values of B in the host are:
B = [ 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20 ]


Play with the above code at [target_is_device_ptr.c](C/target_is_device_ptr.c)

## Excercise 2

Create a class that, during constructor, allocated data to the device. During destructor deallocate data from the device. There is a method that initializes the array with 1s in the GPU, and a method that computes the sum of all the elements in the GPU.

```
Hint: When working with class members, all references to them inside the methods will be preceded by "this->". This makes them references inside of an aggregate variable (i.e. the class itself). This complicate mapping and referencing multiple attributes. Therefore, it is necessary to map all the attributes that are going to be used. This code exercises a lot of concepts learned above.
```

Open and Modify the following code [exercise2.cpp](Exercises/exercise2.cpp)

In [None]:
# Building Solution
!srun -N 1 -c 8 clang++ -fopenmp -fopenmp-targets=nvptx64 Exercises/exercise2.cpp -o Exercises/exercise2.exe

# Running solution
!srun -N 1 -c 8 Exercises/./exercise2.exe

To see a solution go to [exercise2.cpp](Solutions/exercise2.cpp)

In [47]:
# Building Solution
!srun -N 1 -c 8 clang++ -fopenmp -fopenmp-targets=nvptx64 Solutions/exercise2.cpp -o Solutions/exercise2.exe -gline-tables-only

# Running solution
!srun -N 1 -c 8 Solutions/./exercise2.exe

Allocating array in the device
Initializing array in the device
Computing sum
Sum of array with 1000 elements is 1000
Deallocating array in the device
