# Data management

## Why do we have to care about data transfers?

The main bottleneck in using GPUs for computing is data transfers between the host and the GPU.

Let's have a look at the bandwidths.

<img alt="Typical bandwidths" src="../../pictures/bandwidths.png" style="float:none" width="40%"/>

On this picture the size of the arrows represents the bandwidth.
To have a better idea here are some numbers:

- GPU to its internal memory (HBM2): 900 GB/s
- GPU to CPU via PCIe: 16 GB/s
- GPU to GPU via NVLink: 25 GB/s
- CPU to RAM (DDR4): 128 GB/s

So if you have to remember only one thing: take care of memory transfers.

## The easy way: NVIDIA managed memory

NVIDIA offers a feature called [Unified Memory](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) which allows developers to "forget" about data transfers.
The memory space of the host and the GPU are shared so that the normal [_page fault_ mechanism](https://en.wikipedia.org/wiki/Page_fault) can be used to manage transfers.

This feature is activated with the compiler options:

- NVIDIA compilers: `-gpu:managed`
- PGI: `-ta=tesla:managed`

This might give good performance results and you might just forget explicit data transfers. However, depending on the complexity of your data structures, you might need to deal explicitly with data transfers. The next section gives an introduction to manual data management.

Unified Memory also allows to increase virtual memory space on GPU (so called GPU memory oversubscription).

## Manual data movement

### Data clauses

There are multiple data directives that accept the same data clauses. So we start with the data clauses and than continue with data directives.

In order to choose the right data clause for data transfers, you need to answer the following two questions:

- Does the kernel need the values computed beforehand by the CPU?
- Are the values computed inside the kernel needed on the CPU afterhand?

|                  | Needed after        | Not needed after  |
|------------------|---------------------|-------------------|
|Needed before     |  `copy(var1, ...)`    | `copyin(var2, ...)` |
|Not needed before |  `copyout(var3, ...)` | `create(var4, ...)` |

Figure below illustrates transfers, if any, between the CPU and the GPU for these four clauses.

<img alt="Data clauses" src="../../pictures/data_clauses_new.png" style="float:none" width="95%"/>

**Important**: the presence of variables on the GPU is checked at runtime. If some variables are already found on the GPU, these clauses have no effect.
It means that you cannot update variables (on the GPU at the region entrance or on the CPU at exit).
You have to use the `acc update` directive in this case.

Other data clauses include:

- `present`: check if data is present in the GPU memory; an error is raised if it is not the case
- `deviceptr`: pass the GPU pointer; used for interoperability between other APIs (e.g. CUDA, Thrust) and OpenACC
- `attach`: attach a pointer to memory already allocated in the GPU

#### Array shapes and partial data transfers

For array transfers, full or partial, one has to follow the language syntax.

In Fortran, you have to specify the range of the array in the format `(first index:last index)`.

```fortran
    !$acc data copyout(myarray(1:size))
       ! Some really fast kernels
    !$acc end data
```

For partial data transfer, you can specify a subarray. For example: 

```fortran
 !$acc data copyout(myarray(2:size-1))
```


Before moving on to data directives, some vocabulary needs to be introduced. According to data lifetime on the GPU, two types of data regions can be distinguished: _structured_ and _unstructured_. Structured data regions are defined within the same scope (e.g. routine), while unstructured data regions allow data creation and deletion in different scopes.

### Implicit structured data regions associated with compute constructs

Any of the three compute constructs -- `parallel`, `kernels`, or `serial` -- opens an implicit data region. Data transfers will occur just before the kernel starts and just after the kernel ends.

In the [Get started](../Get_started.ipynb) notebook, we have already seen that it is possible to specify data clauses in `acc parallel` to manage our variables.
The compiler checks what variables (scalar or arrays) are needed in the kernel and will try to add the _data clauses_ necessary.

#### Exercise

- Create a parallel region for each loop.
- For each parallel region, what _data clause_ should be added?

Example stored in: `../../examples/Fortran/Data_Management_vector_sum_exercise.f90`

In [None]:
%%idrrun
program vector_sum
    use iso_fortran_env, only : INT32, REAL64
    use openacc
    implicit none

    integer(kind=INT32), parameter              :: system_size  = 10000
    integer(kind=INT32), dimension(system_size) :: a, b, c
    integer(kind=INT32)                         :: i

    ! Insert OpenACC directive
    do i = 1, system_size
        a(i) = i
        b(i) = i * 2
    enddo

    ! Insert OpenACC directive
    do i = 1, system_size
       c(i) = a(i) + b(i)
    enddo

    write(0,"(a22,i3)") "value at position 12: ", c(12)

end program vector_sum        

#### Answer

- For each parallel region, what _data clause_ should be added?
  - Loop 1: The initialization of a and b is done directly on GPU so we don't need to copy the values from CPU.
              Variables a and b are used to compute c after execution of the first parallel region.
              We need to `copyout` a and b.
  - Loop 2: We need the values of a and b to compute c. This computation is the initialization of c.
              We print the value of one element of c after execution.
              The values of a and b are not needed anymore.
              We need to `copyin` a and b. We need to `copyout` c.

Example stored in: `../../examples/Fortran/Data_Management_vector_sum_solution.f90`

In [None]:
%%idrrun -a
program vector_sum
    use iso_fortran_env, only : INT32, REAL64
    use openacc
    implicit none

    integer(kind=INT32), parameter              :: system_size  = 10000
    integer(kind=INT32), dimension(system_size) :: a, b, c
    integer(kind=INT32)                         :: i

    !$acc parallel loop copyout(a(:), b(:)) 
    do i = 1, system_size
        a(i) = i
        b(i) = i * 2
    enddo

    !$acc parallel loop copyin(a(:), b(:)) copyout(c(:))
    do i = 1, system_size
       c(i) = a(i) + b(i)
    enddo

    write(0,"(a22,i3)") "value at position 12: ", c(12)

end program vector_sum        

If you use NVIDIA compilers (formerly PGI), most of the time the right directives will be added _implicitly_.

Our advice is to make explicit all actions performed implicitly by the compiler.
It will help you to keep a code understandable and avoid porting problems if you have to change compiler.

All compilers might not choose the same default behavior.

### Explicit structured data regions `acc data`

Using the _data regions_ associated to kernels is quite convenient and is a good strategy for incremental porting of your code.

However, this results in a large number of data transfers that can be avoided.

If we take a look at the previous example, we count 5 data transfers:

- Loop 1: copyout(a, b)
- Loop 2: copyin(a, b) copyout(c)

If we look closely we can see that we do not need a and b on the CPU between the kernels.
It means that data transfers of a and b at the end of kernel1 and at the beginning of kernel2 are useless.

The solution is to encapsulate the two loops in a _structured data region_ that you can open we the directive `acc data`.
The syntax is:

```fortran
!$acc data <data clauses>
    ! Your code
!$acc end data
```

#### Exercise

Analyze the code to create a _structured data region_ that encompasses both loops.
The data clause `present` have been added to the _data region associated with kernels_. You should not remove this part.

How many data transfers occurred?

Example stored in: `../../examples/Fortran/Data_Management_structured_data_region_exercise.f90`

In [None]:
%%idrrun -a
program vector_addition
    use iso_fortran_env, only : INT32
    use openacc
    implicit none

    integer(kind=INT32), parameter              :: system_size  = 10000
    integer(kind=INT32), dimension(system_size) :: a, b, c
    integer(kind=INT32)                         :: i

!   Structured data region

    !$acc parallel loop present(a(:), b(:))
    do i = 1, system_size
        a(i) = i
        b(i) = i * 2
    enddo

    !$acc parallel loop present(a(:), b(:), c(:))
    do i = 1, system_size
       c(i) = a(i) + b(i)
    enddo

!   End of structured data region

    write(0,"(a22,i3)") "value at position 12: ", c(12)
end program vector_addition

#### Solution

Example stored in: `../../examples/Fortran/Data_Management_structured_data_region_solution.f90`

In [None]:
%%idrrun -a
program vector_addition
    use iso_fortran_env, only : INT32
    use openacc
    implicit none

    integer(kind=INT32), parameter              :: system_size  = 10000
    integer(kind=INT32), dimension(system_size) :: a, b, c
    integer(kind=INT32)                         :: i

!   Structured data region
    !$acc data create(a, b) copyout(c)

    !$acc parallel loop present(a(:), b(:))
    do i = 1, system_size
        a(i) = i
        b(i) = i * 2
    enddo

    !$acc parallel loop present(a(:), b(:), c(:))
    do i = 1, system_size
       c(i) = a(i) + b(i)
    enddo

!   End of structured data region
    !$acc end data

    write(0,"(a22,i3)") "value at position 12: ", c(12)
end program vector_addition

When using _structured data region_ we advise to use the `present` data clause which tells that the data should already be in GPU memory.

#### WRONG example

Here we are going to simulate a case where we make modifications on the CPU between 2 GPU kernels.
This can happen when you are in the porting phase or because some parts of the computation cannot be executed on the GPU.

The example given below doesn't give the right results on the CPU. Why?

Example stored in: `../../examples/Fortran/Data_Management_wrong_example.f90`

In [None]:
%%idrrun -a
program wrong_usage
    use iso_fortran_env, only : INT32, REAL64
    use openacc
    implicit none
        
    integer(kind=INT32 ), parameter              :: system_size = 10000
    real   (kind=real64), dimension(system_size) :: a, b, c
    integer(kind=INT32 )                         :: i

!  Structured data region
    !$acc data create(a, b) copyout(c)

        !$acc parallel loop present(a(:), b(:))
        do i = 1, system_size
            a(i) = i
            b(i) = i*2
        enddo

        ! We update an element of the array on the CPU
        a(12) = 42

        !$acc parallel loop present(b(:), c(:)) copyin(a(:))
        do i = 1, system_size
            c(i) = a(i) + b(i)
        enddo
    !$acc end data
    write(0,"(a22,f10.5)") "value at position 12: ", c(12)
end program wrong_usage

This example is here to emphasize that you cannot update data with data clauses.
It has an unintended behavior.

### Updating data

Let's say that all your code is not ported to the GPU.
Then it means that you will have some variables (arrays or scalars) for which both, the CPU and the GPU, will perform computation.

To keep the results correct, you will have to update those variables when needed.

#### `acc update device`

To update the value a variable has on the GPU with what the CPU has you have to use:
```fortran
!$acc update device(var1, var2, ...)
```

**Important**: The directive cannot be used inside a compute construct.

#### `acc update self`

Once again if all your code is not ported on GPU the values computed on the GPU may be needed afterwards on the CPU.

The directive to use is:
```fortran
!$acc update self(var1, var2, ...)
```


Correct the previous example in order to obtain correct restuls:

Example stored in: `../../examples/Fortran/Data_Management_wrong_example.f90`

In [None]:
%%idrrun -a
program wrong_usage
    use iso_fortran_env, only : INT32, REAL64
    use openacc
    implicit none
        
    integer(kind=INT32 ), parameter              :: system_size = 10000
    real   (kind=real64), dimension(system_size) :: a, b, c
    integer(kind=INT32 )                         :: i

!  Structured data region
    !$acc data create(a, b) copyout(c)

        !$acc parallel loop present(a(:), b(:))
        do i = 1, system_size
            a(i) = i
            b(i) = i*2
        enddo

        ! We update an element of the array on the CPU
        a(12) = 42

        !$acc parallel loop present(b(:), c(:)) copyin(a(:))
        do i = 1, system_size
            c(i) = a(i) + b(i)
        enddo
    !$acc end data
    write(0,"(a22,f10.5)") "value at position 12: ", c(12)
end program wrong_usage

### Explicit unstructured data regions `acc enter data`

Each time you run a code on the GPU, a data region is created for the lifetime of the program.

There are two directives to manage data inside this region:

- `acc enter data <input data clause>`: to put data inside the region (allocate memory, copy data from the CPU to the GPU)
- `acc exit data <output data clause>`: to remove data (deallocate memory, copy data from the GPU to the CPU)

This feature is helpful when you have your variables declared at one point of your code and used in another one (modular programming).
You can allocate memory as soon as the variable is created and just use _present_ when you create kernels.

#### `acc enter data`

This directive is used to put data on the GPU inside the unstructured data region spanning the lifetime of the program.
It will allocate the memory necessary for the variables and, if asked, copy the data present on the CPU to the GPU.

It accepts the clauses:

- _create_: allocate memory on the GPU
- _copyin_: allocate memory on the GPU and initialize it with the values that the variable has on the CPU
- _attach_: attach a pointer to memory already in the GPU

The most common clauses are _create_ and _copyin_. The _attach_ clause is a bit more advanced and is not covered in this part.

Here is an example of syntax:
```fortan
!$acc enter data copyin(var1(:), ...) create(var2(:))
```


**Important:** the directive must appear after the allocation of the memory on the CPU.

```fortran
real, dimension(:,:,:) :: var
allocate(var(nx,ny,nz))
!$acc enter data create(var(:,:,:))
```


Otherwise you will have a runtime error.

#### `acc exit data`

By default, the memory allocated with `acc enter data` is freed at the end of the program.
But usually you do not have access to very large memory on the GPU (it depends on the card but usually you have access to a few tens of GB)
and it might be necessary to have a fine control on what is present.

The directive `acc exit data <output data clause>` is used to remove data from the GPU.
It accepts the clauses:

- _copyout_: copy to the CPU the values that the variable have on the GPU
- _delete_: free the memory on the GPU
- _detach_: remove the attachment of the pointer to the memory

**Important:** the directive must appear before memory deallocation on the CPU.

```fortran
!$acc exit data delete(var)
deallocate(var)
```

Otherwise you will have a runtime error.

#### Exercise

In this exercise you have to add data management directives in order to:

- allocate memory on the GPU for `array`
- perform the initialization on the GPU
- free the memory on the GPU.

Example stored in: `../../examples/Fortran/Data_Management_unstructured_exercise.f90`

In [None]:
%%idrrun -a
program allocate_array_separately
    use iso_fortran_env, only : INT32, REAL64
    use openacc
    implicit none

    real   (kind=REAL64), dimension(:), allocatable :: array
    integer(kind=INT32 )                            :: system_size
    integer(kind=INT32 )                            :: i
 
    system_size = 100000

    call init(array, system_size)

    do i = 1, system_size
        array(i) = dble(i)
    enddo

    write(0,*) array(42)

    deallocate(array)

    contains
     subroutine init(array, system_size)            
     
     real   (kind=REAL64), dimension(:), allocatable, intent(inout) :: array
     integer(kind=INT32 ), intent(in)                               :: system_size

     allocate(array(system_size))

     end subroutine init
end program allocate_array_separately

#### Solution

Example stored in: `../../examples/Fortran/Data_Management_unstructured_solution.f90`

In [None]:
%%idrrun -a
program allocate_array_separately
    use iso_fortran_env, only : INT32, REAL64
    use openacc
    implicit none

    real   (kind=REAL64), dimension(:), allocatable :: array
    integer(kind=INT32 )                            :: system_size
    integer(kind=INT32 )                            :: i
 
    system_size = 100000

    call init(array, system_size)

    !$acc parallel loop present(array(:))
    do i = 1, system_size
        array(i) = dble(i)
    enddo

    !$acc exit data copyout(array(:))
    write(0,*) array(42)

    deallocate(array)

    contains
     subroutine init(array, system_size)            
     
     real   (kind=REAL64), dimension(:), allocatable, intent(inout) :: array
     integer(kind=INT32 ), intent(in)                               :: system_size

     allocate(array(system_size))
     !$acc enter data create(array(1:system_size))

     end subroutine init
end program allocate_array_separately

### Implicit data regions `acc declare`

An implicit data region is created for a program and each subprogram. You can manage data inside these data regions using `acc declare` directive.

An implicit data region is created for each function you write.
You can manage data inside it with the `acc declare` directive.


```fortran
integer, parameter :: size = 1000000
real               :: array(size)
!$acc declare create(array(1:size))
```

In Fortran this directive can also be used for variables declared inside modules.

In addition to regular data causes, it accepts `device_resident` cause for variables needed only on the GPU.

Example given below illustrates usage of this clause.

##### Example

In this example we normalize rows (C) or columns (Fortran) of a square matrix.
The algorithm uses a temporary array (norms) which is only used on the GPU.

Example stored in: `../../examples/Fortran/Data_Management_unstructured_declare_example.f90`

In [None]:
%%idrrun -a
module utils
    use iso_fortran_env, only : REAL64, INT32
    contains
        subroutine normalize_cols(mat, mat_size)
            real    (kind=REAL64), allocatable, dimension(:,:), intent(inout) :: mat
            integer (kind=INT32 )                             , intent(in)    :: mat_size  
            real    (kind=REAL64)                                             :: norm = 0.0_real64
            integer (kind=INT32 )                                             :: i,j
            real    (kind=REAL64), allocatable, dimension(:) :: norms
            !$acc declare device_resident(norms(:))
            allocate(norms(mat_size))
!! Compute the L1 norm of each column
            !$acc parallel loop present(mat(:,:), norms(:))
            do j = 1, mat_size
                norm = 0
                !$acc loop reduction(+:norm)
                do i = 1, mat_size
                    norm = norm + mat(i,j)
                enddo
                norms(j) = norm
            enddo
!! Divide each element of the columns by the L1 norm
            !$acc parallel loop present(mat(:,:), norms(:))
            do j = 1, mat_size
                do i = 1, mat_size
                    mat(i,j) = mat(i,j)/norms(j)
                enddo
            enddo
        end subroutine normalize_cols
end module utils

program normalize
    use utils
    real    (kind=REAL64), allocatable, dimension(:,:)  :: mat
    real    (kind=REAL64)                               :: mat_sum
    integer (kind=INT32)                                :: mat_size=2000
    integer (kind=INT32)                                :: i, j

    allocate(mat(mat_size, mat_size))
    !$acc enter data create(mat)
    call random_number(mat)
    !$acc update device(mat(:,:))
    call normalize_cols(mat, mat_size)
!! Compute the sum of all elements of the matrix
    !$acc parallel loop present(mat(:,:)) reduction(+:mat_sum)
    do j = 1, mat_size
        do i = 1, mat_size
            mat_sum = mat_sum + mat(i,j)
        enddo
    enddo
    !$acc exit data delete(mat)
    deallocate(mat)
    print *, mat_sum, "=", mat_size, "?"
end program normalize