# Performing several tasks at the same time on the GPU

---
**Requirements:**

- [Get started](./Get_started.ipynb)
- [Atomic operations](./Atomic_operations.ipynb)
- [Manual building](./Manual_building.ipynb)
- [Data management](./Data_management.ipynb)

---

This part describes how to overlap several kernels on the GPU and/or how to overlap kernels with data transfers.
This feature is called asynchronism and will give you the possibility to get better performance when it is possible to implement it.

On the GPU you can have several execution threads (called _streams_ or _activity queue_) running at the same time independently.
A _stream_ can be viewed as a pipeline that you feed with kernels and data transfers that have to be executed in order.

So as a developer you can decide to activate several streams if your code is able to withstand them.
OpenACC gives you the possibility to manage streams with the tools:

- _async_ clause
- _wait_ clause or directive

 By default, only one stream is created.

## _async_ clause

Some directives accept the clause _async_ to run on another stream than the default one.
You can specify an integer (which can be a variable) to have several streams concurrently.

If you omit the optional integer then a "default" extra stream is used.

The directives which accept _async_ are:

- the compute constructs: `acc parallel`, `acc kernels`, `acc serial`
- the unstructured data directives: `acc enter data`, `acc exit data`, `acc update`
- the `acc wait` directive

For example we can create 2 streams to allow data transfers and kernel overlap.

```fortran
integer :: stream1=1
integer :: stream2=2

!$acc enter data copyin(array(:)) async(stream1)
! Some stuff
!$acc parallel async(stream2)
    ! A wonderful kernel
!$acc end parallel
```

## _wait_ clause

Running fast is important but having correct results is surely more important.

If you have a kernel that needs the result of another kernel or that a data transfer is complete then you have to wait for the operations to finalize.
You can add the _wait_ clause (with an optional integer) to the directives:

- the compute constructs: `acc parallel`, `acc kernels`, `acc serial`
- the unstructured data directives: `acc enter data`, `acc exit data`, `acc update`

This example implements 2 streams but this time the kernel needs the data transfer on stream1 to complete before being executed.

```fortran
integer stream1=1
integer stream2=2
!$acc enter data copyin(array(:)) async(stream1)
! Some stuff

!$acc parallel async(stream2) wait(stream1)
    ! A wonderful kernel
!$acc end parallel   
```

Furthermore you can wait for several streams to complete by giving a comma-separated list of integers as clause arguments

This example implements 2 streams but this time the kernel needs the data transfer on stream1 to complete before being executed.

```fortran
integer stream1=1
integer stream2=2
integer stream3=3
!$acc parallel loop async(stream3)
do i = 1, system_size
    ! Kernel launched on stream3
enddo

!$acc enter data copyin(array(:)) async(stream1)
! Some stuff

!$acc parallel async(stream2) wait(stream1, stream3)
    ! A wonderful kernel
!$acc end parallel
```


If you omit the clause options, then the operations will wait until all asynchronous operations fulfill.

```fortran
!$acc parallel wait
    ! A wonderful kernel
!$acc end parallel    
```

## _wait_ directive

_wait_ comes also as a standalone directive.
```fortran
integer stream1=1
integer stream2=2
integer stream3=3

!$acc parallel loop async(stream3)
do i = 1, system_size
    ! Kernel launched on stream3
enddo

!$acc enter data copyin(array(:)) async(stream1)
! Some stuff

!$acc wait(stream3)

!$acc parallel async(stream2)
    ! A wonderful kernel
!$acc end parallel
```

## Exercise

In this exercise you have to compute the matrix product $C = A \times B$.

You have to add directives to:

- use the program lifetime unstructured data region to allocate memory on the GPU
- perform the matrix initialization on the GPU
- perform the matrix product on the GPU
- create and analyze a profile (add the option `--profile` to idrrun)
- save the .qdrep file
- check what can be done asynchronously and implement it
- create and analyze a profile (add the option `--profile` to idrrun)
- save the .qdrep file

Your solution is considered correct if no implicit action are done.

Example stored in: `../../examples/Fortran/async_async_exercise.f90`

In [None]:
%%idrrun -a 
program prod_mat
    use iso_fortran_env, only : INT32, REAL64
    implicit none
    integer (kind=INT32)               :: rank=5000
    real    (kind=REAL64), allocatable :: A(:,:), B(:,:), C(:,:)
    integer (kind=INT32)               :: i, j, k
    integer (kind=INT32)               :: streamA, streamB, streamC

    streamA = 1
    streamB = 2
    streamC = 3

    call create_mat(A, rank, streamA)
    call create_mat(B, rank, streamB)
    call create_mat(C, rank, streamC)

    call init_mat(A, rank, 3.0_real64 , streamA)
    call init_mat(B, rank, 14.0_real64, streamB)
    call init_mat(C, rank, 0.0_real64 , streamC)

    do j=1, rank
        do k=1, rank
            do i=1, rank
                C(i,j) = C(i,j) + A(i,k)*B(k,j)
            enddo
        enddo
    enddo
    print *, "Check that this is close to 42.0:", C(12,12)
    deallocate(A, B, C)
    contains
        subroutine create_mat(mat, rank, stream)
            real   (kind=REAL64), intent(inout), allocatable   :: mat(:,:)
            integer(kind=INT32 ), intent(in)                   :: rank, stream
            allocate(mat(rank,rank))
        end subroutine create_mat

        subroutine init_mat(mat, rank, diag, stream)
            real    (kind=REAL64), intent(inout)   :: mat(:,:)
            real    (kind=REAL64), intent(in)      :: diag
            integer (kind=INT32 ), intent(in)      :: rank, stream
            integer (kind=INT32 )                  :: i, j

            do j=1, rank
                do i=1, rank
                   mat(i,j) = 0.0_real64
                enddo
            enddo

            do j=1, rank
                mat(j,j) = diag
            enddo
        end subroutine init_mat
end program prod_mat

### Solution

Example stored in: `../../examples/Fortran/async_async_solution.f90`

In [None]:
%%idrrun -a --profile
program prod_mat
    use iso_fortran_env, only : INT32, REAL64
    implicit none
    integer (kind=INT32)               :: rank=5000
    real    (kind=REAL64), allocatable :: A(:,:), B(:,:), C(:,:)
    integer (kind=INT32)               :: i, j, k
    integer (kind=INT32)               :: streamA, streamB, streamC

    streamA = 1
    streamB = 2
    streamC = 3

    call create_mat(A, rank, streamA)
    call create_mat(B, rank, streamB)
    call create_mat(C, rank, streamC)

    call init_mat(A, rank, 3.0_real64 , streamA)
    call init_mat(B, rank, 14.0_real64, streamB)
    call init_mat(C, rank, 0.0_real64 , streamC)

    !$acc parallel loop &
    !$acc present(A(:rank,:rank), B(:rank,:rank), C(:rank,:rank)) &
    !$acc gang wait(1,2,3)
    do j=1, rank
        do k=1, rank
            !$acc loop vector
            do i=1, rank
                !$acc atomic update
                C(i,j) = C(i,j) + A(i,k)*B(k,j)
            enddo
        enddo
    enddo
    !$acc exit data delete(A(:rank,:rank), B(:rank,:rank)) copyout(C(:rank,:rank))
    print *, "Check that this is close to 42.0:", C(12,12)
    deallocate(A, B, C)
    contains
        subroutine create_mat(mat, rank, stream)
            real   (kind=REAL64), intent(inout), allocatable   :: mat(:,:)
            integer(kind=INT32 ), intent(in)                   :: rank, stream
            allocate(mat(rank,rank))
            !$acc enter data create(mat(:rank,:rank)) async(stream)
        end subroutine create_mat

        subroutine init_mat(mat, rank, diag, stream)
            real    (kind=REAL64), intent(inout)   :: mat(:,:)
            real    (kind=REAL64), intent(in)      :: diag
            integer (kind=INT32 ), intent(in)      :: rank, stream
            integer (kind=INT32 )                  :: i, j

            !$acc parallel loop collapse(2) async(stream)
            do j=1, rank
                do i=1, rank
                   mat(i,j) = 0.0_real64
                enddo
            enddo

            !$acc parallel loop async(stream)
            do j=1, rank
                mat(j,j) = diag
            enddo
        end subroutine init_mat
end program prod_mat

In an ideal world, the solution would produce a profile like this one:

<img src="../../pictures/async.png" style="float:none"/>

### Comments

- Several threads will update the same memory location for C so you have to use an `acc atomic update`
- `collapse` is used to fuse the 3 loops. It helps the compiler to generate a more efficient code

## Advanced NVIDIA compiler option to use Pinned Memory: `-gpu=pinned`

If you look at the profiles of your code (at this point "if" should be "when"), you can see that the memory transfers occurs in chunks of more or less constant size.
Even though you have a large memory block it will be split into several smaller pieces which have the size of a memory page.

Memory not pinned:

<img alt="Nsight output unpinned memory" src="../../pictures/NSight-matmul_not_pinned.png" style="float:none"/>

Memory pinned:

<img alt="Nsight output pinned memory" src="../../pictures/NSight-matmul_pinned.png" style="float:none"/>

Usually the transfer time is reduced when pinned memory is used.
It can also cause some segmentation faults. Do your testing!

### Bonus

You can launch the exercise with `%%idrrun -a --profile --accopts "cc70,pinned"` to test the effect of pinned memory.
You can save a profile to compare the 3 versions.