# Multi GPU programming with OpenACC

---
**Requirements:**

- [Get started](./Get_started.ipynb)
- [Data Management](./Data_management.ipynb)

---

## Disclaimer

This part requires that you have a basic knowledge of OpenMP and/or MPI.

## Introduction

If you wish to have your code run on multiple GPUs, several strategies are available. The most simple ones are to create either several threads or MPI tasks, each one addressing one GPU.

## API description

For this part, the following API functions are needed:

- *acc_get_device_type()*: retrieve the type of accelerator available on the host
- *acc_get_num_device(device_type)*: retrieve the number of accelerators of the given type
- *acc_set_device_num(id, device_type)*: set the id of the device of the given type to use

## MPI strategy

In this strategy, you will follow a classical MPI procedure where several tasks are executed. We will use either the OpenACC directive or API to make each task use 1 GPU.

Have a look at the [examples/C/init_openacc.h](../../examples/C/init_openacc.h)

We have a bug for MPI in the notebooks and you need to save the file before running the next cell.
It is a good way to pratice manual building!
Please add the correct extension for the language you are running.

Example stored in: `../../examples/Fortran/MultiGPU_mpi_example.f90`

In [None]:
%%idrrun -m 4 -a --gpus 2 --option "-cpp"
! you should add ` --option "-cpp" ` as argument to the idrrun command
program multigpu
    use ISO_FORTRAN_ENV, only : INT32
    use mpi
    use openacc
    implicit none
    integer(kind=INT32), dimension(100) :: a
    integer                             :: comm_size, my_rank, code, i
    integer                             :: num_gpus, my_gpu
    integer(kind=acc_device_kind)       :: device_type

    ! Useful for OpenMPI and GPU DIRECT
    call initialisation_openacc()

    ! MPI stuff
    call MPI_Init(code)
    call MPI_Comm_size(MPI_COMM_WORLD, comm_size, code)
    call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, code)

    ! OpenACC stuff
    #ifdef _OPENACC
    device_type = acc_get_device_type()
    num_gpus = acc_get_num_devices(device_type)
    my_gpu   = mod(my_rank,num_gpus)
    call acc_set_device_num(my_gpu, device_type)
    my_gpu   = acc_get_device_num(device_type)   
    ! Alternatively you can set the GPU number with #pragma acc set device_num(my_gpu)

    !$acc parallel loop
    do i = 1, 100
        a(i) = i
    enddo   
    #endif
    write(0,"(a13,i2,a17,i2,a8,i2,a10,i2)") "Here is rank ",my_rank,": I am using GPU ",my_gpu, & 
                                            " of type ",device_type,". a(42) = ",a(42)
    call MPI_Finalize(code)

    contains
        #ifdef _OPENACC
        subroutine initialisation_openacc
        use openacc
        
        type accel_info
            integer :: current_devices
            integer :: total_devices
        end type accel_info
       
        type(accel_info) :: info
        character(len=6) :: local_rank_env
        integer          :: local_rank_env_status, local_rank
        ! Initialisation of OpenACC
        !$acc init
 
       ! Recovery of the local rank of the process via the environment variable
       ! set by Slurm, as MPI_Comm_rank cannot be used here because this routine
       ! is used BEFORE the initialisation of MPI
       call get_environment_variable(name="SLURM_LOCALID", value=local_rank_env, status=local_rank_env_status)
       info%total_devices = acc_get_num_devices(acc_get_device_type())
       if (local_rank_env_status == 0) then
           read(local_rank_env, *) local_rank
           ! Definition of the GPU to be used via OpenACC
           call acc_set_device_num(local_rank, acc_get_device_type())
           info%current_devices = local_rank
       else
           print *, "Error : impossible to determine the local rank of the process"
           stop 1
       endif
       end subroutine initialisation_openacc
       #endif    

end program multigpu

### Remarks

It is possible to have several tasks accessing the same GPU. It can be useful if one task is not enough to keep the GPU busy along the computation.

If you use NVIDIA GPU, you should have a look at the [Multi Process Service](https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf).

## Multithreading strategy

Another way to use several GPUs is with multiple threads. Each thread will use one GPU and several threads can share 1 GPU.

Example stored in: `../../examples/Fortran/MultiGPU_openmp_example.f90`

In [None]:
%%idrrun -a -t -g 4 --threads 4 --option "-cpp"
! you should add ` --option "-cpp" ` as argument to the idrrun command
program MultiGPU_openmp
    use ISO_FORTRAN_ENV, only : INT32
    use OMP_LIB
    use openacc
    implicit none    
    integer(kind=INT32)           :: my_rank
    integer                       :: num_gpus, my_gpu
    integer(kind=acc_device_kind) :: device_type

    !$omp parallel private(my_rank, my_gpu, device_type)
        my_rank = omp_get_thread_num()
        ! OpenACC Stuff
        #ifdef _OPENACC
        device_type = acc_get_device_type()
        num_gpus = acc_get_num_devices(device_type)
        my_gpu   = mod(my_rank,num_gpus)
        call acc_set_device_num(my_gpu, device_type)
        ! We check what GPU is really in use
        my_gpu = acc_get_device_num(device_type)
        ! Alternatively you can set the GPU number with #pragma acc set device_num(my_gpu)
        write(0,"(a14,i2,a17,i2,a9,i2)") "Here is thread ",my_rank," : I am using GPU ",my_gpu," of type ",device_type
        #endif  
    !$omp end parallel
end program MultiGPU_openmp


## Exercise

1. Copy one cell from a previous notebook with a sequential code
2. Modify the code to use several GPUs
3. Check the correctness of the figure

## GPU to GPU data transfers

If you have several GPUs on your machine they are likely interconnected. For NVIDIA GPUs, there are 2 flavors of connections: either PCI express or NVlink.
[NVLink](https://www.nvidia.com/fr-fr/data-center/nvlink/) is a fast interconnect between GPUs. Be careful since it might not be available on your machine.
The main difference between the two connections is the bandwidth for CPU/GPU transfers, which is higher for NVlink.

The GPUDirect feature of CUDA-aware MPI libraries allows direct data transfers between GPUs without an intermediate copy to the CPU memory. If you have access to an MPI CUDA-aware implementation with GPUDirect support, you should definitely adapt your code to benefit from this feature.

For information, during this training course we are using OpenMPI which is CUDA-aware.
You can find a list of CUDA-aware implementation on [NVIDIA website](https://developer.nvidia.com/mpi-solutions-gpus).

By default, the data transfers between GPUs are not direct. The scheme is the following:

1. The __origin__ task generates a Device to Host data transfer
2. The __origin__ task sends the data to the __destination__ task.
3. The __destination__ task generates a Host to Device data transfer

Here we can see that 2 transfers between Host and Device are necessary. This is costly and should be avoided if possible.

### `acc host_data` directive

To be able to transfer data directly between GPUs, we introduce the __host_data__ directive.
```fortran
!$acc host_data use_device(array)
...
!$acc end host_data
```


This directive tells the compiler to assign the address of the variable to its value on the device.
You can then use the pointer with your MPI calls.
__You have to call the MPI functions on the host.__

Here is a example of a code using GPU to GPU direct transfer.
```fortran
integer, parameter :: system_size = 1000;
integer, dimension(system_size) :: array
!$acc enter_data create(array(1:1000))
! Perform some stuff on the GPU
!$acc parallel present(array(1:1000))
...
!$acc end parallel
! Transfer the data between GPUs
if (my_rank .eq. origin ) then
    !$acc host_data use_device(array)
    MPI_Send(array, size, MPI_INT, destination, tag, MPI_COMM_WORLD, code)
    !$acc end host_data
endif
if (my_rank .eq. destination) then
    !$acc host_data use_device(array)
    MPI_Recv(array, size, MPI_INT, origin, tag, MPI_COMM_WORLD, status, code)
    !$acc end host_data
endif

```

### Exercise

As an exercise, you can complete the following MPI code that measures the bandwidth between the GPUs:

1. Add directives to create the buffers on the GPU
2. Measure the effective bandwidth between GPUs by adding the directives necessary to transfer data from one GPU to another one in the following cases:

- Not using NVLink
- Using NVLink

We have a bug for MPI in the notebooks and you need to save the file before running the next cell.
It is a good way to pratice manual building!
Please add the correct extension for the language you are running.

Example stored in: `../../examples/Fortran/MultiGPU_mpi_exercise.f90`

In [None]:
%%idrrun -m 4 -a --gpus 2 --option "-cpp"
! you should add ` --option "-cpp" ` as argument to the idrrun command
program MultiGPU_exercice
    use ISO_FORTRAN_ENV, only : INT32, REAL64
    use mpi
    use openacc
    implicit none
    real   (kind=REAL64), dimension(:), allocatable :: send_buffer, receive_buffer
    real   (kind=REAL64)                            :: start, finish , data_volume   
    integer(kind=INT32 ), parameter                 :: system_size = 2e8/8
    integer                                         :: comm_size, my_rank, code, reps, i, j, k
    integer                                         :: num_gpus, my_gpu
    integer(kind=acc_device_kind)                   :: device_type
    integer, dimension(MPI_STATUS_SIZE)             :: mpi_stat

    ! Useful for OpenMPI and GPU DIRECT
    call initialisation_openacc()

    ! MPI stuff
    reps = 5
    data_volume = dble(reps*system_size)*8*1024_real64**(-3.0)

    call MPI_Init(code)
    call MPI_Comm_size(MPI_COMM_WORLD, comm_size, code)
    call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, code)
    allocate(send_buffer(system_size), receive_buffer(system_size))

    ! OpenACC stuff
    #ifdef _OPENACC
    device_type = acc_get_device_type()
    num_gpus = acc_get_num_devices(device_type)
    my_gpu   = mod(my_rank,num_gpus)
    call acc_set_device_num(my_gpu, device_type)
    #endif

    do j = 0, comm_size - 1
        do i = 0, comm_size - 1
            if ( (my_rank .eq. j) .and. (j .ne. i) ) then
                start = MPI_Wtime()
                do k = 1, reps
                    call MPI_Send(send_buffer,system_size, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, code)
                enddo
            endif 
            if ( (my_rank .eq. i) .and. (i .ne. j) ) then
                do k = 1, reps
                    call MPI_Recv(receive_buffer, system_size, MPI_DOUBLE, j, 0, MPI_COMM_WORLD, mpi_stat, code)
                enddo
            endif
            if ( (my_rank .eq. j) .and. (j .ne. i) ) then
                finish = MPI_Wtime()
                write(0,"(a11,i2,a2,i2,a2,f20.8,a5)") "bandwidth ",j,"->",i,": ",data_volume/(finish-start)," GB/s"
            endif
        enddo
    enddo
    
    deallocate(send_buffer, receive_buffer)
    
    call MPI_Finalize(code)

    contains
        #ifdef _OPENACC
        subroutine initialisation_openacc
            use openacc
            implicit none
            type accel_info
                integer :: current_devices
                integer :: total_devices
            end type accel_info

            type(accel_info) :: info
            character(len=6) :: local_rank_env
            integer          :: local_rank_env_status, local_rank
        ! Initialisation of OpenACC
            !$acc init

        ! Recovery of the local rank of the process via the environment variable
        ! set by Slurm, as MPI_Comm_rank cannot be used here because this routine
        ! is used BEFORE the initialisation of MPI
            call get_environment_variable(name="SLURM_LOCALID", value=local_rank_env, status=local_rank_env_status)
            info%total_devices = acc_get_num_devices(acc_get_device_type())
            if (local_rank_env_status == 0) then
                read(local_rank_env, *) local_rank
                ! Definition of the GPU to be used via OpenACC
                call acc_set_device_num(local_rank, acc_get_device_type())
                info%current_devices = local_rank
            else
                print *, "Error : impossible to determine the local rank of the process"
                stop 1
            endif
        end subroutine initialisation_openacc
        #endif 

end program MultiGPU_exercice

#### Solution

We have a bug for MPI in the notebooks and you need to save the file before running the next cell.
It is a good way to pratice manual building!
Please add the correct extension for the language you are running.

Example stored in: `../../examples/Fortran/MultiGPU_mpi_solution.f90`

In [None]:
%%idrrun -m 4 -a --gpus 2 --option "-cpp"
! you should add ` --option "-cpp" ` as argument to the idrrun command
program MultiGPU_solution
    use ISO_FORTRAN_ENV, only : INT32, REAL64
    use mpi
    use openacc
    implicit none
    real   (kind=REAL64), dimension(:), allocatable :: send_buffer, receive_buffer
    real   (kind=REAL64)                            :: start, finish , data_volume   
    integer(kind=INT32 ), parameter                 :: system_size = 2e8/8
    integer                                         :: comm_size, my_rank, code, reps, i, j, k
    integer                                         :: num_gpus, my_gpu
    integer(kind=acc_device_kind)                   :: device_type
    integer, dimension(MPI_STATUS_SIZE)             :: mpi_stat

    ! Useful for OpenMPI and GPU DIRECT
    call initialisation_openacc()

    ! MPI stuff
    reps = 5
    data_volume = dble(reps*system_size)*8*1024_real64**(-3.0)

    call MPI_Init(code)
    call MPI_Comm_size(MPI_COMM_WORLD, comm_size, code)
    call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, code)
    allocate(send_buffer(system_size), receive_buffer(system_size))
    !$acc enter data create(send_buffer(1:system_size), receive_buffer(1:system_size))

    ! OpenACC stuff
    #ifdef _OPENACC
    device_type = acc_get_device_type()
    num_gpus = acc_get_num_devices(device_type)
    my_gpu   = mod(my_rank,num_gpus)
    call acc_set_device_num(my_gpu, device_type)
    #endif

    do j = 0, comm_size - 1
        do i = 0, comm_size - 1
            if ( (my_rank .eq. j) .and. (j .ne. i) ) then
                start = MPI_Wtime()
                !$acc host_data use_device(send_buffer)
                do k = 1, reps
                    call MPI_Send(send_buffer,system_size, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, code)
                enddo
                !$acc end host_data
            endif 
            if ( (my_rank .eq. i) .and. (i .ne. j) ) then
                !$acc host_data use_device(receive_buffer)
                do k = 1, reps
                    call MPI_Recv(receive_buffer, system_size, MPI_DOUBLE, j, 0, MPI_COMM_WORLD, mpi_stat, code)
                enddo
                !$acc end host_data
            endif
            if ( (my_rank .eq. j) .and. (j .ne. i) ) then
                finish = MPI_Wtime()
                write(0,"(a11,i2,a2,i2,a2,f20.8,a5)") "bandwidth ",j,"->",i,": ",data_volume/(finish-start)," GB/s"
            endif
        enddo
    enddo
    !$acc exit data delete(send_buffer, receive_buffer)
    deallocate(send_buffer, receive_buffer)
    
    call MPI_Finalize(code)

    contains
        #ifdef _OPENACC
        subroutine initialisation_openacc
            use openacc
            implicit none
            type accel_info
                integer :: current_devices
                integer :: total_devices
            end type accel_info

            type(accel_info) :: info
            character(len=6) :: local_rank_env
            integer          :: local_rank_env_status, local_rank
        ! Initialisation of OpenACC
            !$acc init

        ! Recovery of the local rank of the process via the environment variable
        ! set by Slurm, as MPI_Comm_rank cannot be used here because this routine
        ! is used BEFORE the initialisation of MPI
            call get_environment_variable(name="SLURM_LOCALID", value=local_rank_env, status=local_rank_env_status)
            info%total_devices = acc_get_num_devices(acc_get_device_type())
            if (local_rank_env_status == 0) then
                read(local_rank_env, *) local_rank
                ! Definition of the GPU to be used via OpenACC
                call acc_set_device_num(local_rank, acc_get_device_type())
                info%current_devices = local_rank
            else
                print *, "Error : impossible to determine the local rank of the process"
                stop 1
            endif
        end subroutine initialisation_openacc
        #endif 

end program MultiGPU_solution