# Multi GPU programming with OpenACC

---
**Requirements:**

- [Get started](./Get_started.ipynb)
- [Data Management](./Data_management.ipynb)

---

## Disclaimer

This part requires that you have a basic knowledge of OpenMP and/or MPI.

## Introduction

If you wish to have your code run on multiple GPUs, several strategies are available. The most simple ones are to create either several threads or MPI tasks, each one addressing one GPU.

## API description

For this part, the following API functions are needed:

- *acc_get_device_type()*: retrieve the type of accelerator available on the host
- *acc_get_num_device(device_type)*: retrieve the number of accelerators of the given type
- *acc_set_device_num(id, device_type)*: set the id of the device of the given type to use

## MPI strategy

In this strategy, you will follow a classical MPI procedure where several tasks are executed. We will use either the OpenACC directive or API to make each task use 1 GPU.

Have a look at the [examples/C/init_openacc.h](../../examples/C/init_openacc.h)

We have a bug for MPI in the notebooks and you need to save the file before running the next cell.
It is a good way to pratice manual building!
Please add the correct extension for the language you are running.

Example stored in: `../../examples/C/MultiGPU_mpi_example.c`

In [None]:
%%idrrun -m 4 -a --gpus 2 --option "-cpp"
#include <stdio.h>
#include <mpi.h>
#include <openacc.h>
#include "../../examples/C/init_openacc.h"
int main(int argc, char** argv)
{
    // Useful for OpenMPI and GPU DIRECT
    initialisation_openacc();
    MPI_Init(&argc, &argv);
    
    // MPI Stuff
    int my_rank;
    int comm_size;
    MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
    int a[100];
    
    // OpenACC Stuff
    #ifdef _OPENACC
    acc_device_t device_type = acc_get_device_type();
    int num_gpus = acc_get_num_devices(device_type);
    int my_gpu = my_rank%num_gpus;
    acc_set_device_num(my_gpu, device_type);
    my_gpu = acc_get_device_num(device_type);
    // Alternatively you can set the GPU number with #pragma acc set device_num(my_gpu)
    
    #pragma acc parallel
    {
        #pragma acc loop
        for(int i = 0; i< 100; ++i)
            a[i] = i;
    }
    #endif
    printf("Here is rank %d: I am using GPU %d of type %d. a[42] = %d\n", my_rank, my_gpu, device_type, a[42]);
    MPI_Finalize();
    return 0;
}


### Remarks

It is possible to have several tasks accessing the same GPU. It can be useful if one task is not enough to keep the GPU busy along the computation.

If you use NVIDIA GPU, you should have a look at the [Multi Process Service](https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf).

## Multithreading strategy

Another way to use several GPUs is with multiple threads. Each thread will use one GPU and several threads can share 1 GPU.

Example stored in: `../../examples/C/MultiGPU_openmp_example.c`

In [None]:
%%idrrun -a -t -g 4 --threads 4 --option "-cpp"
#include <stdio.h>
#include <openacc.h>
#include <omp.h>
int main(int argc,char** argv)
{
    #pragma omp parallel 
    {
        int my_rank = omp_get_thread_num();
        // OpenACC Stuff
        #ifdef _OPENACC
        acc_device_t dev_type = acc_get_device_type();
        int num_gpus = acc_get_num_devices(dev_type);
        int my_gpu = my_rank%num_gpus;
        acc_set_device_num(my_gpu, dev_type);
        // We check what GPU is really in use
        my_gpu = acc_get_device_num(dev_type);
        // Alternatively you can set the GPU number with #pragma acc set device_num(my_gpu)
        printf("Here is thread %d: I am using GPU %d of type %d.\n", my_rank, my_gpu, dev_type);
        #endif  
    }
}


## Exercise

1. Copy one cell from a previous notebook with a sequential code
2. Modify the code to use several GPUs
3. Check the correctness of the figure

## GPU to GPU data transfers

If you have several GPUs on your machine they are likely interconnected. For NVIDIA GPUs, there are 2 flavors of connections: either PCI express or NVlink.
[NVLink](https://www.nvidia.com/fr-fr/data-center/nvlink/) is a fast interconnect between GPUs. Be careful since it might not be available on your machine.
The main difference between the two connections is the bandwidth for CPU/GPU transfers, which is higher for NVlink.

The GPUDirect feature of CUDA-aware MPI libraries allows direct data transfers between GPUs without an intermediate copy to the CPU memory. If you have access to an MPI CUDA-aware implementation with GPUDirect support, you should definitely adapt your code to benefit from this feature.

For information, during this training course we are using OpenMPI which is CUDA-aware.
You can find a list of CUDA-aware implementation on [NVIDIA website](https://developer.nvidia.com/mpi-solutions-gpus).

By default, the data transfers between GPUs are not direct. The scheme is the following:

1. The __origin__ task generates a Device to Host data transfer
2. The __origin__ task sends the data to the __destination__ task.
3. The __destination__ task generates a Host to Device data transfer

Here we can see that 2 transfers between Host and Device are necessary. This is costly and should be avoided if possible.

### `acc host_data` directive

To be able to transfer data directly between GPUs, we introduce the __host_data__ directive.
```c
#pragma acc host_data use_device(array)
```


This directive tells the compiler to assign the address of the variable to its value on the device.
You can then use the pointer with your MPI calls.
__You have to call the MPI functions on the host.__

Here is a example of a code using GPU to GPU direct transfer.
```c
int size = 1000;
int* array = (int*) malloc(size*sizeof(int));
#pragma acc enter_data create(array[0:1000])
// Perform some stuff on the GPU
#pragma acc parallel present(array[0:1000])
{
...
}
// Transfer the data between GPUs
if (my_rank == origin )
{
    #pragma acc host_data use_device(array)
    MPI_Send(array, size, MPI_INT, destination, tag, MPI_COMM_WORLD);
}
else if (my_rank == destination)
{
    #pragma acc host_data use_device(array)
    MPI_Recv(array, size, MPI_INT, origin, tag, MPI_COMM_WORLD, &status);
}

```

### Exercise

As an exercise, you can complete the following MPI code that measures the bandwidth between the GPUs:

1. Add directives to create the buffers on the GPU
2. Measure the effective bandwidth between GPUs by adding the directives necessary to transfer data from one GPU to another one in the following cases:

- Not using NVLink
- Using NVLink

We have a bug for MPI in the notebooks and you need to save the file before running the next cell.
It is a good way to pratice manual building!
Please add the correct extension for the language you are running.

Example stored in: `../../examples/C/MultiGPU_mpi_exercise.c`

In [None]:
%%idrrun -m 4 -a --gpus 2 --option "-cpp"
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <openacc.h>
#include <math.h>
#include "../../examples/C/init_openacc.h"
int main(int argc, char** argv)
{
    initialisation_openacc();
    MPI_Init(&argc, &argv);
    fflush(stdout);
    double start;
    double end;
    
    int size = 2e8/8;
    
    double* send_buffer = (double*)malloc(size*sizeof(double));
    double* receive_buffer = (double*)malloc(size*sizeof(double));
    // MPI Stuff
    int my_rank;
    int comm_size;
    int reps = 5;
    double data_volume = (double)reps*(double)size*sizeof(double)*pow(1024,-3.0);
    MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
    MPI_Status status;
    
    // OpenACC Stuff
    acc_device_t device_type = acc_get_device_type();
    int num_gpus = acc_get_num_devices(device_type);
    int my_gpu = my_rank%num_gpus;
    acc_set_device_num(my_gpu, device_type); 
    for (int i = 0; i<comm_size; ++i)
    {
        for (int j=0; j < comm_size; ++j)
        {
            if (my_rank == i && i != j)
            {
                start = MPI_Wtime();
                for (int k = 0 ; k < reps; ++k)
                    MPI_Ssend(send_buffer, size, MPI_DOUBLE, j, 0, MPI_COMM_WORLD);
            }
            if (my_rank == j && i != j)
            {
                for (int k = 0 ; k < reps; ++k)
                    MPI_Recv(receive_buffer, size, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);
            }
            if (my_rank == i && i != j)
            {
                end = MPI_Wtime();
                printf("bandwidth %d->%d: %10.5f GB/s\n", i, j, data_volume/(end-start));
            }
        }
    }
    MPI_Finalize();
    return 0;
}


#### Solution

We have a bug for MPI in the notebooks and you need to save the file before running the next cell.
It is a good way to pratice manual building!
Please add the correct extension for the language you are running.

Example stored in: `../../examples/C/MultiGPU_mpi_solution.c`

In [None]:
%%idrrun -m 4 -a --gpus 2 --option "-cpp"
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <openacc.h>
#include <math.h>
#include "../../examples/C/init_openacc.h"
int main(int argc, char** argv)
{
    initialisation_openacc();
    MPI_Init(&argc, &argv);
    fflush(stdout);
    double start;
    double end;
    
    int size = 2e8/8;
    
    double* send_buffer = (double*)malloc(size*sizeof(double));
    double* receive_buffer = (double*)malloc(size*sizeof(double));
    #pragma acc enter data create(send_buffer[:size], receive_buffer[:size])
    // MPI Stuff
    int my_rank;
    int comm_size;
    int reps = 5;
    double data_volume = (double)reps*(double)size*sizeof(double)*pow(1024,-3.0);
    MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
    MPI_Status status;
    
    // OpenACC Stuff
    acc_device_t device_type = acc_get_device_type();
    int num_gpus = acc_get_num_devices(device_type);
    int my_gpu = my_rank%num_gpus;
    acc_set_device_num(my_gpu, device_type); 
    for (int i = 0; i<comm_size; ++i)
    {
        for (int j=0; j < comm_size; ++j)
        {
            if (my_rank == i && i != j)
            {
                start = MPI_Wtime();
                #pragma acc host_data use_device(send_buffer)
                for (int k = 0 ; k < reps; ++k)
                    MPI_Ssend(send_buffer, size, MPI_DOUBLE, j, 0, MPI_COMM_WORLD);
            }
            if (my_rank == j && i != j)
            {
                #pragma acc host_data use_device(receive_buffer)
                for (int k = 0 ; k < reps; ++k)
                    MPI_Recv(receive_buffer, size, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);
            }
            if (my_rank == i && i != j)
            {
                end = MPI_Wtime();
                printf("bandwidth %d->%d: %10.5f GB/s\n", i, j, data_volume/(end-start));
            }
        }
    }
    MPI_Finalize();
    return 0;
}