# Introduction to GPU programming with directives

## What is a GPU?

Graphical Processing Units (GPU) have been designed to accelerate the processing of graphics and have boomed thanks to video games which require more and more computing power.

For this course we will use the terminology from NVIDIA.

GPUs have a large number of computing core really efficient to process large matrices.
For example, the latest generation of NVIDIA GPU (Hopper 100) have 132 processors, called Streaming Multiprocessors (SM) with different kinds of specialized cores:

| Core Type | Number per SM |
|-----------|---------------|
|      FP32 |           128 |
|      FP64 |            64 |
|     INT32 |            64 |
| TensorCore|             4 |
|     Total |           260 |

It means that you have roughly 34k cores on one GPU.

At max, CPUs can have a few 10s of cores (AMD Epyc Rome have 64 cores).
The comparison with the number of cores on one CPU is not fully relevant since their architecture differs a lot.

This is the scheme for one streaming multiprocessor in an [NVIDIA H100 GPU](https://resources.nvidia.com/en-us-tensor-core).

<img alt="NVIDIA H100 GPU architecture" src="../../pictures/H100.png" style="float:none" width="50%"/>

## Programming models

<img alt="Programming models for GPUs" src="../../pictures/models.png" style="float:none" width="50%"/>

You have the choice between several programming models to port your code to GPU:

- Low level programming language (CUDA, OpenCL)
- Programming models (Kokkos)
- GPU libraries (CUDA accelerated libraries, MAGMA, THRUST, AmgX)
- Directives languages (OpenACC, OpenMP target)

Most of the time they are interoperable and you can get the best of each world as long as you take enough time to learn everything :).

In this training course we focus on the directives languages.

### Low level programming languages: CUDA, OpenCL

#### CUDA

Introduction of floating-point processing and programming capabilities on GPU cards at the turn of the century opened the door to general purpose GPU (GPGPU) programming. GPGPU was greatly democratized with the arrival of the [CUDA programming language](https://docs.nvidia.com/cuda/index.html) in 2007.

CUDA is a language close to C++ where you have to manage yourself everything that occurs on the GPU:

- Allocation of memory
- Data transfers
- Kernel (piece of code running on the GPU) execution

The kernel configuration has to be explicitly written in your code.

```c
   __global__ void dot(float* a, float* b, float *c) {

    __shared__ float temp[BLOCK_SIZE];
    int idx = threadIdx.x + blockDim.x * blockIdx.x; // This calculates the global index of the current thread.
    
    temp[threadIdx.x] = a[idx] * b[idx]; 
    __syncthreads(); // This synchronizes all threads in the block.
    
    if (threadIdx.x == 0) { // Only the first thread computes the sum
        float sum = 0.0; 
        for(int i = 0; i < BLOCK_SIZE; ++i)
            sum += temp[i];
        atomicAdd(c, sum); // This adds the sum to the value pointed by the pointer "c".
    }
}
```

All of this means that if you want to port your code on GPU with CUDA you have to write specialized portions of code.
With this you have access to potentially the full processing power of the GPU but you have to learn a new language.

Since it is only available on NVIDIA GPUs you lack the portability to other platforms.

Using [HIP](https://rocm.docs.amd.com/projects/HIP/) is a portable alternative to relying on CUDA. HIP is developped by AMD. It comes with a ROCm backend for AMD GPUs and a CUDA backend for Nvidia GPUs. There are currently efforts on adding to [HIP backends for Intel GPUs](https://github.com/CHIP-SPV/chipStar). The overhead of using HIP API on Nvidia is minimal. Syntactically, HIP is close to CUDA ; there exists a tool for CUDA to HIP conversion.

#### OpenCL

[OpenCL](https://www.khronos.org/opencl/) have been available since 2009 and it was developed to write code that can run on several kind of architectures (CPU, GPU, FPGA, ...).

OpenCL is supported by the major hardware companies so if you choose this option you can alleviate the portability issue.
However, you still have to manage by hand everything happening on the GPU.

```c
__kernel void dot(__global float* a, __global float* b, __global float *c) {
    const int BLOCK_SIZE = 512;
    __local float temp[BLOCK_SIZE];
    int idx = get_global_id(0); // This calculates the global index of the current thread.
    temp[get_local_id(0)] = a[idx] * b[idx];
    barrier(CLK_LOCAL_MEM_FENCE); // This synchronizes all threads in the block.
    if (get_local_id(0) == 0) { // Only the first thread computes the sum
        float sum = 0.0;
       for(int i = 0; i < BLOCK_SIZE; ++i)
            sum += temp[i];
        AtomicAdd(&c[0], sum); // This adds the sum to the value pointed by the pointer c.
   }
};
```

### Using libraries

Let say that your code is spending a lot of time in only one type of computation (linear algebra, FFTs, etc).
Then it is interesting to look for specialized libraries developed for this kind of computation:

- [NVIDIA CUDA libraries](https://docs.nvidia.com/#nvidia-cuda-libraries): FFT, BLAS, Sparse algebra, ...
- [MAGMA](https://icl.cs.utk.edu/magma/): Dense linear algebra
- etc

The implementation cost is much lower than if you have to write your own kernels and you get (hopefully) very good performance.

### Directives

In the general case where the libraries do not fulfill an important part of your code, you can choose to use [OpenACC](https://www.openacc.org/) or [OpenMP 4.5 and above](https://www.openmp.org/) with the target construct.

With this approach you annotate your code with directives considered as comments if you do not activate the compiler options to use them.

For OpenACC:

```c
#pragma acc parallel loop
for (int i=0; i<size; ++i)
{
    // Code to offload to GPU
}
```

For OpenMP target:

```c
#pragma omp target teams distribute parallel for
for (int i=0; i<size; ++i)
{
    // Code to offload to GPU
}
```

The implementation cost is much lower than the low level programming languages and usually you can get up to 95% of the performance you would get by writing your own specialized code.

Even though the modifications in your code will be lower than rewriting everything, you have to keep in mind that some changes might be necessary to have the best performance possible.
Those changes can be in:

- the algorithms
- the data structures
- etc

## OpenACC

The first version of the OpenACC specification was released in November 2011.
It was created by:

- Cray
- NVIDIA
- PGI (now part of NVIDIA)
- CAPS

In November 2023 they released the [3.3 specification](https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.3-final.pdf).

### Compilers for OpenACC

Several [compilers](https://www.openacc.org/tools) are available to produce OpenACC code:

- [Cray Programming environment](https://pubs.cray.com/category/pe-tile) (for Cray hardware)
- [NVIDIA HPC SDK](https://developer.nvidia.com/hpc-sdk) (formerly PGI)
- [GCC 12](https://gcc.gnu.org/gcc-12/) : available since version 10
- [AMD Sourcery CodeBench](https://docs.amd.com/r/en-US/ug821-zynq-7000-swdev/Sourcery-CodeBench-Lite-Edition-for-AMD-Cortex-A9-Compiler-Toolchain)
- etc

You have to be careful since the maturity of each compiler and the specification they respect can change.

#### Disclaimer

The training course is based on version 2.7 of the specification.

Here we will mainly use the HPC compilers from NVIDIA available on their [website](https://developer.nvidia.com/hpc-sdk) which fully respects [specification 2.7](https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.7.pdf).
You will be able to test the GCC compilers which supports [specification 2.6](https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.6.final.pdf)

## OpenMP target

The first OpenMP specification which supports GPU offloading is [4.5](https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf) released in November 2015.
It adds the `target` construct for this purpose.

The newest specification (november 2021) for OpenMP is [5.2](https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf).

### Compilers for OpenMP target

The list of compilers supporting OpenMP is available on the [OpenMP website](https://www.openmp.org/resources/openmp-compilers-tools/).
You have to check if the `target` (or offloading) is supported.

The main compilers which support offloading to GPU are:

- IBM XL for [C/C++](https://www.ibm.com/products/c-and-c-plus-plus-compiler-family) and [Fortran](https://www.ibm.com/products/xl-fortran-linux-compiler-power)
- [GCC since version 7](https://gcc.gnu.org/)
- [CLANG](https://clang.llvm.org/)
- [NVIDIA HPC SDK](https://developer.nvidia.com/hpc-sdk) (formerly PGI)
- [Cray Programming environment](https://pubs.cray.com/category/pe-tile) (for Cray hardware)

## Host driven Language

OpenACC is a host driven programming language.
It means that the host (usually a CPU) is in charge of launching everything happening on the device (usually a GPU) including:

- Executing kernels
- Memory allocations
- Data transfers

<img alt="The CPU controls the GPU" src="../../pictures/CPUcontrolGPU.png" style="float:none" width="30%"/>

## Levels of parallelism

On the GPU you can have 4 different levels of parallelism that can be activated:

- Coarse grain: gang
- Fine grain : worker
- Vectorization : vector
- Sequential : seq

One Gang is made of several Workers which are vectors (with by default a size of one thread).
You can increase the number of thread by activating the Vectorization.

Inside a kernel gangs have the same number of threads running.
But it can be different from one kernel to another.

So the total number of threads used by a kernel is $(Number\_of\_Gangs) * (Number\_of\_Workers) * (Vector\_Length)$.

<img alt=" Levels of parallelism in OpenACC" src="../../pictures/examples_SM_gang_worker_vector.png" style="float:none" width="50%"/>

## Important notes

- There is no way to synchronize threads between gangs.
- The compiler may decide to add synchronization within the threads in one gang.
- The threads of a worker work in [SIMT](https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads) mode.
  It means that all threads run the same instruction at the same time.
  For example on NVIDIA GPUS, groups of 32 threads are formed.
- Usually NVIDIA compilers set the number of workers to one.

### Information about NVIDIA devices

The `nvaccelinfo` command gives interesting information about the devices available.

For example, if you run it on Jean Zay A100 partition.

```bash
$ nvaccelinfo

Device Number:                 7
Device Name:                   NVIDIA A100-SXM4-80GB
Device Revision Number:        8.0
Global Memory Size:            85051572224
Number of Multiprocessors:     108
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Clock Rate:                    1410 MHz
Concurrent Kernels:            Yes
Memory Clock Rate:             1593 MHz
L2 Cache Size:                 41943040 bytes
Max Threads Per SMP:           2048
Async Engines:                 3
Managed Memory:                Yes
Default Target:                cc80
...
```