# Accelerating CUDA C++ Applications with Multiple GPUs

Welcome to the material _Accelerating CUDA C++ Applications with Multiple GPUs_ for `Instruction certification application` in NVIDIA.

## Personal Informations:

I will introduce myself

- My name is **Murilo Boratto**;
- I am **Computer Engenier**;
- I had my PHD ten years ago, in **HPC** (High Performance Computing) topic at the **Universitat Politècnica de València**, in Spain;
- Actually I am working in the **Supercomputing Center SENAI CIMATEC**, in the city Salvador of Bahia in Brasil. My function in this center is **Research Leader in HPC and Parallel Computing projects**. My research line is inside of **accelerating computing**, basically is optimize and portable sequential code for GPU environments;
- I have skills in GPU programming using the APIs CUDA, OpenACC, CUDAWARE, NCCL, MPI, OpenMP ...

This is the summary about me!



## `JupyterLab`

### What is the JupyterLab?

The JupyterLab is an **`interface for interactive computing that can use the browser for compile and execute parallel codes`**. It extends the functionality of traditional Jupyter Notebooks by offering a more flexible and versatile interface, combining various components such as notebooks, terminals, text editors, and data file viewers in one unified workspace on the browser.

### What is the advantages for use the JupyterLab?

1. Flexible Interface

2. Rich Text and Code Editing

### Which areas can I use JupyterLab?

1. HPC, Data Science, and Machine Learning

2. Software Development and Prototyping codes

## `CUDA thread hierarchy`

### What is CUDA thread hierarchy?

In CUDA, the thread hierarchy plays a critical role in organizing the parallel execution of code and optimizing performance. It provides a **`structured way to launch and manage threads in a grid of blocks`**, which are executed on the GPU.

### What does the CUDA thread hierarchy consist of?

The CUDA thread hierarchy consists of three main levels: **threads**, **blocks**, and **grids**. Each level serves a specific purpose in organizing the parallel execution of tasks:

▶ **Thread**: **The smallest unit of execution**. Each thread executes a kernel function independently. Threads can access their own local memory, shared memory within the block, and global memory accessible by all threads.

▶ **Block**: **Threads are grouped into blocks**. Each block contains a fixed number of threads, typically arranged in 1D, 2D, or 3D. Blocks execute independently, making it possible for different blocks to be executed on different streaming multiprocessors (SMs). Threads within a block can cooperate with each other using shared memory, which is fast and shared among threads in the same block.

▶ **Grid**: **A grid is composed of multiple blocks**. When a kernel is launched, it is executed across all the threads in all the blocks of the grid. Like blocks, grids can also be one-, two-, or three-dimensional, providing flexibility in the organization of threads.

### Sample example of CUDA Thread Hierarchy

#### Matrix Multiply

In [None]:
%%writefile mm.cu
#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

__global__ void kernel(int *A, int *B, int *C, int size)
{
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  int j = blockIdx.y * blockDim.y + threadIdx.y;

  if((i < size) && (j < size))
    for(int k = 0; k < size; k++)
       C[i * size + j] += A[i * size + k] * B[k * size + j];

}

void initializeMatrix(int *A, int size)
{
  for(int i = 0; i < size; i++)
    for(int j = 0; j < size; j++)
      A[i * size + j] = rand() % (10 - 1) * 1;
}

void printMatrix(int *A, int size)
{
  for(int i = 0; i < size; i++){
    for(int j = 0; j < size; j++)
      printf("%d\t", A[i * size + j]);
    printf("\n");
  }
  printf("\n");
}

int main(int argc, char **argv)
{
  if (argc < 3)
  {
    printf("%s [SIZE] [BLOCKSIZE]\n", argv[0]);
    exit(-1);
  }

  int n = atoi(argv[1]);
  int blockSize = atoi(argv[2]);
  double t1, t2;

 //Memory Allocation in the Host
  int  *A = (int *) malloc (sizeof(int)*n*n);
  int  *B = (int *) malloc (sizeof(int)*n*n);
  int  *C = (int *) malloc (sizeof(int)*n*n);

  initializeMatrix(A, n);
  initializeMatrix(B, n);

 //printMatrix(A, n);
 //printMatrix(B, n);
 //printMatrix(C, n);

 // Memory Allocation in the Device
  int *d_A, *d_B, *d_C;
  cudaMalloc((void **) &d_A, n * n * sizeof(int) ) ;
  cudaMalloc((void **) &d_B, n * n * sizeof(int) ) ;
  cudaMalloc((void **) &d_C, n * n * sizeof(int) ) ;

  t1 = omp_get_wtime();

 // Copy of data from host to device
  cudaMemcpy( d_A, A, n * n * sizeof(int), cudaMemcpyHostToDevice ) ;
  cudaMemcpy( d_B, B, n * n * sizeof(int), cudaMemcpyHostToDevice ) ;
  cudaMemcpy( d_C, C, n * n * sizeof(int), cudaMemcpyHostToDevice ) ;

 // 2D Computational Grid
  dim3 dimGrid( (int) ceil( (float) n / (float) blockSize ), (int) ceil( (float) n / (float) blockSize ) );
  dim3 dimBlock( blockSize, blockSize);

            kernel<<<dimGrid, dimBlock>>>(A, B, C, n);

 // Copy of data from device to host
  cudaMemcpy( C, d_C, n * n * sizeof(float), cudaMemcpyDeviceToHost ) ;

  t2 = omp_get_wtime();

  printf("%d\t%f\n", n, t2-t1);

 //printMatrix(A, n);
 //printMatrix(B, n);
 //printMatrix(C, n);

// Memory Allocation in the Device
 cudaFree(d_A) ;
 cudaFree(d_B) ;
 cudaFree(d_C) ;

// Memory Allocation in the Host
 free(A);
 free(B);
 free(C);

 return 0;
}

In [None]:
!nvcc mm.cu -o mm -Xcompiler -fopenmp -O3

In [None]:
!./mm 10000 64

**threadIdx.x**: Index of the thread within its block.

**blockIdx.x**: Index of the block within the grid.

**blockDim.x**: Number of threads in each block.

### What are the advantages of the CUDA Thread Hierarchy?

▶ **Scalability**: is the **`ability to efficiently handle increased workloads by adding more computational resources`**.

▶ **Flexible Mapping**: the **`ability to assign computational tasks or processes to available resources`**.

▶ **Efficient Memory Usage**: is the **`efficient use of the shared memory reducing the memory latency`**.

The CUDA thread hierarchy is fundamental to understanding how to design and optimize GPU-accelerated programs. Properly leveraging this structure can lead to significant performance improvements in parallel computing tasks.

## `CUDA kernel`

### What is a CUDA kernel?

**It is a function of the computation that executed on GPU resource**.  The execution a kernel, you launch it from the CPU code using a special syntax that specifies the number of blocks and threads. This launch configuration determines the overall parallelism, with each block containing multiple threads.

## `Concurrent CUDA streams and their behaviors`

### What are Concurrent CUDA Streams?

In CUDA, **streams are sequences of operations that are executed on the GPU in order**. Using multiple concurrent CUDA streams allows for overlapping kernel execution, memory transfers, and other operations, improving performance by enabling asynchronous execution. 

### How to Create and Use CUDA Streams?

#### Creating a CUDA Stream:

CUDA streams are created using the **cudaStreamCreate()** function and destroyed with cudaStreamDestroy().

~~~c++
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
~~~

#### Using CUDA Streams in Operations:

When launching a kernel or performing memory transfers, you can specify the stream in which the operation should execute.

~~~c++
myKernel<<<numBlocks, threadsPerBlock, 0, stream1>>>(args...);
cudaMemcpyAsync(dest, src, size, cudaMemcpyHostToDevice, stream2);
~~~

In the example above, myKernel runs in stream1, while the asynchronous memory copy happens in stream2. If resources allow, both can execute concurrently.


#### Destroying a CUDA Stream:

~~~c++
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
~~~

## `Memory management and data transfer`

Memory management and data transfers in a GPU environment using CUDA are crucial for achieving optimal performance in parallel computing applications. 

### What is the Memory management and data transfer on Multi-GPU systems?

In GPU systems, the concept of Data Transfer Between Host and Devices **`refers to the movement of data between the CPU (host) and GPU (device) memory spaces`**. This process is necessary because the CPU and GPU typically have separate memory pools, and data that is processed by the GPU often originates from the CPU.

### What are the types that is possible?

- **Host-to-Device (H2D) and Device-to-Host (D2H) Transfers**: These are typically needed to initialize input data and retrieve results.

- **Pinned Memory**: Related with **direct memory access (DMA)** transfers in the process.

- **Unified Memory**: Introduced to simplify memory management, it allows for automatic migration of data between the host and device. This can be convenient, but manual management is often faster due to better control over **data locality**.

### What is the concept Asynchronous Memory Transfers on GPU systems?

Asynchronous Memory Transfers on GPU systems refer to the **ability to transfer data between the host and the device memory using overlapping concepts**. This allows the CPU to continue processing other tasks while the data is being moved, which can significantly improve overall system efficiency, especially in high-performance and real-time applications.


- CUDA supports asynchronous data transfers using cudaMemcpyAsync(), which allows overlapping data transfers with computation. This is beneficial for hiding data transfer latency by using streams, which are queues of operations that can execute in parallel.

- Stream Management: Multiple streams can be used to perform different tasks simultaneously, such as overlapping data transfers and kernel execution, leading to higher GPU utilization.

### What are the Optimization Strategies for Memory Management?

- Minimize Data Transfers
- Use Asynchronous Transfers and Overlap Computation
- Exploit Unified Memory

## `Data chunking strategy for multi-stream and multi-GPU`

### What is the Data Chunking Strategy?

The data chunking strategy for multi-stream and multi-GPU environments is a technique used to optimize data processing and parallelism in CUDA applications. This strategy involves **`dividing large datasets into smaller chunks that can be processed independently and in parallel across multiple CUDA streams or multiple GPUs`**. This approach maximizes hardware utilization and improves throughput by reducing idle times and overlapping computation with data transfers.

In multi-GPU environments, multiple GPUs can be used simultaneously to process large datasets. By **distributing the data chunks across different GPUs**, the workload can be parallelized further, leveraging the combined computational power of all GPUs. Each GPU operates independently, and data chunking ensures that the workload is balanced across all GPUs. Properly distributing the chunks among GPUs is crucial to avoid bottlenecks and ensure even load distribution.

Example: If I can $N$ data sets and $M$ GPUs, we can distribute $N/M$ to each GPU of the system. Using differents strategies, i.e., Static (equals chunks), Dynamic (small chunks and the asign is continuos), Hybrid, Out-of-core (the contents doesn't fit on the GPUs), ...

## `Copy/compute overlap with multiple GPU`

### How to do copy/compute overlap with multiple GPU?

Overlapping data transfers with computation across multiple GPUs is a powerful strategy in CUDA programming to maximize performance and hardware utilization. The goal is to minimize idle time by overlapping memory transfers (host-to-device and device-to-host) with kernel execution on multiple GPUs. This is achieved using asynchronous operations and streams. 

### Is there a step-by-step explanation of how to implement this technique?

1. Prepare the Environment

2. Divide the Data into Chunks

3. Allocate Memory on Multiple GPUs

4. Set Up CUDA Streams for Each GPU

5. Copy Data to GPUs Asynchronously

6. Launch Kernels Asynchronously

7. Transfer Results Back to the Host Asynchronously

8. Synchronize the Streams and Devices

## `CUDA error handling`

### What is?

CUDA error handling is a crucial aspect of CUDA programming that ensures the correct and stable execution of GPU-accelerated applications. It **involves detecting, reporting, and addressing errors that can occur during CUDA function calls**, kernel launches, memory operations, or any other GPU-related activity. Proper error handling helps developers identify issues, debug programs, and prevent undefined behavior or crashes.

### Why CUDA Error Handling is Important?

- Detecting Failures

- Debugging

- Deploy the code

### Can you give me a sample example?

~~~c++
myKernel<<<gridDim, blockDim>>>(args);

cudaError_t err = cudaGetLastError();

if (err != cudaSuccess) 
    printf("Kernel launch failed: %s\n", cudaGetErrorString(err));
~~~

## `NVIDIA® Nsight™ Systems Visual Profiler`

### What is?

It is a **profilling tool**. The objective is to **identify bottenecks, and performance on GPU systems**.

### What is possible with the tool?

- Memory analysis
- Identify latency
- Debugging performance