

Stefano Cozzini Area Science Park 14.12.2021/16.12.2021

## Agenda

- Why GPU ? (a little bit of history)
- Basic of GPU architecture
- How to use/programming GPUs ?

DISCLAIMER: many of these slides are coming from Nvidia training materials

# A little bit of history 1: the rising (90s..)

- The CPU has always been slow for Graphics Processing
  - Visualization
  - Games
- Graphics processing is inherently parallel and there is a lot of parallelism
  - O(pixels)
- GPUs were built to do graphics processing only
- Initially, hardwired logic replicated to provide parallelism
  - Little to no programmability

### From Nvidia web site

1999

NVIDIA INVENTS THE GPU

NVIDIA invents the graphics processing unit, putting it on a path to reshape the industry. GeForce 256 is launched as the world's first GPU, a term NVIDIA defines as "a single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines that is capable of processing a minimum of 10 million polygons per second." Modern GPUs process more than 7 billion polygons per second



### To better read...

• Nvidia defined the term graphics processing unit as

"a single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines that is capable of processing a minimum of 10 million polygons per second."

# A little bit of history 2: the 00's

- Like CPUs, GPUs benefited from Moore's Law
  - Evolved from fixed-function hardwired logic to flexible, programmable ALUs
- Around 2003/2004, GPUs were programmable "enough" to do some non-graphics computations
- Severely limited by graphics programming model
- In 2006, GPUs became "fully" programmable: NVIDIA releases "CUDA" language to write non-graphics programs that will run on GPUs

### From Nvidia web site







2006

CUDA ARCHITECTURE UNVEILED

NVIDIA unveils CUDA, a revolutionary architecture for general purpose GPU computing. CUDA will enable scientists and researchers to harness the parallel processing capabilities of GPUs to tackle their most complex computing challenges.

# A little bit of history 3: the 10's (the present)

- GPUs are widely deployed as accelerators
- GPUs so successful that other CPU alternatives are dead
  - Sony/IBM Cell BE
  - Clearspeed RSX
  - Intel MIC
- GPU enabled the ML/DL/AI revolution and started the HPC/AI convergence
  - Tensorcore
- There is ONE winner: NVIDIA

### From Nvidia web site

2019

NVIDIA ANNOUNCES ADVANCES
ACROSS HPC, EMBEDDED, DATA CENTER,
AUTONOMOUS VEHICLE, AND PRO GRAPHICS
MARKETS

NVIDIA reference design platform is introduced, enabling companies to quickly build GPU-accelerated Arm®-based servers for a growing range of HPC applications.



# Nvidia cards for HPC/AI



## Nvidia slide: computing mode for accelerator



## Heterogenous computing

- Terminology:
  - Host The CPU and its memory (host memory)
  - Device The GPU and its memory (device memory)





### CPU vs GPU



# Nvidia cards for HPC/AI

| 2000                                                                                                  |                   |                                                                                                                  |
|-------------------------------------------------------------------------------------------------------|-------------------|------------------------------------------------------------------------------------------------------------------|
| Nvidia Datacenter GPU                                                                                 | Nvidia Tesla V100 | Nvidia A100                                                                                                      |
| GPU codename                                                                                          | GV100             | GA100                                                                                                            |
| GPU architecture                                                                                      | Volta             | Ampere                                                                                                           |
| Launch date                                                                                           | May 2017          | May 2020                                                                                                         |
| GPU process                                                                                           | TSMC 12nm         | TSMC 7nm                                                                                                         |
| Die size                                                                                              | 815mm2            | 826mm2                                                                                                           |
| Transistor Count                                                                                      | 21.1 billion      | 54 billion                                                                                                       |
| FP64 CUDA cores                                                                                       | 2,560             | 3,456                                                                                                            |
| FP32 CUDA cores                                                                                       | 5,120             | 6,912                                                                                                            |
| Tensor Cores                                                                                          | 640               | 432                                                                                                              |
| Streaming Multiprocessors                                                                             | 80                | 108                                                                                                              |
| Peak FP64                                                                                             | 7.8 teraflops     | 9.7 teraflops                                                                                                    |
| Peak FP64 Tensor Core                                                                                 | -                 | 19.5 teraflops                                                                                                   |
| reak i reliser core                                                                                   |                   | 100000                                                                                                           |
| Peak FP32                                                                                             | 15.7 teraflops    | 19.5 teraflops                                                                                                   |
|                                                                                                       | 15.7 teraflops    | 19.5 teraflops  156 teraflops/312 teraflops*                                                                     |
| Peak FP32                                                                                             | 15.7 teraflops    | -                                                                                                                |
| Peak FP32 Peak FP32 Tensor Core                                                                       | -                 | 156 teraflops/312 teraflops*                                                                                     |
| Peak FP32 Peak FP32 Tensor Core Peak BFLOAT16 Tensor Core                                             | -                 | 156 teraflops/312 teraflops* 312 teraflops/624 teraflops*                                                        |
| Peak FP32 Peak FP32 Tensor Core Peak BFLOAT16 Tensor Core Peak FP16 Tensor Core                       | -                 | 156 teraflops/312 teraflops* 312 teraflops/624 teraflops* 312 teraflops/624 teraflops*                           |
| Peak FP32 Peak FP32 Tensor Core Peak BFLOAT16 Tensor Core Peak FP16 Tensor Core Peak INT8 Tensor Core |                   | 156 teraflops/312 teraflops* 312 teraflops/624 teraflops* 312 teraflops/624 teraflops* 624 teraflops/1,248 TOPS* |

<sup>\*</sup>Effective TOPS / TFLOPS using the new Sparsity feature

## Simple process flow



### Simple process flow



## Simple process flow



# CUDA Parallel Computing Platform



From: www.nvidia.com/getcuda

# 3 ways to accelerate your application



## 3 ways to accelerate your application



# Libraries: Easy, High-Quality Acceleration

### • Ease of use:

Using libraries enables GPU acceleration without in-depth knowledge of GPU programming

### • "Drop-in":

 Many GPU-accelerated libraries follow standard APIs, thus enabling acceleration with minimal code changes

### Quality:

 Libraries offer high-quality implementations of functions encountered in a broad range of applications

### • Performance:

NVIDIA libraries are tuned by experts

### Some GPU-accelerated libraries

### Math Libraries

GPU-accelerated math libraries lay the foundation for compute-intensive applications in areas such as molecular dynamics, computational fluid dynamics, computational chemistry, medical imaging, and seismic exploration.



#### cuBLAS

GPU-accelerated basic linear algebra (BLAS) library

Learn More



#### cuSOLVER

GPU-accelerated dense and sparse direct solvers

Learn More



#### cuFFT

GPU-accelerated library for Fast Fourier Transforms

Learn More



#### **cuSPARSE**

GPU-accelerated BLAS for sparse matrices

Learn More



#### **CUDA Math Library**

GPU-accelerated standard mathematical function library

Learn More



#### **cuTENSOR**

GPU-accelerated tensor linear algebra library

Learn More



#### cuRAND

GPU-accelerated random number generation (RNG)

Learn More



#### AmgX

GPU-accelerated linear solvers for simulations and implicit unstructured methods

16/12/2( Learn More

### Some GPU-accelerated libraries

### Deep Learning Libraries

GPU-accelerated libraries for Deep Learning applications that leverage CUDA and specialized hardware components of GPUs.

#### NVIDIA cuDNN

GPU-accelerated library of primitives for deep neural networks

Learn More

#### NVIDIA TensorRT™

High-performance
deep learning
inference optimizer
and runtime for
production deployment

Learn More

#### **NVIDIA Jarvis**

Platform for developing engaging and contextual AIpowered conversation apps

Learn More

### NVIDIA DeepStream SDK

Real-time streaming analytics toolkit for Albased video understanding and multi-sensor processing

Learn More

#### **NVIDIA DALI**

Portable, open-source library for decoding and augmenting images and videos to accelerate deep learning applications

Learn More

## 3 Steps to CUDA-accelerated application

Step 1: Substitute library calls with equivalent CUDA library calls

```
saxpy ( ... ) ---> cublasSaxpy ( ... )
```

- Step 2: Manage data locality
  - with CUDA: cudaMalloc(), cudaMemcpy(), etc.
  - with CUBLAS: cublasAlloc(), cublasSetVector(), etc.
- Step 3: Rebuild and link the CUDA-accelerated library
  - nvcc myobj.o -1 cublas

# Explore the CUDA (Libraries) Ecosystem

CUDA Tools and Ecosystem described in detail on NVIDIA Developer Zone: developer.nvidia.com/cuda-tools-ecosystem

### Tools & Ecosystem



#### **GPU-Accelerated Libraries**

Application accelerating can be as easy as calling a library function.

Learn more >



#### **Debugging Solutions**

Powerful tools can help debug complex parallel applications in intuitive ways.

Learn more >



#### Accelerated Web Services

Micro services with visual and intelligent capabilities using deep learning.

Learn more >



#### Language and APIs

GPU acceleration can be accessed from most popular programming languages.

Learn more >



#### Data Center Tools

Software Tools for every step of the HPC and Al software life cycle.

Learn more >



#### Cluster Management

Managing your cluster and job scheduling can be simple and intuitive.

Learn more >



#### Performance Analysis Tools

Find the best solutions for analyzing your application's performance profile.

Learn more >



#### Key Technologies

Learn more about parallel computing technologies and architectures.

Learn more >

## Tutorial: performing DGEMM on CPU and GPU

• See github repo

# 3 ways to accelerate your application



### OpenACC directives

- Simple Compiler hints
- Compiler parallelizes code
- Works on many-core GPUs v& multicore CPUs



Your original Fortran or C code

### OpenACC

### • Easy:

 Directives are the easy path to accelerate computer intensive applications

### Open:

 OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors

### • Powerful:

 GPU Directives allow complete access to the massive parallel power of a GPU

## Is openACC available?

- OpenACC is at version 2.7
- PGI compiler fully implements it
- GCC 9 includes initial support for OpenAcc 2.6

### OpenACC - Directive Based Approach

- Directives are added to serial source code
  - Manage loop parallelization
  - Manage data transfer between CPU and GPU memory
- Works with C, C++, or Fortran
  - can be combined with explicit CUDA C/Fortran usage
- Directives are formatted as comments
  - They don't interfere with serial execution
- Maintaines portability of original code

# 3 ways to accelerate your application



### GPU Programming Languages



## CUDA programming

- CUDA = Compute Unified Device Architecture
  - Expose general-purpose GPU computing as first-class capability
  - Retain traditional DirectX/OpenGL graphics performance

### CUDA C

- Based on industry-standard C
- A handful of language extensions to allow heterogeneous programs
- Straightforward APIs to manage devices, memory, etc.

### CUDA basic concepts

- The GPU is viewed as a compute device that:
  - has its own RAM (device memory)
  - runs data-parallel portions of an application as kernels by using many threads
- GPU vs. CPU threads
  - GPU threads are extremely lightweight
  - Very little creation overhead
  - GPU needs 1000s of threads for full efficiency
  - multi-core CPU needs only a few (basically one thread per core)

### Hello world

- Standard C that runs on the host
- NVIDIA compiler (nvcc) can be used to compile programs with no device code
- At its simplest, CUDA C is just C!

```
int main(void) {
printf("Hello World!\n");
return 0;
}
>nvcc hello_world.cu
```

#### Hello world on device

- To compile: nvcc -o simple\_kernel simple\_kernel.cu
- To execute: ./simple\_kernel

```
__global__ void mykernel(void) {
}
int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}
```

#### Hello world on device

```
__global__ void mykernel(void) {
}
int main(void) {
  mykernel<<<1,1>>>();
  printf("Hello World!\n");
  return 0;
}
```

- CUDA C keyword \_\_global\_\_ indicates that a function
  - Runs on the device
  - Called from host code
- nvcc splits source file into host and device components:
  - NVIDIA's compiler handles device functions like kernel()
  - Standard host compiler handles host functions like main()
  - gcc, icc, ...

#### Hello world on device

```
__global__ void mykernel(void) {
}
int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}
```

- Triple angle brackets mark a call from host code to device code
  - Also called a "kernel launch"
  - We'll return to the parameters (1,1) in a moment

That's all that is required to execute a function on the GPU

## Parallel programming on GPU

- Adding to vectors on GPU
- A simple kernel:

```
__global__ void add(int *a, int *b, int *c)
{
 *c = *a + *b;
}
```

- As before \_\_global\_\_ is a CUDA C/C++ keyword meaning:
  - add() will execute on the device
  - add() will be called from the host



## Memory management (1)

- Host and device memory are separate entities
  - Device pointers point to GPU memory
  - May be passed to/from host code
  - May not be dereferenced in host code
- Host pointers point to CPU memory
  - May be passed to/from device code
  - May not be dereferenced in device code

# Memory management (2)

- Simple CUDA API for handling device memory
  - cudaMalloc(&p, size),
  - cudaFree(&p),
  - cudaMemcpy(t, s, size, direction)
- Similar to the C equivalents
  - malloc(), free(), memcpy()

Starting on CUDA 6.0 there is a **Unified Memory** feature

# Memory management (3)

- Unified Memory creates a pool of managed memory that is shared between the CPU and GPU.
- Managed memory is accessible to both the CPU and GPU using a single pointer.
- System automatically migrates data allocated in Unified Memory between host and device
- API:

cudaMallocManaged()



## Addition on the Device: add()

```
__global__ void add(int *a, int *b, int *c)
{
    *c = *a + *b;
}
```

Let's take a look at main()...

#### Addition on the Device: main()

```
__int main(void) {
int a, b, c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = sizeof(int);
// Allocate space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d c, size);
// Setup input values
a = 2;
b = 7;
```

## Moving to parallel...

So how do we run code in parallel on the device?

• Instead of executing add() once, execute N times in parallel

#### Vector Addition on the Device

- With add() running in parallel we can do vector addition
- Terminology: each parallel invocation of add() is referred to as a block
  - The set of blocks is referred to as a grid
  - Each invocation can refer to its block index using blockldx.x

```
__global__ void add(int *a, int *b, int *c) {
   c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
```

 By using blockldx.x to index into the array, each block handles a different index

#### Vector Addition on the Device

```
_global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
```

```
Block 0 Block 1 Block 2 Block 3

c[0] = a[0] + b[0];

c[1] = a[1] + b[1];

c[2] = a[2] + b[2];

c[3] = a[3] + b[3];
```

# Vector Addition on the Device: complete program (1)

```
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);
// Alloc space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d c, size);
// Alloc space for host copies of a, b, c and setup input values
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
```

# Vector Addition on the Device: complete program (2)

```
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N blocks
add <<< N, 1>>> (d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(d a); cudaFree(d b); cudaFree(d c);
return 0;
```

#### Review so far:

- Difference between host and device
  - Host CPU
  - Device GPU
- Using \_\_\_global\_\_\_ to declare a function as device code
  - Executes on the device
  - Called from the host
- Passing parameters from host code to a device function

#### Review so far:

- Basic device memory management
  - •cudaMalloc()
  - cudaMemcpy()
  - •cudaFree()
- Launching parallel kernels
  - Launch N copies of add() with add<<<N,1>>>(...);
  - Use blockIdx.x to access block index

#### CUDA execution model

- Thread: Sequential execution unit
  - All threads execute same sequential program
  - Threads execute in parallel
- Threads Block: a group of threads
  - Executes on a single Streaming Multiprocessor (SM)
  - Threads within a block can cooperate
    - Light-weight synchronization
    - Data exchange
- Grid: a collection of thread blocks
  - Thread blocks of a grid execute across multiple SMs
  - Thread blocks do not synchronize with each other
  - Communication between blocks is expensive



#### CUDA threads

- Terminology: a block can be split into parallel threads
- Let's change add() to use parallel threads instead of parallel blocks
- We use threadIdx.x instead of blockIdx.x
- Need to make one change in main()...

```
_global__ void add(int *a, int *b, int *c) {
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
```

## Vector Addition on the Device: using threads

```
#define N 512
int main(void) {
int *a, *b, *c, // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);
// Alloc space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d b, size);
cudaMalloc((void **)&d_c, size);
// Alloc space for host copies of a, b, c and setup input values
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
```

## Vector Addition on the Device:using threads

```
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N threads
add <<<1,N>>>(d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(d a); cudaFree(d b); cudaFree(d c);
return 0;
```

## Combing blocks and threads

- We've seen parallel vector addition using:
  - Many blocks with one thread each
  - One block with many threads
- We want to adapt vector addition to use both blocks and threads
- First let's discuss data indexing...

## Indexing Arrays with Blocks and Threads

- No longer as simple as using blockldx.x and threadldx.x
- Consider indexing an array with one element per thread (8 threads/block)



With M threads/block a unique index for each thread is given by:

```
int index = threadIdx.x + blockIdx.x * M;
```

## Indexing Arrays: Example

Which thread will operate on the red element?



#### Addition with Blocks and Threads

Use the built-in variable blockDim.x for threads per block

```
int index = threadIdx.x + blockIdx.x * blockDim.x;
```

Combined version of add() to use parallel threads and parallel blocks

```
_global__ void add(int *a, int *b, int *c) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index];
}
```

#### Vector Addition using threads and blocks

```
#define N (2048*2048)
#define THREADS PER BLOCK 512
int main(void) {
int *a *b *c // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);
// Alloc space for device copies of a, b, c
cudaMalloc((void **)&d a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d c, size);
// Alloc space for host copies of a, b, c and setup input values
a = (int *)malloc(size); random ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
```

## Vector Addition on the Device:using threads

```
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N threads
add<<<N/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(d a); cudaFree(d b); cudaFree(d c);
return 0;
```

## Handling Arbitrary Vector Sizes

- Typical problems are not friendly multiples of blockDim.x
- Avoid accessing beyond the end of the arrays:

```
_global__ void add(int *a, int *b, int *c, int n) {
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  if (index < n)
  c[index] = a[index] + b[index];
}</pre>
```

Update the kernel launch:

```
add<<<(N + M-1)/M,M>>>(d_a, d_b, d_c, N);
```

#### Review

- Launching parallel kernels
  - •Launch N copies of add() with
    add<<<N/M,M>>>(...);
  - Use blockIdx.x to access block index
  - Use threadIdx.x to access thread index within block
- Allocate elements to threads:

```
int index = threadIdx.x + blockIdx.x
* blockDim.x;
```

# Why bother with threads

- Threads seem unnecessary
  - They add a level of complexity
  - What do we gain?
- Unlike parallel blocks, threads have mechanisms to:
  - Communicate
  - Synchronize
- To look closer, we need a new example...

#### 1D stencil

- Consider applying a 1D stencil to a 1D array of elements
  - Each output element is the sum of input elements within a radius
- If radius is 3, then each output element is the sum of 7 input elements:



## Implement within a block

- Each thread processes one output element
  - blockDim.x elements per block
- Input elements are read several times
  - With radius 3, each input element is read seven times





## Sharing Data between Threads

- Terminology: within a block, threads share data via shared memory
- Extremely fast on-chip memory,
  - By opposition to device memory, referred to as global memory
  - Like a user-managed cache
- Declare using \_\_\_shared\_\_\_, allocated per block
- Data is not visible to threads in other blocks

#### Implementing With Shared Memory

- Cache data in shared memory
  - Read (blockDim.x + 2 \* radius) input elements from global memory to shared memory
  - Compute blockDim.x output elements
  - Write blockDim.x output elements to global memory



Each block needs a halo of radius elements at each boundary

#### Stencil kernel

```
global__ void stencil_1d(int *in, int *out) {
  shared int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;
// Read input elements into shared memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {</pre>
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
```

## Apply the stencil

```
Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)</pre>
result += temp[lindex + offset];
// Store the result
out[gindex] = result;
```

#### Data race!

- The stencil example will not work...
- Suppose thread 15 reads the halo before thread 0 has fetched it...

# \_syncthreads()

```
void __syncthreads();
```

- Synchronizes all threads within a block
  - Used to prevent RAW / WAR / WAW hazards
- All threads must reach the barrier
  - In conditional code, the condition must be uniform across the block

#### Stencil kernel

```
global__ void stencil_1d(int *in, int *out) {
  shared int temp[BLOCK SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;
// Read input elements into shared memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {</pre>
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
// Synchronize (ensure all the data is available)
  syncthreads();
```

#### Coordinating host and device

- Kernel launches are asynchronous
  - Control returns to the CPU immediately
- CPU needs to synchronize before consuming the results
- A few ways to do this:

```
cudaMemcpy() Blocks the CPU until the copy is complete. Copy begins when all preceding CUDA calls have completed
```

```
cudaMemcpyAsync() Asynchronous, does not block the CPU
```

cudaDeviceSynchronize() Blocks the CPU until all preceding CUDA calls have completed

#### What to do tomorrow:

- CUDA
  - CUDA101:
    - Simple commands/program to interact with GPU cards on ORFEO
  - Vector:
    - Exercises on vector
  - 1DStencil
    - Exercise on 1D-stencil
- GEMM:
  - Compare CPU vs GPU performance on Matrix-Matrix multiplication