

# Objective

- To learn the basic concepts involved in a simple CUDA kernel function
  - Declaration
  - Built-in variables
  - Thread index to data index mapping

### **Example: Vector Addition Kernel**

#### Device Code

```
// Compute vector sum C = A + B
// Each thread performs one pair-wise addition

__global___
void vecAddKernel(float* A, float* B, float* C, int n)
{
    int i = threadIdx.x+blockDim.x*blockIdx.x;
    if(i<n) C[i] = A[i] + B[i];
}</pre>
```

□ NIDIA
 □ NIDIA

T ILLINOIS

#### Example: Vector Addition Kernel Launch (Host Code)

#### Host Code

```
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
    // d_A, d_B, d_C allocations and copies omitted
    // Run ceil(n/256.0) blocks of 256 threads each
    vecAddKernel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C, n);
}
```

The ceiling function makes sure that there are enough threads to cover all elements.

### More on Kernel Launch (Host Code)

#### Host Code

```
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
    dim3 DimGrid((n-1)/256 + 1, 1, 1);
    dim3 DimBlock(256, 1, 1);
    vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);
}
```

This is an equivalent way to express the ceiling function.

🐵 NVIDIA



#### Kernel execution in a nutshell

```
__host__
void vecAdd(...)

{
    dim3 DimGrid(ceil(n/256.0),1,1);
    dim3 DimBlock(256,1,1);

vecAddKernel<<<<DimGrid,DimBlock>>>(d_A,d_B),d_C,n);
}

int i = blockIdx.x * blockDim.x + threadIdx.x;

if(i<n) C[i] = A[i]+B[i];
}

Grid

Bk N-1

RAM

RAM
```

#### More on CUDA Function Declarations

|                           | Executed on the: | Only callable from the: |
|---------------------------|------------------|-------------------------|
| device float DeviceFunc() | device           | device                  |
| global void KernelFunc()  | device           | host                    |
| host float HostFunc()     | host             | host                    |

global defines a kernel function Each " " consists of two underscore characters A kernel function must return void device and host can be used together host is optional if used alone

# Compiling A CUDA Program





# Objective

- To understand multidimensional Grids
  - Multi-dimensional block and thread indices
  - Mapping block/thread indices to data indices

# A Multi-Dimensional Grid Example



NVIDIA ILLINOIS

# Processing a Picture with a 2D Grid



62×76 picture

# Row-Major Layout in C/C++



13 Carrier 1

#### Source Code of a PictureKernel

Scale every pixel value by 2.0

□ ILLINOIS
 □ ILLINOIS

### Host Code for Launching PictureKernel

```
// assume that the picture is m × n,

// m pixels in y dimension and n pixels in x dimension

// input d_Pin has been allocated on and copied to device

// output d_Pout has been allocated on device

...

dim3 DimGrid((n-1)/16 + 1, (m-1)/16+1, 1);

dim3 DimBlock(16, 16, 1);

PictureKernel<<<DimGrid,DimBlock>>>(d_Pin, d_Pout, m, n);

...
```

🕖 🕪 NVIDIÆ

I ILLINOIS

#### Covering a 62×76 Picture with 16×16 Blocks





Not all threads in a Block will follow the same control flow path.