# CUDA Threads

* [Overview](#overview) 
* [CUDA threads](#ekf)
* [References](#refs)

## <a name="overview"></a> Overview

In this section we give an overview of CUDA threads. In general, threads in CUDA are arranged in blocks which themselves are arranged into grids. If any of the dimensions is not specified, is set to 1 by default. Each thread in CUDA is aware of its position within the grid/block hierarchy via the ```blockDim``` and ```gridDim``` structures provided by the SDK.

## <a name="ekf"></a> CUDA threads

In CUDA threads are organised into blocks. A block can be either 1D or 2D or 3D. Blocks are arranged into grids. Similar to blocks, a grid can  1D or 2D or 3D. A block as well as a grid are represented via the ```dim3``` data structure. For example the following launches a _kernel_ onto a grid of $(4, 3, 2)$ blocks where each block is a $(3,2)$ 

```
dim3 block(3, 2);
dim3 grid(4,3,2);
foo<<<grid, block>>>();
```

If any of the dimensions is not specified, is set to 1 by default. Each thread in CUDA is aware of its position within the grid/block hierarchy via the ```blockDim``` and ```gridDim``` structures provided by the SDK. Thus

- ```blockDim``` contains the size of each block $(B_x, B_y, B_z)$
- ```gridDim``` contains the size of the grid in blocks $(G_x, G_y, G_z)$
- ```threadIdx``` is the thread position within a block. It has three components as follows
    - $x \in [0, B_x -1]$
    - $y \in [0, B_y -1]$
    - $z \in [0, B_z -1]$
- ```blockIdx``` is the $(b_x, b_y, b_z)$ position of a thread's block within the grid. Each of the $x,y$ and $z$ components follows the same constraints as the ```threadIdx``` above.

When launching multiple blocks a ```threadIdx``` is not unique since there could be two or more threads
in different blocks with the same index. Given this, how can we calculate a unique global id for our thread?

We can assume that each thread is an element of a 6D array that has the following arrangement

```
Thread t[G.z][G.y][G.x][B.z][B.y][B.x]
```

We can then calculate a unique id according to

```
int t_id = (blockIdx.z * gridDim.x*gridDim.y + 
            blockIdx . y ∗ gridDim . x +
            blockIdx . x ) ∗ blockDim . x ∗ blockDim . y ∗ blockDim . z +
            threadIdx . z ∗ blockDim . x ∗ blockDim . y +
            threadIdx . y ∗ blockDim . x +
            threadIdx . x ;
```

The global coordinates of the threads can be obtained according to

```
//start of block + local component
x = blockIdx . x ∗ blockDim . x + threadIdx . x ;
y = blockIdx . y ∗ blockDim . y + threadIdx . y ;
z = blockIdx . z ∗ blockDim . z + threadIdx . z ;
```

In the next section we will discuss more on how the threads are arranged in CUDA. In particular, we will see that  the threads within a block do not run concurrently. Instead they are executed in groups called warps. The size of a warp is hardware-specific.

## <a name="ekf"></a> References