# Before going to the main content ...
let's check if nvcc is available.<br>
If it shows a report on the Cuda compiler driver, then we are good to go!

In [7]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0


# Hello world!

Let's try to run the hello-world program.<br>
Let me explain the code first.<br>
The code is located in this directory : ```codes/1_introduction/```<br>
It is a ```.cu``` file, which looks very similar to a ```.c``` file.<br>
However, some necessary changes have to be made.
- The function that runs in GPU is a ```__global__ void``` type.
- GPU functions ALWAYS have to be ```void``` type. (please confirm this!)
- The syntax for calling the GPU function is : ```func <<<num1, num2>>>();``` (I'll explain what these numbers are.)
- The output of the GPU is collected after synchronizing with the CPU by doing ```cudaDeviceSynchronize();```.<br>

The syntax for compiling and running is the following. <br>
```nvcc -o output_exe input_code -run```<br>

In this particular example, there are three parts.<br>

#### A function that runs on the CPU:
```
void helloCPU(){
    printf("Hello from the CPU.\n");
}
```
#### A function that runs on the GPU:

```
__global__ void helloGPU(){
    printf("Hello from the GPU.\n");
}
```
#### A main function:
```
int main(){
    helloCPU();
    helloGPU<<<1, 1>>>();
    cudaDeviceSynchronize();
}
```
Inside the main function, after compilation, the CPU runs the ```helloCPU()``` function just like a regular C program. Then the ```helloGPU()``` function is sent to block=1 and thread=1 of the GPU. So this only runs once. After that, ```cudaDeviceSynchronize()``` makes sure that the GPU and the CPU are synchronized. Now, the output from the GPU is available to the CPU, so that "Hello from the GPU" can be printed on the screen.

In [8]:
!nvcc -o executables/hello codes/1_introduction/hello.cu -run

Hello from the CPU.
Hello from the GPU.


# CUDA Thread Hierarchy

CUDA stands for Compute Unified Device Architecture. A **thread** refers to the smallest unit of work that can be scheduled and executed by a GPU. These are organized into various levels of hierarchy to efficiently utilize the GPU's parallel processing capabilities. Thousands or even millions of threads can run simultaneously on a GPU, allowing for massive parallelism.
Threads within a block can cooperate and communicate with each other through shared memory.

Threads are grouped into **blocks**. A block is a logical unit that provides synchronization, communication, and memory sharing among its threads. You can think of it as a collection of threads working together. All threads within a block can access the same shared memory. The maximum number of threads per block depends on the GPU architecture.

A **grid** is a collection of blocks. Blocks within a grid can execute independently of each other, allowing for further parallelism. The blocks in a grid can be scheduled on any available multiprocessor (SM) within the GPU.

Now, let's come back to the syntax : ```func <<<num1, num2>>>();```
Here, the GPU *kernel* is sent to the different threads from different blocks for parallel processing. The numbers here refers to the number of blocks and number of threads per block. i.e.,<br>
```func <<<NUMBER_OF_BLOCKS, NUMBER_OF_THREADS_PER_BLOCK>>>();```<br>

**The kernel code is executed by every thread in every thread block configured when the kernel is launched**.<br>
- someKernel<<<1, 1>>>() is configured to run in a single thread block which has a single thread and will therefore run only once.
- someKernel<<<1, 10>>>() is configured to run in a single thread block which has 10 threads and will therefore run 10 times.
- someKernel<<<10, 1>>>() is configured to run in 10 thread blocks which each have a single thread and will therefore run 10 times.
- someKernel<<<10, 10>>>() is configured to run in 10 thread blocks which each have 10 threads and will therefore run 100 times.

Let's see this in action.

### Playing with number of threads and blocks:

The threads and the blocks are given IDs (integers) which start from 0. These IDs can be accessed through variables such as ```threadIdx.x``` and ```blockIdx.x```. The following is an example where I am printing out these IDs for each thread. You can see that the print statements are 'not chronological'. They are also being processed simultaneously in the threads.

In [4]:
!nvcc -o executables/printing_numbers codes/1_introduction/threads_and_blocks.cu -run

Thread ID = 0 , Block ID = 1 
Thread ID = 1 , Block ID = 1 
Thread ID = 0 , Block ID = 0 
Thread ID = 1 , Block ID = 0 


### Making the code idiot proof
There is an issue with manually setting nBlocks and nThreads. The number of cores needed to execute the code is at least nBlocks x nThreads. That's we can't just come up with some arbitrary number of blocks and thread per box. In my case, I have 896 cores in my GPU. This product can't go beyond that.

This is how we fix the problem.

Set ```int N = 896;```<br>

Assume that we have a desire to set threads_per_block exactly to 256<br>

```size_t threads_per_block = 256;```<br>

Then , the number of blocks should be the following.<br> 

```size_t number_of_blocks = (N + threads_per_block - 1) / threads_per_block;```

Inside the GPU function, we can also calculate the unique ID of the core by doing the following.

`int idx = threadIdx.x + blockIdx.x * blockDim.x;`

Then, we tell the GPU to execute the job, only when `idx < N`.

Let's modify the previous example to include this core index. I am printing out the integers from 0 to 9, and hence choosing 10 cores for this operation. Also, I am choosing the number of threads per block to be 4. 

In [5]:
!nvcc -o executables/printing_numbers_2 codes/1_introduction/threads_and_blocks_2.cu -run

Thread ID = 0 , Block ID = 2, Core ID = 8 
Thread ID = 1 , Block ID = 2, Core ID = 9 
Thread ID = 0 , Block ID = 0, Core ID = 0 
Thread ID = 1 , Block ID = 0, Core ID = 1 
Thread ID = 2 , Block ID = 0, Core ID = 2 
Thread ID = 3 , Block ID = 0, Core ID = 3 
Thread ID = 0 , Block ID = 1, Core ID = 4 
Thread ID = 1 , Block ID = 1, Core ID = 5 
Thread ID = 2 , Block ID = 1, Core ID = 6 
Thread ID = 3 , Block ID = 1, Core ID = 7 


Notice that, the core IDs are not organised from 0 to 9. This is because the blocks are not organised that way. You can see that the threads in each block are organised from 0 to 3 (except for the last block, where the thread id goes from 0 to 1, as the number of cores used here is exhausted). 

You can do a fun exercise of printing out the integers from 0 to 9. But instead of using a for loop in CPU, where the print statment is executed one after another, I am using the index features of the GPU to execute the print statement simulataneously.

I have two algorithms for doing that. The first one involves using 10 cores and printing out the core id (from 0 to 9). This is similar to the previous example. That's why I am not repreating it here. The second one involves a matrix of numbers {block id, thread id}, where only the numbers from the diagonal elements are being printed on screen. The GPU function will look like this.

```
__global__ void print_integers(){
    int block_id = blockIdx.x;
    int thread_id = threadIdx.x;
    if(block_id == thread_id){ //for only the diagonal elements in the matrix {block id, thread id}
        printf("%d", block_id);
    } 
}
```
Since we are printing out numbers from 0 to 9, both `block_id` and `thread_id` should have values from 0 to 9. That's why in the main function, we run the kernel as follows.
`print_integers<<<10, 10>>>();`

Try this example yourself. The disadvantage of this method is that, it is limited by the number of available cores. In my case, I can only print out numbers upto 896. That's why its important to make the code 'idiot proof' by executing the processes onyl when the core id is less than the number of avaliable cores.