<a href="https://colab.research.google.com/github/nile649/CUDA_Tutorials/blob/master/cuda_chp_1_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!nvidia-smi

Sat Dec  5 03:02:12 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [8]:
# %%capture
!sudo apt update
!sudo add-apt-repository ppa:graphics-drivers
!sudo apt-key adv --fetch-keys  http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
!sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list'
!sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda_learn.list'

In [9]:
# %%capture
!sudo apt install cuda-10-1
!sudo apt install libcudnn7

In [6]:
%%capture
!pip install git+git://github.com/andreinechaev/nvcc4jupyter.git

In [5]:
%load_ext nvcc_plugin

created output directory at /content/src
Out bin /content/result.out


# Intro to GPU using CUDA programming paradigm.

CPU : Host 
GPU : Device


---


CPU is focused on increasing latency, and whereas GPU focuses on increasing the throughput.

Latency vs Bandwidth

Example :
We have to travel from LA to virginia. Distance of 45000 km.

Ferarri car capacity of 2 people and speed of 200km/hr.
Bus capacity of 40 people and speed of 50km/hr.

CAR
Latency    : 22.5 hr | Hour
Throughput : 0.089   | People / Hour (Number of threads / Latency)

Bus
Latency    : 90 hr | Hour
Throughput : 0.55   | People / Hour (Number of threads / Latency)


---


CPU already has 
  1. complex hardware
  2. Expensive in terms of power consumption 
  3. Flexibility and performance 

GPU is made for
  1. Lots of simple ALU.
  2. Explicitly parallel programming model.
  3. Optimize throughput not latency.


CUDA program are serial program written for a single thread. CPU is responsible to invoke the kernel (The serial code for a single thread is called kernel),  CPU also manages how many threads we need to execute.

The GOAL is always to maximize the use of thread or it’s waste of GPU potenial.

# CUDA Program 1.1 | Square

int main(int argc, char** argv)

similar to python where ** determines dictionary args.

```
argc will be the number of strings pointed to by argv. This will (in practice) be 1 plus the number of arguments, as virtually all implementations will prepend the name of the program to the array.

The variables are named argc (argument count) and argv (argument vector) by convention, but they can be given any valid identifier: int main(int num_args, char** arg_strings) is equally valid.

They can also be omitted entirely, yielding int main(), if you do not intend to process command line arguments.

Try the following program:

#include <iostream>

int main(int argc, char** argv) {
    std::cout << "Have " << argc << " arguments:" << std::endl;
    for (int i = 0; i < argc; ++i) {
        std::cout << argv[i] << std::endl;
    }
}
Running it with ./test a1 b2 c3 will output

Have 4 arguments:
./test
a1
b2
c3
```



# CUDA core functions 

compiler :
*   filename: square.cu
*   compile:  nvcc -o square square.cu


functions :

float *d_in;



*   cudaMalloc [Cuda memory allocation]

    cudaMalloc( (void**) &d_in, ARRAY_BYTES);

*   cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);

*   Transfer of variables CPU <--> CUDA
    1. cudaMemcpyHostToDevice [copy var from CPU to GPU]
    2. cudaMemcpyDeviceToHost [copy var from GPU to CPU]
    3. cudaMemcpyDeviceToDevice [copy var from GPU to GPU]

*   _kernel_name_<<<Blocks, threads>>>(argc,argv)

*   cudaDeviceSynchronize(); synchronization of threads before proceeding fruther.

```
  maximum number of threads/block is 512 (older GPUs) or 1024 (newer GPUs)
	for instance:
		128 threads? 
			square<<<1,  '128>>>( ... )	OK
		1280 threads? 
			square<<<10, 128>>>( ... )	OK
			square<<<5,  256>>>( ... )	OK
			square<<<1, 1280>>>( ... ) !!!NO!!!
	For 2-dimensional problem, like image processing, use <<<128, 128>>>
	on a 128x128 pixels image.
	
	kernel<<< grid of blocks , block of threads >>> ( ... )
				1,2 or 3D       1,2 or 3D
	dim3( x, y, z )
	dim3( x, 1, 1 ) == dim3( w ) == w
	
	kernel<<< dim3( bx, by, bz ) , dim3( tx, ty, tz ), shmem >>> ( ... )
	
	bx * by * bz	grid of blocks
	tx * ty * tz	block of threads
	shmem			memory shared per block (in bytes)
	
	each thread knows it's threadId
	
	threadId	thread within a block
		threadId.x
		threadId.y
	blockDim	size of a block
	blockIdx	block within a grid
	gridDim		size of a grid
	
	
	Map concept
	
	Map:
		- set of elements (eg. 64 float array)
		- function to run on each element indipendently	(eg. square function)
	so Map is:
		map(elements,function)
	
	GPUs are good at map because:
		- gpus have many parallel processors
		- gpus optimize for throughput

```



# why use type caste (void**)

hmm, we will try to draw analogy from c++

DS* var = new DS()
int* var = new int[5]

the cudaMalloc functions look like this



```

void myMalloc(void** ptr, int size) {

  *ptr = malloc(size);

  return;

}

int* ptr = NULL;

myMalloc((void**) &ptr, size);

ptr[0] = 0;

we are trying to create array of memories for pointers pointing to a data location.

if *ptr > var
then **pts > [ptr]

thats a reason why we explicity type cast before the address of a pointer.

```



In [39]:
%%cu

#include <stdio.h>

/*
	Device (GPU) code
	__global__ - declaration specifier - GPU code mark
*/

__global__ void square(float *d_out, float *d_in) {
	int idx = threadIdx.x;
	float f = d_in[idx];
	d_out[idx] = f * f;
}

/*
	Host (CPU) code
*/
int main(int argc, char** argv) {
	const int ARRAY_SIZE = 64;
	const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);
	
	// generate the input array on the host

	float h_in[ARRAY_SIZE];
	for( int i=0; i<ARRAY_SIZE; i++) {
		h_in[i] = float(i);
	}
	
	float h_out[ARRAY_SIZE];
	
	// declare GPU memory pointers
	float* d_in;
	float* d_out;
	
	// allocate GPU memory
	cudaMalloc( (void**) &d_in, ARRAY_BYTES);
	cudaMalloc( (void**) &d_out, ARRAY_BYTES);
	

	cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);

	square<<<1, ARRAY_SIZE>>>(d_out, d_in);
	
	cudaDeviceSynchronize();
	
	// copy back the result array to the CPU
	cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);
	
	// print out the resulting array
	for( int i=0; i<ARRAY_SIZE; i++) {
		printf("%f", h_out[i]);
		printf(((i%4) != 3) ? "\t" : "\n");
	}
	
	// free the GPU memory allocation
	cudaFree(d_in);
	cudaFree(d_out);
	
	return 0;
}

0.000000	1.000000	4.000000	9.000000
16.000000	25.000000	36.000000	49.000000
64.000000	81.000000	100.000000	121.000000
144.000000	169.000000	196.000000	225.000000
256.000000	289.000000	324.000000	361.000000
400.000000	441.000000	484.000000	529.000000
576.000000	625.000000	676.000000	729.000000
784.000000	841.000000	900.000000	961.000000
1024.000000	1089.000000	1156.000000	1225.000000
1296.000000	1369.000000	1444.000000	1521.000000
1600.000000	1681.000000	1764.000000	1849.000000
1936.000000	2025.000000	2116.000000	2209.000000
2304.000000	2401.000000	2500.000000	2601.000000
2704.000000	2809.000000	2916.000000	3025.000000
3136.000000	3249.000000	3364.000000	3481.000000
3600.000000	3721.000000	3844.000000	3969.000000



In [54]:
%%cu
#include <stdlib.h>
#include <stdio.h>

__global__ void cube(float *d_out, float *d_in){
    int idx = threadIdx.x;
    float f = d_in[idx];
    d_out[idx] = f*f*f;
}
/*
d_var indicates device variables.
h_var indicates host variables.
*/
int main(int argc, char **argv){
    
    const int ARRAY_SIZE = 96;
    const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);   

    float h_in[ARRAY_SIZE];
    float h_out[ARRAY_SIZE];

    for(int i=0;i<ARRAY_SIZE;i++){
        h_in[i] = float(i);
        // printf("%f \n",h_in[i]);
    }

    // declare GPU variables as we will do for a single thread;

    float *d_in;
    float *d_out;

    // allocate GPU memory

    cudaMalloc((void**) &d_in, ARRAY_BYTES);
    cudaMalloc((void**) &d_out, ARRAY_BYTES);

    // transfer CPU arrays to GPU.

    cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);

    // launch the kernel
	  cube<<<1, ARRAY_SIZE>>>(d_out, d_in);

    // transfer GPU arrays to CPU.

    cudaDeviceSynchronize();

    cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);

    
    // print out the resulting array
    for (int i =0; i < ARRAY_SIZE; i++) {
      printf("%f", h_out[i]);
      printf(((i % 4) != 3) ? "\t" : "\n");
    }

    cudaFree(d_in);
    cudaFree(d_out);

    return 0;
}

0.000000	1.000000	8.000000	27.000000
64.000000	125.000000	216.000000	343.000000
512.000000	729.000000	1000.000000	1331.000000
1728.000000	2197.000000	2744.000000	3375.000000
4096.000000	4913.000000	5832.000000	6859.000000
8000.000000	9261.000000	10648.000000	12167.000000
13824.000000	15625.000000	17576.000000	19683.000000
21952.000000	24389.000000	27000.000000	29791.000000
32768.000000	35937.000000	39304.000000	42875.000000
46656.000000	50653.000000	54872.000000	59319.000000
64000.000000	68921.000000	74088.000000	79507.000000
85184.000000	91125.000000	97336.000000	103823.000000
110592.000000	117649.000000	125000.000000	132651.000000
140608.000000	148877.000000	157464.000000	166375.000000
175616.000000	185193.000000	195112.000000	205379.000000
216000.000000	226981.000000	238328.000000	250047.000000
262144.000000	274625.000000	287496.000000	300763.000000
314432.000000	328509.000000	343000.000000	357911.000000
373248.000000	389017.000000	405224.000000	421875.000000
438976.000000	456533.00