<a href="https://colab.research.google.com/github/hikmatfarhat-ndu/csc413-week1/blob/master/cuda1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Example1

To write code, create a code cell and write %%WriteFile filename.cu at the beginning. To actually create/modify the file "run" the cell

In [None]:
%%writefile example1.cu
#include "uda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>
__global__ void kernel(){

}
int main(){
    kernel<<<1,1>>>();
std::cout<<"done\n";
    
}

Overwriting example1.cu


NVIDIA C++ compiler. The file extension MUST be .cu, otherwise it compiles it with a "regular" c++ compiler.

In [None]:
!nvcc example1.cu -o example1
!./example1

done


## Example2

Recall that before computing on the GPU we need to transfer the data from host memory to device memory. Once the computation is done we transfer it back to the host.
Below is a simple example of that process.

In [None]:
%%writefile example2.cu
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <iostream>
__global__ void kernel(int *x,int *y,int *z){
    *z=*x+*y;
}
int main(){
    int a=1,b=2,c=0; //host variables
    int *d_a,*d_b,*d_c;//will hold device addresses
    // allocate memory for one integer and store the
     // address in d_a 
    cudaMalloc(&d_a,sizeof(int));
    cudaMalloc(&d_b,sizeof(int));
    cudaMalloc(&d_c,sizeof(int));
    // copy the value of a and b
    // TO device FROM host
    cudaMemcpy(d_a,&a,sizeof(int),cudaMemcpyHostToDevice);
    cudaMemcpy(d_b,&b,sizeof(int),cudaMemcpyHostToDevice);
    kernel<<<1,1>>>(d_a,d_b,d_c);
    // copy the result TO host FROM device
    cudaMemcpy(&c,d_c,sizeof(int),cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    
std::cout<<"value of c is "<<c<<"\n";
    
}

Writing example2.cu


In [None]:
!nvcc example2.cu -o example2
!./example2

value of c is 3


## Example3
This is the first example where we use parallelism, computing the sum of two arrays.
The computation is performed where each thread computes the sum of two elements. To accomplish that we map the thread id to the array index. In this example we use a __single__, __linear__, block therefore the thread id is equal to the builtin variable threadIdx.x

In [None]:
%%writefile example3.cu
#include <iostream>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
__global__ void kernel(float* a, float* b, float* c) {
	int id = threadIdx.x;
	c[id] = a[id] + b[id];
}

int main() {
	int N = 1024;
	float* a, * b, * c;
	float* da, * db, * dc;
  /* allocate memory on host */
	a = (float*)malloc(N * sizeof(float));
	b = (float*)malloc(N * sizeof(float));
	c = (float*)malloc(N * sizeof(float));
  /* allocate memory on device */
	cudaMalloc(&da, N * sizeof(float));
	cudaMalloc(&db, N * sizeof(float));
	cudaMalloc(&dc, N * sizeof(float));
  /* initialize the arrays a and b */
	for (int i = 0; i < N; ++i) {
		a[i] = i;
		b[i] = 2 * i;
	}
  /* copy arrays a and b to device */
	cudaMemcpy(da, a, N * sizeof(float), cudaMemcpyHostToDevice);
	cudaMemcpy(db, b, N * sizeof(float), cudaMemcpyHostToDevice);
/* launch kernel with one block of N threads */
	kernel << <1, N >> > (da, db, dc);
  /* copy result to host */
	cudaMemcpy(c, dc, N * sizeof(float), cudaMemcpyDeviceToHost);
  /* print the first 10 elements */
	for (int i = 0; i < 10; ++i)
		std::cout << c[i] << ' ';
	std::cout << std::endl;
	/* free memory on host and device */
	free(a);
	free(b);
	free(c);
	cudaFree(db);
	cudaFree(dc);
	cudaFree(da);

}

Overwriting example3.cu


In [None]:
!nvcc example3.cu -o example3
!./example3

0 3 6 9 12 15 18 21 24 27 


## Thread blocks

In CUDA the __maximum__ number of threads in a block is 1024. What if in the previous example we would like to compute the sum of two vectors with size bigger than 1024? We use multiple blocks.

### Example4
We repeat the previous example by using multiple blocks.