<a href="https://colab.research.google.com/github/mkhfring/parallel-c/blob/main/A5_(2022).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A5 (10 Marks)
---
**Focus**: CUDA(A, B, C) - Introduction (warming up!!)

© Dr. Abdallah Mohamed

This version of the assignment is the Google Colab version and takes advantage of the free Cloud GPUs offered through the platform. Colab Notebooks (you're reading one right now!) are typically designed to run Python code, however, we'll be modifying them in such a way that we can run CUDA code (as discussed in the lectures) on the GPU.

Please note that code can be written and run directly within this assignment. Lastly, keep in mind that anytime your runtime disconnects or is restarted **you must re-run the Notebook Setup code block**. This applies to all CUDA assignments done using Google Colab.

## Notebook Setup: CUDA Compilation

To enable CUDA code compilation on Colab Notebooks, we'll employ use of the NVCC4Jupyter plugin (source code/documentation available [here](https://github.com/engasa/nvcc4jupyter)). This plugin effectively turns any Colab Notebook code block that includes `%%cu` into compilable/runnable CUDA code.

To download/install/enable NVCC4Jupyter, please run the following code block. **Running this block is required anytime you connect/restart/reconnect to an instance.** To run a code block, mouse over it and click the play button on left side.

In [1]:
# Run the following to configure your notebook for CUDA code
!pip install git+https://github.com/engasa/nvcc4jupyter.git
%load_ext nvcc_plugin

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/engasa/nvcc4jupyter.git
  Cloning https://github.com/engasa/nvcc4jupyter.git to /tmp/pip-req-build-1bmrb3b6
  Running command git clone -q https://github.com/engasa/nvcc4jupyter.git /tmp/pip-req-build-1bmrb3b6
Building wheels for collected packages: NVCCPlugin
  Building wheel for NVCCPlugin (setup.py) ... [?25l[?25hdone
  Created wheel for NVCCPlugin: filename=NVCCPlugin-0.0.2-py3-none-any.whl size=4407 sha256=243b07a5c1b844a38a9f829622f345e6e51a9073ce9e7efa68d9112d8aa7f0a4
  Stored in directory: /tmp/pip-ephem-wheel-cache-mlanb7lt/wheels/d2/a3/04/ef659d715dcdd196d998813ca085af3cab3df66f4bb27576b5
Successfully built NVCCPlugin
Installing collected packages: NVCCPlugin
Successfully installed NVCCPlugin-0.0.2
created output directory at /content/src
Out bin /content/result.out


You should see some output when you click the play button. Wait until the code block is finished running (this is indicated when the stop button goes away). The last couple lines of output should look something like the following:

```
created output directory at /content/src
Out bin /content/result.out
```

## Notebook Setup: GPU Runtime

Before writing/running any CUDA code, we need to ensure Colab is provisioning a Cloud GPU for us. To do this, click on the "Runtime" menu item in the top bar and select the "Change runtime type" option. Select "GPU" from the list of Hardware accelerators and click "Ok". 

You can also check if GPU has been allocated. Colab notebooks without a GPU technically have access to NVCC and will compile and execute CPU/Host code, however, GPU/Device code will silently fail. To prevent such situations, this code will warn the user.

In [2]:
%%cu
#include <stdio.h>
#include "device_launch_parameters.h"
int main() {
  int count;
  cudaGetDeviceCount(&count);
  if (count <= 0 || count > 100)  printf("WARNING<-: NO GPU DETECTED ON THIS COLLABORATE INSTANCE. YOU SHOULD CHANGE THE RUNTIME TYPE.\n");
  else                            printf("GPU ENABLED. - You are ready to start\n");
  return 0;
}

GPU ENABLED. - You are ready to start



### *If the output above is ```GPU ENABLED - You are ready to start```*
### *then you're ready to begin the assignment!*

# Question 1. [+3] 

Querying your GPU: In this question, you will run simple query code to discover the properties and limits of your Colab-provisioned NVIDIA card. Run the code block below, then capture your answers and **submit** them as an image file named A5_Q1.png.

**Note:** See below that `%%cu` needs to be added to let Colab know that the code block is CUDA code.

*Marking Guide: +3 for a screenshot with the required info*


In [3]:
%%cu
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>

int main()
{
    cudaDeviceProp prop;
    int count;
    cudaGetDeviceCount(&count);
    for (int i = 0; i < count; i++)
    {
        cudaGetDeviceProperties(&prop, i);
        printf("----- General Information for device %d ---\n", i);
        printf("Name:	%s\n", prop.name);
        printf("Compute capability:	%d.%d\n", prop.major, prop.minor);
        printf("Clock rate:	%d\n", prop.clockRate);
        printf("Device copy overlap:	");
        printf(prop.deviceOverlap ? "Enabled\n" : "Disabled\n");
        printf("Kernel execution timeout: ");
        printf(prop.kernelExecTimeoutEnabled ? "Enabled\n" : "Disabled\n");
        printf("----- Memory Information for device %d ---\n", i);
        printf("Total global mem:	%lu\n", prop.totalGlobalMem);
        printf("Total constant Mem:	%ld\n", prop.totalConstMem);
        printf("Max mem pitch:	%ld\n", prop.memPitch);
        printf("Texture Alignment:	%ld\n", prop.textureAlignment);
        printf("----- MP Information for device %d ---\n", i);
        printf("Multiprocessor count:	%d\n", prop.multiProcessorCount);
        printf("Shared mem per mp:	%ld\n", prop.sharedMemPerBlock);
        printf("Registers per mp:	%d\n", prop.regsPerBlock);
        printf("Threads in warp:	%d\n", prop.warpSize);
        printf("Max threads per block:	%d\n", prop.maxThreadsPerBlock);
        printf("Max thread dimensions:	(%d, %d, %d)\n",
               prop.maxThreadsDim[0], prop.maxThreadsDim[1], prop.maxThreadsDim[2]);
        printf("Max grid dimensions:	(%d, %d, %d)\n", prop.maxGridSize[0], prop.maxGridSize[1], prop.maxGridSize[2]);
        printf("\n");
    }
    return 0;
}

----- General Information for device 0 ---
Name:	Tesla T4
Compute capability:	7.5
Clock rate:	1590000
Device copy overlap:	Enabled
Kernel execution timeout: Disabled
----- Memory Information for device 0 ---
Total global mem:	15843721216
Total constant Mem:	65536
Max mem pitch:	2147483647
Texture Alignment:	512
----- MP Information for device 0 ---
Multiprocessor count:	40
Shared mem per mp:	49152
Registers per mp:	65536
Threads in warp:	32
Max threads per block:	1024
Max thread dimensions:	(1024, 1024, 64)
Max grid dimensions:	(2147483647, 65535, 65535)




# Question 2. [+7]

**Simple CUDA code:** consider this loop for initializing an array **a**:

```c
const int n = 10000000 // 10 million
for (i = 0; i < n; i++)
    a[i] = (double)i / n;
```

Submit:

1.   The serial implementation running on the CPU.
2.   The CUDA implementation (1 thread per array element).

In both cases, add code to print the first and last 5 elements of the array to verify your code.

*Note that you need to use the placeholder %.7f to print 7 digits after the decimal point.*

***Sample output:***

```c
a[0]: 0.0000000
a[1]: 0.0000001
a[2]: 0.0000002
a[3]: 0.0000003
a[4]: 0.0000004
...
a[9999995]: 0.9999995 
a[9999996]: 0.9999996 
a[9999997]: 0.9999997 
a[9999998]: 0.9999998 
a[9999999]: 0.9999999
```

***Marking Guide:***

+2 for measuring the time of the parallel and serial code 

+2 for the kernel function

+3 for launch configuration and properly calling the kernel



### CPU Implementation

Please code and run your CPU implementation in the code block below. When submitting your assignment, please copy the code block into a text/c file.

In [53]:
%%cu

// CPU Implementation goes here!
#include <stdio.h>
#include <time.h>

void vector_init(double *arr, int size);
int main(){
    clock_t start, end;
    const int n = 10000000;
    double *a = (double*) malloc(n * sizeof(double));
    if (NULL == a){
        printf("enable to allocate memory");
        exit(0);
    }
    start = clock();
    vector_init(a, n);
    end = clock();
    double total_time = ((double) (end - start)) / CLOCKS_PER_SEC;
    printf("The total time to execute the serial code is %.3f \n", total_time);
    for (int i=0; i<5; i++){
        printf("a[%d]: %.7f \n", i, a[i]);
    }
    printf("....\n");
    for (int i=n-5; i<n; i++){
        printf("a[%d]: %.7f \n", i, a[i]);
    }

}
void vector_init(double *arr, int size){
    for(int i=0; i<size; i++){
        arr[i] = (double) i/size;
    }
}

The total time to execute the serial code is 0.092 
a[0]: 0.0000000 
a[1]: 0.0000001 
a[2]: 0.0000002 
a[3]: 0.0000003 
a[4]: 0.0000004 
....
a[9999995]: 0.9999995 
a[9999996]: 0.9999996 
a[9999997]: 0.9999997 
a[9999998]: 0.9999998 
a[9999999]: 0.9999999 



### CUDA Implementation

Please code and run your CUDA implementation in the code block below. When submitting your assignment, please copy the code block into a text/cu file.

In [64]:
%%cu

// CUDA Implementation goes here!
#include <stdio.h>
#include <stdio.h>
#include <time.h>
#include "cuda_runtime.h"
#define MaxThreads 1024
#define CHK(call) {cudaError_t err = call; if (err != cudaSuccess) { printf("Error%d: %s:%d\n",err,__FILE__,__LINE__); printf(cudaGetErrorString(err)); cudaDeviceReset(); exit(1);}}


__global__ void vector_init(double *arr, int size);
int main(){
    clock_t start, end;
    double *a, *d_a;
    const int n = 10000000;
    int block_number = (int) n/MaxThreads + 1;
    a = (double*) malloc(n * sizeof(double));
    if (NULL == a){
        printf("enable to allocate memory");
        exit(0);
    }
    CHK(cudaMalloc(&d_a, n *sizeof(double));)
    start = clock();
    vector_init<<<block_number, MaxThreads>>>(d_a, n);
    cudaMemcpy(a, d_a, n * sizeof(double), cudaMemcpyDeviceToHost);
    end = clock();
    double total_time = ((double) (end - start)) / CLOCKS_PER_SEC; 
    printf("The total time to execute the serial code is %.3f \n", total_time);
    for (int i=0; i<5; i++){
        printf("a[%d]: %.7f \n", i, a[i]);
    }
    printf("....\n");
    for (int i=n-5; i<n; i++){
        printf("a[%d]: %.7f \n", i, a[i]);
    }


}
__global__ void vector_init(double *arr, int size){
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if(i < size){
        double val = (double) i / size;
        arr[i] = val;
    }
}

The total time to execute the serial code is 0.055 
a[0]: 0.0000000 
a[1]: 0.0000001 
a[2]: 0.0000002 
a[3]: 0.0000003 
a[4]: 0.0000004 
....
a[9999995]: 0.9999995 
a[9999996]: 0.9999996 
a[9999997]: 0.9999997 
a[9999998]: 0.9999998 
a[9999999]: 0.9999999 



---

**Submission instructions**

For this assignment, you need to do the following:

1. Compress the PNG file from Q1 and the source code file (i.e. the .cu/c files, not the whole notebook) from Q2 into one zip folder and give a name to the zipped file that matches your ID (e.g., 1234567.zip).

2. Submit the zipped file to Canvas.

Note that you can resubmit an assignment, but the new submission overwrites the old submission and receives a new timestamp.