<a href="https://colab.research.google.com/github/UBCO-COSC-407-Winter-2021-Term-1/lab-6---introduction-to-cuda-mtwichan/blob/main/A6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A6 (10 Marks)
---
**Focus**: CUDA Basics - Introduction (Warming Up!!)

© 2021 Dr. Scott Fazackerley (Based on labs from Dr. Abdallah Mohamed © 2020)

For this course, we are going to be using Google Colab version and takes advantage of the free Cloud GPUs offered through the platform for CUDA development. Colab Notebooks (you're reading one right now!) are typically designed to run Python code, however, we'll be modifying them in such a way that we can run CUDA code (as discussed in the lectures) on the GPU.

Please note that your code will be written and run directly within this assignment. You will need to save a local copy and ensure that you save and upload your completed notebook to your GitHub repo for submission.   You will also need to provide screenshots of output in many cases for submissions.  

**CRITICAL**
Lastly, keep in mind that anytime your runtime disconnects or is restarted **you must re-run the Notebook Setup code block**. This applies to all CUDA assignments done using Google Colab.

## Notebook Setup: GPU Runtime

Before writing/running any CUDA code, we need to ensure Colab is provisioning a Cloud GPU for us. To do this, click on the "Runtime" menu item in the top bar and select the "Change runtime type" option. Select "GPU" from the list of Hardware accelerators and click "Ok". 

## Notebook Setup: CUDA Compilation

To enable CUDA code compilation on Colab Notebooks, we'll employ use of the NVCC4Jupyter plugin (source code/documentation available [here](https://github.com/UBCO-COSC-407-Winter-2021-Term-1/nvcc4jupyter). This plugin effectively turns any Colab Notebook code block that includes `%%cu` into compilable/runnable CUDA code.

To download/install/enable NVCC4Jupyter, please run the following code block. **Running this block is required anytime you connect/restart/reconnect to an instance.** To run a code block, mouse over it and click the play button on left side.

You should see some output when you click the play button. Wait until the code block is finished running (this is indicated when the stop button goes away). The last couple lines of output should look something like the following:

```
created output directory at /content/src
Out bin /content/result.out
```

If your last two lines of output look something like above, you're ready to begin the assignment!

In [2]:
# Run the following to configure your notebook for CUDA code
!pip install git+git://github.com/UBCO-COSC-407-Winter-2021-Term-1/nvcc4jupyter.git
%load_ext nvcc_plugin

Collecting git+git://github.com/UBCO-COSC-407-Winter-2021-Term-1/nvcc4jupyter.git
  Cloning git://github.com/UBCO-COSC-407-Winter-2021-Term-1/nvcc4jupyter.git to /tmp/pip-req-build-2gn_dciv
  Running command git clone -q git://github.com/UBCO-COSC-407-Winter-2021-Term-1/nvcc4jupyter.git /tmp/pip-req-build-2gn_dciv
Building wheels for collected packages: NVCCPlugin
  Building wheel for NVCCPlugin (setup.py) ... [?25l[?25hdone
  Created wheel for NVCCPlugin: filename=NVCCPlugin-0.0.2-py3-none-any.whl size=4405 sha256=1b57e9dd5762efd397a4d7316fb77ffcd433e2f68554fd026d808c341b00fdba
  Stored in directory: /tmp/pip-ephem-wheel-cache-5l_bsd62/wheels/a5/e9/0b/81648e44e04e6ae47e0ec176f5c1805063e4f687ee2bfceca6
Successfully built NVCCPlugin
Installing collected packages: NVCCPlugin
Successfully installed NVCCPlugin-0.0.2
created output directory at /content/src
Out bin /content/result.out


## Question 1. [+3] 

Querying your GPU: In this question, you will run simple query code to discover the properties and limits of your Colab-provisioned NVIDIA card. Run the code block below, then capture your answers and **submit** them as an image file named A6_Q1.png.  This is important as we will want to know what resources are available on a card. 

**Note:** See below that `%%cu` needs to be added to let Colab know that the code block is CUDA code.

*Marking Guide: +3 for a screenshot with the required info*


In [3]:
%%cu
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>

int main()
{
    cudaDeviceProp prop;
    int count;
    cudaGetDeviceCount(&count);
    for (int i = 0; i < count; i++)
    {
        cudaGetDeviceProperties(&prop, i);
        printf("----- General Information for device %d ---\n", i);
        printf("Name:	%s\n", prop.name);
        printf("Compute capability:	%d.%d\n", prop.major, prop.minor);
        printf("Clock rate:	%d\n", prop.clockRate);
        printf("Device copy overlap:	");
        printf(prop.deviceOverlap ? "Enabled\n" : "Disabled\n");
        printf("Kernel execution timeout: ");
        printf(prop.kernelExecTimeoutEnabled ? "Enabled\n" : "Disabled\n");
        printf("----- Memory Information for device %d ---\n", i);
        printf("Total global mem:	%lu\n", prop.totalGlobalMem);
        printf("Total constant Mem:	%ld\n", prop.totalConstMem);
        printf("Max mem pitch:	%ld\n", prop.memPitch);
        printf("Texture Alignment:	%ld\n", prop.textureAlignment);
        printf("----- MP Information for device %d ---\n", i);
        printf("Multiprocessor count:	%d\n", prop.multiProcessorCount);
        printf("Shared mem per mp:	%ld\n", prop.sharedMemPerBlock);
        printf("Registers per mp:	%d\n", prop.regsPerBlock);
        printf("Threads in warp:	%d\n", prop.warpSize);
        printf("Max threads per block:	%d\n", prop.maxThreadsPerBlock);
        printf("Max thread dimensions:	(%d, %d, %d)\n",
               prop.maxThreadsDim[0], prop.maxThreadsDim[1], prop.maxThreadsDim[2]);
        printf("Max grid dimensions:	(%d, %d, %d)\n", prop.maxGridSize[0], prop.maxGridSize[1], prop.maxGridSize[2]);
        printf("\n");
    }
    return 0;
}

----- General Information for device 0 ---
Name:	Tesla K80
Compute capability:	3.7
Clock rate:	823500
Device copy overlap:	Enabled
Kernel execution timeout: Disabled
----- Memory Information for device 0 ---
Total global mem:	11996954624
Total constant Mem:	65536
Max mem pitch:	2147483647
Texture Alignment:	512
----- MP Information for device 0 ---
Multiprocessor count:	13
Shared mem per mp:	49152
Registers per mp:	65536
Threads in warp:	32
Max threads per block:	1024
Max thread dimensions:	(1024, 1024, 64)
Max grid dimensions:	(2147483647, 65535, 65535)




## Question 2. [+7]

**Simple CUDA code:** consider this loop for initializing an array **a**:

```c
const int n = 10000000 // 10 million
for (i = 0; i < n; i++)
    a[i] = (double)i / n;
```

Submit:

1.   The serial implementation running on the CPU.
2.   The CUDA implementation (1 thread per array element).  In your implementation, you will need to ensure:
    1. Memory on the card is correctly allocated (and the host as well)
    2. Data is properly divided up as well as defining the number of blocks in a grid and the number of threads per block
    3. Data is copied back from the card at the end of the kernel launch
    4. Data on the card is free'd when you are done. 

In both cases, add code to print the first and last 5 elements of the array to verify your code.  Add a text block commenting on the timing results.

*Note that you need to use the placeholder %.7f to print 7 digits after the decimal point.*

***Sample output:***

```c
a[0]: 0.0000000
a[1]: 0.0000001
a[2]: 0.0000002
a[3]: 0.0000003
a[4]: 0.0000004
...
a[9999995]: 0.9999995 
a[9999996]: 0.9999996 
a[9999997]: 0.9999997 
a[9999998]: 0.9999998 
a[9999999]: 0.9999999
```

***Note***:  You can find details on timing CUDA code at: https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/ and https://programmerfish.com/profiling-cuda-kernels-and-wrapper-functions/.

***Marking Guide:***
+2 for the kernel function
+3 for launch configuration and properly calling the kernel
+2 for measuring the time of the parallel and serial code  

### CPU Implementation

Please code and run your CPU implementation in the code block below. When submitting your assignment, please copy the code block into a text/c file.

In [82]:
%%cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(void) {
    const int n = 10000000;
    int i = 0;
    int idx = 0;
    time_t start_t, end_t;

    double *a = (double*) malloc(sizeof(double) * n);
    
    start_t = clock();
    for (i = 0; i < n; i++) {
        a[i] = (double)i / n;   
    }
    end_t = clock();
    
    printf("Execution Time: %fs\n", ((double)(end_t - start_t) / CLOCKS_PER_SEC));
    for (i = 0; i < 5; i++) {
        printf("a[%d]: %.7f\n", i, a[i]);
    }

    for (i = 5; i > 0; i--) {
        idx = n - i;
        printf("a[%d]: %.7f\n", idx, a[idx]);
    }   
}

Execution Time: 0.094922s
a[0]: 0.0000000
a[1]: 0.0000001
a[2]: 0.0000002
a[3]: 0.0000003
a[4]: 0.0000004
a[9999995]: 0.9999995
a[9999996]: 0.9999996
a[9999997]: 0.9999997
a[9999998]: 0.9999998
a[9999999]: 0.9999999



### CUDA Implementation

Please code and run your CUDA implementation in the code block below. When submitting your assignment, please copy the code block into a text/cu file.

In [81]:
%%cu

#include "cuda_runtime.h"
#include <stdio.h>
#include <stdlib.h>

__global__ void assignArray(double* A, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        A[i] = (double) (i);
    }
}

int main(void) {
    const int N = 10000000;
    float time = 0;
    int i = 0, idx = 0;
    int nBytes = N * sizeof(double);
    int nthreads = 1024;
    int nblocks = (N - 1) / nthreads + 1;

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    double* A = (double*) malloc(nBytes);  
    double* d_A = NULL;
  
    cudaMalloc(&d_A, nBytes);
    cudaMemcpy(d_A, A, nBytes, cudaMemcpyHostToDevice);

    cudaEventRecord(start);
    assignArray <<<nblocks, nthreads>>> (d_A, N);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    cudaMemcpy(A, d_A, nBytes, cudaMemcpyDeviceToHost);
    cudaEventElapsedTime(&time, start, stop);
    printf("Execution Time: %fs\n", time / 1000.0);

    for (i = 0; i < 5; i++) {
        printf("a[%d]: %.7f\n", i, A[i]);
    }

    for (i = 5; i > 0; i--) {
        idx = N - i;
        printf("a[%d]: %.7f\n", idx, A[idx]);
    } 

    free(A);
    cudaFree(d_A);  
}

Execution Time: 0.000593s
a[0]: 0.0000000
a[1]: 1.0000000
a[2]: 2.0000000
a[3]: 3.0000000
a[4]: 4.0000000
a[9999995]: 9999995.0000000
a[9999996]: 9999996.0000000
a[9999997]: 9999997.0000000
a[9999998]: 9999998.0000000
a[9999999]: 9999999.0000000



In [None]:
Comments on timing results:
The parallel program completed in 0.000593s whereas the sequential method completed in 0.094922s.

---

**Submission Instructions**

For this assignment, you need to do the following:

1. Save your A6.ipynb file from Colab to GitHub.    Make sure that this file is saved (you have a copy) and it is successfully commited to your repo.   Ensure that your code solutions are in your Notebook as they will be marked from there. 
2. Add the PNG file from Q1 and the source code file.

Note that you can resubmit an assignment, but the new submission overwrites the old submission and receives a new timestamp with GitHub.  Make sure to review the due date for this submission. 