# CMDA 3634 SP2024 Parallel Programming Skills Activity
# GPU Standard Deviation
# 50 points
# Instructions
* Complete this Jupyter notebook by writing and testing the requested CUDA code.
* Also, answer the given questions in the markdown cells provided.  
* **Upload your completed .ipynb file to your cmda3634_arc repo on code.vt.edu under the directory PSA05.**
* **Also submit a printed version of your .ipynb file as a .pdf file to Canvas.**

# Academic Integrity

* The use of code from prior sections of the class (or similar classes at other institutions, **Chegg, Course Hero, GitHub,
Stack Overflow, ChatGPT, rent-a-coder sites, etc.**) is **strictly prohibited**, regardless of how they are obtained.

# Honor Code
* By submitting this assignment, you acknowledge that you have adhered to the Virginia Tech Honor Code and attest to the following:
        
*I have neither given nor received unauthorized assistance on this assignment.  The work I am presenting is ultimately my own.*


# Standard Deviation

* The standard deviation of the numbers $1 ... N$ is given by
$$\sigma = \sqrt{\frac{N^2-1}{12}}$$

* A sequential code to compute the standard deviation of the numbers
$1 ... N$ is given below.  

* To test the sequential version, configure Google Colab to use a CPU runtime (not GPU!).  




In [None]:
%%writefile std_dev.c
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

typedef unsigned long long int uint64;

int main (int argc, char** argv) {

    /* get N from the command line */
    if (argc < 2) {
        printf ("Command usage : %s %s\n",argv[0],"N");
        return 1;
    }
    uint64 N = atol(argv[1]);

    /* compute the mean */
    uint64 sum = 0;
    for (uint64 i=1;i<=N;i++) {
        sum += i;
    }
    double mean = 1.0*sum/N;

    /* compute the sum of differences squared */
    double sum_diff_sq = 0;
    for (uint64 i=1;i<=N;i++) {
        sum_diff_sq += (i-mean)*(i-mean);
    }

    /* compute the standard deviation */
    double std_dev = sqrt(sum_diff_sq/N);

    /* print the results */
    printf ("computed std dev is %.1lf",std_dev);
    printf (", sqrt((N^2-1)/12) is %.1lf\n",sqrt((N*N-1)/12.0));

}

Writing std_dev.c


In [None]:
!gcc -o std_dev std_dev.c -lm

In [None]:
!time ./std_dev 100000000

computed std dev is 28867513.5, sqrt((N^2-1)/12) is 28867513.5

real	0m0.642s
user	0m0.636s
sys	0m0.002s


# Part 1 : Complete a CUDA standard deviation version that uses a single thread block.

# 25 points

## For this part the main function is already complete so you just need to write the kernel.  

## You will need to have one thread print the standard deviation from inside the kernel.

### Use a GPU runtime on Google Colab to test your code.  

### Be careful to only use the GPU runtime when you are actively running CUDA code.

### It is possible to temporarily lose your free access to a GPU on Google!

### In particular, when writing code or answering questions you should have your GPU runtime disconnected.  

### You can disconnect your GPU runtime by selecting *Disconnect and delete runtime* from the *Runtime* menu.  

# Question: Explain how you made your kernel code thread safe and parallel efficient.

## Answer:

### Hints: Your kernel should use 3 barriers.  How many times does each thread execute an atomic instruction?

# Question: What is the primary limitation of using a single thread block?  Explain in terms of the GPU hardware (i.e. the SMs--symmetric multiprocessors)

## Answer:

In [None]:
%%writefile gpu_std_dev_v1.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>

typedef unsigned long long int uint64;

__global__ void stdevKernel(uint64 N) {

    /*****************************/
    /* add your kernel code here */
    /*****************************/

}

int main(int argc, char **argv) {

    /* get N and num_threads from the command line */
    if (argc < 3) {
        printf ("Command usage : %s %s %s\n",argv[0],"N","num_threads");
        return 1;
    }

    uint64 N = atol(argv[1]);
    int num_threads = atoi(argv[2]);

    printf ("num_threads = %d\n",num_threads);

    stdevKernel <<< 1, num_threads >>> (N);
    cudaDeviceSynchronize();

}

Writing gpu_std_dev_v1.cu


In [None]:
!nvcc -arch=sm_75 -o gpu_std_dev_v1 gpu_std_dev_v1.cu

# Use $N$ equal to 1 billion to test your code for accuracy.

### Note: Here we are using 128 threads.

In [None]:
!time ./gpu_std_dev_v1 1000000000 128

num_threads = 128
computed std dev is 288675134.6, sqrt((N^2-1)/12) is 288675134.6

real	0m1.688s
user	0m1.339s
sys	0m0.249s


# Use N equal to 10 billion to test your code for speed.

### Note: In order to load the GPU we need to use a large value of $N$.  Unfortunately, the intermediate steps in the standard deviation calculation overflow the double precision data type and so the answers output are not correct!

In [None]:
!time ./gpu_std_dev_v1 10000000000 128

num_threads = 128
computed std dev is 4684509367.1, sqrt((N^2-1)/12) is 804481180.2

real	0m12.049s
user	0m11.415s
sys	0m0.246s


# Part 2 : Complete a CUDA Standard Deviation Version that Uses multiple thread blocks.

# 25 points

## For this part you will need to write two kernels.  The interfaces to the kernels are provided below.

## Note: In each kernel, each thread calculates the sum of $T$ terms.  

## In addition you will have to finish writing the main function which is partially provided below.  

# Question: Explain how you made your kernel code thread safe and parallel efficient.

## Answer:

# Question: About how many times faster is your verion 2 kernel (with $T$ equal to 1000) than your version 1 kernel for $N$ equal to 10 billion?

## Answer:

# Question: When you run your version 2 with $N$ equal to 1 billion and $T$ equal to 1, the runtime is actually longer than the version 1 runtime with the same value of $N$!  Explain why version 2 is slower than version 1 when $T$ is equal to 1 despite the fact that version 2 is using every SM and version 1 is only using 1 SM.  

## Answer:


In [None]:
%%writefile gpu_std_dev_v2.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>

typedef unsigned long long int uint64;

__global__ void sumKernel(uint64 N, uint64 T, uint64* sum) {

    /*****************************/
    /* add your kernel code here */
    /*****************************/

}

__global__ void sumdiffsqKernel(uint64 N, uint64 T, double mean, double* sum_diff_sq) {

    /*****************************/
    /* add your kernel code here */
    /*****************************/

}

int main (int argc, char** argv) {

    /* get N, T, and B from the command line */
    /* T is the number of terms per thread */
    /* B is the number of threads per block */
    /* we typically choose B to be a multiple of 32 */
    /* the maximum value of B is 1024 */
    if (argc < 4) {
        printf ("Command usage : %s %s %s %s\n",argv[0],"N","T","B");
        return 1;
    }
    uint64 N = atol(argv[1]);
    uint64 T = atol(argv[2]);
    int B = atoi(argv[3]);

    /* G is the number of thread blocks */
    /* the maximum number of thread blocks G is 2^31 - 1 = 2147483647 */
    /* We choose G to be the minimum number of thread blocks to have at least N/T threads */

    /***********************************/
    /* add your code to compute G here */
    int G;
    /***********************************/

    printf ("N = %lld\n",N);
    printf ("terms per thread T = %lld\n",T);
    printf ("threads per block B = %d\n",B);
    printf ("number of thread blocks G = %d\n",G);
    printf ("number of threads G*B = %d\n",G*B);

    /***************************/
    /* add your host code here */
    double std_dev;
    /***************************/

    /* output the results */
    printf ("computed std dev is %.1lf",std_dev);
    printf (", sqrt((N^2-1)/12) is %.1lf\n",sqrt((N*N-1)/12.0));

    /*************************************/
    /* add your code to free memory here */
    /*************************************/

}

Writing gpu_std_dev_v2.cu


In [None]:
!nvcc -arch=sm_75 -o gpu_std_dev_v2 gpu_std_dev_v2.cu

# Use $N$ equal to 1 billion to test your code for accuracy.

### Note: Here we are using T=1000 and B=128

In [None]:
!time ./gpu_std_dev_v2 1000000000 1000 128

N = 1000000000
terms per thread T = 1000
threads per block B = 128
number of thread blocks G = 7813
number of threads G*B = 1000064
standard deviation (using formula) = 288675134.59
computed std dev is 288675134.6, sqrt((N^2-1)/12) is 288675134.6

real	0m0.320s
user	0m0.090s
sys	0m0.216s


# Use N equal to 10 billion to test your code for speed.

### Note: In order to load the GPU we need to use a large value of $N$.  Unfortunately the intermediate steps in the standard deviation calculation overflow the double precision data type and so the answers output are not correct!

In [None]:
!time ./gpu_std_dev_v2 10000000000 1000 128

N = 10000000000
terms per thread T = 1000
threads per block B = 128
number of thread blocks G = 78125
number of threads G*B = 10000000
standard deviation (using formula) = 804481180.19
computed std dev is 4684509367.1, sqrt((N^2-1)/12) is 804481180.2

real	0m0.685s
user	0m0.461s
sys	0m0.212s


# Use $N$ equal to 1 billion and $T$ equal to 1.  

In [None]:
!time ./gpu_std_dev_v2 1000000000 1 128

N = 1000000000
terms per thread T = 1
threads per block B = 128
number of thread blocks G = 7812500
number of threads G*B = 1000000000
standard deviation (using formula) = 288675134.59
computed std dev is 288675134.6, sqrt((N^2-1)/12) is 288675134.6

real	0m3.126s
user	0m2.935s
sys	0m0.122s
