<a href="https://colab.research.google.com/github/maomaodedipan/GPU/blob/main/Assignment2_question1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd /content/drive/MyDrive/DD2360/Assignment2/question1

/content/drive/MyDrive/DD2360/Assignment2/question1


In [None]:
%%writefile vectorAdd.cu
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

#define DataType double

// Vector addition kernel
__global__ void vecAdd(DataType *in1, DataType *in2, DataType *out, int len) {
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  if (index < len) {
    out[index] = in1[index] + in2[index];
  }
}
double CPUtimer(){
  struct timeval ti;
  gettimeofday(&ti,NULL);
  return ((double)ti.tv_sec + (double)ti.tv_usec * 1e-6);
}


int main(int argc, char **argv) {

  int inputLength;
  DataType *hostInput1;
  DataType *hostInput2;
  DataType *hostOutput;
  DataType *resultRef;
  DataType *deviceInput1;
  DataType *deviceInput2;
  DataType *deviceOutput;
  bool flag = true;
  double start,end,duration;

  //@@ Insert code below to read in inputLength from args
  if (argc != 2) {
    printf("Wrong argument");
    exit(EXIT_FAILURE);
  }
  inputLength = atoi(argv[1]);
  printf("The input length is %d\n", inputLength);

  //@@ Insert code below to allocate Host memory for input and output
  hostInput1 = (DataType *)malloc(inputLength * sizeof(DataType));
  hostInput2 = (DataType *)malloc(inputLength * sizeof(DataType));
  hostOutput = (DataType *)malloc(inputLength * sizeof(DataType));
  resultRef = (DataType *)malloc(inputLength * sizeof(DataType));

  //@@ Insert code below to initialize hostInput1 and hostInput2 to random numbers, and create reference result in CPU
  for (int i = 0; i < inputLength; ++i) {
    hostInput1[i] = (DataType)rand() / RAND_MAX;
    hostInput2[i] = (DataType)rand() / RAND_MAX;
    resultRef[i] = hostInput1[i] + hostInput2[i];
  }

  //@@ Insert code below to allocate GPU memory here
  cudaMalloc((void **)&deviceInput1, inputLength * sizeof(DataType));
  cudaMalloc((void **)&deviceInput2, inputLength * sizeof(DataType));
  cudaMalloc((void **)&deviceOutput, inputLength * sizeof(DataType));

  //@@ Insert code to below to Copy memory to the GPU here
  start = CPUtimer();
  cudaMemcpy(deviceInput1, hostInput1, inputLength * sizeof(DataType), cudaMemcpyHostToDevice);
  cudaMemcpy(deviceInput2, hostInput2, inputLength * sizeof(DataType), cudaMemcpyHostToDevice);
  end = CPUtimer();
  duration = end - start;
  printf("time of data copy from host to device: %f.\n", duration);


  //@@ Initialize the 1D grid and block dimensions here
  int threadsPerBlock = 256;
  int blocksPerGrid = 1000;

  //@@ Launch the GPU Kernel here
  start = CPUtimer();
  vecAdd<<<blocksPerGrid, threadsPerBlock>>>(deviceInput1, deviceInput2, deviceOutput, inputLength);
  cudaDeviceSynchronize(); // Wait for the GPU to finish
  end = CPUtimer();
  duration = end - start;
  printf("time of the CUDA kernel: %f.\n", duration);

  //@@ Copy the GPU memory back to the CPU here
  start = CPUtimer();
  cudaMemcpy(hostOutput, deviceOutput, inputLength * sizeof(DataType), cudaMemcpyDeviceToHost);
  end = CPUtimer();
  duration = end - start;
  printf("time of data copy from device to host: %f.\n", duration);

  //@@ Insert code below to compare the output with the reference
  for (int i = 0; i < inputLength; ++i) {
    if (fabs(hostOutput[i] - resultRef[i]) > 1e-5) {
      printf("Mismatch at index %d: Host %f, GPU %f\n", i, resultRef[i], hostOutput[i]);
      flag = false;
      break;
    }
  }

  if (flag == true){
     printf("Two vectors are the same");
  }

  //@@ Free the GPU memory here
  cudaFree(deviceInput1);
  cudaFree(deviceInput2);
  cudaFree(deviceOutput);

  //@@ Free the CPU memory here
  free(hostInput1);
  free(hostInput2);
  free(hostOutput);
  free(resultRef);

  return 0;
}


Overwriting vectorAdd.cu


In [None]:
!nvcc vectorAdd.cu
!ls

a.out  Assignment2_question1.ipynb  vectorAdd.cu


In [None]:
!./a.out 1024

The input length is 1024
time of data copy from host to device: 0.000326.
time of the CUDA kernel: 0.000033.
time of data copy from device to host: 0.000025.
Two vectors are the same

In [None]:
!ncu --set default --metrics sm__warps_active.avg.pct_of_peak_sustained_active ./a.out 1024

The input length is 1024
==PROF== Connected to process 2985 (/content/drive/MyDrive/DD2360/Assignment2/question1/a.out)
time of data copy from host to device: 0.000334.
==PROF== Profiling "vecAdd" - 0: 0%....50%....100% - 8 passes
time of the CUDA kernel: 0.498671.
time of data copy from device to host: 0.000062.
==PROF== Disconnected from process 2985
Two vectors are the same[2985] a.out@127.0.0.1
  vecAdd(double *, double *, double *, int), 2023-Nov-26 15:27:38, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          53.44
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: GPU Speed Of Light Throughput
    ------------------------------------------------------

In [None]:
!./a.out 131070

The input length is 131070
time of data copy from host to device: 0.000920.
time of the CUDA kernel: 0.000081.
time of data copy from device to host: 0.000805.
Two vectors are the same

In [None]:
!ncu --set default --metrics sm__warps_active.avg.pct_of_peak_sustained_active ./a.out 131070

The input length is 131070
==PROF== Connected to process 3067 (/content/drive/MyDrive/DD2360/Assignment2/question1/a.out)
time of data copy from host to device: 0.000977.
==PROF== Profiling "vecAdd" - 0: 0%....50%....100% - 8 passes
time of the CUDA kernel: 0.298360.
time of data copy from device to host: 0.000936.
==PROF== Disconnected from process 3067
Two vectors are the same[3067] a.out@127.0.0.1
  vecAdd(double *, double *, double *, int), 2023-Nov-26 15:27:51, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          76.79
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: GPU Speed Of Light Throughput
    ----------------------------------------------------

In [None]:
!./a.out 2048

The input length is 2048
time of data copy from host to device: 0.000345.
time of the CUDA kernel: 0.000030.
time of data copy from device to host: 0.000032.
Two vectors are the same

In [None]:
!./a.out 4096

The input length is 4096
time of data copy from host to device: 0.000375.
time of the CUDA kernel: 0.000038.
time of data copy from device to host: 0.000049.
Two vectors are the same

In [None]:
!./a.out 8192

The input length is 8192
time of data copy from host to device: 0.000349.
time of the CUDA kernel: 0.000047.
time of data copy from device to host: 0.000076.
Two vectors are the same

In [None]:
!./a.out 16384

The input length is 16384
time of data copy from host to device: 0.000359.
time of the CUDA kernel: 0.000035.
time of data copy from device to host: 0.000121.
Two vectors are the same

In [None]:
!./a.out 32768

The input length is 32768
time of data copy from host to device: 0.000441.
time of the CUDA kernel: 0.000037.
time of data copy from device to host: 0.000239.
Two vectors are the same

In [None]:
!./a.out 65536

The input length is 65536
time of data copy from host to device: 0.000619.
time of the CUDA kernel: 0.000040.
time of data copy from device to host: 0.000438.
Two vectors are the same