Massive parallel programming on GPUs and applications, by Lokman ABBAS TURKI  

# 3. Add arrays

## 3.1 Objective

The main purpose of this lab is to familiarize students with the CUDA API through the implementation of vector addition. Students will gain hands-on experience by writing both GPU kernel code and the corresponding host code for array addition. The host code for memory allocation on GPU and copies from CPU to GPU and from GPU to CPU will serve as examples for future labs. We also introduce the use of unified memory.

Students are encouraged to use the CUDA documentation, enabling them to discover:

1) the specifications of CUDA API functions within the [CUDA_Runtime_API](https://docs.nvidia.com/cuda/cuda-runtime-api/index.html).
2) the examples of how to use the CUDA API functions in [CUDA_C_Programming_Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)

## 3.2 Content

Your directory has to contain Add.cu file as well as the header file timer.h

Compile Add.cu using

In [7]:
!nvcc Add.cu -o ADD

Execute DQ using (on Windows machine ./ is not needed)

In [3]:
!./ADD

 ( 1410064908 ): 1410064908
 ( 1410064918 ): 1410064918
 ( 1410064928 ): 1410064928
 ( 1410064938 ): 1410064938
 ( 1410064948 ): 1410064948
CPU Timer for the addition on the CPU of vectors: 5.008873 s
Kernel execution time: 0.108118 seconds


As long as you did not include any additional instruction in the file Add.cu, the execution above is supposed to return

( 999999500 ): 999999500
( 999999510 ): 999999510
( 999999520 ): 999999520
( 999999530 ): 999999530
( 999999540 ): 999999540
CPU Timer for the addition on the CPU of vectors: 0.126

Of course, the execution time changes at each call and depends on the host's performance.

### 3.2.1 Addition operation on the device with explicit data transfer and CPU timer

a) Allocate aGPU, bGPU, cGPU on the GPU using cudaMalloc.

b) Transfer the values of a, b to aGPU, bGPU using cudaMemcpy.

c) Write the kernel addVect_k that adds aGPU to bGPU and puts the result in cGPU

d) Transfer the values of cGPU to c using cudaMemcpy.

e) Free the GPU memory using cudaFree.

f) Call the kernel addVect_k instead of the function addVect.

g) Do not forget to use cudaDeviceSynchronize after calling the kernel.


### 3.2.2 Few Optimizations and GPU timer

a) Change the CPU timer with the GPU timer using cudaEvent (in milliseconds).

b) Check for yourself that using threadIdx.x*gridDim.x + blockIdx.x to access the global memory is a very bad choice.

c) What if you compute the execution time of both calling addVect_k and transferring data?

d) Profile further your code using: !nvprof ./ADD

e) Fix the slow transfer of a, b to aGPU, bGPU using the initialization on the device. Now, we can remove the arrays a, and b.

f) Allocate aGPU, bGPU, and cGPU using cudaMallocManaged on the unified memory. Now we can also remove the array c.

g) What if we kept the initialization on the host but with all arrays in the unified memory?

h) You can speed up solution g) using cudaMemPrefetchAsync of a and b before calling addVect_k.

In [9]:
!nvprof ./ADD

 ( 1410064908 ): 1410064908
 ( 1410064918 ): 1410064918
 ( 1410064928 ): 1410064928
 ( 1410064938 ): 1410064938
 ( 1410064948 ): 1410064948
CPU Timer for the addition on the CPU of vectors: 4.998796 s
==30737== NVPROF is profiling process 30737, command: ./ADD
Kernel execution time: 0.022129 seconds
==30737== Profiling application: ./ADD
==30737== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   76.68%  695.29us         1  695.29us  695.29us  695.29us  initVect_k(int*, int*, int)
                   23.32%  211.49us         1  211.49us  211.49us  211.49us  addVect_k(int*, int*, int*, int)
      API calls:   83.76%  117.55ms         2  58.774ms     630ns  117.55ms  cudaEventCreate
                   14.65%  20.563ms         3  6.8544ms  49.580us  20.451ms  cudaMallocManaged
                    0.64%  903.09us         1  903.09us  903.09us  903.09us  cudaDeviceSynchronize
                    0.46%  644.85us         2  3