# CUDA Exercise 06
> Another approach of parallelized Vector add.

This Jupyter Notebook can also be open by the google colab, so you don't have to buy a PC with a graphic card to play with CUDA. To launch the Google Colab, please click the below Icon.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg#left)](https://colab.research.google.com/github/SuperChange001/CUDA_Learning/blob/main/Solution/Exercise_06.ipynb)

## Initialize the CUDA dev environment

In [1]:
# clone the code repo,
# !pip install git+git://github.com/depctg/nvcc4jupyter.git
# %load_ext nvcc_plugin
!pip install nvcc4jupyter
%load_ext nvcc4jupyter
# Check the environment
!lsb_release -a
!nvcc --version
!nvidia-smi

Collecting nvcc4jupyter
  Downloading nvcc4jupyter-1.2.1-py3-none-any.whl.metadata (5.1 kB)
Downloading nvcc4jupyter-1.2.1-py3-none-any.whl (10 kB)
Installing collected packages: nvcc4jupyter
Successfully installed nvcc4jupyter-1.2.1
Detected platform "Colab". Running its setup...
Source files will be saved in "/tmp/tmpddfr6ih1".
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.4 LTS
Release:	22.04
Codename:	jammy
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
Sat Jun  7 03:17:06 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persiste

## Vector Add with Multiple Threads across Blocks

In [2]:
%%writefile verctor_add_multi_blocks_thread.cu

#include <stdio.h>
#include <assert.h>

#define VECTOR_LENGTH 10000
#define MAX_ERR 1e-4

__global__ void vector_add(float *out, float *a, float *b, int n)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    if(tid<n)
    {
        out[tid] = a[tid] + b[tid];
    }
}

int main(int argc, char *argv[])
{
  float *a, *b, *out;
  float *d_a, *d_b, *d_out;
  int list_of_test_block_size[]={1,64,128,256,512,1024};
  int block_size = 1;

  if( argc == 2 ) {
    //printf("The argument supplied is %s\n", argv[1]);
    int arg1 = atoi(argv[1]);  //argv[0] is the program name
                              //atoi = ascii to int

    block_size = list_of_test_block_size[arg1];
  }
  else if( argc > 2 ) {
    printf("Too many arguments supplied.\n");
  }
  else {
    printf("One argument expected.\n");

  }

  printf("The Block size is %d.\n", block_size);


  // Allocate memory on CPU
  a = (float*)malloc(sizeof(float) * VECTOR_LENGTH);
  b = (float*)malloc(sizeof(float) * VECTOR_LENGTH);
  out = (float*)malloc(sizeof(float) * VECTOR_LENGTH);

  // data initializtion
  for(int i = 0; i < VECTOR_LENGTH; i++)
  {
      a[i] = 3.0f;
      b[i] = 0.14f;
  }

  // Allocate memory on GPU
  cudaMalloc((void**)&d_a, sizeof(float) * VECTOR_LENGTH);
  cudaMalloc((void**)&d_b, sizeof(float) * VECTOR_LENGTH);
  cudaMalloc((void**)&d_out, sizeof(float) * VECTOR_LENGTH);

  // copy operator to GPU
  cudaMemcpy(d_a, a, sizeof(float) * VECTOR_LENGTH, cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, b, sizeof(float) * VECTOR_LENGTH, cudaMemcpyHostToDevice);

  for(int i=0;i<100;i++)
  {
    // GPU do the work, CPU waits
    // Executing kernel
    int grid_size = ((VECTOR_LENGTH + block_size) / block_size);
    vector_add<<<grid_size,block_size>>>(d_out, d_a, d_b, VECTOR_LENGTH);
  }
  // Get results from the GPU
  cudaMemcpy(out, d_out, sizeof(float) * VECTOR_LENGTH,
              cudaMemcpyDeviceToHost);

  // Test the result
  for(int i = 0; i < VECTOR_LENGTH; i++){
      assert(fabs(out[i] - a[i] - b[i]) < MAX_ERR);
  }
  printf("out[0] = %f\n", out[0]);
  printf("PASSED\n");

  // Free the memory
  cudaFree(d_a);
  cudaFree(d_b);
  cudaFree(d_out);
  free(a);
  free(b);
  free(out);
  }

Writing verctor_add_multi_blocks_thread.cu


## Evaluation

Measuring the time cost of executing the CUDA fucntion

In [3]:
!nvcc -o verctor_add_multi_blocks_thread verctor_add_multi_blocks_thread.cu
!nvprof ./verctor_add_multi_blocks_thread 0
!nvprof ./verctor_add_multi_blocks_thread 1
!nvprof ./verctor_add_multi_blocks_thread 2
!nvprof ./verctor_add_multi_blocks_thread 3

The Block size is 1.
==779== NVPROF is profiling process 779, command: ./verctor_add_multi_blocks_thread 0
verctor_add_multi_blocks_thread: verctor_add_multi_blocks_thread.cu:77: int main(int, char**): Assertion `fabs(out[i] - a[i] - b[i]) < MAX_ERR' failed.
==779== Profiling application: ./verctor_add_multi_blocks_thread 0
==779== Profiling result:
No kernels were profiled.
No API activities were profiled.
The Block size is 64.
==794== NVPROF is profiling process 794, command: ./verctor_add_multi_blocks_thread 1
verctor_add_multi_blocks_thread: verctor_add_multi_blocks_thread.cu:77: int main(int, char**): Assertion `fabs(out[i] - a[i] - b[i]) < MAX_ERR' failed.
==794== Profiling application: ./verctor_add_multi_blocks_thread 1
==794== Profiling result:
No kernels were profiled.
No API activities were profiled.
The Block size is 128.
==805== NVPROF is profiling process 805, command: ./verctor_add_multi_blocks_thread 2
verctor_add_multi_blocks_thread: verctor_add_multi_blocks_thread.cu: