# 00 Vector Add V1**FreeCodeCamp CUDA Course - Module 5**Original Course: [https://www.youtube.com/watch?v=86FAWCzIe_4](https://www.youtube.com/watch?v=86FAWCzIe_4)Source File: `00_vector_add_v1.cu`---## OverviewLearn how to perform parallel vector addition on the GPU.---

## Learning ObjectivesBy the end of this notebook, you will:1. Understand CUDA kernel syntax and execution2. Learn GPU memory allocation and data transfer---

## Setup: Google Colab GPUFirst, ensure you have enabled GPU in Colab:1. Go to **Runtime** → **Change runtime type**2. Select **T4 GPU** as Hardware accelerator3. Click **Save**Let's verify CUDA is available:

In [None]:
# Check GPU availability!nvidia-smi

Now install the nvcc4jupyter plugin to compile CUDA code inline:

In [None]:
# Install nvcc4jupyter for inline CUDA compilation!pip install nvcc4jupyter -q%load_ext nvcc4jupyter

---

## Key Concepts- **Kernel Function**: Uses `__global__` qualifier for GPU execution- **Device Memory**: Allocated using `cudaMalloc`- **Data Transfer**: Uses `cudaMemcpy` between host and device- **Kernel Launch**: Syntax `kernel<<<blocks, threads>>>(...)`- **Synchronization**: `cudaDeviceSynchronize()` waits for GPU completion---## CUDA Implementation

In [None]:
%%cu#include <stdio.h>#include <stdlib.h>#include <time.h>#include <cuda_runtime.h>#define N 10000000  // Vector size = 10 million#define BLOCK_SIZE 256// Example:// A = [1, 2, 3, 4, 5]// B = [6, 7, 8, 9, 10]// C = A + B = [7, 9, 11, 13, 15]// CPU vector additionvoid vector_add_cpu(float *a, float *b, float *c, int n) {    for (int i = 0; i < n; i++) {        c[i] = a[i] + b[i];    }}// CUDA kernel for vector addition__global__ void vector_add_gpu(float *a, float *b, float *c, int n) {    int i = blockIdx.x * blockDim.x + threadIdx.x;    if (i < n) {        c[i] = a[i] + b[i];    }}// Initialize vector with random valuesvoid init_vector(float *vec, int n) {    for (int i = 0; i < n; i++) {        vec[i] = (float)rand() / RAND_MAX;    }}// Function to measure execution timedouble get_time() {    struct timespec ts;    clock_gettime(CLOCK_MONOTONIC, &ts);    return ts.tv_sec + ts.tv_nsec * 1e-9;}int main() {    float *h_a, *h_b, *h_c_cpu, *h_c_gpu;    float *d_a, *d_b, *d_c;    size_t size = N * sizeof(float);    // Allocate host memory    h_a = (float*)malloc(size);    h_b = (float*)malloc(size);    h_c_cpu = (float*)malloc(size);    h_c_gpu = (float*)malloc(size);    // Initialize vectors    srand(time(NULL));    init_vector(h_a, N);    init_vector(h_b, N);    // Allocate device memory    cudaMalloc(&d_a, size);    cudaMalloc(&d_b, size);    cudaMalloc(&d_c, size);    // Copy data to device    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);    cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);    // Define grid and block dimensions    int num_blocks = (N + BLOCK_SIZE - 1) / BLOCK_SIZE;    // N = 1024, BLOCK_SIZE = 256, num_blocks = 4    // (N + BLOCK_SIZE - 1) / BLOCK_SIZE = ( (1025 + 256 - 1) / 256 ) = 1280 / 256 = 4 rounded     // Warm-up runs    printf("Performing warm-up runs...\n");    for (int i = 0; i < 3; i++) {        vector_add_cpu(h_a, h_b, h_c_cpu, N);        vector_add_gpu<<<num_blocks, BLOCK_SIZE>>>(d_a, d_b, d_c, N);        cudaDeviceSynchronize();    }    // Benchmark CPU implementation    printf("Benchmarking CPU implementation...\n");    double cpu_total_time = 0.0;    for (int i = 0; i < 20; i++) {        double start_time = get_time();        vector_add_cpu(h_a, h_b, h_c_cpu, N);        double end_time = get_time();        cpu_total_time += end_time - start_time;    }    double cpu_avg_time = cpu_total_time / 20.0;    // Benchmark GPU implementation    printf("Benchmarking GPU implementation...\n");    double gpu_total_time = 0.0;    for (int i = 0; i < 20; i++) {        double start_time = get_time();        vector_add_gpu<<<num_blocks, BLOCK_SIZE>>>(d_a, d_b, d_c, N);        cudaDeviceSynchronize();        double end_time = get_time();        gpu_total_time += end_time - start_time;    }    double gpu_avg_time = gpu_total_time / 20.0;    // Print results    printf("CPU average time: %f milliseconds\n", cpu_avg_time*1000);    printf("GPU average time: %f milliseconds\n", gpu_avg_time*1000);    printf("Speedup: %fx\n", cpu_avg_time / gpu_avg_time);    // Verify results (optional)    cudaMemcpy(h_c_gpu, d_c, size, cudaMemcpyDeviceToHost);    bool correct = true;    for (int i = 0; i < N; i++) {        if (fabs(h_c_cpu[i] - h_c_gpu[i]) > 1e-5) {            correct = false;            break;        }    }    printf("Results are %s\n", correct ? "correct" : "incorrect");    // Free memory    free(h_a);    free(h_b);    free(h_c_cpu);    free(h_c_gpu);    cudaFree(d_a);    cudaFree(d_b);    cudaFree(d_c);    return 0;}

## ExercisesTry these modifications to deepen your understanding:1. **Change Vector Size**: Modify `N` to different values (100, 1000, 100000000)   - Observe how execution time changes   - What happens with very large vectors?2. **Adjust Block Size**: Try different `BLOCK_SIZE` values (64, 128, 512, 1024)   - How does it affect performance?   - What's the maximum allowed?3. **Add Vector Subtraction**: Create a new kernel for `C = A - B`4. **Multiple Operations**: Compute `D = (A + B) * (A - B)` in a single kernel

---## Key Takeaways- CUDA enables massive parallelism for compute-intensive tasks- Proper memory management is crucial for performance- Understanding the thread hierarchy helps write efficient kernels- Always synchronize when needed to ensure correctness---## Next StepsContinue to the next notebook in Module 5 to learn more CUDA concepts!---## Notes*Use this space for your learning notes:*