<a href="https://colab.research.google.com/github/jpmantuano/csc612m/blob/main/mc01/MCO1_Juachon_Mantuano.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Group 3 : MC01** ###

Jean Philip Juachon

Joseph Paulo Mantuano

### Explaining Multiple Data Transfer Method ###

How data moves between your computer’s main memory (the CPU) and the graphics card (the GPU). Below are four data transfer methods between CPU and GPU.

1.) Unified memory

2.) Prefetching of data with memory advice

3.) Data initialization as a CUDA kernel

4.) Old method of transferring data between CPU and memory (memCUDA malloc + CUDAmemcpy)

# Summary: Test Results #

### GPU Kernel Execution Time ###

| Version | Type                      | Total Run time (ms) | Average (ms) | Ranking |
|---------|---------------------------|---------------------|--------------|---------|
| 0       | Vanilla C                 | 31510.242           | 1050.34      | 5       |
| 1       | Unified Memory Version    | 1029.68             | 34.32        | 3       |
| 2       | Prefetch and Mem Advise   | 585.92              | 19.53        | 1       |
| 3       | Data Initialization with CUDA | 925.49          | 30.85        | 2       |
| 4       | Old Method                | 1284.66             | 42.822       | 4       |


### Comparison of GPU Kernel Execution Time Between Data Transfer Methods ###

| Version | Type                      | vs Vanilla C (x times better) | vs Unified Memory | vs Prefetching and Mem Advise | vs Data Initialization with CUDA | vs Old Method |
|---------|---------------------------|-------------------------------|-------------------|------------------------------|----------------------------------|---------------|
| 0       | Vanilla C                 | 1.00                          | 0.03              | 0.02                         | 0.03                             | 0.04          |
| 1       | Unified Memory Version    | 30.60                         | 1.00              | 0.57                         | 0.90                             | 1.25          |
| 2       | Prefetch and Mem Advise   | 53.78                         | 1.76              | 1.00                         | 1.58                             | 2.19          |
| 3       | Data Initialization with CUDA | 34.05                    | 1.11              | 0.63                         | 1.00                             | 1.39          |
| 4       | Old Method                | 24.53                         | 0.80              | 0.46                         | 0.72                             | 1.00          |


### Data Transfer Time ###

| Version | Type                      | Total Data Transfer Time (ms) | Ranking |
|---------|---------------------------|-------------------------------|---------|
| 0       | Vanilla C                 | NA                            | NA      |
| 1       | Unified Memory Version    | 306.37                        | 3       |
| 2       | Prefetch and Mem Advise   | 252.98                        | 1       |
| 3       | Data Initialization with CUDA | 263.22                    | 2       |
| 4       | Old Method                | 951.3                         | 4       |


### Comparison of Data Transfer Time Between Data Transfer Methods ###

| Version | Type                      | vs Vanilla C (x times better) | vs Unified Memory | vs Prefetching and Mem Advise | vs Data Initialization with CUDA | vs Old Method |
|---------|---------------------------|-------------------------------|-------------------|------------------------------|----------------------------------|---------------|
| 0       | Vanilla C                 | NA                            | NA                | NA                           | NA                               | NA            |
| 1       | Unified Memory Version    | NA                            | 1.00              | 0.83                         | 0.86                             | 3.11          |
| 2       | Prefetch and Mem Advise   | NA                            | 1.21              | 1.00                         | 1.04                             | 3.76          |
| 3       | Data Initialization with CUDA | NA                        | 1.16              | 0.96                         | 1.00                             | 3.61          |
| 4       | Old Method                | NA                            | 0.32              | 0.27                         | 0.28                             | 1.00          |


# CUDA Programming Project Specifications : 1D Convolution #

Implements a 1D convolution and compares the performance of CPU-based and GPU-accelerated CUDA implementations, exploring different data transfer methods and CUDA optimization techniques.

#### General Implementation of the 1D convolution defined as : out[i] = (in[i] + in[i + 1] + in[i + 2]) / 3.0f ####


```
conv1D(const float* in, float* out, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n - 2) {
        out[i] = (in[i] + in[i + 1] + in[i + 2]) / 3.0f;
    }
}
```



#### Input: `2^28` elements or `268435456` ####
Generally initialized using the following code:


```
float* initialize_input(float *in, size_t n) {
    for (size_t i = 0; i < n; i++) {
        in[i] = (float)i;
    }
    return in;
}
```



The first and last 20 elements of the output with be displayed at the end of the run, excluding the two zeros at the end.



```
void print_results(const float *out, size_t n) {
    printf("First %d elements of output:\n", PRINT_VALUES);
    for (int i = 0; i < PRINT_VALUES; i++) {
        printf("out[%d] = %f\n", i, out[i]);
    }

    printf("Last %d elements of output:\n", PRINT_VALUES);
    for (size_t i = n - PRINT_VALUES; i < n - 2; i++) {
        printf("out[%zu] = %f\n", i, out[i]);
    }
}
```



The convolution method will be run 30 times and the average runtime will be recorded, this is to accomodate any forms of caching and inconsistencies between runs.



```
//Kernel run method
void run_kernel(const float *in, float *out) {
    int blocks = (N + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
    for (int iter = 0; iter < RUNS; iter++) {
        conv1D_kernel<<<blocks, THREADS_PER_BLOCK>>>(in, out, N);
        cudaDeviceSynchronize();
    }
}
```



Error checking method, also excluding the last two elements as they are always Zero.

```
size_t check_errors(const float *in, const float *out, size_t n) {
    size_t err_count = 0;
    for (size_t i = 0; i < n - 2; i++) {
        float expected = (in[i] + in[i + 1] + in[i + 2]) / 3.0f;
        if (fabs(out[i] - expected) > 1e-5) {
            err_count++;
        }
    }
    return err_count;
}
```



 #### CUDA configurations ####
* The number of blocks is automatically coomputed depending on the number of input elements and the number of threads per block.
* The threads per block is set to 1024 threads.

# Version 0: Using C to implement 1D Convolution #

Get the runtime of 1D Convolution implemented in C and use as the baseline for comparison with the different CUDA version implementation of 1D Convolution.

In [6]:
%%writefile conv1d_c.c
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>

#define ARRAY_SIZE (1 << 28)  // 268,435,456 elements

// ***C function version
void conv_1d(size_t n, float* out, float *in)
{
  for (int i=0; i<n;i++)
     out[i] = (in[i] + in[i+1] + in[i+2])/3.0f;

}

int main() {
    float *in = (float *)malloc((ARRAY_SIZE + 2) * sizeof(float));  // +2 for safe bounds
    float *out = (float *)malloc(ARRAY_SIZE * sizeof(float));

    const size_t loope = 30;

    // Initialize input with some values
    for (size_t i = 0; i < ARRAY_SIZE + 2; i++) {
        in[i] = (float)i;
    }

    //time here
    double elapse, time_taken;

    //timer variables
    clock_t start, end;
    elapse = 0.0f;

    // 1D Convolution (naive C)
    for (int i=0; i<loope; i++){
        start = clock();
        conv_1d(ARRAY_SIZE,out, in);
        end = clock();
        time_taken = ((double)(end-start))*1E3/CLOCKS_PER_SEC;
        elapse = elapse + time_taken;
    }
    printf("Function (in C) average time for %lu loops is %f milliseconds to execute an array size %u \n", loope, elapse/loope, ARRAY_SIZE);

    // Print first and last 20 values
    printf("First 20 elements of output:\n");
    for (int i = 0; i < 20; i++) printf("%f ", out[i]);
    printf("\nLast 20 elements of output:\n");
    for (size_t i = ARRAY_SIZE - 20; i < ARRAY_SIZE; i++) printf("%f ", out[i]);
    printf("\n");

    free(in);
    free(out);
    return 0;
}

Overwriting conv1d_c.c


In [7]:
%%shell
gcc conv1d_c.c -o conv1d_c



In [None]:
%%shell
./conv1d_c

Function (in C) average time for 30 loops is 1050.341400 milliseconds to execute an array size 268435456 
First 20 elements of output:
1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000 8.000000 9.000000 10.000000 11.000000 12.000000 13.000000 14.000000 15.000000 16.000000 17.000000 18.000000 19.000000 20.000000 
Last 20 elements of output:
268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 




# GPU Check

Getting the GPU id for Google Colab but this is often 0 for free tier accounts.

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Tue Jun  3 12:09:28 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   38C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

# Version 1: Unified Memory Version of 1D Convolution

A single memory address space accessible by both CPU and GPU. [Nvidia Blog](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/)

Analogy: Instead of two separate offices, imagine a shared workspace where both the CPU and GPU can reach the same files. You don’t have to move files around; you just work on them where they are.

How it works:
- You allocate memory with cudaMallocManaged.
- Both the CPU and GPU can access this memory.
- CUDA automatically moves data behind the scenes when needed.

Example:

```
float *data;
cudaMallocManaged(&data, N * sizeof(float));  // Accessible by CPU/GPU
```

In [None]:
%%writefile CUDA_conv1d_base.cu
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

__global__
void conv_1d(size_t n, float *out, float *in){

    //IMPORTANT 1: ID = BLOCKID, BLOCKDIM, THREADID
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    //IMPORTANT 2: GRID STRIDE LOOP
    for (int i = index; i < n-2; i += stride){
        out[i] = (in[i] + in[i+1] + in[i+2]) / 3.0f;
    }
}

int main(){
    const size_t ARRAY_SIZE = 1<<28;
    const size_t ARRAY_BYTES = ARRAY_SIZE * sizeof(float);

    //loop
    const size_t loope = 30;

    //declare array
    float *in, *out;
    cudaMallocManaged(&in, ARRAY_BYTES);
    cudaMallocManaged(&out, ARRAY_BYTES);

    //Initialize array
    for (int i=0; i<ARRAY_SIZE; i++)
      in[i] = (float) i;

    //setup CUDA kernel
    size_t numThreads = 1024;

    //size_t numBlocks = 1;
    //IMPORTANT 3: AUTO COMPUTE BLOCKS
    size_t numBlocks = (ARRAY_SIZE + numThreads-1) / numThreads; //round-up

    printf("*** function = Conv 1d\n");
    printf("numElements = %lu\n", ARRAY_SIZE);
    printf("numBlocks = %lu, numThreads = %lu \n",numBlocks,numThreads);

    for (size_t i=0; i<loope;i++)
        conv_1d <<<numBlocks, numThreads>>> (ARRAY_SIZE,out,in);

        //barrier
        cudaDeviceSynchronize();

    //error checking    //this is cpu
    size_t err_count = 0;

    // ***C version
    for (int i=0; i<ARRAY_SIZE-2; i++){
        if (fabs(out[i] - (in[i] + in[i+1] + in[i+2])/3.0f) > 1e-5)
            err_count++;
   }

    printf("Error count(CUDA program): %zu\n", err_count);
    if (err_count == 0){
        printf("CUDA Version is Correct\n");
    }
    else{
        printf("CUDA Version is Incorrect\n");
    }


    printf("First 20 elements of output:\n");
    for (int i = 0; i < 20; i++)
        printf("%f ", out[i]);
    printf("\nLast 20 elements of output:\n");
    for (int i = ARRAY_SIZE - 20; i < ARRAY_SIZE; i++)
        printf("%f ", out[i]);
    printf("\n");

    //free memory
    cudaFree(in);
    cudaFree(out);
    return 0;
}

Writing CUDA_conv1d_base.cu


In [None]:
%%shell
nvcc CUDA_conv1d_base.cu -o CUDA_conv1d_base -arch=sm_75



In [None]:
%%shell
nvprof ./CUDA_conv1d_base

==4948== NVPROF is profiling process 4948, command: ./CUDA_conv1d_base
*** function = Conv 1d
numElements = 268435456
numBlocks = 262144, numThreads = 1024 
Error count(CUDA program): 0
CUDA Version is Correct
First 20 elements of output:
1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000 8.000000 9.000000 10.000000 11.000000 12.000000 13.000000 14.000000 15.000000 16.000000 17.000000 18.000000 19.000000 20.000000 
Last 20 elements of output:
268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 0.000000 0.000000 
==4948== Profiling application: ./CUDA_conv1d_base
==4948== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.02968s        30  34.323ms 



# Version 2: CUDA with Prefetching data and Memory Advice

Explicit control over data locality using:

**a. cudaMemPrefetchAsync**

Pre-migrates data to a specific processor (CPU/GPU) before access:

```
cudaMemPrefetchAsync(data, size, deviceId);  // Force data to GPU

```
**b. cudaMemAdvise**

Hints about access patterns:
```
cudaMemAdvise(data, size, cudaMemAdviseSetAccessedBy, deviceId);  // GPU will access

```

```
# This is formatted as code
```



Analogy: Imagine you know you’ll need a certain file in the GPU office soon. You tell your assistant in advance, “Hey, please have this file ready on the GPU desk before I get there.” That way, there’s no waiting around.

How it works:
- With Unified Memory, you use cudaMemPrefetchAsync to pre-load data to where it’ll be needed (CPU or GPU).
- You can also give “advice” with cudaMemAdvise to help CUDA optimize data placement.

In [None]:
%%writefile CUDA_conv1d_complete.cu
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

__global__
void conv_1d(size_t n, float *out, float *in){

    //IMPORTANT 1: ID = BLOCKID, BLOCKDIM, THREADID
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    //IMPORTANT 2: GRID STRIDE LOOP
    for (int i = index; i < n-2; i += stride){
        out[i] = (in[i] + in[i+1] + in[i+2]) / 3.0f;
    }
}

int main(){
    const size_t ARRAY_SIZE = 1<<28;
    const size_t ARRAY_BYTES = ARRAY_SIZE * sizeof(float);

    //loop
    const size_t loope = 30;

    //declare array
    float *in, *out;
    cudaMallocManaged(&in, ARRAY_BYTES);
    cudaMallocManaged(&out, ARRAY_BYTES);


    //get gpu id
    int device = -1;
    cudaGetDevice(&device);

    // MEMORY ADVISE
    cudaMemAdvise(in, ARRAY_BYTES, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId);
    cudaMemAdvise(out, ARRAY_BYTES, cudaMemAdviseSetReadMostly, cudaCpuDeviceId);

    //"prefetch data" to create CPU page memory
    cudaMemPrefetchAsync(in,ARRAY_BYTES,cudaCpuDeviceId,NULL);
    //"prefetch data" to create GPU page memory
    cudaMemPrefetchAsync(out,ARRAY_BYTES,device,NULL);

    //Initialize array
    for (int i=0; i<ARRAY_SIZE; i++)
      in[i] = (float) i;

    //"Prefetch data" from CPU-GPU
    cudaMemPrefetchAsync(in,ARRAY_BYTES,device,NULL);

    //****** SETUP CUDA KERNEL ******
    size_t numThreads = 1024;

    //IMPORTANT 3: AUTO COMPUTE BLOCKS
    size_t numBlocks = (ARRAY_SIZE + numThreads-1) / numThreads; //round-up

    printf("*** function = Conv 1d\n");
    printf("numElements = %lu\n", ARRAY_SIZE);
    printf("numBlocks = %lu, numThreads = %lu \n",numBlocks,numThreads);

    for (size_t i=0; i<loope;i++)
        conv_1d <<<numBlocks, numThreads>>> (ARRAY_SIZE,out,in);

        cudaDeviceSynchronize();

    //"Prefetch data" from GPU-CPU before error checking because error checking is in CPU.
    cudaMemPrefetchAsync(out,ARRAY_BYTES,cudaCpuDeviceId,NULL);
    cudaMemPrefetchAsync(in,ARRAY_BYTES,cudaCpuDeviceId,NULL);


    //error checking    //this is cpu
    size_t err_count = 0;

    // ***C version
    for (int i=0; i<ARRAY_SIZE-2; i++){
        if (fabs(out[i] - (in[i] + in[i+1] + in[i+2])/3.0f) > 1e-5)
            err_count++;
   }

    printf("Error count(CUDA program): %zu\n", err_count);
    if (err_count == 0){
        printf("CUDA Version is Correct\n");
    }
    else{
        printf("CUDA Version is Incorrect\n");
    }


    printf("First 20 elements of output:\n");
    for (int i = 0; i < 20; i++)
        printf("%f ", out[i]);
    printf("\nLast 20 elements of output:\n");
    for (int i = ARRAY_SIZE - 20; i < ARRAY_SIZE; i++)
        printf("%f ", out[i]);
    printf("\n");

    //free memory
    cudaFree(in);
    cudaFree(out);
    return 0;
}

Writing CUDA_conv1d_complete.cu


In [None]:
%%shell
nvcc CUDA_conv1d_complete.cu -o CUDA_conv1d_complete -arch=sm_75



In [None]:
%%shell
nvprof ./CUDA_conv1d_complete

==5030== NVPROF is profiling process 5030, command: ./CUDA_conv1d_complete
*** function = Conv 1d
numElements = 268435456
numBlocks = 262144, numThreads = 1024 
Error count(CUDA program): 0
CUDA Version is Correct
First 20 elements of output:
1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000 8.000000 9.000000 10.000000 11.000000 12.000000 13.000000 14.000000 15.000000 16.000000 17.000000 18.000000 19.000000 20.000000 
Last 20 elements of output:
268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 0.000000 0.000000 
==5030== Profiling application: ./CUDA_conv1d_complete
==5030== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  332.94ms        30  1



# Version 3: Data Initialization with CUDA Kernel

Direct GPU-side initialization avoids host-device transfers:

**Advantages**

* Eliminates cudaMemcpy for initial data setup.

* Faster for large datasets (no PCIe transfer).

Example:
```
__global__ void init_kernel(float *data, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) data[i] = i;  // Initialize directly on GPU
}
```

Analogy: Instead of carrying a box of blank forms from the CPU office to the GPU office, you just print the forms right there in the GPU office!

How it works:
- You write a small CUDA kernel (a function that runs on the GPU) to fill in or generate your data directly in GPU memory.
- No need to copy data from CPU to GPU.

In [None]:
%%writefile CUDA_conv1d_v3.cu
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

__global__
void init_array(size_t n, float *in){
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = index; i < n; i += stride) {
        in[i] = (float)i;
    }
}

__global__
void conv_1d(size_t n, float *out, float *in){
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = index; i < n - 2; i += stride){
        out[i] = (in[i] + in[i+1] + in[i+2]) / 3.0f;
    }
}

int main(){
    const size_t ARRAY_SIZE = 1<<28;  // 268,435,456
    const size_t ARRAY_BYTES = ARRAY_SIZE * sizeof(float);

    const size_t loope = 30;

    float *in, *out;
    cudaMallocManaged(&in, ARRAY_BYTES);
    cudaMallocManaged(&out, ARRAY_BYTES);

    // Get GPU device id
    int device = -1;
    cudaGetDevice(&device);

    // Memory advise for better behavior
    cudaMemAdvise(in, ARRAY_BYTES, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId);
    cudaMemAdvise(out, ARRAY_BYTES, cudaMemAdviseSetReadMostly, cudaCpuDeviceId);

    // Setup thread/block sizes
    size_t numThreads = 1024;
    size_t numBlocks = (ARRAY_SIZE + numThreads - 1) / numThreads;

    // Initialize array on GPU with kernel
    init_array<<<numBlocks, numThreads>>>(ARRAY_SIZE, in);
    cudaDeviceSynchronize();

    // Prefetch initialized data to GPU (optional but recommended)
    cudaMemPrefetchAsync(in, ARRAY_BYTES, device, NULL);
    cudaMemPrefetchAsync(out, ARRAY_BYTES, device, NULL);

    printf("*** function = Conv 1d with CUDA Kernel Init\n");
    printf("numElements = %lu\n", ARRAY_SIZE);
    printf("numBlocks = %lu, numThreads = %lu \n", numBlocks, numThreads);

    // Run convolution kernel multiple times for timing stability
    for (size_t i = 0; i < loope; i++){
        conv_1d<<<numBlocks, numThreads>>>(ARRAY_SIZE, out, in);
    }
    cudaDeviceSynchronize();

    // Prefetch results back to CPU for validation
    cudaMemPrefetchAsync(out, ARRAY_BYTES, cudaCpuDeviceId, NULL);

    // Validation on CPU
    size_t err_count = 0;
    for (int i = 0; i < ARRAY_SIZE - 2; i++){
        float expected = (in[i] + in[i+1] + in[i+2]) / 3.0f;
        if (fabs(out[i] - expected) > 1e-5){
            err_count++;
        }
    }

    printf("Error count (CUDA kernel init): %zu\n", err_count);
    if (err_count == 0)
        printf("CUDA Version is Correct\n");
    else
        printf("CUDA Version is Incorrect\n");

    printf("First 20 elements of output:\n");
    for (int i = 0; i < 20; i++)
        printf("%f ", out[i]);
    printf("\nLast 20 elements of output:\n");
    for (int i = ARRAY_SIZE - 20; i < ARRAY_SIZE; i++)
        printf("%f ", out[i]);
    printf("\n");

    cudaFree(in);
    cudaFree(out);
    return 0;
}

Overwriting CUDA_conv1d_v3.cu


In [None]:
%%shell
nvcc CUDA_conv1d_v3.cu -o CUDA_conv1d_v3 -arch=sm_75



In [None]:
%%shell
nvprof ./CUDA_conv1d_v3

==5179== NVPROF is profiling process 5179, command: ./CUDA_conv1d_v3
*** function = Conv 1d with CUDA Kernel Init
numElements = 268435456
numBlocks = 262144, numThreads = 1024 
Error count (CUDA kernel init): 0
CUDA Version is Correct
First 20 elements of output:
1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000 8.000000 9.000000 10.000000 11.000000 12.000000 13.000000 14.000000 15.000000 16.000000 17.000000 18.000000 19.000000 20.000000 
Last 20 elements of output:
268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 0.000000 0.000000 
==5179== Profiling application: ./CUDA_conv1d_v3
==5179== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   57.78%  361.27



# Version 4: Old method of data transfer

Explicit host/device memory management:

**Steps**

1. Allocate host memory (malloc).
2. Allocate device memory (cudaMalloc).
3. Copy data:
```
cudaMemcpy(d_in, h_in, size, cudaMemcpyHostToDevice);  // Host→Device[6][7]
```
4. Process on GPU.
5. Copy results back:
```
cudaMemcpy(h_out, d_out, size, cudaMemcpyDeviceToHost);  // Device→Host[6][7]
```

Analogy: Imagine you’re working in two separate offices — one for the CPU and one for the GPU. If you want to work on a file in the GPU office, you have to physically copy it over from the CPU office.

How it works:
- You allocate space on the GPU using something called cudaMalloc (like reserving a desk in the GPU office).
- You copy your data from the CPU to the GPU with cudaMemcpy (like carrying a box of files from one office to the other).
- When you’re done, you copy the results back the same way.

In [None]:
%%writefile CUDA_conv1d_v4_1.cu
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cuda_runtime.h>

__global__
void conv_1d(size_t n, float *out, const float *in) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = index; i < n; i += stride) {
        out[i] = (in[i] + in[i + 1] + in[i + 2]) / 3.0f;
    }
}

int main() {
    const size_t ARRAY_SIZE = 1 << 28;  // ~268 million elements
    const size_t ARRAY_BYTES = ARRAY_SIZE * sizeof(float);
    const size_t LOOP_COUNT = 30;

    // Host arrays
    float *h_in = (float *)malloc(ARRAY_BYTES);
    float *h_out = (float *)malloc(ARRAY_BYTES);

    if (!h_in || !h_out) {
        fprintf(stderr, "Failed to allocate host memory\n");
        return -1;
    }

    // Initialize input on CPU
    for (size_t i = 0; i < ARRAY_SIZE; i++) {
        h_in[i] = (float)i;
    }

    // Device pointers
    float *d_in = NULL;
    float *d_out = NULL;

    // Allocate device memory
    cudaMalloc((void **)&d_in, ARRAY_BYTES);
    cudaMalloc((void **)&d_out, ARRAY_BYTES);

    // Copy input data from CPU to GPU
    cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);

    // Kernel launch parameters
    size_t numThreads = 1024;
    size_t numBlocks = (ARRAY_SIZE + numThreads - 1) / numThreads;

    printf("*** function = Conv 1d (Old cudaMalloc + cudaMemcpy method)\n");
    printf("numElements = %lu\n", ARRAY_SIZE);
    printf("numBlocks = %lu, numThreads = %lu\n", numBlocks, numThreads);

    for (size_t i = 0; i < LOOP_COUNT; i++) {
        conv_1d<<<numBlocks, numThreads>>>(ARRAY_SIZE - 2, d_out, d_in);
    }
    cudaDeviceSynchronize();

    // Copy result back from GPU to CPU
    cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);

    // Verify results
    size_t err_count = 0;
    for (size_t i = 0; i < ARRAY_SIZE - 2; i++) {
        float expected = (h_in[i] + h_in[i + 1] + h_in[i + 2]) / 3.0f;
        if (fabs(h_out[i] - expected) > 1e-5) {
            err_count++;
        }
    }

    printf("Error count (CUDA program): %zu\n", err_count);
    if (err_count == 0) {
        printf("CUDA Version is Correct\n");
    } else {
        printf("CUDA Version is Incorrect\n");
    }

    printf("First 20 elements of output:\n");
    for (int i = 0; i < 20; i++) {
        printf("%f ", h_out[i]);
    }
    printf("\nLast 20 elements of output:\n");
    for (size_t i = ARRAY_SIZE - 20; i < ARRAY_SIZE; i++) {
        printf("%f ", h_out[i]);
    }
    printf("\n");

    // Free memory
    cudaFree(d_in);
    cudaFree(d_out);
    free(h_in);
    free(h_out);

    return 0;
}


Writing CUDA_conv1d_v4_1.cu


In [None]:
%%shell
nvcc CUDA_conv1d_v4_1.cu -o CUDA_conv1d_v4_1 -arch=sm_75



In [None]:
%%shell
nvprof ./CUDA_conv1d_v4_1

==5326== NVPROF is profiling process 5326, command: ./CUDA_conv1d_v4_1
*** function = Conv 1d (Old cudaMalloc + cudaMemcpy method)
numElements = 268435456
numBlocks = 262144, numThreads = 1024
Error count (CUDA program): 0
CUDA Version is Correct
First 20 elements of output:
1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000 8.000000 9.000000 10.000000 11.000000 12.000000 13.000000 14.000000 15.000000 16.000000 17.000000 18.000000 19.000000 20.000000 
Last 20 elements of output:
268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435440.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 268435456.000000 0.000000 0.000000 
==5326== Profiling application: ./CUDA_conv1d_v4_1
==5326== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   

