# ðŸ“˜ Task Explanation: Memory Coalescing, Stride Experiments, and Nsight Systems Profiling

## ðŸŽ¯ Objective
The objective of this task is to understand how **global memory access patterns** affect CUDA kernel performance, with a focus on **memory coalescing**.  
You will experimentally measure how different **memory access strides** impact performance and use **Nsight Systems** to profile and interpret the results.

This task builds intuition for why GPU kernels can be slow even when they are highly parallel.

---

## ðŸ§  Background: What Is Memory Coalescing?
On NVIDIA GPUs, global memory accesses are serviced in **memory transactions** (typically 32/64/128 bytes).

When threads in the **same warp (32 threads)**:
- Access **contiguous memory addresses**
- Properly aligned to cache-line boundaries

their accesses are **coalesced** into fewer transactions, resulting in:
- Higher effective bandwidth
- Lower memory latency

If threads access memory with a **stride** (gaps between addresses), the GPU must issue **more memory transactions**, which significantly degrades performance.

---

## ðŸ§© Part A â€” Study Memory Coalescing
### Task
Study how threads in a warp access global memory and how memory transactions are formed.

### What to Learn
- How `threadIdx.x` maps to memory addresses
- How a warp of 32 threads loads data from global memory
- Why `A[i]` is fast but `A[i * stride]` can be slow

### Expected Outcome
You should be able to explain:
- Why contiguous access is optimal
- How strided access increases memory traffic
- Why memory coalescing is critical for ML kernels (e.g., LayerNorm, Softmax, Attention)

---

## ðŸ”¬ Part B â€” Stride Performance Experiment (stride = 1 / 2 / 4)

### Task
Modify a CUDA kernel so that each thread accesses memory using a configurable stride:

```cpp
C[i] = A[i * stride] + B[i * stride];
```

### Run the Kernel With
Execute the CUDA kernel multiple times using different memory access strides:

- **stride = 1** â†’ fully coalesced memory access  
- **stride = 2** â†’ partially coalesced memory access  
- **stride = 4** â†’ poorly coalesced (highly fragmented) memory access  

---

### What to Measure
For each stride configuration, measure:

- **Kernel execution time**
- **(Optional)** Effective memory bandwidth

---

### Expected Observations

| Stride | Memory Access Pattern | Performance |
|------:|-----------------------|-------------|
| 1 | Fully contiguous | Fastest |
| 2 | Partially coalesced | Slower |
| 4 | Highly fragmented | Much slower |

---

### Why This Matters
Many CUDA kernels are slow **not because of computation**, but because of **inefficient memory access patterns**.  
This experiment makes the performance cost of **uncoalesced global memory access** visible and measurable, helping you understand why memory behavior often dominates GPU performance.

---

## ðŸ“Š Part C â€” Profiling With Nsight Systems

### Task
Use **Nsight Systems** to profile kernel execution for each stride configuration.

---

### What to Look For
When analyzing the profiling results, focus on:

- Kernel execution duration  
- GPU utilization  
- Differences in memory throughput across strides  
- CPUâ€“GPU synchronization overhead  

---

### Key Questions to Answer
- Does kernel execution time increase as the stride increases?  
- Is the kernel **memory-bound** or **compute-bound**?  
- Are there unnecessary synchronizations or idle GPU periods?

---

### Expected Outcome
You should be able to clearly correlate:

> **Poor memory coalescing â†’ more memory transactions â†’ longer kernel runtime**

---

## ðŸ§ª Deliverables
You should produce the following:

1. A **CUDA kernel** supporting configurable memory stride  
2. **Benchmark results** showing runtime vs. stride  
3. **Nsight Systems screenshots or logs**  
4. A **short written analysis** explaining:
   - Why stride affects performance  
   - How memory coalescing explains the observed results

In [None]:
!nvcc --version
!nvidia-smi

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
Tue Dec 23 22:23:38 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   44C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                       

In [None]:
!apt-get update
!apt-get install -y cuda-toolkit-12-4

In [None]:
%%writefile stride_coalescing_skeleton.cu
#include <cstdio>
#include <cstdlib>
#include <cmath>
#include <cuda_runtime.h>

#define CUDA_CHECK(call) do {                                   \
  cudaError_t err = (call);                                     \
  if (err != cudaSuccess) {                                     \
    fprintf(stderr, "CUDA error %s:%d: %s\n",                   \
            __FILE__, __LINE__, cudaGetErrorString(err));       \
    std::exit(EXIT_FAILURE);                                    \
  }                                                             \
} while(0)

// -------------------------------------------
// TODO: Implement a kernel that uses STRIDED memory access.
// Requirements:
//  - Each thread computes a global index tid
//  - Each thread should operate on indices that depend on `stride`
//  - Make sure you do not read/write out-of-bounds
//  - You may use either:
//      (A) "logical index" i in [0, N) and access A[i*stride], or
//      (B) a grid-stride loop over i, but with strided addressing
//  - The goal is to change memory coalescing behavior as stride changes.
// -------------------------------------------
__global__ void vectorAddStrided(const float* A, const float* B, float* C, int N, int stride) {
    // TODO:
    // int tid = ...
    // int gridStride = ...
    // for (...) { ... }

    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int gridStride = blockDim.x * gridDim.x;
    if (tid < N){
        for(int i = tid; i < N; i += gridStride)
        {
            int idx = i * stride;
            C[idx] = A[idx] + B[idx];
        }
    }
}

// CPU reference (for correctness)
static void vectorAddStridedCPU(const float* A, const float* B, float* C, int N, int stride) {
    // TODO:
    // for (int i = 0; i < N; ++i) { ... }
    //for(int i = 0; i < N; i++){
    //    C[i] = A[i] + B[i];
    //}
    for(int i = 0; i < N; ++i)
    {
        int idx = i * stride;
        C[idx] = A[idx] + B[idx];
    }
}

// -------------------------------------------
// TODO: correctness check helper
// Requirements:
//  - Compare gpu[] vs cpu[] within tolerance
//  - Print the first mismatch and return false
// -------------------------------------------
static bool checkClose(const float* gpu, const float* cpu, int count, float tol) {
    // TODO
    for (int i = 0; i < count; ++i)
    {
        float diff = fabsf(gpu[i] - cpu[i]);
        if(diff > tol){
          return false;
        }
    }
    return true;
}

// -------------------------------------------
// Timing helper (CUDA events)
// -------------------------------------------
static float timeKernelMs(const float* dA, const float* dB, float* dC, int N, int stride,
                          int gridSize, int blockSize, int warmupIters, int iters) {
    // Warmup
    for (int i = 0; i < warmupIters; ++i) {
        vectorAddStrided<<<gridSize, blockSize>>>(dA, dB, dC, N, stride);
    }
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    cudaEvent_t start, stop;
    CUDA_CHECK(cudaEventCreate(&start));
    CUDA_CHECK(cudaEventCreate(&stop));

    CUDA_CHECK(cudaEventRecord(start));
    for (int i = 0; i < iters; ++i) {
        vectorAddStrided<<<gridSize, blockSize>>>(dA, dB, dC, N, stride);
    }
    CUDA_CHECK(cudaEventRecord(stop));
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaEventSynchronize(stop));

    float ms = 0.0f;
    CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));
    CUDA_CHECK(cudaEventDestroy(start));
    CUDA_CHECK(cudaEventDestroy(stop));

    return ms / iters;
}

// -------------------------------------------
// TODO: optional bandwidth calculation
// Tip: estimate bytes moved per *kernel invocation* and convert to GB/s.
// -------------------------------------------
static double estimateBandwidthGBs(int N, int stride, float kernel_ms) {
    // TODO:
    // Decide what "N" means in your kernel (number of outputs vs logical outputs).
    // Compute bytes_read + bytes_written per run.
    // return (bytes_total / (kernel_ms/1e3)) / 1e9;
    double bytes_total = (double)N * 3.0 * sizeof(float);
    double sec = kernel_ms / 1e3;
    return (bytes_total / sec) / 1e9;
}

int main() {
    // -------------------------------------------
    // Experiment setup
    // N controls how many output elements you compute (your design).
    // If you access A[i*stride], make sure the allocated arrays are large enough.
    // -------------------------------------------
    const int N = 1 << 24;           // base size (tune if needed)
    const float tol = 1e-6f;

    // We'll test these strides:
    const int strides[] = {1, 2, 4};
    const int numStrides = sizeof(strides) / sizeof(strides[0]);

    // -------------------------------------------
    // TODO: Decide how big your arrays must be to support strided access.
    // If you access A[i*stride] for i in [0, N), you may need:
    //   allocCount = N * stride
    // or something similar to avoid out-of-bounds.
    // -------------------------------------------
    int maxStride = strides[numStrides - 1];
    int allocCount = N * maxStride; // TODO (must be >= max index accessed + 1)

    size_t bytes = size_t(allocCount) * sizeof(float);

    // Host alloc
    float* hA = (float*)std::malloc(bytes);
    float* hB = (float*)std::malloc(bytes);
    float* hC_gpu = (float*)std::malloc(bytes);  // may only need N outputs; up to you
    float* hC_cpu = (float*)std::malloc(bytes);

    if (!hA || !hB || !hC_gpu || !hC_cpu) {
        fprintf(stderr, "Host allocation failed.\n");
        return EXIT_FAILURE;
    }

    // Init
    for (int i = 0; i < allocCount; ++i) {
        hA[i] = 0.001f * i;
        hB[i] = 0.002f * i;
    }

    // Device alloc
    float *dA = nullptr, *dB = nullptr, *dC = nullptr;
    CUDA_CHECK(cudaMalloc(&dA, bytes));
    CUDA_CHECK(cudaMalloc(&dB, bytes));
    CUDA_CHECK(cudaMalloc(&dC, bytes));

    CUDA_CHECK(cudaMemcpy(dA, hA, bytes, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(dB, hB, bytes, cudaMemcpyHostToDevice));

    // -------------------------------------------
    // TODO: Choose launch config
    // -------------------------------------------
    int blockSize = 256; // TODO (e.g., 256)
    int gridSize  = (N + blockSize - 1)/blockSize; // TODO (e.g., (N + blockSize - 1)/blockSize or a fixed value)

    // Timing params
    const int warmupIters = 5;
    const int iters = 20;

    printf("=== Stride Coalescing Experiment ===\n");
    printf("N=%d, allocCount=%d, blockSize=%d, gridSize=%d\n", N, allocCount, blockSize, gridSize);

    for (int s = 0; s < numStrides; ++s) {
        int stride = strides[s];

        // Optional: clear output
        CUDA_CHECK(cudaMemset(dC, 0, bytes));

        // Measure
        float ms = timeKernelMs(dA, dB, dC, N, stride, gridSize, blockSize, warmupIters, iters);

        // Copy back (you may only need first N outputs â€” your choice)
        CUDA_CHECK(cudaMemcpy(hC_gpu, dC, bytes, cudaMemcpyDeviceToHost));

        // CPU reference
        vectorAddStridedCPU(hA, hB, hC_cpu, N, stride);

        // TODO: correctness check (decide how many outputs are valid)
        int checkCount = N; // TODO (e.g., N or allocCount depending on your design)
        bool ok = checkClose(hC_gpu, hC_cpu, checkCount, tol);

        // Optional bandwidth estimate
        double gbs = estimateBandwidthGBs(N, stride, ms);

        printf("[stride=%d] time=%.4f ms | bandwidth=%.2f GB/s | correctness=%s\n",
               stride, ms, gbs, ok ? "PASS" : "FAIL");
    }

    // Cleanup
    CUDA_CHECK(cudaFree(dA));
    CUDA_CHECK(cudaFree(dB));
    CUDA_CHECK(cudaFree(dC));
    std::free(hA);
    std::free(hB);
    std::free(hC_gpu);
    std::free(hC_cpu);

    return 0;
}


Overwriting stride_coalescing_skeleton.cu


In [None]:
!nvcc -arch=sm_75 stride_coalescing_skeleton.cu -o stride_coalescing_skeleton
!./stride_coalescing_skeleton

=== Stride Coalescing Experiment ===
N=16777216, allocCount=67108864, blockSize=256, gridSize=65536
[stride=1] time=0.7965 ms | bandwidth=252.78 GB/s | correctness=PASS
[stride=2] time=1.9959 ms | bandwidth=100.87 GB/s | correctness=PASS
[stride=4] time=4.1621 ms | bandwidth=48.37 GB/s | correctness=PASS
