# ðŸ“˜ Task Explanation: Warp-Level Reduction and `__shfl_down_sync`

## ðŸŽ¯ Objective
The objective of this task is to understand how **warp-level primitives** can be used to implement **fast parallel reductions** on the GPU, and how the CUDA intrinsic  
`__shfl_down_sync` enables **efficient communication between threads within a warp**.

This task teaches a core optimization technique used in many high-performance CUDA and ML kernels.

---

## ðŸ§  Background: What Is a Reduction?
A **reduction** is an operation that combines multiple values into a single result, such as:

- Sum
- Maximum / minimum
- Logical AND / OR

Example:
\[
\text{sum} = \sum_{i=0}^{N-1} x_i
\]

Reductions are fundamental in:
- Loss computation
- Mean / variance (LayerNorm, BatchNorm)
- Softmax
- Dot products

---

## ðŸ§© Part A â€” Warp-Level Reduction

### What Is a Warp?
A **warp** is a group of **32 threads** that execute the same instruction in lockstep on NVIDIA GPUs.

Key properties:
- Threads in a warp are implicitly synchronized
- No need for `__syncthreads()` within a warp

---

### Task: Implement Warp-Level Reduction
Instead of performing reductions across an entire block using shared memory, you will implement a reduction **within a single warp**.

Each thread starts with one value, and the warp cooperatively reduces these values to a single result.

### Why Warp-Level Reduction Is Fast
- No shared memory accesses
- No synchronization barriers
- Communication happens via registers

This results in **much lower latency** compared to shared-memory block reductions.

---

## ðŸ§  Part B â€” Understanding `__shfl_down_sync`

### What Is `__shfl_down_sync`?
`__shfl_down_sync` is a CUDA intrinsic that allows a thread to **read a register value from another thread in the same warp**.

Conceptually:
```cpp
value_from_other_thread = __shfl_down_sync(mask, value, offset);


In [1]:
!nvcc --version
!nvidia-smi

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
Sat Dec 27 02:49:51 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   57C    P8             13W /   70W |       0MiB /  15360MiB |      0%      Default |
|                       

In [None]:
!apt-get update
!apt-get install -y cuda-toolkit-12-4

In [11]:
%%writefile warp_reduce_shfl_skeleton.cu
#include <cstdio>
#include <cstdlib>
#include <cmath>
#include <cuda_runtime.h>

#define CUDA_CHECK(call) do {                                   \
  cudaError_t err = (call);                                     \
  if (err != cudaSuccess) {                                     \
    fprintf(stderr, "CUDA error %s:%d: %s\n",                   \
            __FILE__, __LINE__, cudaGetErrorString(err));       \
    std::exit(EXIT_FAILURE);                                    \
  }                                                             \
} while(0)

__device__ __forceinline__ float warpReduceSum(float val) {
    // TODO: implement warp-level reduction using __shfl_down_sync
    // Use a full mask and reduce across 32 lanes
    unsigned mask = __activemask();
    for(int offset = 16; offset > 0; offset >>= 1){
        val += __shfl_down_sync(mask, val, offset);
    }

    return val;
}

__global__ void reduceWarpSumKernel(const float* __restrict__ in,
                                    float* __restrict__ out,
                                    int N) {
    // TODO:
    // - Each thread loads one element (guarded)
    // - Reduce values within each warp using warpReduceSum
    // - Lane 0 writes one partial sum per warp into out
    __shared__ float warpSums[32];
    int tid = threadIdx.x;
    int lane = tid & 31;          // tid % 32
    int warp = tid >> 5;          // tid / 32

    float sum = 0.0f;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for(int i = idx;i < N; i += stride){
        sum += in[i];
    }

    sum = warpReduceSum(sum);

    if (lane == 0) warpSums[warp] = sum;

    __syncthreads();

    float blockSum = 0.0f;
    if(warp == 0){
        int numWarps = (blockDim.x + 31) / 32;
        blockSum = (lane < numWarps) ? warpSums[lane] : 0.0f;
        blockSum = warpReduceSum(blockSum);

        if(lane == 0) out[blockIdx.x] = blockSum;
    }


}

static float cpuSum(const float* a, int N) {
    double acc = 0.0;
    for (int i = 0; i < N; ++i) acc += a[i];
    return (float)acc;
}

static bool checkClose(float gpu, float cpu, float tol) {
    float diff = std::fabs(gpu - cpu);
    if (diff > tol) {
        printf("Mismatch: gpu=%f cpu=%f diff=%f\n", gpu, cpu, diff);
        return false;
    }
    return true;
}

int main() {
    const int N = 1 << 20;
    const float tol = 1e-2f;

    size_t bytes = size_t(N) * sizeof(float);

    float* hIn = (float*)std::malloc(bytes);
    if (!hIn) return 1;

    for (int i = 0; i < N; ++i) hIn[i] = 0.001f * (i % 1000);

    float* dIn = nullptr;
    CUDA_CHECK(cudaMalloc(&dIn, bytes));
    CUDA_CHECK(cudaMemcpy(dIn, hIn, bytes, cudaMemcpyHostToDevice));

    // TODO: choose block/grid
    int blockSize = 256; // TODO
    int gridSize  = (N + blockSize - 1) / blockSize; // TODO

    // TODO: allocate output for one partial sum per warp
    int warpsTotal = gridSize;      // TODO
    size_t outCount = (size_t)gridSize;
    size_t outBytes = outCount * sizeof(float);     // TODO
    float* dOut = nullptr;   // TODO
    float* hOut = (float*)std::malloc(outBytes);   // TODO
    CUDA_CHECK(cudaMalloc(&dOut, outBytes));

    // TODO: launch kernel
    // reduceWarpSumKernel<<<gridSize, blockSize>>>(dIn, dOut, N);
    reduceWarpSumKernel<<<gridSize, blockSize>>>(dIn, dOut, N);

    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    // TODO: copy partial sums back and finalize on CPU
    // CUDA_CHECK(cudaMemcpy(hOut, dOut, outBytes, cudaMemcpyDeviceToHost));
    CUDA_CHECK(cudaMemcpy(hOut, dOut, outBytes, cudaMemcpyDeviceToHost));

    float gpuSum = 0.0f; // TODO: sum partials
    for (size_t i = 0; i < outCount; ++i)  gpuSum+= hOut[i];

    float cpuRef = cpuSum(hIn, N);
    printf("GPU sum = %f\nCPU sum = %f\n", gpuSum, cpuRef);
    printf("Correctness: %s\n", checkClose(gpuSum, cpuRef, tol) ? "PASS" : "FAIL");

    // Cleanup
    CUDA_CHECK(cudaFree(dIn));
    // TODO: free dOut, free hOut
    CUDA_CHECK(cudaFree(dOut));
    std::free(hOut);
    std::free(hIn);
    return 0;
}


Overwriting warp_reduce_shfl_skeleton.cu


In [12]:
!nvcc -arch=sm_75 warp_reduce_shfl_skeleton.cu -o warp_reduce
!./warp_reduce

      int warpsTotal = gridSize;
          ^


GPU sum = 523641.625000
CPU sum = 523641.625000
Correctness: PASS
