# üìò Task Explanation (Day by Day): PyTorch LayerNorm C++/CUDA Extension

This week focuses on building a **production-style PyTorch LayerNorm (LN) extension**, starting from a C++ forward wrapper and ending with a **complete forward + backward CUDA implementation**, benchmarked against PyTorch‚Äôs official kernel.

The goal is to understand **how real PyTorch operators are written, registered, validated, and optimized**.

---

## üóìÔ∏è Day 2 ‚Äî Register LN Forward (C++ Wrapper) & Python Test

### üéØ Objective
Expose a **custom LN forward implementation** to Python via a PyTorch C++ extension and validate the **Python ‚Üí C++ ‚Üí CUDA** execution path.

### üß© Tasks
- Write a C++ forward wrapper using ATen:
  - Accept `at::Tensor` inputs
  - Validate device, dtype, and layout
  - Allocate output tensors
  - Dispatch to a CUDA kernel
- Register the forward function using `PYBIND11_MODULE`
- Call the operator from Python and verify:
  - Correct execution
  - Correct output shape and dtype

### üß† Key Concepts
- PyTorch C++ extension registration
- ATen tensor checks and allocation
- Python ‚Üî C++ ABI boundary
- Kernel launch from C++

### üì¶ Deliverables
- Callable `ln_forward()` from Python
- Successful Python test script

---

## üóìÔ∏è Day 3 ‚Äî Register LN Backward & Verify Gradient Correctness

### üéØ Objective
Extend the LN operator to support **backward propagation** and ensure it integrates correctly with PyTorch‚Äôs autograd system.

### üß© Tasks
- Implement and register LN backward:
  - Compute gradients for `dx`, `dgamma`, and `dbeta`
  - Use CUDA kernels for gradient computation
- Bind backward logic via:
  - Custom `torch::autograd::Function` **or**
  - Manual backward registration (educational setup)
- Verify gradient correctness:
  - Compare against PyTorch autograd results
  - Use numerical tolerances

### üß† Key Concepts
- Autograd mechanics
- Forward/backward dependency management
- Gradient reduction patterns
- Numerical stability in backward pass

### üì¶ Deliverables
- Working backward kernel
- Gradient correctness test (PASS)

---

## üóìÔ∏è Day 4 ‚Äî Compile, Debug, and Fix Edge Cases

### üéØ Objective
Harden the extension so it behaves correctly across **realistic and corner-case inputs**.

### üß© Tasks
- Fix compilation issues:
  - Template errors
  - Device/dtype mismatches
- Debug runtime errors:
  - Illegal memory access
  - Incorrect indexing
- Handle edge cases:
  - Non-multiple-of-warp feature sizes
  - Small batch sizes
  - Large/small variance values
- Add assertions and sanity checks

### üß† Key Concepts
- CUDA debugging strategies
- Shape- and stride-related pitfalls
- Numerical edge cases in normalization
- Defensive programming in C++ extensions

### üì¶ Deliverables
- Stable, crash-free extension
- Clean compilation with `-O3`

---

## üóìÔ∏è Day 5 ‚Äî Benchmark: Custom LN vs PyTorch Official Kernel

### üéØ Objective
Quantitatively compare your LN implementation against **PyTorch‚Äôs official LayerNorm**.

### üß© Tasks
- Benchmark forward and backward:
  - Your custom LN extension
  - `torch.nn.LayerNorm`
- Measure:
  - Kernel execution time
  - End-to-end forward/backward time
- Use consistent input sizes and warm-up

### üß† Key Concepts
- Fair benchmarking methodology
- Kernel launch overhead
- Memory-bound vs compute-bound behavior
- Why official kernels are highly optimized

### üì¶ Deliverables
- Benchmark table or logs
- Short performance analysis

---

## üóìÔ∏è Day 6 ‚Äî Implement Fused GELU + Bias CUDA Kernel

### üéØ Objective
Apply the same extension workflow to a **fused operator**, reinforcing kernel fusion concepts common in ML systems.

### üß© Tasks
- Implement a CUDA kernel that fuses:
  - Bias addition
  - GELU activation
- Register the fused kernel as a PyTorch extension
- Test correctness against PyTorch reference

### üß† Key Concepts
- Kernel fusion benefits
- Reducing memory traffic
- Elementwise kernel optimization
- Operator fusion in Transformers

### üì¶ Deliverables
- Working fused GELU + Bias kernel
- Python correctness test

---

## üóìÔ∏è Day 7 ‚Äî Weekly Project: Full PyTorch LN Extension

### üéØ Objective
Deliver a **complete, reusable PyTorch LN extension** suitable for learning portfolios or ML systems interviews.

### üß© Tasks
- Integrate:
  - LN forward
  - LN backward
- Clean up codebase:
  - Clear APIs
  - Consistent naming
- Add:
  - Python test scripts
  - Benchmark script
  - README-style documentation

### üß† Key Concepts
- End-to-end operator development
- Code organization for extensions
- Production-style validation and benchmarking

### üì¶ Final Deliverable
- A full **PyTorch LayerNorm C++/CUDA extension**
- Runnable from Python with forward + backward
- Benchmarked and validated

---

## üß† Weekly Takeaway
> **This week trains you to think like an ML systems engineer: designing, registering, debugging, validating, and benchmarking a real PyTorch operator‚Äînot just writing a CUDA kernel.**


In [None]:
!nvcc --version
!nvidia-smi

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
Sun Jan 25 12:24:44 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   43C    P8             11W /   70W |       0MiB /  15360MiB |      0%      Default |
|                       

In [1]:
!apt-get update
!apt-get install -y cuda-toolkit-12-4

0% [Working]            Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
0% [Waiting for headers] [1 InRelease 14.2 kB/129 kB 11%] [Connected to cloud.r                                                                               Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
0% [1 InRelease 129 kB/129 kB 100%] [Connected to cloud.r-project.org (65.9.86.                                                                               Get:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
0% [3 InRelease 38.8 kB/128 kB 30%] [Waiting for headers] [Connecting to r2u.st                                                                               Get:4 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
0% [3 InRelease 47.5 kB/128 kB 37%] [4 InRelease 3,632 B/3,632 B 100%] [Connect0% [3 InRelease 47.5 kB/128 kB 37%] [Connecting to r2u.stat.illinois.edu] [Wait                                           

In [4]:
!pip install ninja

Collecting ninja
  Downloading ninja-1.13.0-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (5.1 kB)
Downloading ninja-1.13.0-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (180 kB)
[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/180.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m180.7/180.7 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ninja
Successfully installed ninja-1.13.0


## üß± Step 1 ‚Äî Register LN Forward (C++ wrapper) + Python call test (skeleton)

In [8]:
# Day 1 (ONE CELL): forward-only LN extension skeleton (NO SOLUTION)

import os, textwrap, torch
from torch.utils.cpp_extension import load

# Install ninja if not already present
!pip install ninja

# -----------------------------
# Write files
# -----------------------------
os.makedirs("ln_ext", exist_ok=True)

open("ln_ext/ext.h", "w").write(r"""
#pragma once
#include <torch/extension.h>

torch::Tensor ln_forward(torch::Tensor x,
                         torch::Tensor gamma,
                         torch::Tensor beta,
                         double eps);

void ln_forward_cuda_launcher(torch::Tensor x,
                              torch::Tensor gamma,
                              torch::Tensor beta,
                              torch::Tensor y,
                              torch::Tensor mean,
                              torch::Tensor inv_std,
                              double eps);
""")

open("ln_ext/ext.cpp", "w").write(r"""
#include <torch/extension.h>
#include "ext.h"

#define CHECK_CUDA(x) TORCH_CHECK((x).is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIG(x) TORCH_CHECK((x).is_contiguous(), #x " must be contiguous")
#define CHECK_F32(x) TORCH_CHECK((x).scalar_type() == at::ScalarType::Float, #x " must be float32")
#define CHECK_2D(x) TORCH_CHECK((x).dim() == 2, #x " must be 2D [B, D]")
#define CHECK_1D(x) TORCH_CHECK((x).dim() == 1, #x " must be 1D [D]")

torch::Tensor ln_forward(
torch::Tensor x,
                         torch::Tensor gamma,
                         torch::Tensor beta,
                         double eps) {
    // TODO:
    // - validate: x CUDA/contiguous/float32/2D
    // - validate: gamma,beta CUDA/contiguous/float32/1D and gamma.size(0)==D
    // - allocate y [B,D], mean [B], inv_std [B]
    // - call ln_forward_cuda_launcher(...)
    // - return y (or return a tuple if you prefer, but keep API consistent)

    // Placeholder (compilable but not correct):
    // This section doesn't consider gamma and beta size and all data is on GPU
    CHECK_CUDA(x); CHECK_CONTIG(x); CHECK_F32(x); CHECK_2D(x);
    CHECK_CUDA(gamma); CHECK_CONTIG(gamma); CHECK_F32(gamma); CHECK_1D(gamma);
    CHECK_CUDA(beta);  CHECK_CONTIG(beta);  CHECK_F32(beta);  CHECK_1D(beta);



    TORCH_CHECK(x.device() == gamma.device(), "x and gamma must be on same device");
    TORCH_CHECK(x.device() == beta.device(),  "x and beta must be on same device");
    TORCH_CHECK(x.size(0) > 0 && x.size(1) > 0, "x must have non-zero B and D");
    TORCH_CHECK(eps > 0.0, "eps must be > 0");



    auto B = x.size(0);
    auto D = x.size(1);
    TORCH_CHECK(gamma.size(0) == D, "gamma must have shape [D]");
    TORCH_CHECK(beta.size(0)  == D, "beta must have shape [D]");

    auto y = torch::empty_like(x);
    auto mean = torch::empty({B}, x.options());
    auto inv_std = torch::empty({B}, x.options());

    // TODO: replace with real launcher call
    ln_forward_cuda_launcher(x, gamma, beta, y, mean, inv_std, eps);

    return y;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &ln_forward, "LayerNorm forward (CUDA, skeleton)");
}
""")

open("ln_ext/ext_cuda.cu", "w").write(r"""
#include <torch/extension.h>
#include <cuda.h>
#include <cuda_runtime.h>

#define CHECK_CUDA(x) TORCH_CHECK((x).is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIG(x) TORCH_CHECK((x).is_contiguous(), #x " must be contiguous")
#define CHECK_F32(x) TORCH_CHECK((x).scalar_type() == at::ScalarType::Float, #x " must be float32")

// TODO: warp reduce helper
__device__ __forceinline__ float warpReduceSum(float v) {
    // TODO: implement with __shfl_down_sync
    for(int offset = 16; offset > 0; offset >>= 1){
    v += __shfl_down_sync(0xffffffff, v, offset);}
    return v;
}

// TODO: forward kernel (warp reduce)
// x: [B,D], gamma/beta: [D], y:[B,D], mean/inv_std:[B]
__global__ void ln_forward_kernel(const float* __restrict__ x,
                                  const float* __restrict__ gamma,
                                  const float* __restrict__ beta,
                                  float* __restrict__ y,
                                  float* __restrict__ mean,
                                  float* __restrict__ inv_std,
                                  int B, int D, float eps) {
    // TODO:
    // - map row(s) to warps/blocks
    // - compute mean via warp reduction
    // - compute variance via warp reduction
    // - write mean/inv_std
    // - normalize + affine and write y
    int row = blockIdx.x;
    int tid = threadIdx.x;
    int lane = tid&32;
    int warp = tid >> 5; // consider why it is like this instead of tid << 5
    int num_warp = (blockDim.x + 31)/ 32;

    const float* xrow = x + (size_t)row * D;
    float* yrow = y + (size_t)row * D;

    // step 1: each thread acculates partial sum and sumq
    float sumx = 0.0f;
    float sumq = 0.0f;
    for (int i = tid; i < D; i += blockDim.x){
        float xi = xrow[i];
        sumx += xi;
        sumq += xi * xi;
    }

    //step2: reduce within each warp
    sumx = warpReduceSum(sumx);
    sumq = warpReduceSum(sumq);

    //step3: write warp partials to shared memory, then reduce again using warp 0
    __shared__ float warp_sums[32];
    __shared__ float warp_sumsq[32];

    if (lane == 0){
        warp_sums[warp] = sumx;
        warp_sumsq[warp] = sumq;
    }
    __syncthreads();

    //get block sum
    float block_sum = 0.f, block_sumsq = 0.f;
    if(warp == 0){
      block_sum = (lane < num_warp) ? warp_sums[lane]  : 0.f;
      block_sumsq = (lane < num_warp) ? warp_sumsq[lane] : 0.f;

      block_sum   = warpReduceSum(block_sum);
      block_sumsq = warpReduceSum(block_sumsq);
    }

    //step4: broadcast mean and inv_std to all threads
    __shared__ float sh_mu, sh_inv;
    if (tid == 0){
        float mu = block_sum / float(D);
        float var = block_sumsq / (float)D - mu * mu;
        float inv = rsqrtf(var + eps);
        sh_mu = mu;
        sh_inv = inv;
        mean[row] = mu;
        inv_std[row] = inv;
    }
    __syncthreads();


    //step5: get result
    float mu = sh_mu, inv = sh_inv;
    for(int i = tid; i < D; i += blockDim.x){
        float xi = xrow[i];
        float xhat = (xi - mu) * inv;
        yrow[i] = xhat * gamma[i] + beta[i];
    }
}

void ln_forward_cuda_launcher(torch::Tensor x,
                              torch::Tensor gamma,
                              torch::Tensor beta,
                              torch::Tensor y,
                              torch::Tensor mean,
                              torch::Tensor inv_std,
                              double eps) {
    CHECK_CUDA(x); CHECK_CONTIG(x); CHECK_F32(x);
    CHECK_CUDA(gamma); CHECK_CONTIG(gamma); CHECK_F32(gamma);
    CHECK_CUDA(beta);  CHECK_CONTIG(beta);  CHECK_F32(beta);
    CHECK_CUDA(y);     CHECK_CONTIG(y);     CHECK_F32(y);
    CHECK_CUDA(mean);  CHECK_CONTIG(mean);  CHECK_F32(mean);
    CHECK_CUDA(inv_std); CHECK_CONTIG(inv_std); CHECK_F32(inv_std);

    int B = (int)x.size(0);
    int D = (int)x.size(1);

    // TODO: choose launch config
    // dim3 block(0,0,1); // TODO
    // dim3 grid(0,0,1);  // TODO
    auto round_up_warp = [](int x) { return (x + 31) & ~31; };
    int f_threads = 256;
    if (D <= 128) f_threads = 128;
    f_threads = round_up_warp(f_threads);

    dim3 f_block(f_threads, 1, 1);
    dim3 f_grid(B, 1, 1);


    // Placeholder launch (won't run correctly until you set block/grid + kernel body)
    ln_forward_kernel<<<f_grid, f_block>>>(
        (const float*)x.data_ptr<float>(),
        (const float*)gamma.data_ptr<float>(),
        (const float*)beta.data_ptr<float>(),
        (float*)y.data_ptr<float>(),
        (float*)mean.data_ptr<float>(),
        (float*)inv_std.data_ptr<float>(),
        B, D, (float)eps
    );
}
""")

# -----------------------------
# Build extension
# -----------------------------
ext = load(
    name="ln_ext_forward",
    sources=["ln_ext/ext.cpp", "ln_ext/ext_cuda.cu"],
    extra_cflags=["-O3"],
    extra_cuda_cflags=["-O3", "-lineinfo"],
    with_cuda=True,
    verbose=False
)

print("Day2 extension loaded:", ext)

# -----------------------------
# Optional: Python call test (disabled until TODOs are filled)
# -----------------------------
RUN_TEST = True
if RUN_TEST:
    B, D = 16, 128
    x = torch.randn(B, D, device="cuda", dtype=torch.float32)
    gamma = torch.ones(D, device="cuda", dtype=torch.float32)
    beta  = torch.zeros(D, device="cuda", dtype=torch.float32)
    y = ext.forward(x, gamma, beta, 1e-5)
    print("y:", y.shape, y.dtype, y.device)


Day2 extension loaded: <module 'ln_ext_forward_v2' from '/root/.cache/torch_extensions/py312_cu126/ln_ext_forward/ln_ext_forward_v2.so'>
y: torch.Size([16, 128]) torch.float32 cuda:0


## ‚úÖ Step 2 ‚Äî Register LN Backward + gradient correctness check (skeleton)

In [None]:
# Day 3 (ONE CELL): add backward + autograd wrapper skeleton (NO SOLUTION)

import os, textwrap, torch
from torch.utils.cpp_extension import load

# Overwrite C++/CUDA files for backward-enabled extension
open("ln_ext/ext.h", "w").write(r"""
#pragma once
#include <torch/extension.h>

// Forward returns (y, mean, inv_std) for backward reuse
std::vector<torch::Tensor> ln_forward(torch::Tensor x,
                                      torch::Tensor gamma,
                                      torch::Tensor beta,
                                      double eps);

// Backward returns (dx, dgamma, dbeta)
std::vector<torch::Tensor> ln_backward(torch::Tensor x,
                                       torch::Tensor gamma,
                                       torch::Tensor mean,
                                       torch::Tensor inv_std,
                                       torch::Tensor dout);

void ln_forward_cuda_launcher(torch::Tensor x,
                              torch::Tensor gamma,
                              torch::Tensor beta,
                              torch::Tensor y,
                              torch::Tensor mean,
                              torch::Tensor inv_std,
                              double eps);

void ln_backward_cuda_launcher(torch::Tensor x,
                               torch::Tensor gamma,
                               torch::Tensor mean,
                               torch::Tensor inv_std,
                               torch::Tensor dout,
                               torch::Tensor dx,
                               torch::Tensor dgamma,
                               torch::Tensor dbeta);
""")

open("ln_ext/ext.cpp", "w").write(r"""
#include <torch/extension.h>
#include "ext.h"

#define CHECK_CUDA(x) TORCH_CHECK((x).is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIG(x) TORCH_CHECK((x).is_contiguous(), #x " must be contiguous")
#define CHECK_F32(x) TORCH_CHECK((x).scalar_type() == at::ScalarType::Float, #x " must be float32")

static void check_forward_args(torch::Tensor x, torch::Tensor gamma, torch::Tensor beta) {
    // TODO: add full checks (dims, shapes)
    CHECK_CUDA(x); CHECK_CONTIG(x); CHECK_F32(x);
    CHECK_CUDA(gamma); CHECK_CONTIG(gamma); CHECK_F32(gamma);
    CHECK_CUDA(beta);  CHECK_CONTIG(beta);  CHECK_F32(beta);
}

static void check_backward_args(torch::Tensor x, torch::Tensor gamma,
                                torch::Tensor mean, torch::Tensor inv_std,
                                torch::Tensor dout) {
    // TODO: add full checks (dims, shapes)
    CHECK_CUDA(x); CHECK_CONTIG(x); CHECK_F32(x);
    CHECK_CUDA(gamma); CHECK_CONTIG(gamma); CHECK_F32(gamma);
    CHECK_CUDA(mean); CHECK_CONTIG(mean); CHECK_F32(mean);
    CHECK_CUDA(inv_std); CHECK_CONTIG(inv_std); CHECK_F32(inv_std);
    CHECK_CUDA(dout); CHECK_CONTIG(dout); CHECK_F32(dout);
}

std::vector<torch::Tensor> ln_forward(torch::Tensor x,
                                      torch::Tensor gamma,
                                      torch::Tensor beta,
                                      double eps) {
    check_forward_args(x, gamma, beta);
    auto B = x.size(0);

    auto y = torch::empty_like(x);
    auto mean = torch::empty({B}, x.options());
    auto inv_std = torch::empty({B}, x.options());

    // TODO: real CUDA forward
    ln_forward_cuda_launcher(x, gamma, beta, y, mean, inv_std, eps);

    return {y, mean, inv_std};
}

std::vector<torch::Tensor> ln_backward(torch::Tensor x,
                                       torch::Tensor gamma,
                                       torch::Tensor mean,
                                       torch::Tensor inv_std,
                                       torch::Tensor dout) {
    check_backward_args(x, gamma, mean, inv_std, dout);

    auto dx = torch::empty_like(x);
    auto dgamma = torch::zeros_like(gamma);
    auto dbeta  = torch::zeros_like(gamma);

    // TODO: real CUDA backward
    ln_backward_cuda_launcher(x, gamma, mean, inv_std, dout, dx, dgamma, dbeta);

    return {dx, dgamma, dbeta};
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &ln_forward, "LayerNorm forward (CUDA, skeleton)");
    m.def("backward", &ln_backward, "LayerNorm backward (CUDA, skeleton)");
}
""")

open("ln_ext/ext_cuda.cu", "w").write(r"""
#include <torch/extension.h>
#include <cuda.h>
#include <cuda_runtime.h>

#define CHECK_CUDA(x) TORCH_CHECK((x).is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIG(x) TORCH_CHECK((x).is_contiguous(), #x " must be contiguous")
#define CHECK_F32(x) TORCH_CHECK((x).scalar_type() == at::ScalarType::Float, #x " must be float32")

__device__ __forceinline__ float warpReduceSum(float v) {
    // TODO: __shfl_down_sync
    return v;
}

__global__ void ln_forward_kernel(const float* __restrict__ x,
                                  const float* __restrict__ gamma,
                                  const float* __restrict__ beta,
                                  float* __restrict__ y,
                                  float* __restrict__ mean,
                                  float* __restrict__ inv_std,
                                  int B, int D, float eps) {
    // TODO
}

__global__ void ln_backward_kernel(const float* __restrict__ x,
                                   const float* __restrict__ gamma,
                                   const float* __restrict__ mean,
                                   const float* __restrict__ inv_std,
                                   const float* __restrict__ dout,
                                   float* __restrict__ dx,
                                   float* __restrict__ dgamma,
                                   float* __restrict__ dbeta,
                                   int B, int D) {
    // TODO:
    // - compute dx
    // - reduce dgamma/dbeta (atomics or 2-pass strategy)
}

void ln_forward_cuda_launcher(torch::Tensor x, torch::Tensor gamma, torch::Tensor beta,
                              torch::Tensor y, torch::Tensor mean, torch::Tensor inv_std,
                              double eps) {
    CHECK_CUDA(x); CHECK_CONTIG(x); CHECK_F32(x);
    CHECK_CUDA(gamma); CHECK_CONTIG(gamma); CHECK_F32(gamma);
    CHECK_CUDA(beta);  CHECK_CONTIG(beta);  CHECK_F32(beta);
    CHECK_CUDA(y);     CHECK_CONTIG(y);     CHECK_F32(y);
    CHECK_CUDA(mean);  CHECK_CONTIG(mean);  CHECK_F32(mean);
    CHECK_CUDA(inv_std); CHECK_CONTIG(inv_std); CHECK_F32(inv_std);

    int B = (int)x.size(0);
    int D = (int)x.size(1);

    dim3 block(0,0,1); // TODO
    dim3 grid(0,0,1);  // TODO

    ln_forward_kernel<<<grid, block>>>(
        x.data_ptr<float>(), gamma.data_ptr<float>(), beta.data_ptr<float>(),
        y.data_ptr<float>(), mean.data_ptr<float>(), inv_std.data_ptr<float>(),
        B, D, (float)eps
    );
}

void ln_backward_cuda_launcher(torch::Tensor x, torch::Tensor gamma,
                               torch::Tensor mean, torch::Tensor inv_std,
                               torch::Tensor dout,
                               torch::Tensor dx, torch::Tensor dgamma, torch::Tensor dbeta) {
    CHECK_CUDA(x); CHECK_CONTIG(x); CHECK_F32(x);
    CHECK_CUDA(gamma); CHECK_CONTIG(gamma); CHECK_F32(gamma);
    CHECK_CUDA(mean); CHECK_CONTIG(mean); CHECK_F32(mean);
    CHECK_CUDA(inv_std); CHECK_CONTIG(inv_std); CHECK_F32(inv_std);
    CHECK_CUDA(dout); CHECK_CONTIG(dout); CHECK_F32(dout);
    CHECK_CUDA(dx); CHECK_CONTIG(dx); CHECK_F32(dx);
    CHECK_CUDA(dgamma); CHECK_CONTIG(dgamma); CHECK_F32(dgamma);
    CHECK_CUDA(dbeta); CHECK_CONTIG(dbeta); CHECK_F32(dbeta);

    int B = (int)x.size(0);
    int D = (int)x.size(1);

    dim3 block(0,0,1); // TODO
    dim3 grid(0,0,1);  // TODO

    ln_backward_kernel<<<grid, block>>>(
        x.data_ptr<float>(), gamma.data_ptr<float>(),
        mean.data_ptr<float>(), inv_std.data_ptr<float>(),
        dout.data_ptr<float>(),
        dx.data_ptr<float>(), dgamma.data_ptr<float>(), dbeta.data_ptr<float>(),
        B, D
    );
}
""")

ext = load(
    name="ln_ext_fwd_bwd",
    sources=["ln_ext/ext.cpp", "ln_ext/ext_cuda.cu"],
    extra_cflags=["-O3"],
    extra_cuda_cflags=["-O3", "-lineinfo"],
    with_cuda=True,
    verbose=False
)

print("Day3 extension loaded:", ext)

# Optional: gradient check harness (disabled until TODOs are implemented)
RUN_GRAD_TEST = False
if RUN_GRAD_TEST:
    B, D = 8, 256
    eps = 1e-5
    x = torch.randn(B, D, device="cuda", dtype=torch.float32, requires_grad=True)
    gamma = torch.randn(D, device="cuda", dtype=torch.float32, requires_grad=True)
    beta  = torch.randn(D, device="cuda", dtype=torch.float32, requires_grad=True)

    # TODO: compare to torch.nn.functional.layer_norm gradients
    # - call ext.forward -> (y, mean, inv_std)
    # - build dout
    # - call ext.backward -> (dx, dgamma, dbeta)
    # - compare to autograd reference
    pass


## ‚úÖ Day 4 ‚Äî Compile / debug / edge cases (skeleton)

In [None]:
# Day 4 (ONE CELL): edge case test scaffolding + debug aids (NO SOLUTION)

import torch, math

# Edge cases to test (you can expand)
CASES = [
    (1, 7),       # tiny D
    (2, 33),      # not multiple of warp
    (4, 128),
    (16, 1024),
    (3, 4096),    # large D
]

# Toggle when your kernels are implemented
RUN_EDGE_TESTS = False

def run_edge_suite(ext):
    for (B, D) in CASES:
        x = torch.randn(B, D, device="cuda", dtype=torch.float32)
        gamma = torch.randn(D, device="cuda", dtype=torch.float32)
        beta  = torch.randn(D, device="cuda", dtype=torch.float32)

        # TODO: call ext.forward and validate shape/dtype/device
        # y, mean, inv = ext.forward(x, gamma, beta, 1e-5)
        # assert y.shape == x.shape
        # assert mean.shape == (B,)
        # assert inv.shape == (B,)

        # TODO: check numerical sanity (no NaN/Inf)
        # assert torch.isfinite(y).all()

        # TODO: backward sanity
        # dout = torch.randn_like(x)
        # dx, dgamma, dbeta = ext.backward(x, gamma, mean, inv, dout)

        print(f"[EdgeCase] B={B} D={D} -> TODO checks")

# If you already loaded Day3 extension as `ext`, you can run:
if RUN_EDGE_TESTS:
    run_edge_suite(ext)
else:
    print("Day4: Edge suite is ready. Set RUN_EDGE_TESTS=True after implementing kernels.")


## ‚úÖ Day 5 ‚Äî Fused GELU + Bias CUDA kernel (extension skeleton)

In [None]:
# Day 6 (ONE CELL): fused Bias+GELU extension skeleton (NO SOLUTION)

import os, torch
from torch.utils.cpp_extension import load

os.makedirs("fused_gelu", exist_ok=True)

open("fused_gelu/ext.cpp", "w").write(r"""
#include <torch/extension.h>

#define CHECK_CUDA(x) TORCH_CHECK((x).is_cuda(), #x " must be CUDA")
#define CHECK_CONTIG(x) TORCH_CHECK((x).is_contiguous(), #x " must be contiguous")
#define CHECK_F16F32(x) TORCH_CHECK((x).scalar_type()==at::kHalf || (x).scalar_type()==at::kFloat, #x " must be fp16 or fp32")

void fused_gelu_bias_cuda_launcher(torch::Tensor x, torch::Tensor bias, torch::Tensor y);

torch::Tensor fused_gelu_bias(torch::Tensor x, torch::Tensor bias) {
    // TODO:
    // - checks (CUDA/contig/dtype/shape)
    // - allocate y
    // - call launcher
    CHECK_CUDA(x); CHECK_CONTIG(x); CHECK_F16F32(x);
    CHECK_CUDA(bias); CHECK_CONTIG(bias); CHECK_F16F32(bias);

    auto y = torch::empty_like(x);
    fused_gelu_bias_cuda_launcher(x, bias, y);
    return y;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &fused_gelu_bias, "Fused Bias+GELU forward (CUDA, skeleton)");
}
""")

open("fused_gelu/ext_cuda.cu", "w").write(r"""
#include <torch/extension.h>
#include <cuda.h>
#include <cuda_runtime.h>

#define CHECK_CUDA(x) TORCH_CHECK((x).is_cuda(), #x " must be CUDA")
#define CHECK_CONTIG(x) TORCH_CHECK((x).is_contiguous(), #x " must be contiguous")

// TODO: implement GELU approximation or exact (no solution here)
// kernel: y = GELU(x + bias)

__global__ void fused_bias_gelu_kernel(/* TODO args */) {
    // TODO
}

void fused_gelu_bias_cuda_launcher(torch::Tensor x, torch::Tensor bias, torch::Tensor y) {
    CHECK_CUDA(x); CHECK_CONTIG(x);
    CHECK_CUDA(bias); CHECK_CONTIG(bias);
    CHECK_CUDA(y); CHECK_CONTIG(y);

    // TODO: grid/block
    dim3 block(0,0,1);
    dim3 grid(0,0,1);

    // TODO: dispatch by dtype (fp16/fp32)
    // fused_bias_gelu_kernel<<<grid, block>>>(...);
}
""")

fused = load(
    name="fused_gelu_bias_ext",
    sources=["fused_gelu/ext.cpp", "fused_gelu/ext_cuda.cu"],
    extra_cflags=["-O3"],
    extra_cuda_cflags=["-O3", "-lineinfo"],
    with_cuda=True,
    verbose=False
)

print("Day6 extension loaded:", fused)

RUN_TEST = False
if RUN_TEST:
    # TODO: compare vs torch.nn.functional.gelu(x + bias)
    pass


## ‚úÖ Day 7 ‚Äî Weekly project packaging (full LN extension) + scripts (skeleton)

In [None]:
# Day 7 (ONE CELL): project packaging skeleton (NO SOLUTION)
# Creates placeholders for README, tests, benchmark, and Nsight Compute script.

import os, textwrap

os.makedirs("project_ln", exist_ok=True)

open("project_ln/README.md", "w").write(r"""
# PyTorch LayerNorm C++/CUDA Extension (Skeleton)

## What you should have by end of Week
- LN forward (CUDA)
- LN backward (CUDA)
- Python API: forward/backward or autograd wrapper
- Correctness tests vs PyTorch
- Benchmarks vs torch.nn.LayerNorm
- Nsight Compute profiling commands

## TODO
- Document build steps (Colab and local)
- Add usage examples
- Add performance notes and profiling screenshots
""")

open("project_ln/test_ln.py", "w").write(r"""
import torch

def test_forward(ext):
    # TODO: compare ext.forward vs torch layer_norm
    pass

def test_backward(ext):
    # TODO: compare gradients vs autograd
    pass

if __name__ == "__main__":
    # TODO: import your built extension module and run tests
    pass
""")

open("project_ln/bench_ln.py", "w").write(r"""
import torch, time

@torch.no_grad()
def bench_fn(fn, iters=200, warmup=50):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()
    s = torch.cuda.Event(enable_timing=True)
    e = torch.cuda.Event(enable_timing=True)
    s.record()
    for _ in range(iters):
        fn()
    e.record()
    torch.cuda.synchronize()
    return s.elapsed_time(e) / iters

def main():
    # TODO: load ext
    # TODO: build benchmark cases
    pass

if __name__ == "__main__":
    main()
""")

open("project_ln/run_ncu.sh", "w").write(r"""#!/usr/bin/env bash
set -euo pipefail

# TODO:
# - Build your extension (if building outside Colab JIT)
# - Run Nsight Compute on forward/backward kernels

# Example:
# ncu --set full --kernel-name "ln_forward_kernel" -o ncu_ln_fwd python project_ln/bench_ln.py
# ncu --set full --kernel-name "ln_backward_kernel" -o ncu_ln_bwd python project_ln/bench_ln.py

echo "Edit this script with your kernel names and driver script."
""")

os.system("chmod +x project_ln/run_ncu.sh")
print("Day7 packaging skeleton created under ./project_ln/")


In [None]:
# Day 7 (ONE CELL): project packaging skeleton (NO SOLUTION)
# Creates placeholders for README, tests, benchmark, and Nsight Compute script.

import os, textwrap

os.makedirs("project_ln", exist_ok=True)

open("project_ln/README.md", "w").write(r"""
# PyTorch LayerNorm C++/CUDA Extension (Skeleton)

## What you should have by end of Week
- LN forward (CUDA)
- LN backward (CUDA)
- Python API: forward/backward or autograd wrapper
- Correctness tests vs PyTorch
- Benchmarks vs torch.nn.LayerNorm
- Nsight Compute profiling commands

## TODO
- Document build steps (Colab and local)
- Add usage examples
- Add performance notes and profiling screenshots
""")

open("project_ln/test_ln.py", "w").write(r"""
import torch

def test_forward(ext):
    # TODO: compare ext.forward vs torch layer_norm
    pass

def test_backward(ext):
    # TODO: compare gradients vs autograd
    pass

if __name__ == "__main__":
    # TODO: import your built extension module and run tests
    pass
""")

open("project_ln/bench_ln.py", "w").write(r"""
import torch, time

@torch.no_grad()
def bench_fn(fn, iters=200, warmup=50):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()
    s = torch.cuda.Event(enable_timing=True)
    e = torch.cuda.Event(enable_timing=True)
    s.record()
    for _ in range(iters):
        fn()
    e.record()
    torch.cuda.synchronize()
    return s.elapsed_time(e) / iters

def main():
    # TODO: load ext
    # TODO: build benchmark cases
    pass

if __name__ == "__main__":
    main()
""")

open("project_ln/run_ncu.sh", "w").write(r"""#!/usr/bin/env bash
set -euo pipefail

# TODO:
# - Build your extension (if building outside Colab JIT)
# - Run Nsight Compute on forward/backward kernels

# Example:
# ncu --set full --kernel-name "ln_forward_kernel" -o ncu_ln_fwd python project_ln/bench_ln.py
# ncu --set full --kernel-name "ln_backward_kernel" -o ncu_ln_bwd python project_ln/bench_ln.py

echo "Edit this script with your kernel names and driver script."
""")

os.system("chmod +x project_ln/run_ncu.sh")
print("Day7 packaging skeleton created under ./project_ln/")
