# üìò Task Explanation (Day by Day): PyTorch LayerNorm C++/CUDA Extension

This week focuses on building a **production-style PyTorch LayerNorm (LN) extension**, starting from a C++ forward wrapper and ending with a **complete forward + backward CUDA implementation**, benchmarked against PyTorch‚Äôs official kernel.

The goal is to understand **how real PyTorch operators are written, registered, validated, and optimized**.

---

## üóìÔ∏è Day 2 ‚Äî Register LN Forward (C++ Wrapper) & Python Test

### üéØ Objective
Expose a **custom LN forward implementation** to Python via a PyTorch C++ extension and validate the **Python ‚Üí C++ ‚Üí CUDA** execution path.

### üß© Tasks
- Write a C++ forward wrapper using ATen:
  - Accept `at::Tensor` inputs
  - Validate device, dtype, and layout
  - Allocate output tensors
  - Dispatch to a CUDA kernel
- Register the forward function using `PYBIND11_MODULE`
- Call the operator from Python and verify:
  - Correct execution
  - Correct output shape and dtype

### üß† Key Concepts
- PyTorch C++ extension registration
- ATen tensor checks and allocation
- Python ‚Üî C++ ABI boundary
- Kernel launch from C++

### üì¶ Deliverables
- Callable `ln_forward()` from Python
- Successful Python test script

---

## üóìÔ∏è Day 3 ‚Äî Register LN Backward & Verify Gradient Correctness

### üéØ Objective
Extend the LN operator to support **backward propagation** and ensure it integrates correctly with PyTorch‚Äôs autograd system.

### üß© Tasks
- Implement and register LN backward:
  - Compute gradients for `dx`, `dgamma`, and `dbeta`
  - Use CUDA kernels for gradient computation
- Bind backward logic via:
  - Custom `torch::autograd::Function` **or**
  - Manual backward registration (educational setup)
- Verify gradient correctness:
  - Compare against PyTorch autograd results
  - Use numerical tolerances

### üß† Key Concepts
- Autograd mechanics
- Forward/backward dependency management
- Gradient reduction patterns
- Numerical stability in backward pass

### üì¶ Deliverables
- Working backward kernel
- Gradient correctness test (PASS)

---

## üóìÔ∏è Day 4 ‚Äî Compile, Debug, and Fix Edge Cases

### üéØ Objective
Harden the extension so it behaves correctly across **realistic and corner-case inputs**.

### üß© Tasks
- Fix compilation issues:
  - Template errors
  - Device/dtype mismatches
- Debug runtime errors:
  - Illegal memory access
  - Incorrect indexing
- Handle edge cases:
  - Non-multiple-of-warp feature sizes
  - Small batch sizes
  - Large/small variance values
- Add assertions and sanity checks

### üß† Key Concepts
- CUDA debugging strategies
- Shape- and stride-related pitfalls
- Numerical edge cases in normalization
- Defensive programming in C++ extensions

### üì¶ Deliverables
- Stable, crash-free extension
- Clean compilation with `-O3`

---

## üóìÔ∏è Day 5 ‚Äî Benchmark: Custom LN vs PyTorch Official Kernel

### üéØ Objective
Quantitatively compare your LN implementation against **PyTorch‚Äôs official LayerNorm**.

### üß© Tasks
- Benchmark forward and backward:
  - Your custom LN extension
  - `torch.nn.LayerNorm`
- Measure:
  - Kernel execution time
  - End-to-end forward/backward time
- Use consistent input sizes and warm-up

### üß† Key Concepts
- Fair benchmarking methodology
- Kernel launch overhead
- Memory-bound vs compute-bound behavior
- Why official kernels are highly optimized

### üì¶ Deliverables
- Benchmark table or logs
- Short performance analysis

---

## üóìÔ∏è Day 6 ‚Äî Implement Fused GELU + Bias CUDA Kernel

### üéØ Objective
Apply the same extension workflow to a **fused operator**, reinforcing kernel fusion concepts common in ML systems.

### üß© Tasks
- Implement a CUDA kernel that fuses:
  - Bias addition
  - GELU activation
- Register the fused kernel as a PyTorch extension
- Test correctness against PyTorch reference

### üß† Key Concepts
- Kernel fusion benefits
- Reducing memory traffic
- Elementwise kernel optimization
- Operator fusion in Transformers

### üì¶ Deliverables
- Working fused GELU + Bias kernel
- Python correctness test

---

## üóìÔ∏è Day 7 ‚Äî Weekly Project: Full PyTorch LN Extension

### üéØ Objective
Deliver a **complete, reusable PyTorch LN extension** suitable for learning portfolios or ML systems interviews.

### üß© Tasks
- Integrate:
  - LN forward
  - LN backward
- Clean up codebase:
  - Clear APIs
  - Consistent naming
- Add:
  - Python test scripts
  - Benchmark script
  - README-style documentation

### üß† Key Concepts
- End-to-end operator development
- Code organization for extensions
- Production-style validation and benchmarking

### üì¶ Final Deliverable
- A full **PyTorch LayerNorm C++/CUDA extension**
- Runnable from Python with forward + backward
- Benchmarked and validated

---

## üß† Weekly Takeaway
> **This week trains you to think like an ML systems engineer: designing, registering, debugging, validating, and benchmarking a real PyTorch operator‚Äînot just writing a CUDA kernel.**


In [1]:
!nvcc --version
!nvidia-smi

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
Fri Jan 30 11:45:42 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   63C    P8             11W /   70W |       0MiB /  15360MiB |      0%      Default |
|                       

In [2]:
!apt-get update
!apt-get install -y cuda-toolkit-12-4
!pip install ninja

0% [Working]            Get:1 https://cli.github.com/packages stable InRelease [3,917 B]
0% [Connecting to archive.ubuntu.com (91.189.92.24)] [Waiting for headers] [Wai                                                                               Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
0% [Connecting to archive.ubuntu.com (91.189.92.24)] [Waiting for headers] [2 I0% [Connecting to archive.ubuntu.com (91.189.92.24)] [Waiting for headers] [2 I                                                                               Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 https://cli.github.com/packages stable/main amd64 Packages [356 B]
Get:6 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ Packages [83.8 kB]
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:8 https://devel

## üß± Step 1 ‚Äî Register LN Forward (C++ wrapper) + Python call test (skeleton)

In [3]:
# Step 1 (ONE CELL): forward-only LN extension skeleton (NO SOLUTION)

import os, textwrap, torch
from torch.utils.cpp_extension import load

# Install ninja if not already present
!pip install ninja

# -----------------------------
# Write files
# -----------------------------
os.makedirs("ln_ext", exist_ok=True)

open("ln_ext/ext.h", "w").write(r"""
#pragma once
#include <torch/extension.h>

torch::Tensor ln_forward(torch::Tensor x,
                         torch::Tensor gamma,
                         torch::Tensor beta,
                         double eps);

void ln_forward_cuda_launcher(torch::Tensor x,
                              torch::Tensor gamma,
                              torch::Tensor beta,
                              torch::Tensor y,
                              torch::Tensor mean,
                              torch::Tensor inv_std,
                              double eps);
""")

open("ln_ext/ext.cpp", "w").write(r"""
#include <torch/extension.h>
#include "ext.h"

#define CHECK_CUDA(x) TORCH_CHECK((x).is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIG(x) TORCH_CHECK((x).is_contiguous(), #x " must be contiguous")
#define CHECK_F32(x) TORCH_CHECK((x).scalar_type() == at::ScalarType::Float, #x " must be float32")
#define CHECK_2D(x) TORCH_CHECK((x).dim() == 2, #x " must be 2D [B, D]")
#define CHECK_1D(x) TORCH_CHECK((x).dim() == 1, #x " must be 1D [D]")

torch::Tensor ln_forward(
torch::Tensor x,
                         torch::Tensor gamma,
                         torch::Tensor beta,
                         double eps) {
    // TODO:
    // - validate: x CUDA/contiguous/float32/2D
    // - validate: gamma,beta CUDA/contiguous/float32/1D and gamma.size(0)==D
    // - allocate y [B,D], mean [B], inv_std [B]
    // - call ln_forward_cuda_launcher(...)
    // - return y (or return a tuple if you prefer, but keep API consistent)

    // Placeholder (compilable but not correct):
    // This section doesn't consider gamma and beta size and all data is on GPU
    CHECK_CUDA(x); CHECK_CONTIG(x); CHECK_F32(x); CHECK_2D(x);
    CHECK_CUDA(gamma); CHECK_CONTIG(gamma); CHECK_F32(gamma); CHECK_1D(gamma);
    CHECK_CUDA(beta);  CHECK_CONTIG(beta);  CHECK_F32(beta);  CHECK_1D(beta);



    TORCH_CHECK(x.device() == gamma.device(), "x and gamma must be on same device");
    TORCH_CHECK(x.device() == beta.device(),  "x and beta must be on same device");
    TORCH_CHECK(x.size(0) > 0 && x.size(1) > 0, "x must have non-zero B and D");
    TORCH_CHECK(eps > 0.0, "eps must be > 0");



    auto B = x.size(0);
    auto D = x.size(1);
    TORCH_CHECK(gamma.size(0) == D, "gamma must have shape [D]");
    TORCH_CHECK(beta.size(0)  == D, "beta must have shape [D]");

    auto y = torch::empty_like(x);
    auto mean = torch::empty({B}, x.options());
    auto inv_std = torch::empty({B}, x.options());

    // TODO: replace with real launcher call
    ln_forward_cuda_launcher(x, gamma, beta, y, mean, inv_std, eps);

    return y;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &ln_forward, "LayerNorm forward (CUDA, skeleton)");
}
""")

open("ln_ext/ext_cuda.cu", "w").write(r"""
#include <torch/extension.h>
#include <cuda.h>
#include <cuda_runtime.h>

#define CHECK_CUDA(x) TORCH_CHECK((x).is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIG(x) TORCH_CHECK((x).is_contiguous(), #x " must be contiguous")
#define CHECK_F32(x) TORCH_CHECK((x).scalar_type() == at::ScalarType::Float, #x " must be float32")

// TODO: warp reduce helper
__device__ __forceinline__ float warpReduceSum(float v) {
    // TODO: implement with __shfl_down_sync
    for(int offset = 16; offset > 0; offset >>= 1){
    v += __shfl_down_sync(0xffffffff, v, offset);}
    return v;
}

// TODO: forward kernel (warp reduce)
// x: [B,D], gamma/beta: [D], y:[B,D], mean/inv_std:[B]
__global__ void ln_forward_kernel(const float* __restrict__ x,
                                  const float* __restrict__ gamma,
                                  const float* __restrict__ beta,
                                  float* __restrict__ y,
                                  float* __restrict__ mean,
                                  float* __restrict__ inv_std,
                                  int B, int D, float eps) {
    // TODO:
    // - map row(s) to warps/blocks
    // - compute mean via warp reduction
    // - compute variance via warp reduction
    // - write mean/inv_std
    // - normalize + affine and write y
    int row = blockIdx.x;
    int tid = threadIdx.x;
    int lane = tid&32;
    int warp = tid >> 5; // consider why it is like this instead of tid << 5
    int num_warp = (blockDim.x + 31)/ 32;

    const float* xrow = x + (size_t)row * D;
    float* yrow = y + (size_t)row * D;

    // step 1: each thread acculates partial sum and sumq
    float sumx = 0.0f;
    float sumq = 0.0f;
    for (int i = tid; i < D; i += blockDim.x){
        float xi = xrow[i];
        sumx += xi;
        sumq += xi * xi;
    }

    //step2: reduce within each warp
    sumx = warpReduceSum(sumx);
    sumq = warpReduceSum(sumq);

    //step3: write warp partials to shared memory, then reduce again using warp 0
    __shared__ float warp_sums[32];
    __shared__ float warp_sumsq[32];

    if (lane == 0){
        warp_sums[warp] = sumx;
        warp_sumsq[warp] = sumq;
    }
    __syncthreads();

    //get block sum
    float block_sum = 0.f, block_sumsq = 0.f;
    if(warp == 0){
      block_sum = (lane < num_warp) ? warp_sums[lane]  : 0.f;
      block_sumsq = (lane < num_warp) ? warp_sumsq[lane] : 0.f;

      block_sum   = warpReduceSum(block_sum);
      block_sumsq = warpReduceSum(block_sumsq);
    }

    //step4: broadcast mean and inv_std to all threads
    __shared__ float sh_mu, sh_inv;
    if (tid == 0){
        float mu = block_sum / float(D);
        float var = block_sumsq / (float)D - mu * mu;
        float inv = rsqrtf(var + eps);
        sh_mu = mu;
        sh_inv = inv;
        mean[row] = mu;
        inv_std[row] = inv;
    }
    __syncthreads();


    //step5: get result
    float mu = sh_mu, inv = sh_inv;
    for(int i = tid; i < D; i += blockDim.x){
        float xi = xrow[i];
        float xhat = (xi - mu) * inv;
        yrow[i] = xhat * gamma[i] + beta[i];
    }
}

void ln_forward_cuda_launcher(torch::Tensor x,
                              torch::Tensor gamma,
                              torch::Tensor beta,
                              torch::Tensor y,
                              torch::Tensor mean,
                              torch::Tensor inv_std,
                              double eps) {
    CHECK_CUDA(x); CHECK_CONTIG(x); CHECK_F32(x);
    CHECK_CUDA(gamma); CHECK_CONTIG(gamma); CHECK_F32(gamma);
    CHECK_CUDA(beta);  CHECK_CONTIG(beta);  CHECK_F32(beta);
    CHECK_CUDA(y);     CHECK_CONTIG(y);     CHECK_F32(y);
    CHECK_CUDA(mean);  CHECK_CONTIG(mean);  CHECK_F32(mean);
    CHECK_CUDA(inv_std); CHECK_CONTIG(inv_std); CHECK_F32(inv_std);

    int B = (int)x.size(0);
    int D = (int)x.size(1);

    // TODO: choose launch config
    // dim3 block(0,0,1); // TODO
    // dim3 grid(0,0,1);  // TODO
    auto round_up_warp = [](int x) { return (x + 31) & ~31; };
    int f_threads = 256;
    if (D <= 128) f_threads = 128;
    f_threads = round_up_warp(f_threads);

    dim3 f_block(f_threads, 1, 1);
    dim3 f_grid(B, 1, 1);


    // Placeholder launch (won't run correctly until you set block/grid + kernel body)
    ln_forward_kernel<<<f_grid, f_block>>>(
        (const float*)x.data_ptr<float>(),
        (const float*)gamma.data_ptr<float>(),
        (const float*)beta.data_ptr<float>(),
        (float*)y.data_ptr<float>(),
        (float*)mean.data_ptr<float>(),
        (float*)inv_std.data_ptr<float>(),
        B, D, (float)eps
    );
}
""")

# -----------------------------
# Build extension
# -----------------------------
ext = load(
    name="ln_ext_forward",
    sources=["ln_ext/ext.cpp", "ln_ext/ext_cuda.cu"],
    extra_cflags=["-O3"],
    extra_cuda_cflags=["-O3", "-lineinfo"],
    with_cuda=True,
    verbose=False
)

print("Step1 extension loaded:", ext)

# -----------------------------
# Optional: Python call test (disabled until TODOs are filled)
# -----------------------------
RUN_TEST = True
if RUN_TEST:
    B, D = 16, 128
    x = torch.randn(B, D, device="cuda", dtype=torch.float32)
    gamma = torch.ones(D, device="cuda", dtype=torch.float32)
    beta  = torch.zeros(D, device="cuda", dtype=torch.float32)
    y = ext.forward(x, gamma, beta, 1e-5)
    print("y:", y.shape, y.dtype, y.device)


Step1 extension loaded: <module 'ln_ext_forward' from '/root/.cache/torch_extensions/py312_cu126/ln_ext_forward/ln_ext_forward.so'>
y: torch.Size([16, 128]) torch.float32 cuda:0


## ‚úÖ Step 2 ‚Äî Register LN Backward + gradient correctness check (skeleton)

In [4]:
# Step 2 (ONE CELL): add backward + autograd wrapper skeleton (NO SOLUTION)

import os, textwrap, torch
from torch.utils.cpp_extension import load
import torch
import torch.nn.functional as F

# Overwrite C++/CUDA files for backward-enabled extension
open("ln_ext/ext.h", "w").write(r"""
#pragma once
#include <torch/extension.h>

// Forward returns (y, mean, inv_std) for backward reuse
std::vector<torch::Tensor> ln_forward(torch::Tensor x,
                                      torch::Tensor gamma,
                                      torch::Tensor beta,
                                      double eps);

// Backward returns (dx, dgamma, dbeta)
std::vector<torch::Tensor> ln_backward(torch::Tensor x,
                                       torch::Tensor gamma,
                                       torch::Tensor mean,
                                       torch::Tensor inv_std,
                                       torch::Tensor dout);

void ln_forward_cuda_launcher(torch::Tensor x,
                              torch::Tensor gamma,
                              torch::Tensor beta,
                              torch::Tensor y,
                              torch::Tensor mean,
                              torch::Tensor inv_std,
                              double eps);

void ln_backward_cuda_launcher(torch::Tensor x,
                               torch::Tensor gamma,
                               torch::Tensor mean,
                               torch::Tensor inv_std,
                               torch::Tensor dout,
                               torch::Tensor dx,
                               torch::Tensor dgamma,
                               torch::Tensor dbeta);
""")

open("ln_ext/ext.cpp", "w").write(r"""
#include <torch/extension.h>
#include "ext.h"

#define CHECK_CUDA(x) TORCH_CHECK((x).is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIG(x) TORCH_CHECK((x).is_contiguous(), #x " must be contiguous")
#define CHECK_F32(x) TORCH_CHECK((x).scalar_type() == at::ScalarType::Float, #x " must be float32")

static void check_forward_args(torch::Tensor x, torch::Tensor gamma, torch::Tensor beta) {
    // TODO: add full checks (dims, shapes)
    CHECK_CUDA(x); CHECK_CONTIG(x); CHECK_F32(x);
    CHECK_CUDA(gamma); CHECK_CONTIG(gamma); CHECK_F32(gamma);
    CHECK_CUDA(beta);  CHECK_CONTIG(beta);  CHECK_F32(beta);

    // Dims
    TORCH_CHECK(x.dim() == 2, "x must be 2D [B, D], got dim=", x.dim());
    TORCH_CHECK(gamma.dim() == 1, "gamma must be 1D [D], got dim=", gamma.dim());
    TORCH_CHECK(beta.dim() == 1,  "beta must be 1D [D], got dim=", beta.dim());


    // Shapes
    const int64_t B = x.size(0);
    const int64_t D = x.size(1);
    TORCH_CHECK(B > 0 && D > 0, "x must have non-zero shape, got [", B, ", ", D, "]");
    TORCH_CHECK(gamma.size(0) == D, "gamma must have shape [D] with D=", D,
                ", got gamma.size(0)=", gamma.size(0));
    TORCH_CHECK(beta.size(0) == D,  "beta must have shape [D] with D=", D,
                ", got beta.size(0)=", beta.size(0));

    // Same device (important for multi-GPU)
    TORCH_CHECK(x.device() == gamma.device(),
                "x and gamma must be on the same device, got x=", x.device(),
                " gamma=", gamma.device());
    TORCH_CHECK(x.device() == beta.device(),
                "x and beta must be on the same device, got x=", x.device(),
                " beta=", beta.device());

}

static void check_backward_args(torch::Tensor x, torch::Tensor gamma,
                                torch::Tensor mean, torch::Tensor inv_std,
                                torch::Tensor dout) {
    // TODO: add full checks (dims, shapes)
    CHECK_CUDA(x); CHECK_CONTIG(x); CHECK_F32(x);
    CHECK_CUDA(gamma); CHECK_CONTIG(gamma); CHECK_F32(gamma);
    CHECK_CUDA(mean); CHECK_CONTIG(mean); CHECK_F32(mean);
    CHECK_CUDA(inv_std); CHECK_CONTIG(inv_std); CHECK_F32(inv_std);
    CHECK_CUDA(dout); CHECK_CONTIG(dout); CHECK_F32(dout);

    // Dims
    TORCH_CHECK(x.dim() == 2, "x must be 2D [B, D], got dim=", x.dim());
    TORCH_CHECK(dout.dim() == 2, "dout must be 2D [B, D], got dim=", dout.dim());
    TORCH_CHECK(gamma.dim() == 1, "gamma must be 1D [D], got dim=", gamma.dim());
    TORCH_CHECK(mean.dim() == 1, "mean must be 1D [B], got dim=", mean.dim());
    TORCH_CHECK(inv_std.dim() == 1, "inv_std must be 1D [B], got dim=", inv_std.dim());

    // Shapes
    const int64_t B = x.size(0);
    const int64_t D = x.size(1);
    TORCH_CHECK(B > 0 && D > 0, "x must have non-zero shape, got [", B, ", ", D, "]");

    TORCH_CHECK(dout.size(0) == B && dout.size(1) == D,
                "dout must have shape [B, D]=[", B, ", ", D, "], got [",
                dout.size(0), ", ", dout.size(1), "]");

    TORCH_CHECK(gamma.size(0) == D,
                "gamma must have shape [D] with D=", D,
                ", got gamma.size(0)=", gamma.size(0));

    TORCH_CHECK(mean.size(0) == B,
                "mean must have shape [B] with B=", B,
                ", got mean.size(0)=", mean.size(0));

    TORCH_CHECK(inv_std.size(0) == B,
                "inv_std must have shape [B] with B=", B,
                ", got inv_std.size(0)=", inv_std.size(0));

    // Same device for all tensors
    const auto dev = x.device();
    TORCH_CHECK(gamma.device() == dev,   "gamma must be on same device as x");
    TORCH_CHECK(mean.device() == dev,    "mean must be on same device as x");
    TORCH_CHECK(inv_std.device() == dev, "inv_std must be on same device as x");
    TORCH_CHECK(dout.device() == dev,    "dout must be on same device as x");
}

std::vector<torch::Tensor> ln_forward(torch::Tensor x,
                                      torch::Tensor gamma,
                                      torch::Tensor beta,
                                      double eps) {
    check_forward_args(x, gamma, beta);
    auto B = x.size(0);

    auto y = torch::empty_like(x);
    auto mean = torch::empty({B}, x.options());
    auto inv_std = torch::empty({B}, x.options());

    // TODO: real CUDA forward
    ln_forward_cuda_launcher(x, gamma, beta, y, mean, inv_std, eps);

    return {y, mean, inv_std};
}

std::vector<torch::Tensor> ln_backward(torch::Tensor x,
                                       torch::Tensor gamma,
                                       torch::Tensor mean,
                                       torch::Tensor inv_std,
                                       torch::Tensor dout) {
    check_backward_args(x, gamma, mean, inv_std, dout);

    auto dx = torch::empty_like(x);
    auto dgamma = torch::zeros_like(gamma);
    auto dbeta  = torch::zeros_like(gamma);

    // TODO: real CUDA backward
    ln_backward_cuda_launcher(x, gamma, mean, inv_std, dout, dx, dgamma, dbeta);

    return {dx, dgamma, dbeta};
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &ln_forward, "LayerNorm forward (CUDA, skeleton)");
    m.def("backward", &ln_backward, "LayerNorm backward (CUDA, skeleton)");
}
""")

open("ln_ext/ext_cuda.cu", "w").write(r"""
#include <torch/extension.h>
#include <cuda.h>
#include <cuda_runtime.h>

#define CHECK_CUDA(x) TORCH_CHECK((x).is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIG(x) TORCH_CHECK((x).is_contiguous(), #x " must be contiguous")
#define CHECK_F32(x) TORCH_CHECK((x).scalar_type() == at::ScalarType::Float, #x " must be float32")

__device__ __forceinline__ float warpReduceSum(float v) {
    // TODO: __shfl_down_sync
   for(int offset = 16; offset > 0; offset >>= 1){
    v += __shfl_down_sync(0xffffffff, v, offset);
  }
    return v;
}

__global__ void ln_forward_kernel(const float* __restrict__ x,
                                  const float* __restrict__ gamma,
                                  const float* __restrict__ beta,
                                  float* __restrict__ y,
                                  float* __restrict__ mean,
                                  float* __restrict__ inv_std,
                                  int B, int D, float eps) {
    // TODO
  int row = blockIdx.x;
  int tid = threadIdx.x;
  int lane = tid & 31;
  int warp = tid >> 5; // consider why it is like this instead of tid << 5
  int num_warp = (blockDim.x + 31)/ 32;

  const float* xrow = x + (size_t)row * D;
  float* yrow = y + (size_t)row * D;

  // step 1: each thread acculates partial sum and sumq

  float sumx = 0.0f;
  float sumq = 0.0f;
  for (int i = tid; i < D; i += blockDim.x){
      float xi = xrow[i];
      sumx += xi;
      sumq += xi * xi;
  }

  //step2: reduce within each warp
  sumx = warpReduceSum(sumx);
  sumq = warpReduceSum(sumq);

  //step3: write warp partials to shared memory, then reduce again using warp 0
  __shared__ float warp_sums[32];
  __shared__ float warp_sumsq[32];

  if (lane == 0){
      warp_sums[warp] = sumx;
      warp_sumsq[warp] = sumq;
  }
  __syncthreads();

  //get block sum
  float block_sum = 0.f, block_sumsq = 0.f;
  if(warp == 0){
    block_sum = (lane < num_warp) ? warp_sums[lane]  : 0.f;
    block_sumsq = (lane < num_warp) ? warp_sumsq[lane] : 0.f;

    block_sum   = warpReduceSum(block_sum);
    block_sumsq = warpReduceSum(block_sumsq);
  }

  //step4: broadcast mean and invstd to all threads
  __shared__ float sh_mu, sh_inv;
  if (tid == 0){
      float mu = block_sum / float(D);
      float var = block_sumsq / (float)D - mu * mu;
      float inv = rsqrtf(var + eps);
      sh_mu = mu;
      sh_inv = inv;
      mean[row] = mu;
      inv_std[row] = inv;
  }
  __syncthreads();


  //step5: get result
  float mu = sh_mu, inv = sh_inv;
  for(int i = tid; i < D; i += blockDim.x){
      float xi = xrow[i];
      float xhat = (xi - mu) * inv;
      yrow[i] = xhat * gamma[i] + beta[i];
  }
}

__global__ void ln_backward_kernel(const float* __restrict__ x,
                                   const float* __restrict__ gamma,
                                   const float* __restrict__ mean,
                                   const float* __restrict__ inv_std,
                                   const float* __restrict__ dout,
                                   float* __restrict__ dx,
                                   float* __restrict__ dgamma,
                                   float* __restrict__ dbeta,
                                   int B, int D) {
    // TODO:
    // - compute dx
    // - reduce dgamma/dbeta (atomics or 2-pass strategy)
  int row = blockIdx.x;
  int tid = threadIdx.x;
  int lane = tid & 31;
  int warp = tid >> 5;
  int num_warps = (blockDim.x + 31) / 32;

  const float* xrow  = x  + (size_t)row * D;
  const float* dyrow = dout + (size_t)row * D;
  float* dxrow       = dx + (size_t)row * D;

  float mu = mean[row];
  float inv = inv_std[row];

  // step1: Accumulate partial sums for s1 and s2 in FP32
  float s1 = 0.f;   // sum(g)
  float s2 = 0.f;   // sum(g * xhat)

  for(int i = tid; i < D; i+= blockDim.x){
      float xi = xrow[i];
      float dyi = dyrow[i];
      float gi  = dyi * gamma[i];
      float xhat = (xi - mu) * inv;
      s1 += gi;
      s2 += gi * xhat;
  }

  // step2: Warp reduce

  s1 = warpReduceSum(s1);
  s2 = warpReduceSum(s2);

  // step3: Warp partials -> shared, then reduce with warp 0
  __shared__ float warp_s1[32];
  __shared__ float warp_s2[32];

  if (lane == 0) {
    warp_s1[warp] = s1;
    warp_s2[warp] = s2;
  }
  __syncthreads();

  float block_s1 = 0.f;
  float block_s2 = 0.f;

  if (warp == 0) {
    block_s1 = (lane < num_warps) ? warp_s1[lane] : 0.f;
    block_s2 = (lane < num_warps) ? warp_s2[lane] : 0.f;
    block_s1 = warpReduceSum(block_s1);
    block_s2 = warpReduceSum(block_s2);
  }

  // step4: Broadcast block_s1/block_s2
  __shared__ float sh_s1, sh_s2;
  if (tid == 0) {
    sh_s1 = block_s1;
    sh_s2 = block_s2;
  }
  __syncthreads();

  float S1 = sh_s1;
  float S2 = sh_s2;

  //step5: Write dx

  float invD = 1.0f / (float)D;

  for (int i = tid; i < D; i += blockDim.x) {
    float xi  = xrow[i];
    float dyi = dyrow[i];
    float gi  = dyi * gamma[i];
    float xhat = (xi - mu) * inv;

    float dx_i = inv * (gi - S1 * invD - xhat * (S2 * invD));
    dxrow[i] = dx_i;
    atomicAdd(&dbeta[i],  dyi);
    atomicAdd(&dgamma[i], dyi * xhat);

  }
}

void ln_forward_cuda_launcher(torch::Tensor x, torch::Tensor gamma, torch::Tensor beta,
                              torch::Tensor y, torch::Tensor mean, torch::Tensor inv_std,
                              double eps) {
    CHECK_CUDA(x); CHECK_CONTIG(x); CHECK_F32(x);
    CHECK_CUDA(gamma); CHECK_CONTIG(gamma); CHECK_F32(gamma);
    CHECK_CUDA(beta);  CHECK_CONTIG(beta);  CHECK_F32(beta);
    CHECK_CUDA(y);     CHECK_CONTIG(y);     CHECK_F32(y);
    CHECK_CUDA(mean);  CHECK_CONTIG(mean);  CHECK_F32(mean);
    CHECK_CUDA(inv_std); CHECK_CONTIG(inv_std); CHECK_F32(inv_std);

    int B = (int)x.size(0);
    int D = (int)x.size(1);

    auto round_up_warp = [](int x) { return (x + 31) & ~31; };

    int f_threads = 256;
    if (D <= 128) f_threads = 128;
    f_threads = round_up_warp(f_threads);

    dim3 f_block(f_threads, 1, 1);
    dim3 f_grid(B, 1, 1);

    //dim3 block(0,0,1); // TODO
    //dim3 grid(0,0,1);  // TODO

    ln_forward_kernel<<<f_grid, f_block>>>(
        x.data_ptr<float>(), gamma.data_ptr<float>(), beta.data_ptr<float>(),
        y.data_ptr<float>(), mean.data_ptr<float>(), inv_std.data_ptr<float>(),
        B, D, (float)eps
    );
}

void ln_backward_cuda_launcher(torch::Tensor x, torch::Tensor gamma,
                               torch::Tensor mean, torch::Tensor inv_std,
                               torch::Tensor dout,
                               torch::Tensor dx, torch::Tensor dgamma, torch::Tensor dbeta) {
    CHECK_CUDA(x); CHECK_CONTIG(x); CHECK_F32(x);
    CHECK_CUDA(gamma); CHECK_CONTIG(gamma); CHECK_F32(gamma);
    CHECK_CUDA(mean); CHECK_CONTIG(mean); CHECK_F32(mean);
    CHECK_CUDA(inv_std); CHECK_CONTIG(inv_std); CHECK_F32(inv_std);
    CHECK_CUDA(dout); CHECK_CONTIG(dout); CHECK_F32(dout);
    CHECK_CUDA(dx); CHECK_CONTIG(dx); CHECK_F32(dx);
    CHECK_CUDA(dgamma); CHECK_CONTIG(dgamma); CHECK_F32(dgamma);
    CHECK_CUDA(dbeta); CHECK_CONTIG(dbeta); CHECK_F32(dbeta);

    int B = (int)x.size(0);
    int D = (int)x.size(1);

    //dim3 block(0,0,1); // TODO
    //dim3 grid(0,0,1);  // TODO
    auto round_up_warp = [](int x) { return (x + 31) & ~31; };
    int b_threads = 256;
    if (D <= 128) b_threads = 128;
    b_threads = round_up_warp(b_threads);

    dim3 b_block(b_threads, 1, 1);
    dim3 b_grid(B, 1, 1);

    ln_backward_kernel<<<b_grid, b_block>>>(
        x.data_ptr<float>(), gamma.data_ptr<float>(),
        mean.data_ptr<float>(), inv_std.data_ptr<float>(),
        dout.data_ptr<float>(),
        dx.data_ptr<float>(), dgamma.data_ptr<float>(), dbeta.data_ptr<float>(),
        B, D
    );
}
""")

ext = load(
    name="ln_ext_fwd_bwd",
    sources=["ln_ext/ext.cpp", "ln_ext/ext_cuda.cu"],
    extra_cflags=["-O3"],
    extra_cuda_cflags=["-O3", "-lineinfo"],
    with_cuda=True,
    verbose=False
)

print("Step2 extension loaded:", ext)

# Optional: gradient check harness (disabled until TODOs are implemented)
RUN_GRAD_TEST = True
if RUN_GRAD_TEST:
    B, D = 8, 256
    eps = 1e-5
    x = torch.randn(B, D, device="cuda", dtype=torch.float32, requires_grad=True)
    gamma = torch.randn(D, device="cuda", dtype=torch.float32, requires_grad=True)
    beta  = torch.randn(D, device="cuda", dtype=torch.float32, requires_grad=True)

    # TODO: compare to torch.nn.functional.layer_norm gradients
    # - call ext.forward -> (y, mean, inv_std)
    # - build dout
    # - call ext.backward -> (dx, dgamma, dbeta)
    # - compare to autograd reference

    # Step1: Extension forward -> (y, mean, inv_std)
    y_ext, mean_ext, inv_std_ext = ext.forward(x, gamma, beta, eps)

    # Build upstream gradient dout (same shape as y)
    dout = torch.randn_like(y_ext)

    # Extension backward -> (dx, dgamma, dbeta)
    dx_ext, dgamma_ext, dbeta_ext = ext.backward(x, gamma, mean_ext, inv_std_ext, dout)

    # Step2: Autograd reference using torch.layer_norm, IMPORTANT: use the same eps and weight/bias
    # Make separate tensors so gradients don't mix with ext path
    x_ref = x.detach().clone().requires_grad_(True)
    gamma_ref = gamma.detach().clone().requires_grad_(True)
    beta_ref  = beta.detach().clone().requires_grad_(True)

    y_ref = F.layer_norm(x_ref, normalized_shape=(D,), weight=gamma_ref, bias=beta_ref, eps=eps)

    # Backprop with the same dout
    y_ref.backward(dout)

    dx_ref = x_ref.grad
    dgamma_ref = gamma_ref.grad
    dbeta_ref = beta_ref.grad

    # step3: Compare
    def report(name, a, b, atol=1e-4, rtol=1e-3):
        diff = (a - b).abs()
        max_abs = diff.max().item()
        max_rel = (diff / b.abs().clamp_min(1e-12)).max().item()
        ok = torch.allclose(a, b, atol=atol, rtol=rtol)
        print(f"{name}: allclose={ok}  max_abs={max_abs:.6e}  max_rel={max_rel:.6e}  "
              f"(atol={atol}, rtol={rtol})")
        return ok

    print("[grad test] comparing ext vs torch.autograd reference")
    ok_dx = report("dx", dx_ext, dx_ref, atol=1e-3, rtol=1e-3)
    ok_dg = report("dgamma", dgamma_ext, dgamma_ref, atol=1e-3, rtol=1e-3)
    ok_db = report("dbeta", dbeta_ext, dbeta_ref, atol=1e-3, rtol=1e-3)

    if not (ok_dx and ok_dg and ok_db):
        # Print a few worst indices for debugging
        def worst_idx(a, b):
            diff = (a - b).abs().reshape(-1)
            idx = diff.argmax().item()
            return idx, diff[idx].item()

        idx, val = worst_idx(dx_ext, dx_ref)
        print(f"worst dx idx(flat)={idx}, abs_diff={val:.6e}")
        idx, val = worst_idx(dgamma_ext, dgamma_ref)
        print(f"worst dgamma idx(flat)={idx}, abs_diff={val:.6e}")
        idx, val = worst_idx(dbeta_ext, dbeta_ref)
        print(f"worst dbeta idx(flat)={idx}, abs_diff={val:.6e}")


Step2 extension loaded: <module 'ln_ext_fwd_bwd' from '/root/.cache/torch_extensions/py312_cu126/ln_ext_fwd_bwd/ln_ext_fwd_bwd.so'>
[grad test] comparing ext vs torch.autograd reference
dx: allclose=True  max_abs=9.536743e-07  max_rel=5.937974e-04  (atol=0.001, rtol=0.001)
dgamma: allclose=True  max_abs=9.536743e-07  max_rel=1.114033e-05  (atol=0.001, rtol=0.001)
dbeta: allclose=True  max_abs=9.536743e-07  max_rel=3.072305e-06  (atol=0.001, rtol=0.001)


## ‚úÖ Step 3 ‚Äî Compile / debug / edge cases (skeleton)

In [5]:
# Step 3 (ONE CELL): edge case test scaffolding + debug aids (NO SOLUTION)

import torch, math

# Edge cases to test (you can expand)
CASES = [
    (1, 7),       # tiny D
    (2, 33),      # not multiple of warp
    (4, 128),
    (16, 1024),
    (3, 4096),    # large D
]

# Toggle when your kernels are implemented
RUN_EDGE_TESTS = False

def run_edge_suite(ext):
    for (B, D) in CASES:
        x = torch.randn(B, D, device="cuda", dtype=torch.float32)
        gamma = torch.randn(D, device="cuda", dtype=torch.float32)
        beta  = torch.randn(D, device="cuda", dtype=torch.float32)

        # TODO: call ext.forward and validate shape/dtype/device
        # y, mean, inv = ext.forward(x, gamma, beta, 1e-5)
        # assert y.shape == x.shape
        # assert mean.shape == (B,)
        # assert inv.shape == (B,)
        y, mean, inv = ext.forward(x, gamma, beta, eps)
        assert y.shape == x.shape, f"y shape mismatch: {y.shape} vs {x.shape}"
        assert mean.shape == (B,), f"mean shape mismatch: {mean.shape} vs {(B,)}"
        assert inv.shape == (B,), f"inv_std shape mismatch: {inv.shape} vs {(B,)}"

        assert y.dtype == x.dtype
        assert mean.dtype == x.dtype
        assert inv.dtype == x.dtype

        assert y.device.type == "cuda"
        assert mean.device.type == "cuda"
        assert inv.device.type == "cuda"

        # TODO: check numerical sanity (no NaN/Inf)
        # assert torch.isfinite(y).all()
        assert torch.isfinite(y).all(), "NaN/Inf found in y"
        assert torch.isfinite(mean).all(), "NaN/Inf found in mean"
        assert torch.isfinite(inv).all(), "NaN/Inf found in inv_std"

        # TODO: backward sanity
        # dout = torch.randn_like(x)
        # dx, dgamma, dbeta = ext.backward(x, gamma, mean, inv, dout)
        dout = torch.randn_like(y)

        dx, dgamma, dbeta = ext.backward(x, gamma, mean, inv, dout)

        # Shape checks
        assert dx.shape == x.shape, f"dx shape mismatch: {dx.shape}"
        assert dgamma.shape == gamma.shape, f"dgamma shape mismatch: {dgamma.shape}"
        assert dbeta.shape == beta.shape, f"dbeta shape mismatch: {dbeta.shape}"

        # Numerical sanity
        assert torch.isfinite(dx).all(), "NaN/Inf found in dx"
        assert torch.isfinite(dgamma).all(), "NaN/Inf found in dgamma"
        assert torch.isfinite(dbeta).all(), "NaN/Inf found in dbeta"

        print(f"[EdgeCase] B={B} D={D} -> TODO checks")

# If you already loaded Day3 extension as `ext`, you can run:
if RUN_EDGE_TESTS:
    run_edge_suite(ext)
else:
    print("Step3: Edge suite is ready. Set RUN_EDGE_TESTS=True after implementing kernels.")



Step3: Edge suite is ready. Set RUN_EDGE_TESTS=True after implementing kernels.


## ‚úÖ Step 4 ‚Äî Fused GELU + Bias CUDA kernel (extension skeleton)

In [10]:
# Step 4 (ONE CELL): fused Bias+GELU extension skeleton (NO SOLUTION)

import os, torch
from torch.utils.cpp_extension import load

os.makedirs("fused_gelu", exist_ok=True)

open("fused_gelu/ext.cpp", "w").write(r"""
#include <torch/extension.h>

#define CHECK_CUDA(x) TORCH_CHECK((x).is_cuda(), #x " must be CUDA")
#define CHECK_CONTIG(x) TORCH_CHECK((x).is_contiguous(), #x " must be contiguous")
#define CHECK_F16F32(x) TORCH_CHECK((x).scalar_type()==at::kHalf || (x).scalar_type()==at::kFloat, #x " must be fp16 or fp32")

void fused_gelu_bias_cuda_launcher(torch::Tensor x, torch::Tensor bias, torch::Tensor y);

torch::Tensor fused_gelu_bias(torch::Tensor x, torch::Tensor bias) {
    // TODO:
    // - checks (CUDA/contig/dtype/shape)
    // - allocate y
    // - call launcher
    CHECK_CUDA(x); CHECK_CONTIG(x); CHECK_F16F32(x);
    CHECK_CUDA(bias); CHECK_CONTIG(bias); CHECK_F16F32(bias);

    TORCH_CHECK(x.scalar_type() == bias.scalar_type(),
                "x and bias must have the same dtype, got x=", x.scalar_type(),
                " bias=", bias.scalar_type());

    TORCH_CHECK(x.device() == bias.device(),
                "x and bias must be on the same CUDA device");


    const auto B = x.size(0);
    const auto D = x.size(1);
    TORCH_CHECK(B > 0 && D > 0, "x must have non-zero shape [B,D], got [", B, ",", D, "]");
    TORCH_CHECK(bias.size(0) == D,
                "bias must have shape [D] with D=x.size(1). Got bias.size(0)=",
                bias.size(0), " D=", D);


    auto y = torch::empty_like(x);
    fused_gelu_bias_cuda_launcher(x, bias, y);
    return y;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &fused_gelu_bias, "Fused Bias+GELU forward (CUDA, skeleton)");
}
""")

open("fused_gelu/ext_cuda.cu", "w").write(r"""
#include <torch/extension.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_fp16.h>

#define CHECK_CUDA(x) TORCH_CHECK((x).is_cuda(), #x " must be CUDA")
#define CHECK_CONTIG(x) TORCH_CHECK((x).is_contiguous(), #x " must be contiguous")

// TODO: implement GELU approximation or exact (no solution here)
// kernel: y = GELU(x + bias)
#define CUDA_CHECK(call) do {                                  \
  cudaError_t err = (call);                                    \
  TORCH_CHECK(err == cudaSuccess, "CUDA error: ",              \
              cudaGetErrorString(err));                        \
} while(0)


// GELU (tanh approximation): common in fused kernels
__device__ __forceinline__ float gelu_tanh(float x){
        // 0.5*x*(1+tanh(sqrt(2/pi)*(x+0.044715*x^3)))
        const float square_pi = 0.7978845608028654f;
        const float x_factor = 0.044715f;
        float x3 = x * x * x;
        float t = square_pi * (x + x_factor * x3);
        return 0.5f * x * (1.0f + tanhf(t));
        }

__global__ void fused_bias_gelu_kernel_f32(
        const float* __restrict__ x,
        const float* __restrict__ bias,
        float* __restrict__ y,
        int B, int D){
        int idx = (int)blockIdx.x * (int)blockDim.x + (int)threadIdx.x;
        int n = B * D;
        if (idx < n){
            int d = idx - (idx / D) * D; // idx % D without slow mod on some arch
            float t = x[idx] + bias[d];
            y[idx] = gelu_tanh(t);
            }
        }

__global__ void fused_bias_gelu_kernel_f16(
        const __half* __restrict__ x,
        const __half* __restrict__ bias,
        __half* __restrict__ y,
        int B, int D){
        int idx = (int)blockIdx.x * (int)blockDim.x + (int)threadIdx.x;
        int n = B * D;
        if (idx < n){
            int d = idx - (idx / D) * D; // idx % D without slow mod on some arch
            float t = __half2float(x[idx]) + __half2float(bias[d]);
            float out = gelu_tanh(t);
            y[idx] = __float2half_rn(out);
            }
        }
//__global__ void fused_bias_gelu_kernel(/* TODO args */) {
    // TODO
//}

void fused_gelu_bias_cuda_launcher(torch::Tensor x, torch::Tensor bias, torch::Tensor y) {
    CHECK_CUDA(x); CHECK_CONTIG(x);
    CHECK_CUDA(bias); CHECK_CONTIG(bias);
    CHECK_CUDA(y); CHECK_CONTIG(y);


    TORCH_CHECK(x.dim() == 2, "x must be 2D [B,D]");
    TORCH_CHECK(bias.dim() == 1, "bias must be 1D [D]");
    TORCH_CHECK(y.dim() == 2, "y must be 2D [B,D]");
    TORCH_CHECK(x.scalar_type() == bias.scalar_type(), "x and bias must have same dtype");
    TORCH_CHECK(x.scalar_type() == y.scalar_type(), "x and y must have same dtype");

    int B = (int)x.size(0);
    int D = (int)x.size(1);
    TORCH_CHECK((int)bias.size(0) == D, "bias.size(0) must equal D");

    int n = B * D;
    constexpr int threads = 256;
    int blocks = (n + threads - 1) / threads;
    dim3 block(threads, 1, 1);
    dim3 grid(blocks, 1, 1);

    // TODO: grid/block
    //dim3 block(0,0,1);
    //dim3 grid(0,0,1);

    // TODO: dispatch by dtype (fp16/fp32)
    // fused_bias_gelu_kernel<<<grid, block>>>(...);


    if (x.scalar_type() == at::kFloat) {
        fused_bias_gelu_kernel_f32<<<blocks, threads>>>(
            x.data_ptr<float>(),
            bias.data_ptr<float>(),
            y.data_ptr<float>(),
            B, D
        );
    } else if (x.scalar_type() == at::kHalf) {
        fused_bias_gelu_kernel_f16<<<blocks, threads>>>(
            (const __half*)x.data_ptr<at::Half>(),
            (const __half*)bias.data_ptr<at::Half>(),
            (__half*)y.data_ptr<at::Half>(),
            B, D
        );
    } else {
        TORCH_CHECK(false, "Unsupported dtype: ", x.scalar_type());
    }
    CUDA_CHECK(cudaGetLastError());
}
""")

fused = load(
    name="fused_gelu_bias_ext",
    sources=["fused_gelu/ext.cpp", "fused_gelu/ext_cuda.cu"],
    extra_cflags=["-O3"],
    extra_cuda_cflags=["-O3", "-lineinfo"],
    with_cuda=True,
    verbose=False
)

print("Step4 extension loaded:", fused)

RUN_TEST = True
if RUN_TEST:
    # TODO: compare vs torch.nn.functional.gelu(x + bias)
    import torch.nn.functional as F

    def test_case(B, D, dtype):
        x = torch.randn(B, D, device="cuda", dtype=dtype)
        bias = torch.randn(D, device="cuda", dtype=dtype)

        y = fused.forward(x, bias)
        y_ref = F.gelu(x + bias)  # PyTorch reference (may be exact/approx depending on version)

        # sanity
        assert y.shape == x.shape and y.dtype == x.dtype and y.device == x.device
        assert torch.isfinite(y).all(), "NaN/Inf in fused output"

        diff = (y - y_ref).abs()
        print(f"[test] dtype={dtype} B={B} D={D} max_err={diff.max().item():.6e} mean_err={diff.mean().item():.6e}")

    test_case(256, 1024, torch.float32)
    test_case(256, 1024, torch.float16)


Step4 extension loaded: <module 'fused_gelu_bias_ext_v2' from '/root/.cache/torch_extensions/py312_cu126/fused_gelu_bias_ext/fused_gelu_bias_ext_v2.so'>
[test] dtype=torch.float32 B=256 D=1024 max_err=4.734993e-04 mean_err=1.230978e-04
[test] dtype=torch.float16 B=256 D=1024 max_err=3.906250e-03 mean_err=1.772642e-04


## ‚úÖ Step 5 ‚Äî Weekly project packaging (full LN extension) + scripts (skeleton)

In [None]:
# Step 5 (ONE CELL): project packaging skeleton (NO SOLUTION)
# Creates placeholders for README, tests, benchmark, and Nsight Compute script.

import os, textwrap

os.makedirs("project_ln", exist_ok=True)

open("project_ln/README.md", "w").write(r"""
# PyTorch LayerNorm C++/CUDA Extension (Skeleton)

## What you should have by end of Week
- LN forward (CUDA)
- LN backward (CUDA)
- Python API: forward/backward or autograd wrapper
- Correctness tests vs PyTorch
- Benchmarks vs torch.nn.LayerNorm
- Nsight Compute profiling commands

## TODO
- Document build steps (Colab and local)
- Add usage examples
- Add performance notes and profiling screenshots
""")

open("project_ln/test_ln.py", "w").write(r"""
import torch

def test_forward(ext):
    # TODO: compare ext.forward vs torch layer_norm
    pass

def test_backward(ext):
    # TODO: compare gradients vs autograd
    pass

if __name__ == "__main__":
    # TODO: import your built extension module and run tests
    pass
""")

open("project_ln/bench_ln.py", "w").write(r"""
import torch, time

@torch.no_grad()
def bench_fn(fn, iters=200, warmup=50):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()
    s = torch.cuda.Event(enable_timing=True)
    e = torch.cuda.Event(enable_timing=True)
    s.record()
    for _ in range(iters):
        fn()
    e.record()
    torch.cuda.synchronize()
    return s.elapsed_time(e) / iters

def main():
    # TODO: load ext
    # TODO: build benchmark cases
    pass

if __name__ == "__main__":
    main()
""")

open("project_ln/run_ncu.sh", "w").write(r"""#!/usr/bin/env bash
set -euo pipefail

# TODO:
# - Build your extension (if building outside Colab JIT)
# - Run Nsight Compute on forward/backward kernels

# Example:
# ncu --set full --kernel-name "ln_forward_kernel" -o ncu_ln_fwd python project_ln/bench_ln.py
# ncu --set full --kernel-name "ln_backward_kernel" -o ncu_ln_bwd python project_ln/bench_ln.py

echo "Edit this script with your kernel names and driver script."
""")

os.system("chmod +x project_ln/run_ncu.sh")
print("Step6 packaging skeleton created under ./project_ln/")


In [None]:
# Step 5 (ONE CELL): project packaging skeleton (NO SOLUTION)
# Creates placeholders for README, tests, benchmark, and Nsight Compute script.

import os, textwrap

os.makedirs("project_ln", exist_ok=True)

open("project_ln/README.md", "w").write(r"""
# PyTorch LayerNorm C++/CUDA Extension (Skeleton)

## What you should have by end of Week
- LN forward (CUDA)
- LN backward (CUDA)
- Python API: forward/backward or autograd wrapper
- Correctness tests vs PyTorch
- Benchmarks vs torch.nn.LayerNorm
- Nsight Compute profiling commands

## TODO
- Document build steps (Colab and local)
- Add usage examples
- Add performance notes and profiling screenshots
""")

open("project_ln/test_ln.py", "w").write(r"""
import torch

def test_forward(ext):
    # TODO: compare ext.forward vs torch layer_norm
    pass

def test_backward(ext):
    # TODO: compare gradients vs autograd
    pass

if __name__ == "__main__":
    # TODO: import your built extension module and run tests
    pass
""")

open("project_ln/bench_ln.py", "w").write(r"""
import torch, time

@torch.no_grad()
def bench_fn(fn, iters=200, warmup=50):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()
    s = torch.cuda.Event(enable_timing=True)
    e = torch.cuda.Event(enable_timing=True)
    s.record()
    for _ in range(iters):
        fn()
    e.record()
    torch.cuda.synchronize()
    return s.elapsed_time(e) / iters

def main():
    # TODO: load ext
    # TODO: build benchmark cases
    pass

if __name__ == "__main__":
    main()
""")

open("project_ln/run_ncu.sh", "w").write(r"""#!/usr/bin/env bash
set -euo pipefail

# TODO:
# - Build your extension (if building outside Colab JIT)
# - Run Nsight Compute on forward/backward kernels

# Example:
# ncu --set full --kernel-name "ln_forward_kernel" -o ncu_ln_fwd python project_ln/bench_ln.py
# ncu --set full --kernel-name "ln_backward_kernel" -o ncu_ln_bwd python project_ln/bench_ln.py

echo "Edit this script with your kernel names and driver script."
""")

os.system("chmod +x project_ln/run_ncu.sh")
print("Step6 packaging skeleton created under ./project_ln/")
