# üìò Task Explanation: PyTorch C++ Extensions ‚Äî TensorAccessor & ATen API

## üéØ Objective
The objective of this task is to understand the **foundations of PyTorch C++ extensions** and learn how to write **custom C++/CUDA operators** that integrate seamlessly with PyTorch.

This task focuses on:
- How PyTorch exposes tensors to C++/CUDA
- How to safely and efficiently access tensor data
- How custom operators interact with PyTorch‚Äôs autograd system

---

## üß† Background: Why PyTorch C++ Extensions?
While Python is ideal for model definition and experimentation, **performance-critical operators** (e.g., LayerNorm, Softmax, fused kernels) are often implemented in **C++/CUDA**.

PyTorch C++ extensions allow you to:
- Write custom high-performance kernels
- Call them directly from Python
- Register forward and backward functions
- Participate fully in PyTorch‚Äôs autograd system

---

## üß© Part A ‚Äî PyTorch C++ Extension Basics

### Task
Learn the basic structure of a PyTorch C++ extension and how it is built and loaded.

You should understand:
- How to create a C++ extension using `torch.utils.cpp_extension`
- How Python code loads compiled shared libraries
- The role of `PYBIND11_MODULE` in binding C++ functions to Python
- The difference between:
  - Pure C++ extensions
  - C++ + CUDA extensions

### Key Concepts
- `setup.py` or `load()` workflow
- CMake / NVCC integration
- ABI compatibility with PyTorch
- CPU vs CUDA dispatch

---

## üß© Part B ‚Äî TensorAccessor

### What Is TensorAccessor?
`TensorAccessor` is a lightweight wrapper that provides **type-safe and bounds-aware access** to tensor data inside CUDA kernels.

It allows you to:
- Index tensors using `tensor[i][j]` syntax
- Avoid manual pointer arithmetic
- Improve code readability and safety

### Task
Learn how to:
- Convert a PyTorch tensor to a `TensorAccessor`
- Use `TensorAccessor` inside CUDA kernels
- Understand layout assumptions (contiguous, strides)

### Key Considerations
- Tensor must be contiguous (or you must handle strides explicitly)
- Access patterns affect memory coalescing
- TensorAccessor does not perform automatic bounds checking on device

---

## üß© Part C ‚Äî ATen API

### What Is ATen?
**ATen** is PyTorch‚Äôs core C++ tensor library.  
It provides:
- Tensor creation and manipulation
- Device and dtype abstraction
- Dispatch to CPU or CUDA implementations

### Task
Learn how to:
- Use `at::Tensor` in C++ code
- Access tensor metadata (shape, dtype, device)
- Launch CUDA kernels using ATen utilities
- Write device-agnostic code where possible

### Common ATen Operations
- `tensor.data_ptr<T>()`
- `tensor.size(dim)`
- `tensor.stride(dim)`
- `at::zeros_like`, `at::empty`

---

## üîç Key Questions to Answer
- How does PyTorch pass tensors from Python to C++?
- What are the differences between `data_ptr` and `TensorAccessor`?
- When should you prefer ATen APIs over raw CUDA code?
- How does PyTorch ensure correct device and dtype dispatch?

---

## üß™ Deliverables
You should produce:
1. A minimal PyTorch C++ extension that can be imported in Python
2. A C++ function that:
   - Accepts `at::Tensor` inputs
   - Accesses tensor data using `TensorAccessor`
3. A basic CUDA kernel launched via ATen
4. A short write-up explaining:
   - Data flow from Python ‚Üí C++ ‚Üí CUDA
   - Why this approach is used in real ML systems

---

## üéì What You Learn from This Task
By completing this task, you will understand:
- How PyTorch integrates Python, C++, and CUDA
- How tensors are represented internally
- How high-performance ML operators are built
- How to extend PyTorch beyond Python

---

## üöÄ Relevance to ML Systems
PyTorch C++ extensions are used in:
- Custom fused operators
- FlashAttention and fused LN kernels
- High-performance training and inference backends

Mastering TensorAccessor and ATen is a **key step toward ML Systems and CUDA kernel engineering roles**.

---

## üß† Key Takeaway
> **PyTorch C++ extensions bridge the gap between Python productivity and C++/CUDA performance, enabling production-grade ML operators.**


In [1]:
!nvcc --version
!nvidia-smi

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
Sun Jan 25 12:24:44 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   43C    P8             11W /   70W |       0MiB /  15360MiB |      0%      Default |
|                       

In [None]:
!apt-get update
!apt-get install -y cuda-toolkit-12-4

# üß© Task: PyTorch C++/CUDA Extension on Colab

## üéØ Goal
In this task, you will build a **minimal PyTorch C++/CUDA extension** directly in **Google Colab** that:

- Compiles C++ + CUDA code using PyTorch utilities
- Uses **ATen API** in C++
- Uses **TensorAccessor** inside a CUDA kernel
- Is callable from Python

‚ö†Ô∏è This is a **skeleton only**. You must fill in all TODO sections.

---

## üìå Environment Assumptions
- Google Colab with **GPU enabled**
- CUDA already available via PyTorch
- No local files required (everything written via `%%writefile`)

---

## üß± Step 1 ‚Äî Create C++ Interface (ext.h)


In [3]:
%%writefile ext.h
#pragma once
#include <torch/extension.h>

// C++ forward declaration
torch::Tensor my_op_forward(torch::Tensor input);

// CUDA launcher declaration
void my_op_cuda_launcher(torch::Tensor input, torch::Tensor output);


Writing ext.h


## üß± Step 2 ‚Äî C++ Wrapper with ATen (ext.cpp)


In [21]:
%%writefile ext.cpp
#include <torch/extension.h>
#include "ext.h"

// ------------------------------------------------------------
// TODO: ATen wrapper
// Requirements:
//  - Validate input device (CUDA), dtype (float32), contiguous, 2D
//  - Allocate output tensor with same shape/device/dtype
//  - Call CUDA launcher
//  - Return output
// ------------------------------------------------------------
torch::Tensor my_op_forward(torch::Tensor input) {
    // TODO: remove early return
    // return torch::Tensor();

    //Device check
    TORCH_CHECK(input.is_cuda(), "my_op_forward: input must be a CUDA tensor");

    // Dtype check (float32)
    TORCH_CHECK(input.scalar_type() == at::kFloat,
                "my_op_forward: input must be float32 (torch.float32)");

    // Layout / contiguity check
    TORCH_CHECK(input.is_contiguous(),
                "my_op_forward: input must be contiguous (call .contiguous())");

    // Shape check: 2D
    TORCH_CHECK(input.dim() == 2,
                "my_op_forward: input must be 2D, got dim=", input.dim());

    // Allocate output (same shape/device/dtype)
    auto output = torch::empty_like(input);

    // Launch CUDA kernel (implemented in .cu)
    my_op_cuda_launcher(input, output);

    return output;
}

// ------------------------------------------------------------
// PyBind
// ------------------------------------------------------------
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &my_op_forward, "MyOp forward (CUDA)");
}

Overwriting ext.cpp


## üß± Step 3 ‚Äî Write CUDA Kernel + TensorAccessor (ext_cuda.cu)

In [22]:
%%writefile ext_cuda.cu
#include <torch/extension.h>
#include <cuda.h>
#include <cuda_runtime.h>

#define CHECK_CUDA(x) TORCH_CHECK((x).is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK((x).is_contiguous(), #x " must be contiguous")
#define CHECK_FLOAT(x) TORCH_CHECK((x).scalar_type() == at::ScalarType::Float, #x " must be float32")
#define CHECK_2D(x) TORCH_CHECK((x).dim() == 2, #x " must be 2D")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x); CHECK_FLOAT(x); CHECK_2D(x)

// Macro for checking CUDA errors
#define CUDA_CHECK(call)                                                          \
  do {                                                                            \
    cudaError_t cudaStatus = call;                                                \
    if (cudaStatus != cudaSuccess) {                                              \
      fprintf(stderr, "CUDA Error: %s at %s:%d\n", cudaGetErrorString(cudaStatus), \
              __FILE__, __LINE__);                                                \
      throw std::runtime_error(cudaGetErrorString(cudaStatus));                   \
    }                                                                             \
  } while (0)

// ------------------------------------------------------------
// TODO: CUDA kernel using TensorAccessor
// Input/Output shape: [B, D]
// Task: elementwise transform y[b,d] = f(x[b,d]) (you choose f)
// Requirements:
//  - Use TensorAccessor (PackedTensorAccessor32)
//  - Correct indexing (b,d)
//  - Bounds checks
// ------------------------------------------------------------
__global__ void my_kernel(
    torch::PackedTensorAccessor32<float, 2, torch::RestrictPtrTraits> x,
    torch::PackedTensorAccessor32<float, 2, torch::RestrictPtrTraits> y,
    int B, int D
) {
    // TODO:
    // - compute b, d from blockIdx/threadIdx
    // - if (b < B && d < D) { y[b][d] = ...; }

    // Map: threadIdx.x -> d, threadIdx.y -> b within a tile
    const int d = static_cast<int>(blockIdx.x) * static_cast<int>(blockDim.x) +
                  static_cast<int>(threadIdx.x);
    const int b = static_cast<int>(blockIdx.y) * static_cast<int>(blockDim.y) +
                  static_cast<int>(threadIdx.y);

    if (b < B && d < D) {
        const float v = x[b][d];
        y[b][d] = v * v;  // f(v)
    }
}

// ------------------------------------------------------------
// TODO: CUDA launcher
// Requirements:
//  - Validate tensors
//  - Choose block/grid
//  - Create accessors from tensors
//  - Launch kernel
// ------------------------------------------------------------
void my_op_cuda_launcher(torch::Tensor input, torch::Tensor output) {
    CHECK_INPUT(input);
    CHECK_INPUT(output);

    // TODO: get B, D from input sizes
    // int B = ...
    // int D = ...
    // Input/Output must be [B, D]
    TORCH_CHECK(input.dim() == 2, "my_op_cuda_launcher: input must be 2D");
    TORCH_CHECK(output.dim() == 2, "my_op_cuda_launcher: output must be 2D");
    TORCH_CHECK(input.size(0) == output.size(0) && input.size(1) == output.size(1),
                "my_op_cuda_launcher: input/output shapes must match");

    const int B = static_cast<int>(input.size(0));
    const int D = static_cast<int>(input.size(1));


    // TODO: choose dim3 block, grid
    // dim3 block(...);
    // dim3 grid(...);
    // 2D tile for (b, d)
    // - x dimension covers D
    // - y dimension covers B
    constexpr int TX = 32;  // columns
    constexpr int TY = 8;   // rows
    dim3 block(TX, TY);
    dim3 grid((D + TX - 1) / TX, (B + TY - 1) / TY);

    // Create accessors (PackedTensorAccessor32)
    auto x_acc = input.packed_accessor32<float, 2, torch::RestrictPtrTraits>();
    auto y_acc = output.packed_accessor32<float, 2, torch::RestrictPtrTraits>();

    // TODO: launch kernel with accessors
    // my_kernel<<<grid, block>>>(..., ..., B, D);
    // Launch
    my_kernel<<<grid, block>>>(x_acc, y_acc, B, D);

    // Optional but strongly recommended for catching launch errors early
    CUDA_CHECK(cudaGetLastError());
}

Overwriting ext_cuda.cu


##  Step 4 ‚Äî Colab Cell 4 ‚Äî Build Extension (JIT Compile)

In [18]:
!pip install ninja

Collecting ninja
  Downloading ninja-1.13.0-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (5.1 kB)
Downloading ninja-1.13.0-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (180 kB)
[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/180.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m180.7/180.7 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ninja
Successfully installed ninja-1.13.0


In [23]:
import torch
from torch.utils.cpp_extension import load
# TIP: set verbose=True if compilation issues
ext = load(
    name="tensor_accessor_ext",
    sources=["ext.cpp", "ext_cuda.cu"],
    extra_cflags=["-O3"],
    extra_cuda_cflags=["-O3", "--use_fast_math"],
    with_cuda=True,
    verbose=True
)

print("Extension loaded:", ext)


Extension loaded: <module 'tensor_accessor_ext_v1' from '/root/.cache/torch_extensions/py312_cu126/tensor_accessor_ext/tensor_accessor_ext_v1.so'>


##  Step 5 ‚Äî CPU Reference + Test Harness (Correctness)

In [24]:
import torch

# ------------------------------------------------------------
# TODO: CPU reference (must match your CUDA kernel's f(x))
# Requirements:
#  - input: x [B, D] on CPU
#  - output: y [B, D] on CPU
# ------------------------------------------------------------
def my_op_cpu_reference(x: torch.Tensor) -> torch.Tensor:
    # TODO: implement same math as CUDA kernel
    assert x.device.type == "cpu", "CPU reference expects a CPU tensor"
    assert x.dtype == torch.float32, "CPU reference expects float32"
    assert x.dim() == 2, "CPU reference expects 2D [B, D]"
    return x * x  # f(x)

# ------------------------------------------------------------
# TODO: correctness check
# Requirements:
#  - create test tensor on CUDA
#  - run ext.forward
#  - compare with CPU reference (move tensors appropriately)
#  - print max error
# ------------------------------------------------------------
def test_correctness(B=256, D=1024, atol=1e-5, rtol=1e-4):
    # TODO
    assert torch.cuda.is_available(), "CUDA is not available"

    # Create input on CUDA
    x_cuda = torch.randn(B, D, device="cuda", dtype=torch.float32)

    # Run extension forward (CUDA)
    y_cuda = ext.forward(x_cuda)

    # CPU reference: move input to CPU, compute, then compare on CPU
    x_cpu = x_cuda.detach().cpu()
    y_ref_cpu = my_op_cpu_reference(x_cpu)

    y_cuda_cpu = y_cuda.detach().cpu()
    diff = (y_cuda_cpu - y_ref_cpu).abs()
    max_err = diff.max().item()

    # Relative error (avoid div by 0)
    denom = y_ref_cpu.abs().clamp_min(1e-12)
    rel = (diff / denom).max().item()

    ok = torch.allclose(y_cuda_cpu, y_ref_cpu, atol=atol, rtol=rtol)

    print(f"[correctness] B={B}, D={D}, atol={atol}, rtol={rtol}")
    print(f"  max_abs_err = {max_err:.6e}")
    print(f"  max_rel_err = {rel:.6e}")
    print(f"  allclose    = {ok}")

    if not ok:
        # helpful extra info
        idx = diff.argmax().item()
        b = idx // D
        d = idx % D
        print(f"  worst at (b={b}, d={d}): y_cuda={y_cuda_cpu[b,d].item():.6e}, "
              f"y_ref={y_ref_cpu[b,d].item():.6e}, diff={diff[b,d].item():.6e}")

test_correctness()


[correctness] B=256, D=1024, atol=1e-05, rtol=0.0001
  max_abs_err = 0.000000e+00
  max_rel_err = 0.000000e+00
  allclose    = True


##  Step 6 ‚Äî Benchmark (CUDA Events)

In [25]:
import torch

# ------------------------------------------------------------
# TODO: benchmark
# Requirements:
#  - time ext.forward(x) with CUDA events
#  - include warmup
#  - print average ms
# ------------------------------------------------------------
def benchmark(B=4096, D=1024, iters=200, warmup=20):
    assert torch.cuda.is_available(), "CUDA is not available"

    x = torch.randn(B, D, device="cuda", dtype=torch.float32)

    # Warmup (important for CUDA context, caching, JIT, etc.)
    for _ in range(warmup):
        y = ext.forward(x)
    torch.cuda.synchronize()

    # CUDA events timing
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    start.record()
    for _ in range(iters):
        y = ext.forward(x)
    end.record()

    torch.cuda.synchronize()
    total_ms = start.elapsed_time(end)
    avg_ms = total_ms / iters

    print(f"[benchmark] B={B}, D={D}, iters={iters}, warmup={warmup}")
    print(f"  total   = {total_ms:.3f} ms")
    print(f"  avg/iter= {avg_ms:.6f} ms")

benchmark()


[benchmark] B=4096, D=1024, iters=200, warmup=20
  total   = 31.330 ms
  avg/iter= 0.156652 ms


##  Step 7 ‚Äî Nsight Compute Profiling Script (Colab-Friendly Output)

In [28]:
%%writefile run_ncu_ext.sh
#!/usr/bin/env bash
set -euo pipefail

# ------------------------------------------------------------
# Nsight Compute profiling helper for the PyTorch C++/CUDA extension
#
# Notes:
#  - Build is handled by Python JIT (torch.utils.cpp_extension.load)
#  - This script assumes Nsight Compute (ncu) is available in PATH
#  - You must ensure the Python workload actually launches `my_kernel`
# ------------------------------------------------------------

# -----------------------
# User-adjustable params
# -----------------------
KERNEL_NAME="my_kernel"        # must exactly match the CUDA kernel symbol
OUT="ncu_report"               # output base name (ncu_report.ncu-rep)
DRIVER="driver.py"             # Python script that calls ext.forward(...)
SET="full"                     # or: speedOfLight, memoryWorkloadAnalysis, etc.

# -----------------------
# Sanity checks
# -----------------------
if ! command -v ncu &>/dev/null; then
  echo "[WARN] Nsight Compute (ncu) not found in PATH."
  echo "       If you're on Colab, ncu is usually NOT available."
  echo "       Run this script on a local machine or cloud VM with Nsight Compute installed."
  exit 0
fi

if [[ ! -f "${DRIVER}" ]]; then
  echo "[ERROR] ${DRIVER} not found."
  echo "Create a driver that imports the extension and calls ext.forward(x)."
  exit 1
fi

# -----------------------
# Run profiling
# -----------------------
echo "[INFO] Profiling kernel: ${KERNEL_NAME}"
echo "[INFO] Driver: ${DRIVER}"
echo "[INFO] Output: ${OUT}.ncu-rep"

# --kernel-name-base demangled is usually best for C++ kernels
# --launch-skip/--launch-count help avoid profiling warmup launches
ncu \
  --set "${SET}" \
  --kernel-name "${KERNEL_NAME}" \
  --kernel-name-base demangled \
  --launch-skip 1 \
  --launch-count 1 \
  -o "${OUT}" \
  python "${DRIVER}"

echo "[DONE] Nsight Compute report generated: ${OUT}.ncu-rep"
echo "Open it with: ncu-ui ${OUT}.ncu-rep"


Overwriting run_ncu_ext.sh


In [29]:
!chmod +x run_ncu_ext.sh
!./run_ncu_ext.sh


[ERROR] driver.py not found.
Create a driver that imports the extension and calls ext.forward(x).


In [2]:
%%writefile driver.py
import torch
from torch.utils.cpp_extension import load

# Load the extension (must match parameters in cell pX8kHNCUHMhY)
ext = load(
    name="tensor_accessor_ext",
    sources=["ext.cpp", "ext_cuda.cu"],
    extra_cflags=["-O3"],
    extra_cuda_cflags=["-O3", "--use_fast_math"],
    with_cuda=True,
    verbose=True # Set to True for verbose compilation output if needed
)

# Create a dummy tensor and call the forward pass to trigger the kernel
# Use parameters that ensure the kernel is actually launched
B, D = 256, 1024 # Example dimensions
x = torch.randn(B, D, device="cuda", dtype=torch.float32)

# Call the extension's forward method
y = ext.forward(x)

# Ensure CUDA operations are complete before exiting, especially for profiling
torch.cuda.synchronize()

print("driver.py executed: ext.forward called successfully.")

Overwriting driver.py


Now that `driver.py` has been created, you can run the Nsight Compute profiling script again.

In [3]:
!./run_ncu_ext.sh

/bin/bash: line 1: ./run_ncu_ext.sh: No such file or directory
