# Deep Learning Refresher

In [4]:
%pip install torch

Note: you may need to restart the kernel to use updated packages.


In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim

torch.__version__

'2.8.0'

## Neural network types at a glance

- **MLP (Dense/Feed‑Forward):** Good when inputs are tabular features $x\in\mathbb{R}^n$.
- **CNN:** Local connectivity + weight sharing for data on grids (images, audio, videos).
- **RNN/Seq models:** For sequences; often superseded by attention/Transformers for long‑range deps.
- **GNN:** Generalizes convolutions to graphs; a CNN is a GNN on a pixel‑grid.


## Multi‑Layer Perceptron (MLP)

**Key idea:** Stack affine transforms + nonlinearities to approximate functions.  
The Universal Approximation Theorem (UAT) states a 1‑hidden‑layer MLP with enough units and a suitable nonlinearity can approximate continuous functions on compact subsets:
$$
f(x)\approx \sum_{i=1}^{N} a_i\,\sigma(w_i^\top x + b_i).
$$

### Example: Tabular regression with an MLP
Below we learn a toy map $x\mapsto y=\sin(x)$ with noise to illustrate capacity and training.


In [6]:
import math, random
torch.manual_seed(0)

# Toy dataset: y = sin(x) + noise
N = 256
x = torch.linspace(-3*math.pi, 3*math.pi, N).unsqueeze(1)
y = torch.sin(x) + 0.1*torch.randn_like(x)

mlp = nn.Sequential(
    nn.Linear(1, 64),
    nn.ReLU(),
    nn.Linear(64, 64),
    nn.ReLU(),
    nn.Linear(64, 1)
)

opt = optim.Adam(mlp.parameters(), lr=1e-2)
losses = []
for step in range(1500):
    opt.zero_grad()
    pred = mlp(x)
    loss = F.mse_loss(pred, y)
    loss.backward()
    opt.step()
    if step % 100 == 0:
        losses.append(loss.item())

losses[-5:], float(F.mse_loss(mlp(x), y))

Consider using tensor.detach() first. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/python_variable_methods.cpp:836.)
  losses[-5:], float(F.mse_loss(mlp(x), y))


([0.017481729388237,
  0.008053405210375786,
  0.00952376052737236,
  0.008279295638203621,
  0.01027391105890274],
 0.05733589082956314)

## Convolutional Neural Networks (CNNs)

**Convolution (continuous):** $(f*g)(t)=\int f(\tau)\,g(t-\tau)\,d\tau$  
**Convolution (discrete 2D):**
$$
(\omega * f)(x,y)=\sum_{i=-a}^{a}\sum_{j=-b}^{b}\omega(i,j)\,f(x-i,\,y-j).
$$
Deep‑learning libraries often implement **cross‑correlation** (no kernel flip) but with learned weights it’s functionally equivalent.

**Properties:** weight sharing, sparse connectivity, receptive fields, multi‑channel kernels.


In [7]:
# A tiny ConvNet that shows spatial shape changes via stride/pooling
class TinyConv(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(3, 8, kernel_size=3, padding=1),  # keep H,W
            nn.ReLU(),
            nn.MaxPool2d(2),                            # halves H,W
            nn.Conv2d(8, 16, kernel_size=3, padding=1, stride=2),  # halves H,W again
            nn.ReLU(),
            nn.Conv2d(16, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1,1)),               # global average pooling
        )
        self.head = nn.Linear(32, 10)

    def forward(self, x):
        x = self.net(x)          # [B, 32, 1, 1]
        x = x.flatten(1)         # [B, 32]
        return self.head(x)      # [B, 10]

x = torch.randn(4, 3, 64, 64)
model = TinyConv()
out = model(x)
x.shape, out.shape

(torch.Size([4, 3, 64, 64]), torch.Size([4, 10]))

### Downsampling: pooling vs. strided convolutions

- **Pooling:** Max/Average over windows to reduce spatial size (translation tolerance).
- **Strided conv:** Learnable downsampling by moving the filter with stride $s>1$.


In [8]:
X = torch.arange(0, 4*4.).view(1,1,4,4)  # simple toy image
pool = nn.MaxPool2d(2)
conv_s2 = nn.Conv2d(1, 1, kernel_size=3, stride=2, padding=1, bias=False)
with torch.no_grad():
    conv_s2.weight.fill_(1/9)  # average-like

X, pool(X), conv_s2(X)

(tensor([[[[ 0.,  1.,  2.,  3.],
           [ 4.,  5.,  6.,  7.],
           [ 8.,  9., 10., 11.],
           [12., 13., 14., 15.]]]]),
 tensor([[[[ 5.,  7.],
           [13., 15.]]]]),
 tensor([[[[ 1.1111,  2.6667],
           [ 5.6667, 10.0000]]]], grad_fn=<ConvolutionBackward0>))

### Batch Normalization

Learns to normalize intermediate activations channel‑wise:
$$
\hat{I}_{b,c,x,y} = \frac{I_{b,c,x,y}-\mu_c}{\sqrt{\sigma_c^2+\varepsilon}},\quad
O_{b,c,x,y} = \gamma_c \hat{I}_{b,c,x,y} + \beta_c.
$$
**Effects:** faster training, more stable gradients, allows larger learning rates.


In [9]:
bn = nn.BatchNorm2d(8)
T = torch.randn(16, 8, 32, 32)
out = bn(T)
out.mean(dim=(0,2,3)).abs().max().item(), out.std(dim=(0,2,3)).sub(1).abs().max().item()

(1.7462298274040222e-08, 2.562999725341797e-05)

## Residual connections (ResNets)

Skip connections add the input back to the output of a few layers to ease optimization:
$$
\mathrm{y} = F(\mathrm{x}) + \mathrm{x}.
$$
This helps gradient flow and enables **very deep** networks.


In [10]:
class BasicBlock(nn.Module):
    def __init__(self, C):
        super().__init__()
        self.conv1 = nn.Conv2d(C, C, 3, padding=1)
        self.bn1   = nn.BatchNorm2d(C)
        self.conv2 = nn.Conv2d(C, C, 3, padding=1)
        self.bn2   = nn.BatchNorm2d(C)

    def forward(self, x):
        identity = x
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out = F.relu(out + identity)   # residual add
        return out

blk = BasicBlock(16)
y = blk(torch.randn(2,16,32,32))
y.shape

torch.Size([2, 16, 32, 32])

## Increasing spatial size: upsampling, unpooling, transposed conv

- **Nearest/bilinear upsampling:** Non‑learned resize.
- **Max‑unpooling:** Uses saved pooling indices to place values back.
- **Transposed convolution:** Learnable upsampling; careful with checkerboard artifacts.

Below: simple transposed convolution doubling the size.


In [11]:
tconv = nn.ConvTranspose2d(4, 2, kernel_size=3, stride=2, padding=1, output_padding=1)
Z = torch.randn(1,4,8,8)
tconv(Z).shape

torch.Size([1, 2, 16, 16])

## Bottleneck layers and $1\times 1$ convolutions

A $1\times 1$ convolution can **reduce channels** before expensive $3\times 3$ ops, cutting parameters while preserving depth:
- Direct $3\times 3$ with $C=256$ in/out: $(3\cdot 3 \cdot 256 + 1)\cdot 256$ params.
- Bottleneck $256\to 64 \to 256$ using $1\times 1 \to 3\times 3 \to 1\times 1$ drastically reduces params.

This idea appears in **ResNet** and **Inception** families.


In [12]:
def count_params(m):
    return sum(p.numel() for p in m.parameters())

direct = nn.Conv2d(256, 256, 3, padding=1, bias=True)

bottleneck = nn.Sequential(
    nn.Conv2d(256, 64, 1, bias=True),
    nn.ReLU(),
    nn.Conv2d(64, 64, 3, padding=1, bias=True),
    nn.ReLU(),
    nn.Conv2d(64, 256, 1, bias=True),
)

count_params(direct), count_params(bottleneck)

(590080, 70016)

## Separable and depthwise‑separable convolutions

A 2D kernel may factorize, e.g. smoothing:
$$
\frac{1}{3}\begin{bmatrix}1\\1\\1\end{bmatrix}\ *\ \frac{1}{3}\begin{bmatrix}1&1&1\end{bmatrix}
= \frac{1}{9}\begin{bmatrix}1&1&1\\1&1&1\\1&1&1\end{bmatrix}.
$$

**Depthwise‑separable** conv splits **spatial** and **cross‑channel** correlations:
1) depthwise: per‑channel $k\times k$, 2) pointwise: $1\times 1$ across channels.


In [13]:
class DepthwiseSeparableConv(nn.Module):
    def __init__(self, nin, nout, k=3, padding=1):
        super().__init__()
        self.depthwise = nn.Conv2d(nin, nin, kernel_size=k, padding=padding, groups=nin, bias=False)
        self.pointwise = nn.Conv2d(nin, nout, kernel_size=1, bias=False)
    def forward(self, x):
        return self.pointwise(self.depthwise(x))

dwsc = DepthwiseSeparableConv(32, 64)
plain = nn.Conv2d(32, 64, 3, padding=1, bias=False)

count_params(plain), count_params(dwsc)

(18432, 2336)

## Multi‑headed networks

One shared backbone with **multiple heads** lets you learn related tasks jointly (e.g., policy + value in games, or class + bounding box in detection).

Below: a tiny example with two heads and combined losses.


In [14]:
class Backbone(nn.Module):
    def __init__(self):
        super().__init__()
        self.feat = nn.Sequential(
            nn.Conv2d(3, 16, 3, padding=1), nn.ReLU(),
            nn.Conv2d(16,32,3,padding=1),   nn.ReLU(),
            nn.AdaptiveAvgPool2d((1,1))
        )
    def forward(self, x):
        return self.feat(x).flatten(1)

class MultiHead(nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = Backbone()
        self.class_head = nn.Linear(32, 5)   # classification logits
        self.reg_head   = nn.Linear(32, 2)   # simple 2D regression
    
    def forward(self, x):
        h = self.backbone(x)
        return self.class_head(h), self.reg_head(h)

net = MultiHead()
x = torch.randn(8,3,64,64)
cls, reg = net(x)

# Example combined loss: cross-entropy + L1
target_cls = torch.randint(0,5,(8,))
target_reg = torch.randn(8,2)
loss = F.cross_entropy(cls, target_cls) + F.l1_loss(reg, target_reg)
cls.shape, reg.shape, float(loss)

(torch.Size([8, 5]), torch.Size([8, 2]), 2.2864792346954346)

## Sparsity and pruning (brief)

Encourage sparse weights (e.g., with $\ell_1$ regularization) and prune small‑magnitude connections to reduce compute. Many structured/unstructured pruning algorithms exist. Below: a quick, illustrative unstructured magnitude prune.


In [15]:
# Simple magnitude pruning demonstration
lin = nn.Linear(64, 64, bias=False)
with torch.no_grad():
    lin.weight.normal_(0, 0.1)

threshold = lin.weight.abs().quantile(0.8).item()  # keep top-20% magnitude
mask = (lin.weight.abs() >= threshold).float()
pruned_params = mask.numel() - int(mask.sum())

lin.weight.mul_(mask)  # zero-out small weights
pruned_params, mask.mean().item()

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

## Classic CNN architectures (very short recap)

- **AlexNet:** Early deep CNN; large initial kernels/strides + pooling.
- **VGG:** Only $3\times3$ convs; depth via stacking; heavy parameter counts.
- **GoogLeNet/Inception:** Multi‑resolution branches + $1\times 1$ bottlenecks.
- **ResNet:** Residual connections enable very deep nets; bottleneck blocks in deeper variants.

> Try swapping blocks in the toy models above to feel differences in parameter counts and shapes.


## Where to go next

- Replace synthetic data with a small real dataset (e.g., CIFAR‑10) for end‑to‑end demos.
- Explore **depthwise separable convs** vs. standard convs on speed and accuracy.
- Extend the multi‑head example to include an **auxiliary loss** (e.g., self‑supervised feature loss).

This notebook is designed as a compact **practice companion** to the slides.
