In [1]:
# !nvidia-smi

# MNIST — MLP warm-up

We’ll load MNIST, normalize it, define a small MLP, and run a quick
sanity check forward pass to verify shapes before training.


In [2]:
# OPTIONAL: only run this if your torch/torchvision install is broken.
# For GPU on Kaggle (CUDA 12.1 wheels):
# !pip install --upgrade --force-reinstall torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# For CPU-only:
# !pip install --upgrade --force-reinstall torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu


## Imports & device

We’ll autodetect CUDA and fall back to CPU. The code works either way.


In [3]:
import torch
import torchvision

device = "cuda" if torch.cuda.is_available() else "cpu"

print("torch:", torch.__version__)
print("torchvision:", torchvision.__version__)
print("device:", device)


torch: 2.6.0+cu124
torchvision: 0.21.0+cu124
device: cuda


## Dataset & transforms

- `ToTensor()` → scales pixels to [0,1] with shape [1, 28, 28].
- `Normalize((0.1307,), (0.3081,))` → center/scale using MNIST stats.
  (Note the commas: single-element tuples.)


### Why do we normalize MNIST with `(0.1307,), (0.3081,)`?

After `ToTensor()`, MNIST images are scaled to `[0,1]`, but their distribution isn’t centered and doesn’t have unit variance:
- Mean pixel value is about **0.1307** (most pixels are dark background).
- Standard deviation is about **0.3081**.

**Why normalize?**
- Centering (subtracting the mean) makes neuron inputs hover around zero, which helps gradients flow and speeds up learning.
- Scaling (dividing by the std) puts features on a comparable scale, making optimization more stable and less sensitive to learning rates.

**Why those exact numbers?**
- They are the empirical mean and std of the MNIST training set computed over all pixels.
- Using dataset-specific stats is better than generic choices (like 0.5/0.5) because it matches the true data distribution.

**Why the tuples — and why the comma is so important?**
- `Normalize` expects a *sequence* (list or tuple) with one value per channel.
- MNIST has 1 channel → we need one mean and one std → a 1-element tuple.
- In Python:
  - `(0.3081,)` → a tuple containing one float ✅
  - `(0.3081)` → just a float ❌
- If you forget the comma, you pass a float instead of a tuple. That breaks the shape handling inside `Normalize` and can lead to confusing errors (like “std evaluated to zero”).

**What if we skip normalization?**
- The model may still learn (MNIST is simple), but:
  - Training is slower.
  - Optimization is less stable.
  - Accuracy may plateau lower.
- For harder datasets (like CIFAR or ImageNet), skipping normalization can mean the model fails to learn at all.

**TL;DR**
Normalization with `(0.1307,), (0.3081,)` standardizes MNIST inputs to zero-like mean and unit-like variance.  
The trailing comma is crucial because it makes those values tuples, not plain floats, which is exactly what `Normalize` expects.


In [4]:
transform = torchvision.transforms.Compose([
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize((0.1307,), (0.3081,))
])

train_mnist = torchvision.datasets.MNIST(
    root="./data", train=True, download=True, transform=transform
)
test_mnist = torchvision.datasets.MNIST(
    root="./data", train=False, download=True, transform=transform
)

# quick peek
x0, y0 = train_mnist[0]
print("one sample:", x0.shape, y0)  # torch.Size([1, 28, 28]) label_int


100%|██████████| 9.91M/9.91M [00:00<00:00, 17.9MB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 484kB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 4.46MB/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 8.53MB/s]

one sample: torch.Size([1, 28, 28]) 5





## Model

A simple fully-connected classifier:
- Flatten 28×28 → 784
- Hidden: 300 → 300 with LeakyReLU
- Output: 10 logits (no Softmax; CrossEntropyLoss expects logits)


In [5]:
import torch

# In PyTorch, p.numel() returns the number of elements (scalars) in the tensor p.

p = torch.randn(3, 4)   # shape [3,4]
print(p.numel())        # 12


12


In [6]:
model = torch.nn.Sequential(
    torch.nn.Linear(28*28, 300),
    torch.nn.LeakyReLU(),
    torch.nn.Linear(300, 300),
    torch.nn.LeakyReLU(),
    torch.nn.Linear(300, 10)  # logits
).to(device)

sum_params = sum(p.numel() for p in model.parameters())
print("model params:", sum_params)


model params: 328810


## Sanity check (single example)

Flatten to 784 features, forward once, confirm output shape [1, 10].


In [7]:
digit, cls = train_mnist[0]
digit = digit.to(device).view(1, 28*28)  # add batch dim = 1
with torch.no_grad():
    out = model(digit)
print("single forward shape:", out.shape)  # torch.Size([1, 10])


single forward shape: torch.Size([1, 10])


## Sanity check with dataset loop (first item only)

Iterate the dataset, move to device, flatten, run model, print shape, break.


In [8]:
for digit, cls in train_mnist:
    digit = digit.to(device)
    digit = digit.view(digit.shape[0], 28*28)
    with torch.no_grad():
        print(model(digit).shape)  # expected: torch.Size([1, 10])
    break


torch.Size([1, 10])


### Why do we use `digit.view(digit.shape[0], 28*28)`?

Each MNIST image comes as a tensor of shape `[B, 1, 28, 28]`:
- `B` = batch size  
- `1` = number of channels (grayscale)  
- `28 × 28` = image height and width  

Our model starts with a `Linear(28*28, 300)` layer, which expects a
**flat vector of 784 features per image**, not a 2D grid.

The call

``` python
digit = digit.view(digit.shape[0], 28*28)
```

does two things:
1. Keeps the batch dimension (`digit.shape[0]`).
2. Flattens each `[1,28,28]` image into a single vector `[784]`.

So:
- Before: `[B, 1, 28, 28]`  
- After:  `[B, 784]`  

This reshaping step bridges the gap between image-shaped data and the
fully connected (dense) layers of our MLP.


## Dataloaders

We’ll iterate in mini-batches for efficient training.


In [9]:
from torch.utils.data import DataLoader 

batch_size = 62 
train_dl = DataLoader(
    train_mnist, 
    batch_size=batch_size, 
    shuffle=True,
    num_workers=2, 
    pin_memory=(device=="cuda")
)

test_dl = DataLoader(
    test_mnist, 
    batch_size=batch_size, 
    shuffle=False, 
    num_workers=2,
    pin_memory=(device=="cuda")
)

len(train_dl), len(test_dl)

(968, 162)

### Understanding `DataLoader` arguments

When we wrap our MNIST datasets in `DataLoader`, we specify a few
important options:

- **`batch_size=64`**  
  - How many samples to group together in one batch.  
  - Instead of returning a single `[1, 28, 28]` image, the loader
    returns `[64, 1, 28, 28]` tensors.  
  - Larger batch sizes improve GPU utilization and give smoother
    gradient estimates, but also use more memory.  
  - On CPU, smaller batches can be more practical to keep things fast
    and memory-efficient.

- **`shuffle=True` (for training)**  
  - Each epoch, the training data is shuffled.  
  - Prevents the model from simply memorizing the order of the data.  
  - Helps generalization because each mini-batch looks different each
    epoch.  
  - For evaluation (`test_dl`), we use `shuffle=False` so results are
    deterministic and ordered.

- **`num_workers=2`**  
  - Number of subprocesses used to load data in parallel.  
  - `0` means load in the main process (slower).  
  - On CPU or GPU, having a few workers (like 2–4) allows data to be
    prefetched while the model is training on the previous batch,
    keeping the pipeline efficient.  
  - On Kaggle, small values (like 2) are often safe.

- **`pin_memory=(device=="cuda")`**  
  - *Pinned (page-locked) memory* speeds up data transfer from CPU RAM
    to GPU memory.  
  - If `device=="cuda"`, we set `pin_memory=True` so each batch can be
    moved to GPU more efficiently with `.to("cuda")`.  
  - If `device=="cpu"`, this option does nothing and can safely remain
    `False`.

---

**Summary:**
- Training loader: `batch_size=64`, `shuffle=True`  
- Test loader: `batch_size=64`, `shuffle=False`  
- Use a few `num_workers` to overlap data loading with computation.  
- Enable `pin_memory` only when training on CUDA for faster CPU→GPU
  transfers.


### Loss & Optimizer

- **Loss:** `CrossEntropyLoss` compares the model’s **logits** to the
  ground-truth class indices. It internally applies `log_softmax`, so
  we **do not** put a `Softmax` layer in the model.
- **Optimizer:** `Adam` with a standard learning rate (1e-3) works well
  for this small MLP. It adapts per-parameter step sizes and usually
  converges faster than plain SGD.
- **Seed:** we set a manual seed for reproducibility (weight init and
  the Adam state).
- This works the same on **CPU or CUDA**; no device-specific changes are
  needed for defining the loss/optimizer.


In [10]:
torch.manual_seed(42)

loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

print(loss_fn)
print(optimizer)

CrossEntropyLoss()
Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.001
    maximize: False
    weight_decay: 0
)
