In [1]:
import torch
import torchvision

# MLP for MNIST Classification

This notebook trains a **Multilayer Perceptron (MLP)** to classify **MNIST** digits (0–9). Below is a concise theory guide to frame each step of the pipeline.

---

## 1) Problem setup

- **Inputs**: grayscale images $x ∈ ℝ^{1×28×28}$ (values in `[0,1]` after `ToTensor`).
- **Flattening**: we reshape to a vector $x_{vec} ∈ ℝ^{784}$ (since `28×28=784`) before feeding the MLP.
- **Targets**: integer class labels `y ∈ {0,…,9}`.
- **Goal**: learn a function $f_θ : ℝ^{784} → ℝ^{10}$ that predicts the correct digit.

> **Why flatten?** Convolutional nets (CNNs) exploit spatial structure; an MLP doesn’t. Here we use a simpler **fully-connected** network that treats each pixel as a feature.

---

## 2) Model: Multilayer Perceptron (MLP)

An MLP stacks **linear layers** with **nonlinear activations**:

$$ x (784) → Linear(784→300) → ReLU → Linear(300→300) → ReLU → Linear(300→10) → logits $$

- **Linear layer**: `h = W x + b`.
- **ReLU**: `ReLU(z) = max(0, z)` adds nonlinearity so the network can learn complex functions.
- **Output**: a length-10 vector of **logits** (unnormalized scores), one per class.

**Parameter count (for this exact architecture):**
- `784×300 + 300 = 235,500`
- `300×300 + 300 = 90,300`
- `300×10  + 10  = 3,010`  
**Total ≈ 328,810 parameters**

> **Why logits (not probabilities)?** Computing probabilities via softmax is deferred to the loss function for numerical stability and correct gradients.

---

## 3) From logits to probabilities

For a sample `x`, the network outputs logits $z ∈ ℝ^{10}$. The **softmax** turns logits into probabilities:
$$
p_k = \frac{e^{z_k}}{\sum_{j=1}^{10} e^{z_j}}
$$
But in PyTorch, **you should not apply softmax before the loss** if you use `CrossEntropyLoss`.

> **PyTorch note**: `nn.CrossEntropyLoss(logits, y)` internally does `log_softmax` + `NLLLoss`. Give it **raw logits** and integer labels `y`.

---

## 4) Loss: Multi-class cross-entropy

For a single example with true class `y`, the cross-entropy loss is:
$$
\mathcal{L} = -\log p_y
$$
Averaged over the batch, this encourages the model to put high probability mass on the true class.

**Useful gradient fact** (with softmax + cross-entropy):
$$
\frac{\partial \mathcal{L}}{\partial z_k} = p_k - \mathbb{1}[k=y]
$$
i.e., the gradient on the logits is simply **(predicted prob − one-hot target)**.

---

## 5) Optimization: Gradient descent & Adam

Training iteratively improves parameters by following the negative gradient of the loss:
$$
\theta \leftarrow \theta - \eta \, \nabla_\theta \mathbb{E}_{(x,y)}[\mathcal{L}(f_\theta(x), y)]
$$

- **Mini-batches**: use `DataLoader` to estimate gradients on small batches (e.g., 32 samples) → faster, smoother updates.
- **Adam optimizer**: an adaptive variant of SGD that maintains running estimates of first/second moments of gradients. Typical start: `lr=1e-3`.

> **Epoch** = one full pass over the training set. Accuracy typically improves over multiple epochs.

---

## 6) Data preprocessing

- **`ToTensor()`** scales pixel values from `[0,255]` to `[0,1]`.
- **Normalization (recommended)**:
  $$
  x \leftarrow \frac{x - 0.1307}{0.3081}
  $$
  These are standard MNIST mean/std. Normalization can stabilize and speed convergence.

---

## 7) Evaluation: from logits to predictions

- **Predicted class**: $ŷ = argmax_k z_k$ (or `argmax` over probabilities; both equivalent).
- **Accuracy**:
  $$
  \text{acc} = \frac{\#\{i : \hat{y}_i = y_i\}}{N}
  $$
- **Confusion matrix (optional)**: shows which digits the model confuses (e.g., 4 vs 9).

> **Important**: Disable dropout/batch-norm effects and gradients at test time:
```python
net.eval()
with torch.no_grad():
    ...
```

## 8) Overfitting, regularization, and sanity checks

- **Overfitting**: training accuracy ≫ test accuracy.  
  Remedies: more data, **weight decay** (L2), **dropout**, reduce hidden size, early stopping.
- **Underfitting**: both train and test accuracy are low.  
  Remedies: train longer, increase model capacity, tune LR.
- **Sanity checks**:
  - Can the model **overfit a tiny subset** (e.g., 100 samples)? If not, something’s wrong (bugs, LR too low, etc.).
  - Does loss decrease over iterations? If not, check LR, device placement, label dtype, and that you’re using logits with `CrossEntropyLoss`.

---

## 9) Device, dtype, and reproducibility

- **Device**: keep model and tensors on the same device:
```python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
net.to(device); x, y = x.to(device), y.to(device)

```
- **Seeds** (help reproducibility, not exact determinism across all backends):
```python
import torch, random, numpy as np
torch.manual_seed(0); np.random.seed(0); random.seed(0)

```
- **Mixed precision** (optional): torch.cuda.amp.autocast() can speed training on GPU.
