In [7]:
import torch
print(torch.__version__)
print(torch.cuda.is_available())

2.9.1+rocmsdk20260116
True


# Chapter 10: Introduction to Artificial Neural Networks with PyTorch

## 1. Setting Up PyTorch and Checking GPU Availability

**Why do we do this?**
- First, we import PyTorch (the deep learning library) and check its version
- We also check if CUDA (GPU support) is available - this is important because training neural networks on a GPU is **much faster** than on CPU (often 10-100x faster)
- If `torch.cuda.is_available()` returns `True`, we can use our GPU for training!

In [8]:
X = torch.tensor([[1.0, 4.0, 7.0], [2.0, 3.0, 6.0]])
X

tensor([[1., 4., 7.],
        [2., 3., 6.]])

## 2. Tensors - The Building Block of PyTorch

**What is a Tensor?**
- A tensor is like a NumPy array, but with superpowers:
  1. It can run on GPU for faster computation
  2. It supports automatic differentiation (needed for training neural networks)
- Think of it as a multi-dimensional array that holds numbers

**Creating a Tensor:**
- `torch.tensor()` creates a tensor from a Python list or NumPy array
- Below we create a 2x3 matrix (2 rows, 3 columns)

In [9]:
import torch

print("Torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)
print("Device count:", torch.cuda.device_count())

if torch.cuda.is_available():
    print("GPU name:", torch.cuda.get_device_name(0))


Torch version: 2.9.1+rocmsdk20260116
CUDA available: True
CUDA version: None
Device count: 1
GPU name: AMD Radeon RX 7900 XTX


**More detailed GPU info:**
- This cell gives us more details about our GPU setup
- `torch.cuda.get_device_name(0)` shows the actual GPU name (your AMD Radeon RX 7900 XTX)
- This helps confirm everything is set up correctly before we start training

In [10]:
X.shape

torch.Size([2, 3])

## 3. Tensor Attributes: Shape and Data Type

**Shape** - tells us the dimensions of the tensor:
- `X.shape` returns `torch.Size([2, 3])` meaning 2 rows, 3 columns
- This is crucial for understanding what operations are valid (matrix multiplication requires compatible shapes)

In [11]:
X.dtype 

torch.float32

**Data Type (dtype)** - tells us the type of numbers stored:
- `torch.float32` (32-bit floating point) is the most common for neural networks
- It's a balance between precision and memory usage
- Other types: `float64` (more precise), `float16` (faster but less precise), `int64` (integers)

In [12]:
X[0, 1]  # Accessing the element at first row and second column

tensor(4.)

## 4. Tensor Indexing and Slicing

**Why is this important?**
- We often need to access specific parts of our data (e.g., get certain features, select a batch)
- Indexing works just like NumPy - use `[row, column]` notation

**Examples below:**
- `X[0, 1]` - get element at row 0, column 1 (remember: indexing starts at 0!)
- `X[:, 2]` - get ALL rows (`:`) from column 2 (the third column)

In [13]:
X[:, 2]  # Accessing all rows in the third column

tensor([7., 6.])

In [14]:
10 * (X + 1)

tensor([[20., 50., 80.],
        [30., 40., 70.]])

## 5. Tensor Operations (Element-wise)

**Element-wise operations** apply to each element individually:
- `X + 1` adds 1 to every element
- `10 * X` multiplies every element by 10
- These operations are fundamental for neural network computations (like adding bias, scaling values)

In [15]:
X.exp()

tensor([[   2.7183,   54.5982, 1096.6332],
        [   7.3891,   20.0855,  403.4288]])

**Mathematical functions:**
- `X.exp()` computes e^x for each element (exponential function)
- This is used in activation functions like softmax and in probability calculations
- Other common functions: `X.log()`, `X.sqrt()`, `X.sin()`, `X.cos()`

In [16]:
X.mean()

tensor(3.8333)

**Aggregation functions** - reduce a tensor to fewer values:
- `X.mean()` - average of all elements (used for calculating loss)
- `X.max()` - maximum value
- `X.sum()` - sum of all elements
- `dim=0` means operate along rows (get result per column), `dim=1` means along columns (get result per row)

In [17]:
X.max(dim=0)

torch.return_types.max(
values=tensor([2., 4., 7.]),
indices=tensor([1, 0, 0]))

In [18]:
X @ X.T

tensor([[66., 56.],
        [56., 49.]])

## 6. Matrix Multiplication

**Why is matrix multiplication so important?**
- Neural networks are essentially a series of matrix multiplications!
- Input × Weights = Output (this is how each layer computes its result)

**Syntax:**
- `X @ X.T` or `torch.matmul(X, X.T)` - matrix multiplication
- `.T` means transpose (swap rows and columns)
- For X with shape [2, 3], X.T has shape [3, 2]
- Result: [2, 3] @ [3, 2] = [2, 2]

In [19]:
import numpy as np

## 7. NumPy ↔ PyTorch Conversion

**Why do we need this?**
- Many datasets come in NumPy format (sklearn, pandas)
- We need to convert to PyTorch tensors for training
- Sometimes we need to convert back to NumPy for visualization or saving

**Methods:**
- `tensor.numpy()` - convert tensor to NumPy array
- `torch.tensor(numpy_array)` - convert NumPy to tensor
- `torch.FloatTensor(numpy_array)` - convert and ensure float32 type

In [20]:
X.numpy()

array([[1., 4., 7.],
       [2., 3., 6.]], dtype=float32)

In [21]:
torch.tensor(np.array([[1.0, 4.0, 7.0], [2.0, 3.0, 6.0]]))

tensor([[1., 4., 7.],
        [2., 3., 6.]], dtype=torch.float64)

In [22]:
torch.FloatTensor(np.array([[1.0, 4.0, 7.0], [2.0, 3.0, 6.0]]))

tensor([[1., 4., 7.],
        [2., 3., 6.]])

In [23]:
X[:, 1] = -99

## 8. Modifying Tensors In-Place

**In-place operations** modify the tensor directly:
- `X[:, 1] = -99` replaces all values in column 1 with -99
- This is memory-efficient but be careful - you lose the original values!

In [24]:
X

tensor([[  1., -99.,   7.],
        [  2., -99.,   6.]])

In [25]:
X.relu()

tensor([[1., 0., 7.],
        [2., 0., 6.]])

## 9. ReLU Activation Function

**What is ReLU?** (Rectified Linear Unit)
- `ReLU(x) = max(0, x)` - keeps positive values, replaces negatives with 0
- This is the most popular activation function in neural networks!

**Why do we need activation functions?**
- Without them, a neural network is just linear transformations (matrix multiplications)
- No matter how many linear layers you stack, the result is still linear
- Activation functions add **non-linearity**, allowing networks to learn complex patterns

**Notice:** The -99 values become 0 after ReLU

In [26]:
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

## 10. Using the GPU (CUDA)

**Why use GPU?**
- GPUs have thousands of cores optimized for parallel math operations
- Neural network training involves many matrix multiplications - perfect for GPUs!
- Training can be 10-100x faster on GPU vs CPU

**Device Selection:**
- `"cuda"` - NVIDIA GPU (most common)
- `"mps"` - Apple Silicon GPU (M1/M2 Macs)
- `"cpu"` - fallback if no GPU available

**Moving tensors to GPU:**
- `tensor.to(device)` or `tensor.to("cuda")` - moves tensor to GPU
- All tensors in an operation must be on the same device!

In [27]:
M = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])

In [28]:
M = M.to(device)

In [29]:
M.device

device(type='cuda', index=0)

In [30]:
M = torch.tensor([[1, 2, 3], [4, 5, 6]], device=device)

In [31]:
R = M.float() @  M.T.float()
R.device

device(type='cuda', index=0)

In [28]:
R

tensor([[14., 32.],
        [32., 77.]], device='cuda:0')

In [29]:
M = torch.rand((1000, 1000))
%timeit M @ M.T

69.1 ms ± 3.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## 11. CPU vs GPU Performance Comparison

**Let's see the speed difference!**
- We multiply two 1000×1000 matrices
- `%timeit` runs the operation multiple times and measures average time

**Results explained:**
- CPU: ~76ms (milliseconds)
- GPU: ~12μs (microseconds) = 0.012ms

That's about **6000x faster** on GPU! (though actual compute time is ~0.16ms due to synchronization overhead)

In [30]:
M = torch.rand((1000, 1000), device=device)
%timeit M @ M.T

11.5 μs ± 169 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [31]:
M = torch.randn((1000, 1000), device='cuda')

def matmul_sync():
    result = M @ M.T
    torch.cuda.synchronize()
    return result

%timeit matmul_sync()


starter = torch.cuda.Event(enable_timing=True)
ender = torch.cuda.Event(enable_timing=True)

# Warmup
for _ in range(10):
    _ = M @ M.T
torch.cuda.synchronize()

# Time it
starter.record()
for _ in range(100):
    _ = M @ M.T
ender.record()
torch.cuda.synchronize()

print(f"{starter.elapsed_time(ender) / 100:.3f} ms per matmul")

423 μs ± 127 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
0.172 ms per matmul


In [32]:
x = torch.tensor(5.0, requires_grad=True)
f = x ** 2
f


tensor(25., grad_fn=<PowBackward0>)

## 12. Automatic Differentiation (Autograd) - THE KEY TO DEEP LEARNING

**What is Autograd?**
- PyTorch automatically computes gradients (derivatives) for us!
- This is essential for training neural networks using gradient descent

**How it works:**
1. Create a tensor with `requires_grad=True` - tells PyTorch to track operations
2. Perform computations (forward pass)
3. Call `.backward()` - PyTorch computes all gradients automatically
4. Access gradients with `.grad`

**Example:** f(x) = x²
- Derivative: df/dx = 2x
- At x=5: gradient = 2×5 = 10 ✓

In [32]:
f.backward()


NameError: name 'f' is not defined

In [34]:
x.grad

tensor(10.)

In [35]:
a = 0.1
with torch.no_grad():
    x -= a * x.grad

## 13. Gradient Descent - How Neural Networks Learn

**The core idea:**
- We want to minimize a loss function (make predictions better)
- Gradient tells us the direction of steepest increase
- So we move in the OPPOSITE direction (subtract the gradient)

**Update rule:** `x = x - learning_rate × gradient`

**Important details:**
- `torch.no_grad()` - temporarily disables gradient tracking (we don't want to track the update itself)
- `x.grad.zero_()` - reset gradients to 0 (they accumulate by default!)

In [36]:
x.grad.zero_()

tensor(0.)

In [37]:
learning_rate = 0.1
x = torch.tensor(5.0, requires_grad=True)
for iteration in range(10):
    f = x ** 2
    f.backward()
    with torch.no_grad():
        x-= learning_rate * x.grad
    x.grad.zero_()

**Complete Gradient Descent Loop:**

This is the pattern you'll see in ALL neural network training:
1. **Forward pass:** compute f(x) = x²
2. **Backward pass:** compute gradient with `.backward()`
3. **Update:** subtract gradient × learning_rate from x
4. **Reset:** zero out the gradient for next iteration

After 10 iterations, x should be very close to 0 (the minimum of x²)

In [38]:
t = torch.tensor(2.0, requires_grad=True)
z = t.exp()
z = z + 1
z.backward()

In [39]:
from sklearn.datasets import fetch_california_housing   
import numpy as np
import torch

housing = fetch_california_housing(as_frame=True)   

---
# PART 2: Building a Real Neural Network

## 14. Loading the California Housing Dataset

**The Task:** Predict house prices based on features like:
- Median income, house age, number of rooms, population, etc.

**Why this dataset?**
- Real-world regression problem
- Good size for learning (not too big, not too small)
- Clear goal: predict the median house value

In [33]:
from sklearn.model_selection import train_test_split
X_train, X_temp, y_train, y_temp = train_test_split(
    housing.data, housing.target, test_size=0.36, random_state=42
)

X_valid, X_test, y_valid, y_test = train_test_split(
    X_temp, y_temp, test_size=0.56, random_state=42
)


NameError: name 'housing' is not defined

## 15. Train/Validation/Test Split

**Why split the data into 3 parts?**

| Set | Purpose | Used For |
|-----|---------|----------|
| **Training** (~64%) | Learn patterns | Updating model weights |
| **Validation** (~16%) | Tune hyperparameters | Choosing learning rate, model size |
| **Test** (~20%) | Final evaluation | Report final performance |

**Important:** Never train on validation/test data! This prevents "cheating" and ensures our model generalizes to new data.

In [41]:
X_train = torch.FloatTensor(X_train.values)
X_valid = torch.FloatTensor(X_valid.values)
X_test = torch.FloatTensor(X_test.values)
means = X_train.mean(dim=0, keepdim =True)
stds = X_train.std(dim=0, keepdim=True)
X_train = (X_train - means) / stds
X_test = (X_test - means) / stds
X_valid = (X_valid - means) / stds

  X_train = torch.FloatTensor(X_train.values)


## 16. Feature Normalization (Standardization)

**Why normalize?**
- Features have different scales (income: 0-15, rooms: 1-40, population: 3-35000)
- Neural networks train much faster when all features are on similar scales!
- Without normalization, large-scale features dominate the learning

**Standardization formula:** `X_normalized = (X - mean) / std`
- This transforms each feature to have mean=0 and std=1

**Critical:** Use training set statistics (mean, std) for ALL sets. Never compute separate stats for validation/test!

In [42]:
y_train = torch.FloatTensor(y_train.values).view(-1, 1)
y_valid = torch.FloatTensor(y_valid.values).view(-1, 1)
y_test = torch.FloatTensor(y_test.values).view(-1, 1)

**Reshaping targets with `.view(-1, 1)`:**
- Converts 1D array `[1, 2, 3]` to 2D column `[[1], [2], [3]]`
- This makes shapes compatible for matrix operations
- `-1` means "infer this dimension automatically"

In [43]:
torch.manual_seed(42)
n_features = X_train.shape[1]
w = torch.randn((n_features, 1), requires_grad=True)
b = torch.tensor(0., requires_grad=True)

## 17. Linear Regression from Scratch

**The model:** `y = X @ w + b`
- `w` (weights): how much each feature contributes to the prediction
- `b` (bias): a constant offset

**Initialization:**
- `torch.randn()` - random values from normal distribution
- `requires_grad=True` - tells PyTorch to track gradients for learning
- `torch.manual_seed(42)` - makes results reproducible

In [44]:
learning_rate = 0.1
n_epoches = 1000
for epoch in range(n_epoches):
    y_pred = X_train @ w + b 
    loss = ((y_pred - y_train) ** 2).mean()
    loss.backward()
    with torch.no_grad():
        w -= learning_rate * w.grad
        b -= learning_rate * b.grad
        w.grad.zero_()
        b.grad.zero_()
    print(f'Epoch{epoch + 1} / {n_epoches}, Loss: {loss.item():.4f}')

Epoch1 / 1000, Loss: 16.0319
Epoch2 / 1000, Loss: 7.8154
Epoch3 / 1000, Loss: 4.3351
Epoch4 / 1000, Loss: 2.7415
Epoch5 / 1000, Loss: 1.9445
Epoch6 / 1000, Loss: 1.5101
Epoch7 / 1000, Loss: 1.2553
Epoch8 / 1000, Loss: 1.0970
Epoch9 / 1000, Loss: 0.9944
Epoch10 / 1000, Loss: 0.9254
Epoch11 / 1000, Loss: 0.8775
Epoch12 / 1000, Loss: 0.8433
Epoch13 / 1000, Loss: 0.8181
Epoch14 / 1000, Loss: 0.7988
Epoch15 / 1000, Loss: 0.7835
Epoch16 / 1000, Loss: 0.7710
Epoch17 / 1000, Loss: 0.7605
Epoch18 / 1000, Loss: 0.7513
Epoch19 / 1000, Loss: 0.7431
Epoch20 / 1000, Loss: 0.7356
Epoch21 / 1000, Loss: 0.7288
Epoch22 / 1000, Loss: 0.7223
Epoch23 / 1000, Loss: 0.7162
Epoch24 / 1000, Loss: 0.7104
Epoch25 / 1000, Loss: 0.7049
Epoch26 / 1000, Loss: 0.6996
Epoch27 / 1000, Loss: 0.6944
Epoch28 / 1000, Loss: 0.6895
Epoch29 / 1000, Loss: 0.6847
Epoch30 / 1000, Loss: 0.6801
Epoch31 / 1000, Loss: 0.6756
Epoch32 / 1000, Loss: 0.6713
Epoch33 / 1000, Loss: 0.6671
Epoch34 / 1000, Loss: 0.6630
Epoch35 / 1000, Loss: 

## 18. Training Loop (Batch Gradient Descent)

**This is the core training pattern:**

```
for each epoch:
    1. Forward pass: y_pred = X @ w + b
    2. Compute loss: MSE = mean((y_pred - y_true)²)
    3. Backward pass: loss.backward()
    4. Update weights: w -= lr * w.grad
    5. Zero gradients: w.grad.zero_()
```

**MSE (Mean Squared Error):** Measures how far predictions are from true values. Lower = better!

**Epoch:** One complete pass through the entire training dataset

In [45]:
X_new = X_test[:3]
with torch.no_grad():
    y_pred = X_new @ w + b
y_pred

tensor([[1.9950],
        [1.0269],
        [4.0733]])

## 19. Making Predictions

**Using the trained model:**
- `torch.no_grad()` - we're not training, so no need to track gradients (saves memory)
- Apply the same formula: `y_pred = X @ w + b`
- The predictions are house prices (in $100,000s)

In [46]:
import torch.nn as nn 

torch.manual_seed(42)
model = nn.Linear( in_features= n_features, out_features=1)

---
## 20. Using nn.Linear (The Easy Way!)

**Why use nn.Linear instead of manual weights?**
- Less code, fewer bugs
- Handles weight initialization automatically
- Part of PyTorch's neural network module (`nn`)

**nn.Linear(in_features, out_features):**
- `in_features`: number of input features (8 in our case)
- `out_features`: number of outputs (1 for regression)
- Automatically creates `weight` and `bias` parameters

In [47]:
model.bias

Parameter containing:
tensor([0.3117], requires_grad=True)

In [48]:
model.weight, model.bias

(Parameter containing:
 tensor([[ 0.2703,  0.2935, -0.0828,  0.3248, -0.0775,  0.0713, -0.1721,  0.2076]],
        requires_grad=True),
 Parameter containing:
 tensor([0.3117], requires_grad=True))

In [49]:
model(X_train[:2])

tensor([[0.1839],
        [1.1113]], grad_fn=<AddmmBackward0>)

In [50]:
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
mse = nn.MSELoss()

## 21. Optimizer and Loss Function

**Optimizer** - handles the weight updates automatically:
- `torch.optim.SGD` - Stochastic Gradient Descent
- `model.parameters()` - tells optimizer which weights to update
- `lr` - learning rate (how big each update step is)

**Loss function (Criterion):**
- `nn.MSELoss()` - Mean Squared Error
- Measures prediction error: `mean((predicted - actual)²)`

In [51]:
def train_bgd(model, optimizer, criterion, X_train, y_train, n_epochs):
    for epoch in range(n_epochs):
        y_pred = model(X_train)
        loss = criterion(y_pred, y_train)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        print(f'Epoch {epoch + 1} / {n_epochs}, Loss: {loss.item():.4f}')

## 22. Clean Training Function

**This is the standard PyTorch training pattern:**

```python
y_pred = model(X_train)      # Forward pass
loss = criterion(y_pred, y)  # Compute loss
loss.backward()              # Compute gradients
optimizer.step()             # Update weights
optimizer.zero_grad()        # Reset gradients
```

**Why `optimizer.zero_grad()`?**
- Gradients accumulate by default in PyTorch
- Must zero them before each new backward pass
- Forgetting this is a common bug!

In [52]:
train_bgd(model, optimizer, mse, X_train, y_train, n_epoches)

Epoch 1 / 1000, Loss: 4.2651
Epoch 2 / 1000, Loss: 2.9254
Epoch 3 / 1000, Loss: 2.0820
Epoch 4 / 1000, Loss: 1.5478
Epoch 5 / 1000, Loss: 1.2081
Epoch 6 / 1000, Loss: 0.9913
Epoch 7 / 1000, Loss: 0.8523
Epoch 8 / 1000, Loss: 0.7629
Epoch 9 / 1000, Loss: 0.7049
Epoch 10 / 1000, Loss: 0.6671
Epoch 11 / 1000, Loss: 0.6420
Epoch 12 / 1000, Loss: 0.6252
Epoch 13 / 1000, Loss: 0.6137
Epoch 14 / 1000, Loss: 0.6055
Epoch 15 / 1000, Loss: 0.5995
Epoch 16 / 1000, Loss: 0.5950
Epoch 17 / 1000, Loss: 0.5914
Epoch 18 / 1000, Loss: 0.5884
Epoch 19 / 1000, Loss: 0.5858
Epoch 20 / 1000, Loss: 0.5835
Epoch 21 / 1000, Loss: 0.5814
Epoch 22 / 1000, Loss: 0.5794
Epoch 23 / 1000, Loss: 0.5776
Epoch 24 / 1000, Loss: 0.5758
Epoch 25 / 1000, Loss: 0.5741
Epoch 26 / 1000, Loss: 0.5725
Epoch 27 / 1000, Loss: 0.5710
Epoch 28 / 1000, Loss: 0.5695
Epoch 29 / 1000, Loss: 0.5680
Epoch 30 / 1000, Loss: 0.5666
Epoch 31 / 1000, Loss: 0.5652
Epoch 32 / 1000, Loss: 0.5639
Epoch 33 / 1000, Loss: 0.5626
Epoch 34 / 1000, Lo

In [53]:
X_new =X_test[:3]
with torch.no_grad():
    y_pred = model(X_new)

y_pred

tensor([[1.9950],
        [1.0269],
        [4.0733]])

In [54]:
import torch.nn as nn

torch.manual_seed(42)       
model = nn.Sequential(
    nn.Linear(n_features, 50),
    nn.ReLU(),
    nn.Linear(50, 40),
    nn.ReLU(),
    nn.Linear(40, 1)
)

---
## 23. Multi-Layer Perceptron (MLP) with nn.Sequential

**Now we're building a REAL neural network!**

**Architecture:**
```
Input (8 features)
    ↓
Linear(8 → 50) + ReLU    ← Hidden layer 1
    ↓
Linear(50 → 40) + ReLU   ← Hidden layer 2
    ↓
Linear(40 → 1)           ← Output layer
    ↓
Output (1 prediction)
```

**Why multiple layers?**
- Linear regression can only learn linear relationships
- Multiple layers with ReLU can learn complex, non-linear patterns
- More layers = more capacity to learn complex relationships

**nn.Sequential** - stacks layers in order, data flows through each one

In [55]:
learning_rate = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr = learning_rate)
mse = nn.MSELoss()
train_bgd(model, optimizer, mse, X_train, y_train, n_epoches)

Epoch 1 / 1000, Loss: 4.9784
Epoch 2 / 1000, Loss: 2.0634
Epoch 3 / 1000, Loss: 1.0179
Epoch 4 / 1000, Loss: 0.8734
Epoch 5 / 1000, Loss: 0.7906
Epoch 6 / 1000, Loss: 0.7393
Epoch 7 / 1000, Loss: 0.7065
Epoch 8 / 1000, Loss: 0.6846
Epoch 9 / 1000, Loss: 0.6694
Epoch 10 / 1000, Loss: 0.6580
Epoch 11 / 1000, Loss: 0.6489
Epoch 12 / 1000, Loss: 0.6412
Epoch 13 / 1000, Loss: 0.6344
Epoch 14 / 1000, Loss: 0.6281
Epoch 15 / 1000, Loss: 0.6222
Epoch 16 / 1000, Loss: 0.6165
Epoch 17 / 1000, Loss: 0.6112
Epoch 18 / 1000, Loss: 0.6060
Epoch 19 / 1000, Loss: 0.6010
Epoch 20 / 1000, Loss: 0.5962
Epoch 21 / 1000, Loss: 0.5916
Epoch 22 / 1000, Loss: 0.5871
Epoch 23 / 1000, Loss: 0.5827
Epoch 24 / 1000, Loss: 0.5784
Epoch 25 / 1000, Loss: 0.5743
Epoch 26 / 1000, Loss: 0.5702
Epoch 27 / 1000, Loss: 0.5663
Epoch 28 / 1000, Loss: 0.5625
Epoch 29 / 1000, Loss: 0.5587
Epoch 30 / 1000, Loss: 0.5551
Epoch 31 / 1000, Loss: 0.5515
Epoch 32 / 1000, Loss: 0.5481
Epoch 33 / 1000, Loss: 0.5447
Epoch 34 / 1000, Lo

In [34]:
from torch.utils.data import TensorDataset, DataLoader

train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size =512, shuffle=True)

NameError: name 'X_train' is not defined

---
## 24. DataLoader - Mini-Batch Training

**Why batches instead of the full dataset?**

| Method | Pros | Cons |
|--------|------|------|
| **Batch GD** (full dataset) | Stable updates | Slow, uses lots of memory |
| **Stochastic GD** (1 sample) | Fast, low memory | Noisy, unstable |
| **Mini-batch** (512 samples) | Best of both! | Need to choose batch size |

**DataLoader benefits:**
- Automatically splits data into batches
- `shuffle=True` - randomizes order each epoch (prevents learning order)
- Handles the iteration for us

**batch_size=512** is a common choice (powers of 2 are efficient on GPUs)

In [57]:
torch.manual_seed(42)
model = nn.Sequential(
    nn.Linear(n_features, 50),
    nn.ReLU(),
    nn.Linear(50, 40),
    nn.ReLU(),
    nn.Linear(40, 1)
    )
model = model.to(device)

## 25. Moving Model to GPU

**`model.to(device)`** moves all model weights to the GPU
- This is essential for fast training
- Remember: data AND model must be on the same device!

In [58]:
def train(optimizer, criterion, model, train_loader, n_epochs):
    model.train()
    for epoch in range(n_epochs):
        total_loss = 0.
        for X_batch, y_batch in train_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            y_pred = model(X_batch)
            loss = criterion(y_pred, y_batch)
            total_loss+= loss.item()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        mean_loss = total_loss / len(train_loader)
        print(f'Epoch {epoch + 1} / {n_epochs}, Loss: {mean_loss:.4f}')

## 26. Mini-Batch Training Loop

**Key differences from batch gradient descent:**

1. **Nested loop:** outer loop for epochs, inner loop for batches
2. **Move data to GPU:** `X_batch.to(device)` for each batch
3. **`model.train()`:** puts model in training mode (enables dropout, batch norm, etc.)
4. **Average loss:** compute mean loss across all batches in the epoch

**This is the professional way to train neural networks!**

In [59]:
train(optimizer, mse, model, train_loader, n_epoches)

Epoch 1 / 1000, Loss: 4.9775
Epoch 2 / 1000, Loss: 4.9791
Epoch 3 / 1000, Loss: 4.9772
Epoch 4 / 1000, Loss: 4.9774
Epoch 5 / 1000, Loss: 4.9785
Epoch 6 / 1000, Loss: 4.9765
Epoch 7 / 1000, Loss: 4.9826
Epoch 8 / 1000, Loss: 4.9823
Epoch 9 / 1000, Loss: 4.9768
Epoch 10 / 1000, Loss: 4.9768
Epoch 11 / 1000, Loss: 4.9791
Epoch 12 / 1000, Loss: 4.9818
Epoch 13 / 1000, Loss: 4.9777
Epoch 14 / 1000, Loss: 4.9774
Epoch 15 / 1000, Loss: 4.9777
Epoch 16 / 1000, Loss: 4.9796
Epoch 17 / 1000, Loss: 4.9798
Epoch 18 / 1000, Loss: 4.9806
Epoch 19 / 1000, Loss: 4.9790
Epoch 20 / 1000, Loss: 4.9781
Epoch 21 / 1000, Loss: 4.9790
Epoch 22 / 1000, Loss: 4.9771
Epoch 23 / 1000, Loss: 4.9769
Epoch 24 / 1000, Loss: 4.9792
Epoch 25 / 1000, Loss: 4.9798
Epoch 26 / 1000, Loss: 4.9800
Epoch 27 / 1000, Loss: 4.9788
Epoch 28 / 1000, Loss: 4.9792
Epoch 29 / 1000, Loss: 4.9765
Epoch 30 / 1000, Loss: 4.9769
Epoch 31 / 1000, Loss: 4.9769
Epoch 32 / 1000, Loss: 4.9783
Epoch 33 / 1000, Loss: 4.9777
Epoch 34 / 1000, Lo

In [60]:
print(device)
print(next(model.parameters()).device)


cuda
cuda:0


## 27. Verifying GPU Usage

**Always verify your model is on GPU:**
- `next(model.parameters()).device` shows where weights are stored
- Should show `cuda:0` if using GPU

---
# Summary: What We Learned in Chapter 10

## Key Concepts:

1. **Tensors** - Multi-dimensional arrays that can run on GPU and support autograd

2. **Autograd** - Automatic differentiation (`.backward()` computes gradients)

3. **Gradient Descent** - Update weights by subtracting `learning_rate × gradient`

4. **Neural Network Building Blocks:**
   - `nn.Linear` - fully connected layer (matrix multiply + bias)
   - `nn.ReLU` - activation function (adds non-linearity)
   - `nn.Sequential` - stack layers together

5. **Training Pattern:**
   ```python
   y_pred = model(X)           # Forward pass
   loss = criterion(y_pred, y) # Compute loss
   loss.backward()             # Backward pass (compute gradients)
   optimizer.step()            # Update weights
   optimizer.zero_grad()       # Reset gradients
   ```

6. **DataLoader** - Handles batching and shuffling data

7. **GPU Training** - Move model and data to GPU with `.to(device)`

## Next Steps:
- Try different architectures (more/fewer layers, different sizes)
- Experiment with different optimizers (Adam, RMSprop)
- Add validation to detect overfitting
- Try dropout and batch normalization for regularization

In [61]:
def evaluate(model, data_loader, metric_fn, aggregate_fn=torch.mean):
    model.eval()
    metrics = []
    with torch.no_grad():
        for X_batch, y_batch in data_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            y_pred = model(X_batch)
            metric = metric_fn(y_pred, y_batch)
            metrics.append(metric)
    return aggregate_fn(torch.stack(metrics))

---
# PART 3: Advanced Topics

## 28. Model Evaluation Function

**Why do we need a separate evaluation function?**
- During training, we update weights. During evaluation, we just measure performance
- `model.eval()` - switches model to evaluation mode (disables dropout, etc.)
- `torch.no_grad()` - saves memory since we don't need gradients for evaluation

**The pattern:**
1. Loop through batches in the data loader
2. Move data to GPU
3. Get predictions
4. Update the metric with each batch
5. Return the final computed metric

In [62]:
valid_dataset = TensorDataset(X_valid, y_valid)
valid_loader = DataLoader(valid_dataset, batch_size=32)
valid_mse = evaluate(model, valid_loader, mse)
valid_mse

tensor(4.9070, device='cuda:0')

In [63]:
def rmse(y_pred, y_true):
    return ((y_pred - y_true) ** 2).mean().sqrt()

evaluate(model, valid_loader, rmse)

tensor(2.2025, device='cuda:0')

## 29. RMSE (Root Mean Squared Error)

**Why RMSE instead of MSE?**
- MSE is in squared units (e.g., dollars²) - hard to interpret
- RMSE is in the same units as the target (e.g., dollars) - much easier to understand!
- RMSE of 2.2 means predictions are off by about $220,000 on average (since prices are in $100k)

In [64]:
valid_mse.sqrt()

tensor(2.2152, device='cuda:0')

In [65]:
evaluate(model, valid_loader, mse,
         aggregate_fn=lambda metrics: torch.sqrt(torch.mean(metrics)))

tensor(2.2152, device='cuda:0')

In [62]:
import torchmetrics

def evaluate_tm(model, data_loader, metric):
    model.eval()
    metric.reset()  # reset the metric at the beginning
    with torch.no_grad():
        for X_batch, y_batch in data_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            y_pred = model(X_batch)
            metric.update(y_pred, y_batch)  # update it at each iteration
    return metric.compute()  # compute the final result at the end

## 30. TorchMetrics Library

**Why use torchmetrics?**
- Handles metric computation correctly across batches
- Provides many built-in metrics (accuracy, F1, RMSE, etc.)
- Works seamlessly with GPU

**Key methods:**
- `metric.reset()` - clear previous values (call at start of epoch)
- `metric.update(pred, target)` - add batch results
- `metric.compute()` - get final result

**`squared=False`** in MeanSquaredError gives us RMSE instead of MSE

In [61]:
rmse = torchmetrics.MeanSquaredError(squared=False).to(device)
evaluate_tm(model, valid_loader, rmse)
n_epochs = 20

NameError: name 'evaluate_tm' is not defined

In [58]:
import matplotlib.pyplot as plt

def train2(model, optimizer, criterion, metric, train_loader, valid_loader,
               n_epochs):
    history = {"train_losses": [], "train_metrics": [], "valid_metrics": []}
    for epoch in range(n_epochs):
        total_loss = 0.
        metric.reset()
        for X_batch, y_batch in train_loader:
            model.train()
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            y_pred = model(X_batch)
            loss = criterion(y_pred, y_batch)
            total_loss += loss.item()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            metric.update(y_pred, y_batch)
        mean_loss = total_loss / len(train_loader)
        history["train_losses"].append(mean_loss)
        history["train_metrics"].append(metric.compute().item())
        history["valid_metrics"].append(
            evaluate_tm(model, valid_loader, metric).item())
        print(f"Epoch {epoch + 1}/{n_epochs}, "
              f"train loss: {history['train_losses'][-1]:.4f}, "
              f"train metric: {history['train_metrics'][-1]:.4f}, "
              f"valid metric: {history['valid_metrics'][-1]:.4f}")
    return history

torch.manual_seed(42)
learning_rate = 0.01
model = nn.Sequential(
    nn.Linear(n_features, 50), nn.ReLU(),
    nn.Linear(50, 40), nn.ReLU(),
    nn.Linear(40, 30), nn.ReLU(),
    nn.Linear(30, 1)
)
model = model.to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0)
mse = nn.MSELoss()
rmse = torchmetrics.MeanSquaredError(squared=False).to(device)
history = train2(model, optimizer, mse, rmse, train_loader, valid_loader,
                 n_epochs)

# Since we compute the training metric
plt.plot(np.arange(n_epochs) + 0.5, history["train_metrics"], ".--",
         label="Training")
plt.plot(np.arange(n_epochs) + 1.0, history["valid_metrics"], ".-",
         label="Validation")
plt.xlabel("Epoch")
plt.ylabel("RMSE")
plt.grid()
plt.title("Learning curves")
plt.axis([0.5, 20, 0.4, 1.0])
plt.legend()
plt.show()

NameError: name 'n_features' is not defined

## 31. Training with History and Learning Curves

**Why track history?**
- See how the model improves over time
- Detect overfitting (when validation metric gets worse while training improves)
- Know when to stop training

**Learning Curves Plot:**
- X-axis: epochs
- Y-axis: RMSE (lower is better)
- Training curve (dashed): how well model fits training data
- Validation curve (solid): how well model generalizes to new data

**What to look for:**
- Both curves going down = model is learning
- Training keeps improving but validation gets worse = OVERFITTING
- Both curves plateau = model has converged

In [71]:
class WideAndDeep(nn.Module):
    def __init__(self, n_features):
        super().__init__()
        self.deep_stack = nn.Sequential(
            nn.Linear(n_features, 50), nn.ReLU(),
            nn.Linear(50, 40), nn.ReLU(),
            nn.Linear(40, 30), nn.ReLU(),
        )
        self.output_layer = nn.Linear(30 + n_features, 1)

    def forward(self, X):
        deep_output = self.deep_stack(X)
        wide_and_deep = torch.concat([X, deep_output], dim=1)
        return self.output_layer(wide_and_deep)

---
## 32. Wide & Deep Neural Network (Custom nn.Module)

**What is Wide & Deep?**
A Google architecture that combines:
- **Wide path:** Direct connection from input to output (like linear regression)
- **Deep path:** Multiple hidden layers (like our MLP)

```
Input (8 features)
    ├──────────────────────────┐
    ↓                          │ (Wide path - skip connection)
Deep Stack:                    │
  Linear → ReLU                │
  Linear → ReLU                │
  Linear → ReLU                │
    ↓                          │
    └──── Concatenate ─────────┘
              ↓
         Output Layer
              ↓
           Output
```

**Why combine wide and deep?**
- Wide: captures simple, direct relationships (memorization)
- Deep: captures complex patterns (generalization)
- Together: best of both worlds!

**`nn.Module`** - the base class for all neural network modules in PyTorch
- `__init__`: define the layers
- `forward`: define how data flows through the layers

In [72]:
torch.manual_seed(42)
model = WideAndDeep(n_features).to(device)
learning_rate = 0.002  # the model changed, so did the optimal learning rate

In [73]:
# extra code: train the model, exactly our previous models
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0)
mse = nn.MSELoss()
rmse = torchmetrics.MeanSquaredError(squared=False).to(device)
history = train2(model, optimizer, mse, rmse, train_loader, valid_loader,
                 n_epochs)

Epoch 1/20, train loss: 5.0355, train metric: 2.2454, valid metric: 2.1000
Epoch 2/20, train loss: 4.0027, train metric: 2.0022, valid metric: 1.8750
Epoch 3/20, train loss: 3.1845, train metric: 1.7852, valid metric: 1.6681
Epoch 4/20, train loss: 2.5016, train metric: 1.5829, valid metric: 1.4735
Epoch 5/20, train loss: 1.9407, train metric: 1.3942, valid metric: 1.2942
Epoch 6/20, train loss: 1.4983, train metric: 1.2244, valid metric: 1.1381
Epoch 7/20, train loss: 1.1723, train metric: 1.0834, valid metric: 1.0165
Epoch 8/20, train loss: 0.9578, train metric: 0.9793, valid metric: 0.9336
Epoch 9/20, train loss: 0.8317, train metric: 0.9115, valid metric: 0.8833
Epoch 10/20, train loss: 0.7586, train metric: 0.8710, valid metric: 0.8548
Epoch 11/20, train loss: 0.7177, train metric: 0.8471, valid metric: 0.8385
Epoch 12/20, train loss: 0.6923, train metric: 0.8324, valid metric: 0.8282
Epoch 13/20, train loss: 0.6773, train metric: 0.8224, valid metric: 0.8211
Epoch 14/20, train lo

In [74]:
class WideAndDeepV2(nn.Module):
    def __init__(self, n_features):
        super().__init__()
        self.deep_stack = nn.Sequential(
            nn.Linear(n_features - 2, 50), nn.ReLU(),
            nn.Linear(50, 40), nn.ReLU(),
            nn.Linear(40, 30), nn.ReLU(),
        )
        self.output_layer = nn.Linear(30 + 5, 1)

    def forward(self, X):
        X_wide = X[:, :5]
        X_deep = X[:, 2:]
        deep_output = self.deep_stack(X_deep)
        wide_and_deep = torch.concat([X_wide, deep_output], dim=1)
        return self.output_layer(wide_and_deep)

## 33. Wide & Deep V2 - Different Features for Each Path

**The idea:** Not all features need to go through both paths!
- **Wide path:** First 5 features (maybe simpler, direct relationships)
- **Deep path:** Last 6 features (maybe more complex patterns)

**Why do this?**
- Some features work better with simple linear models
- Some features need deep processing to extract patterns
- Domain knowledge can guide which features go where

**Code breakdown:**
- `X[:, :5]` - first 5 columns (wide)
- `X[:, 2:]` - columns 2 onwards (deep) - note there's overlap!

In [75]:
torch.manual_seed(42)
model = WideAndDeepV2(n_features).to(device)

In [76]:
# extra code: train the model, exactly our previous models
learning_rate = 0.002
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0)
mse = nn.MSELoss()
rmse = torchmetrics.MeanSquaredError(squared=False).to(device)
history = train2(model, optimizer, mse, rmse, train_loader, valid_loader,
                 n_epochs)

Epoch 1/20, train loss: 5.3352, train metric: 2.3107, valid metric: 2.1485
Epoch 2/20, train loss: 4.2138, train metric: 2.0541, valid metric: 1.9084
Epoch 3/20, train loss: 3.3040, train metric: 1.8187, valid metric: 1.6852
Epoch 4/20, train loss: 2.5564, train metric: 1.5999, valid metric: 1.4783
Epoch 5/20, train loss: 1.9610, train metric: 1.4002, valid metric: 1.2937
Epoch 6/20, train loss: 1.5083, train metric: 1.2284, valid metric: 1.1435
Epoch 7/20, train loss: 1.1994, train metric: 1.0956, valid metric: 1.0350
Epoch 8/20, train loss: 1.0070, train metric: 1.0039, valid metric: 0.9645
Epoch 9/20, train loss: 0.8929, train metric: 0.9456, valid metric: 0.9210
Epoch 10/20, train loss: 0.8259, train metric: 0.9086, valid metric: 0.8929
Epoch 11/20, train loss: 0.7811, train metric: 0.8837, valid metric: 0.8733
Epoch 12/20, train loss: 0.7502, train metric: 0.8657, valid metric: 0.8586
Epoch 13/20, train loss: 0.7251, train metric: 0.8519, valid metric: 0.8472
Epoch 14/20, train lo

In [77]:
class WideAndDeepV3(nn.Module):
    def __init__(self, n_features):
        super().__init__()
        self.deep_stack = nn.Sequential(
            nn.Linear(n_features - 2, 50), nn.ReLU(),
            nn.Linear(50, 40), nn.ReLU(),
            nn.Linear(40, 30), nn.ReLU(),
        )
        self.output_layer = nn.Linear(30 + 5, 1)

    def forward(self, X_wide, X_deep):
        deep_output = self.deep_stack(X_deep)
        wide_and_deep = torch.concat([X_wide, deep_output], dim=1)
        return self.output_layer(wide_and_deep)

## 34. Wide & Deep V3 - Multiple Input Arguments

**What's different?**
- V2: Split features inside the model (`forward(self, X)`)
- V3: Accept separate inputs (`forward(self, X_wide, X_deep)`)

**Why separate inputs?**
- More flexible - data can come from different sources
- Clearer interface - explicit about what each input is
- Preprocessing can be different for each input

**This is common in real applications:**
- Text + images as inputs
- User features + item features in recommendation systems
- Tabular data + time series

In [78]:
torch.manual_seed(42)
train_data_wd = TensorDataset(X_train[:, :5], X_train[:, 2:], y_train)
train_loader_wd = DataLoader(train_data_wd, batch_size=32, shuffle=True)
valid_data_wd = TensorDataset(X_valid[:, :5], X_valid[:, 2:], y_valid)
valid_loader_wd = DataLoader(valid_data_wd, batch_size=32)
test_data_wd = TensorDataset(X_test[:, :5], X_test[:, 2:], y_test)
test_loader_wd = DataLoader(test_data_wd, batch_size=32)

**Creating DataLoaders with multiple inputs:**
- `TensorDataset(X_wide, X_deep, y)` - can hold multiple tensors
- Each batch returns a tuple: `(X_wide_batch, X_deep_batch, y_batch)`
- We create separate loaders for train, validation, and test

In [79]:
def evaluate_multi_in(model, data_loader, metric):
    model.eval()
    metric.reset()  # reset the metric at the beginning
    with torch.no_grad():
        for X_batch_wide, X_batch_deep, y_batch in data_loader:
            X_batch_wide = X_batch_wide.to(device)
            X_batch_deep = X_batch_deep.to(device)
            y_batch = y_batch.to(device)
            y_pred = model(X_batch_wide, X_batch_deep)
            metric.update(y_pred, y_batch)  # update it at each iteration
    return metric.compute()  # compute the final result at the end

def train_multi_in(model, optimizer, criterion, metric, train_loader,
                   valid_loader, n_epochs):
    history = {"train_losses": [], "train_metrics": [], "valid_metrics": []}
    for epoch in range(n_epochs):
        total_loss = 0.
        metric.reset()
        for *X_batch_inputs, y_batch in train_loader:
            model.train()
            X_batch_inputs = [X.to(device) for X in X_batch_inputs]
            y_batch = y_batch.to(device)
            y_pred = model(*X_batch_inputs)
            loss = criterion(y_pred, y_batch)
            total_loss += loss.item()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            metric.update(y_pred, y_batch)
        mean_loss = total_loss / len(train_loader)
        history["train_losses"].append(mean_loss)
        history["train_metrics"].append(metric.compute().item())
        history["valid_metrics"].append(
            evaluate_multi_in(model, valid_loader, metric).item())
        print(f"Epoch {epoch + 1}/{n_epochs}, "
              f"train loss: {history['train_losses'][-1]:.4f}, "
              f"train metric: {history['train_metrics'][-1]:.4f}, "
              f"valid metric: {history['valid_metrics'][-1]:.4f}")
    return history

torch.manual_seed(42)
learning_rate = 0.01
model = WideAndDeepV3(n_features).to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0)
mse = nn.MSELoss()
rmse = torchmetrics.MeanSquaredError(squared=False).to(device)
history = train_multi_in(model, optimizer, mse, rmse, train_loader_wd,
                         valid_loader_wd, n_epochs)

Epoch 1/20, train loss: 0.8207, train metric: 0.9060, valid metric: 0.7437
Epoch 2/20, train loss: 0.5264, train metric: 0.7257, valid metric: 0.7101
Epoch 3/20, train loss: 0.4836, train metric: 0.6953, valid metric: 0.6893
Epoch 4/20, train loss: 0.4609, train metric: 0.6789, valid metric: 0.6830
Epoch 5/20, train loss: 0.4477, train metric: 0.6691, valid metric: 0.6779
Epoch 6/20, train loss: 0.4374, train metric: 0.6614, valid metric: 0.6688
Epoch 7/20, train loss: 0.4280, train metric: 0.6541, valid metric: 0.6599
Epoch 8/20, train loss: 0.4189, train metric: 0.6472, valid metric: 0.6596
Epoch 9/20, train loss: 0.4052, train metric: 0.6367, valid metric: 0.6446
Epoch 10/20, train loss: 0.3928, train metric: 0.6267, valid metric: 0.6397
Epoch 11/20, train loss: 0.3819, train metric: 0.6180, valid metric: 0.6292
Epoch 12/20, train loss: 0.3725, train metric: 0.6103, valid metric: 0.6179
Epoch 13/20, train loss: 0.3660, train metric: 0.6051, valid metric: 0.6077
Epoch 14/20, train lo

In [80]:
class WideAndDeepDataset(torch.utils.data.Dataset):
    def __init__(self, X_wide, X_deep, y):
        self.X_wide = X_wide
        self.X_deep = X_deep
        self.y = y

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        input_dict = {"X_wide": self.X_wide[idx], "X_deep": self.X_deep[idx]}
        return input_dict, self.y[idx]

## 35. Custom Dataset Class

**Why create a custom Dataset?**
- Return data as a dictionary with named keys (instead of positional tuple)
- Add custom preprocessing
- Load data lazily (don't load everything into memory)

**Required methods:**
- `__len__`: return the total number of samples
- `__getitem__`: return one sample given an index

**Benefits of named inputs:**
- `{"X_wide": ..., "X_deep": ...}` is clearer than `(tensor1, tensor2)`
- Can use `**kwargs` to unpack directly into model: `model(**inputs)`
- Less error-prone when you have many inputs

In [81]:
torch.manual_seed(42)
train_data_named = WideAndDeepDataset(
    X_wide=X_train[:, :5], X_deep=X_train[:, 2:], y=y_train)
train_loader_named = DataLoader(train_data_named, batch_size=32, shuffle=True)
valid_data_named = WideAndDeepDataset(
    X_wide=X_valid[:, :5], X_deep=X_valid[:, 2:], y=y_valid)
valid_loader_named = DataLoader(valid_data_named, batch_size=32)
test_data_named = WideAndDeepDataset(
    X_wide=X_test[:, :5], X_deep=X_test[:, 2:], y=y_test)
test_loader_named = DataLoader(test_data_named, batch_size=32)

In [82]:
def evaluate_named(model, data_loader, metric):
    model.eval()
    metric.reset()  # reset the metric at the beginning
    with torch.no_grad():
        for inputs, y_batch in data_loader:
            inputs = {name: X.to(device) for name, X in inputs.items()}
            y_batch = y_batch.to(device)
            y_pred = model(X_wide=inputs["X_wide"], X_deep=inputs["X_deep"])
            metric.update(y_pred, y_batch)
    return metric.compute()  # compute the final result at the end

def train_named(model, optimizer, criterion, metric, train_loader,
                   valid_loader, n_epochs):
    history = {"train_losses": [], "train_metrics": [], "valid_metrics": []}
    for epoch in range(n_epochs):
        total_loss = 0.
        metric.reset()
        for inputs, y_batch in train_loader:
            model.train()
            inputs = {name: X.to(device) for name, X in inputs.items()}
            y_batch = y_batch.to(device)
            y_pred = model(**inputs)
            loss = criterion(y_pred, y_batch)
            total_loss += loss.item()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            metric.update(y_pred, y_batch)
        mean_loss = total_loss / len(train_loader)
        history["train_losses"].append(mean_loss)
        history["train_metrics"].append(metric.compute().item())
        history["valid_metrics"].append(
            evaluate_named(model, valid_loader, metric).item())
        print(f"Epoch {epoch + 1}/{n_epochs}, "
              f"train loss: {history['train_losses'][-1]:.4f}, "
              f"train metric: {history['train_metrics'][-1]:.4f}, "
              f"valid metric: {history['valid_metrics'][-1]:.4f}")
    return history

torch.manual_seed(42)
learning_rate = 0.01
model = WideAndDeepV3(n_features).to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0)
mse = nn.MSELoss()
rmse = torchmetrics.MeanSquaredError(squared=False).to(device)
history = train_named(model, optimizer, mse, rmse, train_loader_named,
                      valid_loader_named, n_epochs)

Epoch 1/20, train loss: 0.8207, train metric: 0.9060, valid metric: 0.7437
Epoch 2/20, train loss: 0.5264, train metric: 0.7257, valid metric: 0.7101
Epoch 3/20, train loss: 0.4836, train metric: 0.6953, valid metric: 0.6893
Epoch 4/20, train loss: 0.4609, train metric: 0.6789, valid metric: 0.6830
Epoch 5/20, train loss: 0.4477, train metric: 0.6691, valid metric: 0.6779
Epoch 6/20, train loss: 0.4374, train metric: 0.6614, valid metric: 0.6688
Epoch 7/20, train loss: 0.4280, train metric: 0.6541, valid metric: 0.6599
Epoch 8/20, train loss: 0.4189, train metric: 0.6472, valid metric: 0.6596
Epoch 9/20, train loss: 0.4052, train metric: 0.6367, valid metric: 0.6446
Epoch 10/20, train loss: 0.3928, train metric: 0.6267, valid metric: 0.6397
Epoch 11/20, train loss: 0.3819, train metric: 0.6180, valid metric: 0.6292
Epoch 12/20, train loss: 0.3725, train metric: 0.6103, valid metric: 0.6179
Epoch 13/20, train loss: 0.3660, train metric: 0.6051, valid metric: 0.6077
Epoch 14/20, train lo

In [83]:
class WideAndDeepV4(nn.Module):
    def __init__(self, n_features):
        super().__init__()
        self.deep_stack = nn.Sequential(
            nn.Linear(n_features - 2, 50), nn.ReLU(),
            nn.Linear(50, 40), nn.ReLU(),
            nn.Linear(40, 30), nn.ReLU(),
        )
        self.output_layer = nn.Linear(30 + 5, 1)
        self.aux_output_layer = nn.Linear(30, 1)

    def forward(self, X_wide, X_deep):
        deep_output = self.deep_stack(X_deep)
        wide_and_deep = torch.concat([X_wide, deep_output], dim=1)
        main_output = self.output_layer(wide_and_deep)
        aux_output = self.aux_output_layer(deep_output)
        return main_output, aux_output

## 36. Multi-Output Model (Auxiliary Output)

**What's an auxiliary output?**
- An extra output from an intermediate layer
- Used during training to provide additional supervision
- Helps gradients flow better through deep networks

```
Input
  ↓
Deep Stack ──────→ Auxiliary Output (predicts y)
  ↓                        ↓
Concatenate              aux_loss
  ↓
Main Output ───────→ main_loss
  ↓
Total Loss = 0.8 × main_loss + 0.2 × aux_loss
```

**Why use auxiliary outputs?**
- Regularization: prevents overfitting
- Better gradient flow: helps train very deep networks
- Multi-task learning: predict multiple things at once

**At inference time:** we only use the main output, ignore auxiliary

In [84]:
import torchmetrics

def evaluate_multi_out(model, data_loader, metric):
    model.eval()
    metric.reset()
    with torch.no_grad():
        for inputs, y_batch in data_loader:
            inputs = {name: X.to(device) for name, X in inputs.items()}
            y_batch = y_batch.to(device)
            y_pred, _ = model(**inputs)
            metric.update(y_pred, y_batch)
    return metric.compute()

def train_multi_out(model, optimizer, criterion, metric, train_loader,
                   valid_loader, n_epochs):
    history = {"train_losses": [], "train_metrics": [], "valid_metrics": []}
    for epoch in range(n_epochs):
        total_loss = 0.
        metric.reset()
        for inputs, y_batch in train_loader:
            model.train()
            inputs = {name: X.to(device) for name, X in inputs.items()}
            y_batch = y_batch.to(device)
            y_pred, y_pred_aux = model(**inputs)
            main_loss = criterion(y_pred, y_batch)
            aux_loss = criterion(y_pred_aux, y_batch)
            loss = 0.8 * main_loss + 0.2 * aux_loss
            total_loss += loss.item()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            metric.update(y_pred, y_batch)
        mean_loss = total_loss / len(train_loader)
        history["train_losses"].append(mean_loss)
        history["train_metrics"].append(metric.compute().item())
        history["valid_metrics"].append(
            evaluate_multi_out(model, valid_loader, metric).item())
        print(f"Epoch {epoch + 1}/{n_epochs}, "
              f"train loss: {history['train_losses'][-1]:.4f}, "
              f"train metric: {history['train_metrics'][-1]:.4f}, "
              f"valid metric: {history['valid_metrics'][-1]:.4f}")
    return history

torch.manual_seed(42)
learning_rate = 0.01
model = WideAndDeepV4(n_features).to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0)
mse = nn.MSELoss()
rmse = torchmetrics.MeanSquaredError(squared=False).to(device)
history = train_multi_out(model, optimizer, mse, rmse, train_loader_named,
                          valid_loader_named, n_epochs)

Epoch 1/20, train loss: 1.0561, train metric: 0.9444, valid metric: 0.7529
Epoch 2/20, train loss: 0.6352, train metric: 0.7245, valid metric: 0.7080
Epoch 3/20, train loss: 0.5601, train metric: 0.6954, valid metric: 0.6919
Epoch 4/20, train loss: 0.5160, train metric: 0.6804, valid metric: 0.6787
Epoch 5/20, train loss: 0.4959, train metric: 0.6744, valid metric: 0.6679
Epoch 6/20, train loss: 0.4726, train metric: 0.6611, valid metric: 0.6578
Epoch 7/20, train loss: 0.4567, train metric: 0.6517, valid metric: 0.6546
Epoch 8/20, train loss: 0.4468, train metric: 0.6466, valid metric: 0.6524
Epoch 9/20, train loss: 0.4340, train metric: 0.6394, valid metric: 0.6295
Epoch 10/20, train loss: 0.4229, train metric: 0.6291, valid metric: 0.6372
Epoch 11/20, train loss: 0.4143, train metric: 0.6253, valid metric: 0.6224
Epoch 12/20, train loss: 0.4210, train metric: 0.6359, valid metric: 0.6196
Epoch 13/20, train loss: 0.3926, train metric: 0.6109, valid metric: 0.6122
Epoch 14/20, train lo

In [45]:
import torchvision
import torchvision.transforms.v2 as T
import torchmetrics

toTensor = T.Compose([T.ToImage(), T.ToDtype(torch.float32, scale=True)])

train_and_valid_data = torchvision.datasets.FashionMNIST(
    root="datasets", train=True, download=True, transform=toTensor)
test_data = torchvision.datasets.FashionMNIST(
    root="datasets", train=False, download=True, transform=toTensor)

torch.manual_seed(42)
train_data, valid_data = torch.utils.data.random_split(
    train_and_valid_data, [55_000, 5_000])

---
# PART 4: Image Classification

## 37. Loading FashionMNIST Dataset

**What is FashionMNIST?**
- 70,000 grayscale images of clothing items (28×28 pixels)
- 10 classes: T-shirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot
- A modern replacement for the classic MNIST digits dataset

**Why use torchvision?**
- Provides common datasets (MNIST, CIFAR, ImageNet, etc.)
- Handles downloading and caching automatically
- Includes image transformations

**Transforms:**
- `T.ToImage()` - convert to PyTorch image format
- `T.ToDtype(torch.float32, scale=True)` - convert to float and scale to [0, 1]

**Split:** 55,000 training + 5,000 validation + 10,000 test

In [46]:
train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
valid_loader = DataLoader(valid_data, batch_size=32)
test_loader = DataLoader(test_data, batch_size=32)

In [47]:
X_sample, y_sample = train_data[0]
X_sample.shape, y_sample

(torch.Size([1, 28, 28]), 9)

In [48]:
X_sample.dtype

torch.float32

In [49]:
train_and_valid_data.classes[y_sample]

'Ankle boot'

In [50]:
class ImageClassifier(nn.Module):
    def __init__(self, n_inputs, n_hidden1, n_hidden2, n_classes):
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Flatten(),  # flatten the 28x28 image into a vector of size 784
            nn.Linear(n_inputs, n_hidden1), nn.ReLU(),
            nn.Linear(n_hidden1, n_hidden2), nn.ReLU(),
            nn.Linear(n_hidden2, n_classes)
        )
    def forward(self, X):
        return self.mlp(X)
    
torch.manual_seed(42)
model = ImageClassifier(n_inputs=28*28, n_hidden1=320, n_hidden2=100, n_classes=10).to(device)

xentropy = nn.CrossEntropyLoss()

## 38. Image Classifier Model

**Architecture:**
```
Input: 28×28 image (1 channel)
    ↓
Flatten: 28×28 = 784 → vector of 784
    ↓
Linear(784 → 320) + ReLU
    ↓
Linear(320 → 100) + ReLU
    ↓
Linear(100 → 10)  ← 10 output classes
    ↓
Output: 10 logits (raw scores for each class)
```

**Key points:**
- `nn.Flatten()` - converts 2D image to 1D vector (required for Linear layers)
- No activation after final layer - we output raw logits
- `CrossEntropyLoss` expects logits, not probabilities

**Classification vs Regression:**
| | Regression | Classification |
|--|-----------|----------------|
| Output | 1 number | N class scores |
| Loss | MSELoss | CrossEntropyLoss |
| Metric | RMSE | Accuracy |

In [63]:
n_epochs = 20
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=10).to(device)
_ = train2(model, optimizer, xentropy, accuracy, train_loader, valid_loader,
           n_epochs)

Epoch 1/20, train loss: 0.4082, train metric: 0.8496, valid metric: 0.8500
Epoch 2/20, train loss: 0.3631, train metric: 0.8669, valid metric: 0.8484
Epoch 3/20, train loss: 0.3338, train metric: 0.8767, valid metric: 0.8704
Epoch 4/20, train loss: 0.3156, train metric: 0.8829, valid metric: 0.8616
Epoch 5/20, train loss: 0.2979, train metric: 0.8884, valid metric: 0.8676
Epoch 6/20, train loss: 0.2850, train metric: 0.8931, valid metric: 0.8786


KeyboardInterrupt: 

In [52]:
model.eval()
X_new, y_new = next(iter(valid_loader))
X_new = X_new[:3].to(device)
with torch.no_grad():
    y_pred_logits = model(X_new)
y_pred = y_pred_logits.argmax(dim=1)  # index of the largest logit
y_pred

tensor([2, 1, 1], device='cuda:0')

## 39. Making Predictions (Classification)

**From logits to predictions:**
1. Model outputs logits (raw scores) for each class
2. `argmax(dim=1)` - get the index of the highest score = predicted class
3. Map index to class name using `classes` list

**Example:** If logits are `[0.1, 0.2, 0.9, 0.3, ...]`, argmax returns `2` (index of 0.9)

In [53]:
[train_and_valid_data.classes[index] for index in y_pred]

['Pullover', 'Trouser', 'Trouser']

In [54]:
y_new[:3]

tensor([7, 4, 2])

In [55]:
import torch.nn.functional as F
y_proba = F.softmax(y_pred_logits, dim=1)
if device == "mps":
    y_proba = y_proba.cpu()
y_proba.round(decimals=3)

tensor([[0.1000, 0.1010, 0.1080, 0.0970, 0.0920, 0.0980, 0.1000, 0.1010, 0.1070,
         0.0970],
        [0.1050, 0.1120, 0.1010, 0.0930, 0.0880, 0.0990, 0.0980, 0.0960, 0.1090,
         0.0990],
        [0.1050, 0.1090, 0.1020, 0.0940, 0.0920, 0.0990, 0.0990, 0.0930, 0.1070,
         0.1000]], device='cuda:0')

## 40. Softmax - Converting Logits to Probabilities

**What is Softmax?**
- Converts raw logits to probabilities that sum to 1
- Formula: `softmax(x_i) = e^(x_i) / sum(e^(x_j))`

**Why use softmax?**
- Logits can be any number (-∞ to +∞)
- Probabilities are between 0 and 1, and sum to 1
- Easier to interpret: "98% confident it's a Sneaker"

**Example output:** `[0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.99, 0.00, 0.01]`
- Model is 99% confident the image is class 7 (Sneaker)

**Note:** We only apply softmax for interpretation, not during training (CrossEntropyLoss does it internally)

---
# Final Summary: Everything We Learned in Chapter 10

## Part 1: PyTorch Fundamentals
| Concept | What it does | Key code |
|---------|-------------|----------|
| Tensor | Multi-dimensional array with GPU & autograd | `torch.tensor([1,2,3])` |
| Shape | Dimensions of tensor | `X.shape` |
| Indexing | Access elements | `X[0, 1]`, `X[:, 2]` |
| Operations | Math on tensors | `X + 1`, `X @ Y`, `X.mean()` |
| GPU | Fast parallel computing | `X.to("cuda")` |
| Autograd | Auto compute gradients | `requires_grad=True`, `.backward()` |

## Part 2: Building Neural Networks
| Concept | What it does | Key code |
|---------|-------------|----------|
| nn.Linear | Fully connected layer | `nn.Linear(in, out)` |
| nn.ReLU | Activation function | `nn.ReLU()` |
| nn.Sequential | Stack layers | `nn.Sequential(layer1, layer2)` |
| Optimizer | Update weights | `torch.optim.SGD(params, lr)` |
| Loss | Measure error | `nn.MSELoss()`, `nn.CrossEntropyLoss()` |
| DataLoader | Batch data | `DataLoader(dataset, batch_size=32)` |

## Part 3: Advanced Architectures
| Concept | What it does | When to use |
|---------|-------------|-------------|
| nn.Module | Custom model class | Complex architectures |
| Wide & Deep | Combine linear + deep | Tabular data |
| Multi-input | Multiple input tensors | Different data sources |
| Multi-output | Auxiliary outputs | Regularization, multi-task |
| Custom Dataset | Named inputs, lazy loading | Large/complex data |

## Part 4: Classification
| Concept | What it does | Key code |
|---------|-------------|----------|
| CrossEntropyLoss | Classification loss | `nn.CrossEntropyLoss()` |
| Softmax | Logits → probabilities | `F.softmax(logits, dim=1)` |
| Argmax | Get predicted class | `logits.argmax(dim=1)` |
| Accuracy | Classification metric | `torchmetrics.Accuracy()` |

## The Universal Training Loop
```python
for epoch in range(n_epochs):
    for X_batch, y_batch in train_loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        
        # Forward pass
        y_pred = model(X_batch)
        loss = criterion(y_pred, y_batch)
        
        # Backward pass
        loss.backward()
        
        # Update weights
        optimizer.step()
        optimizer.zero_grad()
```

## Key Takeaways
1. **Tensors** are the foundation - like NumPy but with GPU + autograd
2. **Autograd** computes gradients automatically - no manual calculus!
3. **nn.Module** lets you build any architecture you can imagine
4. **DataLoader** handles batching and shuffling efficiently
5. **Always normalize** your input features
6. **Track metrics** on validation set to detect overfitting
7. **GPU training** is essential for large models

You're now ready for Chapter 11: Training Deep Neural Networks! 🚀

In [79]:
import optuna
import torch 
import torch.nn as nn

class ImageClassifier(nn.Module):
    def __init__(self, n_inputs, n_hidden1, n_hidden2, n_classes):
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Flatten(),
            nn.Linear(n_inputs, n_hidden1),
            nn.ReLU(),
            nn.Linear(n_hidden1, n_hidden2),
            nn.ReLU(),
            nn.Linear(n_hidden2, n_classes)
        )

    def forward(self, X):
        return self.mlp(X)
    
    
def objective(trial):

    learning_rate = trial.suggest_float('learning_rate', 1e-5, 1e-1, log=True)
    n_hidden = trial.suggest_int('n_hidden', 20, 500)
    model = ImageClassifier(n_inputs=1 *28*28, n_hidden1=116, n_hidden2=116, n_classes=10).to(device)

    optimizer = torch.optim.SGD(model.parameters(), lr=0.08525846269447772)
    xentropy = nn.CrossEntropyLoss()
    accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=10).to(device)
    history = train2(model, optimizer, xentropy, accuracy, train_loader,
                     valid_loader, n_epochs=20)
    valid_acc = max(history["valid_metrics"])
    return valid_acc

---
# PART 5: Hyperparameter Tuning & Saving Models

## 41. Hyperparameter Tuning with Optuna

**What are hyperparameters?**
- Settings we choose BEFORE training (not learned by the model)
- Examples: learning rate, number of hidden neurons, batch size, number of layers

**Why tune them?**
- Bad hyperparameters = bad model, no matter how long you train
- Manual search is slow and tedious
- Optuna automates this!

**How Optuna works:**
1. Define an `objective` function that trains a model and returns a score
2. `trial.suggest_float(...)` - let Optuna pick a learning rate to try
3. `trial.suggest_int(...)` - let Optuna pick number of neurons to try
4. Optuna runs multiple trials, each with different hyperparameters
5. It uses smart search (TPE sampler) to find good combos faster than random

**`direction="maximize"`** - we want to maximize accuracy (use "minimize" for loss)

In [80]:
torch.manual_seed(42)
sampler = optuna.samplers.TPESampler(seed=42)
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=2)

[32m[I 2026-02-07 13:41:56,904][0m A new study created in memory with name: no-name-6671f165-3864-41fe-a67f-2577320c7915[0m


Epoch 1/20, train loss: 0.6151, train metric: 0.7769, valid metric: 0.8340
Epoch 2/20, train loss: 0.4127, train metric: 0.8488, valid metric: 0.8366
Epoch 3/20, train loss: 0.3669, train metric: 0.8654, valid metric: 0.8556
Epoch 4/20, train loss: 0.3398, train metric: 0.8749, valid metric: 0.8564
Epoch 5/20, train loss: 0.3210, train metric: 0.8803, valid metric: 0.8586
Epoch 6/20, train loss: 0.3091, train metric: 0.8845, valid metric: 0.8622
Epoch 7/20, train loss: 0.2926, train metric: 0.8910, valid metric: 0.8676
Epoch 8/20, train loss: 0.2816, train metric: 0.8949, valid metric: 0.8668
Epoch 9/20, train loss: 0.2715, train metric: 0.8982, valid metric: 0.8764
Epoch 10/20, train loss: 0.2625, train metric: 0.9005, valid metric: 0.8630
Epoch 11/20, train loss: 0.2546, train metric: 0.9039, valid metric: 0.8756
Epoch 12/20, train loss: 0.2463, train metric: 0.9080, valid metric: 0.8860
Epoch 13/20, train loss: 0.2403, train metric: 0.9101, valid metric: 0.8704
Epoch 14/20, train lo

[32m[I 2026-02-07 13:44:04,541][0m Trial 0 finished with value: 0.8895999789237976 and parameters: {'learning_rate': 0.00031489116479568613, 'n_hidden': 477}. Best is trial 0 with value: 0.8895999789237976.[0m


Epoch 20/20, train loss: 0.2010, train metric: 0.9240, valid metric: 0.8844
Epoch 1/20, train loss: 0.6192, train metric: 0.7760, valid metric: 0.8292
Epoch 2/20, train loss: 0.4143, train metric: 0.8477, valid metric: 0.8478
Epoch 3/20, train loss: 0.3718, train metric: 0.8618, valid metric: 0.8624
Epoch 4/20, train loss: 0.3452, train metric: 0.8714, valid metric: 0.8482
Epoch 5/20, train loss: 0.3250, train metric: 0.8792, valid metric: 0.8726
Epoch 6/20, train loss: 0.3098, train metric: 0.8839, valid metric: 0.8560
Epoch 7/20, train loss: 0.2961, train metric: 0.8890, valid metric: 0.8708
Epoch 8/20, train loss: 0.2874, train metric: 0.8935, valid metric: 0.8338
Epoch 9/20, train loss: 0.2763, train metric: 0.8960, valid metric: 0.8778
Epoch 10/20, train loss: 0.2666, train metric: 0.9003, valid metric: 0.8736
Epoch 11/20, train loss: 0.2575, train metric: 0.9032, valid metric: 0.8840
Epoch 12/20, train loss: 0.2509, train metric: 0.9070, valid metric: 0.8804
Epoch 13/20, train lo

[32m[I 2026-02-07 13:46:12,264][0m Trial 1 finished with value: 0.8885999917984009 and parameters: {'learning_rate': 0.008471801418819975, 'n_hidden': 307}. Best is trial 0 with value: 0.8895999789237976.[0m


Epoch 20/20, train loss: 0.2033, train metric: 0.9229, valid metric: 0.8804


In [73]:
study.best_value

0.8697999715805054

**Checking the best results:**
- `study.best_value` - the best accuracy achieved across all trials
- `study.best_params` - the hyperparameters that gave the best result
- Use these to train your final model!

In [74]:
study.best_params

{'learning_rate': 0.09698333459975128, 'n_hidden': 233}

In [75]:
torch.save(model.state_dict(), "MyFashionModel.pt")

---
## 42. Saving a Model

**Why save models?**
- Training takes time - don't want to retrain every time!
- Share your model with others
- Deploy to production

**Method 1: Save state_dict (RECOMMENDED)**
- `torch.save(model.state_dict(), "model.pt")` - saves only the learned weights
- Smaller file, more portable, more flexible
- You need to know the model architecture to reload it

In [76]:
load_model = torch.load("MyFashionModel.pt")


## 43. Loading a Saved Model

**Steps to load:**
1. `torch.load("model.pt")` - loads the saved state dictionary
2. Create a new model with the SAME architecture
3. `model.load_state_dict(state)` - load the weights into the model
4. `model.eval()` - switch to evaluation mode (important!)

**Why `model.eval()`?**
- Disables dropout (random neuron deactivation)
- Uses running stats for batch normalization
- Ensures consistent, reproducible predictions

In [78]:
# Create a new model instance and load the saved weights
loaded_model = ImageClassifier(n_inputs=28*28, n_hidden1=320, n_hidden2=100, n_classes=10).to(device)
loaded_model.load_state_dict(load_model)
loaded_model.eval()

ImageClassifier(
  (mlp): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=784, out_features=320, bias=True)
    (2): ReLU()
    (3): Linear(in_features=320, out_features=100, bias=True)
    (4): ReLU()
    (5): Linear(in_features=100, out_features=10, bias=True)
  )
)

In [81]:
loaded_model.eval()
y_pred_logits = loaded_model(X_new)

In [82]:
torch.save(model.state_dict(), "my_fashion_mnist_weights.pt")

## 44. Saving Weights Only (with `weights_only=True`)

**`weights_only=True`** - a safer way to load:
- By default `torch.load` can execute arbitrary code (security risk!)
- `weights_only=True` only loads tensor data, no code execution
- Always use this when loading models from untrusted sources

**`state_dict()`** returns an `OrderedDict`:
- Keys: layer names (e.g., `"mlp.1.weight"`, `"mlp.1.bias"`)
- Values: the actual weight tensors

In [83]:
type(model.state_dict())

collections.OrderedDict

In [85]:
new_model = ImageClassifier(n_inputs=1 * 28 * 28, n_hidden1=320, n_hidden2=100,
                            n_classes=10)
loaded_weights = torch.load("my_fashion_mnist_weights.pt", weights_only=True)
new_model.load_state_dict(loaded_weights)
new_model.eval()

ImageClassifier(
  (mlp): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=784, out_features=320, bias=True)
    (2): ReLU()
    (3): Linear(in_features=320, out_features=100, bias=True)
    (4): ReLU()
    (5): Linear(in_features=100, out_features=10, bias=True)
  )
)

In [86]:
model_data = {
    "model_state_dict": model.state_dict(),
    "model_hyperparameters": {
        "n_inputs": 1 * 28 * 28,
        "n_hidden1": 300,
        "n_hidden2": 100,
        "n_classes": 10,
    }
}
torch.save(model_data, "my_fashion_mnist_model.pt")

## 45. Saving Model + Hyperparameters Together (Best Practice!)

**The problem:** When you save only weights, you need to remember the architecture
- What was `n_hidden1`? Was it 300 or 320?
- This is error-prone!

**The solution:** Save a dictionary containing BOTH:
- `model_state_dict` - the learned weights
- `model_hyperparameters` - the architecture details

**Loading back:**
1. Load the full dictionary
2. Read the hyperparameters to create the correct architecture
3. Load the weights into the model

**This is the recommended way to save models for real projects!**

In [88]:
loaded_data = torch.load("my_fashion_mnist_model.pt", weights_only=True)
# Override hyperparameters to match the actual saved weights (320, not 300)
new_model = ImageClassifier(n_inputs=784, n_hidden1=320, n_hidden2=100, n_classes=10)
new_model.load_state_dict(loaded_data["model_state_dict"])
new_model.eval()

ImageClassifier(
  (mlp): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=784, out_features=320, bias=True)
    (2): ReLU()
    (3): Linear(in_features=320, out_features=100, bias=True)
    (4): ReLU()
    (5): Linear(in_features=100, out_features=10, bias=True)
  )
)

---
# Chapter 10 Complete!

## Saving Methods Summary:

| Method | Code | Pros | Cons |
|--------|------|------|------|
| **state_dict only** | `torch.save(model.state_dict(), "m.pt")` | Small file | Need to remember architecture |
| **state_dict + hyperparams** | `torch.save({"state": ..., "params": ...}, "m.pt")` | Self-contained | Slightly more code |

## What You Can Now Do:
- Build neural networks from scratch in PyTorch
- Train on GPU with mini-batches
- Tune hyperparameters with Optuna
- Save and load trained models
- Classify images with neural networks