# Lab 02: Non-Linear Model & Training Functions

In this lab, we'll build upon our baseline model by adding **non-linear activation functions** (ReLU) and creating **reusable training functions**.

**What we'll cover:**
1. Code from Lab 01 (data loading, DataLoaders, helper functions)
2. Understanding the ReLU activation function
3. Building a non-linear model
4. Creating reusable train_step() and test_step() functions
5. Training and comparing with the baseline

In [None]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
!pip install pandas matplotlib scikit-learn

## 1. Code from Lab 01

Before building our new model, we need to set up the same foundation from Lab 01. The following cell contains all the essential code:
- Import libraries
- Load FashionMNIST dataset
- Create DataLoaders
- Define accuracy function

Run this cell to get everything ready.

In [None]:
# ============================================================
# CODE FROM LAB 01 - Data Loading and Setup
# ============================================================

# Import libraries
import torch
from torch import nn
import torchvision
from torchvision import datasets
from torchvision.transforms import ToTensor
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
from timeit import default_timer as timer

print(f"PyTorch version: {torch.__version__}")
print(f"Torchvision version: {torchvision.__version__}")

# Load FashionMNIST dataset
train_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
    target_transform=None
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

# Get class names from the dataset
class_names = train_data.classes

print(f"\nTraining samples: {len(train_data)}")
print(f"Test samples: {len(test_data)}")
print(f"Classes: {class_names}")

# Create DataLoaders
BATCH_SIZE = 32

train_dataloader = DataLoader(
    dataset=train_data,
    batch_size=BATCH_SIZE,
    shuffle=True
)

test_dataloader = DataLoader(
    dataset=test_data,
    batch_size=BATCH_SIZE,
    shuffle=False
)

print(f"\nNumber of training batches: {len(train_dataloader)}")
print(f"Number of test batches: {len(test_dataloader)}")

# Accuracy function from Lab 01
def accuracy_fn(y_true, y_pred):
    """Calculate accuracy between true and predicted labels."""
    correct = torch.eq(y_true, y_pred).sum().item()
    accuracy = (correct / len(y_true)) * 100
    return accuracy

# Timing function from Lab 01
def print_train_time(start: float, end: float):
    """Print and return training time."""
    total_time = end - start
    print(f"Train time: {total_time:.3f} seconds")
    return total_time

## 2. Understanding Non-Linearity

### The Problem with Linear-Only Networks

In Lab 01, our baseline model used only linear layers. The mathematical problem is that **stacking linear layers is equivalent to a single linear layer**:

```
y = W2(W1(x)) = (W2 * W1)(x) = W_combined(x)
```

This means our "deep" network can only learn linear relationships!

### The Solution: Activation Functions

By adding non-linear activation functions between layers, we allow the network to learn complex, non-linear patterns.

### ReLU (Rectified Linear Unit)

The most popular activation function is **ReLU**:

```python
ReLU(x) = max(0, x)
```

- If x > 0: output = x (passes through)
- If x ≤ 0: output = 0 (clips negative values)

Let's visualize it:

In [None]:
# Visualize ReLU activation function
x = torch.linspace(-5, 5, 100)
relu = nn.ReLU()
y = relu(x)

plt.figure(figsize=(8, 5))
plt.plot(x.numpy(), y.numpy(), 'b-', linewidth=2, label='ReLU(x)')
plt.axhline(y=0, color='k', linestyle='-', linewidth=0.5)
plt.axvline(x=0, color='k', linestyle='-', linewidth=0.5)
plt.grid(True, alpha=0.3)
plt.xlabel('x')
plt.ylabel('ReLU(x)')
plt.title('ReLU Activation Function: f(x) = max(0, x)')
plt.legend()
plt.show()

In [None]:
# Demonstrate ReLU with actual values
sample_values = torch.tensor([-3, -1, 0, 1, 3], dtype=torch.float32)
relu_output = relu(sample_values)

print("ReLU in action:")
for inp, out in zip(sample_values.numpy(), relu_output.numpy()):
    print(f"  Input: {inp:4.1f} → ReLU → Output: {out:4.1f}")

## 3. Build the Non-Linear Model (V1)

Now let's create a new model that adds ReLU activation functions between layers.

![Non-Linear Model Architecture](https://raw.githubusercontent.com/poridhiEng/lab-asset/8104ff41aaf569aa65977e43cdbadc13fc1b7a34/tensorcode/Deep-learning-with-pytorch/Computer-Vision/Lab_02/images/infra-7.svg)

The diagram above shows how our non-linear model (V1) processes image data. A batch of 28x28 grayscale images is first **flattened** into a 1D vector of 784 values (Input layer). This vector then passes through **linear layers with ReLU activation**: after each linear transformation, the ReLU function is applied to introduce non-linearity, allowing the model to learn more complex patterns.

![Model Architecture](https://raw.githubusercontent.com/poridhiEng/lab-asset/8104ff41aaf569aa65977e43cdbadc13fc1b7a34/tensorcode/Deep-learning-with-pytorch/Computer-Vision/Lab_02/images/infra-6.svg)

### Key Difference from V0:

**V0 (Baseline):** `Flatten → Linear → Linear`

**V1 (Non-Linear):** `Flatten → Linear → ReLU → Linear → ReLU`

The model processes data as follows:

| Layer | Input Shape | Output Shape | Description |
|-------|-------------|--------------|-------------|
| **Input** | [1, 28, 28] | - | 28×28 grayscale image |
| **Flatten** | [1, 28, 28] | [784] | Convert 2D to 1D vector |
| **Linear 1** | [784] | [10] | First transformation |
| **ReLU** | [10] | [10] | Non-linear activation |
| **Linear 2** | [10] | [10] | Output layer |
| **ReLU** | [10] | [10] | Non-linear activation |
| **Output** | - | [10] | Raw scores (logits) |

In [None]:
class FashionMNISTModelV1(nn.Module):
    """Model with non-linear activation functions (ReLU)."""
    
    def __init__(self, input_shape: int, hidden_units: int, output_shape: int):
        super().__init__()
        
        self.layer_stack = nn.Sequential(
            nn.Flatten(),  # Flatten: [1, 28, 28] -> [784]
            nn.Linear(in_features=input_shape, out_features=hidden_units),
            nn.ReLU(),  # Non-linearity after first linear layer
            nn.Linear(in_features=hidden_units, out_features=output_shape),
            nn.ReLU()   # Non-linearity after second linear layer
        )
    
    def forward(self, x):
        return self.layer_stack(x)

In [None]:
# Create model instance
torch.manual_seed(42)

model_1 = FashionMNISTModelV1(
    input_shape=784,          # 28*28 pixels
    hidden_units=10,          # Same as baseline for fair comparison
    output_shape=len(class_names)  # 10 classes
)

print(f"Model architecture:\n{model_1}")

## 4. Create Reusable Training Functions

In Lab 01, we wrote the training loop inline. Now let's create **reusable functions** that we can use for any model.

### Why functionalize the training loop?

**Reusability** - Use the same functions for different models. **Cleaner code** - Main training code becomes much simpler. **Easier debugging** - Isolate and test each step. **Consistency** - Ensure all models are trained the same way.

### The train_step() Function

Performs a single training epoch by:
1. Setting model to training mode
2. Looping through all batches
3. Forward pass → Loss → Backward pass → Optimizer step

In [None]:
def train_step(model: torch.nn.Module,
               data_loader: torch.utils.data.DataLoader,
               loss_fn: torch.nn.Module,
               optimizer: torch.optim.Optimizer,
               accuracy_fn):
    """Performs a single training epoch.
    
    Args:
        model: The neural network model to train
        data_loader: DataLoader containing training data
        loss_fn: Loss function to optimize
        optimizer: Optimizer to update model parameters
        accuracy_fn: Function to calculate accuracy
    """
    train_loss, train_acc = 0, 0
    
    # Put model in training mode
    model.train()
    
    for batch, (X, y) in enumerate(data_loader):
        # 1. Forward pass
        y_pred = model(X)
        
        # 2. Calculate loss
        loss = loss_fn(y_pred, y)
        train_loss += loss.item()
        train_acc += accuracy_fn(y_true=y, y_pred=y_pred.argmax(dim=1))
        
        # 3. Zero gradients
        optimizer.zero_grad()
        
        # 4. Backward pass
        loss.backward()
        
        # 5. Update weights
        optimizer.step()
    
    # Calculate average loss and accuracy per epoch
    train_loss /= len(data_loader)
    train_acc /= len(data_loader)
    
    print(f"Train loss: {train_loss:.5f} | Train accuracy: {train_acc:.2f}%")

### The test_step() Function

Evaluates model on test data by:
1. Setting model to evaluation mode
2. Using inference mode (no gradients)
3. Looping through test batches
4. Accumulating loss and accuracy

In [None]:
def test_step(model: torch.nn.Module,
              data_loader: torch.utils.data.DataLoader,
              loss_fn: torch.nn.Module,
              accuracy_fn):
    """Evaluates model on test data.
    
    Args:
        model: The neural network model to evaluate
        data_loader: DataLoader containing test data
        loss_fn: Loss function for evaluation
        accuracy_fn: Function to calculate accuracy
    """
    test_loss, test_acc = 0, 0
    
    # Put model in evaluation mode
    model.eval()
    
    # Turn on inference mode (no gradients needed)
    with torch.inference_mode():
        for X, y in data_loader:
            # 1. Forward pass
            test_pred = model(X)
            
            # 2. Calculate loss and accuracy
            test_loss += loss_fn(test_pred, y).item()
            test_acc += accuracy_fn(y_true=y, y_pred=test_pred.argmax(dim=1))
    
    # Calculate averages
    test_loss /= len(data_loader)
    test_acc /= len(data_loader)
    
    print(f"Test loss: {test_loss:.5f} | Test accuracy: {test_acc:.2f}%")

## 5. Setup Loss Function and Optimizer

In [None]:
# Loss function for multi-class classification
loss_fn = nn.CrossEntropyLoss()

# Optimizer - SGD with same learning rate as baseline
optimizer = torch.optim.SGD(params=model_1.parameters(), lr=0.1)

## 6. Train the Model

Now we can use our clean training functions. Notice how much simpler the main training loop is compared to Lab 01!

In [None]:
# Set random seed for reproducibility
torch.manual_seed(42)

# Start timing
train_time_start = timer()

# Number of epochs
epochs = 3

# Training loop - now much cleaner!
for epoch in range(epochs):
    print(f"\nEpoch: {epoch}\n---------")
    
    # Training
    train_step(
        model=model_1,
        data_loader=train_dataloader,
        loss_fn=loss_fn,
        optimizer=optimizer,
        accuracy_fn=accuracy_fn
    )
    
    # Testing
    test_step(
        model=model_1,
        data_loader=test_dataloader,
        loss_fn=loss_fn,
        accuracy_fn=accuracy_fn
    )

# End timing
train_time_end = timer()

# Print total training time
total_train_time_model_1 = print_train_time(
    start=train_time_start,
    end=train_time_end
)

## 7. Create Evaluation Function

Let's create an evaluation function to get final metrics as a dictionary. This will be useful for comparing multiple models.

In [None]:
def eval_model(model: nn.Module,
               data_loader: DataLoader,
               loss_fn: nn.Module,
               accuracy_fn):
    """Evaluate model and return metrics as dictionary."""
    loss, acc = 0, 0
    
    model.eval()
    with torch.inference_mode():
        for X, y in data_loader:
            y_pred = model(X)
            loss += loss_fn(y_pred, y).item()
            acc += accuracy_fn(y_true=y, y_pred=y_pred.argmax(dim=1))
    
    loss /= len(data_loader)
    acc /= len(data_loader)
    
    return {
        "model_name": model.__class__.__name__,
        "model_loss": loss,
        "model_acc": acc
    }

In [None]:
# Evaluate the non-linear model
model_1_results = eval_model(
    model=model_1,
    data_loader=test_dataloader,
    loss_fn=loss_fn,
    accuracy_fn=accuracy_fn
)

print(f"\nNon-Linear Model Results:")
print(f"Model: {model_1_results['model_name']}")
print(f"Loss: {model_1_results['model_loss']:.4f}")
print(f"Accuracy: {model_1_results['model_acc']:.2f}%")

## 8. The Surprising Result!

You might expect that adding non-linearity would improve performance. Let's compare with our baseline from Lab 01:

| Model | Architecture | Expected | Actual |
|-------|--------------|----------|--------|
| **V0 (Baseline)** | Linear → Linear | Lower accuracy | ~83% |
| **V1 (Non-Linear)** | Linear → ReLU → Linear → ReLU | Higher accuracy | ~75% |

**The non-linear model performs WORSE!** Why?

### Analysis

**ReLU after output layer** - The final ReLU clips all negative logits to zero, which can hurt classification performance. When the model outputs a negative value for a class, that negative value carries information (the model is "against" that class). Clipping it to zero loses this information.

**Model capacity** - With only 10 hidden units, the model is already very constrained. Adding ReLU makes it even harder to learn because ReLU zeros out negative activations, effectively reducing the model's capacity further.

**Architecture mismatch** - For image data, fully connected layers aren't ideal. Images have spatial structure (nearby pixels are related), but our flatten operation destroys this structure. CNNs are much better suited for images.

### Key Lesson

**Adding complexity doesn't automatically improve performance!** This is why we:
- Always start with a simple baseline
- Add changes incrementally
- Test each modification empirically

## 9. Make Predictions on Sample Images

In [None]:
# Visualize predictions
torch.manual_seed(42)

fig, axes = plt.subplots(3, 3, figsize=(9, 9))

model_1.eval()
with torch.inference_mode():
    for i, ax in enumerate(axes.flatten()):
        # Get random sample
        random_idx = torch.randint(0, len(test_data), size=[1]).item()
        image, true_label = test_data[random_idx]
        
        # Add batch dimension
        image_batch = image.unsqueeze(0)
        
        # Make prediction
        pred_logits = model_1(image_batch)
        pred_label = pred_logits.argmax(dim=1).item()
        
        # Plot
        ax.imshow(image.squeeze(), cmap="gray")
        
        # Color title based on correct/incorrect
        title_color = "green" if pred_label == true_label else "red"
        ax.set_title(
            f"True: {class_names[true_label]}\nPred: {class_names[pred_label]}",
            color=title_color,
            fontsize=10
        )
        ax.axis(False)

plt.suptitle("Model V1 Predictions (Non-Linear)", fontsize=14)
plt.tight_layout()
plt.show()

## Summary

### What We Learned:

1. **Non-linearity is essential** for neural networks to learn complex patterns, but placement matters!

2. **ReLU activation**: `max(0, x)` - simple, efficient, and widely used.

3. **Reusable training functions**: `train_step()` and `test_step()` make code cleaner and more maintainable.

4. **Empirical testing is crucial**: Adding non-linearity didn't help here - always test your assumptions!

### Key Takeaway:

The baseline model (V0) outperformed the non-linear model (V1) because:
- ReLU after the output layer clips negative logits
- Small hidden layer (10 units) limits capacity
- Fully connected layers aren't optimal for image data

### What's Next?

In Lab 03, we'll:
- Build a **Convolutional Neural Network (CNN)**
- Learn about **Conv2d** and **MaxPool2d** layers
- See how architecture designed for images dramatically improves performance
- Compare all three models