# DATASCI 315, Homework 8: Fashion MNIST with Regularization

In this homework, you will apply deep learning to the Fashion MNIST dataset and explore techniques to address underfitting and overfitting.

The Fashion MNIST dataset is a popular benchmark for machine learning and computer vision, often used as a drop-in replacement for the original MNIST dataset of handwritten digits. It consists of 70,000 grayscale images, each 28x28 pixels. The dataset is split into 60,000 training images and 10,000 test images. Each image depicts a clothing item or accessory from one of ten classes: T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, and Ankle boot.

This dataset provides a more challenging classification task than the original MNIST digits, since the images are more complex and contain subtler differences between classes.

PyTorch, via torchvision, makes it easy to load the Fashion MNIST dataset. Run the cell below to download the data and create training and validation dataloaders.

In [None]:
import torch
import torchvision
from matplotlib import pyplot as plt
from torch import nn, optim
from torch.utils.data import DataLoader, random_split
from torchvision import transforms

# Define a transform to convert images to tensors
transform = transforms.Compose([transforms.ToTensor()])

# Download the full training dataset
full_train_dataset = torchvision.datasets.FashionMNIST(
    root="./data", train=True, download=True, transform=transform
)

# Define the sizes for train/validation split
train_size = int(0.7 * len(full_train_dataset))
val_size = len(full_train_dataset) - train_size

# Split the dataset into training and validation sets
train_dataset, val_dataset = random_split(full_train_dataset, [train_size, val_size])

# Download the test dataset
test_dataset = torchvision.datasets.FashionMNIST(
    root="./data", train=False, download=True, transform=transform
)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=1024, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=1024, shuffle=False)

# Loss function to use throughout
criterion = nn.CrossEntropyLoss()

We provide you with the following function for training your model.

In [None]:
def train_model(model, train_loader, val_loader, optimizer, num_epochs=50):
    val_losses = []
    training_losses = []

    for epoch in range(num_epochs):
        model.train()
        train_loss = 0
        for batch_x, batch_y in train_loader:
            optimizer.zero_grad()
            outputs = model(batch_x)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()

            train_loss += loss.item()

        train_loss /= len(train_loader)
        training_losses.append(train_loss)

        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch_x, batch_y in val_loader:
                outputs = model(batch_x)
                loss = criterion(outputs, batch_y)
                val_loss += loss.item()

        val_loss /= len(val_loader)

        val_losses.append(val_loss)
        print(
            f"Epoch {epoch + 1}/{num_epochs}, "
            f"Training Loss: {train_loss:.4f}, "
            f"Validation Loss: {val_loss:.4f}"
        )

    return training_losses, val_losses


def plot_losses(training_losses, val_losses):
    """Plot training and validation loss curves."""
    plt.plot(training_losses, label="Training Loss")
    plt.plot(val_losses, label="Validation Loss")
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.legend()
    plt.show()

### Problem 1: Underfitting

Create a model that underfits the data for at least 50 epochs. Your model should demonstrate:
- Training loss that drops steadily but remains above 1.0
- Validation loss that remains similar to training loss (within 0.05) throughout training

**Hint:** Use a very small hidden layer size (e.g., 2 neurons) to limit the model's capacity.

In [None]:
# BEGIN SOLUTION
# Use a very small hidden layer to limit model capacity and cause underfitting
model_underfit = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28 * 28, 2),
    nn.ReLU(),
    nn.Linear(2, 10),
)

optimizer = optim.Adam(model_underfit.parameters(), lr=0.001)
# END SOLUTION

training_losses_1, val_losses_1 = train_model(
    model_underfit, train_loader, val_loader, optimizer, num_epochs=50
)
plot_losses(training_losses_1, val_losses_1)

In [None]:
# Test assertions
assert training_losses_1[-1] > 1.0, "Final training loss should be above 1.0 for underfitting"
assert (
    abs(training_losses_1[-1] - val_losses_1[-1]) < 0.1
), "Training and validation loss should be similar for underfitting"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(training_losses_1) == 50, "Should train for exactly 50 epochs"
assert training_losses_1[0] > training_losses_1[-1], "Training loss should decrease"
# END HIDDEN TESTS

### Problem 2: Overfitting

Create a larger model that overfits the data. Your model should demonstrate:
- Training loss that drops steadily throughout training
- Validation loss that drops at first but then begins increasing

**Hint:** Use larger hidden layers (e.g., 128 neurons) to increase the model's capacity.

In [None]:
# BEGIN SOLUTION
# Use larger hidden layers to increase model capacity and cause overfitting
model_overfit = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28 * 28, 128),
    nn.ReLU(),
    nn.Linear(128, 128),
    nn.ReLU(),
    nn.Linear(128, 10),
)
optimizer = optim.Adam(model_overfit.parameters(), lr=0.001)
# END SOLUTION

training_losses_2, val_losses_2 = train_model(
    model_overfit, train_loader, val_loader, optimizer, num_epochs=50
)
plot_losses(training_losses_2, val_losses_2)

In [None]:
# Test assertions
assert (
    training_losses_2[-1] < val_losses_2[-1]
), "Training loss should be lower than validation loss for overfitting"
assert training_losses_2[-1] < 0.3, "Training loss should be low for an overfit model"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(training_losses_2) == 50, "Should train for exactly 50 epochs"
assert (
    min(val_losses_2) < val_losses_2[-1]
), "Validation loss should increase after initial decrease"
# END HIDDEN TESTS

### Problem 3: Early Stopping

Early stopping is a regularization technique that stops training when the validation loss begins to increase. It works by:
1. Tracking the best validation loss seen so far
2. Incrementing a counter each time the validation loss exceeds the best loss by more than a threshold (`min_delta`)
3. Stopping training when this counter reaches a limit (`patience`)

Modify the training function below to implement early stopping. The key changes you need to make are:
- Update `best_val_loss` when validation improves by at least `min_delta`
- Reset `patience_counter` when improvement occurs
- Increment `patience_counter` when no improvement
- Break out of the training loop when `patience_counter >= patience`

In [None]:
def train_model_with_early_stopping(
    model,
    train_loader,
    val_loader,
    optimizer,
    num_epochs=100,
    patience=5,
    min_delta=0.01,
):
    best_val_loss = float("inf")
    patience_counter = 0

    validation_losses = []
    training_losses = []

    for epoch in range(num_epochs):
        # Training phase
        model.train()
        train_loss = 0
        for inputs, targets in train_loader:
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()

            train_loss += loss.item()

        train_loss /= len(train_loader)
        training_losses.append(train_loss)

        # Validation phase
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for inputs, targets in val_loader:
                outputs = model(inputs)
                loss = criterion(outputs, targets)
                val_loss += loss.item()

        val_loss /= len(val_loader)
        validation_losses.append(val_loss)

        # Print training and validation loss
        print(
            f"Epoch {epoch + 1}/{num_epochs}, "
            f"Training Loss: {train_loss:.4f}, "
            f"Validation Loss: {val_loss:.4f}"
        )

        # BEGIN SOLUTION
        # Check for early stopping
        if val_loss < best_val_loss - min_delta:
            best_val_loss = val_loss
            patience_counter = 0  # Reset the counter if there's an improvement
        else:
            patience_counter += 1

        if patience_counter >= patience:
            print("Early stopping triggered")
            break
        # END SOLUTION

    return training_losses, validation_losses

Now train the same architecture from Problem 2 using early stopping:

In [None]:
# BEGIN SOLUTION
# Use the same architecture as Problem 2
model_early = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28 * 28, 128),
    nn.ReLU(),
    nn.Linear(128, 128),
    nn.ReLU(),
    nn.Linear(128, 10),
)
optimizer = optim.Adam(model_early.parameters(), lr=0.001)
# END SOLUTION

training_losses_3, val_losses_3 = train_model_with_early_stopping(
    model_early, train_loader, val_loader, optimizer, num_epochs=50
)
plot_losses(training_losses_3, val_losses_3)

In [None]:
# Test assertions
assert len(training_losses_3) < 50, "Early stopping should stop training before 50 epochs"
assert len(training_losses_3) > 5, "Training should run for at least a few epochs"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(training_losses_3) == len(val_losses_3), "Loss lists should have same length"
# END HIDDEN TESTS

### Problem 4: Smaller Model

Try reducing the size of your model from Problem 2 to avoid overfitting. Use smaller hidden layer sizes while keeping the same number of layers.

**Hint:** Try using 32 neurons in each hidden layer instead of 128.

In [None]:
# BEGIN SOLUTION
# Use smaller hidden layers to reduce model capacity
model_smaller = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28 * 28, 32),
    nn.ReLU(),
    nn.Linear(32, 32),
    nn.ReLU(),
    nn.Linear(32, 10),
)
optimizer = optim.Adam(model_smaller.parameters(), lr=0.001)
# END SOLUTION

training_losses_4, val_losses_4 = train_model(
    model_smaller, train_loader, val_loader, optimizer, num_epochs=50
)
plot_losses(training_losses_4, val_losses_4)

In [None]:
# Test assertions
assert val_losses_4[-1] < 0.5, "Validation loss should be reasonably low"
assert (
    abs(training_losses_4[-1] - val_losses_4[-1]) < 0.2
), "Gap between training and validation loss should be small"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(training_losses_4) == 50, "Should train for exactly 50 epochs"
# END HIDDEN TESTS

### Problem 5: L2 Regularization

Use L2 regularization (weight decay) to train the same architecture from Problem 2 without overfitting.

**Hint:** Use the `weight_decay` parameter in the Adam optimizer. Try a value like 0.01.

In [None]:
# BEGIN SOLUTION
# Use the same architecture as Problem 2 but with L2 regularization via weight_decay
model_l2 = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28 * 28, 128),
    nn.ReLU(),
    nn.Linear(128, 128),
    nn.ReLU(),
    nn.Linear(128, 10),
)
optimizer = optim.Adam(model_l2.parameters(), lr=0.001, weight_decay=0.01)
# END SOLUTION

training_losses_5, val_losses_5 = train_model(
    model_l2, train_loader, val_loader, optimizer, num_epochs=50
)
plot_losses(training_losses_5, val_losses_5)

In [None]:
# Test assertions
assert val_losses_5[-1] < 0.6, "Validation loss should be reasonable with L2"
assert training_losses_5[-1] > 0.3, "L2 should prevent training loss from getting too low"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(training_losses_5) == 50, "Should train for exactly 50 epochs"
# END HIDDEN TESTS

### Problem 6: Dropout

Add dropout to the architecture from Problem 2 to avoid overfitting.

**Hint:** Add a `nn.Dropout` layer after one or more of the ReLU activations. Try a dropout probability of 0.5 to 0.7.

In [None]:
# BEGIN SOLUTION
# Add dropout after the first ReLU to regularize the model
model_dropout = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28 * 28, 128),
    nn.ReLU(),
    nn.Dropout(0.7),
    nn.Linear(128, 128),
    nn.ReLU(),
    nn.Linear(128, 10),
)
optimizer = optim.Adam(model_dropout.parameters(), lr=0.001)
# END SOLUTION

training_losses_6, val_losses_6 = train_model(
    model_dropout, train_loader, val_loader, optimizer, num_epochs=50
)
plot_losses(training_losses_6, val_losses_6)

In [None]:
# Test assertions
assert val_losses_6[-1] < 0.5, "Validation loss should be reasonable with dropout"
assert (
    training_losses_6[-1] > val_losses_6[-1] - 0.1
), "With dropout, training loss is typically higher than validation loss"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(training_losses_6) == 50, "Should train for exactly 50 epochs"
# END HIDDEN TESTS

### Problem 7: L2 and Dropout Combined

Use both L2 regularization and dropout together to avoid overfitting.

In [None]:
# BEGIN SOLUTION
# Combine dropout and L2 regularization
model_l2dropout = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28 * 28, 128),
    nn.ReLU(),
    nn.Dropout(0.7),
    nn.Linear(128, 128),
    nn.ReLU(),
    nn.Linear(128, 10),
)

optimizer = optim.Adam(model_l2dropout.parameters(), lr=0.001, weight_decay=0.01)
# END SOLUTION

training_losses_7, val_losses_7 = train_model(
    model_l2dropout, train_loader, val_loader, optimizer, num_epochs=50
)
plot_losses(training_losses_7, val_losses_7)

In [None]:
# Test assertions
assert val_losses_7[-1] < 0.7, "Validation loss should be reasonable"
assert training_losses_7[-1] > 0.4, "Combined regularization should prevent low training loss"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(training_losses_7) == 50, "Should train for exactly 50 epochs"
# END HIDDEN TESTS

## Model Comparison

Run the code below to print the validation and test loss for each model.

In [None]:
models = [
    model_underfit,
    model_overfit,
    model_early,
    model_smaller,
    model_l2,
    model_dropout,
    model_l2dropout,
]
model_names = [
    "Underfit",
    "Overfit",
    "Early Stopping",
    "Smaller Model",
    "L2 Regularization",
    "Dropout",
    "L2 and Dropout",
]


def get_loss(model, loader):
    model.eval()
    loss = 0
    with torch.no_grad():
        for batch_x, batch_y in loader:
            outputs = model(batch_x)
            batch_loss = criterion(outputs, batch_y)
            loss += batch_loss.item()
    return loss / len(loader)


for model, model_name in zip(models, model_names, strict=True):
    val = get_loss(model, val_loader)
    test = get_loss(model, test_loader)
    print(f"{model_name:>20}: validation={val:.4f} test={test:.4f}")

### Problem 8: Model Selection

If you had to use one of these models, which one would you use and why? Write your answer below.

# BEGIN SOLUTION
Based on the results, the Early Stopping model typically achieves the best test loss. This makes sense because it stops training at the point where the model generalizes best to unseen data, before overfitting occurs. The Dropout model also performs well by preventing the model from relying too heavily on any single neuron, which improves generalization.

Key observations:
- The Underfit model has high loss on both training and test, indicating it lacks the capacity to learn the patterns in the data.
- The Overfit model has low training loss but higher test loss, showing it has memorized the training data rather than learning generalizable patterns.
- Regularization techniques (early stopping, smaller model, L2, dropout) help reduce the gap between training and test performance.
# END SOLUTION