# DATASCI 315, Group Work 6: Regularizing Neural Networks

**Instructions:** During lab section, and afterward as necessary, you will collaborate in two-person teams (assigned by the GSI) to complete the problems that are interspersed below. The GSI will help individual teams encountering difficulty, make announcements addressing common issues, and help ensure progress for all teams. *During lab, feel free to flag down your GSI to ask questions at any point!* Upon completion, one member of the team should submit their team's work through Canvas as html.

## Introduction to underfitting and overfitting

The accuracy of models on the validation data would peak after training for a number of epochs and then stagnate or start decreasing.

In other words, your model would *overfit* to the training data. Learning how to deal with overfitting is important. Although it's often possible to achieve high accuracy on the *training set*, what you really want is to develop models that generalize well to a *testing set* (or data they haven't seen before).

The opposite of overfitting is *underfitting*. Underfitting occurs when there is still room for improvement on the train data. This can happen for a number of reasons: If the model is not powerful enough, is over-regularized, or has simply not been trained long enough. This means the network has not learned the relevant patterns in the training data.

If you train for too long though, the model will start to overfit and learn patterns from the training data that don't generalize to the test data. You need to strike a balance. Understanding how to train for an appropriate number of epochs as you'll explore below is a useful skill.

To prevent overfitting, the best solution is to use more complete training data. The dataset should cover the full range of inputs that the model is expected to handle. Additional data may only be useful if it covers new and interesting cases.

A model trained on more complete data will naturally generalize better. When that is no longer possible, the next best solution is to use techniques like regularization. These place constraints on the quantity and type of information your model can store.  If a network can only afford to memorize a small number of patterns, the optimization process will force it to focus on the most prominent patterns, which have a better chance of generalizing well.

In this group-work assignment, you'll explore several common regularization techniques, and use them to improve on a classification model.

## Setup

Before getting started, import the necessary packages:

In [None]:
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from matplotlib import pyplot as plt
from torch.utils.data import DataLoader, Dataset, random_split

print(torch.__version__)

### The Higgs dataset

The goal of this assignment is not to do particle physics, so don't dwell on the details of the dataset. It contains 11,000,000 examples, each with 28 features, and a binary class label.

We first download the dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/280/higgs). The download is about 2.6 GB and may take a few minutes to complete.

In [None]:
import zipfile
from pathlib import Path
from urllib.request import urlretrieve

# Define the URL and the local file path
url = "https://archive.ics.uci.edu/static/public/280/higgs.zip"
zip_filename = "data/higgs.zip"
csv_filename = "data/HIGGS.csv.gz"

# Download and extract the file
if not Path(csv_filename).exists():
    if not Path(zip_filename).exists():
        print(f"Downloading {zip_filename}...")
        urlretrieve(url, zip_filename)
    print(f"Extracting {csv_filename}...")
    with zipfile.ZipFile(zip_filename, "r") as zip_ref:
        zip_ref.extractall(".")
else:
    print(f"{csv_filename} already exists.")

# Load the data using pandas
data = pd.read_csv(csv_filename, compression="gzip", header=None)

# Display the first few rows
print(data.head())

We split the dataset and create dataloaders for easy batching.

In [None]:
# Convert DataFrame to numpy arrays
features = data.iloc[:, 1:].values  # All others except the first column are features
labels = data.iloc[:, 0].values  # First column is label


# Custom Dataset class
class DataFrameDataset(Dataset):
    def __init__(self, features, labels):
        self.features = torch.tensor(features, dtype=torch.float32)
        self.labels = torch.tensor(labels, dtype=torch.long)

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]


# Create Dataset object
dataset = DataFrameDataset(features, labels)

# Split train, test data
train_dataset, val_dataset, _ = random_split(dataset, [10000, 1000, len(dataset) - 11000])

# Create DataLoader
dataloader = DataLoader(dataset, batch_size=1000, shuffle=True)
train_loader = DataLoader(train_dataset, batch_size=500, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=500, shuffle=False)

The histogram summarizes the distribution of the data.

In [None]:
for batch_features, _batch_labels in dataloader:
    print(batch_features)
    plt.hist(batch_features.numpy().ravel(), bins=101)
    break

### Problem 1: Write a Training Function

The simplest way to prevent overfitting is to start with a small model: a model with a small number of learnable parameters (which is determined by the number of layers and the number of units per layer). In deep learning, the number of learnable parameters in a model is often referred to as the model's "capacity".

Intuitively, a model with more parameters will have more "memorization capacity" and therefore will be able to easily learn a perfect dictionary-like mapping between training samples and their targets, a mapping without any generalization power, but this would be useless when making predictions on previously unseen data.

Always keep this in mind: deep learning models tend to be good at fitting to the training data, but the real challenge is generalization, not fitting.

On the other hand, if the network has limited memorization resources, it will not be able to learn the mapping as easily. To minimize its loss, it will have to learn compressed representations that have more predictive power. At the same time, if you make your model too small, it will have difficulty fitting to the training data. There is a balance between "too much capacity" and "not enough capacity".

Unfortunately, there is no magical formula to determine the right size or architecture of your model (in terms of the number of layers, or the right size for each layer). You will have to experiment using a series of different architectures.

To find an appropriate model size, it's best to start with relatively few layers and parameters, then begin increasing the size of the layers or adding new layers until you see diminishing returns on the validation loss.

Start with a simple model using only densely-connected layers as a baseline, then create larger models, and compare them.

We will train a number of models, so having a function that trains a given model will be convenient. Complete the `train_model` function below by filling in the training loop. The function should:
1. Set the model to training mode and iterate over the training data
2. For each batch: zero gradients, compute outputs, compute loss, backpropagate, and update weights
3. Track the average training loss per epoch
4. Set the model to evaluation mode and compute the validation loss
5. Return both the training and validation loss histories

In [None]:
def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs=100):
    validation_losses = []
    training_losses = []

    for epoch in range(num_epochs):
        # BEGIN SOLUTION
        # Training phase
        model.train()
        train_loss = 0
        for batch_x, batch_y in train_loader:
            optimizer.zero_grad()
            outputs = model(batch_x)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()

        train_loss /= len(train_loader)
        training_losses.append(train_loss)

        # Validation phase
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch_x, batch_y in val_loader:
                outputs = model(batch_x)
                loss = criterion(outputs, batch_y)
                val_loss += loss.item()

        val_loss /= len(val_loader)
        validation_losses.append(val_loss)
        # END SOLUTION
        if epoch % 10 == 0:
            print(
                f"Epoch {epoch+1}/{num_epochs}, "
                f"Training Loss: {train_loss:.4f}, "
                f"Validation Loss: {val_loss:.4f}"
            )

    return training_losses, validation_losses

In [None]:
# Test assertions
# Create a simple model and test the train_model function
test_model = nn.Sequential(nn.Linear(28, 8), nn.ReLU(), nn.Linear(8, 2))
test_criterion = nn.CrossEntropyLoss()
test_optimizer = optim.Adam(test_model.parameters(), lr=0.01)

train_losses, val_losses = train_model(
    test_model, train_loader, val_loader, test_criterion, test_optimizer, num_epochs=10
)

assert len(train_losses) == 10, f"Expected 10 training losses, got {len(train_losses)}"
assert len(val_losses) == 10, f"Expected 10 validation losses, got {len(val_losses)}"
assert all(isinstance(loss, float) for loss in train_losses), "Training losses should be floats"
assert all(isinstance(loss, float) for loss in val_losses), "Validation losses should be floats"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert train_losses[0] > train_losses[-1], "Training loss should decrease over time"
assert all(loss > 0 for loss in train_losses), "Training losses should be positive"
assert all(loss > 0 for loss in val_losses), "Validation losses should be positive"
# END HIDDEN TESTS

## Comparing Model Architectures

We will observe how the loss trajectories evolve as we increase the size of the models. As we increase the model's size, a divergence between the training and the validation loss will appear. In very large models, validation loss will increase despite the falling training loss.

### Problem 2a: Tiny Model

Write a tiny model with one hidden layer using `nn.Sequential`. The hidden layer should have width `16` and use the `nn.ELU()` activation function. The input dimension is 28 (the number of features) and the output dimension is 2 (binary classification).

Then, train the model using the function from Problem 1.

In [None]:
# BEGIN SOLUTION
model = nn.Sequential(nn.Linear(28, 16), nn.ELU(), nn.Linear(16, 2))
# END SOLUTION
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

training_losses, validation_losses = train_model(
    model, train_loader, val_loader, criterion, optimizer, num_epochs=200
)

In [None]:
# Test assertions
assert "model" in dir(), "model should be defined"
assert isinstance(model, nn.Module), "model should be an nn.Module"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Verify model structure for 2a
# END HIDDEN TESTS

Now check how the model did:

In [None]:
plt.plot(training_losses, label="Training Loss")
plt.plot(validation_losses, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

### Problem 2b: Small Model

Write a small model with two hidden layers using `nn.Sequential`. Each hidden layer should have width `16` and use the `nn.ELU()` activation function.

Then, train the model using the function from Problem 1.

In [None]:
# BEGIN SOLUTION
model = nn.Sequential(nn.Linear(28, 16), nn.ELU(), nn.Linear(16, 16), nn.ELU(), nn.Linear(16, 2))
# END SOLUTION
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [None]:
# Test assertions
assert "model" in dir(), "model should be defined"
assert isinstance(model, nn.Module), "model should be an nn.Module"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Verify model structure for 2b
# END HIDDEN TESTS

In [None]:
training_losses, validation_losses = train_model(
    model, train_loader, val_loader, criterion, optimizer, num_epochs=200
)

In [None]:
plt.plot(training_losses, label="Training Loss")
plt.plot(validation_losses, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

### Problem 2c: Medium Model

Write a medium model with three hidden layers using `nn.Sequential`. Each hidden layer should have width `64` and use the `nn.ELU()` activation function.

Then, train the model using the function from Problem 1.

In [None]:
# BEGIN SOLUTION
model = nn.Sequential(
    nn.Linear(28, 64),
    nn.ELU(),
    nn.Linear(64, 64),
    nn.ELU(),
    nn.Linear(64, 64),
    nn.ELU(),
    nn.Linear(64, 2),
)
# END SOLUTION
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [None]:
# Test assertions
assert "model" in dir(), "model should be defined"
assert isinstance(model, nn.Module), "model should be an nn.Module"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Verify model structure for 2c
# END HIDDEN TESTS

And train the model using the same data:

In [None]:
training_losses, validation_losses = train_model(
    model, train_loader, val_loader, criterion, optimizer, num_epochs=200
)

In [None]:
plt.plot(training_losses, label="Training Loss")
plt.plot(validation_losses, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

### Problem 2d: Large Model

Write a large model with three hidden layers using `nn.Sequential`. Each hidden layer should have width `512` and use the `nn.ELU()` activation function.

Then, train the model using the function from Problem 1. You should observe significant overfitting in this model.

In [None]:
# BEGIN SOLUTION
model = nn.Sequential(
    nn.Linear(28, 512),
    nn.ELU(),
    nn.Linear(512, 512),
    nn.ELU(),
    nn.Linear(512, 512),
    nn.ELU(),
    nn.Linear(512, 2),
)
# END SOLUTION
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

training_losses, validation_losses = train_model(
    model, train_loader, val_loader, criterion, optimizer, num_epochs=200
)

In [None]:
# Test assertions
assert "model" in dir(), "model should be defined"
assert isinstance(model, nn.Module), "model should be an nn.Module"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Verify model structure for 2d
# END HIDDEN TESTS

In [None]:
plt.plot(training_losses, label="Training Loss")
plt.plot(validation_losses, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

## Adding Regularization

### Problem 3a: L2 Regularization (Weight Decay)

You may be familiar with Occam's Razor principle: given two explanations for something, the explanation most likely to be correct is the "simplest" one, the one that makes the least amount of assumptions. This also applies to the models learned by neural networks: given some training data and a network architecture, there are multiple sets of weights values (multiple models) that could explain the data, and simpler models are less likely to overfit than complex ones.

A "simple model" in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parameters altogether, as demonstrated in the section above). Thus a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights only to take small values, which makes the distribution of weight values more "regular". This is called "weight regularization", and it is done by adding to the loss function of the network a cost associated with having large weights. This cost comes in two flavors:

* [L1 regularization](https://developers.google.com/machine-learning/glossary/#L1_regularization), where the cost added is proportional to the absolute value of the weights coefficients (i.e. to what is called the "L1 norm" of the weights).

* [L2 regularization](https://developers.google.com/machine-learning/glossary/#L2_regularization), where the cost added is proportional to the square of the value of the weights coefficients (i.e. to what is called the squared "L2 norm" of the weights). L2 regularization is also called weight decay in the context of neural networks. Don't let the different name confuse you: weight decay is mathematically the exact same as L2 regularization.

L1 regularization pushes weights towards exactly zero, encouraging a sparse model. L2 regularization will penalize the weights parameters without making them sparse since the penalty goes to zero for small weightsâ€”one reason why L2 is more common.

Here, we apply L2 regularization using the `weight_decay` parameter in PyTorch optimizers. Read the [PyTorch optimizer documentation](https://pytorch.org/docs/stable/optim.html) and set the `weight_decay` parameter to reduce overfitting in the medium-sized model (width 64, three hidden layers).

Plot the loss curve to verify that the L2 regularization was effective.

In [None]:
# BEGIN SOLUTION
model = nn.Sequential(
    nn.Linear(28, 64),
    nn.ELU(),
    nn.Linear(64, 64),
    nn.ELU(),
    nn.Linear(64, 64),
    nn.ELU(),
    nn.Linear(64, 2),
)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=0.001)
# END SOLUTION

In [None]:
# Test assertions
assert "model" in dir(), "model should be defined"
assert isinstance(model, nn.Module), "model should be an nn.Module"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Verify model structure for 3a
# END HIDDEN TESTS

In [None]:
training_losses, validation_losses = train_model(
    model, train_loader, val_loader, criterion, optimizer, num_epochs=200
)

In [None]:
plt.plot(training_losses, label="Training Loss")
plt.plot(validation_losses, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

### Problem 3b: Dropout

Dropout is one of the most effective and most commonly used regularization techniques for neural networks, developed by Hinton and his students at the University of Toronto.

The intuitive explanation for dropout is that because individual nodes in the network cannot rely on the output of the others, each node must output features that are useful on their own.

Dropout, applied to a layer, consists of randomly "dropping out" (i.e. setting to zero) a number of output features of the layer during training. For example, a given layer would normally have returned a vector `[0.2, 0.5, 1.3, 0.8, 1.1]` for a given input sample during training; after applying dropout, this vector will have a few zero entries distributed at random, e.g. `[0, 0.5, 1.3, 0, 1.1]`.

The "dropout rate" is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5. At test time, no units are dropped out, and instead the layer's output values are scaled down by a factor equal to the dropout rate, so as to balance for the fact that more units are active than at training time.

In PyTorch, you can introduce dropout in a network via the `nn.Dropout` layer, which gets applied to the output of the layer right before it.

Add `nn.Dropout(0.5)` layers after each activation function in the medium-sized model (width 64, three hidden layers) to reduce overfitting. Do not use weight decay for this problem.

Plot the loss curve to verify that the dropout was effective.

In [None]:
# BEGIN SOLUTION
model = nn.Sequential(
    nn.Linear(28, 64),
    nn.ELU(),
    nn.Dropout(0.5),
    nn.Linear(64, 64),
    nn.ELU(),
    nn.Dropout(0.5),
    nn.Linear(64, 64),
    nn.ELU(),
    nn.Dropout(0.5),
    nn.Linear(64, 2),
)
# END SOLUTION
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [None]:
# Test assertions
assert "model" in dir(), "model should be defined"
assert isinstance(model, nn.Module), "model should be an nn.Module"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Verify model structure for 3b
# END HIDDEN TESTS

In [None]:
training_losses, validation_losses = train_model(
    model, train_loader, val_loader, criterion, optimizer, num_epochs=200
)

In [None]:
plt.plot(training_losses, label="Training Loss")
plt.plot(validation_losses, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

### Problem 3c: Combined Regularization

Combine both strategies from Problems 3a and 3b to improve the large model (width 512, three hidden layers). Use both dropout layers and weight decay.

Plot the loss curves to verify that the combined regularization was effective at reducing overfitting.

In [None]:
# BEGIN SOLUTION
model = nn.Sequential(
    nn.Linear(28, 512),
    nn.ELU(),
    nn.Dropout(0.5),
    nn.Linear(512, 512),
    nn.ELU(),
    nn.Dropout(0.5),
    nn.Linear(512, 512),
    nn.ELU(),
    nn.Dropout(0.5),
    nn.Linear(512, 2),
)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=0.001)
# END SOLUTION

In [None]:
# Test assertions
assert "model" in dir(), "model should be defined"
assert isinstance(model, nn.Module), "model should be an nn.Module"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Verify model structure for 3c
# END HIDDEN TESTS

In [None]:
training_losses, validation_losses = train_model(
    model, train_loader, val_loader, criterion, optimizer, num_epochs=200
)

In [None]:
plt.plot(training_losses, label="Training Loss")
plt.plot(validation_losses, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

## Conclusions

To recap, here are the most common ways to prevent overfitting in neural networks:

* Get more training data.
* Reduce the capacity of the network.
* Add weight regularization.
* Add dropout.

Two important approaches not covered in this notebook are:

* [Data augmentation](../images/data_augmentation.ipynb)
* Batch normalization (`torch.nn.functional.batch_norm`)

Remember that each method can help on its own, but often combining them can be even more effective.