# DATASCI 503, Group Work 11: Building and Tuning Neural Networks in PyTorch

**Instructions:** During lab section, and afterward as necessary, you will collaborate in two-person teams to complete the problems that are interspersed below. The GSI will help individual teams encountering difficulty, make announcements addressing common issues, and help ensure progress for all teams. **During lab, feel free to flag down your GSI to ask questions at any point!** Upon completion, one member of the team should submit their team's work through Canvas **as html**.

In [None]:
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from matplotlib import pyplot as plt

# PLEASE USE RANDOM STATE 42 FOR THIS ASSIGNMENT

# **NOTE**: This assignment requires training multiple neural networks for several epochs. Please ensure you provide yourself with ample time to run the full notebook from start to finish and generate all outputs before submission.

### Problem 1a: Activation Functions

In 1-2 sentences, explain what an activation function is and why it is useful for neural networks.

> BEGIN SOLUTION

An activation function introduces non-linearity into a neural network by transforming the output of each neuron. Without activation functions, a neural network would only be able to learn linear relationships regardless of its depth, since composing linear functions still produces a linear function.
> END SOLUTION


### Problem 1b: List Activation Functions

List 5 different activation functions and explain why each might be useful in neural networks. Use the following format: Name, Formula: Use Case.

1. Tanh, $\tanh(x) = (e^x - e^{-x}) / (e^x + e^{-x})$: The tanh function is a smooth, differentiable everywhere function bounded between -1, 1.
2. List item
3. List item
4. List item
5. List item

> BEGIN SOLUTION

1. Tanh, $\tanh(x) = (e^x - e^{-x}) / (e^x + e^{-x})$: The tanh function is a smooth, differentiable everywhere function bounded between -1, 1.
2. ReLU, $\text{ReLU}(x) = \max(0, x)$: Fast to compute and helps mitigate the vanishing gradient problem; widely used in hidden layers.
3. Sigmoid, $\sigma(x) = 1 / (1 + e^{-x})$: Outputs values between 0 and 1, making it useful for binary classification output layers.
4. Leaky ReLU, $\text{LeakyReLU}(x) = \max(\alpha x, x)$ where $\alpha$ is small (e.g., 0.01): Addresses the "dying ReLU" problem by allowing small negative gradients.
5. Softmax, $\text{softmax}(x_i) = e^{x_i} / \sum_j e^{x_j}$: Converts a vector of logits into a probability distribution, useful for multi-class classification output layers.
> END SOLUTION


### Problem 1c: Understanding Loss Functions

Write down the names and formulas of 2 loss functions (ideally the most widely used) for neural networks. One should be for classification and one should be for regression. Include one sentence explaining intuitively what quantities the model is trying to bring close together and/or why the function is useful to minimize.

> BEGIN SOLUTION

**CLASSIFICATION**: Cross-Entropy Loss, $L = -\sum_{i} y_i \log(\hat{y}_i)$. This loss measures the difference between the predicted probability distribution and the true distribution, encouraging the model to assign high probability to the correct class.

**REGRESSION**: Mean Squared Error (MSE), $L = \frac{1}{n}\sum_{i}(y_i - \hat{y}_i)^2$. This loss measures the average squared difference between predicted and actual values, penalizing larger errors more heavily and encouraging predictions close to the true values.
> END SOLUTION


### Problem 2: Gradient Descent with Autograd

Consider the simple regression problem given $X=(x_1,\dots,x_n)$ as a vector of length $n$ and targets $Y=(y_1,\dots,y_n)$. Use gradient descent to minimize the Mean Squared Error loss:

$$f(w) = \frac{1}{n}\sum_i (y_i - w_0 - w_1 x_i)^2$$

Write a function `mse_gd` that takes as input:
- `X`: Input feature tensor
- `y`: Target tensor  
- `eta`: Step size (learning rate), default 0.1
- `max_iter`: Maximum number of iterations, default 1000

The function should return the learned weights `w` as a tensor of shape `(2,)` where `w[0]` is the intercept and `w[1]` is the slope.

**Note:** This is similar to the problem from last week's groupwork, but now use PyTorch with autograd instead of NumPy.

In [None]:
def mse_gd(features, targets, eta=1e-1, max_iter=1000):
    # BEGIN SOLUTION
    # Randomly initialize weights from standard normal distribution
    # weights[0] is intercept, weights[1] is slope
    weights = torch.randn(2, requires_grad=True)

    for _ in range(max_iter):
        # Compute MSE loss: mean of squared residuals
        loss = torch.mean(torch.square(targets - weights[0] - weights[1] * features))

        # Compute gradients via backpropagation
        loss.backward()

        # Perform gradient descent step without tracking gradients
        with torch.no_grad():
            weights -= eta * weights.grad

        # Reset gradients for next iteration
        weights.grad.zero_()

    return weights
    # END SOLUTION

In [None]:
# Test assertions
torch.manual_seed(10)
X_test_gd = torch.rand(60)
y_test_gd = 2.0 - 1.4 * X_test_gd + torch.randn(60) * 0.3

weights_result = mse_gd(X_test_gd, y_test_gd)
weights_np = weights_result.detach().numpy()

assert weights_result.shape == (2,), f"Expected shape (2,), got {weights_result.shape}"
assert abs(weights_np[0] - 2.0) < 0.3, f"Intercept should be close to 2.0, got {weights_np[0]:.4f}"
assert abs(weights_np[1] - (-1.4)) < 0.3, f"Slope should be close to -1.4, got {weights_np[1]:.4f}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
torch.manual_seed(42)
X_hidden = torch.rand(100)
y_hidden = 3.0 + 2.0 * X_hidden + torch.randn(100) * 0.1
weights_hidden = mse_gd(X_hidden, y_hidden)
w_hidden_np = weights_hidden.detach().numpy()
assert abs(w_hidden_np[0] - 3.0) < 0.2, "Intercept estimation failed on hidden test"
assert abs(w_hidden_np[1] - 2.0) < 0.2, "Slope estimation failed on hidden test"
# END HIDDEN TESTS

Run the next cell to check your implementation.

In [None]:
# NO NEED TO EDIT THIS CELL
torch.manual_seed(10)
X = torch.rand(60)
y = 2.0 - 1.4 * X + torch.randn(60) * 0.3

w = mse_gd(X, y)
w = w.detach().numpy()  # need to detach tensors that require grad before calling numpy

plt.scatter(X, y)
plt.plot([0, 1], [2, 0.6], color="red", linestyle="dashed")  # true line in red
plt.plot([0, 1], [w[0], w[0] + w[1]], color="blue", linestyle="dashed")  # estimated line in blue
plt.show()



---
**Tips and Tricks for Training Your NNs**

*   **Start Small**: Models wiith fewer trainable parameters have less of an opportunity to overfit.
*   **Try a Wide Variety of Settings**: There are so many activation functions, widths, depths, network types, and regularization techniques at your disposal that sometimes you need to explore around with all combinations to get a sense of what works and what doesn't.




---



### Problem 3: Working with DataLoaders

To work effectively with PyTorch, we use the `DataLoader` class, which provides:

1. Automatic Batching
2. Sample Shuffling
3. Parallel Data Loading
4. Parallel Data Transformations

Using the Higgs dataset loaded below, perform the following operations:

1. Split the data into train and test splits (80/20) using stratification.
2. Create validation data from the train split to create train and validation splits (80/20).
3. Standard scale the training set and use the **same** scaling for both the validation and test sets.
4. Convert each split into a `TensorDataset`.
5. Create a `DataLoader` object for each split with an appropriate batch size (use a power of 2, such as 512).

**Hint:** You should use a single `StandardScaler` object, fitting only on training data.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
# NO NEED TO CHANGE THIS CELL

NUM_SAMPLES = 50_000

df = pd.read_csv("data/higgs.csv", index_col="EventId")
df = df.sample(n=NUM_SAMPLES, random_state=42)

# First column is the binary target (0 or 1), rest are features
df.columns = [f"feature_{i}" for i in range(1, df.shape[1])] + ["is_higgs_signal"]

df["is_higgs_signal"] = (df["is_higgs_signal"] == "s").astype(int)

# Split into features and target
X = df.drop("is_higgs_signal", axis=1)
y = df["is_higgs_signal"]

X.shape, y.shape

In [None]:
# NO NEED TO CHANGE THIS CELL
X.head()

In [None]:
# NO NEED TO CHANGE THIS CELL
y.head()

In [None]:
# BEGIN SOLUTION
# Split data into training and testing sets (80/20) with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, shuffle=True, stratify=y
)

# Split training data into train and validation sets (80/20) with stratification
X_train, X_validation, y_train, y_validation = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, shuffle=True, stratify=y_train
)

# Scale the data using a single scaler fit on training data only
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_validation = scaler.transform(X_validation)
X_test = scaler.transform(X_test)

# Convert to PyTorch tensors (features as float32, labels as long for CrossEntropyLoss)
X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train.values, dtype=torch.long)
X_validation = torch.tensor(X_validation, dtype=torch.float32)
y_validation = torch.tensor(y_validation.values, dtype=torch.long)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_test = torch.tensor(y_test.values, dtype=torch.long)

# Create TensorDataset for each split
train_dataset = torch.utils.data.TensorDataset(X_train, y_train)
validation_dataset = torch.utils.data.TensorDataset(X_validation, y_validation)
test_dataset = torch.utils.data.TensorDataset(X_test, y_test)

# Create DataLoaders with batch size of 512 (power of 2 for GPU efficiency)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=512, shuffle=True)
validation_loader = torch.utils.data.DataLoader(validation_dataset, batch_size=512, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=512, shuffle=True)
# END SOLUTION

In [None]:
# Test assertions
# Verify data split sizes:
# 50000 * 0.8 * 0.8 = 32000 train, 50000 * 0.8 * 0.2 = 8000 val, 50000 * 0.2 = 10000 test
assert len(train_dataset) == 32000, f"Expected 32000 training samples, got {len(train_dataset)}"
assert (
    len(validation_dataset) == 8000
), f"Expected 8000 validation samples, got {len(validation_dataset)}"
assert len(test_dataset) == 10000, f"Expected 10000 test samples, got {len(test_dataset)}"

# Verify tensor types
assert X_train.dtype == torch.float32, "X_train should be float32"
assert y_train.dtype == torch.long, "y_train should be long"

# Verify DataLoader batch size
assert train_loader.batch_size == 512, f"Expected batch size 512, got {train_loader.batch_size}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Verify scaling was applied (training data should have mean close to 0)
assert abs(X_train.mean().item()) < 0.1, "Training data should be approximately zero-centered"
# Verify feature dimensions are preserved
assert X_train.shape[1] == 31, "Should have 31 features"
# Verify labels are binary
assert set(y_train.unique().tolist()) == {0, 1}, "Labels should be binary (0, 1)"
# END HIDDEN TESTS

In [None]:
len(validation_loader)

### Problem 4: Training a Model

We have provided a commented `train_model` function for you. Review the function to understand each argument and step.

Then, create a small neural network with one hidden layer of width 32 and train it for 100 epochs using the provided training function.

In [None]:
def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs=100, patience=0):
    """
    model (torch.nn.Module): The neural network
    train_loader (torch.utils.data.DataLoader): Training data
    val_loader (torch.utils.data.DataLoader): Validation data
    criterion (torch.nn Loss function): The loss function
    optimizer (torch.optim): The optimizer used to make gradient steps
    num_epochs (int): The number of training epochs to perform
    patience (int): Number of epochs for early stopping. If 0, early stopping is not used.
    """

    # intiate a way to keep track of training and validation losses
    validation_losses = []
    training_losses = []

    min_validation_loss = float("inf")
    patience_counter = 0

    # for each epoch...
    for epoch in range(num_epochs):
        # set the model to train mode
        model.train()
        train_loss = 0
        for batch_X, batch_y in train_loader:
            # new iteration so we set gradients to zero
            optimizer.zero_grad()
            # get outputs from the model (forward pass)
            outputs = model(batch_X)
            # get the loss using the outputs and the truth
            loss = criterion(outputs, batch_y)
            # compute gradients of the loss with respect to model parameters (backward pass)
            loss.backward()
            # take a step with optimizer to update model parameters using computed gradients
            optimizer.step()
            # add to the epoch's runnning training loss
            train_loss += loss.item()

        # get the mean training loss
        train_loss /= len(train_loader)
        # append training loss observation to train lost list
        training_losses.append(train_loss)

        # we want to evaluate model losses on validation set so we set model
        # to eval mode here
        model.eval()
        val_loss = 0
        # we also don't want to do backpropagation while not training so we
        # turn off gradient
        with torch.no_grad():
            for batch_X, batch_y in val_loader:
                # val forward pass
                outputs = model(batch_X)
                # val loss function
                loss = criterion(outputs, batch_y)
                # add to total running validation loss
                val_loss += loss.item()

        # get the mean training loss
        val_loss /= len(val_loader)
        if patience:
            if val_loss < min_validation_loss:
                min_validation_loss = val_loss
                patience_counter = 0
            else:
                patience_counter += 1
                if patience_counter >= patience:
                    print(f"Early stopping at epoch {epoch + 1}")
                    break
        # append training loss observation to train lost list
        validation_losses.append(val_loss)
        # once in a while...
        if epoch % 10 == 0:
            # Print epoch progress, training loss, and validation loss
            print(
                f"Epoch {epoch + 1}/{num_epochs}, "
                f"Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}"
            )

    return training_losses, validation_losses, model

In [None]:
# BEGIN SOLUTION
# Create a simple neural network with one hidden layer of width 32
model = nn.Sequential(
    nn.Linear(X_train.shape[1], 32),
    nn.ReLU(),
    nn.Linear(32, 2),
)

# Use CrossEntropyLoss for classification
criterion = nn.CrossEntropyLoss()

# Use Adam optimizer with learning rate 0.001
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train for 100 epochs
training_losses, validation_losses, trained_model = train_model(
    model, train_loader, validation_loader, criterion, optimizer, num_epochs=100
)
# END SOLUTION

In [None]:
# Test assertions
assert isinstance(model, nn.Module), "model should be an nn.Module"
assert isinstance(criterion, nn.Module), "criterion should be a loss function"
assert isinstance(optimizer, optim.Optimizer), "optimizer should be an Optimizer"
assert len(training_losses) > 0, "training_losses should have entries after training"
assert len(validation_losses) > 0, "validation_losses should have entries after training"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(training_losses) == 100, "Should train for 100 epochs"
assert all(loss >= 0 for loss in training_losses), "Training losses should be non-negative"
# END HIDDEN TESTS

In [None]:
# CHECK HOW THE MODEL DID
# NO CHANGES NECESSARY HERE
plt.plot(training_losses, label="Training Loss")
plt.plot(validation_losses, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

## Part 5: Comparing Model Architectures

We will observe how the loss trajectories evolve as we increase the size of the models. As we increase the model's size, a divergence between the training and the validation loss will appear. In very large models, validation loss will increase despite the falling training loss.

### Problem 5a: Model Building Function

Write a function called `build_model` that takes in the following arguments:

1. `n_hidden` (int): The number of hidden layers to use.
2. `n_neurons` (int): The width of each hidden layer.
3. `activation` (str): A string indicating the activation function to use (e.g., 'relu', 'tanh', 'sigmoid').
4. `input_dim` (int): The input dimension of your data.
5. `output_dim` (int): The output dimension of your data.
6. `include_dropout` (bool or list): Whether to include dropout layers. If a list, use the dropout rates per layer.

The function should return a neural network (`nn.Sequential`) with these properties, ready to be trained with your `train_model` function.

In [None]:
def build_model(
    n_hidden=1, n_neurons=30, activation="relu", input_dim=0, output_dim=0, include_dropout=False
):
    # BEGIN SOLUTION
    # Map activation names to PyTorch activation classes
    activation_map = {
        "relu": nn.ReLU,
        "tanh": nn.Tanh,
        "sigmoid": nn.Sigmoid,
        "leaky_relu": nn.LeakyReLU,
    }

    # Get activation function class, default to ReLU if not found
    activation_fn = activation_map.get(activation.lower(), nn.ReLU)

    layers = []

    # Build hidden layers
    for layer_idx in range(n_hidden):
        # First layer takes input_dim, subsequent layers take n_neurons
        in_features = input_dim if layer_idx == 0 else n_neurons
        layers.append(nn.Linear(in_features, n_neurons))
        layers.append(activation_fn())

        # Add dropout if specified
        if include_dropout:
            layers.append(nn.Dropout(include_dropout[layer_idx]))

    # Add output layer
    layers.append(nn.Linear(n_neurons, output_dim))

    return nn.Sequential(*layers)
    # END SOLUTION

In [None]:
# Test assertions
test_model = build_model(n_hidden=2, n_neurons=16, activation="relu", input_dim=10, output_dim=2)

# Check that model is nn.Sequential
assert isinstance(test_model, nn.Sequential), "Model should be an nn.Sequential"

# Check number of layers (2 hidden layers * 2 (linear + activation) + 1 output = 5)
assert len(test_model) == 5, f"Expected 5 layers, got {len(test_model)}"

# Check input/output dimensions
test_input = torch.randn(4, 10)
test_output = test_model(test_input)
assert test_output.shape == (4, 2), f"Expected output shape (4, 2), got {test_output.shape}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Test with dropout
dropout_model = build_model(
    n_hidden=3,
    n_neurons=32,
    activation="tanh",
    input_dim=20,
    output_dim=5,
    include_dropout=[0.2, 0.3, 0.4],
)
assert (
    len(dropout_model) == 10
), "Model with dropout should have 10 layers (3*(linear+activation+dropout) + output)"

# Verify activation type
assert isinstance(dropout_model[1], nn.Tanh), "Should use Tanh activation when specified"

# Test forward pass with dropout model
dropout_input = torch.randn(8, 20)
dropout_output = dropout_model(dropout_input)
assert dropout_output.shape == (8, 5), "Dropout model output shape incorrect"
# END HIDDEN TESTS

### Problem 5b: Evaluating Neural Network Architectures

Using your `build_model` method, try out at least 2 depths, 2 widths, and 2 activations (8 total models). Record the final validation loss of each model using a dictionary. Print out the test loss of the best performing model, and in the markdown cell below, describe which model performed best and report its test loss.

In [None]:
# BEGIN SOLUTION
# Architecture parameters to test
num_layers_list = [3, 5]
num_neurons_list = [16, 32]
activations_list = ["relu", "tanh"]

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Dictionary to store performance results
model_performance = {}
lowest_val_loss = float("inf")
best_model = None

# Try all combinations of architectures
for n_hidden in num_layers_list:
    for n_neurons in num_neurons_list:
        for activation in activations_list:
            print(f"Training: {n_hidden} layers, {n_neurons} neurons, {activation} activation")

            model = build_model(
                n_hidden, n_neurons, activation, input_dim=X_train.shape[1], output_dim=2
            ).to(device)
            criterion = nn.CrossEntropyLoss()
            optimizer = optim.Adam(model.parameters(), lr=0.001)
            train_losses, val_losses, trained_model = train_model(
                model, train_loader, validation_loader, criterion, optimizer, num_epochs=100
            )

            # Record performance
            model_performance[(n_hidden, n_neurons, activation)] = val_losses[-1]
            if val_losses[-1] < lowest_val_loss:
                lowest_val_loss = val_losses[-1]
                best_model = model

            print(f"Final Val Loss: {val_losses[-1]:.6f}")
# END SOLUTION

In [None]:
# BEGIN SOLUTION
# Find best model configuration
best_config = min(model_performance, key=model_performance.get)
print(f"Best model configuration: {best_config}")

# Evaluate best model on test set
criterion = nn.CrossEntropyLoss()
test_loss = criterion(best_model(X_test), y_test)
print(f"Test loss: {test_loss.item():.6f}")
# END SOLUTION

> BEGIN SOLUTION

My best performing model was a neural network with 5 hidden layers, 16 neurons per layer, and ReLU activation. It achieved a test loss of approximately 0.0058.
> END SOLUTION


## Part 6: Adding Regularization

### Problem 6a: Weight Decay

Pick a deep model of your choosing (using `build_model`) and train it with an optimizer that includes weight decay (L2 regularization).



---


You may be familiar with Occam's Razor principle: given two explanations for something, the explanation most likely to be correct is the "simplest" one, the one that makes the least amount of assumptions. This also applies to the models learned by neural networks: given 2 networks that achieve a similar performance, go with the simpler one! Less risk of overfitting!

* [L1 regularization](https://developers.google.com/machine-learning/glossary/#L1_regularization), where the cost added is proportional to the absolute value of the weights coefficients (i.e. to what is called the "L1 norm" of the weights). L1 regularization produces sparse models because a decrease of 0.1 -> 0 is as penalty reducing as 10.1 -> 10.0 .

* [L2 regularization](https://developers.google.com/machine-learning/glossary/#L2_regularization), where the cost added is proportional to the square of the value of the weights coefficients (i.e. to what is called the squared "L2 norm" of the weights). L2 regularization is also called **weight decay** in the context of neural networks.

Here, we apply L2 regularization using the `weight_decay` feature of PyTorch.
Feel free to read the [article](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) and adjust the `weight_decay` parameter to remove overfitting in the medium size model.


---



Pick a deep model of your choosing (using `build_model`) and train it with an optimizer that has weight_decay.

In [None]:
# Test assertions
assert isinstance(model, nn.Module), "model should be an nn.Module"
assert isinstance(optimizer, optim.Adam), "optimizer should be Adam"
assert optimizer.defaults.get("weight_decay", 0) > 0, "optimizer should have weight_decay > 0"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert hasattr(model, "forward"), "model should have forward method"
# END HIDDEN TESTS

In [None]:
# BEGIN SOLUTION
# Build a deep model
model = build_model(
    n_hidden=5, n_neurons=64, activation="relu", input_dim=X_train.shape[1], output_dim=2
)

# Use CrossEntropyLoss and Adam optimizer with weight decay
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-6)
# END SOLUTION

In [None]:
training_losses, validation_losses, trained_model = train_model(
    model, train_loader, validation_loader, criterion, optimizer, num_epochs=200
)

In [None]:
plt.plot(training_losses, label="Training Loss")
plt.plot(validation_losses, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

In [None]:
# Test assertions
assert isinstance(model, nn.Module), "model should be an nn.Module"
# Check that model contains dropout layers
has_dropout = any(isinstance(layer, nn.Dropout) for layer in model.modules())
assert has_dropout, "model should contain Dropout layers"
print("All tests passed!")

# BEGIN HIDDEN TESTS
dropout_layers = [layer for layer in model.modules() if isinstance(layer, nn.Dropout)]
assert len(dropout_layers) == n_hidden, f"Should have {n_hidden} dropout layers"
# END HIDDEN TESTS

### Problem 6b: Dropout

Your `build_model` function already supports dropout via the `include_dropout` parameter. Use it to build a model with dropout layers, specifying dropout rates that increase with depth (lower near input, higher in deeper layers).

In [None]:
# BEGIN SOLUTION
# Define dropout rates that increase with depth
n_hidden = 5
dropout_rates = [0.2, 0.2, 0.3, 0.4, 0.5]

# Build model with dropout
model = build_model(
    n_hidden=n_hidden,
    n_neurons=64,
    activation="relu",
    input_dim=X_train.shape[1],
    output_dim=2,
    include_dropout=dropout_rates,
)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# END SOLUTION

In [None]:
training_losses, validation_losses, trained_model = train_model(
    model, train_loader, validation_loader, criterion, optimizer, num_epochs=200
)

In [None]:
# Test assertions
assert isinstance(model, nn.Module), "model should be an nn.Module"
assert isinstance(criterion, nn.Module), "criterion should be a loss function"
assert isinstance(optimizer, optim.Optimizer), "optimizer should be an Optimizer"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert hasattr(model, "forward"), "model should have forward method"
# END HIDDEN TESTS

In [None]:
plt.plot(training_losses, label="Training Loss")
plt.plot(validation_losses, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

### Problem 6c: Early Stopping

The `train_model` function already includes early stopping via the `patience` parameter. Demonstrate early stopping by training a model with `patience=5`.

In [None]:
# BEGIN SOLUTION
# Build a model without dropout or weight decay to demonstrate early stopping
model = build_model(
    n_hidden=5, n_neurons=64, activation="relu", input_dim=X_train.shape[1], output_dim=2
)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# END SOLUTION

In [None]:
# Test assertions
assert isinstance(model, nn.Module), "model should be an nn.Module"
assert isinstance(optimizer, optim.Adam), "optimizer should be Adam"
# At least one regularization technique should be used
uses_weight_decay = optimizer.defaults.get("weight_decay", 0) > 0
has_dropout = any(isinstance(layer, nn.Dropout) for layer in model.modules())
assert uses_weight_decay or has_dropout, "Should use at least one regularization technique"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert hasattr(model, "forward"), "model should have forward method"
# END HIDDEN TESTS

In [None]:
# BEGIN SOLUTION
# Train with early stopping (patience=5)
training_losses, validation_losses, trained_model = train_model(
    model, train_loader, validation_loader, criterion, optimizer, num_epochs=200, patience=5
)
# END SOLUTION

In [None]:
plt.plot(training_losses, label="Training Loss")
plt.plot(validation_losses, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

### Problem 6d: Putting It All Together

Choose any combination of at least 2 regularization strategies (weight decay, dropout, early stopping) and train a model for up to 500 epochs. The model should stop early if using early stopping.

In [None]:
# BEGIN SOLUTION
# Combine weight decay and early stopping
model = build_model(
    n_hidden=5, n_neurons=64, activation="relu", input_dim=X_train.shape[1], output_dim=2
)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-6)
# END SOLUTION

In [None]:
# BEGIN SOLUTION
# Train with early stopping and weight decay
training_losses, validation_losses, trained_model = train_model(
    model, train_loader, validation_loader, criterion, optimizer, num_epochs=500, patience=10
)
# END SOLUTION

In [None]:
# Test assertions
assert isinstance(model, nn.Module), "model should be an nn.Module"
assert isinstance(optimizer, optim.Adam), "optimizer should be Adam"
# Training should have used early stopping
assert len(training_losses) < 500, "Training should have stopped early with patience"
# At least one regularization technique should be used
uses_weight_decay = optimizer.defaults.get("weight_decay", 0) > 0
has_dropout = any(isinstance(layer, nn.Dropout) for layer in model.modules())
assert uses_weight_decay or has_dropout, "Should use at least one regularization technique"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert hasattr(model, "forward"), "model should have forward method"
# END HIDDEN TESTS

In [None]:
plt.plot(training_losses, label="Training Loss")
plt.plot(validation_losses, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

In [None]:
validation_losses