# DATASCI 315, Homework 6: Training Models with PyTorch

In this homework, you will use PyTorch to train models. There are 3 problems:
1. Training a support vector machine (SVM) using PyTorch for simple synthetic data.
2. Training a shallow neural network for regression on California housing data.
3. Training a deep neural network for classification on digits data.

In [None]:
import matplotlib.pyplot as plt
import torch

## Part 1: Support Vector Machine

This problem will test your understanding of using `torch` tensors and automatic differentiation in `torch`.

### Problem 1a: Generate Data

Generate $n=100$ sample points $(x_i, y_i), i=1,\dots,n$ where $x_i\in \mathbb R^2$ is drawn from the standard bivariate normal distribution. For each such $x_i$, generate $y_i\in \{-1,1\}$ such that

$$y_i = \begin{cases}
 -1 & x_{i1} > \frac{3}{4} x_{i2} \\
 +1 & x_{i1} \leq \frac{3}{4} x_{i2}.
\end{cases}$$

Generate this without using `for`/`while` loops, and without using `numpy`: your code should use PyTorch vectorization. `features` should be of shape $(n,2)$ and `labels` should be of shape $(n,)$.

Also, create a scatter plot showing each training sample $x_i$ colored by its response $y_i$.

Note: The class labels here are $-1$ and $1$, not $0$ and $1$. The transformation $y\mapsto 2y-1$ can be used to map 0/1 labels to -1/1 labels.

In [None]:
torch.manual_seed(10)

n = 100
# BEGIN SOLUTION
features = torch.randn(n, 2)
labels = 2 * (features[:, 0] <= 0.75 * features[:, 1]).int() - 1

# Create scatter plot
plt.figure(figsize=(4, 3))
plt.scatter(features[:, 0], features[:, 1], c=labels)
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert features.shape == (
    100,
    2,
), f"features should have shape (100, 2), got {features.shape}"
assert labels.shape == (100,), f"labels should have shape (100,), got {labels.shape}"
assert set(labels.unique().tolist()) == {-1, 1}, "labels should only contain -1 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert features.dtype == torch.float32, "features should be float32"
assert (labels == 1).sum() + (labels == -1).sum() == 100, "All labels should be -1 or 1"
# Check the labeling rule: y_i = 1 iff x_i1 <= 0.75 * x_i2
expected_labels = 2 * (features[:, 0] <= 0.75 * features[:, 1]).int() - 1
assert torch.all(labels == expected_labels), "Labels do not follow the specified rule"
# END HIDDEN TESTS

### Problem 1b: SVM Loss

A support vector machine (SVM) is fitted using *hinge loss,* which is given by

$$L(Y, \widehat{Y}) = \sum_{i=1}^n \max \{0, 1 - y_i \widehat{y}_i\}.$$

For a linear SVM, the prediction is given as

$$\widehat{y}_i = w_0 + w_1 x_{i1} + w_2 x_{i2}$$

Using PyTorch, compute loss for a linear SVM. Your function should take the parameters $w=(w_0,w_1,w_2)$ and data `features`, `labels` as input and output $L(Y, \hat{Y}(X))$.

**Hint:** Be consistent with your shapes for `labels` and $\widehat{y}$. The parameter $w$ is provided as a $(3,)$ shaped tensor (not a $(3,1)$ tensor). Note `features` is passed as a tensor of shape $(n,2)$.

**Hint:** Use `torch.maximum` or `torch.clamp` to compute the max with zero.

In [None]:
def svm_loss(weights, features, labels):
    # BEGIN SOLUTION
    # Prepend column of ones to features for the bias term
    ones_column = torch.ones((features.shape[0], 1))
    features_with_bias = torch.cat((ones_column, features), dim=1)
    predictions = features_with_bias @ weights
    return torch.sum(torch.maximum(1 - labels * predictions, torch.tensor(0.0)))
    # END SOLUTION

In [None]:
# Test assertions
w_test = torch.tensor([0.2, 0.5, -0.3])
loss_test = svm_loss(w_test, features, labels)
assert loss_test.shape == (), f"Loss should be a scalar, got shape {loss_test.shape}"
assert loss_test.item() > 0, "Loss should be positive for this test case"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Test with good weights (loss should be relatively small)
w_good = torch.tensor([0.0, -0.8, 0.6])
loss_good = svm_loss(w_good, features, labels)
assert loss_good.item() < 50, "Loss with good weights should be relatively small"
# Test that loss is non-negative
assert loss_test.item() >= 0, "Hinge loss should be non-negative"
# Check that loss is in reasonable range for these weights
assert 100 < loss_test.item() < 200, f"Loss should be in range (100, 200), got {loss_test.item()}"
# END HIDDEN TESTS

### Problem 1c: Training via Gradient Descent

Write the following `train_svm` function, taking `features` and `labels` as input and using gradient descent to train the parameters (here just $w$). Use automatic differentiation to get gradients. The function also takes as input `eta` (learning rate / step size) and `epochs` (the number of iterations). There is no need to check for convergence. Return the final value of parameter $w$ and a list of the loss evaluated at each iteration.

Do not use `torch.optim` for this problem. Instead, implement gradient descent directly.

**Hint:** Set up the parameters to enable gradient computations with `requires_grad=True`.

In [None]:
def train_svm(features, labels, eta=1e-1, epochs=1000):
    _n, d = features.shape

    # BEGIN SOLUTION
    # Initialize weights randomly with gradient tracking
    weights = torch.randn(d + 1, requires_grad=True)
    losses = []

    # Training loop
    for _i in range(epochs):
        # Compute loss
        loss = svm_loss(weights, features, labels)
        # Backward pass
        loss.backward()
        # Gradient descent update (no gradient tracking)
        with torch.no_grad():
            weights -= eta * weights.grad
            weights.grad.zero_()
        # Record loss
        losses.append(loss.detach().item())
    # END SOLUTION
    return weights, losses

In [None]:
# Test assertions
torch.manual_seed(42)
w_trained, losses = train_svm(features, labels)
assert len(losses) == 1000, f"Should have 1000 loss values, got {len(losses)}"
assert losses[-1] < losses[0], "Loss should decrease during training"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert w_trained.shape == (3,), f"w should have shape (3,), got {w_trained.shape}"
assert w_trained.requires_grad, "w should have requires_grad=True"
# Check that final loss is reasonably small
assert losses[-1] < 10, f"Final loss should be small, got {losses[-1]}"
# Check that the learned weights approximate the true boundary
w_norm = w_trained.detach() / torch.linalg.norm(w_trained.detach())
# The true boundary is x1 = 0.75 * x2, so w should be proportional to [0, -0.8, 0.6]
assert abs(w_norm[1].item() + 0.8) < 0.2, "w1 should be close to -0.8 (normalized)"
# END HIDDEN TESTS

In [None]:
# Verify trained weights
w_trained, losses = train_svm(features, labels)
print(f"first loss = {losses[0]:.3f} and last loss = {losses[-1]:.3f}")
w_detached = w_trained.detach()
print(f"optimal w (normalized) = {w_detached / torch.linalg.norm(w_detached)}")

w_true = torch.tensor([0.0, -4.0, 3.0])
print(f"original w (normalized) = {w_true / torch.linalg.norm(w_true)}")

## Part 2: Shallow Network Regression

In this problem, you will train a shallow neural network on the California housing dataset using gradient descent.

In [None]:
# Load the dataset
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

california_housing = fetch_california_housing(return_X_y=True, as_frame=True)
housing_features = california_housing[0]
housing_prices = california_housing[1]
train_features_unscaled, test_features_unscaled, train_prices, test_prices = train_test_split(
    housing_features, housing_prices, test_size=0.3, random_state=42
)
sc = StandardScaler()
train_features_scaled = sc.fit_transform(train_features_unscaled)
test_features_scaled = sc.transform(test_features_unscaled)

In [None]:
# Convert numpy arrays to tensors
train_features = torch.tensor(train_features_scaled, dtype=torch.float32)
test_features = torch.tensor(test_features_scaled, dtype=torch.float32)
train_prices = torch.tensor(train_prices.values, dtype=torch.float32)
test_prices = torch.tensor(test_prices.values, dtype=torch.float32)

### Problem 2a: Sizes

Fill in the input dimension and output dimension of the shallow neural network for regression on this dataset. For this homework, we will choose a hidden layer with dimension double that of the input dimension.

In [None]:
# BEGIN SOLUTION
input_dim = train_features.shape[1]
hidden_dim = 2 * input_dim
output_dim = 1
# END SOLUTION

In [None]:
# Test assertions
assert input_dim == 8, f"Input dimension should be 8, got {input_dim}"
assert hidden_dim == 16, f"Hidden dimension should be 16, got {hidden_dim}"
assert output_dim == 1, f"Output dimension should be 1 for regression, got {output_dim}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert input_dim == train_features.shape[1], "input_dim should match number of features"
assert hidden_dim == 2 * input_dim, "hidden_dim should be 2 * input_dim"
# END HIDDEN TESTS

### Problem 2b: Initialization

Define `torch` tensors for weights `W1`, `b1`, `W2`, and `b2`. Initialize both biases `b1` and `b2` to be zero vectors and initialize `W1` and `W2` by picking values uniformly at random from the interval $[0,1]$.

Note: All variables should enable gradient computations.

In [None]:
# BEGIN SOLUTION
W1 = torch.rand((input_dim, hidden_dim), requires_grad=True)
W2 = torch.rand((hidden_dim, output_dim), requires_grad=True)
b1 = torch.zeros(hidden_dim, requires_grad=True)
b2 = torch.zeros(output_dim, requires_grad=True)
# END SOLUTION

In [None]:
# Test assertions
assert W1.shape == (
    input_dim,
    hidden_dim,
), f"W1 shape should be ({input_dim}, {hidden_dim})"
assert W2.shape == (
    hidden_dim,
    output_dim,
), f"W2 shape should be ({hidden_dim}, {output_dim})"
assert b1.shape == (hidden_dim,), f"b1 shape should be ({hidden_dim},)"
assert b2.shape == (output_dim,), f"b2 shape should be ({output_dim},)"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert W1.requires_grad, "W1 should have requires_grad=True"
assert W2.requires_grad, "W2 should have requires_grad=True"
assert b1.requires_grad, "b1 should have requires_grad=True"
assert b2.requires_grad, "b2 should have requires_grad=True"
assert torch.all(b1 == 0), "b1 should be initialized to zeros"
assert torch.all(b2 == 0), "b2 should be initialized to zeros"
assert torch.all((W1 >= 0) & (W1 <= 1)), "W1 values should be in [0, 1]"
assert torch.all((W2 >= 0) & (W2 <= 1)), "W2 values should be in [0, 1]"
# END HIDDEN TESTS

### Problem 2c: Model and Loss

(i) Complete the `model` function below to define a shallow network. The `inputs` parameter is a matrix of shape $n \times d$, where the $i$th row contains $x_i^\intercal$. It should return a `torch` tensor of shape $(n,)$ containing the output of the network.

**Hint:** Use `torch.nn.ReLU()` for activation.

(ii) Write a function that computes mean-squared error given the predictions and targets. Make sure to use `torch` operations only for autodiff to work.

In [None]:
def model(inputs):
    # BEGIN SOLUTION
    hidden = torch.nn.ReLU()(inputs @ W1 + b1)
    return hidden @ W2 + b2
    # END SOLUTION


def mse(predictions, targets):
    # BEGIN SOLUTION
    sq_error = torch.square(predictions.squeeze() - targets)
    return torch.mean(sq_error)
    # END SOLUTION

In [None]:
# Test assertions
test_output = model(train_features[:10])
assert test_output.shape[0] == 10, f"Model output should have 10 rows, got {test_output.shape[0]}"
test_mse = mse(torch.tensor([1.0, 2.0, 3.0]), torch.tensor([1.0, 2.0, 4.0]))
assert abs(test_mse.item() - 1 / 3) < 0.01, f"MSE calculation incorrect, got {test_mse.item()}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Test model output shape
full_output = model(train_features)
assert full_output.shape[0] == train_features.shape[0], "Model should output one value per input"
# Test MSE with known values
mse_test = mse(torch.tensor([0.0, 0.0]), torch.tensor([1.0, 1.0]))
assert abs(mse_test.item() - 1.0) < 0.01, "MSE of [0,0] vs [1,1] should be 1.0"
# END HIDDEN TESTS

### Problem 2d: Training via Batches

For this problem, complete the following code for training the model via batches. We are not using `DataLoader` to create batches—instead, we will shuffle the training data at each epoch and go through it $B$ samples at a time (where $B=500$ is the batch size). Use Stochastic Gradient Descent from `torch` to perform the optimization by filling in the following code chunk.

In [None]:
from torch import optim

learning_rate = 0.01
n_epochs = 100
batch_size = 500

# BEGIN SOLUTION
num_samples = train_features.shape[0]
num_batches = num_samples // batch_size  # how many iterations within each epoch

# Set up optimizer
optimizer = optim.SGD(params=[W1, b1, W2, b2], lr=learning_rate)
# END SOLUTION

for epoch in range(n_epochs):
    # Shuffle data
    random_shuffle_idx = torch.randperm(num_samples)
    train_features = train_features[random_shuffle_idx]
    train_prices = train_prices[random_shuffle_idx]

    # Training within epoch through batches
    # BEGIN SOLUTION
    for i in range(num_batches):
        batch_start = i * batch_size
        batch_end = (i + 1) * batch_size
        pred = model(train_features[batch_start:batch_end, :])
        loss = mse(pred, train_prices[batch_start:batch_end])
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    # END SOLUTION

    # Evaluation
    if epoch % 10 == 0:
        with torch.no_grad():
            loss = mse(model(train_features), train_prices)
            print(f"Loss at step {epoch}: {loss.detach().item():.4f}")

In [None]:
# Test assertions
with torch.no_grad():
    final_train_loss = mse(model(train_features), train_prices).item()
assert (
    final_train_loss < 3.0
), f"Training loss should be below 3.0 after training, got {final_train_loss}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert final_train_loss < 2.0, f"Training loss should be below 2.0, got {final_train_loss}"
# END HIDDEN TESTS

### Problem 2e: Evaluation on Test Data

Compute the loss on the test data and print the value. Make sure no gradient computations are done here.

In [None]:
# BEGIN SOLUTION
with torch.no_grad():
    predictions = model(test_features)
    test_loss = mse(predictions, test_prices)
    print(f"Loss on test data = {test_loss.detach().item():.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert test_loss.item() < 3.0, f"Test loss should be below 3.0, got {test_loss.item()}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert test_loss.item() < 2.0, f"Test loss should be below 2.0, got {test_loss.item()}"
assert test_loss.item() > 0.1, f"Test loss suspiciously low: {test_loss.item()}"
# END HIDDEN TESTS

## Part 3: Deep Network Classification

In this problem, you will train a two-layer network using `torch` for classification on the `digits` dataset, which consists of handwritten digits (as $8\times 8$ grayscale images) and their labels (0–9).

In [None]:
import torch
from sklearn import datasets

# Load the digits dataset
digit_images, digit_labels = datasets.load_digits(return_X_y=True)
print(f"Number of samples = {len(digit_images)}")
print(f"Target classes = {torch.unique(digit_labels)}")
digit_images.shape

In [None]:
_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for i, ax in enumerate(axes):
    ax.set_axis_off()
    ax.imshow(digit_images[i].reshape(8, 8), cmap="gray")
    ax.set_title(f"Sample: {digit_labels[i]}")

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
digit_images = sc.fit_transform(digit_images)

### Problem 3a: DataLoader

Split the data into train and test sets using `sklearn` (keep 30% data for test). Set up dataloader objects for batch-training. Use batch size $32$. Keep `shuffle=True` so that each epoch the training data are shuffled.

**Hint:** `DataLoader` takes a `torch` `TensorDataset` object—so first need to create that. For that, convert everything to proper `torch` tensors (of correct data types—for `nn.CrossEntropyLoss` the target must have data type `torch.long`).

In [None]:
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, TensorDataset

# BEGIN SOLUTION
digit_images = torch.tensor(digit_images, dtype=torch.float32)
digit_labels = torch.tensor(digit_labels, dtype=torch.long)

train_images, test_images, train_labels, test_labels = train_test_split(
    digit_images, digit_labels, test_size=0.3, random_state=42
)

training_data = TensorDataset(train_images, train_labels)
train_dataloader = DataLoader(training_data, batch_size=32, shuffle=True)
# END SOLUTION

In [None]:
# Test assertions
assert len(train_images) == 1257, f"train_images should have 1257 samples, got {len(train_images)}"
assert len(test_images) == 540, f"test_images should have 540 samples, got {len(test_images)}"
assert (
    train_labels.dtype == torch.long
), f"train_labels should be torch.long, got {train_labels.dtype}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert (
    train_images.dtype == torch.float32
), f"train_images should be float32, got {train_images.dtype}"
assert len(training_data) == len(train_images), "TensorDataset length should match train_images"
# Check dataloader batch size
first_batch = next(iter(train_dataloader))
assert first_batch[0].shape[0] == 32, f"Batch size should be 32, got {first_batch[0].shape[0]}"
# END HIDDEN TESTS

### Problem 3b: Setting up Model

Build a `torch` model using `nn.Sequential` by adding appropriate layers (Linear and activation). For this problem use ReLU activation. The first hidden layer should have $128$ units and the second should have $64$ units.

**Hint:** Are the images already flattened?

**Hint:** Set up the shapes correctly. The target has 10 classes.

**How many total parameters are there in the model?**

In [None]:
from torch import nn

# BEGIN SOLUTION
model = nn.Sequential(
    nn.Linear(in_features=64, out_features=128),
    nn.ReLU(),
    nn.Linear(in_features=128, out_features=64),
    nn.ReLU(),
    nn.Linear(in_features=64, out_features=10),
)
# END SOLUTION

In [None]:
# BEGIN SOLUTION
total_params = sum(param.numel() for param in model.parameters())
# END SOLUTION
total_params

In [None]:
# Test assertions
assert total_params == 17226, f"Total parameters should be 17226, got {total_params}"
test_input = torch.randn(5, 64)
test_out = model(test_input)
assert test_out.shape == (
    5,
    10,
), f"Model output shape should be (5, 10), got {test_out.shape}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Check model structure
assert len(list(model.children())) == 5, "Model should have 5 layers (3 Linear + 2 ReLU)"
# Check first linear layer
first_layer = next(iter(model.children()))
assert first_layer.in_features == 64, "First layer input should be 64"
assert first_layer.out_features == 128, "First layer output should be 128"
# END HIDDEN TESTS

### Problem 3c: Setting up Loss and Optimizer

Set up the appropriate loss function (from `torch`) and initialize the SGD optimizer with correct trainable parameters. Use learning rate 0.1.

In [None]:
# BEGIN SOLUTION
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
# END SOLUTION

In [None]:
# Test assertions
assert isinstance(loss_fn, nn.CrossEntropyLoss), "Should use CrossEntropyLoss for classification"
assert isinstance(optimizer, optim.SGD), "Should use SGD optimizer"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Check learning rate
assert (
    optimizer.defaults["lr"] == 0.1
), f"Learning rate should be 0.1, got {optimizer.defaults['lr']}"
# END HIDDEN TESTS

### Problem 3d: Training

Train the model for 100 epochs by filling in parts of the following code chunk. It should display the average loss over the training data every 10 epochs.

In [None]:
for epoch in range(100):
    epoch_training_loss = 0

    # Training loop
    # BEGIN SOLUTION
    for inputs, target in train_dataloader:
        # Forward pass
        outputs = model(inputs)
        loss = loss_fn(outputs, target)

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        # Zero gradients
        optimizer.zero_grad()

        # Compute loss value
        epoch_training_loss += loss.detach().item()

    avg_train_loss = epoch_training_loss / len(train_dataloader)
    # END SOLUTION
    if epoch % 10 == 0:
        print(f"Epoch {epoch}: average training loss = {avg_train_loss:.4f}")

In [None]:
# Test assertions
assert avg_train_loss < 0.1, f"Final average training loss should be < 0.1, got {avg_train_loss}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert (
    avg_train_loss < 0.05
), f"Training loss should be very small after 100 epochs, got {avg_train_loss}"
# END HIDDEN TESTS

### Problem 3e: Test Accuracy

Compute the test accuracy in evaluation mode (i.e., no gradient computations). Accuracy is defined as the proportion of correct predictions (this is not the same as the Cross Entropy loss that you were using as the loss function for optimization above).

In [None]:
# BEGIN SOLUTION
with torch.no_grad():
    outputs = model(test_images)
    correct = (outputs.argmax(dim=1) == test_labels).sum().item()
    test_accuracy = correct / len(test_labels)
    print(f"Test accuracy = {100 * test_accuracy:.2f}%")
# END SOLUTION

In [None]:
# Test assertions
assert test_accuracy > 0.90, f"Test accuracy should be > 90%, got {100 * test_accuracy:.2f}%"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert test_accuracy > 0.95, f"Test accuracy should be > 95%, got {100 * test_accuracy:.2f}%"
assert test_accuracy <= 1.0, "Test accuracy cannot exceed 100%"
# END HIDDEN TESTS