# **DATASCI 315, Group Work Assignment 7: High-Dimensional Spaces, the Bias-Variance Trade-Off, Ensemble Methods, and Data Augmentation**

**Instructions:** During lab section, and afterward as necessary, you will collaborate in two-person teams (assigned by the GSI) to complete the problems that are interspersed below. The GSI will help individual teams encountering difficulty, make announcements addressing common issues, and help ensure progress for all teams. *During lab, feel free to flag down your GSI to ask questions at any point!* Upon completion, one member of the team should submit their team's work through Canvas as html.

To begin, let's import some packages that we'll use throughout this assignment.

In [None]:
import math

import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.optim.lr_scheduler import StepLR
from torch.utils.data import DataLoader, TensorDataset

plt.style.use("seaborn-v0_8-bright")

# Part A: High-Dimensional Space

This part investigates the strange properties of high-dimensional spaces. We consider the following two properties of high-dimensional spaces:

1. The closeness of random points in a high-dimensional space
2. The proportion of a bounding hypercube contained within a hypersphere.

### Problem 1a: Distance in High-Dimensional Space

Given $n$ random points (from a standard multivariate normal distribution) compute the average norm, the minimum, maximum pairwise distance and the ratio. Complete the following function, which returns the average norm and the ratio.

Note: While computing minimum, ignore the self distances which are 0.


In [None]:
# BEGIN SOLUTION
def distance(n_dim=1, n_data=1000):
    data_points = torch.randn(n_data, n_dim)

    # compute the average norm
    avg_norm = torch.linalg.norm(data_points, dim=1).mean().item()

    # compute pairwise distances
    reshaped_data = data_points.reshape(n_data, 1, n_dim)
    diff = reshaped_data - reshaped_data.transpose(0, 1)
    pairwise_distance = torch.sqrt(torch.sum(diff**2, dim=2))

    # compute maximum and minimum (be careful)
    maximum_distance = pairwise_distance.max().item()
    pairwise_distance.fill_diagonal_(float("inf"))
    minimum_distance = pairwise_distance.min().item()

    # compute ratio
    ratio = maximum_distance / minimum_distance
    return avg_norm, ratio


# END SOLUTION

In [None]:
# Test assertions
assert True, "Solution implemented for 1a"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Hidden tests for 1a
assert True
# END HIDDEN TESTS

In [None]:
dim = torch.arange(5, 500, 100)

# Run distance function for each dimension
norms = []
ratios = []
for d in dim:
    n, r = distance(n_dim=d.item())
    norms.append(n)
    ratios.append(r)

fig, ax = plt.subplots(ncols=3, figsize=(10, 3))
ax[0].plot(dim, norms)
ax[0].set_xlabel("dimension")
ax[0].set_ylabel("average norm")

ax[1].plot(dim, ratios)
ax[1].set_xlabel("dimension")
ax[1].set_ylabel("ratio of max/min distance")

ax[2].plot(dim[3:], ratios[3:])
ax[2].set_xlabel("dimension")
ax[2].set_ylabel("ratio of max/min distance")
ax[2].set_ylim(0, 3)

plt.tight_layout()
plt.show()

### Problem 1b: Hypersphere in Bounding Hypercube

Consider the hypersphere $B(0,r)=\{x\in R^d: \Vert x\Vert \leq r\}$ of radius $r$ - this is a generalization of the disk (in 2-d) and sphere (in 3-d). Let $B=B(0,1)$ be the standard hypersphere (unit Euclidean ball).

 This hypersphere $B$ is a subset of the hypercube $H=\{x\in R^d: -1 \leq x_i \leq 1 \text{ for all } i=1,\dots,d\}$ - this is a generalization of a square (in 2-d) and cube (in 3-d). See visualization in 2-d for reference. $H$ is the smallest possible cube that contains the ball $B$. We are interested in how much of the volume of $H$ is taken up by $B$.

 The volume of the hypercube is

$$V_d(H) = 2^d.$$

The volume of the hypersphere is given by

$$V_d(B) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)}$$

where $\Gamma$ is the Gamma function (use `math.gamma` for scalar computations)

(i) Complete the function, which takes `n_dim` as input and returns the ratio (volume of the hypersphere $B$) divided by (volume of hypercube $H$) of that dimension.

(ii) Plot the ratio for dimensions from 2 to 20.

In [None]:
# BEGIN SOLUTION
def volume(n_dim):
    # compute hypercube volume
    cube_volume = 2**n_dim

    # compute sphere volume (use math.gamma for the Gamma function)
    sphere_volume = math.pi ** (n_dim / 2) / math.gamma(n_dim / 2 + 1)

    # ratio
    return sphere_volume / cube_volume


volume(2)
# END SOLUTION

In [None]:
# Test assertions
assert True, "Solution implemented for 1b"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Hidden tests for 1b
assert True
# END HIDDEN TESTS

In 2-d, the volume of the hypercube (unit square) is 4 and that of the hypersphere (unit disk) is $\pi\approx 3.14159$, hence the ratio is about $0.76853982$. Here is a visualization. We are interested in the volume in the square outside the disk (i.e., the red part shown)

In [None]:
circle = plt.Circle((0, 0), 1, fill=True, color="blue", label="Unit Circle", alpha=0.5)

# Unit square
square = plt.Rectangle((-1, -1), 2, 2, fill=True, color="red", label="Unit Square", alpha=0.5)

# Create plot
fig, ax = plt.subplots(1)
ax.add_artist(square)
ax.add_artist(circle)

# Set plot limits and aspect ratio
ax.set_xlim(-1.5, 1.5)
ax.set_ylim(-1.5, 1.5)
ax.set_aspect("equal", adjustable="box")

# Add labels and title
plt.xlabel("x")
plt.ylabel("y")
plt.title("Unit Disk and Bounding Unit Square")

# Show plot
plt.show()

Write your code for plotting the ratio of volume across dimensions. If correctly implemented, you should note how the ratio basically gets to 0 - the sphere hardly takes up any space within the cube. Most of the volume within the cube is at the boundaries.

In [None]:
dim = list(range(2, 20))

ratios = [volume(d) for d in dim]
plt.plot(dim, ratios)
plt.xlabel("dimension")
plt.ylabel("ratio of volume of sphere/volume of cube")
plt.show()

# Part B: A Regression Problem to Illustrate the Bias-Variance Trade-off and Ensemble Methods

In this section, we go over (i) bias-variance trade-off and (ii) ensemble methods using a simple regression problem. Here are some helper codes and plot for the function (domain is $[0,1]$). This follows Section 8.2-8.3 in textbook, using a single layer network (no activation) where the least squares solution has a closed-form expression.

In [None]:
# The true function that we are trying to estimate, defined on [0,1]
def true_function(x):
    return torch.exp(torch.sin(x * (2 * 3.1413)))


# Generate some data points with or without noise
def generate_data(n_data, sigma_y=0.3):
    # Generate x values quasi uniformly
    x = torch.zeros(n_data)
    for i in range(n_data):
        x[i] = torch.rand(1).item() * (1 / n_data) + i / n_data

    # y value from running through function and adding noise
    y = true_function(x) + sigma_y * torch.randn(n_data)
    return x, y


# Draw the fitted function, together with uncertainty used to generate points
def plot_function(
    x_func,
    y_func,
    x_data=None,
    y_data=None,
    x_model=None,
    y_model=None,
    sigma_func=None,
    sigma_model=None,
    ax=None,
):
    if ax is None:
        _fig, ax = plt.subplots()
    ax.plot(x_func, y_func, "k-")
    if sigma_func is not None:
        ax.fill_between(x_func, y_func - 2 * sigma_func, y_func + 2 * sigma_func, color="lightgray")

    if x_data is not None:
        ax.plot(x_data, y_data, "o", color="#d18362")

    if x_model is not None:
        ax.plot(x_model, y_model, "-", color="#7fe7de")

    if sigma_model is not None:
        ax.fill_between(
            x_model,
            y_model - 2 * sigma_model,
            y_model + 2 * sigma_model,
            color="lightgray",
        )

    ax.set_xlim(0, 1)
    ax.set_xlabel("Input, ")
    ax.set_ylabel("Output, ")
    return ax


# Generate true function
x_func = torch.linspace(0, 1.0, 100)
y_func = true_function(x_func)

# Generate some data points
torch.manual_seed(1)
sigma_func = 0.3
n_data = 15
x_data, y_data = generate_data(n_data, sigma_func)

# Plot the function, data and uncertainty
plot_function(x_func, y_func, x_data, y_data, sigma_func=sigma_func)
plt.show()

We also provide codes for the solution for this problem.

In [None]:
# Define model -- beta is a scalar and omega has size n_hidden,1
def network(x, beta, omega):
    # Retrieve number of hidden units
    n_hidden = omega.shape[0]

    y = torch.zeros_like(x)
    for c_hidden in range(n_hidden):
        # Evaluate activations based on shifted lines (figure 8.4b-d)
        line_vals = x - c_hidden / n_hidden
        h = line_vals * (line_vals > 0)
        # Weight activations by omega parameters and sum
        y = y + omega[c_hidden] * h
    # Add bias, beta
    return y + beta


# This fits the n_hidden+1 parameters (see fig 8.4a) in closed form.
# If you have studied linear algebra, then you will know it is a least
# squares solution of the form (design_matrix^TA)^-1A^Tb.  If you don't recognize that,
# then just take it on trust that this gives you the best possible solution.
def fit_model_closed_form(x, y, n_hidden):
    n_data = len(x)
    design_matrix = torch.ones((n_data, n_hidden + 1))
    for i in range(n_data):
        for j in range(1, n_hidden + 1):
            design_matrix[i, j] = x[i] - (j - 1) / n_hidden
            design_matrix[i, j] = max(design_matrix[i, j], 0)

    beta_omega = torch.linalg.lstsq(design_matrix, y).solution

    beta = beta_omega[0]
    omega = beta_omega[1:]

    return beta, omega


# Closed form solution
beta, omega = fit_model_closed_form(x_data, y_data, n_hidden=3)

# Get prediction for model across graph range
x_model = torch.linspace(0, 1, 100)
y_model = network(x_model, beta, omega)

# Draw the function and the model
plot_function(x_func, y_func, x_data, y_data, x_model, y_model)
plt.show()

## Part B (i) Bias-Variance Trade-Off

A very important aspect of machine learning is understanding bias-variance trade-off. As we know, there are three sources of error in our modeling: (i) bias, (ii) variance and (iii) noise (irreducible part).

*Rule of thumb: Model with higher complexity will lower bias at cost of higher variance*

### Problem 2a: Model Mean and Variance Helper Function

The function repeats the experiment `n_datasets` times, each time drawing a random dataset of size `n_data` with noise level `sigma_fun` and fits it with a simple model with `n_hidden` weights and a bias term (no activation). It then computes mean and standard deviation of the estimated model over 100 equispaced points in $[0,1]$ (`x_model`). This gives an estimate of the bias and variance of the model.

In [None]:
# BEGIN SOLUTION
# Run the model many times with different datasets and return the mean and variance
def get_model_mean_variance(n_data, n_datasets, n_hidden, sigma_func):
    # Create array that stores model results in rows
    y_model_all = torch.zeros((n_datasets, x_model.shape[0]))

    for c_dataset in range(n_datasets):
        # TODO -- Generate n_data x,y, pairs with standard deviation sigma_func
        # Replace this line
        x_data, y_data = generate_data(n_data, sigma_func)

        # TODO -- Fit the model
        # Replace this line:
        beta, omega = fit_model_closed_form(x_data, y_data, n_hidden=n_hidden)

        # TODO -- Run the fitted model on x_model
        # Replace this line
        y_model = network(x_model, beta, omega)

        # Store the model results
        y_model_all[c_dataset, :] = y_model

    # Get mean and standard deviation of model
    mean_model = torch.mean(y_model_all, dim=0)
    std_model = torch.std(y_model_all, dim=0)

    # Return the mean and standard deviation of the fitted model
    return mean_model, std_model


# END SOLUTION

In [None]:
# Test assertions
assert True, "Solution implemented for 2a"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Hidden tests for 2a
assert True
# END HIDDEN TESTS

### Problem 2b: Recreate Figure 8.6

Use 100 repetitions for each of sample size 6, 10, 100 (using $\sigma=0.3$ and `n_hidden`=3). Plot the bias and variances in 3 plots side-by-side.

In [None]:
# BEGIN SOLUTION
# Generate N random data sets, fit the model N times
n_datasets = 100
sigma_func = 0.3
n_hidden = 3

# Get mean and variance of fitted model
torch.manual_seed(1)
samples = [6, 10, 100]

# Plot the results
fig, ax = plt.subplots(ncols=3, figsize=(10, 3))
for i in range(3):
    mean_model, std_model = get_model_mean_variance(samples[i], n_datasets, n_hidden, sigma_func)
    plot_function(
        x_func,
        y_func,
        x_model=x_model,
        y_model=mean_model,
        sigma_model=std_model,
        ax=ax[i],
    )
    ax[i].set_title(f"{samples[i]} samples")
plt.tight_layout()
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert True, "Solution implemented for 2b"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Hidden tests for 2b
assert True
# END HIDDEN TESTS

### Problem 2c: Recreate Figure 8.7

Use 100 repetitions for each of `n_hidden` 3, 5, 10 (using $\sigma=0.3$ and `n_data`=10). Plot the bias and variances in 3 plots side-by-side.

In [None]:
# BEGIN SOLUTION
# Generate N random data sets, fit the model N times
n_datasets = 100
sigma_func = 0.3
n_data = 20

# Get mean and variance of fitted model
torch.manual_seed(2)
hidden = [3, 5, 10]

# Plot the results
fig, ax = plt.subplots(ncols=3, figsize=(10, 3))
for i in range(3):
    mean_model, std_model = get_model_mean_variance(n_data, n_datasets, hidden[i], sigma_func)
    plot_function(
        x_func,
        y_func,
        x_model=x_model,
        y_model=y_model,
        sigma_model=std_model,
        ax=ax[i],
    )
    ax[i].set_title(f"{hidden[i]} regions")
plt.tight_layout()
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert True, "Solution implemented for 2c"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Hidden tests for 2c
assert True
# END HIDDEN TESTS

### Problem 2d: Recreate Figure 8.9

Plot bias and variance terms as a function of
the model capacity (number of hidden units)
in the simplified model using setting from previous problem with `n_data`=15. Use 100 repetitions for each.

In [None]:
# BEGIN SOLUTION
# Plot the noise, bias and variance as a function of capacity
hidden_variables = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
bias = torch.zeros(len(hidden_variables))
variance = torch.zeros(len(hidden_variables))

n_datasets = 100
n_data = 15
sigma_func = 0.3
n_hidden = 5

# Set random seed so that we get the same result every time
torch.manual_seed(1)

for c_hidden in range(len(hidden_variables)):
    # Get mean and variance of fitted model
    mean_model, std_model = get_model_mean_variance(
        n_data, n_datasets, hidden_variables[c_hidden], sigma_func
    )
    # TODO -- Estimate bias and variance
    # Replace these lines
    # Compute variance (avg squared deviation of fitted models)
    variance[c_hidden] = torch.mean(std_model**2)
    # Compute bias (average squared deviation of mean fitted model around true function)
    bias[c_hidden] = torch.mean((mean_model - y_func) ** 2)

# Plot the results
_fig, ax = plt.subplots()
ax.plot(hidden_variables, variance, label="variance", color="mediumaquamarine")
ax.plot(hidden_variables, bias, label="bias", color="sandybrown")
ax.plot(
    hidden_variables,
    variance + bias,
    linestyle="dashed",
    color="gray",
    label="bias+variance",
)
ax.set_xlim(1, 12)
ax.set_ylim(0, 0.4)
ax.set_xlabel("Model capacity")
ax.set_ylabel("Mean squared error")
plt.legend()
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert True, "Solution implemented for 2d"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Hidden tests for 2d
assert True
# END HIDDEN TESTS

## Part B (ii) Ensemble Methods

This section investigates how ensembling can improve the performance of models. We'll work with the same ground truth and neural network model as in part B (i) which we can fit in closed form, and so we can eliminate any errors due to not finding the global maximum.

We start with a baseline model using `n_hidden`=14.

In [None]:
# Generate true function
x_func = torch.linspace(0, 1.0, 100)
y_func = true_function(x_func)

# Generate some data points
torch.manual_seed(1)
sigma_func = 0.3
n_data = 15
x_data, y_data = generate_data(n_data, sigma_func)

# Closed form solution
beta, omega = fit_model_closed_form(x_data, y_data, n_hidden=14)

# Get prediction for model across graph range
x_model = torch.linspace(0, 1, 100)
y_model = network(x_model, beta, omega)

# Draw the function and the model
fig, ax = plt.subplots(figsize=(5, 4))
plot_function(x_func, y_func, x_data, y_data, x_model, y_model, ax=ax)
plt.title("Single Model")
plt.show()

# Compute MSE between fitted model and true curve
mean_sq_error = torch.mean((y_model - y_func) * (y_model - y_func))
print(f"Mean square error = {mean_sq_error:3.3f}")

### Problem 3a: Ensembling 10 Models

Let `n_model`=10 be the number of models used. Each model will use the same architecture (in this case controlled via the `n_hidden` parameter, as before) explicitly `n_hidden`=14. However, each model will be trained on a bootstrapped sample (a sample of size $n$ taken *with replacement* from the original training data, also of size $n$).

Complete the code chunk below to achieve this and collect the results in `all_y_model`


In [None]:
# BEGIN SOLUTION
# Now let's resample the data with replacement four times.
n_model = 10
# Array to store the prediction from all of our models
all_y_model = torch.zeros((n_model, len(y_model)))

# For each model
for i, c_model in enumerate(range(n_model)):
    # TODO Sample data indices with replacement (use torch.randint)
    # Replace this line
    resampled_indices = torch.randint(0, n_data, (n_data,))

    # Extract the resampled x and y data
    x_data_resampled = x_data[resampled_indices]
    y_data_resampled = y_data[resampled_indices]

    # Fit the model
    beta, omega = fit_model_closed_form(x_data_resampled, y_data_resampled, n_hidden=14)

    # Run the model
    y_model_resampled = network(x_model, beta, omega)

    # Store the results
    all_y_model[c_model, :] = y_model_resampled

    # Compute MSE between fitted model and true curve
    mean_sq_error = torch.mean((y_model_resampled - y_func) * (y_model_resampled - y_func))
    print(f"Model {i}: Mean square error = {mean_sq_error:3.3f}")
# END SOLUTION

In [None]:
# Test assertions
assert True, "Solution implemented for 3a"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Hidden tests for 3a
assert True
# END HIDDEN TESTS

### Problem 3b: Aggregation

Now, we have results from 10 different models. Thus at each $x$, we have 10 different predictions. To aggregate these, one can use **mean** or **median** (for classification task, this can be a majority vote). For both of these aggregation methods, compute the mean squared error.

In [None]:
# BEGIN SOLUTION
# Replace this line
y_model_median = torch.median(all_y_model, dim=0).values
y_model_mean = torch.mean(all_y_model, dim=0)

# Compute the mean squared error between the fitted model and the true curve
mean_sq_error = torch.mean((y_model_median - y_func) * (y_model_median - y_func))
print(f"Mean square error for Median = {mean_sq_error:3.3f}")

mean_sq_error = torch.mean((y_model_mean - y_func) * (y_model_mean - y_func))
print(f"Mean square error for Mean = {mean_sq_error:3.3f}")
# END SOLUTION

In [None]:
# Test assertions
assert True, "Solution implemented for 3b"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Hidden tests for 3b
assert True
# END HIDDEN TESTS

In [None]:
# Draw the function and the model
fig, ax = plt.subplots(ncols=2, figsize=(10, 4))
plot_function(x_func, y_func, x_data, y_data, x_model, y_model_median, ax=ax[0])
ax[0].set_title("Median")

plot_function(x_func, y_func, x_data, y_data, x_model, y_model_mean, ax=ax[1])
ax[1].set_title("Mean")

plt.suptitle("Ensemble Methods")
plt.tight_layout()
plt.show()

You should see that both the median and mean models are better than most of the individual models. We have improved our performance at the cost of ten times as much training time, storage, and inference time. Note in the plots how much of the overfitting is also eliminated.

# Part C: MNIST 1-D Dataset

The MNIST 1-D Dataset is a 1-dimensional version of MNIST digit dataset - you can check details [here](https://github.com/greydanus/mnist1d). Each digit image is now represented as a vector (1-d) with 40 features. We do not need to get into details about how this was created, rather we take the dataset as given. The only thing to keep in mind is that this is slightly harder dataset compared to the usual MNIST. The first part of the group work focus on this dataset and coming up with a good deep neural classifier for this dataset.

In [None]:
# Run this if you're in a Colab to install MNIST 1D repository
%pip install git+https://github.com/greydanus/mnist1d

Let's generate a training and test dataset using the MNIST1D code. The dataset gets saved as a .pkl file so it doesn't have to be regenerated each time.

In [None]:
import mnist1d

args = mnist1d.data.get_dataset_args()
data = mnist1d.data.get_dataset(args, path="./mnist1d_data.pkl", download=False, regenerate=False)

# The training and test input and outputs are in
# data['x'], data['y'], data['x_test'], and data['y_test']
print("Examples in training set: {}".format(len(data["y"])))
print("Examples in test set: {}".format(len(data["y_test"])))
print("Length of each example: {}".format(data["x"].shape[-1]))

Let us visualize the dataset in 2-d using PCA and t-SNE.

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from torchvision import datasets, transforms

transform = transforms.Compose([transforms.ToTensor()])

# Download and load the training data
mnist = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
mnist.data = torch.flatten(mnist.data, start_dim=1)

mnist_labels = mnist.targets
idx = torch.randperm(60000)[:4000]
mnist_subset = mnist.data[idx]
mnist_labels = mnist_labels[idx]

X = data["x"]
labels = data["y"]

X_pca = PCA(n_components=2).fit_transform(X)
X_tsne = TSNE(n_components=2, learning_rate="auto", init="random", perplexity=3).fit_transform(X)
mnist_pca = PCA(n_components=2).fit_transform(mnist_subset)
mnist_tsne = TSNE(n_components=2, learning_rate="auto", init="random", perplexity=3).fit_transform(
    mnist_subset
)

In [None]:
fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(10, 10))
ax[0, 0].scatter(X_pca[:, 0], X_pca[:, 1], c=labels, alpha=0.3, cmap="hsv")
ax[0, 1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, alpha=0.3, cmap="hsv")
ax[1, 0].scatter(mnist_pca[:, 0], mnist_pca[:, 1], c=mnist_labels, alpha=0.3, cmap="hsv")
ax[1, 1].scatter(mnist_tsne[:, 0], mnist_tsne[:, 1], c=mnist_labels, alpha=0.3, cmap="hsv")
ax[0, 0].set_ylabel("MNIST 1D")
ax[1, 0].set_ylabel("MNIST original")

ax[0, 0].set_title("PCA")
ax[0, 1].set_title("t-SNE")
plt.suptitle("2D Visualization of MNIST 1-d and original (color by label)")
plt.tight_layout()
plt.show()

As seen above, there is not much separation between the classes (at least in this two-dimensional view). Compare this to that of the original MNIST data (at least the t-SNE) - hence the classification task on the 1-d version is expected to be harder.

In [None]:
def weights_init(layer_in):
    # Initialize the parameters with He initialization
    # Replace this line (see figure 7.8 of book for help)
    if isinstance(layer_in, nn.Linear):
        nn.init.kaiming_normal_(layer_in.weight)
        layer_in.bias.data.fill_(0.0)

## Performance on MNSIT 1-D

### Problem 4a: Training Function for MNIST-1D

 The `verbose` parameter can be toggled to either print loss/error through the training process or not.

Hint: Scheduler can be used in torch as `StepLR(optimizer, step_size, gamma)` and its function is to reduce the learning rate by fraction `gamma` every `step_size` epochs.

In [None]:
# BEGIN SOLUTION
def train(
    model,
    weights_init,
    data,
    batch_size,
    learning_rate,
    momentum,
    decay=0,
    schedule_params=(10, 0.5),
    n_epoch=50,
    *,
    verbose=True,
):
    # choose cross entropy loss function (equation 5.24)
    loss_function = nn.CrossEntropyLoss()

    # construct SGD optimizer and initialize learning rate and momentum
    optimizer = torch.optim.SGD(
        model.parameters(), lr=learning_rate, momentum=momentum, weight_decay=decay
    )

    # object that decreases learning rate by half every 10 epochs
    # schedule_params = (step_size, gamma) for StepLR object
    scheduler = StepLR(optimizer, step_size=schedule_params[0], gamma=schedule_params[1])

    # set up data
    x_train = torch.tensor(data["x"].astype("float32"))
    y_train = torch.tensor(data["y"].transpose().astype("int64"))
    x_test = torch.tensor(data["x_test"].astype("float32"))
    y_test = torch.tensor(data["y_test"].astype("int64"))

    # load the data into a class that creates the batches
    data_loader = DataLoader(
        TensorDataset(x_train, y_train),
        batch_size=batch_size,
        shuffle=True,
    )

    # Initialize model weights
    model.apply(weights_init)

    # loop over the dataset n_epoch times
    # store the loss and the % correct at each epoch
    losses_train = torch.zeros(n_epoch)
    errors_train = torch.zeros(n_epoch)
    losses_test = torch.zeros(n_epoch)
    errors_test = torch.zeros(n_epoch)

    for epoch in range(n_epoch):
        # loop over batches
        for _i, batch in enumerate(data_loader):
            # retrieve inputs and labels for this batch
            x_batch, y_batch = batch
            # zero the parameter gradients
            optimizer.zero_grad()
            # forward pass -- calculate model output
            pred = model(x_batch)
            # compute the loss
            loss = loss_function(pred, y_batch)
            # backward pass
            loss.backward()
            # SGD update
            optimizer.step()

        # Run whole dataset to get statistics -- normally wouldn't do this
        pred_train = model(x_train)
        pred_test = model(x_test)
        _, predicted_train_class = torch.max(pred_train.data, 1)
        _, predicted_test_class = torch.max(pred_test.data, 1)
        errors_train[epoch] = 100 - 100 * (predicted_train_class == y_train).float().sum() / len(
            y_train
        )
        errors_test[epoch] = 100 - 100 * (predicted_test_class == y_test).float().sum() / len(
            y_test
        )
        losses_train[epoch] = loss_function(pred_train, y_train).item()
        losses_test[epoch] = loss_function(pred_test, y_test).item()
        if verbose and epoch % 10 == 0:
            print(
                f"Epoch {epoch:5d}, train loss {losses_train[epoch]:.6f}, "
                f"train error {errors_train[epoch]:3.2f},  "
                f"test loss {losses_test[epoch]:.6f}, "
                f"test error {errors_test[epoch]:3.2f}"
            )

        # tell scheduler to consider updating learning rate
        scheduler.step()
    return losses_train, errors_train, losses_test, errors_test


# END SOLUTION

In [None]:
# Test assertions
assert True, "Solution implemented for 4a"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Hidden tests for 4a
assert True
# END HIDDEN TESTS

In [None]:
# consider the following baseline
D_i = 40  # Input dimensions
D_k = 100  # Hidden dimensions
D_o = 10  # Output dimensions

model = nn.Sequential(nn.Linear(D_i, D_k), nn.ReLU(), nn.Linear(D_k, D_o))

losses_train, errors_train, losses_test, errors_test = train(
    model=model,
    weights_init=weights_init,
    data=data,
    batch_size=100,
    learning_rate=0.05,
    momentum=0.9,
    schedule_params=(10, 0.5),
    n_epoch=100,
    verbose=True,
)

# Plot the results
n_epoch = len(losses_train)
fig, ax = plt.subplots(ncols=2, figsize=(8, 3))
ax[0].plot(errors_train, "r-", label="train")
ax[0].plot(errors_test, "b-", label="test")
ax[0].set_ylim(0, 100)
ax[0].set_xlim(0, n_epoch)
ax[0].set_xlabel("Epoch")
ax[0].set_ylabel("Error")
ax[0].set_title(f"Train Error {errors_train[-1]:3.2f}%, Test Error {errors_test[-1]:3.2f}%")
ax[0].legend()

# Plot the results
ax[1].plot(losses_train, "r-", label="train")
ax[1].plot(losses_test, "b-", label="test")
ax[1].set_xlim(0, n_epoch)
ax[1].set_xlabel("Epoch")
ax[1].set_ylabel("Loss")
ax[1].set_title(f"Train loss {losses_train[-1]:3.2f}, Test loss {losses_test[-1]:3.2f}")
ax[1].legend()
plt.tight_layout()
plt.show()

print(f"Test Accuracy = {(100 - errors_test[-1]):.3f}%, Test loss = {losses_test[-1]:.3f}")

There are several tuning knobs in the above for improving the performance:
1. Model architecture (number of layers, number of nodes in a layer, activation function, etc.)
2. Data (batch size)
3. Optimizer choices (learning rate, momentum, decreasing learning rate `scheduler`)
4. Regularizer (adding dropout layer or using `weight decay` for $L_2$ regularization)

Consider the model above as the baseline, which gives a test error of just above 40% (i.e., test accuracy just below 60%) and test loss of around 1.1. Can you do better?

### Problem 4b: Tune MLP for 65% Accuracy

Note: https://github.com/greydanus/mnist1d mentions for MLP the benchmark is 68% accuracy.

Plot the training and test loss and error (as before) - ensure you do not visibly see significant overfitting (aka test loss increasing too much)


In [None]:
# BEGIN SOLUTION
D_k = 500
model = nn.Sequential(
    nn.Linear(D_i, D_k), nn.ELU(), nn.Linear(D_k, D_k), nn.ELU(), nn.Linear(D_k, D_o)
)
losses_train, errors_train, losses_test, errors_test = train(
    model=model,
    weights_init=weights_init,
    data=data,
    batch_size=128,
    learning_rate=0.05,
    momentum=0.9,
    decay=0.001,
    schedule_params=(20, 0.7),
    n_epoch=150,
    verbose=False,
)
print(f"Test Accuracy = {(100 - errors_test[-1]):.3f}%, Test loss = {losses_test[-1]:.3f}")
# END SOLUTION

In [None]:
# Test assertions
assert True, "Solution implemented for 4b"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Hidden tests for 4b
assert True
# END HIDDEN TESTS

In [None]:
# Plot the results
n_epoch = len(losses_train)
fig, ax = plt.subplots(ncols=2, figsize=(10, 3))
ax[0].plot(errors_train, "r-", label="train")
ax[0].plot(errors_test, "b-", label="test")
ax[0].set_ylim(0, 100)
ax[0].set_xlim(0, n_epoch)
ax[0].set_xlabel("Epoch")
ax[0].set_ylabel("Error")
ax[0].set_title(f"Train Error {errors_train[-1]:3.2f}%, Test Error {errors_test[-1]:3.2f}%")
ax[0].legend()

# Plot the results
ax[1].plot(losses_train, "r-", label="train")
ax[1].plot(losses_test, "b-", label="test")
ax[1].set_xlim(0, n_epoch)
ax[1].set_xlabel("Epoch")
ax[1].set_ylabel("Loss")
ax[1].set_title(f"Train loss {losses_train[-1]:3.2f}, Test loss {losses_test[-1]:3.2f}")
ax[1].legend()
plt.tight_layout()
plt.show()

## Augmentation with MNIST 1-D

This part investigates data augmentation for the MNIST-1D model. Data augmentation is a commonly used method for generating more synthetic training samples by applying simple transformations (e.g. translation, inverting, scaling, filters, rotation) - for images at least, given an image of a dog e.g., a human can classify this even if the image is zoomed, rotated, irrespective of location of the dog in the image or any filters applied. This is the intuition behind data augmentation.

Again, for baseline model, we use the previously used `baseline` (with 2 linear layers and 100 hidden nodes) - recall, for this model, we achieved a test loss of around 1.14 and a test error of around 42%.

In [None]:
D_i = 40  # Input dimensions
D_k = 100  # Hidden dimensions
D_o = 10  # Output dimensions

model = nn.Sequential(nn.Linear(D_i, D_k), nn.ReLU(), nn.Linear(D_k, D_o))

### Problem 4c: Augment Function

Complete the following function which takes a sample $x$ (as `input_vector`) and applies some transformations and returns another vector `data_out`. For this problem, we apply two transformations:

1. Shift $K$ places to the right:

$$(x_1,x_2,\dots,x_n) \mapsto (x_{n-K+1}, \dots, x_n, x_1, x_2, \dots, x_{n-K})$$

Note that the first coordinate $x_1$ (at python index 0) originally is $(K+1)$th position (python index $K$) and this is done cyclically, so points that go off the end are added back to the beginning.

For example, $n=4, K=2$: $(x_1,x_2,x_3,x_4)\mapsto (x_3,x_4,x_1,x_2)$.

2. Scaling: Scale by a random number drawn from uniform over (0.8, 1.2).

In [None]:
# BEGIN SOLUTION
def augment(input_vector):
    # Create output vector
    input_tensor = (
        torch.tensor(input_vector, dtype=torch.float32)
        if not isinstance(input_vector, torch.Tensor)
        else input_vector.clone()
    )
    n = len(input_tensor)
    data_out = torch.zeros_like(input_tensor)

    # TODO:  Shift the input data by a random offset
    # (rotating, so points that would go off the end, are added back to the beginning)
    k = torch.randint(0, n, (1,)).item()
    data_out[k:] = input_tensor[: (n - k)]
    data_out[:k] = input_tensor[(n - k) :]

    # TODO: Randomly scale data by factor from uniform [0.8, 1.2]
    # Replace this line:
    scale = 0.9 + 0.2 * torch.rand(1).item()
    return data_out * scale


# example
augment([1, 2, 3, 4])
# END SOLUTION

In [None]:
# Test assertions
assert True, "Solution implemented for 4c"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Hidden tests for 4c
assert True
# END HIDDEN TESTS

Let us construct augment our original training data with such transformed data.

In [None]:
n_data_orig = data["x"].shape[0]
# We'll double the amount of data
n_data_augment = n_data_orig * 2
augmented_x = torch.zeros((n_data_augment, D_i))
augmented_y = torch.zeros(n_data_augment)
# First n_data_orig rows are original data
augmented_x[0:n_data_orig, :] = torch.tensor(data["x"], dtype=torch.float32)
augmented_y[0:n_data_orig] = torch.tensor(data["y"], dtype=torch.float32)

# Fill in rest of with augmented data
for c_augment in range(n_data_orig, n_data_augment):
    # Choose a data point randomly
    random_data_index = torch.randint(0, n_data_orig - 1, (1,)).item()
    # Augment the point and store
    augmented_x[c_augment, :] = augment(
        torch.tensor(data["x"][random_data_index, :], dtype=torch.float32)
    )
    augmented_y[c_augment] = data["y"][random_data_index]

# to use the train function we created above
augmented_data = {
    "x": augmented_x.numpy(),
    "y": augmented_y.numpy(),
    "x_test": data["x_test"],
    "y_test": data["y_test"],
}

### Problem 4d: Training on Augmented Data

Use the `train` function to train the data, using the same tuning knobs as used in the baseline case. What is the test loss and error now?

In [None]:
# BEGIN SOLUTION
losses_train, errors_train, losses_test, errors_test = train(
    model=model,
    weights_init=weights_init,
    data=augmented_data,
    batch_size=100,
    learning_rate=0.05,
    momentum=0.9,
    schedule_params=(10, 0.5),
    n_epoch=100,
    verbose=True,
)
print(f"Test Accuracy = {(100 - errors_test[-1]):.3f}%, Test loss = {losses_test[-1]:.3f}")
# END SOLUTION

In [None]:
# Test assertions
assert True, "Solution implemented for 4d"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Hidden tests for 4d
assert True
# END HIDDEN TESTS

### Problem 4e: Achieve 70% Accuracy

Report the final test accuracy - Get it above 70%

In [None]:
# BEGIN SOLUTION
def augment(input_vector):
    # Create output vector
    input_tensor = (
        torch.tensor(input_vector, dtype=torch.float32)
        if not isinstance(input_vector, torch.Tensor)
        else input_vector.clone()
    )
    n = len(input_tensor)
    data_out = torch.zeros_like(input_tensor)

    k = torch.randint(0, n, (1,)).item()
    data_out[k:] = input_tensor[: (n - k)]
    data_out[:k] = input_tensor[(n - k) :]
    scale = 0.95 + 0.1 * torch.rand(n)
    return data_out * scale


n_data_orig = data["x"].shape[0]

# We'll double the amount of data
n_data_augment = int(n_data_orig * 1.5)
augmented_x = torch.zeros((n_data_augment, D_i))
augmented_y = torch.zeros(n_data_augment)

# First n_data_orig rows are original data
augmented_x[0:n_data_orig, :] = torch.tensor(data["x"], dtype=torch.float32)
augmented_y[0:n_data_orig] = torch.tensor(data["y"], dtype=torch.float32)

# Fill in rest of with augmented data
for c_augment in range(n_data_orig, n_data_augment):
    # Choose a data point randomly
    random_data_index = torch.randint(0, n_data_orig - 1, (1,)).item()
    # Augment the point and store
    augmented_x[c_augment, :] = augment(
        torch.tensor(data["x"][random_data_index, :], dtype=torch.float32)
    )
    augmented_y[c_augment] = data["y"][random_data_index]

# to use the train function we created above
augmented_data = {
    "x": augmented_x.numpy(),
    "y": augmented_y.numpy(),
    "x_test": data["x_test"],
    "y_test": data["y_test"],
}
# END SOLUTION

In [None]:
# Test assertions
assert True, "Solution implemented for 4e"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Hidden tests for 4e
assert True
# END HIDDEN TESTS

In [None]:
D_k = 500
model = nn.Sequential(
    nn.Linear(D_i, D_k),
    nn.ELU(),
    nn.Linear(D_k, D_k),
    nn.ELU(),
    nn.Linear(D_k, D_k),
    nn.ELU(),
    nn.Linear(D_k, D_o),
)
losses_train, errors_train, losses_test, errors_test = train(
    model=model,
    weights_init=weights_init,
    data=augmented_data,
    batch_size=128,
    learning_rate=0.01,
    momentum=0.9,
    decay=0.005,
    schedule_params=(20, 0.8),
    n_epoch=200,
    verbose=True,
)
print(f"Test Accuracy = {(100 - errors_test[-1]):.3f}%, Test loss = {losses_test[-1]:.3f}")