# DATASCI 315, Group Work 4: PyTorch Training Pipeline and FashionMNIST

In this group-work assignment, we will start working with PyTorch. The learning objectives for this assignment include gaining familiarity with:

1. Tensors in PyTorch
2. Automatic differentiation (autograd) in PyTorch
3. Deep networks in PyTorch

**Instructions:** During lab section, and afterward as necessary, you will collaborate in two-person teams (assigned by the GSI) to complete the problems that are interspersed below. The GSI will help individual teams encountering difficulty, make announcements addressing common issues, and help ensure progress for all teams. *During lab, feel free to flag down your GSI to ask questions at any point!* Upon completion, one member of the team should submit their team's work through Canvas as HTML.

**Resources:**
- [PyTorch Documentation](https://pytorch.org/docs/stable/index.html)
- [PyTorch Tutorials](https://pytorch.org/tutorials/)

In [None]:
import matplotlib.pyplot as plt
import torch
from torch import nn, optim

## Deep Networks for Regression

We go over the basic steps to build a (moderately) deep network for a regression problem and train it using PyTorch's autograd. Things to learn in this section:

1. Creating a `DataLoader` object (to help with training)
2. How to use `torch.nn` layers (linear and activation) and build a sequential model
3. How to compute loss and optimize using `torch.optim`

Let us first look at a regression problem with synthetic data where the goal is to train a 3-layer network.

In [None]:
from sklearn.model_selection import train_test_split

num_samples, num_features = 1000, 3
torch.manual_seed(10)
X = torch.rand(num_samples, num_features)
y = (
    2 * torch.sin(3 * X[:, 0])
    - 3 * torch.cos(-4 * X[:, 1])
    + 2.4 * X[:, 2]
    + torch.randn(num_samples)
)
y = y.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Shape of X_train: {X_train.shape}, shape of y_train: {y_train.shape}")
print(f"Shape of X_test: {X_test.shape}, shape of y_test: {y_test.shape}")

### Using DataLoader

The `DataLoader` class can be imported from `torch.utils.data` and is a convenient way to create batches while keeping $(X, y)$ pairs together. `DataLoader` takes a `torch` dataset object (which can be created from raw data using `TensorDataset`) or a built-in `torch` dataset (we will see an example later).

In [None]:
from torch.utils.data import DataLoader, TensorDataset

dataset = TensorDataset(X_train, y_train)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

for batch_idx, (features, targets) in enumerate(dataloader):
    print(f"Batch {batch_idx}: features shape: {features.shape}, targets shape: {targets.shape}")

Note there are $B=25$ batches now, each of size `batch_size` ($M=32$). In each batch, $X$ has shape $(32, 3)$ where the first dimension corresponds to the number of samples in the batch and the second corresponds to the number of features. The batch also contains the corresponding $y$ values. To see the number of samples in the entire dataset, use the `len` function.

In [None]:
len(dataset)

### Building a Sequential Model

Building a sequential model (where the output from one layer is passed as the input to the next layer) is done using `torch.nn.Sequential`.

For fully connected layers, you will need:
1. Linear layers via `torch.nn.Linear`
2. Activation functions from `torch.nn`

The `Linear` layer is similar to the class `Linear` we saw in Group Work 2, except that it (i) allows for more general shaped tensors and (ii) parameters associated with this layer enable automatic gradient computation.

In [None]:
# A single linear layer
print(f"num_features = {num_features}")
layer1 = nn.Linear(in_features=num_features, out_features=10)

# Passing the full data through the layer
print(f"Shape of X: {X.shape}")
yhat = layer1(X)
print(f"Shape of output of X through this layer: {yhat.shape}")

# Passing a batch through the layer
for batch_idx, (features, _targets) in enumerate(dataloader):
    yhat = layer1(features)
    print(f"Batch {batch_idx}: shape of output through this layer: {yhat.shape}")

Such layers can be combined as follows. Suppose we want to design a neural network with 2 hidden layers:

$$x \rightarrow z_1 \rightarrow z_2 \rightarrow y$$

where (denoting the activation function as $a$ acting coordinate-wise)

\begin{align}
  x &\in \mathbb{R}^d \\
  z_1 &= a(b_1 + W_1 x) \in \mathbb{R}^{d_1} \\
  z_2 &= a(b_2 + W_2 z_1) \in \mathbb{R}^{d_2} \\
  y &= b_3 + W_3 z_2 \in \mathbb{R}.
\end{align}

That is, the first hidden layer has $d_1$ nodes and the second has $d_2$. $W_1, b_1$ are the weights and bias for the first layer, and similarly for the other parameters. Here is how to build this model. Recall the ReLU activation can be used from `torch.nn.ReLU`.

In [None]:
hidden_dim_1 = 10
hidden_dim_2 = 5

model = nn.Sequential(
    nn.Linear(num_features, hidden_dim_1),  # input has d features, output has d_1
    nn.ReLU(),  # activation function
    nn.Linear(hidden_dim_1, hidden_dim_2),  # input has d_1 features, output has d_2
    nn.ReLU(),
    nn.Linear(hidden_dim_2, 1),  # input has d_2 features, output has 1 (as y is scalar)
)

# Let us look at the trainable parameters
for parameter in model.parameters():
    print(parameter)

Explanation:

1. The first two are $W_1$ and $b_1$ with shapes $(d_1, d)$ and $(d_1,)$ respectively, associated with the first layer.
2. The next two are $W_2$ and $b_2$, and so on.

Note that they all have `requires_grad` set to `True`.

In [None]:
print(model)

### Problem 1: Shapes and Parameters

(a) What is the shape of $W_2$ and $b_2$?

(b) What is the total number of parameters in this model?

Explain your solution.

In [None]:
# BEGIN SOLUTION
# (a) W_2 has shape (d_2, d_1) = (5, 10) and b_2 has shape (d_2,) = (5,).
#
# (b) Total parameters:
#   Layer 1: d * d_1 + d_1 = 3 * 10 + 10 = 40
#   Layer 2: d_1 * d_2 + d_2 = 10 * 5 + 5 = 55
#   Layer 3: d_2 * 1 + 1 = 5 + 1 = 6
#   Total = 40 + 55 + 6 = 101

w2_shape = (hidden_dim_2, hidden_dim_1)
b2_shape = (hidden_dim_2,)
total_params = sum(parameter.numel() for parameter in model.parameters())
# END SOLUTION

In [None]:
# Test assertions
assert w2_shape == (5, 10), f"Expected W_2 shape (5, 10), got {w2_shape}"
assert b2_shape == (5,), f"Expected b_2 shape (5,), got {b2_shape}"
assert total_params == 101, f"Expected 101 total parameters, got {total_params}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert w2_shape[0] == hidden_dim_2, "W_2 rows should equal hidden_dim_2"
assert w2_shape[1] == hidden_dim_1, "W_2 cols should equal hidden_dim_1"
assert len(b2_shape) == 1, "b_2 should be 1-dimensional"
# END HIDDEN TESTS

### Training the Model

This is similar to the optimization routine we saw above for simple linear regression. The only difference is letting the optimizer know which parameters to optimize over.

In [None]:
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

num_epochs = 200
train_loss = []
test_loss = []

for _epoch in range(num_epochs):
    for inputs, target in dataloader:
        # Forward pass
        outputs = model(inputs)
        loss = loss_fn(outputs, target)

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        # Zero gradients
        optimizer.zero_grad()

    # Compute the loss for train and test
    with torch.no_grad():
        y_hat_train = model(X_train)
        train_loss.append(loss_fn(y_hat_train, y_train).detach().item())
        y_hat_test = model(X_test)
        test_loss.append(loss_fn(y_hat_test, y_test).detach().item())

In [None]:
plt.plot(train_loss, label="train loss")
plt.plot(test_loss, label="test loss")
plt.legend()
plt.xlabel("epochs")
plt.ylabel("loss")
plt.show()

**Note:** One can keep track of the loss computed within each batch as well, but typically these are updated at every batch within an epoch. One must be careful when computing the training loss, depending on what is needed. The method shown here computes the loss on the whole data at the end of each epoch (this might be very costly with large datasets), and hence a better way is to use the batch losses instead.

## Deep Networks for Classification

This section will walk you through training a classification model with one of PyTorch's built-in datasets. We will be using the FashionMNIST dataset, which contains $28 \times 28$ grayscale images of clothing articles, each with a label (type of clothing). Let us download the dataset and visualize some samples. Note the `ToTensor` transformation applied to each sample, which converts it to a PyTorch tensor object. One could also apply other transforms directly (like `Resize`, `Normalize`, `CenterCrop`, etc.), which can be composed via `torchvision.transforms.Compose`.

In [None]:
from torchvision import datasets
from torchvision.transforms import ToTensor

training_data = datasets.FashionMNIST(root="data", train=True, download=True, transform=ToTensor())

test_data = datasets.FashionMNIST(root="data", train=False, download=True, transform=ToTensor())

In [None]:
num_train = len(training_data)
num_test = len(test_data)
print(f"Number of training samples: {num_train}")
print(f"Number of test samples: {num_test}")

In [None]:
# Visualizing the data
labels_map = {
    0: "T-Shirt",
    1: "Trouser",
    2: "Pullover",
    3: "Dress",
    4: "Coat",
    5: "Sandal",
    6: "Shirt",
    7: "Sneaker",
    8: "Bag",
    9: "Ankle Boot",
}
figure = plt.figure(figsize=(8, 8))
cols, rows = 3, 3
for i in range(1, cols * rows + 1):
    sample_idx = torch.randint(len(training_data), size=(1,)).item()
    img, label = training_data[sample_idx]
    figure.add_subplot(rows, cols, i)
    plt.title(labels_map[label])
    plt.axis("off")
    plt.imshow(img.squeeze(), cmap="gray")
plt.show()

#### Problem 2a: DataLoaders

Create a `DataLoader` object with the training data, using a batch size of 64. Also create one for the test data.

In [None]:
# BEGIN SOLUTION
train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=False)
# END SOLUTION

In [None]:
# Test assertions
num_batches_train = len(train_dataloader)
num_batches_test = len(test_dataloader)
assert num_batches_train == 938, f"Expected 938 train batches, got {num_batches_train}"
assert num_batches_test == 157, f"Expected 157 test batches, got {num_batches_test}"

# Check that dataloader can be iterated over
train_features, train_labels = next(iter(train_dataloader))
assert train_features.size() == torch.Size(
    [64, 1, 28, 28]
), f"Expected feature shape [64, 1, 28, 28], got {train_features.size()}"
assert train_labels.size() == torch.Size(
    [64]
), f"Expected label shape [64], got {train_labels.size()}"
print("All tests passed!")

# Display an image and label for verification
img = train_features[0].squeeze()
label = train_labels[0]
plt.figure(figsize=(3, 3))
plt.imshow(img, cmap="gray")
plt.show()
print(f"Label: {label} corresponding to {labels_map[label.item()]}")

# BEGIN HIDDEN TESTS
assert train_dataloader.batch_size == 64, "Train dataloader batch size should be 64"
assert test_dataloader.batch_size == 64, "Test dataloader batch size should be 64"
# END HIDDEN TESTS

#### Problem 2b: Flattening

Before using a sequential model, we need to reshape the features properly. Currently each batch $X$ has shape $(M, 1, 28, 28)$, where $M=64$. We need to flatten each sample to obtain $X$ of shape $(M, 784)$. Use `nn.Flatten` (which can be thought of as another layer) to do this. For this problem, initialize this layer and pass the features from a batch through it.

In [None]:
# BEGIN SOLUTION
flatten_layer = nn.Flatten()
train_features, train_labels = next(iter(train_dataloader))
train_flat_features = flatten_layer(train_features)
# END SOLUTION

In [None]:
# Test assertions
assert train_features.shape == torch.Size(
    [64, 1, 28, 28]
), f"Expected input shape [64, 1, 28, 28], got {train_features.shape}"
assert train_flat_features.shape == torch.Size(
    [64, 784]
), f"Expected output shape [64, 784], got {train_flat_features.shape}"
print(f"Shape of input: {train_features.shape}")
print(f"Shape of output: {train_flat_features.shape}")
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert train_flat_features.shape[0] == 64, "Batch dimension should be preserved"
assert train_flat_features.shape[1] == 28 * 28, "Features should be flattened to 784"
# END HIDDEN TESTS

You can use this as the first layer while building the sequential model. Note that you can think of this as another layer with no trainable parameters (like an activation layer).

#### Problem 2c: A Sequential Model

Write a sequential model with 2 hidden layers, each with 512 nodes.

**Notes:**
1. The output should have 10 values (there are 10 target classes).
2. Make sure you flatten images first.
3. Use ReLU activation after each hidden layer.

What is the total number of parameters in this model?

**Note:** Since we will be using `CrossEntropyLoss`, which automatically normalizes the outputs, we do not need a separate `softmax` operation on the last layer.

In [None]:
# BEGIN SOLUTION
# Architecture: 784 -> 512 -> 512 -> 10
# Total params: (784*512 + 512) + (512*512 + 512) + (512*10 + 10) = 669,706
classification_model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(784, 512),
    nn.ReLU(),
    nn.Linear(512, 512),
    nn.ReLU(),
    nn.Linear(512, 10),
)
# END SOLUTION

In [None]:
# Test assertions
with torch.no_grad():
    train_features, train_labels = next(iter(train_dataloader))
    predicted_logits = classification_model(train_features)
    assert predicted_logits.shape == torch.Size(
        [64, 10]
    ), f"Expected output shape [64, 10], got {predicted_logits.shape}"
    print(f"Shape of predicted_logits: {predicted_logits.shape}")

model_num_params = sum(parameter.numel() for parameter in classification_model.parameters())
assert model_num_params == 669706, f"Expected 669,706 parameters, got {model_num_params}"
print(f"Total number of parameters: {model_num_params}")
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert (
    len(list(classification_model.parameters())) == 6
), "Model should have 6 parameter tensors (3 weight + 3 bias)"
# END HIDDEN TESTS

#### Problem 2d: Loss

Initialize the appropriate loss function (`CrossEntropyLoss`) and compute the loss based on `predicted_logits` from above and `train_labels` from the same batch.

In [None]:
# BEGIN SOLUTION
classification_loss_fn = nn.CrossEntropyLoss()
with torch.no_grad():
    initial_loss = classification_loss_fn(predicted_logits, train_labels)
print(f"Initial loss: {initial_loss.item():.3f}")
# END SOLUTION

In [None]:
# Test assertions
assert initial_loss.item() > 0, "Loss should be positive"
assert initial_loss.item() < 10, "Initial loss should be reasonable (< 10)"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert initial_loss.shape == torch.Size([]), "Loss should be a scalar"
# END HIDDEN TESTS

#### Problem 2e: Training

Use SGD optimizer from `torch.optim` to train the model for 10 epochs. Use learning rate $\eta = 0.001$. Keep track of both the training and test loss. Also compute the average classification accuracy on the test data.

**Hint:** When computing test loss, ensure you use `torch.no_grad()` since we do not want these operations to modify gradients.

In [None]:
# BEGIN SOLUTION
classification_optimizer = optim.SGD(classification_model.parameters(), lr=1e-3)

num_classification_epochs = 10
classification_train_loss = []
classification_test_loss = []
classification_test_accuracy = []

for epoch in range(num_classification_epochs):
    epoch_training_loss = 0
    epoch_test_loss = 0
    num_correct = 0

    # Training loop
    for batch_features, batch_labels in train_dataloader:
        # Forward pass
        outputs = classification_model(batch_features)
        loss = classification_loss_fn(outputs, batch_labels)

        # Backward pass and optimization
        loss.backward()
        classification_optimizer.step()

        # Zero gradients
        classification_optimizer.zero_grad()

        # Accumulate loss
        epoch_training_loss += loss.detach().item()

    # Test evaluation loop
    for batch_features, batch_labels in test_dataloader:
        with torch.no_grad():
            outputs = classification_model(batch_features)
            loss = classification_loss_fn(outputs, batch_labels)
            epoch_test_loss += loss.detach().item()
            num_correct += (outputs.argmax(dim=1) == batch_labels).sum().item()

    classification_train_loss.append(epoch_training_loss / num_batches_train)
    classification_test_loss.append(epoch_test_loss / num_batches_test)
    classification_test_accuracy.append(num_correct / num_test)
    print(
        f"Epoch {epoch}: "
        f"train loss = {classification_train_loss[-1]:.2f}, "
        f"test loss = {classification_test_loss[-1]:.2f}, "
        f"test accuracy = {classification_test_accuracy[-1]:.2f}"
    )
# END SOLUTION

In [None]:
# Test assertions
assert (
    len(classification_train_loss) == 10
), f"Expected 10 training loss values, got {len(classification_train_loss)}"
assert (
    len(classification_test_loss) == 10
), f"Expected 10 test loss values, got {len(classification_test_loss)}"
assert (
    len(classification_test_accuracy) == 10
), f"Expected 10 test accuracy values, got {len(classification_test_accuracy)}"
assert (
    classification_test_accuracy[-1] > 0.5
), f"Expected final accuracy > 0.5, got {classification_test_accuracy[-1]:.2f}"
assert classification_train_loss[-1] < classification_train_loss[0], "Training loss should decrease"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert all(
    0 <= acc <= 1 for acc in classification_test_accuracy
), "Accuracy should be between 0 and 1"
assert all(loss > 0 for loss in classification_train_loss), "Loss should be positive"
# END HIDDEN TESTS