# PyTorch Tutorial & Homework - Neural Networks
Prof. Lim Kwan Hui and Prof. Ezekiel Soremekun, with many thanks to Prof. Dorien Herremans for the initial version and Nelson Lui for the base text.

Homework questions are at the end of the tutorial.

**To edit the notebook**:

There are two ways to edit the notebook.

You can either open it in the "playground", where you can change and run cells. After closing the tab, your changes will be lost. To do so, press "File" > "Open in playground".

Alternatively, you can make a copy of this notebook to your own Google Drive account through "File" > "Save a copy in Drive..."

**Activating the GPU on Colab**:

Colab now gives you 12 hours of free GPU time (before you have to request a new node).
Simply select "GPU" in the Accelerator drop-down in Notebook Settings (either through the Edit menu or the command palette at cmd/ctrl-shift-P).

# Setting up the notebook on colab

Let's check if we are using the GPU environment and cuda is installed:

In [67]:
# Import PyTorch and other libraries
import torch
import numpy as np
from tqdm import tqdm

print("PyTorch version:")
print(torch.__version__)
print("GPU Detected:")
print(torch.cuda.is_available())

#defining a shortcut function for later:
import os
using_GPU = os.path.exists('/opt/bin/nvidia-smi')

PyTorch version:
2.2.1
GPU Detected:
False


# Computation Graphs

A computation graph is simply a way to define a sequence of operations to go from input to model output.

You can think of the nodes in the graph as representing operations, and the edges in the graph represent tensors going in and out.

For example, say we wanted to build a linear regression model. This has the form $\hat y = Wx + b$.

In this equation, $x$ is our input, $W$ is a learned weight matrix, $b$ is a learned bias, and $\hat y$ is the predicted output.

As a computation graph, this looks like:

![Linear Regression Computation Graph](https://imgur.com/IcBhTjS.png)

When implementing deep learning models, you're basically designing and specifying computation graphs. It's a bit like playing with Legos in that you're stringing together a bunch of blocks (the operations) to achieve a final desired output.

# The building blocks of deep learning models

`torch.nn` makes it easy to build neural nets by providing functions for specifying arbitrary computation graphs and abstractions for putting them all together. We'll start by covering a few classes in the `torch.nn` module that form basic building blocks of many deep learning applications.

The classes below are all callable, so you can use them with `outputs = YourDeepLearningBlock(its_inputs)`

In [68]:
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

## Linear Layers (Affine Transforms)

A linear layer (also known as an affine transform) defines a function:

$$f(x) = Wx + b$$

This linear transform is a core part of deep learning. $W$ and $b$ are the parameters of this layer, where $W$ is a learned weight matrix and $b$ is a learned bias vector.

`nn.Linear()` takes two construction parameters: the dimensionality of the input and the dimensionality of the desired output.

In [69]:
# Create a Linear layer. Input should have 5 dimensions, output will have 3.
lin = nn.Linear(5, 3)
# Data is a matrix of shape (2, 5). Can we use the linear layer on it?
data = torch.randn(2, 5)

# Yes! Running the data matrix through the layer outputs shape (2, 3).
print(lin(data))

tensor([[-0.2427,  0.0437,  0.6581],
        [-0.4578, -0.7820,  0.1781]], grad_fn=<AddmmBackward0>)


In [70]:
# What about a matrix of shape (2, 4, 5)?
data = torch.randn(2, 4, 5)
# This works as well! As long as the last dimension is the specified
# input dimension to the Linear layer, you're good.
# Output shape: (2, 4, 3)
print(lin(data))

tensor([[[ 0.6175, -0.1103, -0.4453],
         [-0.5570,  1.1604,  0.7831],
         [-0.0775,  1.1725,  0.5173],
         [ 0.1109,  0.2320,  0.5300]],

        [[-0.3850,  0.2671,  0.4981],
         [ 0.0855,  0.2707,  0.5223],
         [-0.1584,  0.5427, -0.1146],
         [ 0.0279,  0.1225,  0.2458]]], grad_fn=<ViewBackward0>)


In [71]:
# But (5, 2) is an incompatible shape (uncomment and run to see error)
data = torch.randn(5, 2)
#print(lin(data))

In [72]:
# But we can transpose it using t()!
# Now its shape (2, 5) and all is fine.
print(lin(data.t()))

tensor([[-0.1913,  0.8792,  0.4457],
        [ 0.2834,  1.1513,  0.1965]], grad_fn=<AddmmBackward0>)


## Nonlinearities / Activation Functions

Since composing linear transformations gives you a linear transformation, we don't gain any representational power by just chaining `Linear` layers.

In deep learning, we add nonlinearities after our Linear transforms, which lets us build more powerful models.

PyTorch comes with a veritable zoo of nonlinearities.

In [73]:
data = torch.randn(2, 3)
print(data)

# Nonlinearities are layers too!
relu = nn.ReLU()
print(relu)
print(relu(data))

tanh = nn.Tanh()
print(tanh)
print(tanh(data))

sigmoid = nn.Sigmoid()
print(sigmoid)
print(sigmoid(data))

tensor([[ 0.8418, -0.8038,  1.2000],
        [ 0.5772,  0.8405,  1.2129]])
ReLU()
tensor([[0.8418, 0.0000, 1.2000],
        [0.5772, 0.8405, 1.2129]])
Tanh()
tensor([[ 0.6868, -0.6662,  0.8337],
        [ 0.5207,  0.6861,  0.8375]])
Sigmoid()
tensor([[0.6988, 0.3092, 0.7685],
        [0.6404, 0.6986, 0.7708]])


If you'd prefer to not create a class for the nonlinearity, you can also call it functionally as below:

In [74]:
data = torch.randn(2, 3)
print(data)

# Nonlinearities can also be used functionally, with no need to create a class!
print("ReLu:")
print(torch.relu(data))

print("tanh:")
print(torch.tanh(data))

print("Sigmoid:")
print(torch.sigmoid(data))

tensor([[ 0.6563,  1.1358,  0.3073],
        [-0.8320,  0.1175, -0.0789]])
ReLu:
tensor([[0.6563, 1.1358, 0.3073],
        [0.0000, 0.1175, 0.0000]])
tanh:
tensor([[ 0.5759,  0.8130,  0.2980],
        [-0.6815,  0.1170, -0.0788]])
Sigmoid:
tensor([[0.6584, 0.7569, 0.5762],
        [0.3032, 0.5293, 0.4803]])


## Dropout

Dropout is used to regularize our models by randomly setting some outputs to 0.

This helps to prevent overfitting by encouraging the model to look beyond specific spurious patterns and find features that generalize.

**Note that we should only apply dropout during training!**

In [75]:
data = torch.randn(2, 3)
print(data)

# Create a Dropout layer and call it on input
# Here, the probability of zeroing an element is 0.5
dropout = nn.Dropout(0.5)
print(dropout)
print(dropout(data))

# Use dropout functionally, training=False by default so no change.
print("Functional dropout, training=False")
print(F.dropout(data, 0.5, training=False))

# Set training=True, so things are dropped out
print("Functional dropout, training=True")
print(F.dropout(data, 0.5, training=True))

tensor([[-0.6832,  0.6947,  0.7942],
        [ 1.0470, -0.8556, -0.6907]])
Dropout(p=0.5, inplace=False)
tensor([[-0.0000,  1.3894,  0.0000],
        [ 0.0000, -1.7112, -1.3813]])
Functional dropout, training=False
tensor([[-0.6832,  0.6947,  0.7942],
        [ 1.0470, -0.8556, -0.6907]])
Functional dropout, training=True
tensor([[-0.0000,  0.0000,  0.0000],
        [ 0.0000, -1.7112, -0.0000]])


# Structuring PyTorch models

At the highest level, `nn.Module` defines what most would refer to as a "model". It's a convenient way for encapsulating the trainable parameters of a model or a component of your model, and subclassing this class gives you Python functions for moving your model to the GPU, saving it, loading it etc.

When you're building your own model, you're going to subclass `nn.Module`. Critically, you also need to override the `__init__()` and `forward()` functions.

*   In `__init__()`, you should take arguments that modify how the model runs (e.g. # of layers, # of hidden units, output sizes). You'll also set up most of the layers that you use in the forward pass here.
*   In `forward()`, you define the "forward pass" of your model, or the operations needed to transform input to output. **You can use any of the Tensor operations in the forward pass.**



### Feed-forward neural net

Back to the simple neural network we covered in the lecture, we can add some intermediate layers (called hidden layers), nonlinearities, and dropout for regularization. This is essentially a multi-layer feed forward neural net, and it's implementation as a module is outlined below:

In [76]:
class FeedForwardNN(nn.Module):
  # input_size: Dimensionality of input feature vector.
  # num_classes: The number of classes in the classification problem.
  # num_hidden: The number of hidden (intermediate) layers to use.
  # hidden_dim: The size of each of the hidden layers.
  # dropout: The proportion of units to drop out after each layer.
  def __init__(self, input_size, num_classes, num_hidden, hidden_dim, dropout):
    # Always call the superclass (nn.Module) constructor first!
    super(FeedForwardNN, self).__init__()

    # Set up the hidden layers.
    assert num_hidden > 0
    # A special ModuleList to store our hidden layers.
    self.hidden_layers = nn.ModuleList([])
    # First hidden layer maps from input_size -> num_hidden.
    self.hidden_layers.append(nn.Linear(input_size, hidden_dim))
    # Subsequent hidden layers map from num_hidden -> num_hidden.
    # Note that they can map to any dimensionality --- as long as the final
    # output is a distribution over your classes!
    for i in range(num_hidden - 1):
      self.hidden_layers.append(nn.Linear(hidden_dim, hidden_dim))

    # Set up the dropout layer.
    self.dropout = nn.Dropout(dropout)

    # Set up the final transform to a distribution over classes.
    self.output_projection = nn.Linear(hidden_dim, num_classes)

    # Set up the nonlinearity to use between layers.
    self.nonlinearity = nn.ReLU()

  # Forward's sole argument is the input.
  # input is of shape (batch_size, input_size)
  def forward(self, x):
    # Apply the hidden layers, nonlinearity, and dropout.
    for hidden_layer in self.hidden_layers:
      x = hidden_layer(x)
      x = self.dropout(x)
      x = self.nonlinearity(x)

    # Output layer: project x to a distribution over classes.
    out = self.output_projection(x)

    # Softmax the out tensor to get a log-probability distribution
    # over classes for each example.
    out_distribution = F.log_softmax(out, dim=-1)
    return out_distribution

# Training PyTorch models: Losses and Optimizers

By now, we've learned how to construct models in PyTorch. In this section, we'll go over how to calculate your model's loss and how to optimize the parameters to minimize the loss.

## Loss Functions

Intuitively, loss functions serve to tell your model how poorly it's doing --- the purpose of training is to adjust the weights of our model to minimize the loss.

A loss function takes a true output $y$ and a model-predicted output $\hat y$ and calculates the loss. If $y = \hat y$, our model produced the correct output and thus our loss is 0. The further our predicted $\hat y$ from the true $y$, the higher our loss is.

PyTorch comes with a large collection of loss functions. The most commonly used loss for classification is negative log likelihood (`nn.NLLLoss` or the very related `nn.CrossEntropyLoss`). The difference between `nn.NLLLoss` and `nn.CrossEntropyLoss` for classification problems is that `nn.NLLLoss` expects the output to be log-softmax normalized, which is easy to do with the `nn.LogSoftmax` layer. On the other hand `nn.CrossEntropyLoss`, automatically applies the log-softmax --- you can think of it as `nn.LogSoftmax` + `nn.NLLLoss`. Which to use depends on whether you want to add the extra `nn.LogSoftmax` to your model's `forward()`.

A common loss used for regression problems is the mean squared error (`nn.MSELoss`).

Here's a usage example of the `CrossEntropyLoss`.

In [77]:
# 3 examples, unnormalized scores over 4 classes.
model_output = torch.rand(3, 4, requires_grad = True)

# The correct labels.
targets = torch.LongTensor([1, 0, 3])

# CrossEntropyLoss
cross_entropy = nn.CrossEntropyLoss()
# Loss, averaged across all 3 batch elements.
# Can call this functionally: avg_loss = F.cross_entropy(model_output, targets)
avg_loss = cross_entropy(model_output, targets)
print("CrossEntropyLoss averaged across all 3 batch elements:")
print(avg_loss)

# Backpropagate wrt avg_loss
avg_loss.backward()
# Print out the gradients of model_output
print("Gradients of model_output")
print(model_output.grad)

CrossEntropyLoss averaged across all 3 batch elements:
tensor(1.1827, grad_fn=<NllLossBackward0>)
Gradients of model_output
tensor([[ 0.0500, -0.2170,  0.0714,  0.0956],
        [-0.2491,  0.0745,  0.0689,  0.1057],
        [ 0.0734,  0.0617,  0.0894, -0.2245]])


And here's a snippet showing that `LogSoftmax` + `NLLLoss` is the same as `CrossEntropyLoss`.

In [78]:
nll = nn.NLLLoss()
log_softmax_model_output = F.log_softmax(model_output, dim=-1)
# Loss, averaged across all 3 batch elements.
# Can call this functionally: avg_loss = F.nll_loss(model_output, targets)
avg_loss = nll(log_softmax_model_output, targets)
print("Negative-Log Likelihood averaged across all 3 batch elements:")
print(avg_loss)

Negative-Log Likelihood averaged across all 3 batch elements:
tensor(1.1827, grad_fn=<NllLossBackward0>)


## Optimizers

Now that we can calculate the loss and backpropagate through our model (with `.backward()`), we can update the weights and try to reduce the loss!

PyTorch includes a variety of optimizers that do exactly this, from the standard SGD to more recent techniques like Adam and RMSProp.

At construction, PyTorch parameters take the parameters to optimize. When we run an input through our model, calculate the loss, and backpropagate, the gradients are automatically stored in the parameters (since they're all `Variables`). With these gradients, the optimizer can update the weights.

Optimizers live in the `torch.optim` module.

In [79]:
import torch.optim as optim

To get the parameters of our model, we can just call `.parameters()` on a `Module`. Below, we create an instance of our previously-defined feed forward neural network and get its parameters.

In [80]:
input_size = 784
num_classes = 10
num_hidden = 2
hidden_dim = 50
dropout = 0.2
ffnn_clf = FeedForwardNN(input_size, num_classes, num_hidden,
                         hidden_dim, dropout)
print(ffnn_clf)

parameters = ffnn_clf.parameters()

print("Shapes of model parameters:")
print([x.size() for x in list(parameters)])

FeedForwardNN(
  (hidden_layers): ModuleList(
    (0): Linear(in_features=784, out_features=50, bias=True)
    (1): Linear(in_features=50, out_features=50, bias=True)
  )
  (dropout): Dropout(p=0.2, inplace=False)
  (output_projection): Linear(in_features=50, out_features=10, bias=True)
  (nonlinearity): ReLU()
)
Shapes of model parameters:
[torch.Size([50, 784]), torch.Size([50]), torch.Size([50, 50]), torch.Size([50]), torch.Size([10, 50]), torch.Size([10])]


Now to create an optimizer for this model, we construct a optimizer class and pass it the parameters of the model: stochastic gradient descend.

In [81]:
ffnn_optim = optim.SGD(ffnn_clf.parameters(), lr=0.5)

Let's try using our optimizer to take a gradient update on our model! We'll generate a few random examples, and run them through our model (the forward pass).

In [82]:
# Make some fake data for our model.
# 5 examples in the batch, each example has 784 features.
sample_input = torch.randn(5, 784)
# Multilabel classification, 10 possible classes.
sample_labels = torch.LongTensor([0, 3, 9, 6, 2])

# Run the sample_input through ffnn_clf to get a distribution
# over our classes
sample_predictions = ffnn_clf(sample_input)
print("Predicted distribution over classes: ")
print(sample_predictions)
print("Target Labels:")
print(sample_labels)

Predicted distribution over classes: 
tensor([[-2.4876, -2.2540, -2.3640, -2.4481, -2.1980, -2.4108, -2.2306, -2.2909,
         -2.2048, -2.1918],
        [-2.5271, -2.2726, -2.0243, -2.4829, -2.3892, -2.3375, -2.2993, -2.2415,
         -2.1618, -2.3919],
        [-2.3128, -2.1053, -2.1353, -2.3604, -2.4080, -2.3782, -2.5198, -2.2543,
         -2.1861, -2.4521],
        [-2.4563, -2.5450, -2.2291, -2.4671, -2.1358, -2.1607, -2.2910, -2.1736,
         -2.2903, -2.3672],
        [-2.4685, -2.1693, -2.1936, -2.3258, -2.3950, -2.2589, -2.4781, -2.2594,
         -2.1467, -2.3969]], grad_fn=<LogSoftmaxBackward0>)
Target Labels:
tensor([0, 3, 9, 6, 2])


Now let's calculate the loss of our model on these examples.

In [83]:
nll_loss = F.nll_loss(sample_predictions, sample_labels)
print("Average NLL Loss:")
print(nll_loss)

Average NLL Loss:
tensor(2.3814, grad_fn=<NllLossBackward0>)


Let's print the gradients of one of the parameter matrices in our model, to ensure it's `None`. We haven't done backprop yet, so there shouldn't be any gradients.

In [84]:
print(list(ffnn_clf.parameters())[0].grad)

None


Now we can backpropagate with respect to the loss to calculate the gradients for the parameters of our model with `.backward()`. It's also good practice to call `optimizer.zero_grad()` before `loss.backwards()`, which ensures that the gradients are reset to 0 before backprop.

In [85]:
ffnn_optim.zero_grad()
nll_loss.backward()

Let's check our gradients now...

In [86]:
print(list(ffnn_clf.parameters())[0].grad)

tensor([[ 0.0101,  0.0054, -0.0388,  ...,  0.0399, -0.0127, -0.0016],
        [ 0.0011, -0.0095,  0.0082,  ...,  0.0007,  0.0053,  0.0001],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [-0.0043, -0.0101,  0.0061,  ...,  0.0061,  0.0093,  0.0054],
        [-0.0212,  0.0111,  0.0089,  ..., -0.0173,  0.0111, -0.0100],
        [-0.0203,  0.0033, -0.0018,  ...,  0.0081,  0.0147,  0.0094]])


Now that we have gradients for each of our parameters, we can update them by using `optimizer.step()`.

In [87]:
# save the old value of the parameter for comparison later
old_parameter = list(ffnn_clf.parameters())[0].data.clone()

# Make a gradient update with our optimizer
ffnn_optim.step()

new_parameter = list(ffnn_clf.parameters())[0].data

print("Difference between weight matrix before and after update:")
print(old_parameter - new_parameter)


Difference between weight matrix before and after update:
tensor([[ 5.0517e-03,  2.6928e-03, -1.9398e-02,  ...,  1.9935e-02,
         -6.3730e-03, -7.9811e-04],
        [ 5.4368e-04, -4.7666e-03,  4.1083e-03,  ...,  3.6877e-04,
          2.6550e-03,  5.4335e-05],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        ...,
        [-2.1362e-03, -5.0494e-03,  3.0563e-03,  ...,  3.0290e-03,
          4.6596e-03,  2.7048e-03],
        [-1.0605e-02,  5.5329e-03,  4.4616e-03,  ..., -8.6679e-03,
          5.5312e-03, -4.9849e-03],
        [-1.0153e-02,  1.6503e-03, -8.8810e-04,  ...,  4.0672e-03,
          7.3690e-03,  4.7086e-03]])


If you're familiar with the SGD update rule, you know that:

$$ \theta^{t+1} = \theta^{t} - \left( \eta \cdot \nabla L \left(\theta^{t} \right) \right)$$

Where $\theta^{t}$ is the weight at time $t$, $\eta$ is the learning rate, $\nabla L(\theta^{t})$ is the gradient. Since $\eta = 0.5$, it makes perfect sense that the difference between the weight vectors printed above is exactly half of the gradient.

# Example: Classification on FashionMNIST

Let's use the `FeedForwardNN` model we built earlier to do a simple classification task! This example is meant to be an annotated walkthrough of how to build, train, and evaluate a model in PyTorch. We'll use the [FashionMNIST dataset](https://github.com/zalandoresearch/fashion-mnist), where we are tasked with classifying black and white images of clothes into 10 different classes.

## Loading Data

We'll start by loading the data with `torchvision` --- knowing how to use torchvision isn't the point of this tutorial, so it's relatively unannotated.

In [88]:
!pip install torchvision==0.17 #note: you can find compatible torch/torchvision versions here: https://github.com/pytorch/vision#installation
import torchvision
from torchvision.datasets import FashionMNIST

train_dataset = FashionMNIST(root='./torchvision-data',
                             train=True,
                             transform=torchvision.transforms.ToTensor(),
                             download=True)

test_dataset = FashionMNIST(root='./torchvision-data', train=False,
                            transform=torchvision.transforms.ToTensor())



`train_dataset` and `test_dataset` are both subclasses of PyTorch's `torch.utils.data.Dataset`. The main benefit of subclassing this abstract class is that we can use `torch.utils.data.DataLoader`s to handle batching our examples and iterating over them. We'll create `DataLoader`s for our datasets now.

In [89]:
from torch.utils.data import DataLoader

# Data-related hyperparameters
batch_size = 64

# Set up a DataLoader for the training dataset.
train_dataloader = DataLoader(
    dataset=train_dataset, batch_size=batch_size, shuffle=True)

# Set up a DataLoader for the test dataset.
test_dataloader = DataLoader(
    dataset=test_dataset, batch_size=batch_size)

Let's take a look at what's inside our datasets. `torch.utils.data.Dataset`s are indexable, so we can easily peek inside.

In [90]:
# Print the first training example
print(train_dataset[0])

(tensor([[[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0039, 0.0000, 0.0000, 0.0510,
          0.2863, 0.0000, 0.0000, 0.0039, 0.0157, 0.0000

From this output, we can see the dataset elements are tuple of `(data_tensor, label)`. `data_tensor` is a `FloatTensor` of shape `(1, 28, 28)` (since the image is 28x28), and `label` is an integer from 0 to 9 (since there are 10 classes in the data).

Let's similarly look at what the `DataLoader` produces.

In [91]:
list(train_dataloader)[0]

[tensor([[[[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           ...,
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]]],
 
 
         [[[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           ...,
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]]],
 
 
         [[[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000

As we can see, the `DataLoader` groups examples into batches of size `batch_size` (64 by default in the code above). Thus, the shape of the returned tensor is `(64, 1, 28, 28)`, since we essentially stacked `batch_size` examples together. Similarly, `labels` is now a `LongTensor` of size `batch_size`.

Note that the label for a single example was a Python `int` --- the dataloader automatically grouped them into a `LongTensor` of the appropriate size.

## Building our model

Now we can construct a `FeedForwardNN` instance that we'll train. Each FashionMNIST example is `28x28`, so we get it as a Tensor of shape `(28, 28)`.

We'll flatten out each example to a vector of size `(784,)` for compatibility with our model.

In [92]:
# Hyperparameters of our model.
num_hidden = 2
hidden_dim = 512
dropout = 0.2

fashionmnist_ffnn_clf = FeedForwardNN(input_size=784, num_classes=10,
                                      num_hidden=num_hidden,
                                      hidden_dim=hidden_dim, dropout=dropout)
print(fashionmnist_ffnn_clf)

FeedForwardNN(
  (hidden_layers): ModuleList(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): Linear(in_features=512, out_features=512, bias=True)
  )
  (dropout): Dropout(p=0.2, inplace=False)
  (output_projection): Linear(in_features=512, out_features=10, bias=True)
  (nonlinearity): ReLU()
)


If we're using a GPU, we'll move the model to the GPU which should speed up training. We do this with the same `.cuda()` method we used for Tensors.

In [93]:
if using_GPU:
  fashionmnist_ffnn_clf = fashionmnist_ffnn_clf.cuda()

# Check if the Module is on GPU by checking if a parameter is on GPU
print("Model on GPU?:")
print(next(fashionmnist_ffnn_clf.parameters()).is_cuda)

Model on GPU?:
False


## Construct other classes we need for training: loss and optimizer

Now, we'll set up a criterion for calculating the loss and an Optimizer for updating our parameters.

In [94]:
# Set up criterion for calculating loss
nll_criterion = nn.NLLLoss()

lr = 0.1
momentum = 0.9
# Set up an optimizer for updating the parameters of fashionmnist_ffnn_clf
ffnn_optimizer = optim.SGD(fashionmnist_ffnn_clf.parameters(),
                           lr=lr, momentum=momentum)

## Train the model!

Now, we'll implement the procedure to train the model --- this is typically called the "train loop" since we loop over our batches, performing the forward pass, calculating a loss, backpropping, and then updating our parameters. This is the bulk of the code necessary to train the model.

This block looks pretty long, but that's mostly because of the comments :)

In [95]:
# Number of epochs (passes through the dataset) to train the model for.
num_epochs = 10

# A counter for the number of gradient updates we've performed.
num_iter = 0

# Iterate `num_epochs` times.
for epoch in range(num_epochs):
  print("Starting epoch {}".format(epoch + 1))
  # Iterate over the train_dataloader, unpacking the images and labels
  for (images, labels) in train_dataloader:
    # Reshape images from (batch_size, 1, 28, 28) to (batch_size, 784), since
    # that's what our model expects. Remember that -1 does shape inference!
    reshaped_images = images.view(-1, 784)

    # Wrap reshaped_images and labels in Variables,
    # since we want to calculate gradients and backprop.
    reshaped_images = Variable(reshaped_images)
    labels = Variable(labels)

    # If we're using the GPU, move reshaped_images and labels to the GPU.
    if using_GPU:
      reshaped_images = reshaped_images.cuda()
      labels = labels.cuda()

    # Run the forward pass through the model to get predicted log distribution.
    # predicted shape: (batch_size, 10) (since there are 10 classes)
    predicted = fashionmnist_ffnn_clf(reshaped_images)

    # Calculate the loss
    batch_loss = nll_criterion(predicted, labels)

    # Clear the gradients as we prepare to backprop.
    ffnn_optimizer.zero_grad()

    # Backprop (backward pass), which calculates gradients.
    batch_loss.backward()

    # Take a gradient step to update parameters.
    ffnn_optimizer.step()

    # Increment gradient update counter.
    num_iter += 1

    # Calculate test set loss and accuracy every 500 gradient updates
    # It's standard to have this as a separate evaluate function, but
    # we'll place it inline for didactic purposes.
    if num_iter % 500 == 0:
      # Set model to eval mode, which turns off dropout.
      fashionmnist_ffnn_clf.eval()
      # Counters for the num of examples we get right / total num of examples.
      num_correct = 0
      total_examples = 0
      total_test_loss = 0

      # Iterate over the test dataloader
      for (test_images, test_labels) in test_dataloader:
        # Reshape images from (batch_size, 1, 28, 28) to (batch_size, 784) again
        reshaped_test_images = test_images.view(-1, 784)

        # Wrap test data in Variable, like we did earlier.
        # We set volatile=True bc we don't need history; speeds up inference.
        reshaped_test_images = Variable(reshaped_test_images, volatile=True)
        test_labels = Variable(test_labels, volatile=True)

        # If we're using the GPU, move tensors to the GPU.
        if using_GPU:
          reshaped_test_images = reshaped_test_images.cuda()
          test_labels = test_labels.cuda()

        # Run the forward pass to get predicted distribution.
        predicted = fashionmnist_ffnn_clf(reshaped_test_images)

        # Calculate loss for this test batch. This is averaged, so multiply
        # by the number of examples in batch to get a total.
        total_test_loss += nll_criterion(
            predicted, test_labels).data * test_labels.size(0)

        # Get predicted labels (argmax)
        # We need predicted.data since predicted is a Variable, and torch.max
        # expects a Tensor as input. .data extracts Tensor underlying Variable.
        _, predicted_labels = torch.max(predicted.data, 1)

        # Count the number of examples in this batch
        total_examples += test_labels.size(0)

        # Count the total number of correctly predicted labels.
        # predicted == labels generates a ByteTensor in indices where
        # predicted and labels match, so we can sum to get the num correct.
        num_correct += torch.sum(predicted_labels == test_labels.data)
      accuracy = 100 * num_correct / total_examples
      average_test_loss = total_test_loss / total_examples
      print("Iteration {}. Test Loss {}. Test Accuracy {}.".format(
          num_iter, average_test_loss, accuracy))
      # Set the model back to train mode, which activates dropout again.
      fashionmnist_ffnn_clf.train()

Starting epoch 1


  reshaped_test_images = Variable(reshaped_test_images, volatile=True)
  test_labels = Variable(test_labels, volatile=True)


Iteration 500. Test Loss 0.5992898344993591. Test Accuracy 78.23999786376953.
Starting epoch 2
Iteration 1000. Test Loss 0.5732710957527161. Test Accuracy 80.72000122070312.
Iteration 1500. Test Loss 0.554969847202301. Test Accuracy 79.45999908447266.
Starting epoch 3
Iteration 2000. Test Loss 0.5411611199378967. Test Accuracy 80.9800033569336.
Iteration 2500. Test Loss 0.5333136916160583. Test Accuracy 80.20999908447266.
Starting epoch 4
Iteration 3000. Test Loss 0.4768065810203552. Test Accuracy 82.38999938964844.
Iteration 3500. Test Loss 0.5478038787841797. Test Accuracy 79.43000030517578.
Starting epoch 5
Iteration 4000. Test Loss 0.5184205174446106. Test Accuracy 81.91000366210938.
Iteration 4500. Test Loss 0.4889791011810303. Test Accuracy 83.20999908447266.
Starting epoch 6
Iteration 5000. Test Loss 0.47661474347114563. Test Accuracy 83.1500015258789.
Iteration 5500. Test Loss 0.5552670955657959. Test Accuracy 81.36000061035156.
Starting epoch 7
Iteration 6000. Test Loss 0.5004

# Homework Exercises
**Due: 7th Mar, 11:59pm**
<br>
<br>
Based on the same FashionMNIST dataset, work on the following tasks below. Submit your homework as either: (i) an ipynb file with your results inside; or (ii) a python file and separate pdf discussing your results.

(a) Develop a new feed-forward neural network that contains 3 hidden layers, with hidden layers 1, 2, 3 being of dimensions 512, 256, 128, respectively. Hidden layer 1 is the layer immediately after the input layer, while hidden layer 3 is the one just before the output layer.

(b) Experiment with three different activation functions and two different optimizers. Report your results and discuss your findings.

(c) Building upon Task b above, describe and implement two approaches to improve upon the best variation from Task b. Report your results and discuss your findings.


Part a:

The following feed forward network is composed of three hidden layers where each layer is of the respective dimensions - 512, 256 and 128. 
I will be experimenting with three different activation functions namely, ReLU, tanh and Sigmoid since they are most commonly implemente along with one dropout function. After the implementation of each activation function, a dropout function will be implemented to prevent overfitting.

In [123]:
# Define the FeedForward Neural Network
class NewFeedForwardNet(nn.Module):
    def __init__(self, input_layer=784, output_layer=10, hidden_layer1=512, hidden_layer2=256, hidden_layer3=128, dropout=0.3):
        super(NewFeedForwardNet, self).__init__()
        self.layer1 = nn.Linear(input_layer, hidden_layer1)
        self.layer2 = nn.Linear(hidden_layer1, hidden_layer2)
        self.layer3 = nn.Linear(hidden_layer2, hidden_layer3)
        self.output = nn.Linear(hidden_layer3, output_layer)

        # Activation functions
        self.relu = nn.ReLU()
        self.tanh = nn.Tanh()
        self.sigmoid = nn.Sigmoid()
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, activation='relu'):
        x = x.view(x.shape[0], -1)  # Flatten input

        if activation == 'relu':
            activation_fn = self.relu
        elif activation == 'tanh':
            activation_fn = self.tanh
        elif activation == 'sigmoid':
            activation_fn = self.sigmoid
        else:
            raise ValueError("Invalid activation function")

        x = activation_fn(self.layer1(x))
        x = self.dropout(x)
        x = activation_fn(self.layer2(x))
        x = self.dropout(x)
        x = activation_fn(self.layer3(x))
        x = self.dropout(x)

        x = self.output(x)
        return F.log_softmax(x, dim=-1)

# Instantiate the model
new_model = NewFeedForwardNet()
print(new_model)

NewFeedForwardNet(
  (layer1): Linear(in_features=784, out_features=512, bias=True)
  (layer2): Linear(in_features=512, out_features=256, bias=True)
  (layer3): Linear(in_features=256, out_features=128, bias=True)
  (output): Linear(in_features=128, out_features=10, bias=True)
  (relu): ReLU()
  (tanh): Tanh()
  (sigmoid): Sigmoid()
  (dropout): Dropout(p=0.3, inplace=False)
)


Now that the initial structural implementation of the layers is defined, I will proceed to train and evaluate the model using the train_and_evaluate function below. 

I will begin by ensuring the device in use where the function will be implemented using the CUDA GPU is available otherwise it will use the CPU resources. 

For the optimizers, I have chosen Adam and Stochastic Gradient Descent (SGD) since they pair well with the activation functions chose in the previous section. 

Adam: Initial learning rate = 0.001 
SGD: Initial learning rate = 0.001, momentum = 0.9

  optimizer.zero_grad() ensures the gradients are set to zero before each implementation so that they don't accumulate and affect the results. 

In [128]:
def train_and_evaluate(activation='relu', optimizer_type='adam', epochs=10):
    model = NewFeedForwardNet()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Define loss function
    criterion = nn.CrossEntropyLoss()

    # Choose optimizer
    if optimizer_type == 'adam':
        optimizer = optim.Adam(model.parameters(), lr=0.001)
    elif optimizer_type == 'sgd':
        optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    else:
        raise ValueError("Invalid optimizer choice")

    # Training loop
    for epoch in range(epochs):
        model.train()
        total_loss = 0
        correct = 0
        for images, labels in train_dataloader:
            images, labels = images.to(device), labels.to(device)
            
            optimizer.zero_grad()
            output = model(images, activation)
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            correct += (output.argmax(dim=1) == labels).sum().item()
        
        train_accuracy = correct / len(train_dataloader.dataset)
        print(f"Epoch {epoch+1}: Loss = {total_loss:.4f}, Training Accuracy = {train_accuracy:.4f}")

    return model

# Run experiments
print("Training with ReLU + Adam:")
model_relu_adam = train_and_evaluate('relu', 'adam')

print("\nTraining with Tanh + Adam:")
model_tanh_adam = train_and_evaluate('tanh', 'adam')

print("\nTraining with Sigmoid + SGD:")
model_sigmoid_sgd = train_and_evaluate('sigmoid', 'sgd')


Training with ReLU + Adam:
Epoch 1: Loss = 557.9737, Training Accuracy = 0.7834
Epoch 2: Loss = 400.5800, Training Accuracy = 0.8458
Epoch 3: Loss = 363.2905, Training Accuracy = 0.8592
Epoch 4: Loss = 345.3349, Training Accuracy = 0.8670
Epoch 5: Loss = 331.3600, Training Accuracy = 0.8715
Epoch 6: Loss = 318.8402, Training Accuracy = 0.8758
Epoch 7: Loss = 310.0679, Training Accuracy = 0.8785
Epoch 8: Loss = 302.0291, Training Accuracy = 0.8823
Epoch 9: Loss = 294.8465, Training Accuracy = 0.8852
Epoch 10: Loss = 287.6772, Training Accuracy = 0.8866

Training with Tanh + Adam:
Epoch 1: Loss = 521.5704, Training Accuracy = 0.7984
Epoch 2: Loss = 414.3546, Training Accuracy = 0.8412
Epoch 3: Loss = 386.7568, Training Accuracy = 0.8525
Epoch 4: Loss = 369.2481, Training Accuracy = 0.8586
Epoch 5: Loss = 356.1026, Training Accuracy = 0.8628
Epoch 6: Loss = 348.8328, Training Accuracy = 0.8655
Epoch 7: Loss = 339.0565, Training Accuracy = 0.8703
Epoch 8: Loss = 335.7524, Training Accuracy

Inital test results show that ReLU+Adam had the highest accuracy of ~88% (0.8866), tanh+Adam had an accuracy of ~87% (0.8744) and the lowest accuracy of ~10% (0.1057) was Sigmoid+SGD.

Hence, the nezxt iteration was aimed at improving the performance of Sigmoid+SGD by altering the parameters. In iteration 1 below, I increased the learning rate of SGD from 0.001 to 0.01 while keeping all other variables unchanged. 

In [129]:
#Iteration 1
def train_and_evaluate(activation='relu', optimizer_type='adam', epochs=10):
    model = NewFeedForwardNet()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Define loss function
    criterion = nn.CrossEntropyLoss()

    # Choose optimizer
    if optimizer_type == 'adam':
        optimizer = optim.Adam(model.parameters(), lr=0.001)
    elif optimizer_type == 'sgd':
        optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
    else:
        raise ValueError("Invalid optimizer choice")

    # Training loop
    for epoch in range(epochs):
        model.train()
        total_loss = 0
        correct = 0
        for images, labels in train_dataloader:
            images, labels = images.to(device), labels.to(device)
            
            optimizer.zero_grad()
            output = model(images, activation)
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            correct += (output.argmax(dim=1) == labels).sum().item()
        
        train_accuracy = correct / len(train_dataloader.dataset)
        print(f"Epoch {epoch+1}: Loss = {total_loss:.4f}, Training Accuracy = {train_accuracy:.4f}")

    return model

# Run experiments
print("Training with ReLU + Adam:")
model_relu_adam = train_and_evaluate('relu', 'adam')

print("\nTraining with Tanh + Adam:")
model_tanh_adam = train_and_evaluate('tanh', 'adam')

print("\nTraining with Sigmoid + SGD:")
model_sigmoid_sgd = train_and_evaluate('sigmoid', 'sgd')


Training with ReLU + Adam:
Epoch 1: Loss = 562.0038, Training Accuracy = 0.7818
Epoch 2: Loss = 404.4496, Training Accuracy = 0.8447
Epoch 3: Loss = 368.2507, Training Accuracy = 0.8597
Epoch 4: Loss = 349.2219, Training Accuracy = 0.8653
Epoch 5: Loss = 332.8848, Training Accuracy = 0.8712
Epoch 6: Loss = 321.4450, Training Accuracy = 0.8757
Epoch 7: Loss = 311.9161, Training Accuracy = 0.8792
Epoch 8: Loss = 302.2629, Training Accuracy = 0.8822
Epoch 9: Loss = 293.2805, Training Accuracy = 0.8860
Epoch 10: Loss = 288.6774, Training Accuracy = 0.8875

Training with Tanh + Adam:
Epoch 1: Loss = 525.9455, Training Accuracy = 0.7984
Epoch 2: Loss = 408.7132, Training Accuracy = 0.8427
Epoch 3: Loss = 383.2703, Training Accuracy = 0.8528
Epoch 4: Loss = 368.5491, Training Accuracy = 0.8595
Epoch 5: Loss = 354.9660, Training Accuracy = 0.8630
Epoch 6: Loss = 344.7906, Training Accuracy = 0.8656
Epoch 7: Loss = 340.0693, Training Accuracy = 0.8701
Epoch 8: Loss = 333.9088, Training Accuracy

Results of iteration 1 shows an overall improvement in the accuracy of all models, especially the accuracy of Sigmoid+SGD which increased from ~10% (0.1057) in the initial implemented to 72% (0.7252)! in iteration 1. From this result we can infer that increasing the learning rate had a major impact in the accuracy of Sigmoid activation function with SGD as well as the loss which reduced drastically from ~2160 to ~696.

In iteration 2 below, I will increase the learning rate of SGD from 0.01 to 0.1 to observe if the accuracy improves and if the loss decreases further.

In [130]:
#Iteration 2
def train_and_evaluate(activation='relu', optimizer_type='adam', epochs=10):
    model = NewFeedForwardNet()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Define loss function
    criterion = nn.CrossEntropyLoss()

    # Choose optimizer
    if optimizer_type == 'adam':
        optimizer = optim.Adam(model.parameters(), lr=0.001)
    elif optimizer_type == 'sgd':
        optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
    else:
        raise ValueError("Invalid optimizer choice")

    # Training loop
    for epoch in range(epochs):
        model.train()
        total_loss = 0
        correct = 0
        for images, labels in train_dataloader:
            images, labels = images.to(device), labels.to(device)
            
            optimizer.zero_grad()
            output = model(images, activation)
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            correct += (output.argmax(dim=1) == labels).sum().item()
        
        train_accuracy = correct / len(train_dataloader.dataset)
        print(f"Epoch {epoch+1}: Loss = {total_loss:.4f}, Training Accuracy = {train_accuracy:.4f}")

    return model

# Run experiments
print("Training with ReLU + Adam:")
model_relu_adam = train_and_evaluate('relu', 'adam')

print("\nTraining with Tanh + Adam:")
model_tanh_adam = train_and_evaluate('tanh', 'adam')

print("\nTraining with Sigmoid + SGD:")
model_sigmoid_sgd = train_and_evaluate('sigmoid', 'sgd')


Training with ReLU + Adam:
Epoch 1: Loss = 559.5788, Training Accuracy = 0.7825
Epoch 2: Loss = 400.5652, Training Accuracy = 0.8463
Epoch 3: Loss = 366.4172, Training Accuracy = 0.8592
Epoch 4: Loss = 347.7177, Training Accuracy = 0.8656
Epoch 5: Loss = 330.4547, Training Accuracy = 0.8723
Epoch 6: Loss = 318.9842, Training Accuracy = 0.8765
Epoch 7: Loss = 309.1889, Training Accuracy = 0.8791
Epoch 8: Loss = 303.9820, Training Accuracy = 0.8821
Epoch 9: Loss = 294.1227, Training Accuracy = 0.8857
Epoch 10: Loss = 285.1771, Training Accuracy = 0.8883

Training with Tanh + Adam:
Epoch 1: Loss = 522.2632, Training Accuracy = 0.8006
Epoch 2: Loss = 410.7835, Training Accuracy = 0.8407
Epoch 3: Loss = 378.8818, Training Accuracy = 0.8541
Epoch 4: Loss = 364.3579, Training Accuracy = 0.8594
Epoch 5: Loss = 355.6537, Training Accuracy = 0.8625
Epoch 6: Loss = 346.2382, Training Accuracy = 0.8660
Epoch 7: Loss = 336.9866, Training Accuracy = 0.8699
Epoch 8: Loss = 331.8873, Training Accuracy

Results of iteration 2 shows a ~13% increase in the accuracy of Sigmoid+SGD from ~72% (0.7252) in iteration 1 to ~85% (0.8572) in iteration 2, as well as a drastic reduction in loss ~696 in iteration 1 to ~371 in iteration 2. 

In the final iteration, iteration 3 below, I will increase the learning rate of SGD from 0.1 to 1.0 while keeping all other variables unchanged to observe a possible improvement in accuracy and loss rates. 

In [131]:
#Iteration 3
def train_and_evaluate(activation='relu', optimizer_type='adam', epochs=10):
    model = NewFeedForwardNet()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Define loss function
    criterion = nn.CrossEntropyLoss()

    # Choose optimizer
    if optimizer_type == 'adam':
        optimizer = optim.Adam(model.parameters(), lr=0.001)
    elif optimizer_type == 'sgd':
        optimizer = optim.SGD(model.parameters(), lr=1.0, momentum=0.9)
    else:
        raise ValueError("Invalid optimizer choice")

    # Training loop
    for epoch in range(epochs):
        model.train()
        total_loss = 0
        correct = 0
        for images, labels in train_dataloader:
            images, labels = images.to(device), labels.to(device)
            
            optimizer.zero_grad()
            output = model(images, activation)
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            correct += (output.argmax(dim=1) == labels).sum().item()
        
        train_accuracy = correct / len(train_dataloader.dataset)
        print(f"Epoch {epoch+1}: Loss = {total_loss:.4f}, Training Accuracy = {train_accuracy:.4f}")

    return model

# Run experiments
print("Training with ReLU + Adam:")
model_relu_adam = train_and_evaluate('relu', 'adam')

print("\nTraining with Tanh + Adam:")
model_tanh_adam = train_and_evaluate('tanh', 'adam')

print("\nTraining with Sigmoid + SGD:")
model_sigmoid_sgd = train_and_evaluate('sigmoid', 'sgd')


Training with ReLU + Adam:
Epoch 1: Loss = 566.5738, Training Accuracy = 0.7777
Epoch 2: Loss = 401.0009, Training Accuracy = 0.8465
Epoch 3: Loss = 368.7477, Training Accuracy = 0.8567
Epoch 4: Loss = 346.4264, Training Accuracy = 0.8675
Epoch 5: Loss = 331.7427, Training Accuracy = 0.8708
Epoch 6: Loss = 318.6099, Training Accuracy = 0.8769
Epoch 7: Loss = 312.9610, Training Accuracy = 0.8787
Epoch 8: Loss = 301.5828, Training Accuracy = 0.8831
Epoch 9: Loss = 297.4196, Training Accuracy = 0.8829
Epoch 10: Loss = 289.5726, Training Accuracy = 0.8874

Training with Tanh + Adam:
Epoch 1: Loss = 526.1339, Training Accuracy = 0.7991
Epoch 2: Loss = 410.0683, Training Accuracy = 0.8452
Epoch 3: Loss = 385.6389, Training Accuracy = 0.8517
Epoch 4: Loss = 366.4562, Training Accuracy = 0.8579
Epoch 5: Loss = 355.0481, Training Accuracy = 0.8626
Epoch 6: Loss = 343.3016, Training Accuracy = 0.8673
Epoch 7: Loss = 337.9778, Training Accuracy = 0.8693
Epoch 8: Loss = 333.9000, Training Accuracy

Results of the final iteration showed no improvement in accuracy or loss rates, with accuracy dropping to ~10% again and loss skyrocketing like the initial iteration. 

In conclusion, a learning rate of 0.1 is ideal to achieve high accuracy rate in the SGD optimiser when used with the Sigmoid activation function. 

Now that we know the ideal learning rate value, in the next iteration I aim to check whether increasing the number of epochs from 10 to 15 has an impact on the loss and accuracy rate of all the three activation function + optimizer combinations. 

In [137]:
#Iteration 2 
def train_and_evaluate(activation='relu', optimizer_type='adam', epochs=15):
    model = NewFeedForwardNet()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Define loss function
    criterion = nn.CrossEntropyLoss()

    # Choose optimizer
    if optimizer_type == 'adam':
        optimizer = optim.Adam(model.parameters(), lr=0.001)
    elif optimizer_type == 'sgd':
        optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    else:
        raise ValueError("Invalid optimizer choice")

    # Training loop
    for epoch in range(epochs):
        model.train()
        total_loss = 0
        correct = 0
        for images, labels in train_dataloader:
            images, labels = images.to(device), labels.to(device)
            
            optimizer.zero_grad()
            output = model(images, activation)
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            correct += (output.argmax(dim=1) == labels).sum().item()
        
        train_accuracy = correct / len(train_dataloader.dataset)
        print(f"Epoch {epoch+1}: Loss = {total_loss:.4f}, Training Accuracy = {train_accuracy:.4f}")

    return model

# Run experiments
print("Training with ReLU + Adam:")
model_relu_adam = train_and_evaluate('relu', 'adam')

print("\nTraining with Tanh + Adam:")
model_tanh_adam = train_and_evaluate('tanh', 'adam')

print("\nTraining with Sigmoid + SGD:")
model_sigmoid_sgd = train_and_evaluate('sigmoid', 'sgd')


Training with ReLU + Adam:
Epoch 1: Loss = 559.2046, Training Accuracy = 0.7804
Epoch 2: Loss = 399.0117, Training Accuracy = 0.8467
Epoch 3: Loss = 364.9941, Training Accuracy = 0.8603
Epoch 4: Loss = 346.4972, Training Accuracy = 0.8666
Epoch 5: Loss = 332.2067, Training Accuracy = 0.8716
Epoch 6: Loss = 319.7782, Training Accuracy = 0.8764
Epoch 7: Loss = 306.9475, Training Accuracy = 0.8815
Epoch 8: Loss = 304.0997, Training Accuracy = 0.8827
Epoch 9: Loss = 295.7075, Training Accuracy = 0.8851
Epoch 10: Loss = 289.4848, Training Accuracy = 0.8877
Epoch 11: Loss = 280.6306, Training Accuracy = 0.8899
Epoch 12: Loss = 275.1522, Training Accuracy = 0.8907
Epoch 13: Loss = 273.9605, Training Accuracy = 0.8921
Epoch 14: Loss = 268.2923, Training Accuracy = 0.8952
Epoch 15: Loss = 264.1560, Training Accuracy = 0.8959

Training with Tanh + Adam:
Epoch 1: Loss = 518.2301, Training Accuracy = 0.8017
Epoch 2: Loss = 409.2020, Training Accuracy = 0.8430
Epoch 3: Loss = 383.2794, Training Acc

Results of increasing the epochs from 10 to 15 does not display the epochs for Sigmoid + SGD at all. The testing stops at the tanh+Adam stage. This could be due to various reasons such as the following but not limited to;
1. Vanishing Gradient: When using sigmoid activation functions, gradients can become extremely small during backpropagation, especially in deeper networks. This is because the derivative of the sigmoid function ranges from 0 to 0.25, with very small values. This causes the network to learn very slowly or not at all.

2. Learning Rate Sensitivity: SGD is particularly sensitive to learning rate when paired with sigmoid activations. As noted in one discussion, changing from Adam to SGD required significant learning rate adjustments, but even with various learning rate values, SGD still struggled to train effectively.