# Homework 1: Feed-forward Neural Networks (100 points)

### Overview

Below you will find a PyTorch implementation of a feed-forward neural network for image recognition. We use the popular MNIST dataset, where the model predicts a single digit (0-9) for a black-and-white photo of a handwritten digit. This is a _classification_ task.

### NN Architecture

Each image has size 28x28 grayscale pixel values between 0 and 255. In preprocessing, we flatten each image to a single vector of length $28^2 = 784$, which serves as the entire input for the model.

For each image, we aim to predict one of ten classes (0-9). We could use an output layer $y$ of size 1 (a single neuron) -- for example, using a naive mapping like prediction $p = \mathrm{int}(10y)$. But this presupposes that a handwritten 0 is similar to a handwritten 1 and very different than a handwritten 9, which isn't the case. So instead we use an output layer $y$ of size 10, where the prediction $p = argmax(y)$, so each output neuron controls the likelihood for a particular class.

We use a simple two-layer neural network. To begin, we will have an input size of 784, a hidden layer of size 5, and an output layer of size 10.

### Your Task

At the bottom of this notebook file, there are a series of questions testing your understanding of this neural network architecture. Some questions include instructions where you will need to modify hyperparameters (notated in the code below) and re-run the model to investigate the changed results. __There is no need to read through the following code in depth to answer the questions, but it may be useful as a reference.__

Below each question is a cell with the text “Type Markdown and LaTex.” Double-click the cell and type your response to the question. Save your responses by clicking on the floppy disk icon or choosing File - Save and Checkpoint.

After responding to the questions, download your notebook as a `.html` file by choosing File - Download as - html (.html). You will be submitting this `.html` file to your instructor for grading.

In [1]:
import torch
import torch.nn as nn
import torchvision.datasets as datasets
import torchvision.transforms as transforms
torch.manual_seed(0)

<torch._C.Generator at 0x772f2c216e70>

In [2]:
root_dir = 'assets_week1'
trainDataset = datasets.MNIST(root=root_dir, train=True, transform=transforms.ToTensor(), download=True)
testDataset = datasets.MNIST(root=root_dir, train=False, transform=transforms.ToTensor())

In [3]:
class NNModel(nn.Module):
    def __init__(self, inputSize, outputSize, hiddenSize, activate):
        super().__init__()
        
        self.activate = nn.Sigmoid() if activate == "Sigmoid" else nn.Tanh() if activate == "Tanh" else nn.ReLU()
        self.layer1 = nn.Linear(inputSize, hiddenSize)
        self.layer2 = nn.Linear(hiddenSize, outputSize)
        
    def forward(self, X):
        hidden = self.activate(self.layer1(X))
        return self.layer2(hidden)

In [4]:
# The dimensionality of the input
inputSize = 784
# Number of neurons in the first layer
hiddenSize = 5 #300 #5
# Number of neurons in the second layer
outputSize = 10
# Activation function (default: ReLU, options: Sigmoid, Tanh, ReLU)
activation = "Relu" #"Sigmoid"
# Learning rate
learningRate = .001 #1 #0.001
# Number of training epochs
numEpochs = 5 #10 #5
# Number of training examples per batch
batchSize = 200

In [5]:
trainLoader = torch.utils.data.DataLoader(dataset=trainDataset, batch_size=batchSize, shuffle=True)
testLoader = torch.utils.data.DataLoader(dataset=testDataset, batch_size=batchSize, shuffle=False)

net = NNModel(inputSize, outputSize, hiddenSize, activation)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(net.parameters(), lr=learningRate)

print('>>> Beginning training!')
for epoch in range(numEpochs):
    for i, (images, labels) in enumerate(trainLoader):
        images = images.view(-1, 28*28)
        
        optimizer.zero_grad()
        
        # Forward propagation
        outputs = net(images)
        
        # Backpropagation
        loss = criterion(outputs, labels)
        loss.backward()
        
        # Gradient descent
        optimizer.step()
        
        # Logging
        if (i+1) % 100 == 0:
            print('Epoch [{}/{}], Step [{}/{}], Loss: {}'.format(epoch+1, numEpochs, i+1, len(trainDataset)//batchSize, loss))

print()
print('>>> Beginning validation!')
correct, total = 0, 0
for i, (images, labels) in enumerate(testLoader):
    images = images.view(-1, 28*28)
    
    outputs = net(images)
    _, prediction = torch.max(outputs, axis=1)
    correct += torch.sum(prediction == labels)
    total += labels.size(0)
print('Validation accuracy: {}%'.format(correct.item()/total*100))

>>> Beginning training!
Epoch [1/5], Step [100/300], Loss: 1.5429329872131348
Epoch [1/5], Step [200/300], Loss: 1.1821335554122925
Epoch [1/5], Step [300/300], Loss: 1.0662295818328857
Epoch [2/5], Step [100/300], Loss: 0.8010011315345764
Epoch [2/5], Step [200/300], Loss: 0.6519669890403748
Epoch [2/5], Step [300/300], Loss: 0.5987505912780762
Epoch [3/5], Step [100/300], Loss: 0.5415259003639221
Epoch [3/5], Step [200/300], Loss: 0.5746868252754211
Epoch [3/5], Step [300/300], Loss: 0.5670210719108582
Epoch [4/5], Step [100/300], Loss: 0.39370736479759216
Epoch [4/5], Step [200/300], Loss: 0.5235522985458374
Epoch [4/5], Step [300/300], Loss: 0.5720871090888977
Epoch [5/5], Step [100/300], Loss: 0.5311781764030457
Epoch [5/5], Step [200/300], Loss: 0.737825334072113
Epoch [5/5], Step [300/300], Loss: 0.5962098836898804

>>> Beginning validation!
Validation accuracy: 87.69%


## Homework Questions

Your goal is to improve the model's accuracy by tuning hyperparameters. If a question asks you to modify a hyperparameter and you obtain improved results, retain that hyperparameter change for subsequent questions. Otherwise, revert back to the original hyperparameter value.

**To make sure your code produces consistent results, it is advisable to click "Kernel -> Restart & Run All" every time you want to run your code.**

### Question 1: Loss Minimization & Gradient Descent (5 points)

Given a neural network with model parameters $\theta$, loss function $E$, and learning rate $\alpha$, what is the correct method to perform gradient descent?

a) $\theta_i += \alpha E$

b) $\theta_i -= \alpha E$

c) $\theta_i += \alpha\frac{\partial E}{\partial \theta_i}$

d) $\theta_i -= \alpha\frac{\partial E}{\partial \theta_i}$

d is the correct method to perform gradient descent. We take the slope of the curve and move in the direction with a learning rate of alpha till we reach minima or till the gradient becomes very small.

### Question 2: Class Imbalance (10 points)

Imagine you are an engineer tasked with helping a company to identify faulty parts early using an machine learning-based image recognition system. What evaluation metric would you use? More specifically, explain why a raw percent accuracy score would be a poor choice of evaluation metric for this problem space.

Class imbalance would mean that there are significantly more instances of one class (non-faulty parts) than the other (faulty parts).

Simple accuracy as the evaluation metric would lead to an over-optimistic estimate of the model's performance, as it could achieve a high accuracy score by always predicting the majority class (non-faulty parts) without detecting any faulty parts.

Therefore other evaluation metrics such as precision and recall would be more suitable. Precision measures the proportion of predicted faulty parts that are correctly classified as faulty. Recall measures the proportion of actual faulty parts that are correctly identified as faulty.

### Question 3a:  Size of a Hidden Layer (10 points)

Explain how the hidden layer size influences the architecture of a feed-forward neural network. In doing so, note what can happen if the hidden size is too large and what can happen if the hidden size is too small.

If the hidden layer size is too small, the neural network may not be able to learn complex patterns in the data resulting in poor performance on both the training and testing data.

If the hidden layer size is too large, the neural network may become too complex and may overfit the training data. This means that the model may perform well on the training data but poorly on new data.

Therefore we need to do some hyperparameter tuning to get the best number of hidden layers for the dataset we are interested in.

### Question 3b: Size of a Hidden Layer  (10 points)

Increase the hidden size from 5 to 300 and re-run your trial. How does the accuracy change?

_a) It increases, since the model learns more quickly_

_b) It increases, since the model has more memory and can learn more complex features_

_c) It decreases, since the model has to learn more parameters and it doesn't have enough time_

_d) It decreases, since the model has less memory_

Accuracy w.r.t 5 hidden layer:
Validation accuracy: 74.42999999999999%

Accuracy w.r.t 300 hidden layer:
Validation accuracy: 95.48%

The accuracy seems to have jumped when we increased the hidden layers from 5 to 300 but the training took some more time, may be because model had to learn additional number of parameters.

There correct answer is b - accuracy increases, since the model has more memory and can learn more complex features.

### Question 4a: Learning Rate  (10 points)

Explain the purpose of a learning rate. In doing so, note what can happen if the learning rate is too large and what can happen if the learning rate is too small.

The learning rate controls the step size at which the model's parameters are updated during training. In other words, the learning rate determines how quickly or slowly the model learns from the data.If the learning rate is too large, the model may overshoot the optimal solution.

If the learning rate is too small, the model may take too long to converge and may get stuck in local optima .

### Question 4b: Learning Rate  (10 points)

Increase the learning rate from 0.001 to 1. How does the accuracy change?

a) It increases, since the model learns more quickly

b) It increases, since the model is better able to converge

c) It decreases, since the model learns too slowly

d) It decreases, since the model is not able to converge

Accuracy w.r.t learning rate of 0.001:
Validation accuracy: 75.64999999999999%

Accuracy w.r.t learning rate of 1:
Validation accuracy: 58.37%

To know if accuracy decreases because the model learned too slowly or model is not able to converge, lets increase the number of epochs to 10 and re run the model.

Accuracy w.r.t learning rate of 1 and numEpochs =10 (not 5):
Validation accuracy: 19.470000000000002%

Increasing the number of epochs further decreases the accuracy meaning that model is actually not able to converge, hence answer d


### Question 5a: Activation Functions (10 points)

Explain the main purpose of an activation function in neural networks. Also, explain the main benefit of the Tanh activation function over the Sigmoid activation function, and the main benefit of the ReLU activation function over the Sigmoid activation function.

The main purpose of an activation function is to introduce non-linearity to the output of a neuron. Non-linearity means that the output is not just a linear combination of the inputs, but a more complex function that can better capture the relationships between the inputs and the outputs.

The Tanh activation function is similar to the Sigmoid activation function, but it outputs values between -1 and 1 instead of 0 and 1.The main benefit of using the Tanh function over the Sigmoid function is that it can help prevent the vanishing gradient problem, which can occur when the gradient of the Sigmoid function becomes very small for very large or very small inputs.

The ReLU (Rectified Linear Unit) activation function is a very simple function that outputs the input value if it is positive, and 0 otherwise. This means that it introduces non-linearity to the output of a neuron in a very simple way. The main benefit of using the ReLU function over the Sigmoid function is that it is computationally efficient and can help prevent the vanishing gradient problem.

### Question 5b: Activation Functions (5 points)

Change the activation function in the hyperparameter list above to determine which activation function is most effective at this task.

a) ReLU

b) Sigmoid

c) Tanh

Accuracy with Sigmoid:
Validation accuracy: 78.73%

Accuracy with Tanh:
Validation accuracy: 83.3%

Accuracy with Relu:
Validation accuracy: 85.41%

so Relu is the most effective activation function for this task

### Question 6: Overfitting  (10 points)

Define overfitting and explain how it can damage model training and results.

Overfitting is a problem in machine learning, which occurs when a model learns the training data too well, to the extent that it captures the noise and randomness in the data, rather than the underlying patterns. This can lead to poor performance when the model is applied to new, unseen data.

The problem with overfitting is that the model becomes too sensitive to the noise and randomness in the training data and loses its ability to generalize to new data. This means that the model may perform well on the training data, but perform poorly on the test data or real-world data.

### Question 7: Early Stopping  (10 points)

Outline a procedure for early stopping to prevent overfitting. Clearly describe how you’d use the training, validation, and test sets accuracy to decide when to stop.

Early stopping is a technique used in machine learning to prevent overfitting by stopping the training process when the model's performance on the validation set stops improving.

### Question 8: Regularization  (10 points)

Briefly explain a few common methods of regularization to prevent overfitting.

L1 Regularization (Lasso Regression): This method adds a penalty term to the loss function, proportional to the absolute values of the model parameters (also known as L1 norm). This encourages the model to have sparse weight vectors, meaning some weights are forced to zero. This makes the model simpler and less prone to overfitting.

L2 Regularization (Ridge Regression): This method adds a penalty term to the loss function, proportional to the square of the model parameters (also known as L2 norm). This encourages the model to have small weight vectors, reducing the impact of large weights on the output, making the model more generalizable.

Dropout Regularization: This method randomly drops out a percentage of the neurons in the network during training, forcing the model to learn more robust features, reducing its sensitivity to individual weights, and thus overfitting. 

Early Stopping: This method stops the training process when the model's performance on the validation set stops improving, preventing the model from overfitting.