### Objectives

1. Understand how to use pytorch to write a neural network
2. Write a neural network with multiple hidden layers

Linear layers are used in this lab, 
- 3 hidden layers 
- MNIST database: 28*28 pixel input (= 784 dimensional vector, a 784 1-dimensional tensor)
- The output layer can be an integer btw 0 and 9 (= 10 dimensional vector, a 10 1-dimensional tensor)
- sigmoid activation function for hidden layers
- softmax function for output layer

![](img/lab3/network.png)

In [1]:
import torch
from torchvision import transforms, datasets
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

train = datasets.MNIST("", train = True, download = True, transform = transforms.Compose([transforms.ToTensor()]))
test = datasets.MNIST("", train = False, download = True, transform = transforms.Compose([transforms.ToTensor()]))

# DataLoader() wraps an iterable over the given dataset and supports automatic batching, sampling, shuffling and multiprocess data loading
trainset = torch.utils.data.DataLoader(train, batch_size = 10, shuffle = True)
testset = torch.utils.data.DataLoader(test, batch_size = 10, shuffle = True)
# batch_size determine how many samples to pass through network before w and b are updated
# reduces memory downloads and increases speed to train
# one epoch involves number of samples/batch_size updates to the model

In [2]:
n = 128 # number of neurons
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(28*28, n) # creating a layer of n neurons with 28*28 inputs
        self.fc2 = nn.Linear(n, n) # nn.Linear is an affine trasformation
        self.fc3 = nn.Linear(n, n)
        self.fc4 = nn.Linear(n, n)
        self.fc5 = nn.Linear(n, 10)

    def forward(self, x):
        x = torch.sigmoid(self.fc1(x)) # 0.103 (64 neurons)
        x = torch.sigmoid(self.fc2(x))
        x = torch.sigmoid(self.fc3(x)) # number of layers
        # x = torch.sigmoid(self.fc4(x)) # better accuracy without fourth layer
        # x = F.relu(self.fc1(x)) # 0.118 (64 neurons)
        # x = F.relu(self.fc2(x))
        # x = F.relu(self.fc3(x))
        # x = torch.tanh(self.fc1(x)) # 0.411 (64 neurons), 0.47 (128 neurons), 0.353 (200 neurons)
        # x = torch.tanh(self.fc2(x))
        # x = torch.tanh(self.fc3(x))
        # x = torch.tanh(self.fc4(x)) # 0.442 (128 neurons, 4 hidden layers)
        x = self.fc5(x)

        # computes to softmax function of given input
        return F.softmax(x, dim = 1)

# create instance of neural network
net = Net()

##### Tested with different activation functions, optimisers, and number of layers

Best computation was with Sigmoids and Adam. Number of hidden layers, as well as neurons, can affect the accuracy. Increasing the epochs with SGD will be very slow.

##### What is ReLu?
The ReLu function stands for rectified linear activation function, which is a piecewise linear function that will output the inpur directly if it is positive, other it will output zero.

##### Sigmoid VS Relu?
Not dataset dependent, and choice of algorithm is guided by the problem, which requires lots of practice to get better at.

##### What is Adam gradient descent?
Adams is an adaptive method, as opposed to SGD, which uses a combination of momentum and adaptive learning rate to speed up the converge of the optimisation process, and avoid getting stuck. 

1) Adaptive Learning Rate (adapts learning rate for each weight during training, converge faster and prevents oscillations)
2) Momentum Updates (keeps a running average of the gradient updates)

- Adam adapts the learning rate for each parameter based on the estimated first and second moments of gradients as opposed to using a fixed learning rate.
- Momentum allows the alg to continue moving in the same direction as the gradients, even when they are small or noisy.
- Its robustness makes it perform well even with noisy gradients, and when there are many local minima.
- Important to note, it is not always the best choice for every problem and requires trial and error.

In [3]:
# sets up optimisation method for backpropagation alg
# optimiser = optim.SGD(net.parameters(), lr = 0.001) # SGD (Stoachastic Gradient Descent)
optimiser = optim.Adam(net.parameters(), lr = 0.001) # Adam (increases Accuracy drastically)

In [4]:
Epochs = 3 # number of training epochs

# iterate over the training data (trainset)
for epoch in range(Epochs):
    for data in trainset:
        X, y = data # assign input (X) and labels (y)
        net.zero_grad() # set gradiant stored to zero to reset gradient value for each iteration
        output = net.forward(X.view(-1,28*28)) # transform 2 dimensional tensor (28x28 matrix) input to 1 dimension (784 vector)
        loss = F.nll_loss(output, y) # loss function (cross-entropy sicne we are working w classifier)
        loss.backward() # compute gradient wrt loss function over each parameter of the network (must set gradient to 0, line 46)
        optimiser.step() # update parameters of the network according to the optimisation alg and gradient stored within each variable

##### Why can't the gradients be automatically zeroed when loss.backward() is called?

- net.zero_grad() sets the gradient as 0, as by default, the gradients are accumulated in buffers and not overwritten whenever backwards() is called.
- the previous gradient is needed in two cases: 1) when we want to perform gradient descent, as optimiser.step() is called after loss.backward(). 2) We need to accumulate gradient amongst some batches; to do that, we can simply call backward multiple times and optimise once.

In [5]:
correct, total = 0, 0

with torch.no_grad(): # network won't update gradient stored in each variable in the test sessions
    for data in testset:
        X, y = data
        output = net.forward(X.view(-1, 28*28))
        for idx, i in enumerate(output):
            if torch.argmax(i) == y[idx]:
                correct += 1
            total += 1
print("Accuracy: ", round(correct/total, 3))

Accuracy:  0.938


##### What is a tensor?
A tensor is a data structure used to represent and manipulate complex data in neural networks. 1-dimensional tensors are a vector. They are used to represent input data, model parameters, and intermediate computations in neural networks. Tensors offer several advantages over vectors, it can represent data with any number of dimensions, it is easy to manipulate because they can be reshaped, transposed, sliced, concatenated, making it easier to preprocess and trasform for use in ML models. ML libaries are optimised for tensor operations, enabling faster and more efficient computation. It is also compatable with ML frameworks, which uses them as a common data format.
![](img/lab3/tensors.png)

##### Result of adding more layers? why does the accuracy go down?

The decrease in accuracy suggests that these layers are leading to overfitting, which implies that it is a good fit for the training data, but does not work well with new data, as it learns specific noise and trends that may not be general to all data.

Also increase in layers may require much more epochs for it to converge, which takes A LOT of time (could add a stopping criteria rather than running for so many epochs).

##### Choosing the number of hidden layers and nodes in a feedforward neural network

There are three types of layers: input, hidden and output.

Input Layer - the number of neurons comprising the layer is equal to the number of features (columns) in your data, with some adding an additional node for a bias term.

Hidden Layer - responsible for transofmring the input data into a form that can be used to make predictions, there can be one or more. 

Output Layer - one output layer, number of nodes depends on the output that the model is designed to produce.