<a href="https://colab.research.google.com/github/lefteryx/MNIST-Dataset-Classification/blob/main/MNIST_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Explain the different approaches you took while building this model and the explain why certain approaches failed.*

***

Overall, building a neural network involves selecting a appropriate functions and algorithms, experimenting with different techniques to improve performance, and carefully monitoring the model's performance on the validation set to prevent overfitting.

My approach was pretty much what I had learnt in 3B1B's video on neural networks and in the official PyTorch Documentation-cum-Tutorial: I got a 28*28 unit input layer, a few hidden layers, and a 10 unit output layer. 

As far as the number of hidden layers are concerned, a single layer would've been too less as it would probably not have enough capacity to learn the complex patterns in the MNIST dataset, resulting in underfit, poor performance.
Taking too many layers would've been an issue as well, since it would've increased the complexity of the model making it harder to train, resulting in poor performance in the testing phase due to overfitness.

I went with the middle way in choosing the number of neurons in the hidden layers, as well (512 each) for reasons similar to those in the case of the number of hidden layers, as stated above.

I preferred ReLu over the Sigmoid function since it's Sigmoid is unnecessarily complex and outdated.

I chose an efficient learn rate as well, which if too low would take too much of time, and if too much, may overshoot the minimum of the loss function and fail to converge at all.

For the optimization algorithm, I chose Stochastic Gradient Descent (SGD) due to its simplicity and popularity. Upon my research, I found that Adam and RMSProp, among others, may have been more efficient choices, but I couldn't take out time to elaborately study them due to an unforeseen illness. 

Still, SGD has a lot of benefits, among which 2 of the most important ones I mention below:

1. Efficiency - SGD is computationally efficient and can handle even large datasets, converging quickly.

2. Simplicity - SGD is a simple algorithm that is easy to implement and understand. It only requires computing the gradient of the loss function with respect to the model's weights, and moving in the opposite direction of the gradient. This makes it a straightforward algorithm to use and debug.
 



*Find ways to explain how and why the model converges and at what point overfitting takes place.*
***

When a neural network is trained, it uses an optimization algorithm to adjust the weights of the network in order to minimize the loss function. This is known as convergence.

There are several factors that can influence whether a model will converge and how long it will take to converge. For example, the choice of optimization algorithm, the learning rate, and the size and complexity of the model can all affect convergence.

In general, the model converges when the optimization algorithm is able to find the optimal set of weights that minimize the loss function. 



In [None]:
# Import necessary modules
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms


class NeuralNetwork(nn.Module):
  
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        # Define the layers of the network
        self.fc1 = nn.Linear(28 * 28, 512) # Fully-connected layer with 512 units
        self.fc2 = nn.Linear(512, 512) # Fully-connected layer with 512 units
        self.fc3 = nn.Linear(512, 10) # Fully-connected layer with 10 units

    def forward(self, x):
        # Define the forward pass through the network
        x = x.view(-1, 28 * 28) # Reshape the input tensor into a 2D tensor
        x = F.relu(self.fc1(x)) # Apply ReLU activation function to the output of the first fully-connected layer
        x = F.relu(self.fc2(x)) # Apply ReLU activation function to the output of the second fully-connected layer
        x = self.fc3(x) # Pass the output of the second fully-connected layer through the third fully-connected layer
        return x

# Set hyperparameters: very important for the performance of our model
batch_size = 64
learning_rate = 0.01
num_epochs = 20

# Load the MNIST dataset
train_dataset = datasets.MNIST(root='data', train=True, download=True,
                               transform=transforms.ToTensor())
test_dataset = datasets.MNIST(root='data', train=False, download=True,
                              transform=transforms.ToTensor())
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Instantiating NeuralNetwork and creating an optimizer object
model = NeuralNetwork()
optimizer = optim.SGD(model.parameters(), lr=learning_rate)


# Train the network
n_total_steps = len(train_loader)
for epoch in range(num_epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        # Pass the data through the network
        output = model(data)

        # Calculate the loss
        loss = F.cross_entropy(output, target)

        # Zero the gradients
        optimizer.zero_grad()

        # Backpropagation
        loss.backward()

        # Update the weights
        optimizer.step()

        if batch_idx+1==938:
             print (f'Epoch [{epoch+1}/{num_epochs}], Step[{batch_idx+1}/{n_total_steps}], Loss: {loss.item():.4f}') 


# Test the network
with torch.no_grad():
    n_correct = 0
    n_samples = 0
    for images, labels in test_loader:
        images = images.reshape(-1, 28*28)
        labels = labels
        outputs = model(images)
        # max returns (value ,index)
        _, predicted = torch.max(outputs.data, 1)
        n_samples += labels.size(0)
        n_correct += (predicted == labels).sum().item() 
        
    # accuracy of the Neural Network 
    acc = 100.0 * n_correct / n_samples
    print()
    print(f'Accuracy of the network on the 10000 test images: {acc}%') 



Epoch [1/20], Step[938/938], Loss: 0.6509
Epoch [2/20], Step[938/938], Loss: 0.4683
Epoch [3/20], Step[938/938], Loss: 0.4220
Epoch [4/20], Step[938/938], Loss: 0.5973
Epoch [5/20], Step[938/938], Loss: 0.2738
Epoch [6/20], Step[938/938], Loss: 0.2659
Epoch [7/20], Step[938/938], Loss: 0.1819
Epoch [8/20], Step[938/938], Loss: 0.6452
Epoch [9/20], Step[938/938], Loss: 0.1867
Epoch [10/20], Step[938/938], Loss: 0.0864
Epoch [11/20], Step[938/938], Loss: 0.0853
Epoch [12/20], Step[938/938], Loss: 0.1220
Epoch [13/20], Step[938/938], Loss: 0.0432
Epoch [14/20], Step[938/938], Loss: 0.0695
Epoch [15/20], Step[938/938], Loss: 0.0793
Epoch [16/20], Step[938/938], Loss: 0.1964
Epoch [17/20], Step[938/938], Loss: 0.1369
Epoch [18/20], Step[938/938], Loss: 0.0918
Epoch [19/20], Step[938/938], Loss: 0.0625
Epoch [20/20], Step[938/938], Loss: 0.0825

Accuracy of the network on the 10000 test images: 96.16%
