# 1. Introduction
This notebook is focused on Deep learning in baby steps. For a beginner, who has no idea about neural networks and how they work can be frustrating sometimes. But don't worry, I'll try to make it simpler and try to make you understand the concept.

I will also be providing video lectures and articles that helped me.
<div class="alert alert-block alert-warning"> 
<strong>Note</strong>: Deep learning coding is VERY different in structure than the usual <em>sklearn</em> for machine learning. In addition, it usually works with <em>images</em> and <em>text</em>, while <em>ML</em> usually works with <em>tabular</em> data. So please, be patient with yourself and if you don't understand something right away, continue reading/ coding and it will all make sense in the end.
</div>

<img src='https://miro.medium.com/max/1200/1*4br4WmxNo0jkcsY796jGDQ.jpeg' width=200>

**Pytorch** is a library that has many advantages over *Keras* and is widely used nowadays. It has different structure as compared to *sklearn*.

**Tensors**: Instead of working with tabular data or numpy arrays, we'll be working with tensors. A tensor is a container which can house data in *N dimensions*. Although, Tensor is similar to numpy arrays, the difference is that it supports better GPU support so they are faster as compared to numpy arrays

In [None]:
# importing Pytorch modules

import torch

x = torch.empty(5,3)       #torch.empty generates random values
print(x)

y= torch.ones(5,3)         #torch.ones generates number of ones
print(y)

a = torch.tensor([[0,1,2],[3,4,5]])  #it helps in saving in matrix format
print(a)

# 2. Neural Networks

## 2.1 Youtube videos are the best if you want to grasp more and save your valuable time.
Here are some links for video lectures:
<div class="alert alert-block alert-info">
<img src='https://upload.wikimedia.org/wikipedia/commons/b/ba/3B1B_Logo.png' width='50' align='left'></img>
<p><a href='https://www.youtube.com/watch?v=aircAruvnKk&t=1007s'>What are Neural Networks?</a></p>
<p><a href='https://www.youtube.com/watch?v=IHZwWFHWa-w'>How do Neural Networks learn?</a></p>

Watch above videos for visualizing neural networks and how it works.
</div>

## 2.2 Perceptron
A **Perceptron** is a single layer neural network, while a **Multi Layer Perceptron** is called a Neural Network. 

I **highly suggest** reading [this blog post](https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53) for some very good explanations.

<img src = 'https://i.imgur.com/IHgw2au.png' width='400'>

## 2.3 Deep vs Shallow Networks
Plain vanilla Neural Networks (or Feed Forward Neural Networks or FNNs) have the most simple architecture in the Neural Networks realm, but their basics will help you understand much more compicated stuff ahead.

In Feed Foward Neural Nets, the hidden layers gradually /increase/decrease in hidden size (number of neurons) so more and more **details** of the input (images, text etc.) can be grasped.


<img src='https://i.imgur.com/D8QhLWM.png' width=350>

It is known that Deep Neural Nets (thin and tall) are **better** than Shallow ones (fat and short). This happens because the deep ones can learn more and more abstract representations the *deeper* you go. Also, the number of parameters is smaller so the training is faster.

... Let's start programming.

# 3. The Data - MNIST

We'll be working in MNIST Dataset, which is usually the go-to dataset when starting in Neural Networks. Nevertheless, you can apply the following principles on any datasets (images, text, tabular data, audio data), as **all data can be represented in numbers**. Cuz this is just it: numbers.

In [None]:
# importing pytorch modules
import torch
import torch.nn as nn                            # to access built-in functions to make a NN
import torch.optim as optim                      # optimizers
from torchvision import datasets, transforms     # to access the MNIST dataset
import torch.nn.functional as F   # to access activation functions

# libraries you may know already
import numpy as np
import matplotlib.pyplot as plt # for plotting
%matplotlib inline
import seaborn as sns
import sklearn.metrics
import warnings
warnings.filterwarnings(action="ignore")

In [None]:
# Load in the data from torchvision datasets 
# train=True to access training images and train=False to access test images
# We also transform to Tensors the images
mnist_train = datasets.MNIST('data', train = True, download = True, transform=transforms.ToTensor())
mnist_test = datasets.MNIST('data', train = False, download = True, transform=transforms.ToTensor())

In [None]:
# How the object looks:
print('Structure of train data:', mnist_train, '\n')
print('Structure of test data:', mnist_test, '\n')
print('Image on index 0 shape:', list(mnist_train)[0][0].shape)
print('Image on index 0 label:', list(mnist_train)[0][1])

In [None]:
# Visualize sample of images:
sample = datasets.MNIST('data', train=True, download=True)

plt.figure(figsize = (16, 3))
for i,(image, label) in enumerate(sample):
    if i>=16:
        break
    plt.subplot(2,8,i+1)
    plt.imshow(image)

# 4. Vanilla FNN Neural Network
if we were working with sklearn library then, we just have to call the object of the model e.g: let's say of `RandomForestClassifier()` and then hypertune their parameters for which there is already a predefined class which we don't have to create every time. But while dealing or working with neural networks, we certainly have to define our own class.

For Neural Networks is different: they can be so volatile, depending on the structure of your input (eg. an image of shape `[3, 500, 250]`), number of `hidden layers`, number of `neurons` in each hidden layer, whether or not you want to call the `Dropout()` functions etc. You can also build multiple neural networks and then combine them in another one (for example in a Sequence2Sequence RNN).

>Note:`super()` function is there because the `MNISTClassifier` class inherits attributes from it's parent class `nn.Module`. By calling this function we make this possible. Removing it would lead to an *error*.

In [None]:
#creating the network
class MNISTClassifier(nn.Module):                  # nn.Module is a subclass from which we inherit
    def __init__(self):                            #defining structure
        super(MNISTClassifier, self).__init__() 
        self.layers = nn.Sequential(nn.Linear(28*28, 50),   # adding first layer: 784 neurons to 50
                                    nn.ReLU(),               # calling activation function
                                    nn.Linear(50, 20),      # adding second layer: 50 neurons to 20
                                    nn.ReLU(),               # activation function
                                    nn.Linear(20, 10)        # output layer having 10 neurons
                                   )
        #output layer contains number of classes we want to predict
        
    def forward(self, image, prints=False):
        if prints: print("Image shape:", image.shape)
        image = image.view(-1,28*28)                   # Flatten image: from [1, 28, 28] to [784]
        if prints: print('Image reshaped:', image.shape)
        out = self.layers(image)                            # Create Log Probabilities
        if prints: print('Out shape:', out.shape)
        
        return out

Great!! You just made it through this!

## 4.1 Activation Functions
An activation function is a fancy way of saying that we are making the output of each neuron *nonlinear*, because we WANT to learn non-linear relationships between the input and the output.

There are maaany types of activation functions, but some of them are:

### Rectifier Linear Unit (ReLu)
The function is linear when the activation is above zero, and is equal to zero otherwise.
<img src="https://miro.medium.com/max/1026/0*g9ypL5M3k-f7EW85.png" width="350">

### Sigmoid
The sigmoid function has a tilted "S" shape, and its output is always between 0 and 1. They are *interpreted as probabilities* (probability of input to be digit 1, probability of input to be digit 2 etc.).
<img src="https://miro.medium.com/max/4000/1*JHWL_71qml0kP_Imyx4zBg.png" width="350">

### Tanh
A variation of the Sigmoid, but it outputs values between -1 and 1.
<img src="https://mathworld.wolfram.com/images/interactive/TanhReal.gif" width="300">

These Activation Functions squish the neuron's output between the 2 values, preventing big numbers becoming much bigger and small numbers becoming much smaller.

## 4.2 Making a Forward Pass

A forward pass is when you take the images one by one (or batch by batch, we'll come back to this) and we put them through the neural network, which outputs for each a log probability (10 in out case).

Let's look at 1 example:

<img src='https://i.imgur.com/ywMFtDz.png' width='600'>

In [None]:
torch.manual_seed(1) # set the manual seed
np.random.seed(1)  # set random seed in numpy

# Selecting 1 image with its label
image_ex, label_ex = mnist_train[0]
print("Image shape:", image_ex.shape,"\n")
print("Label:", label_ex,"\n")

# creating an instance of model
model_example = MNISTClassifier()
print(model_example,"\n")

# creating log probabilities
out = model_example(image_ex, prints=True)
print("out:",out,"\n")

# Choose maximum probability and then select only the label (not the prob number)
prediction = out.max(dim=1)[1]
print("Prediction:", prediction)

<div class="alert alert-block alert-info"> 
Prediction is wrong because we have not yet trained the model.
Don't worry we are going to get correct predictions soonðŸ˜‰.
</div>

## 4.3 Backpropagation

So, the purpose is to UPDATE the weights and biases in the neural network so it *learns* to recognize the digits and accurately classify them. This is done during backpropagation, when the model literally goes back and updates the parameters (weights) a little bit. Before going any further, I highly recommend watching the following video which explains the concept of Backpropagation.

<div class="alert alert-block alert-info">
<img src='https://upload.wikimedia.org/wikipedia/commons/b/ba/3B1B_Logo.png' width='50' align='left'></img>
<p><a href='https://www.youtube.com/watch?v=Ilg3gGewQ5U&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=3'>What is Backpropagation really doing?</a></p>
<p>Cheers again to 3Blue1Brown for his amazing structured videos.</p>
</div>

### 4.3.1 Loss and Optimizer Functions:

These 2 are like brother and sister: work hand in hand during the neural network training. They change by the case, but their main purpose is the same:

**Loss Function (`criterion`): given an output and an actual, it computes the difference between them**
* Regressive loss functions:
    * MAE: `torch.nn.L1Loss()`
    * MSE: `torch.nn.MSELoss()` etc.
* Classification loss functions:
    * Cross Entropy Loss: `torch.nn.CrossEntropyLoss()`
    * Binary Cross Entropy Loss: `torch.nn.BCELoss()` etc.
* Embedding Loss functions (whether 2 inputs are similar or not):
    * Hinge Loss: `torch.nn.HingeEmbeddingLoss()`
    * Cosine Loss: `torch.nn.CosineEmbeddingLoss()` etc.

**Optimizer Function (`torch.optim`): updates the weights and biases to REDUCE the loss**
* Examples:
    * Stochastic Gradient Descent: `SGD()`
    * Adam: `Adam()`   (widely used)
    * Adagrad: `Adagrad()`
    
Different neural networks and purposes can require different loss and optimizer functions. Click [here](https://pytorch.org/docs/stable/nn.html) to check all of them.

In [None]:
# LOSS and Optimizer instances

# Loss is the function that calculates how far is the prediction from the true value
criterion = nn.CrossEntropyLoss()
print("Criterion:", criterion,"\n")

# Using this loss the Optimizer computes the gradients of each neuron and updates the weights
optimizer = optim.SGD(model_example.parameters(), lr = 0.004, momentum=0.9)   #stochastic gradient descent
print("Optimizer:",optimizer,"\n")

### 4.3.2 MNISTClassifier trainable parameters:
* 1 : torch.Size([50, 784]) - 50 weights (or parameters) for each 28x28 neurons (28x28x50 weights in total)
* 2 : torch.Size([50]) - 50 biases
* 3 : torch.Size([20, 50]) - 20 weights (or parameters) for each 50 neurons (50x20 weights in total)
* 4 : torch.Size([20]) - 20 biases
* 5 : torch.Size([10, 20]) - 10 weights (or parameters) for each 20 neurons (10x20 weights in total)
* 6 : torch.Size([10]) - value of the final neurons (the log probabilities)

In [None]:
# Let's also look at how many parameters (weights and biases) are updating during 1 single backpropagation
# Parameter Understanding
for i in range(6):
    print(i+1,":",list(model_example.parameters())[i].shape)

### 4.3.3 Do 1 BackPropagation : Obtain loss and update weights

In [None]:
torch.manual_seed(1)  # set random seed
np.random.seed(1)  # set random seed for numpy

print("Log Probabilities:",out,"\n")
print("Actual Value:", torch.tensor(label_ex).reshape(-1))

# clear gradients: needs to be done before applying backpropagation
optimizer.zero_grad()
# compute loss
loss = criterion(out, torch.tensor(label_ex).reshape(-1))
print("Loss:", loss,"\n")

#compute gradients
loss.backward()

# update weights
optimizer.step()

# After this 1 iteration the weights have updated once

Until now we:
1. Created a Vanilla FNN
2. Took 1 image through the network and create prediction
3. Look at the prediction vs actual and computed the loss
4. Using the loss we updated the weights and biases

This is called training. The next chapters will be dedicated to training the network and improving it.

# 5. Training the Neural Network
Our purpose now that we have the structure in place and the data is to make the Vanilla FNN perform well.

## 5.1 Batches
With an artificial neural network, we may want to use more than one image at one time. That way, we can compute the *average* loss across a **mini-batch** of **multiple** images, and take a step to optimize the **average** loss. The average loss across multiple training inputs is going to be less "noisy" than the loss for a single input, and is less likely to provide "bad information" because of a "bad" input.

Batches can have different sizes:
* one extreme is `batch_size` = 1: meaning that we compute the loss and update after EACH image (so we have 60,000 batches of size 1)
* a `batch_size` = 60: means that, for 60,000 training images, we'll have 1000 batches of size 60
* the other extreme is `batch_size` = 60,000: when we input ALL images and do 1 backpropagation (we have 1 batch of size 60,000 images)

The actual batch size that we choose depends on many things. We want our batch size to be large enough to not be too "noisy", but not so large as to make each iteration too expensive to run.

<img src='https://i.imgur.com/M6ZkRXa.png' width='400'>

In the above example, instead of having 70 noisy losses we'll have just 7 averaged losses.

In [None]:
# Create trainloaders for train and test data
# We put shuffle=True so the images shuffle after every epoch
train_loader = torch.utils.data.DataLoader(mnist_train, batch_size=60, shuffle=True)
test_loader = torch.utils.data.DataLoader(mnist_test, batch_size=60, shuffle=True)

# inspect Trainloader
print("Train loader:",train_loader,"\n")

# select first batch
imgs, labels = next(iter(train_loader))

print('Object shape:', imgs.shape)      # [60, 1, 28, 28]: 60 images of size [1, 28, 28]
print('Label values:', labels)          # actual labels for the 60 images
print('Total Images:', labels.shape)    # 60 labels in total

### 5.1.1 Training the example network on a batch instead of image by image

In [None]:
# setting up seed
torch.manual_seed(1)
np.random.seed(1)

for i, (images, labels) in enumerate(train_loader):
    # stop after three iterations
    if i>=3:
        break
        
    print("-----Batch ",i,"-----")
    #prediction
    out = model_example(images)
    print("out shape:", out.shape)
    
    # update weights
    loss = criterion(out, labels)
    print("loss:", loss)
    
    print("Optimizing:")
    # Clears the gradients of all optimized
    optimizer.zero_grad()
    # computes optimizer
    loss.backward()
    #performs a single optimization step
    optimizer.step()
    
    print("Done.\n")

## 5.2 Accuracy of the Classifier
During Training, we would usually want to check for the accuracy of the model, to see how good or how bad is performing.

<div class="alert alert-block alert-info"> 
<strong>Note</strong>: During <strong>training</strong>, it is highly important to set the model into training mode by calling <code>your_model.train()</code>. This enables gradients training, the Dropout() function etc. When you <strong>evaluate</strong> the model call <code>your_model.eval()</code>. This disables the gradients, Dropout() function etc and sets the model in evaluation mode.
</div>

In [None]:
# Instantiate 2 variables for total cases and correct cases
correct_cases = 0
total_cases = 0

# Sets the module in evaluation mode (VERY IMPORTANT)
model_example.eval()

for k, (images, labels) in enumerate(train_loader):
    # Just show first 3 batches accuracy
    if k >= 3: break
    
    print('==========', k, ':')
    out = model_example(images)
    print('Out:', out.shape)
    
    # Choose maximum probability and then select only the label (not the prob number)
    prediction = out.max(dim = 1)[1]
    print('Prediction:', prediction.shape)
    
    # Number of correct cases - we first see how many are correct in the batch
            # then we sum, then convert to integer (not tensor)
    correct_cases += (prediction == labels).sum().item()
    print('Correct:', correct_cases)
    
    # Total cases
    total_cases += images.shape[0]
    print('Total:', total_cases)
    
    
    if k < 2: print('\n')
        

print('Average Accuracy after 3 iterations:', correct_cases/total_cases)

## 5.3 Iterations vs Epochs:

**Iterations**: number of iterations is the number of times we *update* the weights (parameters) of the FNN. For example, above we did 3 iterations.

**Epoch**: number of times *all* training data was used once to update the parameters. This is used because, in general, we would want to train the network for longer. Until now in this notebook we haven't completed yet a full epoch.

## 5.4 Predefined Functions: Accuracy and Training Loop
Now let's create some functions, so our trainin process will become easier:

### 5.4.1 Predefined Accuracy Function:

In [None]:
def get_accuracy(model, data, batchSize = 20):
    '''Iterates through data and returnes average accuracy per batch.'''
    # Sets the model in evaluation mode
    model.eval()
    
    # Creates the dataloader
    data_loader = torch.utils.data.DataLoader(data, batch_size=batchSize)
    
    correct_cases = 0
    total_cases = 0
    
    for (images, labels) in iter(data_loader):
        # Is formed by 20 images (by default) with 10 probabilities each
        out = model(images)
        # Choose maximum probability and then select only the label (not the prob number)
        prediction = out.max(dim = 1)[1]
        # First check how many are correct in the batch, then we sum then convert to integer (not tensor)
        correct_cases += (prediction == labels).sum().item()
        # Total cases
        total_cases += images.shape[0]
    
    return correct_cases / total_cases

In [None]:
get_accuracy(model_example,mnist_train,20)

### 6.4.2 Predefined Training Function

<img src='https://i.imgur.com/S1miUl0.png' width=600>

In [None]:
def train_network(model, train_data, test_data, batchSize=20, num_epochs=1, learning_rate=0.01, weight_decay=0,
                 show_plot = True, show_acc = True):
    
    '''Trains the model and computes the average accuracy for train and test data.
    If enabled, it also shows the loss and accuracy over the iterations.'''
    
    print('Get data ready...')
    # Create dataloader for training dataset - so we can train on multiple batches
    # Shuffle after every epoch
    train_loader = torch.utils.data.DataLoader(dataset=train_data, batch_size=batchSize, shuffle=True)
    
    # Create criterion and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, weight_decay=weight_decay)
    
    # Losses & Iterations: to keep all losses during training (for plotting)
    losses = []
    iterations = []
    # Train and test accuracies: to keep their values also (for plotting)
    train_acc = []
    test_acc = []
    
    print('Training started...')
    iteration = 0
    # Train the data multiple times
    for epoch in range(num_epochs):
        
        for images, labels in iter(train_loader):
            # Set model in training mode:
            model.train()
            
            # Create log probabilities
            out = model(images)
            # Clears the gradients from previous iteration
            optimizer.zero_grad()
            # Computes loss: how far is the prediction from the actual?
            loss = criterion(out, labels)
            # Computes gradients for neurons
            loss.backward()
            # Updates the weights
            optimizer.step()
            
            # Save information after this iteration
            iterations.append(iteration)
            iteration += 1
            losses.append(loss)
            # Compute accuracy after this epoch and save
            train_acc.append(get_accuracy(model, train_data))
            test_acc.append(get_accuracy(model, test_data))
            
    
    # Show Accuracies
    # Show the last accuracy registered
    if show_acc:
        print("Final Training Accuracy: {}".format(train_acc[-1]))
        print("Final Testing Accuracy: {}".format(test_acc[-1]))
    
    # Create plots
    if show_plot:
        plt.figure(figsize=(10,4))
        plt.subplot(1,2,1)
        plt.title("Loss Curve")
        plt.plot(iterations[::20], losses[::20], label="Train", linewidth=4, color='#008C76FF')
        plt.xlabel("Iterations")
        plt.ylabel("Loss")

        plt.subplot(1,2,2)
        plt.title("Accuracy Curve")
        plt.plot(iterations[::20], train_acc[::20], label="Train", linewidth=4, color='#9ED9CCFF')
        plt.plot(iterations[::20], test_acc[::20], label="Test", linewidth=4, color='#FAA094FF')
        plt.xlabel("Iterations")
        plt.ylabel("Accuracy")
        plt.legend(loc='best')
        plt.show()

# 6. Model Evaluation

Now that we have our functions ready, we can start training on the ENTIRE dataset.

But first, to make the training faster, we will:
* select 600 training images and 400 testing images
* `batch_size` will be by default set to 20 images/batch
* we'll iterate through the data 200 times (`num_epochs`=200)

In [None]:
mnist_data = datasets.MNIST("data", train=True, download=True, transform = transforms.ToTensor())
mnist_data = list(mnist_data)

mnist_train = mnist_data[:600]   # 600 images for training 
mnist_test = mnist_data[600:1000]  # 400 images for testing

model = MNISTClassifier()

# train the model using above function(predefined)
train_network(model, mnist_train, mnist_test, num_epochs=200)

# 7. Overfitting

As any other Machine Learning Model, Neural Nets can suffer from overfitting. Overfitting is when a neural network model learns about the quirks of the training data, rather than information that is generalizable to the task at hand.

## 7.1 Data Augmentation
Why try to collect more data when you can create some on your own? *Data Augmentation* generates more data points from our existing data set by:
* Flipping each image horizontally or vertically (won't work for digit recognition, but might for other tasks)
* Shifting each pixel a little to the left or right
* Rotating the images a little
* Adding noise to the image

<img src='https://www.kdnuggets.com/wp-content/uploads/cats-data-augmentation.jpg' width='400'>

For our example we'll rotate the images randomly up to 35 degrees.

In [None]:
# Predefined Function that shows 20 images
def show_image(data, title="Default"):
    plt.figure(figsize=(10,2))
    for i,(image, label) in enumerate(data):
        if i>=20:
            break
        plt.subplot(2,10,i+1)
        plt.imshow(image)
        plt.suptitle(title, fontsize=15)

In [None]:
# creating original and rotated images
original_images = datasets.MNIST("data", train=True, download=True)
rotated_images = datasets.MNIST("data", train=True, 
                                download=True, transform = transforms.RandomRotation(35,fill=(0,)))
# show images
show_image(original_images,"Original")
show_image(rotated_images, "Rotated")

### 7.1.1 Training on Augmented Data:

* `transforms.Normalize()`: means to scale the input features of a neural network, so that all features are scaled similarly. For images is not really necessary, as they all have the same structure, but I threw it here just for reference.

In [None]:
# transform
mytransform = transforms.Compose([transforms.RandomRotation(35, fill=(0,)),
                                 transforms.ToTensor(),
                                 transforms.Normalize([0.5], [0.5])])
# import mnist data and apply transformations
mnist_data_aug = datasets.MNIST("data", train=True, download=True, transform=mytransform)
mnist_data_aug = list(mnist_data_aug)

# training data
mnist_train_aug = mnist_data_aug[:600]  # 600 augmented images for training

# ------ Training the model ------
# Create Model Instance
model_aug = MNISTClassifier()

# train the network
train_network(model_aug, mnist_train_aug, mnist_test, num_epochs=200)

## 7.2 Weight Decay and Learning Rate

**Weight Decay**: The idea of weight decay is to *penalize large weights*. Large weights mean that the prediction relies heavily on the content of one or multiple pixels. So, we penalize them by adding and extra term to the `criterion` function.

**Learning Rate**: This one is probably not new. In FNNs we train using gradient descent to update the weights. The learning rate *controls* [how much to change the model in response to the estimated error each time the model weights are updated](https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/). If we choose a lr too large we might overshoot the local minima, while using a lr too small we might wait longer for the model to train, as the steps are tinier.

<img src='https://srdas.github.io/DLBook/DL_images/TNN2.png' width='400'>

In [None]:
model2 = MNISTClassifier()

#train the network
train_network(model2, mnist_train, mnist_test, num_epochs=220, learning_rate=0.001, weight_decay=0.0005)

## 7.3 Dropout() and Layer Optimization

### 7.3.1 Dropout() Function
This technique builds *many* models and then averages their prediction at test time (this is why it is very important to call `model.eval()` when we want to evaluate).

For each model we **dropout** (drop out, zero out, remove etc.) a portion of neurons from each training iteration. Hence, in different iterations of training, we will drop out a different set of neurons.

This way we prevent the weight from being overly dependent on eachother: for example for one weight to be unnecessarily large to compensate for another unnecessarily large weight with the opposite sign. In other words, weights are encouraged to be *strong and independent*.
<img src='https://miro.medium.com/max/1200/1*iWQzxhVlvadk6VAJjsgXgg.png' width=400>


### 7.3.2 Layer Optimization
Our `MNISTClassifier()` had until now 3 layers with a fixed number on neurons in each layer. We can change that by making it changable during training, so eventually we can apply `Grid Search` and find the best combination possible.

### 7.3.3 Changing the Structure of our MNISTClassifier()
Now let's change our Neural Net a bit:
* `nn.Dropout(p=0.4)`: each neuron has 40% chance of being dropped
* `layer1_size`: size of the first hidden layer
* `layer2_size`: size of the second hidden layer

In [None]:
class MNIST_Classifier_Improved(nn.Module):
    def __init__(self, layer1_size=50, layer2_size=20, dropout=0.4):    #structure of FNN
        super(MNIST_Classifier_Improved, self).__init__()
        
        self.layers = nn.Sequential(nn.Dropout(p= dropout),       #dropout for first layer
                      nn.Linear(28*28, layer1_size),              # 784 neurons to 50
                      nn.ReLU(),                                  # activation functino
                      nn.Dropout(p=dropout),                      # Dropout for second layer
                      nn.Linear(layer1_size, layer2_size),        # Second layer: 50 neurons to 20
                      nn.ReLU(),                                  # activation function
                      nn.Dropout(p=dropout),                      # dropout for output layer
                      nn.Linear(layer2_size, 10))                 # output layer
         
    def forward(self, image):              # taking image through NN                       
        image= image.view(-1, 28*28)       # Flatten image (Matrix to Vector)
        out = self.layers(image)           #  log probabilities output
        
        return out

#### Improved Model Structure:
<img src='https://i.imgur.com/m22zEqN.png' width='600'>

In [None]:
# Training Improved model now:
model_improved = MNIST_Classifier_Improved(layer1_size=80, layer2_size=50, dropout=0.5)
print(model_improved,"\n")

train_network(model_improved, mnist_train, mnist_test,num_epochs=220,learning_rate=0.001,weight_decay=0.0005)

In [None]:
get_accuracy(model_improved, mnist_train,64)

# 8. Confusion Matrix

In [None]:
def get_confusion_matrix(model, test_data):
    # First we make sure we disable Gradient Computing
    torch.no_grad()
    
    # Model in Evaluation Mode
    model.eval()
    
    preds, actuals = [], []

    for image, label in mnist_test:
        # Add 1 more dimension for batching
        image = image.unsqueeze(0)
        out = model_improved(image)

        prediction = torch.max(out, dim=1)[1].item()
        preds.append(prediction)
        actuals.append(label)
    
    return sklearn.metrics.confusion_matrix(preds, actuals)

In [None]:
plt.figure(figsize=(16,6))
sns.heatmap(get_confusion_matrix(model_improved, mnist_test), cmap="icefire", annot=True, linewidths=0.1)
plt.title("Confusion Matrix", fontsize=14)

If you have any questions, please do not hesitate to ask. This notebook is made to bring more clear understanding of concepts and coding, so this would also help me add, modify and improve it. 

<div class="alert alert-block alert-warning"> 
<p>If you liked this, upvote!</p>
<p>Cheers!</p>
</div>

# References:
* [Andrada Olteanu](https://www.kaggle.com/andradaolteanu)
* [Create your own FNN](http://alexlenail.me/NN-SVG/index.html)
* [WTF is a Tensor?](https://www.kdnuggets.com/2018/05/wtf-tensor.html)
* [What the hell is a Perceptron?](https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53)
* 3Blue1Brown videos:
    * [But what is a Neural Network?](https://www.youtube.com/watch?v=aircAruvnKk&t=1007s)
    * [What is Backpropagation really doing?](https://www.youtube.com/watch?v=Ilg3gGewQ5U&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=3)
    * [Gradient Descent, how Neural Networks learn](https://www.youtube.com/watch?v=IHZwWFHWa-w)
* [All `torch.` functions (including loss & optimizer functions)](https://pytorch.org/docs/stable/nn.html)
* [Impact of Learning Rate in NNs](https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/)