<a href="https://colab.research.google.com/github/mvdheram/DeepLearning-Notebooks/blob/main/Introduction_to_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction 

Deep learning vs Machine learning

Machine Learning: 

* Feature Engineering (selection of 
 **appropriate** features) is to be done before training. 
* Fits a straight line (Linearly seperable).

Deep Learning: 

* Feature Engineering (Extraction and selection of **appropriate** features) happens during training.
  * Eg.Image Classification, speech recognition, text etc.
* Fits a squiggly line (Non-linear seperable).





Why PyTorch ?

* Pythonic.
* GPU support.
* Similar to NumPy


## Defining tensor 

In [None]:
# Tensor using torch ( Tensor: Array with arbitrary number of dimensions )
import torch
import numpy as np

# 2x3 Tensor
torch.tensor([[2,3,4],[1,2,9]])

# Random 2x2 matrix
torch.rand(2,2)

# variable to matrix
a = torch.rand((3,5))
a.shape

# 2x2 of 0's
a = torch.zeros(2,2)

# 2x2 of 1's
b = torch.ones(2,2)

# Identity Matrix
c = torch.eye(2)

# Convert from numpy
c_numpy = np.identity(2)
d_torch = torch.from_numpy(c_numpy)

print(a , "\n\n", b,"\n \n",c,"\n\n",d_torch)

tensor([[0., 0.],
        [0., 0.]]) 

 tensor([[1., 1.],
        [1., 1.]]) 
 
 tensor([[1., 0.],
        [0., 1.]]) 

 tensor([[1., 0.],
        [0., 1.]], dtype=torch.float64)


## Matrix Operations


In [None]:
a = torch.tensor([[1,2],[3,4]])
b = torch.tensor([[5,6],[7,8]])

print(a , "\n\n", b,"\n \n")
torch.matmul(a,b)

tensor([[1, 2],
        [3, 4]]) 

 tensor([[5, 6],
        [7, 8]]) 
 



tensor([[19, 22],
        [43, 50]])

# Forward Passing in NN

*Neural networks can be understood as **computational graphs**.* 

**Computational Graphs** is a network of nodes which represents scalars, vectors or tensors connected via edges which represents functions of operation.

* Under the hood code gets converted into computational graphs which makes automatic computation of gradients/ derivatives easier.

NeuralNetworks
                  
    Input Layer -> Weighted sum of inputs -> Activation function on the weighted sum.


In [None]:
# Simple computational graph with functions(addition and multiplication) performed on scalars.

import torch

a = torch.Tensor([2])
b = torch.Tensor([-4])
c = torch.Tensor([-2])
d = torch.Tensor([2])

e = a+b
f = c*d

g = e*f
print(e,f,g)



tensor([-2.]) tensor([-4.]) tensor([8.])


# Backpropagation 




## Derivatives 


* Rate of change of function.
* Can be interpreted as indicating the Steepness of function.

Derivative Rules :

    d - Derivative w.r.t x

    Addition : (f+g)' = f'+g'
    Multiplication : (f.g)' = f.dg + g.df
    d[3x] = 3
    d[constant] = 0
    d[x] = 1

* Chain Rule:
  * Deals with composition of fuctions (function g inside f).
         d[f(g(x))] = f'(g(x))* g'(x)
         d[(f(x))^n] = n(f(x))^n-1 * f'(x)
  * Eg. 
        1. d[(sin x)^2)] 
        
         = d/d(sinx) [(sinx)^2] * d/dx [sinx]
         = 2 sinx + cos x

        2. d[(x+2)^2]

        = d/d(x+2) [(x+2)^2] * d/dx [x+2]
        = 2x+4

**Chain rule used in backprop to readjust the weights** . 


 Note :

* Gradient is multi-variable generalization of derivative.

* Considering many variable in NN. **"Gradient" is used instead of "derivative"**. 



  
 

In [None]:
import torch

# Required grad set to true for derivatives
x = torch.tensor(-3., requires_grad= True)
y = torch.tensor(5., requires_grad= True)
z = torch.tensor(-2.,requires_grad= True)

q = x+y
f = q*z

# Compute derivative of the computational graph
f.backward()

print("Gradient of z is ", str(z.grad))
print("Gradient of y is ", str(y.grad))
print("Gradient of x is ", str(x.grad))

Gradient of z is  tensor(2.)
Gradient of y is  tensor(-2.)
Gradient of x is  tensor(-2.)


# Neural network

ANN vs other classifiers

* Features are extracted as a part of the network (Input layers).
  Eg. Speech recognition, image classification etc.

In [None]:
import torch

# 10 nodes of input layer
input_layer = torch.rand(10)

# Weight matrix (Number of input neurons, Number of hidden nuerons)
w1 = torch.rand(10,20)
w2 = torch.rand(20,20)
w3 = torch.rand(20,4)

# Hidden layers
h1 = torch.matmul(input_layer, w1)
h2 = torch.matmul(h1,w2)

output_layer = torch.matmul(h2,w3)
print(output_layer)

tensor([251.4403, 235.6309, 236.0042, 220.8700])


## Building a Neural network - PyTorch style

In [None]:
import torch
import torch.nn as nn 

# Class Net inherits from nn.module
class Net(nn.Module):
  def __init__(self): # Define parameters (tensors, weights)
    super(Net,self).__init__()
    self.fc1 = nn.Linear(10,20) # Fully connected nodes
    self.fc2 = nn.Linear(20,20)
    self.output = nn.Linear(20,4)

  def forward(self,x):
    x = self.fc1(x)
    x = self.fc2(x)
    x = self.output(x)
    return x

In [None]:
# 10 nodes of input layer
input_layer = torch.rand(10)
net = Net() 
result = net(input_layer)
result

tensor([-0.0687,  0.2870, -0.1255, -0.3945], grad_fn=<AddBackward0>)

## Activation Function 

*Linear algebra states that matrix multiplication is linear transformation.*

i.e Any layers of NN can be transformed into a single NN. 

**Consequence** : Deals with linearly seperable datasets.

Why Activation Function?

* Used to deal with non - linearly seperable datasets.
* Activation functions are non - functions.
* Used in each layer of NN. Hence making them much more powerfull.

Types:


1. Sigmoid
        1.0 / (1 + np.exp(-1 * x))
2. tanh
        np.tanh(x)
3. ReLU
        np.maximum(x, 0)
4. Leaky ReLU
        np.maximum(0.1x, 0)
5. Maxout
        max(w1x+b1,w2x+b2)
6. ELU
        x            x>0
        alpha(e^x-1) x<0



ReLU (Rectifier Linear Unit) activation function

* Most used activation function which sets negative inputs to 0.

In [None]:
import torch.nn as nn
relu = nn.ReLU()

tensor_1 = torch.tensor([2.,-4.])
print(relu(tensor_1))

tensor_2 = torch.tensor([[2.,-4.],[1.2,0.]])
print(relu(tensor_2))

tensor([2., 0.])
tensor([[2.0000, 0.0000],
        [1.2000, 0.0000]])


## Loss Functions

Loss Functions / cost function :

Measure of the error.

* For regression : least squared loss.
* For classification : softmax cross-entropy loss. (Transforms numbers into probabilities)
* For more complicated problems ( like object detection) more complicated losses.

**Loss functions should be differentiable or else computation of gradients is not possible**. 

    A function is differentiable at a point when there's a defined derivative at that point. This means that the slope of the tangent line of the points from the left is approaching the same value as the slope of the tangent of the points from the right.
 


### Softmax Cross Entropy loss

Softmax: Returns normalized probabilities.

Cross Entropy loss : returns -log(probability of correct class predicted). 

In [None]:
# Scores for three classes 
logits = torch.tensor([[3.2,5.1,-1.7]])
ground_truth = torch.tensor([0])
criterion = nn.CrossEntropyLoss()

loss = criterion(logits,ground_truth)
print(loss)

tensor(2.0404)


# Preparing datasets (Computer Vision)

Transforming datasets into PyTorch friendly format.

In [None]:
# Using CIFAR-10 color dataset for classification

import torch
import torchvision # Deals with pretrained datasets and NN
import torch.utils.data 
import torchvision.transforms as transforms

# Transformation of images to torch tensors using tranform.ToTensor()
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.4914,0.48216,0.44653), # Values for standardizing images (mean and SD of each channel (R,G,B))
                          (0.24703,0.24349,0.26159))]
)

# download: True ( if dataset not in the root folder, then download)
# tranform : transform the images to torch.tensors() using the function 
trainset = torchvision.datasets.CIFAR10(root = './data', train =True, download = True, transform = transform )

# train : False ( for test set)
testset = torchvision.datasets.CIFAR10(root = './data', train =False, download = True, transform = transform )

# Getting data ready for PyTorch
# batch_size, shuffle : use 32 random sampled image batches (n 32 batches per Epochs)
# num_workers : Number of processors used to fetch.
trainloader = torch.utils.data.DataLoader(trainset, batch_size=32, shuffle= True, num_workers=4)

testloader = torch.utils.data.DataLoader(testset,batch_size=32,shuffle=False,num_workers=4)



Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified


In [None]:
print(testloader.dataset.data.shape, trainloader.dataset.data.shape)

(10000, 32, 32, 3) (50000, 32, 32, 3)


In [None]:
print( testloader.batch_size)

32


In [None]:
print(trainloader.sampler)

<torch.utils.data.sampler.RandomSampler object at 0x7fc76dc0b9b0>


# Training NN

 NN Pipeline:
1. Initialize neural networks with random weights.

for mini batch :
  2. Do a forward pass **(Weighted sum + Activation function)**.
  3. Calculate the **loss function** (1 number).
---
4. Calculate the gradients using backporp
  with **Optimizer**.
5. Change the weights based on gradients.
        SGD : weight -= weight_gradient * learning_rate


Training pipeline :

1. Initialize model
2. Define loss function (criterion)
3. Define optimizer with learning rate for gradient descnet 
  * Optimizer : It is the way of searching/optimizing weights in NN from the search space to reduce loss.
  * Adam : Adaptive Moment Estimation (Adam) is method that computes adaptive learning rates for each parameter.
    * Learning rate used to increase the rate of descent to optimim (minimum loss) 
4. Forward step to calculate prediction 
5. Forward step to calculate loss with respect to the deined criterion
6. Backward step to calculate partial derivatives (hyperparameters - weights and biases) with respect to the computational graphs.
7. Use optimizer to optimize the weights and biases.
  

## Neural Network - PyTorch style

### Building the model

In [None]:
# Contd. CIFAR10 Example

import torch
import torch.nn as nn
import torch.nn.functional as F # Functional API
import torch.optim as optim

class Net(nn.Module):
  def __init__(self):
    super(Net,self).__init__()
    # Fully connected layer 1 with 32x32 of 3 channels(RGB) and 500 units in hidden layer
    self.fc1 = nn.Linear(32*32*3,500)
    self.fc2 = nn.Linear(500,10) # 10 units in output layer for 10 classes
  
  def forward(self,x):
    x = F.relu(self.fc1(x)) # Apply relu (non -linearality) in the hidden layer
    return self.fc2(x)



## Training the model

In [None]:
net = Net()
criterion = nn.CrossEntropyLoss() # Loss function
optimizer = optim.Adam(net.parameters(),lr = 3e-4) # Adam optimizer with the object parameters and learning rate

for epoch in range(10): # loop over the dataset multiple times
  for i, data in enumerate(trainloader,0):

    # Get the inputs
    inputs, labels = data
    inputs = inputs.view(-1,32*32*3) # Rearranges into a vector

    # Zero the parameter gradients inorder to not accumulate gradients from previous iteration
    optimizer.zero_grad()

    # Forward + backward + optimize
    outputs = net(inputs) # Forward step results in prediction
    loss = criterion(outputs, labels) # Loss function
    loss.backward() # calculate Gradient 
    optimizer.step() # Optimize weights w.r.t optimizer



## Making predictions using model

In [None]:
correct, total = 0,0
predictions = []
net.eval() # Setting the model to evaluation for making predictions

for i, data in enumerate(testloader,0):
  inputs, labels = data
  inputs = inputs.view(-1,32*32*3)
  outputs = net(inputs) # Feed forward
  _, predicted = torch.max(outputs.data,1) # returns the max prob as outpus
  predictions.append(outputs)
  total += labels.size(0)
  correct += (predicted == labels).sum().item() # Number of correct predictions

print("accuracy :", (100 * correct/total))


accuracy : 53.5


# Convolutional Neural Network

## Convolution Operator

Neural Networks on images

*  All relations between features captured.
  *  Computationally inefficient.
  * May overfit due to many parameters.

Convolutional Neural Network in images

* Units connected to only few units from previous layers. ( selected features )
* Units share weights.



Convolution:

Two operation:

1. Convolve (Dot product)
  * Dot product of filter/Kernel with corresponding pixels of the image and place the result in the center pixel of image.
  * Depth dimention should match of filter and image dimentions.
    * Image dimention (32,32,**3**)
    * Filter dimention (5,5,**3**)
2. Sliding window (Stride)
  * Slide the kernel/filter along x axis for stride distance.

Reuslt: Activation Maps.

3. Padding
  * Padding the activation map to match the size of image. 
    * Adding zeros at the side of the image.



Convolution layer contains several activation maps.

Goal of CNN: 

* Learn different weighted filters that produces different activation maps corresponding to different features.


## Convolutions in PyTorch 

### OOP-based(torch.nn)

Parameters:

* in_channels(int) : Number of channels in input.   
* out_channels(int) : Number of channels produced by the convolution.
* Kernel_size(int or tuple) : Size of the kernel/filter.
* Stride (int or tuple, optional) : Stride of the convolution . Default:1
* Padding ( int or tuple, optional) : Zero-padding.



In [None]:
import torch
import torch.nn

# Mini batch of 16 images(32,32,3)
image = torch.rand(16,3,32,32)
# Number of out_feature channels can be changed
conv_filter = torch.nn.Conv2d(in_channels =3, out_channels =1, kernel_size =5, stride =1, padding =0)
output_feature = conv_filter(image)

print(output_feature.shape)


torch.Size([16, 1, 28, 28])


### Functional (torch.nn.functional)

Parameters:

* input (minibatch x iH x iW) : input tensor of shape.
* weight (out_channels x in _channels x kH x KW) - Shape of the filter.
* Stride (sH,sW) : The stride of the convolving kernel.
* Padding (Default - 0) : Implicit zero padding on both sides of the input.

In [None]:
import torch
import torch.nn.functional as F

image = torch.rand(16,3,32,32)
filter = torch.rand(1,3,5,5) # Number of output feature channels can be changed(First parameter)
out_feat_F = F.conv2d(image,filter,stride=1,padding=0)

print(out_feat_F.shape)

torch.Size([16, 1, 28, 28])


## Pooling operator

**Convolution Layer**
* Extract different features. (**Feature maps**)

**Pooling Layer**

* Select most dominant features or combine different features.( **Feature Selection**)

Why?

* Lower the resolution of the images (downsampling) making computation efficient.
* Making learning invarient to translation.(robust to shifting or movements of the image)

Types:


1.   Max-Pooling
  *  Returns the maximum of the pixel map.
  *  Typically, 2x2 filter with stride 2 over the activation map.

2.   Average Pooling
  * Returns the average of the pixel map.
  * Typically, 2x2 filter with stride 2 over the activation map.






### Max-pooling in PyTorch

#### OOP

In [None]:
import torch
import torch.nn

# [mini-batch size, height, depth, weight]
im = torch.Tensor([[[[3,1,3,5],[6,0,7,9],[3,2,1,4],[0,2,4,3]]]])
max_pooling = torch.nn.MaxPool2d(2) # Kernel size 2
output_feature = max_pooling(im)

print(output_feature)

tensor([[[[6., 9.],
          [3., 4.]]]])


#### Functional

In [None]:
import torch
import torch.nn.functional as F

# [mini-batch size, height, depth, weight]
im = torch.Tensor([[[[3,1,3,5],[6,0,7,9],[3,2,1,4],[0,2,4,3]]]])

output_feature_F = F.max_pool2d(im,2)

print(output_feature)

tensor([[[[6., 9.],
          [3., 4.]]]])


### Average-Pooling in PyTorch 

#### OOP

In [None]:
import torch
import torch.nn 

im =  torch.Tensor([[[[3,1,3,5],[6,0,7,9],[3,2,1,4],[0,2,4,3]]]])
avg_pooling = torch.nn.AvgPool2d(2)
output_feature = avg_pooling(im)

print(output_feature)

tensor([[[[2.5000, 6.0000],
          [1.7500, 3.0000]]]])


#### Functional

In [None]:
import torch
import torch.nn.functional as F

im =  torch.Tensor([[[[3,1,3,5],[6,0,7,9],[3,2,1,4],[0,2,4,3]]]])
output_feature_F = F.avg_pool2d(im,2)

print(output_feature_F)

tensor([[[[2.5000, 6.0000],
          [1.7500, 3.0000]]]])


## Convolutional Neural Networks

CNN is a Neural network comprising of the convolutional layer, pooling layers and fully connected layers.

CNN pipeline:

    Convolution layer -> Pooling layer -> Flattening -> Fully connected layers

AlexNet architecture:

* Architecture that revolutionised use of CNN.
* Developed for ImageNet Classification and consists of 
  *   5 Convolutional layers
  *   3 Max pooling layers
  *   3 fully connected layers

Output: Classification into 1000 different classes 






## AlexNet in PyTorch

In [None]:
 class AlexNet(nn.Module):

  # Parameters taken from paper.
  def __init__(self,num_classes = 1000):
    super(AlexNet,self).__init__()
    self.conv1 = nn.Conv2d(3,64, kernel_size=11, stride=4, padding=2)
    self.relu = nn.ReLU(inplace=True)
    self.maxpool = nn.MaxPool2d(kernel_size=3,stride=2)
    self.conv2 = nn.Conv2d(64,192,kernel_size=5, padding=2)
    self.conv3 = nn.Conv2d(192,384,kernel_size=3, padding=1)
    self.conv4 = nn.Conv2d(384,256,kernel_size=5, padding=2)
    self.conv5 = nn.Conv2d(256,256,kernel_size=5, padding=2)
    self.avgpool = nn.AdaptiveAvgPool2d((6,6))
    self.fc1 = nn.Linear(256*6*6,4096)
    self.fc2 = nn.Linear(4096,4096)
    self.fc3 = nn.Linear(4096, num_classes)

  # Defining CNN
  def forward(self,x):
    x = self.relu(self.conv1(x)) # Pass image to conv1
    x = self.maxpool(x)
    x = self.relu(self.conv2(x))
    x = self.maxpool(x)
    x = self.relu(self.conv3(x))
    x = self.relu(self.conv4(x))
    x = self.relu(self.conv5(x))
    x = self.maxpool(x)
    x = self.avgpool(x)
    x = x.view(x.size(0),256*6*6)
    x = self.relu(self.fc1(x))
    x = self.relu(self.fc2(x))
    return self.fc3

net = AlexNet()

In [None]:
net

AlexNet(
  (conv1): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv3): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(384, 256, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv5): Conv2d(256, 256, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (fc1): Linear(in_features=9216, out_features=4096, bias=True)
  (fc2): Linear(in_features=4096, out_features=4096, bias=True)
  (fc3): Linear(in_features=4096, out_features=1000, bias=True)
)

## Training Fully Connected CNN

## 1. Dataloaders ( Loading train and test data)

In [None]:
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.5,0.5,0.5),(0.5,0.5,0.5))])


trainset = torchvision.datasets.CIFAR10(root='./data',train = True, transform=transform, download=True )

testset = torchvision.datasets.CIFAR10(root= './data', train = False, download = True, transform=transform)

trainloader = torch.utils.data.DataLoader(trainset, batch_size = 128, shuffle = True, num_workers = 2)

testloader = torch.utils.data.DataLoader(testset, batch_size = 128, shuffle = False, num_workers = 2)

Files already downloaded and verified
Files already downloaded and verified


## Building CNN

In [None]:
class CNN(nn.Module):
  # Define parameters (layers..)
  def __init__(self,num_classes = 10):
    super(CNN,self).__init__()
    self.conv1 = nn.Conv2d(in_channels= 3, out_channels=32, kernel_size=3, padding=1)
    self.conv2 = nn.Conv2d(in_channels= 32, out_channels=64, kernel_size=3, padding=1)
    self.conv3 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1)
    self.pool = nn.MaxPool2d(2,2)
    self.fc = nn.Linear(128*4*4, num_classes) # (out_channels * height * width) <- Result of pooling and conv

  # Applying the parameters to the input
  def forward(self,x):
    x = self.pool(F.relu(self.conv1(x))) # Functional relu
    x = self.pool(F.relu(self.conv2(x)))
    x = self.pool(F.relu(self.conv3(x)))
    x = x.view(-1,128*4*4) # Squeze into a single D vector (Flattening)
    return self.fc(x)



## Training with optimizer and loss function

In [None]:
net = CNN() # Instantiate object from class Net
criterion = nn.CrossEntropyLoss() # Instantiate cross entropy loss 
optimizer = optim.Adam(net.parameters(),lr = 3e-4) 

# Training ( loop over trainloader )

for epoch in range(10):
  for i,data in enumerate(trainloader,0):
    # Get the inputs
    inputs, labels = data

    # Zero the parameter gradients
    optimizer.zero_grad()

    # Forward + backward + optimize
    outputs = net(inputs)
    loss = criterion(outputs,labels)
    loss.backward()
    optimizer.step()

print("Finished training")


Finished training


## Evaluating the results

In [None]:
correct, total =0,0
predictions = []
net.eval()

for i, data in enumerate(testloader,0):
  inputs, labels = data
  outputs = net(inputs)
  _,predicted = torch.max(outputs.data,1)
  predictions.append(outputs)
  total += labels.size(0)
  correct += (predicted == labels).sum().item()

print("accuracy of CNN :", 100*correct/total)

accuracy of CNN : 70.24


# Sequential Module

* PyTorch tool usefull when building large NN.
* Sequential module helps in making code more modular (usefull in feed forward networks).
* More OO based approach and allows to change parts(modules) independently of each other.

In [None]:
class AlexNet(nn.Module):

  # Parameters taken from paper.
  def __init__(self,num_classes = 1000):
    super(AlexNet,self).__init__()
    self.features = nn.Sequential( # Sequential module, order matters (Convolutional layers)
      nn.Conv2d(3,64, kernel_size=11, stride=4, padding=2), nn.ReLU(inplace=True),
      nn.MaxPool2d(kernel_size=3,stride=2),
      nn.Conv2d(64,192,kernel_size=5, padding=2), nn.ReLU(inplace=True),
      nn.Conv2d(192,384,kernel_size=3, padding=1), nn.ReLU(inplace=True),
      nn.Conv2d(384,256,kernel_size=5, padding=2), nn.ReLU(inplace=True),
      nn.Conv2d(256,256,kernel_size=5, padding=2), nn.ReLU(inplace=True),
      nn.MaxPool2d(kernel_size=3,stride=2),)
    self.avgpool = nn.AdaptiveAvgPool2d((6,6))
    self.classifier = nn.Sequential( # Sequential module (Fully Connected layers)
        nn.Dropout(),nn.Linear(256*6*6,4096), nn.ReLU(inplace=True), # Dropout (few units) used to avoid overfitting
        nn.Dropout(), nn.Linear(4096,4096), nn.ReLU(inplace=True),
        nn.Dropout(), nn.Linear(4096, num_classes))

  # Defining CNN w.r.t each sequential module
  def forward(self,x):
    x = self.features(x)
    x = self.avgpool(x)
    x = x.view(x.size(0), 256*6*6)
    x = self.classifier(x)
    return x


# The problem of overfitting

Model works very well on training set, but worse on the test set.

Reason:

* Very complicated non-smooth hypothesis (seperator).

**Bias and Variance tradeoff**:
 
* Bias 
  * Inability to **capture true relationship** between independent and dependent variables.

* Variance
  * **Mean squared difference**  between prediction and actual value.

        low bias -> high variance
        high bias -> low variance

**Overfitting: High variance ( high difference of accuracy between training and test set)**



## Prevent Overfitting:


* Training different models with changing hyperparameters on training set and testing on test set may lead to contamination of test set.

Solution : Cross validation of model with validation set and then on test on test set.

  * Training set : Train the model
  * Validation set : Select the model
  * Testing set : test the model

Note: Testing set used only ones or few times.

### Using validation sets in PyTorch

In [None]:
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

indices = np.arange(50000) # train and validation split based on indices 
np.random.shuffle(indices)

# subsetrandomsampler used to select random samples from the indices for training  
train_loader = torch.utils.data.DataLoader(torchvision.datasets.CIFAR10(root ='./data', train = True, download =True, transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.485,0.456,0.406),(0.229,0.224,0.225))])),
                                            batch_size = 1, shuffle = False, sampler = torch.utils.data.SubsetRandomSampler(indices[:45000]))
 
val_loader = torch.utils.data.DataLoader(torchvision.datasets.CIFAR10(root ='./data', train = True, download =True, transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.485,0.456,0.406),(0.229,0.224,0.225))])),
                                            batch_size = 1, shuffle = False, sampler = torch.utils.data.SubsetRandomSampler(indices[45000:50000]))


Files already downloaded and verified
Files already downloaded and verified


### Regularization techniques

Techniques used in train to give better predictions, generalize well and avoid overfitting.

*   **L2 - regularization** :
  * Adding value to loss function that penalizes large weights.
      * loss + sum of the squared norm of weight matrices * regularization parameter. 
  * Used in regression and svm.
  * Add `weight_decay`
          optim.Adam(net.parameters(), lr = 3e-4, weight_decay=0.0001)
*  **Dropout** :
  * During forward pass, there is a probability p to be dropped out from computation. 
  * By doing so units are forced to not be dependent on surrounding units. (architecture changing as different units are removed during each iteration)
  * Typically used in fully connected layers. 



```
self.classifier = nn.Sequential (
  nn.Dropout (p =0.5), # Drop each unit with probability 50%
  nn.Linear(256*6*6,4096),
  nn.ReLU(inplace=True),
  nn.Dropout(p=0.5),
  nn.Linear(4096,4096),
  nn.ReLU(inplace = True),
  nn.Linear(4096, num_classes),
)

```
* **Batch-normalization**
  *  Normalization is the process of scaling the data values to a standard scale.
           Normalization - [0 to 1]
           Standardization - [x- mean / standard deviation]
    * Why?
    
      * Larger weights can cause im balanced gradients, which may ultimately leads to exploding gradient problem [local minima].
  * **Batch normalization applied to layers normalizes the output of the activation function for the units of the layer applied**. 
      1. Normalize output from activation function.
              z = x-M/ S.d
      2. Multiply normalized output by arbitrary parameter, g.
              z*g
      3. Add arbitrary parameter, b, to resulting product.
              (z *g)+b
      * Mean(M), Standard deviation (d), g, b are trainable parameters.

  *  Batch normalization computes mean and variance etc. of the mini batch for each feature and then normalises features based on these stats. Insert BatchNorm2d after the activation layer of the features to normalize the batch of features.

        self.bn = nn.BatchNorm2d(num_features = 64, eps = 1e-05, momentum = 0.9)

* **Early stopping** : 
    * Checks the accuracy of the validation set for each epoch. If the accuracy is stagnent or decreased, then training is terminated. 







### Hyperparameters


Question: How to choose all these hyperparameters(l2 regualarization, dropout parametr, optimizers (Ada vs gradient descent), batch-norm momentum and epsilon, number of epochs for early stopping etc)?

Answer:

* Train many networks with different hyperparameters (typically use random values for them).
* Test them on validation set.
* Use best performing net in the validation set to know the expected accuracy of the network in new data.

**Note: Very important to set the mode of the net** (`model.train()`, `model.eval()`)


# Transfer learning

The deeper the progress in the CNN, the more abstract the features become.

 **Consequence**: the low level features or layers are more general to a large degree dataset dependent.

Transfer Leaning: Using a pretrained model trained on large dataset and to finetune the model for the specific use case.

* Usefull to train in less time and to train small datasets without overfitting.

**Fine tuning** :

1. **Fine tune everything**.
    Eg. 

      1. Trained model on CIFAR10 and saved the model as cifar10.pth
      2. The penaltimate layer has dimentions of 1024(features) * 4 * 4 (spatial dimention).
      3. Use this pretrained model to train CIFAR100 dataset which is much smaller.

      Finetuning the model using CIFAR10.

      ```
      #Instantiate the model 
      model = Net()

      #Load the parameters from the old model trained on CIFAR10
      model.load_state_dict(torch.load('cifar10_net.pth'))

      # Change the number of units in the last layer (always correspond to the number of classes)
      model.fc = nn.Linear(4 * 4* 1024, 100)

      # Train and evaluate the model ( same from scratch)
      model.train()
      ```
2. **Freeze all the layer except the last layer** during back propagation and fine tune only the last layer.
  * Typically done if dataset is too small.


      
      #Instantiate the model 
      model = Net()

      #Load the parameters from the old model trained on CIFAR10
      model.load_state_dict(torch.load('cifar10_net.pth'))

      # Freeze all the layers except the final one
      for param in model.parameters():
        param.requires_grad = False
      
      # Change the number of output units
      model.fc = nn.Linear( 4 * 4 * 1024, 100)

      # Train and evaluate the model ( same from scratch)
      model.train()







## Torchvision library

Library for pretrained models.

In [None]:
import torchvision

# Donwload the resnet dataset from torchvision
model = torchvision.models.resnet18(pretrained = True)

#model.fc = nn.Linear(512,num_classes)
model.layers

ModuleAttributeError: ignored