# Mastering PyTorch By Ashish Ranjan Jha
We used the following tutorial to guide our hands-on session as well:
- [Deep Learning with PyTorch: A 60 Minute Blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)

## Chapter 1 - Overview of Deep Learning Using PyTorch

Deep Learning is a class of machine learning methods that has revolutionized the way computers/machines are used to build automated solutions for real-life problems in a way that wasn't possible before. Deep Learning uses large amounts of data to learn non-trival relationships between inputs and outputs in the form of complex nonlinear functions.

Some of the inputs and outputs could be:
•Input: An image of a text; output: Text
•Input: Text; output: A natural voice speaking the text
•Input: A natural voice speaking the text; output: Transcribed text
And so on. (The above examples deliberately exclude tabular input data because gradient boosted trees (XGBoost, LightGBM, CatBoost) still outperform deep learning on such data.)

Some of the well-known layers are the following:

• Fully-connected or linear: In a fully connected layer, all neurons preceding this layer are connected to all neurons succeeding this layer. Fully connected layers are a fundamental unit of many – in fact, most – deep learning classifiers.

• Convolutional: In convolutional layer, where a convolutional kernel (or filter) is convolved over the input. Convolutional layers are a fundamental unit of Convolutional Neural Networks (CNNs), which are the most effective models for solving computer vision problems.

• Recurrent: Recurrent layers have an advantage over fully connected layers in that they exhibit memorizing capabilities, which comes in handy working with sequential data where one needs to remember past inputs along with the present inputs.

• DeConv (the reverse of a convolutional layer): Quite the opposite of a convolutional layer, a DeConvolutional Layer works by expanding the input data spatially and hence is crucial in models that aim to generate or reconstruct images, for example.

### PyTorch modules

The PyTorch library, besides offering the computational functions as NumPy does, also offers a set of
modules that enable developers to quickly design, train, and test deep learning models. The following
are some of the most useful modules.

#### torch.autograd

`torch.autograd` is PyTorch’s automatic differentiation engine that powers neural network training. 

Neural networks (NNs) are like a series of linked functions that process input data. These functions have parameters (weights and biases) that are kept in tensors in PyTorch.

Training a neural network involves two main steps:

**Forward Propagation**
In this step, the neural network makes an educated guess about the output. It does this by passing the input data through its functions to produce a prediction.

**Backward Propagation**
Here, the neural network learns from its mistakes. It adjusts its parameters based on the difference between its prediction and the actual result. The network works backwards from the output, calculates how much each parameter contributed to the error (using derivatives called gradients), and updates the parameters to improve the prediction. This process of updating is done using a method called gradient descent.

In [1]:
# Let's take a look at a single training step. 

# Import dependencies 
import torch
from torchvision.models import resnet18, ResNet18_Weights

model = resnet18(weights=ResNet18_Weights.DEFAULT)
data = torch.rand(1,3,64,64)
labels = torch.rand(1, 1000)

# forward pass 
prediction = model(data)

# calculate the loss
loss = (prediction - labels).sum()

# backward pass
loss.backward()

# load an optimizer 
optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

# gradient descent
optim.step()

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:00<00:00, 150MB/s] 


#### torch.nn
When building a neural network architecture, the fundamental aspects that the network is built on are the number of layers, the number of neurons in each layer, and which of those are learnable, and so on. 

The PyTorch nn module enables users to quickly instantiate neural network architectures by defining some of these high-level aspects as opposed to having to specify all the details manually.

In [2]:
import math
import torch
'''
Let's assume a 64 dimensional input and a 4-dimensional output for this 1-layer network. Hence, we initialize a 64x4 dimensional matrix filled with random values.
'''
weights = torch.randn(64, 4)/math.sqrt(64)
'''
We then ensure that the parameters of this neural network are trainable, i.e., the numbers in the 64x4 matrix can be tuned with the help of backpropagation of gradients.
'''
weights.requires_grad_()
'''finally we also add the bias weights for the 4-dimensional output, and make these trainable too'''
bias = torch.zeros(4,requires_grad=True)

We can instead use nn.Linear(64, 4) to represent the same thing in PyTorch. In TensorFlow, this could be written as tf.keras.layers.Dense(64, input_shape=(4,), activation=None).

In [3]:
# if you are thinking about how 64 dimensional input look like ?
'''
A 64-dimensional input typically refers to a data point represented as a vector with 64 elements. Each element can be a feature, a measurement, or a value that represents some aspect of the data.
'''
x = torch.randn(64)
print(x)

y = torch.randn(64,4)
print(y)

# see the difference between x and y here.

tensor([-0.2963,  1.3469, -0.7804,  1.1162, -0.2191, -1.3136, -0.7041,  0.4698,
        -0.8653,  0.8924,  0.4866, -1.0979, -0.6410,  0.2095, -0.2444, -0.3963,
         0.1754, -0.2154,  0.2450,  0.4057,  0.4755, -1.1892,  1.6432,  0.2591,
        -0.2277,  0.0310,  0.8761,  2.2952,  0.5770,  0.5841, -0.7167,  0.2036,
         0.2739, -1.7121, -0.7243,  0.0449, -1.2325, -0.9400, -0.6814,  2.2289,
         1.2043, -0.3879,  1.6313,  0.2762, -1.0367, -0.9182,  0.1398, -0.1379,
        -1.0946, -0.1464, -0.1394, -1.5116, -0.3661,  1.8563,  0.4153, -0.6331,
        -1.0518,  0.6365,  1.0537,  0.5225, -0.9475, -0.4695,  1.1794,  1.5329])
tensor([[-0.7816,  0.1137, -1.2018, -0.0897],
        [ 0.1645,  0.9752,  2.8006,  1.0079],
        [-1.6533,  0.8117, -0.2668,  1.0208],
        [-2.2112, -0.6038, -0.0998, -0.9875],
        [ 0.9569, -0.9595, -1.5058,  0.4872],
        [ 0.5487,  1.1966, -0.5981,  1.6318],
        [ 0.3414,  2.1036, -0.5699, -1.0466],
        [ 0.4907, -0.5955,  0.7983, -

**Differences:**

1. Shape and Dimensions:
x: This tensor is a 1-dimensional tensor (vector) with 64 elements.
y: This tensor is a 2-dimensional tensor (matrix) with 64 rows and 4 columns.

2. Structure:
x: A simple list of 64 random values drawn from a standard normal distribution (mean = 0, standard deviation = 1).
y: A matrix with 64 rows and 4 columns, where each element is a random value drawn from a standard normal distribution.

3. Usage:
x: Typically used as a single data point with 64 features. This could be useful in various applications like a feature vector in machine learning.
y: Typically used as a batch of data points, where each row represents a data point with 4 features. This is useful in scenarios like feeding data into a neural network in mini-batches.

Within the torch.nn module, there is a submodule called **torch.nn.functional**. This submodule
consists of all the functions within the torch.nn module, whereas all the other submodules are classes.
These functions are loss functions, activating functions, and also neural functions that can be used
to create neural networks in a functional manner (that is, when each subsequent layer is expressed
as a function of the previous layer) such as pooling, convolutional, and linear functions.

In [4]:
# Example (error add we have not defined model which we will later.)
import torch.nn.functional as F
loss_func = F.binary_cross_entropy
loss = loss_func(model(X), y)

NameError: name 'X' is not defined

### Training a neural network using PyTorch

A typical training procedure for a neural network is as follows:

- Define the neural network that has some learnable parameters (or weights)

- Iterate over a dataset of inputs

- Process input through the network

- Compute the loss (how far is the output from being correct)

- Propagate gradients back into the network’s parameters

- Update the weights of the network, typically using a simple update rule: `weight = weight - learning_rate * gradient`

![architecture](https://pytorch.org/tutorials/_images/mnist.png)

It is a simple feed-forward network. It takes the input, feeds it through several layers one after the other, and then finally gives the output.

#### Network Breakdown:
**Input Layer (32x32)** 

The input to the network is a 32x32 pixel grayscale image. The image in the diagram appears to be a handwritten letter "A," but typically, LeNet-5 was used on the MNIST dataset, which contains 28x28 pixel images of handwritten digits. The images are zero-padded to 32x32. (Information from reference; now let's break down the architecture based on this image only)

**C1: First Convolutional Layer**

- Feature Maps: 6 ; 
The image shows "C1: feature maps 6@28x28," which tells you that the first convolutional layer (C1) generates 6 feature maps.

- Kernel Size: 5x5 ; 
Although the kernel size is not explicitly mentioned in the image, you can infer it based on the input size and the output size of the convolutional layer. The input is 32x32, and after applying the convolution, the output is 28x28. Given the formula for the output size of a convolution:

`Output Size = Input Size − Kernel Size + 1`
`Kernel Size = Input Size− Output Size + 1`

Substituting the values:

`Kernel Size = 32 − 28 + 1 = 5`

So, the kernel size is 5x5.

- Output Size: 28x28 ;
The image directly labels the output size of the first convolutional layer as 28x28, which you can see under "C1: feature maps 6@28x28."

**S2: First Subsampling (Pooling) Layer**

- Feature Maps: 6
- Subsampling Method: Typically, average pooling with a stride of 2
- Output Size: 14x14


Calculation:
The output size of a pooling layer can be calculated using the formula:

`Output Size = (Input Size − Pool Size)/Stride + 1`

For simplicity, in most standard implementations of LeNet-5:

Pool Size: Typically 2x2
Stride: Typically 2

`Output Size = (28 − 2)/2 + 1 = 14`

**Similarly**
**C3: Second Convolutional Layer**
- Feature Maps: 16
- Kernel Size: 5x5
- Output Size: 10x10

**S4: Second Subsampling (Pooling) Layer**
- Feature Maps: 16
- Subsampling Method: Average pooling with a stride of 2
- Output Size: 5x5

**C5: Fully Connected Convolutional Layer**
-> Here, a fully connected layer (also known as a dense layer) takes the flattened output from the previous layer as input. If you have 400 (16x(5x5)) inputs (from the previous layer), and the fully connected layer has 120 neurons, each neuron will receive inputs from all 400 features.
- Neurons: 120
- Kernel Size: 5x5

**F6: Fully Connected Layer**
- Neurons: 84

**Output Layer**
- Neurons: 10

In [5]:
# Import dependencies
import torch
import torch.nn as nn
import torch.nn.functional as F

In [6]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        self.cn1 = nn.Conv2d(1, 6, 5)
        self.cn2 = nn.Conv2d(6, 16, 5)
        
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16*5*5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
        
    def forward(self,input):
        # it uses RELU activation function, and
        # outputs a Tensor with size (N, 6, 28, 28), where N is the size of the batch
        c1 = F.relu(self.cn1(input))
        
        # this layer does not have any parameter, and outputs a (N, 6, 14, 14) Tensor
        s2 = F.max_pool2d(c1, (2, 2))
        
        # it uses RELU activation function, and
        # outputs a (N, 16, 10, 10) Tensor
        c3 = F.relu(self.cn2(s2))
        
        # Subsampling layer S4: 2x2 grid, purely functional,
        # this layer does not have any parameter, and outputs a (N, 16, 5, 5) Tensor
        s4 = F.max_pool2d(c3, 2)
        
        # Flatten operation: purely functional, outputs a (N, 400) Tensor
        s4 = torch.flatten(s4, 1)
        
        # Fully connected layer F5: (N, 400) Tensor input,
        # and outputs a (N, 120) Tensor, it uses RELU activation function
        f5 = F.relu(self.fc1(s4))
        
        # Fully connected layer F6: (N, 120) Tensor input,
        # and outputs a (N, 84) Tensor, it uses RELU activation function
        f6 = F.relu(self.fc2(f5))
        
        # Gaussian layer OUTPUT: (N, 84) Tensor input, and
        # outputs a (N, 10) Tensor
        output = self.fc3(f6)
        return output
    
net = Net()
print(net)

Net(
  (cn1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (cn2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


The forward function, and the backward function (where gradients are computed) is automatically defined using autograd. We can use any of the Tensor operations in the forward function.

The learnable parameters of a model are returned by `net.parameters()`

In [7]:
params = list(net.parameters())
print(len(params))
print(params[0].size()) # cn1's weight

10
torch.Size([6, 1, 5, 5])


We need to resize the inputs if input size is not 32x32 as expected input size of this net(LeNet) is 32x32.

In [8]:
# Let's try some random 32x32 input.
input = torch.randn(1,1,32,32) # (batch_size, no_of_channels, height, width)
out = net(input)
print(out)

tensor([[-0.0311, -0.0800, -0.0268,  0.0915, -0.0534,  0.0446, -0.0026,  0.0173,
         -0.0836,  0.0089]], grad_fn=<AddmmBackward0>)
