In [80]:
#hide
from fastai import *

In the last previous chapters we have learnt about almost all aspects of training a model from
callbacks,different optimizers,creating custom models etc.In this chapter we will be focussing on
training things from scratch.Most of the things we will be doing will be the ones which we have 
already studied.But this time it will be more focussed on implementation and not on practical 
meanings.

# A Neural Net from the Foundations

We will be starting with basic tensor indexing,then we will come to neural nets,then implementing
backpropogation and how loss is calculated through PyTorch.We would also learn about autograd
package in PyTorch which directly calculates gradient.

## Building a Neural Net Layer from Scratch

Since we would be building neural nets from scratch and implement loss calculation and backward
and forward propogation,we can start with revising some Linear Algebra basics.Initially we would 
use python to implement everything and then later we would replace everything with PyTorch 
functionality and see how many lines of code can be replaced by single one.We would feel that 
we are doing the same things again but this chaptwe would focus more on implementation.Let's 
start with some theory for now...

### Modeling a Neuron

A neuron gets a number of inputs and has a corresponding set of weights for each one.The weights 
are multiplied by inputs and then a bias is added to them.Mathematically it can be represented as
:
    $$ out = \sum_{i=1}^{n} x_{i} w_{i} + b$$

Here xi(1.....n) are the inputs and wi(1....n) are the weights and b is the bias.While in code 
this can be written as :-
    
output = sum([x*w for x,w in zip(inputs,weights)]) + bias


A nonlinear function known as "Activation function" is then applied to this output after which 
the information is passed to another neuron.The function is mostly Rectified linear unit mostly
called "ReLU" which basically zeros every quantity less than zero.The function can be returned as

def relu(x): 
    return x if x >= 0 else 0

A deep learning model consists of many such neurons stacked in layers.Generally a first layer is
created with neurons equal to the input size and the inputs are connected to these neurons.This 
layer is also called fully connected layer or a linear layer.Next we compute the dot product of 
each input and each weight and sum them up.

sum([x*w for x,w in zip(input,weight)])

They are added because its a matrix multiplication.Let's say if input x has a size of batch_size
X n_inputs and weights are such that in matrix w,there are n_neurons X n_inputs.Every neuron has 
same number of weights as there are inputs.And the bias numbers are in a vector(1 D tensor) with
size n_neurons.(Every neuron has one tensor).Thus the output would be:-
    
    y=x @ w.t()+b
    
Here @ is the matrix multiplication product and.t() takes the transpose of w matrix.y is then of 
the size batch_size X n_neurons.Mathematically written as:
    
$$y_{i,j} = \sum_{k=1}^{n} x_{i,k} w_{k,j} + b_{j}$$

The transpose has to be taken because mathematically in  m @ n , the coefficient at any position
(i,j) is given by:-

sum([a * b for a,b in zip(m[i,:],n[:,j])])

So like for every calculation we need matrix multiplication.Let's learn about matrix 
multiplication then...    

### Matrix Multiplication from Scratch

We would write a function which would do multiplication of two tensors and return the product 
using plain Python.Though we can do it directly  using PyTorch function but to understand how it
is done we would use Python.We would be using only the PyTorch indexing here instead of arrays.

In [5]:
#Importing PyTorch
import torch
from torch import tensor

We define a function matmul and pass 2 matrices a and b through it which are to be multiplied.
Next we store a's rows and columns in ar,ac respectively and b's in br and bc.Then we make sure
that a's columns are equal to b's rows.Next we create a zero matrix for our product of size a's 
rows X b's columns.Then we use nested loops.

In the function matmul we use three nested loops one for row indices,one for column indices
and one for the sum.ac and ar are number of columns and number of rows of a and the same for br,
bc and before multiplying we check that a has as many as columns as number of rows in b.

Inside the loops we do multiplication.c is the product matrix containing zeros.But we multiply 
the corresponding elements of a and b and then add them for every column.Next we return the 
product.

In [6]:
#Function for matrix multiplication
def matmul(a,b):
    ar,ac = a.shape # n_rows * n_cols
    br,bc = b.shape
    assert ac==br#check if the a's columns are equal to b's rows
    c = torch.zeros(ar, bc)#initialize a zero matrix for product
    for i in range(ar):#looping through a's rows
        for j in range(bc):#b's columns
            for k in range(ac): #a's columns
                c[i,j] += a[i,k] * b[k,j]#matrix multiplication
    return c #return product

Now to implement this function we create two random matrices.The sizes would be such that it is 
equal to that of MNIST images.We would be using 5 MNIST images of size 28 X 28.And a linear model
so that we get 10 activations.

In [7]:
#creating random matrices for matrix multiplication
m1 = torch.randn(5,28*28)#5 MNIST images of size 28 X 28 pixels
m2 = torch.randn(784,10)

Next we pass both the matrices through matmul to multiply them.We would be using Jupyter's magic 
command called %time which would return the time taken for the whole computation.

In [8]:
%time t1=matmul(m1, m2) #time taken for multiplication through function

Wall time: 1.44 s


We can now use PyTorch's built in method of @ and print the time taken for the same.

In [9]:
%timeit -n 20 t2=m1@m2 #multiplication using @ method

The slowest run took 12.52 times longer than the fastest. This could mean that an intermediate result is being cached.
194 µs ± 291 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


When we multiply matrices using the user define function it takes time in milli secs but when it 
is done through @ , the time is a few microseconds.So using Python's three nested loops to
multiply is not good as it takes more time whereas PyTorch's direct @ implementation is 10000
times faster than Python and if done on GPU,it's more faster.

This is because PyTorch is written in C++ so as to make it fast.It's easy on PyTorch because we
use tensors for calculation and also Broadcasting and Elementwise operations makes it more fast.

Let us learn more about the element wise operations on matrices in PyTorch.

### Elementwise Arithmetic

We can apply all basic mathematical operations such as +,-,=,*,/,%,==,<,> elementwise in PyTorch.
So if we do a+b then all the elements in a and b will be added elementwise so they should be of
same shape and we get a tensor of same shape having sum of elements of a and b.

In [10]:
#addition of two tensors
a = tensor([10., 6, -4])
b = tensor([2., 8, 7])
a + b

tensor([12., 14.,  3.])

<,> are Boolean operators.They return True or False so we get a tensor with True or False as 
elements when we compare two matrices of equal shape using >/<.

In [11]:
#Comparing elements using Boolean operators
a < b

tensor([False,  True,  True])

There are some operators which return  tensors with only one element after applying them on a 
tensor with more than one elements.Some of these are all(),sum() and mean().To get the one 
element from the 0 ranked tensor we use .item().

In [12]:
#Reduction operations on tensors
(a < b).all(), (a==b).all()

(tensor(False), tensor(False))

In [13]:
#Getting single element from reduction operation using .item() method.
(a + b).mean().item()

9.666666984558105

The elementwise operations can be applied on tensors of any rank provided they have same shape.
Let's see what happens when they are of different shape.

In [14]:
#multiplying the tensors of same rank and shape
m = tensor([[1., 2, 3], [4,5,6], [7,8,9]])
m*m

tensor([[ 1.,  4.,  9.],
        [16., 25., 36.],
        [49., 64., 81.]])

In [18]:
m.shape,n.shape

(torch.Size([3, 3]), torch.Size([2, 3]))

In [17]:
#Defining other tensor of different shape
n = tensor([[1., 2, 3], [4,5,6]])
m*n #doesn't work as the shapes do not much

RuntimeError: The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 0

m X n is not possible as the shapes of the tensors do not match.We know that using PyTorch 
functionality speeds things up so we multiply the tensors such that ith row of a and jth column 
of b are multiplied and then all the elements are summed up.This will be executed fast as the 
third inner loop is executed by PyTorch at a faster speed.

To access one column or one row for a tensor we use a[:,j] or a[i,:].The : means all elements.
We can also use range using i:j.It would then take elements from i to j.j is noninclusive here 
though.Instead of a[i,:] we can also use a[i].Let's implement the reduction function then and 
remove the third loop from the function and see how much less time it takes...

In [19]:
#Function matmul after removing the inner for loop
def matmul(a,b):#pass two matrices
    ar,ac = a.shape#row and columns of a 
    br,bc = b.shape#row and columns of b
    assert ac==br#test if a's columns are equal to b's rows
    c = torch.zeros(ar, bc)#initalize product matrix c
    for i in range(ar):#iterate through a's rows
        for j in range(bc):#iterate through b's columns
            c[i,j] = (a[i] * b[:,j]).sum()#multiplying and summing up using.sum()
    return c #return c

Next we use magic command %timeit to check the time taken for executing this function.So we pass
m1 and m2 through the function and we print the time.

In [20]:
%timeit -n 20 t3 = matmul(m1,m2)

1.85 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


This is around 1000 times faster already then the function with one extra for loop.Using 
broadcasting we can remove all loops and speed up the function more.

### Broadcasting

Broadcasting is a term we came across earlier also.It tells how operations are performed between 
the tensors of different ranks.If there is a 3 X 3 matrix and a 4 X 5 matrix obviously we cannot
add them but if we add scalar to a matrix we can do it using broadcasting.A vector of size 3 can
be added to a tensor of size 3 X 4.Let's see how do we do it..

Broadcasting basically gives some rules to test about the compatiblity of tests when we are doing
a mathematical operation.The tensor with smaller shape is basically expanded to match with the 
tensor of bigger shape.Let's understand these rules first using some examples..

#### Broadcasting with a scalar

It is the simplest broadcasting.When out of the two matrices between which we want to perform 
operations,one of them is a scalar and one of them is a tensor,then a tensor of same shape as
that of the given tensor is created consisting of the scalar as the elements and then the 
operation is performed.

In [21]:
#Comparing tensor with a scalar
a = tensor([10., 6, -4])
a > 0

tensor([ True,  True, False])

Here 0 is broadcasted to a shape and then compared with each element in a.Thus you get the same
shape tensor in output.This is also very useful if we are normalizing dataset by subtracting mean
and dividing it by standard deviation.

In [22]:
#Normalizing a tensor using broadcasting
m = tensor([[1., 2, 3], [4,5,6], [7,8,9]])
(m - 5) / 2.73

tensor([[-1.4652, -1.0989, -0.7326],
        [-0.3663,  0.0000,  0.3663],
        [ 0.7326,  1.0989,  1.4652]])

In case if the means for rows are different,we broadcast a vector to matrix.

#### Broadcasting a vector to a matrix

In [23]:
#broadcasting vector to a matrix
c = tensor([10.,20,30])#vector
m = tensor([[1., 2, 3], [4,5,6], [7,8,9]])#matrix
m.shape,c.shape#their shapes

(torch.Size([3, 3]), torch.Size([3]))

In [24]:
#adding them using broadcasting
m + c

tensor([[11., 22., 33.],
        [14., 25., 36.],
        [17., 28., 39.]])

c has a shape of 1 X 3 and m has shape 3 X 3.So,c is expanded in such a way that it has three 
rows and three columns.Actual copy is not created but addition happens in such a way.Using 
expand_as method the expansion happens.

In [25]:
#Expanding vector using expand_as
c.expand_as(m)

tensor([[10., 20., 30.],
        [10., 20., 30.],
        [10., 20., 30.]])

Storage property can be checked for any tensor using .storage() method to check for useless data.

In [26]:
#expanding the tensor
t = c.expand_as(m)
t.storage()#getting storage using the storage method

 10.0
 20.0
 30.0
[torch.FloatStorage of size 3]

Though t is of size 3 X 3,it takes the space for only 3 float elements.It is because the dim has 
a stride of zero.We can check the stride using .stride() method.

In [27]:
#stride and shape of expanded tensor
t.stride(), t.shape

((0, 1), torch.Size([3, 3]))

By default the broadcasting was done on last dimension.If it's done otherwise still the result is
same.

In [28]:
#Broadcasting other side
c + m

tensor([[11., 22., 33.],
        [14., 25., 36.],
        [17., 28., 39.]])

If a vector is of size n then it can be broadcasted with a matrix of size m X n.

In [29]:
#Broadcasting 
c = tensor([10.,20,30])#vector
m = tensor([[1., 2, 3], [4,5,6]])#multidimensional tensor
c+m

tensor([[11., 22., 33.],
        [14., 25., 36.]])

Above c has a size of 3 and m has a size of 2 X 3.So broadcasting is possible.

In [30]:
#Broadcasting not possible
c = tensor([10.,20])
m = tensor([[1., 2, 3], [4,5,6]])
c+m
#c shape is 2,m shape is 2 X 3,so not compatible for broadcasting

RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1

To broadcast in other dimension,the vector should be of shape 3 X 1.To add a unit dimension,we 
use unsqueeze method in PyTorch.We had done this earlier also while implementing cnn.

In [31]:
#Adding unit dimension to a vector 
c = tensor([10.,20,30])# vector
m = tensor([[1., 2, 3], [4,5,6], [7,8,9]])#tensor
c = c.unsqueeze(1)#unsqueeze method
m.shape,c.shape#shape of the tensors

(torch.Size([3, 3]), torch.Size([3, 1]))

The vector has a shape of 3 X 1 now.And it is expanded on the column side.

In [32]:
#applying broadcasting during addition
c+m

tensor([[11., 12., 13.],
        [24., 25., 26.],
        [37., 38., 39.]])

Like previously even this expanded tensor though contains 9 elements but has a storage of 3 
elements only.

In [33]:
#checking the storage of the expanded tensor
t = c.expand_as(m)
t.storage()#3 float elements stored

 10.0
 20.0
 30.0
[torch.FloatStorage of size 3]

Checking the stride and shape of the expanded tensor.Even this tensor has a 0 stride in column
dimension.It can be checked using .stride() method.

In [34]:
#checking the stride and shape of the expanded tensor
t.stride(), t.shape

((1, 0), torch.Size([3, 3]))

Whenever we expand tensors using unsqueeze method it adds a unit dimension at the beginning by
default.But we can pass the index where the dimension has to be added.Let's see how it works...

In [35]:
#How unsqueeze method adds unit dimension at the required position
c = tensor([10.,20,30])
c.shape, c.unsqueeze(0).shape,c.unsqueeze(1).shape

(torch.Size([3]), torch.Size([1, 3]), torch.Size([3, 1]))

The unsqueeze method can also be replaced using None indexing.None indexing is like wherever we
want to add the dimension we can add None as the position while indexing the particular dimension
.For eg a[None,:]

In [36]:
#Adding None indexing to add dimension at the beginning or end.
c.shape, c[None,:].shape,c[:,None].shape

(torch.Size([3]), torch.Size([1, 3]), torch.Size([3, 1]))

While indexing the trailing colons can be omitted and instead we can just use None indexing or 
... which stands for previous dimensions.

In [37]:
#Omitting the trailing colons
c[None].shape,c[...,None].shape

(torch.Size([1, 3]), torch.Size([3, 1]))

Now to make the multiplication faster we can remove one more for loop and when we multiply a[i]
with b[:,j] instead of that we multiply a[i] with whole b via broadcasting.

In [38]:
#Removing one more for loop using the unsqueeze method
def matmul(a,b):# a and b are passed
    ar,ac = a.shape#a's rows and columns
    br,bc = b.shape#b's rows and columns
    assert ac==br#check if the a's columns and b's rows are equal
    c = torch.zeros(ar, bc)#initialze the final product to zero matrix.
    for i in range(ar):
#       c[i,j] = (a[i,:]          * b[:,j]).sum() # previous
        c[i]   = (a[i  ].unsqueeze(-1) * b).sum(dim=0)
    return c#returning the product

Next we again see the time taken for executing matmul after removing another for loop and how 
fast it has become..

In [39]:
#time taken for multiplication
%timeit -n 20 t4 = matmul(m1,m2)

496 µs ± 35.3 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


After removing one more for loop,it has become a lot more effecient in terms of time.Let's learn
more about the Broadcasting rules...

# Broadcasting rules

Whenever any operation is performed between two tensors,PyTorch compares elements of both the 
tensors.Two dimensions of two tensors are said to be compatible only when:-
They are equal or if one of them is 1,then it  can be broadcasted to match the shape of 
the other one.

Having same number of dimensions is not compulsory.If there is an image of size 256 X 256 X 3
array containing RGB values.To scale the image we can multiply it by a vector with 3 values.
Broadcasting causes it to give a result of 3 d array only.However if the vector has 2 elements,
then it's not compatible.If the image is a 3 X 3 matrix and vector is a 1d tensor then also 
broadcasting happens over rows.

Let's learn about Einstein summation now..


# Einstein Summation

Apart from PyTorch's @ and torch.matmul there is one more method in which we can implement matrix
multiplication called Einstein summation or einsum.It can be written as :-
    
    ik,kj -> ij

Here we multiply two tensors of size(i X k) and (k X j) and the product is of size (i X j).We
have also studied this in High School Mathematics.In PyTorch it's available as torch.einsum.
The rules for einstein summation are as follows:-

1.Repeated indices are implicitly summed.
2.Every index can appear at most twice in any term.

Here k is repeated so summed over k.The product matrix contains elements at positions (i,j)
such that they are the sum of coefficients in (i,k) multiplied by (k,j) in the other tensor and 
we get the matrix product.In PyTorch it is implemented as follows:-
    

In [40]:
#Implementing PyTorch's Einstein summation
def matmul(a,b): 
    return torch.einsum('ik,kj->ij', a, b)# a and b are the matrices which are to be multiplied
# and before that the dimensions of the operands and products are passed

Let's see how much time it takes for computation.

In [41]:
#Time taken for einstein summation
%timeit -n 20 t5 = matmul(m1,m2)

The slowest run took 6.04 times longer than the fastest. This could mean that an intermediate result is being cached.
180 µs ± 168 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


It is a lot faster than all the previous functions we had executed for matrix multiplication.
"einsum" is thus one of the fastest ways for operations in PyTorch.We have learnt enough about 
matrix multiplication.We can build the neural net now..So let's implement the forward and 
backward pass now...

## The Forward and Backward Passes

Forward pass and backward pass together make an epoch.Forward pass is the calculation process 
where the output is calculated from the input values while traversing through the layers.A loss 
function is calculated from output and input values.During the backward pass using gradient 
descent algorithm the weights are updated from last layer till first layer.

Thus during the forward pass we calculate the outputs based on matrix products and gradients in 
backward pass.Further we will define the neural net and will also learn about initializing the 
weights properly.

### Defining and Initializing a Layer

In the previous chapter while working with MNIST sample image model we had tried a simple two 
layer model.Here also we would be starting with that.The first layer is a simple linear layer 
and does the calculation wx+b where w and x are inputs and weights and b is the bias.So we simply
define a function lin which takes input(x),weights(w) and bias(b) as arguments and returns 
x @ w + b where @ represents matrix multiplication as w,x are multidimensional tensors and b 
also is a 1d vector.

In [42]:
#defining first linear layer of the neural net
def lin(x, w, b): #inputs,weights,bias
    return x @ w + b #return wx+b

Earlier we had seen that we pass the output through a non-linear function called "activation 
function" to add non-linearity to the model.One of the common activation functions used in deep
learning for inner layers is ReLU returning maximum of 0 and x.Before passing the inputs through
the model let's create a random set of inputs and outputs.Since we won't train our model,so 
random inputs and outputs are created.
Thus we create tensors of size 200 X 100 as inputs and a vector of 200 values as output.

In [43]:
#Creating random inputs and output
x = torch.randn(200, 100)#input 200 X 100
y = torch.randn(200)#output

Since we are creating a two layer model so we would have 2 sets of parameters(weights and bias).
So we would have 2 weight matrices and 2 bias vectors.We initalize the parameters randomly by
creating first weight matrix of size 100 X 50 as hidden size is 50 and the output size is 1.The
output is one float.Weight matrices are initialized randomly and bias is initialized to zero.

In [44]:
#Initializing weight matrices and bias vectors for two layers
w1 = torch.randn(100,50)#weight matrix for layer 1(100 X hidden size)
b1 = torch.zeros(50)#bias vector for layer 1
w2 = torch.randn(50,1)#weight matrix for layer 2
b2 = torch.zeros(1)#bias vector for layer 2

Let's pass the inputs,weight matrix and bias through the first linear layer by calling the 
function lin.

In [45]:
#passing input,weight matrix and bias through the first linear layer 
l1 = lin(x, w1, b1)#input,weight matrix,bias
l1.shape 

torch.Size([200, 50])

The first layer returns an output of size 200,the batch size by 50,the hidden size.Let's see what
is the loophole in the way we initialized the parameters and input to the model.For that we would
look at the mean and standard deviation of the output from first layer of the model.

In [46]:
#mean and standard deviation of the output from first layer
l1.mean(), l1.std()

(tensor(0.0284), tensor(10.0478))

The mean is fine as it is near to zero also since the inputs and weights also have means close to
zero.The standard deviation basically represents how much the values are deviated from mean.The
number can go upto 10 that too in just one layer.This is a problem.Most of the networks have 100s
,1000s of layers and if the activations are multiplied by a scale of 10 everytime we know that 
there is a scale upto which numbers are stored in computers.If it goes beyond that then it's a 
problem.

Let's just randomly multiply the input with 100 X 100 random matrix 50 times and see if we can 
get the representable pixel values..

In [47]:
#Randomly multiplying input by 100 x 100 matrix to reduce the standard deviation
x = torch.randn(200, 100)#initalizing input
for i in range(50):#iterating 50 times
    x = x @ torch.randn(100,100)#multiply by 100 X 100 random matrix
x[0:5,0:5]

tensor([[nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan]])

The resulting matrix is full of nans.So we can use more smaller parameters but then it would 
keep getting smaller and after continued multiplication in 100 layers,the values obtained are 
almost zero.

Let's try using a small scale then and see what we have in output..

In [48]:
#Using smaller scale of parameters 
x = torch.randn(200, 100)
for i in range(50): #iterating 50 times assuming 50 layers
    x = x @ (torch.randn(100,100) * 0.01)#multiplying by 0.01 for lower scale 
x[0:5,0:5]

tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])

We can observe that we got zeros after using lower scale parameters.So weight matrices should be 
scaled such that activations have a standard deviation near to 1.A research work by Xavier Glorot
and Yoshua Bengio has published that the scale for a layer is given by 1/sqrt(Nin) where Nin is 
the number of inputs to the layer.

Here the number of inputs to the layer are 100 so,the scale would be 0.1 So let's try scaling by
0.1

In [49]:
#scaling the weight matrix  by 0.1
x = torch.randn(200, 100)#initializing the input matrix
for i in range(50): #iterating over 50 layers
    x = x @ (torch.randn(100,100) * 0.1)#scaling the weight matrix by 0.1
x[0:5,0:5]

tensor([[ 1.3485e+00,  1.1395e+00,  1.5915e-03, -6.0329e-02, -5.0502e-01],
        [ 2.7586e+00, -1.9528e-01,  1.1850e-02, -1.3599e+00,  8.3320e-02],
        [ 3.2456e+00,  5.3009e-01, -1.3298e-01, -7.4117e-01, -6.1467e-01],
        [-1.4938e+00, -2.0137e-02,  2.0974e-02,  1.9033e-01,  4.9550e-01],
        [ 3.5250e-01,  1.2939e-01, -7.8426e-02,  1.8014e-01, -2.8594e-01]])

Now after scaling as per the scaling rule we don't have any zeros or nans in the output.Let's see
what is the standard deviation of these activations..

In [50]:
#standard deviation of the activations
x.std()

tensor(0.8335)

So finally we have activations in a stable range with standard deviation within 1.We saw that 
even varying the scale by 0.1 can give us very small or very large activations so proper 
initialization of weight matrix is very important.Let's start by defining our inputs again and 
proceed further with neural net.

In [51]:
#Initializing the inputs and outputs
x = torch.randn(200, 100)
y = torch.randn(200)

Next we also initalize the weight matrix and bias vectors for both the layers.We also scale the
weight matrix according to the right scaling rule..

In [52]:
#initializing weight matrix and bias vectors for both layers
from math import sqrt
w1 = torch.randn(100,50) / sqrt(100)#scaling by dividing by 0.1
b1 = torch.zeros(50)
w2 = torch.randn(50,1) / sqrt(50)#scaling by dividing by sqrt(50)
b2 = torch.zeros(1)

Let's pass it through the first layer then.We also see the mean and standard deviation of the 
activations from the layer..

In [53]:
#passing the input,weight and bias through the first layer..
l1 = lin(x, w1, b1)#input(x),weight matrix(w1),bias(b1)
l1.mean(),l1.std()#mean and standard deviation of the activations

(tensor(0.0085), tensor(1.0155))

So we have the mean of the activations nearly zero and the standard deviation also close to 1.
So that's perfect.Now previously also we had mentioned about passing the linear activations 
through non-linear activation function to get the final activations.We use ReLU here the most
common activation function in deep learning and it replaces the negatives with a zero.

In [54]:
#function for relu activation
def relu(x): 
    return x.clamp_min(0.)#clamp_min zeros all negatives

Next we pass the linear activations through relu function to get ReLU activation.Let's also check
the mean and standard deviation of these activations..

In [55]:
#Passing linear activations through relu
l2 = relu(l1)
l2.mean(),l2.std()#mean and standard deviation of non-linear activations

(tensor(0.4074), tensor(0.5965))

The mean of the activations has again increased to 0.4 and standard deviation is 0.58.Mean
increased because reLU removed negative values.Let's check if after few layers this also results
in a zero or not.

In [56]:
#Passing the non-linear activations through some layers to see if they decrease to zero.
x = torch.randn(200, 100)#Initializing the input matrix
for i in range(50): #passing through 50 layers
    x = relu(x @ (torch.randn(100,100) * 0.1))#Multiplying by weight matrix scaled by 0.1 and
    #relu activated
x[0:5,0:5]

tensor([[2.5217e-09, 1.5451e-08, 2.6013e-08, 2.1594e-08, 1.0393e-08],
        [4.4534e-09, 2.3102e-08, 3.9754e-08, 3.3959e-08, 1.6925e-08],
        [0.0000e+00, 1.5987e-08, 2.6886e-08, 1.8555e-08, 1.2870e-08],
        [2.5492e-09, 2.6981e-08, 4.7825e-08, 3.4117e-08, 1.7306e-08],
        [2.7717e-09, 1.6132e-08, 3.1019e-08, 2.1887e-08, 1.2364e-08]])

Again we have the activations nearly zero.Remember about the scaling rule,the research had used
tanh as the activation function and not ReLU.In case of ReLU,some more research was conducted and
the right scale was calculated as sqrt(2/Nin) where Nin are the number of inputs.Let's scale the
model according to this rule.. 

In [57]:
#scaling the ReLU activated model with different rule.
x = torch.randn(200, 100)#initializing inputs
for i in range(50): #through the multiple layers
    x = relu(x @ (torch.randn(100,100) * sqrt(2/100)))#relu activation and weight matrix scaled 
    #by sqrt(2/n)
x[0:5,0:5]

tensor([[0.0000, 0.2721, 0.6038, 0.0147, 0.0000],
        [0.0000, 0.3515, 0.5730, 0.0000, 0.0000],
        [0.0000, 0.2293, 0.5150, 0.0621, 0.0000],
        [0.0000, 0.2389, 0.7302, 0.0000, 0.0000],
        [0.0000, 0.7179, 1.2054, 0.0035, 0.0000]])

This still has some non-zero activations and not all values are zero here.So let's use this rule
to initialize our weight matrices this time.As usual we would also initialize the input and 
output matrix again.

In [58]:
#Initializing inputs and outputs again
x = torch.randn(200, 100)
y = torch.randn(200)

Initializing weight matrix and bias vectors scaled using the sqrt(2/Nin) rule.

In [53]:
#Initializing the weight matrices and bias vectors
w1 = torch.randn(100,50) * sqrt(2 / 100)#scaled by sqrt(2/100)
b1 = torch.zeros(50)#bias for layer1
w2 = torch.randn(50,1) * sqrt(2 / 50#scaled by sqrt(2/50)
b2 = torch.zeros(1)#bias for layer 2

Let's pass the weight,input and bias vector through the first linear layer.Then we pass the 
returned linear activations through relu function.Next we take the mean and standard deviation of
the relu activations.

In [54]:
#Get the activations
l1 = lin(x, w1, b1)#Linear layer(input,weight,bias)
l2 = relu(l1)#relu activation
l2.mean(), l2.std()#mean and standard deviation of relu activations

(tensor(0.5567), tensor(0.8187))

This seems better than the previous ones.So let's just define our model together.We define a 
function model which takes x as input,passes it through linear layer.Then passes the linear 
activations through relu layer and the relu activations are again passed through linear layer....

In [55]:
#Whole model definition
def model(x):
    l1 = lin(x, w1, b1)#first linear layer
    l2 = relu(l1)#relu activatons
    l3 = lin(l2, w2, b2)#second linear layer
    return l3

This is the part of the forward pass where output is computed through various layers when input
is supplied to the model.Next we have to calculate loss from the output predictions and the 
labels.Since these are random numbers we would be using the mean squared error.Let's pass the 
input through the final model definition now to get the final predictions.

We may not get the targets and labels of same shape though.

In [56]:
#Passing inputs through the whole model definition
out = model(x)
out.shape#shape of the output

torch.Size([200, 1])

Our labels are 1d vector consisting of 200 values.But the output we got from the model is of 
shpe 200 X 1.Let's remove the extra dimension from the output using the squeeze method.Thus next
we define a function mse through which we pass the outputs and the targets and we return the 
squezzed output and target differences squared and then taken average of.

In [57]:
#defining the mse function for loss 
def mse(output, targ): #pass outputs and tarets
    return (output.squeeze(-1) - targ).pow(2).mean()#return the squared difference mean between 
#targets and the outputs.

Let's pass the output and the random output labelling vector y through mse function to calculate
the loss.

In [58]:
#calculating loss by passing the outputs and labels 
loss = mse(out, y)

Thus we have calculated loss and done all the steps till the forward pass.Let's calculate 
gradients and look at the backward pass now..

### Gradients and the Backward Pass

Previously also we have seen the use of PyTorch's Autograd package for calculating the gradients
using loss.backward().Let's see it in terms of mathematics...

We have to calculate the gradients of the loss with respect to the weights of the model.We would
thus use chain rule here as we have multiple variables.Chain rule is the calculus which tells how
can we compute the derivative of a composite function..
  $$(g \circ f)'(x) = g'(f(x)) f'(x)$$

Loss is also composed of different functions,mean squared error,the second linear layer,a ReLU 
layer and a linear layer.If the gradient of the loss with respect to b2 then loss can be defined
as:-
    
    loss = mse(out,y) = mse(lin(l2, w2, b2), y)

The chain rule thus says that:-
    dloss/db2=dloss/dout X dout/db2=d(mse(out,y))/dout X d(lin(l2,w2,b2))/db2

For calculating loss with respect to b2,we calculate gradient of loss with respect to output and
multiply it with gradient of output with respect to b2.To compute all these we would have to 
calculate gradients of loss with respect to b1,w1 and b2,w2 also.

To calculate gradients,we start from the output and then go backwards towards the first layer,
starting from last layer.That is called Backpropogation.It can be implemented using backward 
method for all the functions we used like relu,mse and lin.Thus next we define a function 
mse_grad which calculates the gradients of the loss with respect to the output of the model.That
is actually input to loss function.Formula for derivative of x^2 that is 2x is used to calculate 
the gradient.Derivative of mean is 1/n therefore the gradient is divided by the number of 
elements in the input.

In [59]:
#function for calculating gradient of loss with respect to output of the previous layer
def mse_grad(inp, targ): #inputs and targets passed
    # grad of loss with respect to output of previous layer
    inp.g = 2. * (inp.squeeze() - targ).unsqueeze(-1) / inp.shape[0]#2 X(inp-target)/n

Next gradients of ReLU and linear layer,gradients of loss with respect to output(out.g) and also
apply chain rule to calculate the gradients of the loss with respect to output.Chain rule states
that inp.g=relu'(inp) X out.g. The gradient of ReLU is either 0 or 1.So ut can be easily 
calculated by checking the positive values in input and multiplying the number with out.g

In [60]:
#Function for calculating gradient of ReLU with respect to the output of the last layer
def relu_grad(inp, out):
    # grad of relu with respect to input activations
    inp.g = (inp>0).float() * out.g#checking for values>1 

So as to calculate the gradients of the loss with respect to input,weights,bias in the linear 
layer we define a function lin_grad through which we pass input,output,weights and bias in the
linear layer.

In [61]:
#Function for calculating gradient of linear layer with respect to weight,bias and the inputs.
def lin_grad(inp, out, w, b):
    # grad of matmul with respect to input
    inp.g = out.g @ w.t()
    w.g = inp.t() @ out.g
    b.g = out.g.sum(0)

We discussed all the calculus behind the forward pass and backward pass to understand it well.
It's not necessary to remember all the formulas..Let's discuss about some interesting thing 
called SymPy...

### Sidebar: SymPy

SymPy is a symbolic computational libray mainly used for calculus.Let us install it first using
pip.


In [84]:
#install sympy library using pip
pip install sympy

Collecting sympy
  Downloading sympy-1.7-py3-none-any.whl (5.9 MB)
Collecting mpmath>=0.19
  Downloading mpmath-1.1.0.tar.gz (512 kB)
Building wheels for collected packages: mpmath
  Building wheel for mpmath (setup.py): started
  Building wheel for mpmath (setup.py): finished with status 'done'
  Created wheel for mpmath: filename=mpmath-1.1.0-py3-none-any.whl size=532240 sha256=d46aec77d639abbc20d7736d821f97856670951b8a183b8bd8760ed8296a4458
  Stored in directory: c:\users\kruti\appdata\local\pip\cache\wheels\29\2c\1c\d2e4580cde2743b0aef389e936ac21a2db92921ddbca53faa1
Successfully built mpmath
Installing collected packages: mpmath, sympy
Successfully installed mpmath-1.1.0 sympy-1.7
Note: you may need to restart the kernel to use updated packages.


For sybmolic computation,,first we define a symbol and then calculation happens.Let's see how
sympy works..

In [85]:
#Import symbols and differentiation from sympy
from sympy import symbols,diff
sx,sy = symbols('sx sy')#get the symbols
diff(sx**2, sx)#differentiate the expression

2*sx

Thus sympy has calculated the derivative of sx**2 as 2sx.It can take more complex and compound
expressions and equations.Most of the things we saw in this chapter are manually done using user.
But PyTorch and now sympy allows us to calculate gradient 

### End sidebar

We hav discussed about forward pass and the backward pass in details.We define separate function
for forward and bakward pass.The gradients are not stored anywhere instead it is just executed 
in reverse to that of forward pass.Let's define the "forward_and_backward" function which would
take input and targets.We should note that loss is not required in backward pass.

In the forward pass,we pass the input through first layer,the ReLU activation and then passing
through the next linear layer.In the backward pass,we simply call out the mse_grad functions for 
,the linear layer,the lin_gred for linear gradient and ReLU for ReLU gradient.We had previously
defined the gradient functions for different layers.This way even we can access all the model
parameters w1.g,W2.g,b1.g.,b2.g.

In [2]:
#Function for forward pass and backward pass
def forward_and_backward(inp, targ):#Passing inputs and targets
    # forward pass:
    
    l1 = inp @ w1 + b1#first linear layer
    
    l2 = relu(l1)#ReLU activation
    
    out = l2 @ w2 + b2#second linear layer
    # we don't actually need the loss in backward!
    loss = mse(out, targ)
    
    # backward pass:
    mse_grad(out, targ)#gradient for loss
    lin_grad(l2, out, w2, b2)#gradient for first linear layer
    relu_grad(l1, l2)#ReLU activation
    lin_grad(inp, l1, w1, b1)#gradient for second linear layer

Let's make the model like a PyTorch module..

### Refactoring the Model

In our model we used three functions and then 2 associated functions,forward pass and backward 
pass with them.We had written the functions separately but and then called them whenever it was
required.This can be made simpler and modified by creating a class for that.The class would store
inputs and outputs for backward pass and we would have to call backward only.We would create 3 
different classes for Relu,Linear and Mse.

In [87]:
#Class for Relu layer
class Relu():
    def __call__(self, inp):#input
        self.inp = inp#input
        self.out = inp.clamp_min(0.)#clamped input=output
        return self.out#return output
    
    def backward(self): 
        self.inp.g = (self.inp>0).float() * self.out.g#gradient for Relu layer

"__call__" is known as the magic name in Python which makes class callable.It will be executed 
when y=Relu() is instantiated.In the same way we create classes for Linear layer and mse loss
also.

In [88]:
#class for linear layer
class Lin():
    def __init__(self, w, b): #input paramaters
        self.w,self.b = w,b
        
    def __call__(self, inp):#magic command executed when class instantiated
        self.inp = inp#input
        self.out = inp@self.w + self.b#output
        return self.out#return output
    
    def backward(self):#gradient calculation steps
        self.inp.g = self.out.g @ self.w.t()
        self.w.g = self.inp.t() @ self.out.g
        self.b.g = self.out.g.sum(0)

In [89]:
#class for Mean Squared error(Loss function)
class Mse():
    def __call__(self, inp, targ):#inout and targets passed
        self.inp = inp#inputs
        self.targ = targ#target
        self.out = (inp.squeeze() - targ).pow(2).mean()#outout calculation
        return self.out#return output
    
    def backward(self):#gradient calculation
        x = (self.inp.squeeze()-self.targ).unsqueeze(-1)
        self.inp.g = 2.*x/self.targ.shape[0]

Next we can put everything together in model like last time but this time in class and not as 
function.

In [90]:
#Model class together including the forward pass and backward pass
class Model():
    def __init__(self, w1, b1, w2, b2):#weights and bias passed
        self.layers = [Lin(w1,b1), Relu(), Lin(w2,b2)]#list of layers
        self.loss = Mse()#loss function
        
    def __call__(self, x, targ):#executed when the class is called
        for l in self.layers:#iterating through all layers
            x = l(x)#pass the input through layers,store output
        return self.loss(x, targ)#return the loss calculated through loss by passing output and
                                 #targets
    
    def backward(self):#function for backward pass
        self.loss.backward()#calculate gradient for loss function
        for l in reversed(self.layers): #iterating through reversed layers
            l.backward()#compute gradient for each layer

Next create an object for the model by instantiating Model class and pass weights and bias for
both layers through it.

In [91]:
#Instantiating model
model = Model(w1, b1, w2, b2)

Forward pass executed by calling the model object and passing x(inputs) and y(targets)

In [92]:
#Forward pass in the model
loss = model(x, y)

And backward pass is executed using model.backward() function.

In [93]:
#Backward pass in the model
model.backward()

In this section we had written all the function classes ReLU,mse and linear separately.Next we
would create such class that all the function classes are included in it and it will inherit from
the base class.

### Going to PyTorch

Since the ReLu,mse and linear classes we created earlier are very much similar so we create a 
parent class and make these three inherit from that class.

We create LayerFunction as base class.And we create forward and backward(bwd)functions so as to
implement forward and backward class.

In [94]:
#Creating base class to inherit the function classes
class LayerFunction():
    def __call__(self, *args):#arguments
        self.args = args
        self.out = self.forward(*args)#arguments passed in forward pass
        return self.out#return output of forward pass
    
    def forward(self): #forward pass 
        raise Exception('not implemented')
    def bwd(self):      
        raise Exception('not implemented')
    def backward(self): #backward pass
        self.bwd(self.out, *self.args)

Let's create the Linear,MSE and Relu classes we had written earlier..But this time they would
be inherited from the LayerFunction base class.We would implement the forward and bwd methods in 
each of the inherited classes here.But this time they would be inherited from the LayerFunction 
base class.

In [95]:
#Inherited class for Relu
class Relu(LayerFunction):#inherited from layerFunction class
    def forward(self, inp): #input passed
        return inp.clamp_min(0.)#returns the clamped input at zero as output
    def bwd(self, out, inp):#pass input,output through backward 
        inp.g = (inp>0).float() * out.g#calculate the gradient

In [96]:
#Inherited class for Linear layer
class Lin(LayerFunction):#inherited from LayerFunction class
    def __init__(self, w, b): #pass weight and bias
        self.w,self.b = w,b
        
    def forward(self, inp): 
        return inp@self.w + self.b#return Wx+b
    
    def bwd(self, out, inp):#pass input,output through backward
        #calculate gradient for linear layer
        inp.g = out.g @ self.w.t()
        self.w.g = self.inp.t() @ self.out.g
        self.b.g = out.g.sum(0)

In [97]:
#Inherited class for mse(loss function)
class Mse(LayerFunction):#inherited from LayerFunction class
    def forward (self, inp, targ):#pass predictions and targets
        return (inp.squeeze() - targ).pow(2).mean()#return(prediction-target)^2)
    def bwd(self, out, inp, targ): #pass output,input,target
        inp.g = 2*(inp.squeeze()-targ).unsqueeze(-1) / targ.shape[0]#calculate gradient

Finally we have created an inherited class containing subclasses for the three functions.Rest 
part of the model is same.Every basic function to be differentiated can be written as an object
form in autograd package of PyTorch.The object is torch.autograd.Function and has a forward and 
backward method.This leads to PyTorch keeping record of the computation and operations we perform
on the functions.If required_grad is set to False then  it won't store about the gradients of
the other functions.

Let's write the same thing in PyTorch.It is almost same as writing the earlier classes.Only 
difference it makes is that we can decide wht to choose and what to assign to the variable,the 
gradients are also returned in backward function.Let's see how can we write our own function
object though very rarely we would need this.Let'see how we can write that..

In [98]:
#creating function object 
from torch.autograd import Function

class MyRelu(Function):
    @staticmethod
    def forward(ctx, i):
        result = i.clamp_min(0.)
        ctx.save_for_backward(i)
        return result
    
    @staticmethod
    def backward(ctx, grad_output):
        i, = ctx.saved_tensors
        return grad_output * (i>0).float()

The above definition is not very often used.Instead for more complex models we use Function 
objects in torch.nn Module.It provides the starting point for all models aand includes all the
neural nets we had studied till now.Main feature of this module is that it allows us to initiali-
ze the trainable parameters on its own. 

For implementing nn.Module we follow following steps:-
    
1.The superclass "__init__" is the first one to be called always and it initializes the 
parameters.

2.Define parameters for the model amd store them in nn.parameters

3.There is a forward function also which returns the output when input is passed.

Let's construct a linear layer using PyTorch's torch.nn Module.

In [6]:
import torch.nn as nn #import torch.nn module

class LinearLayer(nn.Module):#creating linear layer
    def __init__(self, n_in, n_out):#superclass
        super().__init__()
        self.weight = nn.Parameter(torch.randn(n_out, n_in) * sqrt(2/n_in))#initialize weight
    
        self.bias = nn.Parameter(torch.zeros(n_out))#Initialize bias vector 
    
    def forward(self, x):#forward function
        return x @ self.weight.t() + self.bias#return Wx+b

Since we used nn.Parameter to define the parameters the class keeps a record of that.We can 
easily access them using object.parameters().

In [100]:
#accessing the parameters of the linear layer
lin = LinearLayer(10,2)#Instantiate the object
p1,p2 = lin.parameters()#store the parameters
p1.shape,p2.shape#shape of the weights and bias

(torch.Size([2, 10]), torch.Size([2]))

PyTorch's nn.Module also lets us update the parameters using just opt.step().In PyTorch the
weights are stored as n_out X n_in matrix.That's why in the forward pass there is a w.t() that is
transpose step to make the dimensions compatible.So instead of initializing parameters differently we add PyTorch's Linear,ReLU and Linear layer in sequence using PyTorch's nn.Module.

In [101]:
#Using PyTorch's nn.Module layers to build the model
class Model(nn.Module):
    def __init__(self, n_in, nh, n_out):#Pass number of inputs,hidden states and number of output
        super().__init__()
        self.layers = nn.Sequential(#define the sequential layers
            nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out))
        self.loss = mse#loss function
        
    def forward(self, x, targ):#forward pass 
        return self.loss(self.layers(x).squeeze(), targ)#calculating the loss

Fastai has a replacement as Module instead of nn.Module and we don't need to call super().init()
here which we were doing in case of nn.Module.The rest of the steps and code remains the same.

In [103]:
#Using PyTorch's Module layers to build the model
class Model(Module):
    def __init__(self, n_in, nh, n_out):#Pass number of inputs,hidden states and number of output
        self.layers = nn.Sequential(#define the sequential layers
            nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out))
        self.loss = mse#loss function
        
    def forward(self, x, targ): #forward pass 
        return self.loss(self.layers(x).squeeze(), targ)#calculating the loss

Next chapter we would start from this model and learn about modifying the training loop from 
scratch and refactoring it.

## Conclusion

This chapter we concentrated on building forward and backward steps of training loop through 
different ways.We started from very basics of deep learning,matrix multiplication and step by 
step reached towards implementing forward and backward passes by scratch.We also used pure 
PyTorch to build the layers.

Some important takeaways of this chapter are as follows:-

1.A neural net consists of matrix multiplications with non linear layers in between.

2.We use techniques such as elementwise operations and broadcasting for faster operations in 
Python.

3.Initializing a neural net is very important with proper scaling else the activations value may
not be proper.

4.Backward pass in the training process involves calculating gradients many times using chain 
rule.

5.When we try to subclass the PyTorch's nn.Module calling superclass of "__init__" method inside 
the function is compulsory and a forward function has to be there which takes input and gives the
predictions after passing the input through layers.

## Questionnaire

1. Write the Python code to implement a single neuron.
1. Write the Python code to implement ReLU.
1. Write the Python code for a dense layer in terms of matrix multiplication.
1. Write the Python code for a dense layer in plain Python (that is, with list comprehensions and functionality built into Python).
1. What is the "hidden size" of a layer?
1. What does the `t` method do in PyTorch?
1. Why is matrix multiplication written in plain Python very slow?
1. In `matmul`, why is `ac==br`?
1. In Jupyter Notebook, how do you measure the time taken for a single cell to execute?
1. What is "elementwise arithmetic"?
1. Write the PyTorch code to test whether every element of `a` is greater than the corresponding element of `b`.
1. What is a rank-0 tensor? How do you convert it to a plain Python data type?
1. What does this return, and why? `tensor([1,2]) + tensor([1])`
1. What does this return, and why? `tensor([1,2]) + tensor([1,2,3])`
1. How does elementwise arithmetic help us speed up `matmul`?
1. What are the broadcasting rules?
1. What is `expand_as`? Show an example of how it can be used to match the results of broadcasting.
1. How does `unsqueeze` help us to solve certain broadcasting problems?
1. How can we use indexing to do the same operation as `unsqueeze`?
1. How do we show the actual contents of the memory used for a tensor?
1. When adding a vector of size 3 to a matrix of size 3×3, are the elements of the vector added to each row or each column of the matrix? (Be sure to check your answer by running this code in a notebook.)
1. Do broadcasting and `expand_as` result in increased memory use? Why or why not?
1. Implement `matmul` using Einstein summation.
1. What does a repeated index letter represent on the left-hand side of einsum?
1. What are the three rules of Einstein summation notation? Why?
1. What are the forward pass and backward pass of a neural network?
1. Why do we need to store some of the activations calculated for intermediate layers in the forward pass?
1. What is the downside of having activations with a standard deviation too far away from 1?
1. How can weight initialization help avoid this problem?
1. What is the formula to initialize weights such that we get a standard deviation of 1 for a plain linear layer, and for a linear layer followed by ReLU?
1. Why do we sometimes have to use the `squeeze` method in loss functions?
1. What does the argument to the `squeeze` method do? Why might it be important to include this argument, even though PyTorch does not require it?
1. What is the "chain rule"? Show the equation in either of the two forms presented in this chapter.
1. Show how to calculate the gradients of `mse(lin(l2, w2, b2), y)` using the chain rule.
1. What is the gradient of ReLU? Show it in math or code. (You shouldn't need to commit this to memory—try to figure it using your knowledge of the shape of the function.)
1. In what order do we need to call the `*_grad` functions in the backward pass? Why?
1. What is `__call__`?
1. What methods must we implement when writing a `torch.autograd.Function`?
1. Write `nn.Linear` from scratch, and test it works.
1. What is the difference between `nn.Module` and fastai's `Module`?

### Further Research

1. Implement ReLU as a `torch.autograd.Function` and train a model with it.
1. If you are mathematically inclined, find out what the gradients of a linear layer are in mathematical notation. Map that to the implementation we saw in this chapter.
1. Learn about the `unfold` method in PyTorch, and use it along with matrix multiplication to implement your own 2D convolution function. Then train a CNN that uses it.
1. Implement everything in this chapter using NumPy instead of PyTorch. 

# Answers:-

Ans-1 def lin(x,w,b):
        return x@w+b

Ans-2 def relu(x): 
    return x if x >= 0 else 0

Ans-3 sum([x*w for x,w in zip(input,weight)])

Ans-4 y[i,j] = sum([a * b for a,b in zip(x[i,:],w[j,:])]) + b[j]

Ans-5 Hidden size is the number of neurons in a layer in a neural net and all the inputs are 
linked to each neuron.

Ans-6 It takes the transpose in PyTorch

Ans-7 It is because plain Python uses for loops for matrix multiplication which are very slow.
PyTorch is written in C++ which is much faster.

Ans-8 In case of matrix multiplication between 2-D matrices a and b in order to have the matrices
compatible a's columns should be equal to b's rows.

Ans-9 using the magic command %time

Ans-10 All the mathematical operators can be applied elementwise on matrices.If two tensors a and
b are of same shape and size,then operations are performed between corresponding elements at the 
similar positions in both the matrices.

Ans-11 (a<b)

Ans-12 Rank-0 tensors are tensors containing a single element..item() can be used to convert it 
into a plain Python number.

Ans-13 tensor([2,3]).tensor([1]) is extended to the size of [1,2] and then addition takes place
using broadcasting.

Ans-14 Error as the sizes do not match so broadcasting is not possible

Ans-15 Using elementwise arithmetic we can remove one of the nested loops,the tensors can be 
multiplied directly before summing up.The inner most loop is then replaced by PyTorch elementwise
arithmetic.

Ans-16 Two dimensions are said to be compatible for broadcasting only when:-
1.They are equal.
2.If either is 1 then it can be expanded to match the other.

Ans-17 expand_as method is used to expand a vector to match the size of other tensor of more
dimensions.
 
Ans-18-Broadcasting is done along rows most of the times but to do it along any other dimension
the vector shape can be changed by adding a unit axis to the vector.It can be done using 
unsqueeze method.

Ans-19 Unsqueezed method can be replaced using None indexing.To add an extra unit dimension,we
add None in that dimension while indexing the tensors.a[None,:]

Ans-20 using storage method

Ans-21 Elements of vector are added to each row of matrix by default.

Ans-22 No the storage space remains same.It is because in whatever dimension broadcasting occurs
it has a stride of 0 that is PyTorch looks for next row by adding stride and doesn't move.

Ans-23 def matmul(a,b): 
    return torch.einsum('ik,kj->ij', a, b)

Ans-24 Summing over takes place along that repeated axis.

Ans-25 Einstein summation rules are as follows:-

1.Repeated indices is summed over.

2.Every index can occur maximum twice

3.The term must contain identical non-repeated indices.

Ans-26 While training a model, the process is divided into two parts,forward pass is when output 
is calculated through the model from given input and in backward pass we calculate gradients of
loss with respect to parameters to update the parameters.

Ans-27 It is because they are required for calculating gradients during the backward pass.

Ams-28 If the activations have a standard deviation far away from 1 as if there are many layers,
then every layer will multiply the scale of the activations by 10 and at the end the number would
be so large that we won't be able to store it in computers.

Ans-29 Weight initialization results in rescaling of the weights thus the activations be in the 
specific range and there are less zeros also.

Ans-30 sqrt(2/n)

Ans-31 squeeze method is used for removing extra unit dimension.In loss function we are required
to calculate the difference between the targets and outputs from model.Sometimes,the output has
one extra dimension so as to get rid of that squeeze is used.

Ans-32 It refers to the index  of the dimension to be removed.The index can be considered to be 
the axis index in the shape of the matrix.

Ans-33 Chain rule is a calculus rule used for calculating the derivative of a composite function.
dy/dx = dy/du * du/dx

Ans-35 0 or 1.(inp >0).float().out.g

Ans-36 loss_grad,linear_grad,relu_grad,linear_grad(reverse order of layers)

Ans-37 A magic name in Python makes class collable.

Ans-38 Forward and backward

Ans-40 nn.Module requires call to the superclass init after defining it.Fastai's Module doesn't