<a href="https://colab.research.google.com/github/jindaldisha/Deep-Learning-and-Neural-Networks/blob/main/Neural-Networks-with-Tensorflow/00_03_gradient_descent_and_linear_regression_with_pytorch_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Gradient Descent and Linear Regression with PyTorch from scratch

##Introduction to Linear Regression
Linear Regression is one of the foundational algorithms in Machine Learning.

We'll create a model that predicts crop yields for apples and oranges (target variables) by looking at the average temperature, rainfall, and humidity (input variables or features) in a region.

Data:

| Region | Temp (F) | Rainfall (mm) | Humidity (%) | Apples (ton) | Oranges (ton) |
| -- | -- | -- | -- | -- | -- |
| Kanto | 73 | 67 | 43 | 56 | 70 |
| Johto | 91 | 88 | 64 | 81 | 101 | 
| Hoenn | 87 | 134 | 58 | 119 | 133 | 
| Sinnoh | 102 | 43 | 37 | 22 | 37 | 
| Unova | 69 | 96 | 70 | 103 | 119 |  

In a linear regression model, each target variable is estimated to be a weighted (w) sum of the input variables, offset by some constant (b), known as a bias :

`yield_apple = w11 * temp + w12 * rainfall + w13 * humidity + b1`

`yield_oranges = w21 * temp + w22 * rainfall + w23 * humidity + b2`

The learning part of linear regression is to figure out the weights (w11, w12..) and the biases (b1, b2..) using the training data and make accurate predictions for the test data that we pass to it. The learned weights will be used to predict the yields for apples and oranges in new regions using the average temperature, rainfall and humidity for that region.

We train our model by adjusting the weights and biases slighty many time and keep improving our accuracy. To improve our model, we use an optimization technique called **Gradient Descent**.


In [57]:
#Import Libraries
import numpy as np
import torch

##Training data

We'll represent the training data using matrics x_train and y_train.

x_train will have the values of temperature, rainfall and humidity for every region as a single row each.
y_train will have the crop yields of apples and oranges for every region as a single row each.

In [58]:
#Input Features (Temp, Rainfall, Humidity)
x_train = np.array([[73, 67, 43], #Kanto
                    [91, 88, 64], #Johto
                    [87, 134, 58], #Hoenn
                    [102, 43, 37], #Sinnoh
                    [69, 96, 70]], #Unova
                    dtype= 'float32')
x_train

array([[ 73.,  67.,  43.],
       [ 91.,  88.,  64.],
       [ 87., 134.,  58.],
       [102.,  43.,  37.],
       [ 69.,  96.,  70.]], dtype=float32)

In [59]:
# Input Labels (Apples, Oranges)
y_train = np.array([[56, 70], 
                    [81, 101], 
                    [119, 133], 
                    [22, 37], 
                    [103, 119]], dtype='float32')
y_train

array([[ 56.,  70.],
       [ 81., 101.],
       [119., 133.],
       [ 22.,  37.],
       [103., 119.]], dtype=float32)

In [60]:
#Convert input numpy array to pytorch tensors
x_train = torch.from_numpy(x_train)
y_train = torch.from_numpy(y_train)

In [61]:
x_train, type(x_train), x_train.shape

(tensor([[ 73.,  67.,  43.],
         [ 91.,  88.,  64.],
         [ 87., 134.,  58.],
         [102.,  43.,  37.],
         [ 69.,  96.,  70.]]), torch.Tensor, torch.Size([5, 3]))

In [62]:
y_train, type(y_train), y_train.shape

(tensor([[ 56.,  70.],
         [ 81., 101.],
         [119., 133.],
         [ 22.,  37.],
         [103., 119.]]), torch.Tensor, torch.Size([5, 2]))

## Linear regression model from scratch

The weights and biases (`w11, w12,... w23, b1 & b2`) can also be represented as matrices, initialized as random values. The first row of `w` and the first element of `b` are used to predict the first target variable, i.e., yield of apples, and similarly, the second for oranges.

In [63]:
w = torch.randn(2,3, requires_grad = True)
b = torch.randn(2, requires_grad = True)
w, b

(tensor([[ 0.3892,  0.7033,  0.5034],
         [-0.6309, -1.4198, -0.3539]], requires_grad=True),
 tensor([-1.0438, -0.5648], requires_grad=True))

Our Model is going to perform matrix multiplication of input features `x_train` and weights `w` (transposed) and add biases `b`.

torch.randn creates a tensor with the given shape, with elements picked randomly from a normal distribution with mean 0 and standard deviation 1.

In [64]:
#Function for model
def model(x):
  return x @ w.t() + b

In [65]:
#Calling the model on our input featurs
y_pred = model(x_train)

In [66]:
#Comparing y_pred with y_train
y_pred, y_train

(tensor([[  96.1316, -156.9660],
         [ 128.4767, -205.5705],
         [ 156.2524, -266.2344],
         [  87.5176, -139.0632],
         [ 128.5618, -205.1727]], grad_fn=<AddBackward0>),
 tensor([[ 56.,  70.],
         [ 81., 101.],
         [119., 133.],
         [ 22.,  37.],
         [103., 119.]]))

There is a big difference between our model's predicted labels and the true labels. This is because we've initializes our model with random weights and biases.

##Loss Function

Before we make changes in our model to improve our predictions, we need a way to evaluate our model. We can compare the predicted labels with true labels using the following steps"
- Calculate the difference between `y_pred` and `y_train`
- Take a square of all the differences
- Calculate the average of the elements.

The result is a single numbers and this is know as the `mean square error (MSE)`. It is mean of the square of the difference in the predicted value and the true value.

On average, each element in the prediction differs from the actual target by the square root of the loss. The result is called the loss because it indicates how bad the model is at predicting the target variables. It represents information loss in the model: the lower the loss, the better the model.

In [67]:
#Loss Function (MSE)
def mse(y_true, y_pred):
  error = y_true - y_pred
  square = error * error
  number_of_elements = error.numel()
  mean = torch.sum(square) / number_of_elements
  return mean

In [68]:
#Compute Loss
loss = mse(y_train, y_pred)
print(loss)

tensor(45117.1680, grad_fn=<DivBackward0>)


## Compute gradients

With PyTorch, we can automatically compute the gradient or derivative of the loss w.r.t. to the weights and biases because they have `requires_grad` set to `True`.

The gradients are stored in the `.grad` property of the respective tensors. Note that the derivative of the loss w.r.t. the weights matrix is itself a matrix with the same dimensions.

In [69]:
#Compute Gradients
loss.backward()

In [70]:
#Gradients for weights
print(w)
print(w.grad) 

tensor([[ 0.3892,  0.7033,  0.5034],
        [-0.6309, -1.4198, -0.3539]], requires_grad=True)
tensor([[  3787.5012,   3425.9556,   2227.6567],
        [-23905.2402, -26874.7285, -16348.4160]])


Since we've calculated gradients for the loss.
`w.grad` is the derivative of the `loss` w.r.t to `w`.

In [71]:
print(b)
print(b.grad)

tensor([-1.0438, -0.5648], requires_grad=True)
tensor([  43.1880, -286.6014])


## Adjust Weights and Biases to reduce the Loss

The loss is function of our weights and biases and we want to find the set of weights where the loss is the lowest. 

The gradients indicate the rate of change of the loss, i.e. the loss function's slope w.r.t the weights and biases.

If the gradient element is positive:
 - Increasing the weight element's value slightly will increase the loss.
 - Decreasing the weight element's value will decrease the loss.

If the gradient element is negative:
 - Increasing the weight element's value slightly will decrease the loss.
 - Decreasing the weight element's value will increase the loss.


The increase or decrease in the loss by changing a weight element is proportional to the gradient of the loss w.r.t. that element. This observation forms the basis of the gradient descent optimization algorithm that we'll use to improve our model (by descending along the gradient).

We can subtract from each weight element a small quantity proportional to the derivative of the loss w.r.t. that element to reduce the loss slightly.


In simple words, if the gradient is negative, the slope is negative so we need to increase the weight. If the gradient is positive, the slope is positive so we need to decrease the weight.

It is called gradient descent, because we're descending along the gradient.

In [72]:
with torch.no_grad():
  w -= w.grad * 1e-5
  b -= b.grad * 1e-5

We multiply the gradients with a very small number to ensure that we don't modify the weights by a very large amount. This ensure that we only take small steps in the downhill direction and not giant leaps. This prevents from diverging away from the optimal solution. This number is known as the learning rate. 
We use torch.no_grad to we shouldn't track, calculate or modify the gradients while updating the weights and biases.

In [73]:
#Evaluate the model again
y_pred = model(x_train)
loss = mse(y_train, y_pred)
print(loss)

tensor(30621., grad_fn=<DivBackward0>)


Before we proceed, we reset the gradients to zero by invoking the `.zero_()` method. We need to do this because PyTorch accumulates gradients. Otherwise, the next time we invoke `.backward` on the loss, the new gradient values are added to the existing gradients, which may lead to unexpected results.

In [74]:
w.grad.zero_()
b.grad.zero_()
print(w.grad)
print(b.grad)

tensor([[0., 0., 0.],
        [0., 0., 0.]])
tensor([0., 0.])


## Train the model using gradient descent

We reduce the loss and improve our model using the gradient descent optimization algorithm. Thus, we can train the model using the following steps:

1. Generate predictions

2. Calculate the loss

3. Compute gradients w.r.t the weights and biases

4. Adjust the weights by subtracting a small quantity proportional to the gradient

5. Reset the gradients to zero


In [75]:
# Step 1: Generate Predictions
y_pred = model(x_train)
y_pred

tensor([[  90.1130, -114.4765],
        [ 120.5891, -149.7011],
        [ 147.0740, -199.9397],
        [  81.3565,  -97.0720],
        [ 121.0997, -151.4316]], grad_fn=<AddBackward0>)

In [76]:
# Step 2: Calculate Loss
loss = mse(y_pred, y_train)
loss

tensor(30621., grad_fn=<DivBackward0>)

In [77]:
#Step 3:Compute Gradients
loss.backward()

In [78]:
print(w.grad)
print(b.grad)

tensor([[  3167.7095,   2764.2473,   1818.4055],
        [-19516.2930, -22152.4160, -13435.7480]])
tensor([  35.8465, -234.5242])


In [79]:
#Step 4, 5: Adjust Weights and Reset Gradients
with torch.no_grad():
  w -= w.grad * 1e-5
  b -= b.grad * 1e-5
  w.grad.zero_()
  b.grad.zero_()

In [80]:
#View new weights and biases
w, b

(tensor([[ 0.3196,  0.6414,  0.4629],
         [-0.1967, -0.9295, -0.0561]], requires_grad=True),
 tensor([-1.0446, -0.5596], requires_grad=True))

In [81]:
#Evaluate the Model again
y_pred = model(x_train)
loss = mse(y_train, y_pred)
print(loss)

tensor(20849.5312, grad_fn=<DivBackward0>)


##Train for multiple epochs

To reduce the loss further, we repeat the process of adjusting the weights and biases using the gradient descent algorithm multiple times. Each iteration is called an epoch.

In [82]:
#Train for 100 epochs
for i in range(100):
  #Make Prediction
  y_pred = model(x_train)
  #Calculate Loss
  loss = mse(y_train, y_pred)
  #Print Loss at each epoch
  print(f'Loss at Epoch {i}: {loss}')
  #Calculate Gradients
  loss.backward()
  #Gradient Descent
  with torch.no_grad():
    w -= w.grad * 1e-5
    b -= b.grad * 1e-5
    #Reset gradients to zero
    w.grad.zero_()
    b.grad.zero_()


Loss at Epoch 0: 20849.53125
Loss at Epoch 1: 14261.994140625
Loss at Epoch 2: 9820.1064453125
Loss at Epoch 3: 6824.16796875
Loss at Epoch 4: 4802.66943359375
Loss at Epoch 5: 3437.864501953125
Loss at Epoch 6: 2515.627197265625
Loss at Epoch 7: 1891.662353515625
Loss at Epoch 8: 1468.729736328125
Loss at Epoch 9: 1181.3009033203125
Loss at Epoch 10: 985.2171630859375
Loss at Epoch 11: 850.7191162109375
Loss at Epoch 12: 757.7528076171875
Loss at Epoch 13: 692.80322265625
Loss at Epoch 14: 646.7631225585938
Loss at Epoch 15: 613.4943237304688
Loss at Epoch 16: 588.8599853515625
Loss at Epoch 17: 570.0724487304688
Loss at Epoch 18: 555.2525634765625
Loss at Epoch 19: 543.1334838867188
Loss at Epoch 20: 532.8614501953125
Loss at Epoch 21: 523.8606567382812
Loss at Epoch 22: 515.7425537109375
Loss at Epoch 23: 508.2452697753906
Loss at Epoch 24: 501.19189453125
Loss at Epoch 25: 494.46282958984375
Loss at Epoch 26: 487.97723388671875
Loss at Epoch 27: 481.68035888671875
Loss at Epoch 28:

As it can be seen from above, the loss decreases with each epoch.

In [83]:
y_pred

tensor([[ 60.0975,  75.7934],
        [ 82.7421, 103.0963],
        [112.7583, 118.6221],
        [ 38.9984,  69.2988],
        [ 92.1892, 104.4102]], grad_fn=<AddBackward0>)

In [84]:
y_train

tensor([[ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.]])

The predicted values are closer to the true values as compared to before.