<a href="https://colab.research.google.com/github/jindaldisha/Deep-Learning-and-Neural-Networks/blob/main/00_3_gradient_descent_and_linear_regression_with_pytorch_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Gradient Descent and Linear Regression with PyTorch from scratch

##Introduction to Linear Regression
Linear Regression is one of the foundational algorithms in Machine Learning.

We'll create a model that predicts crop yields for apples and oranges (target variables) by looking at the average temperature, rainfall, and humidity (input variables or features) in a region.

Data:

| Region | Temp (F) | Rainfall (mm) | Humidity (%) | Apples (ton) | Oranges (ton) |
| -- | -- | -- | -- | -- | -- |
| Kanto | 73 | 67 | 43 | 56 | 70 |
| Johto | 91 | 88 | 64 | 81 | 101 | 
| Hoenn | 87 | 134 | 58 | 119 | 133 | 
| Sinnoh | 102 | 43 | 37 | 22 | 37 | 
| Unova | 69 | 96 | 70 | 103 | 119 |  

In a linear regression model, each target variable is estimated to be a weighted (w) sum of the input variables, offset by some constant (b), known as a bias :

`yield_apple = w11 * temp + w12 * rainfall + w13 * humidity + b1`

`yield_oranges = w21 * temp + w22 * rainfall + w23 * humidity + b2`

The learning part of linear regression is to figure out the weights (w11, w12..) and the biases (b1, b2..) using the training data and make accurate predictions for the test data that we pass to it. The learned weights will be used to predict the yields for apples and oranges in new regions using the average temperature, rainfall and humidity for that region.

We train our model by adjusting the weights and biases slighty many time and keep improving our accuracy. To improve our model, we use an optimization technique called **Gradient Descent**.


In [197]:
#Import Libraries
import numpy as np
import torch

##Training data

We'll represent the training data using matrics x_train and y_train.

x_train will have the values of temperature, rainfall and humidity for every region as a single row each.
y_train will have the crop yields of apples and oranges for every region as a single row each.

In [198]:
#Input Features (Temp, Rainfall, Humidity)
x_train = np.array([[73, 67, 43], #Kanto
                    [91, 88, 64], #Johto
                    [87, 134, 58], #Hoenn
                    [102, 43, 37], #Sinnoh
                    [69, 96, 70]], #Unova
                    dtype= 'float32')
x_train

array([[ 73.,  67.,  43.],
       [ 91.,  88.,  64.],
       [ 87., 134.,  58.],
       [102.,  43.,  37.],
       [ 69.,  96.,  70.]], dtype=float32)

In [199]:
# Input Labels (Apples, Oranges)
y_train = np.array([[56, 70], 
                    [81, 101], 
                    [119, 133], 
                    [22, 37], 
                    [103, 119]], dtype='float32')
y_train

array([[ 56.,  70.],
       [ 81., 101.],
       [119., 133.],
       [ 22.,  37.],
       [103., 119.]], dtype=float32)

In [200]:
#Convert input numpy array to pytorch tensors
x_train = torch.from_numpy(x_train)
y_train = torch.from_numpy(y_train)

In [201]:
x_train, type(x_train), x_train.shape

(tensor([[ 73.,  67.,  43.],
         [ 91.,  88.,  64.],
         [ 87., 134.,  58.],
         [102.,  43.,  37.],
         [ 69.,  96.,  70.]]), torch.Tensor, torch.Size([5, 3]))

In [202]:
y_train, type(y_train), y_train.shape

(tensor([[ 56.,  70.],
         [ 81., 101.],
         [119., 133.],
         [ 22.,  37.],
         [103., 119.]]), torch.Tensor, torch.Size([5, 2]))

## Linear regression model from scratch

The weights and biases (`w11, w12,... w23, b1 & b2`) can also be represented as matrices, initialized as random values. The first row of `w` and the first element of `b` are used to predict the first target variable, i.e., yield of apples, and similarly, the second for oranges.

In [203]:
w = torch.randn(2,3, requires_grad = True)
b = torch.randn(2, requires_grad = True)
w, b

(tensor([[-0.5492, -0.3743,  0.9817],
         [ 0.0911,  0.8713,  0.9355]], requires_grad=True),
 tensor([-0.5123,  1.4734], requires_grad=True))

Our Model is going to perform matrix multiplication of input features `x_train` and weights `w` (transposed) and add biases `b`.

torch.randn creates a tensor with the given shape, with elements picked randomly from a normal distribution with mean 0 and standard deviation 1.

In [204]:
#Function for model
def model(x):
  return x @ w.t() + b

In [205]:
#Calling the model on our input featurs
y_pred = model(x_train)

In [206]:
#Comparing y_pred with y_train
y_pred, y_train

(tensor([[-23.4729, 106.7221],
         [-20.6040, 146.3027],
         [-41.5157, 180.4041],
         [-36.3078,  82.8404],
         [ -5.6249, 156.8817]], grad_fn=<AddBackward0>), tensor([[ 56.,  70.],
         [ 81., 101.],
         [119., 133.],
         [ 22.,  37.],
         [103., 119.]]))

There is a big difference between our model's predicted labels and the true labels. This is because we've initializes our model with random weights and biases.

##Loss Function

Before we make changes in our model to improve our predictions, we need a way to evaluate our model. We can compare the predicted labels with true labels using the following steps"
- Calculate the difference between `y_pred` and `y_train`
- Take a square of all the differences
- Calculate the average of the elements.

The result is a single numbers and this is know as the `mean square error (MSE)`. It is mean of the square of the difference in the predicted value and the true value.

On average, each element in the prediction differs from the actual target by the square root of the loss. The result is called the loss because it indicates how bad the model is at predicting the target variables. It represents information loss in the model: the lower the loss, the better the model.

In [207]:
#Loss Function (MSE)
def mse(y_true, y_pred):
  error = y_true - y_pred
  square = error * error
  number_of_elements = error.numel()
  mean = torch.sum(square) / number_of_elements
  return mean

In [208]:
#Compute Loss
loss = mse(y_train, y_pred)
print(loss)

tensor(6678.8149, grad_fn=<DivBackward0>)


## Compute gradients

With PyTorch, we can automatically compute the gradient or derivative of the loss w.r.t. to the weights and biases because they have `requires_grad` set to `True`.

The gradients are stored in the `.grad` property of the respective tensors. Note that the derivative of the loss w.r.t. the weights matrix is itself a matrix with the same dimensions.

In [209]:
#Compute Gradients
loss.backward()

In [210]:
#Gradients for weights
print(w)
print(w.grad) 

tensor([[-0.5492, -0.3743,  0.9817],
        [ 0.0911,  0.8713,  0.9355]], requires_grad=True)
tensor([[-8490.9727, -9742.0332, -5798.2070],
        [ 3643.3960,  3681.3899,  2315.1355]])


Since we've calculated gradients for the loss.
`w.grad` is the derivative of the `loss` w.r.t to `w`.

In [211]:
print(b)
print(b.grad)

tensor([-0.5123,  1.4734], requires_grad=True)
tensor([-101.7051,   42.6302])


## Adjust Weights and Biases to reduce the Loss

The loss is function of our weights and biases and we want to find the set of weights where the loss is the lowest. 

The gradients indicate the rate of change of the loss, i.e. the loss function's slope w.r.t the weights and biases.

If the gradient element is positive:
 - Increasing the weight element's value slightly will increase the loss.
 - Decreasing the weight element's value will decrease the loss.

If the gradient element is negative:
 - Increasing the weight element's value slightly will decrease the loss.
 - Decreasing the weight element's value will increase the loss.


The increase or decrease in the loss by changing a weight element is proportional to the gradient of the loss w.r.t. that element. This observation forms the basis of the gradient descent optimization algorithm that we'll use to improve our model (by descending along the gradient).

We can subtract from each weight element a small quantity proportional to the derivative of the loss w.r.t. that element to reduce the loss slightly.


In simple words, if the gradient is negative, the slope is negative so we need to increase the weight. If the gradient is positive, the slope is positive so we need to decrease the weight.

It is called gradient descent, because we're descending along the gradient.

In [212]:
with torch.no_grad():
  w -= w.grad * 1e-5
  b -= b.grad * 1e-5

We multiply the gradients with a very small number to ensure that we don't modify the weights by a very large amount. This ensure that we only take small steps in the downhill direction and not giant leaps. This prevents from diverging away from the optimal solution. This number is known as the learning rate. 
We use torch.no_grad to we shouldn't track, calculate or modify the gradients while updating the weights and biases.

In [213]:
#Evaluate the model again
y_pred = model(x_train)
loss = mse(y_train, y_pred)
print(loss)

tensor(4558.9028, grad_fn=<DivBackward0>)


Before we proceed, we reset the gradients to zero by invoking the `.zero_()` method. We need to do this because PyTorch accumulates gradients. Otherwise, the next time we invoke `.backward` on the loss, the new gradient values are added to the existing gradients, which may lead to unexpected results.

In [214]:
w.grad.zero_()
b.grad.zero_()
print(w.grad)
print(b.grad)

tensor([[0., 0., 0.],
        [0., 0., 0.]])
tensor([0., 0.])


## Train the model using gradient descent

We reduce the loss and improve our model using the gradient descent optimization algorithm. Thus, we can train the model using the following steps:

1. Generate predictions

2. Calculate the loss

3. Compute gradients w.r.t the weights and biases

4. Adjust the weights by subtracting a small quantity proportional to the gradient

5. Reset the gradients to zero


In [215]:
# Step 1: Generate Predictions
y_pred = model(x_train)
y_pred

tensor([[ -8.2531, 100.6000],
        [ -0.5924, 138.2655],
        [-17.7103, 170.9580],
        [-21.3115,  76.6842],
        [ 13.6460, 149.2126]], grad_fn=<AddBackward0>)

In [216]:
# Step 2: Calculate Loss
loss = mse(y_pred, y_train)
loss

tensor(4558.9028, grad_fn=<DivBackward0>)

In [217]:
#Step 3:Compute Gradients
loss.backward()

In [218]:
print(w.grad)
print(b.grad)

tensor([[-6918.4756, -8048.9282, -4754.2598],
        [ 3011.9531,  3004.5544,  1897.1111]])
tensor([-83.0443,  35.1441])


In [219]:
#Step 4, 5: Adjust Weights and Reset Gradients
with torch.no_grad():
  w -= w.grad * 1e-5
  b -= b.grad * 1e-5
  w.grad.zero_()
  b.grad.zero_()

In [220]:
#View new weights and biases
w, b

(tensor([[-0.3952, -0.1964,  1.0872],
         [ 0.0245,  0.8044,  0.8933]], requires_grad=True),
 tensor([-0.5104,  1.4727], requires_grad=True))

In [221]:
#Evaluate the Model again
y_pred = model(x_train)
loss = mse(y_train, y_pred)
print(loss)

tensor(3129.6816, grad_fn=<DivBackward0>)


##Train for multiple epochs

To reduce the loss further, we repeat the process of adjusting the weights and biases using the gradient descent algorithm multiple times. Each iteration is called an epoch.

In [222]:
#Train for 100 epochs
for i in range(100):
  #Make Prediction
  y_pred = model(x_train)
  #Calculate Loss
  loss = mse(y_train, y_pred)
  #Print Loss at each epoch
  print(f'Epoch [{i+1}/100], Loss: {loss}')
  #Calculate Gradients
  loss.backward()
  #Gradient Descent
  with torch.no_grad():
    w -= w.grad * 1e-5
    b -= b.grad * 1e-5
    #Reset gradients to zero
    w.grad.zero_()
    b.grad.zero_()


Epoch [1/100], Loss: 3129.681640625
Epoch [2/100], Loss: 2165.91455078125
Epoch [3/100], Loss: 1515.819580078125
Epoch [4/100], Loss: 1077.10986328125
Epoch [5/100], Loss: 780.858154296875
Epoch [6/100], Loss: 580.6143188476562
Epoch [7/100], Loss: 445.07611083984375
Epoch [8/100], Loss: 353.14971923828125
Epoch [9/100], Loss: 290.61968994140625
Epoch [10/100], Loss: 247.90695190429688
Epoch [11/100], Loss: 218.5556182861328
Epoch [12/100], Loss: 198.2154083251953
Epoch [13/100], Loss: 183.95474243164062
Epoch [14/100], Loss: 173.79769897460938
Epoch [15/100], Loss: 166.41268920898438
Epoch [16/100], Loss: 160.90249633789062
Epoch [17/100], Loss: 156.66229248046875
Epoch [18/100], Loss: 153.28428649902344
Epoch [19/100], Loss: 150.493896484375
Epoch [20/100], Loss: 148.10568237304688
Epoch [21/100], Loss: 145.9948272705078
Epoch [22/100], Loss: 144.07696533203125
Epoch [23/100], Loss: 142.2953338623047
Epoch [24/100], Loss: 140.61154174804688
Epoch [25/100], Loss: 138.99951171875
Epoch

As it can be seen from above, the loss decreases with each epoch.

In [223]:
y_pred

tensor([[ 59.2773,  71.6623],
        [ 88.9440, 100.8018],
        [ 99.9737, 130.4883],
        [ 34.3872,  43.3723],
        [105.7756, 115.9469]], grad_fn=<AddBackward0>)

In [224]:
y_train

tensor([[ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.]])

The predictions are quite close to our targets. We have a trained a reasonably good model to predict crop yields for apples and oranges by looking at the average temperature, rainfall, and humidity in a region. We can use it to make predictions of crop yields for new regions by passing a batch containing a single row of input.

The approach in machine learning is very different from classical programming. Usually we write programs that take some inputs, perform some operations and return the result. 
However, here we've defined a 'mode' that assumes a specific relation between the inputs and outputs, expresses using some random parameters i.e. weights and biases. We then show the model some known inputs and outputs and train the model to come up with good values for the unknown parameters. Once trained the model can be used to compute the outputs for new inputs.Deep learning is a branch of machine learning that uses matrix operations, non-linear activation functions and gradient descent to build and train models.