---

<center> <h1> Gradient Descent Algorithm in Python </h1>

---

## Data Description

Consider the following 3 datapoints: 

| **X1  (Feature)** | **Y  (Target)** |
|:-----------------:|:---------------:|
|         1         |       4.8       |
|         3         |       12.4      |
|         5         |       15.5      |

Here,

**X1** refers to the independent variable (also called as Feature / Attribute in Machine Learning)

**Y** is the dependent variable (also known as Target Variable in ML)

<img src="best_fit_line.JPG">

- The following plot shows these 3 datapoints in Blue circles. Also shown is the red-line (with squares), which we are claiming is the **“best-fit line”**.
- The claim is that this best fit-line will have the minimum error for prediction (the predicted values are actually the red-squares, hence the vertical difference is the error in prediction).
- This total difference (error) across all the datapoints is expressed as the Mean Squared Error Function, which will be minimized using the Gradient Descent Algorithm, discussed below.
- Minimizing or maximizing any quantity is mathematically referred as an Optimization Problem, and hence the solution (the point where the minima/maxima exists) is referred the **“optimal values”**.
- You can easily see that the yellow-line (a poor-fit line) which has “non-optimal” values of slope & intercept fits the data very badly (btw the exact equation of the yellow line is x+6, so slope is 1 and intercept is 6 units)

The net Objective is to find the Equation of the Best-Fitting Straight Line (through these 3 data points, mentioned in the above table, also represented by the blue circles in the above plot).

---

$$
\hat{Y} = w_0 + w_1X_1 \quad \text{is the equation of the best-fit line (red-line in the plot) where}
$$ 

---



$$ 
w_1 = \quad \text{slope of the line;} 
$$ 

$$ 
w_0  = \quad \text{intercept of the line} 
$$

$$ 
w_0 , w_1 \quad \text{are also called model weights}
$$

$$ 
\hat{Y} \quad \text {is the predicted values of Y, given by the “best-fit line”.}
$$ 

These predicted values are represented by red-squares on the red-line. Of course, the predicted values are NOT exactly same as the actual values of Y (blue circles), the vertical difference represents the error in the prediction given by:

$$ 
Error_i = \hat{Y}_i - Y_i \quad \text{for any ith data points} 
$$ 

$$
MSE = \frac{1}{N}\sum_{i=1}^{N}(Error_i)^2= \frac{1}{N}\sum_{i=1}^{N}(\hat{Y}_i - Y_i)^2
\quad \text{where N = Total no. of data points. For this question, N=3}
$$

## Problem Statement

To find the **“optimal values”** of the slope and intercept of this best-fit line, such that the **“Mean Squared Error” (MSE)** is minimum. 

Also, **Plot the following:**
- 1. MSE Loss function (y-axis) vs w0 (x-axis)
- 2. MSE Loss function (y-axis) vs w1 (x-axis)
- 3. 3D-plot of Loss function w.r.t. w0 & w1
- 4. w0 (y-axis) vs Iteration (x-axis)
- 5. w1 (y-axis) vs Iteration (x-axis)
- 6. Loss function (y-axis) vs iteration (x-axis)

## Approach
#### How will i get the optimal values of the slope and intercept ?

This is where the Gradient Descent Algorithm comes!

$$
w^{k+1}_0 = w^{k}_0 - (\alpha\sum_{i=1}^{N}(\hat{Y}_i - Y_i))
$$

$$
w^{k+1}_1 = w^{k}_1 - (\alpha\sum_{i=1}^{N}[(\hat{Y}_i - Y_i)*X_1i])
$$

where 
$$ 
w^{k}_0 , w^{k}_1 \quad \text{represent the values of the intercept and the slope of the linear-fit
in the kth iteration} 
$$
and
$$
w^{k+1}_0, w^{k+1}_1 \quad \text{represent the values of the intercept and the slope of the linear-fit in the (k+1)th iteration (next iteration)}
$$

$$
w_0 , w_1 \quad \text{are also called model weights or model coefficients and}
$$

$$
\alpha \quad \text{represents the Learning Rate}
$$

<img src="gradient_descent.JPG">

### Gradient Descent Algorithm

- 1. Initialize the algorithm with random values of α, and weights (w0 , w1)

- 2. Calculate predictions 
$$
\hat{Y} = w_0 + w_1X_1
$$ 
- 3. Calculate Error terms & MSE Loss Function (L).
  > Error Terms are:
  $$
  \sum_{i=1}^{N}\hat{Y}_i - Y_i \quad \text{and}
  $$

  $$
  \sum_{i=1}^{N}[(\hat{Y}_i - Y_i)*X_1i]
  $$

  $$
  \quad \text{for data points i=1 to N. Here N=3}
  $$
  
  > and Loss Function as:
  $$
  MSE = \frac{1}{N}\sum_{i=1}^{N}(Error_i)^2= \frac{1}{N}\sum_{i=1}^{N}(\hat{Y}_i - Y_i)^2
  $$

  $$
  \quad \text{where N = Total no. of data points. Here N=3}
  $$
  
- 4. Update your weights using model coefficients equation
- 5. Repeat 2-4, until convergence.

Based on the above-mentioned steps, we can calculate the weights. Let the **learning rate (α)** be **0.01** and initialize the weights **w0** and **w1** as **0**.