## Ordinary Least Squares with Simple Linear Regression


The simple linear regression model is:

 $$\mathbf{y} = \beta_0 +\beta_1\mathbf{x}$$


where, we need to estimate the parameters, intercept($\beta_0$) and slope($\beta_1$). 


Let's recall an Advertising dataset and simple linear regression performed on scatter plot of _sales_ Vs. _TV_. With the help of `Scikit-Learn`, we were able to fit the best regression line among all the possibilities. Here is a snapshot:

<figure align="center">
       <img src="./fig1.png" height="350" width="600">
       <figcaption>Figure 1: Simple Linear Regression </figcaption>
   </figure>



The blue line is a simple linear regression line with output $\mathbf{y}$ as `sales` and $\mathbf{x}$ as `TV`. The residual or error, $\epsilon$ is the difference between the observed value, $y_i$, and predicted value, $\hat{y_i}$. The observed value is the actual output data point, which is all blue dots in the figure, and the predicted value is the point given by the black regression line. Error for each output data point is shown by the vertical distance from the actual output data point to the predicted point on a regression line.

The predicted output value is:

$$\hat{y_i} = \beta_0 + \beta_1x_i$$

The observed (actual) output value is:

$$y_i = \beta_0 + \beta_1x_i + \epsilon_i$$

Where $\epsilon_i$ is a random error, not a parameter. The error $\epsilon_i$ as ($y_{i}-\hat{y_{i}}$) can either be positive or negative or even 0 sometimes. As we can see in the figure, vertical lines are on either side of the regression line. To avoid the cancellation of the error while summing errors, we square each error and sum them, called _Residual Sum of Squares (RSS)_ or _Sum of Squared Errors (SSE)_.

$$\text{Sum of Squared Errors (SSE)} = \sum_{i=1}^{n}(y_{i}-\hat{y_{i}})^2$$

The summation is indexed from $1$ to $n$, since we have $n$ samples. Sum of Squared Errors (SSE) is the function of $\beta_0$ and $\beta_1$. We can also take it as _Loss function_. The main principle of Least Squares is that we should end up choosing intercept ($\beta_0$) and slope ($\beta_1$) such that the overall sum is minimum.




Thus, to estimate the parameters, we minimize the sum of squared error. Sum of Squared Errors (SSE) can also be written as:

$$\text{SSE} = \sum_{i=1}^{n}(y_{i}-\hat{y_{i}})^2 =\sum_{i=1}^{n}(y_{i}-(\beta_0+\beta_1x_i))^2 $$



$\hat{y_i}$ is replaced with the simple linear regression model equation. Since we tend to minimize $\text{SSE}$, it is also called an objective function. Since the objective function, $\text{SSE}$ is a squared term, it is always positive. If we plot objective function, it would be a convex graph facing upwards. 


<figure align="center">
       <img src="./fig2.png" height="400" width="500">
       <figcaption>Figure 2: Convex cost function </figcaption>
   </figure>


The parameters at a minimum point are obtained from calculus by setting the first derivative of the objective function to $0$. Gradient or slope is always $0$ at the minimum point. We have two unknown parameters, intercept ($\beta_0$) and slope ($\beta_1$) so, we will take the partial derivative of _SSE_ with respect to $\beta_0$ and $\beta_1$ separately. We will set both partial derivatives to 0 and solve for $\beta_0$ and $\beta_1$ separately.


Taking partial derivatives with respect to $\beta_0$:

$$\frac{\partial\ \text{SSE}}{\partial \beta_0}  = \frac{\partial }{\partial \beta_0}\sum(y_i-(\beta_0+\beta_1x_i))^2$$

Note that the derivative of the sum is the sum of the derivatives. So, we can take the derivative inside the summation.

$$\frac{\partial }{\partial \beta_0}\sum(y_i-(\beta_0+\beta_1x_i))^2 = \sum\frac{\partial }{\partial \beta_0}(y_i-(\beta_0+\beta_1x_i))^2 $$

Now, applying power rule and chain rule, we get:


$$= \sum2(y_i-(\beta_0+\beta_1x_i))(-1) $$

$$=-2\sum(y_i-(\beta_0+\beta_1x_i)) ......(1)$$



Now, with respect to $\beta_1$:


$$\frac{\partial\ {\text{SSE}} }{\partial \beta_1} = \frac{\partial }{\partial \beta_1}\sum(y_i-(\beta_0+\beta_1x_i))^2$$

Again, the derivative of the sum is the sum of the derivatives, So, we take the derivative inside the summation.

$$\frac{\partial }{\partial \beta_1}\sum(y_i-(\beta_0+\beta_1x_i))^2 = \sum\frac{\partial }{\partial \beta_1}(y_i-(\beta_0+\beta_1x_i))^2 $$


Applying power rule, $2$ comes out front and exponent becomes $1$. We will also apply chain rule to encounter the coefficient of $\beta_1$. 
$$= \sum2(y_i-(\beta_0+\beta_1x_i))(-x_i) $$

Cleaning up a bit, 

$$= -2\sum x_i(y_i-(\beta_0+\beta_1x_i)) ......(2)$$