## Derivation of OLS estimates for simple linear regression

### Summary
Simple derivation of ordinary least squares (OLS) parameter estimates for simple linear regression.

### Why write this?
Some machine learning courses which introduce simple linear regression with ordinary least squares (OLS) estimates may initially gloss over the derivation of the model's parameters. This is written for those interested in the details of that derivation.

Though there are certainly other examples elsewhere, this notebook is written to be as clear as possible, including reference to related concepts for convenience. The end goal is a step-by-step derivation of the OLS estimates of simple linear regression for you to reference and use elsewhere.

### Background
Say there is a training set of $m\in\mathbb{N}$ paired training examples of the form $(x^{(i)}, y^{(i)})$ where $x^{(i)},y^{(i)}\in\mathbb{R}$ represent the values of the independent and dependent variable for the $i$-th training example, respectively. Using simple linear regression, the $i$-th independent variable (target) can be expressed as $y^{(i)}=\hat y^{(i)}+\epsilon^{(i)}$, where $\hat y^{(i)}=\hat\beta_0+\hat\beta_1 x^{(i)}$ is the estimated target value for the $i$-th training example formed with an estimated intercept and slope of $\hat\beta_0\in\mathbb{R}$ and $\hat\beta_1\in\mathbb{R}$, respectively, and where $\epsilon^{(i)} = y^{(i)} - \hat y^{(i)}$ is the residual or error of the $i$-th training example—the difference between the actual target value, $y^{(i)}$, and the estimated target value, $\hat y^{(i)}$.

### Partial derivatives of the cost function
Forming least squares estimates of the regression parameters $\hat\beta_0$ and $\hat\beta_1$ involves minimizing the average squared error across the $m$ examples, $\frac{1}{m}\sum\limits^m_{i=1} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)^2$, which is referred to as the cost function, $J(\hat\beta_0, \hat\beta_1)$. Note that as the cost function is convex, to minimize the cost function is to find the value of each parameter when the partial derivative of the cost function with respect to that parameter is equal to $0$. These partial derivatives are detailed in the following subsections.

#### Solve for partial derivative with respect to $\hat\beta_0$
To solve for the partial derivative of the cost function with respect to $\hat\beta_0$, state the partial derivative as
<br>
<div align="center"><b> $\frac{\partial}{\partial\hat\beta_0}J(\hat\beta_0, \hat\beta_1)=\frac{\partial}{\partial\hat\beta_0}\frac{1}{m}\sum\limits^m_{i=1} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)^2$ </b></div>
<br>

Rearrange the right-hand side of the above equation for clarity
<br>
<div align="center"><b> $\frac{\partial}{\partial\hat\beta_0}J(\hat\beta_0, \hat\beta_1)=\frac{1}{m}\sum\limits^m_{i=1} \frac{\partial}{\partial\hat\beta_0} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)^2$ </b></div>
<br>

Rewrite $\frac{\partial}{\partial\hat\beta_0} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)^2$ above using the [chain rule](https://en.wikipedia.org/wiki/Chain_rule)
<br>
<div align="center"><b> $\frac{\partial}{\partial\hat\beta_0}J(\hat\beta_0, \hat\beta_1)=\frac{1}{m}\sum\limits^m_{i=1} 2\big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big) \frac{\partial}{\partial\hat\beta_0} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big) $ </b></div>
<br>

Rewrite $\frac{\partial}{\partial\hat\beta_0} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)$ above, as $\frac{\partial}{\partial\hat\beta_0} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big) = -1$
<br>
<div align="center"><b> $\frac{\partial}{\partial\beta_0}J(\beta_0, \beta_1)=\frac{1}{m}\sum\limits^m_{i=1} -2\big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)$ </b></div>
<br>

Finally, as $-2$ in $\sum\limits^m_{i=1} -2\big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)$ is a constant, rewrite this outside of the summation
<br>
<div align="center"><b> $\frac{\partial}{\partial\hat\beta_0}J(\hat\beta_0, \hat\beta_1)=-2\frac{1}{m}\sum\limits^m_{i=1} y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}$ </b></div>
<br>

#### Solve for partial derivative with respect to $\hat\beta_1$
To solve for the partial derivative of the cost function with respect to $\hat\beta_1$, state the partial derivative as
<br>
<div align="center"><b> $\frac{\partial}{\partial\hat\beta_1}J(\hat\beta_0, \hat\beta_1)=\frac{\partial}{\partial\hat\beta_1}\frac{1}{m}\sum\limits^m_{i=1} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)^2$ </b></div>
<br>

Rearrange the right-hand side of the above equation for clarity
<br>
<div align="center"><b> $\frac{\partial}{\partial\hat\beta_1}J(\hat\beta_0, \hat\beta_1)=\frac{1}{m}\sum\limits^m_{i=1} \frac{\partial}{\partial\hat\beta_1} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)^2$ </b></div>
<br>

Rewrite $\frac{\partial}{\partial\hat\beta_1} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)^2$ above using the [chain rule](https://en.wikipedia.org/wiki/Chain_rule)
<br>
<div align="center"><b> $\frac{\partial}{\partial\hat\beta_1}J(\hat\beta_0, \hat\beta_1)=\frac{1}{m}\sum\limits^m_{i=1} 2\big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big) \frac{\partial}{\partial\hat\beta_1} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big) $ </b></div>
<br>

Rewrite $\frac{\partial}{\partial\hat\beta_1} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)$ above, as $\frac{\partial}{\partial\hat\beta_1} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big) = -x^{(i)}$
<br>
<div align="center"><b> $\frac{\partial}{\partial\beta_1}J(\beta_0, \beta_1)=\frac{1}{m}\sum\limits^m_{i=1} -2\big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)x^{(i)}$ </b></div>
<br>

Finally, as $-2$ in $\sum\limits^m_{i=1} -2\big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)$ is a constant, rewrite this outside of the summation
<br>
<div align="center"><b> $\frac{\partial}{\partial\hat\beta_1}J(\hat\beta_0, \hat\beta_1)=-2\frac{1}{m}\sum\limits^m_{i=1} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)x^{(i)}$ </b></div>
<br>

#### Partial derivatives
Therefore, the partial derivatives of the cost function with respect to the parameters $\hat\beta_0$ and $\hat\beta_1$ are
<br>
<div align="center"><b> $\frac{\partial}{\partial\hat\beta_0}J(\hat\beta_0, \hat\beta_1)=-2\frac{1}{m}\sum\limits^m_{i=1} y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}$ </b></div>
<br>
<div align="center"><b> $\frac{\partial}{\partial\hat\beta_1}J(\hat\beta_0, \hat\beta_1)=-2\frac{1}{m}\sum\limits^m_{i=1} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)x^{(i)}$ </b></div>
<br>

### Least squares estimates
To minimize the cost function, find the value of each parameter when the partial derivative of the cost function with respect to that parameter is equal to $0$.

#### Solve for least squares estimate of $\hat\beta_0$
To solve for the least squares estimate of $\hat\beta_0$, set $\frac{\partial}{\partial\hat\beta_0}J(\hat\beta_0, \hat\beta_1)=0$ as
<br>
<div align="center"><b> $0=-2\frac{1}{m}\sum\limits^m_{i=1} y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}$ </b></div>
<br>

Divide both sides of the above by $-2$
<br>
<div align="center"><b> $0=\frac{1}{m}\sum\limits^m_{i=1} y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}$ </b></div>
<br>

Multiply both sides of the above by $m$
<br>
<div align="center"><b> $0=\sum\limits^m_{i=1} y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}$ </b></div>
<br>

Distribute the summation across the right-hand side of the above
<br>
<div align="center"><b> $0=\sum\limits^m_{i=1} y^{(i)} - \sum\limits^m_{i=1} \hat\beta_0 - \sum\limits^m_{i=1} \hat\beta_1 x^{(i)}$ </b></div>
<br>

Add $\sum\limits^m_{i=1} \hat\beta_0$ to both sides of the above
<br>
<div align="center"><b> $\sum\limits^m_{i=1} \hat\beta_0=\sum\limits^m_{i=1} y^{(i)} - \sum\limits^m_{i=1} \hat\beta_1 x^{(i)}$ </b></div>
<br>

As $\hat\beta_0$ is a constant, rewrite the left-hand side of the above as $\sum\limits^m_{i=1} \hat\beta_0=m\hat\beta_0$
<br>
<div align="center"><b> $m\hat\beta_0=\sum\limits^m_{i=1} y^{(i)} - \sum\limits^m_{i=1} \hat\beta_1 x^{(i)}$ </b></div>
<br>

As $\hat\beta_1$ in $\sum\limits^m_{i=1} \hat\beta_1 x^{(i)}$ is a constant, rewrite this outside the summation
<br>
<div align="center"><b> $m\hat\beta_0=\sum\limits^m_{i=1} y^{(i)} - \hat\beta_1\sum\limits^m_{i=1} x^{(i)}$ </b></div>
<br>

Multiply both sides of the above by $\frac{1}{m}$
<br>
<div align="center"><b> $\hat\beta_0=\frac{1}{m}\sum\limits^m_{i=1} y^{(i)} - \hat\beta_1\frac{1}{m}\sum\limits^m_{i=1} x^{(i)}$ </b></div>
<br>

Finally, note that as the average $y$ value is equal to $\bar y=\frac{1}{m}\sum\limits^m_{i=1} y^{(i)}$ and the average $x$ value is equal to $\bar x=\frac{1}{m}\sum\limits^m_{i=1} x^{(i)}$, the above can be simplified as
<br>
<div align="center"><b> $\hat\beta_0=\bar y - \hat\beta_1\bar x$ </b></div>
<br>

#### Solve for least squares estimate of $\hat\beta_1$
To solve for the least squares estimate of $\hat\beta_1$, set $\frac{\partial}{\partial\hat\beta_1}J(\hat\beta_0, \hat\beta_1)=0$ as
<br>
<div align="center"><b> $0=-2\frac{1}{m}\sum\limits^m_{i=1} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)x^{(i)}$ </b></div>
<br>

Divide both sides of the above by $-2$
<br>
<div align="center"><b> $0=\frac{1}{m}\sum\limits^m_{i=1} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)x^{(i)}$ </b></div>
<br>

Multiply both sides of the above by $m$
<br>
<div align="center"><b> $0=\sum\limits^m_{i=1} \big(y^{(i)}-\hat\beta_0-\hat\beta_1 x^{(i)}\big)x^{(i)}$ </b></div>
<br>

Expand the right-hand side of the above as $\hat\beta_0=\bar y - \hat\beta_1\bar x$
<br>
<div align="center"><b> $0=\sum\limits^m_{i=1} \big(y^{(i)}-\big(\bar y - \hat\beta_1\bar x\big)-\hat\beta_1 x^{(i)}\big)x^{(i)}$ </b></div>
<br>

Expand $\big(y^{(i)}-\big(\bar y - \hat\beta_1\bar x\big)-\hat\beta_1 x^{(i)}\big)x^{(i)}$ in the right-hand side of the above
<br>
<div align="center"><b> $0=\sum\limits^m_{i=1} y^{(i)}x^{(i)}-\big(\bar y - \hat\beta_1\bar x\big)x^{(i)}-\hat\beta_1 x^{(i)}x^{(i)}$ </b></div>
<br>

Distribute the summation across the right-hand side of the above
<br>
<div align="center"><b> $0=\sum\limits^m_{i=1} y^{(i)}x^{(i)}-\sum\limits^m_{i=1} \big(\bar y - \hat\beta_1\bar x\big)x^{(i)}-\sum\limits^m_{i=1}\hat\beta_1 x^{(i)}x^{(i)}$ </b></div>
<br>

Expand $\big(\bar y - \hat\beta_1\bar x\big)x^{(i)}$ in the right-hand side of the above
<br>
<div align="center"><b> $0=\sum\limits^m_{i=1} y^{(i)}x^{(i)}-\sum\limits^m_{i=1} \bar y x^{(i)} - \hat\beta_1\bar x x^{(i)} -\sum\limits^m_{i=1}\hat\beta_1 x^{(i)}x^{(i)}$ </b></div>
<br>

Distribute the summation across $\sum\limits^m_{i=1} \bar y x^{(i)} - \hat\beta_1\bar x x^{(i)}$ in the right-hand side of the above
<br>
<div align="center"><b> $0=\sum\limits^m_{i=1} y^{(i)}x^{(i)}-\sum\limits^m_{i=1} \bar y x^{(i)} - \sum\limits^m_{i=1} \hat\beta_1\bar x x^{(i)} -\sum\limits^m_{i=1}\hat\beta_1 x^{(i)}x^{(i)}$ </b></div>
<br>

Factor $\sum\limits^m_{i=1} y^{(i)}x^{(i)}-\sum\limits^m_{i=1} \bar y x^{(i)}$ as $\sum\limits^m_{i=1} \big(y^{(i)}-\bar y\big) x^{(i)}$ in the right-hand side of the above
<br>
<div align="center"><b> $0= \sum\limits^m_{i=1} \big(y^{(i)}-\bar y\big) x^{(i)} - \sum\limits^m_{i=1} \hat\beta_1\bar x x^{(i)} -\sum\limits^m_{i=1}\hat\beta_1 x^{(i)}x^{(i)}$ </b></div>
<br>

Rearrange the terms $-\sum\limits^m_{i=1} \hat\beta_1\bar x x^{(i)}$ and $-\sum\limits^m_{i=1}\hat\beta_1 x^{(i)}x^{(i)}$ to make subsequent factoring clearer
<br>
<div align="center"><b> $0= \sum\limits^m_{i=1} \big(y^{(i)}-\bar y\big) x^{(i)} -\sum\limits^m_{i=1}\hat\beta_1 x^{(i)}x^{(i)} - \sum\limits^m_{i=1} \hat\beta_1\bar x x^{(i)} $ </b></div>
<br>

Factor $\sum\limits^m_{i=1}\hat\beta_1 x^{(i)}x^{(i)} - \sum\limits^m_{i=1} \hat\beta_1\bar x x^{(i)}$ as $\sum\limits^m_{i=1}\hat\beta_1 \big(x^{(i)}-\bar x\big)x^{(i)}$ in the right-hand side of the above
<br>
<div align="center"><b> $0= \sum\limits^m_{i=1} \big(y^{(i)}-\bar y\big) x^{(i)} -\sum\limits^m_{i=1}\hat\beta_1 \big(x^{(i)}-\bar x\big)x^{(i)} $ </b></div>
<br>

Add $\sum\limits^m_{i=1}\hat\beta_1 \big(x^{(i)}-\bar x\big)x^{(i)}$ to both sides of the above
<br>
<div align="center"><b> $\sum\limits^m_{i=1}\hat\beta_1 \big(x^{(i)}-\bar x\big)x^{(i)}= \sum\limits^m_{i=1} \big(y^{(i)}-\bar y\big) x^{(i)} $ </b></div>
<br>

As $\hat\beta_1$ in $\sum\limits^m_{i=1}\hat\beta_1 \big(x^{(i)}-\bar x\big)x^{(i)}$ is a constant, rewrite this outside the summation
<br>
<div align="center"><b> $\hat\beta_1\sum\limits^m_{i=1} \big(x^{(i)}-\bar x\big)x^{(i)}= \sum\limits^m_{i=1} \big(y^{(i)}-\bar y\big) x^{(i)} $ </b></div>
<br>

Note that $\sum\limits^m_{i=1} \big(y^{(i)}-\bar y\big)x^{(i)}=\sum\limits^m_{i=1}\big(y^{(i)}-\bar y\big)\big(x^{(i)}-\bar x\big)$, as $\sum\limits^m_{i=1}\big(y^{(i)}-\bar y\big)\big(x^{(i)}-\bar x\big)$ expands to $\sum\limits^m_{i=1} y^{(i)}x^{(i)}-y^{(i)}\bar x-\bar y x^{(i)} + \bar y \bar x$ and $\sum\limits^m_{i=1} y^{(i)}x^{(i)}- \sum\limits^m_{i=1} y^{(i)}\bar x- \sum\limits^m_{i=1} \bar y x^{(i)} + \sum\limits^m_{i=1} \bar y \bar x$. As $\sum\limits^m_{i=1} y^{(i)}\bar x = \sum\limits^m_{i=1} \bar y \bar x$, the last expansion can be reduced to $\sum\limits^m_{i=1} y^{(i)}x^{(i)}- \sum\limits^m_{i=1} \bar y x^{(i)}$, and then factored as $\sum\limits^m_{i=1} \big(y^{(i)}-\bar y\big)x^{(i)}$. It can be shown similarly that $\sum\limits^m_{i=1} \big(x^{(i)}-\bar x\big)x^{(i)}=\sum\limits^m_{i=1} \big(x^{(i)}-\bar x\big)\big(x^{(i)}-\bar x\big)$ or $\sum\limits^m_{i=1} \big(x^{(i)}-\bar x\big)^2$. Using the above equalities, rewrite the above as
<br>
<div align="center"><b> $\hat\beta_1\sum\limits^m_{i=1} \big(x^{(i)}-\bar x\big)^2= \sum\limits^m_{i=1} \big(y^{(i)}-\bar y\big) \big(x^{(i)} - \bar x\big) $ </b></div>
<br>

Finally, divide both sides of the above by $\sum\limits^m_{i=1} \big(x^{(i)}-\bar x\big)^2$
<br>
<div align="center"><b> $\hat\beta_1= \frac{\sum\limits^m_{i=1} \big(y^{(i)}-\bar y\big) \big(x^{(i)} - \bar x\big)}{\sum\limits^m_{i=1} \big(x^{(i)}-\bar x\big)^2} $ </b></div>
<br>

#### Least squares estimates
Therefore, least squares estimates of the parameters $\hat\beta_0$ and $\hat\beta_1$ of the simple linear regression are
<br>
<div align="center"><b> $\hat\beta_0=\bar y - \hat\beta_1\bar x$ </b></div>
<br>
<div align="center"><b> $\hat\beta_1= \frac{\sum\limits^m_{i=1} \big(y^{(i)}-\bar y\big) \big(x^{(i)} - \bar x\big)}{\sum\limits^m_{i=1} \big(x^{(i)}-\bar x\big)^2} $ </b></div>
<br>