# Linear Regression 

We let the set of valid hypothesises be all functions of the form
$$ f(x | \Theta) = \theta_0 + x_1 \theta_1 + \ldots + x_n \theta_n $$ 


## Goal

We would like to minimize the loss function

$\mathcal{L}(\Theta | X, Y) = \frac{1}{2} \sum_{i = 1}^N (\Theta^T X - y_i) $ 

## Gradient Descent

An iterative method for finding $\Theta$ is Gradient Descent

$$ \theta_i := \theta_i - \alpha  \frac{\partial J(\Theta)}{\partial \theta_i}$$

Where $\Theta$ is set an initial setting (guess maybe) and $alpha$ is the learning rate. The choice of $\Theta_0$ and $\alpha$ can affect whether the determine which $\Theta$ is returned. Gradient Descent can get trapped in local minima.


### Batch 

For every update every single data point in $\{m\}$ must be iterated over. Computationally this is expensive.

$$ \theta_j := \theta_j - \alpha  \sum_{i=1}^{m} \frac{\partial J(\Theta)}{\partial \theta_i}$$

![alt text](assets/gd.svg "Batch Gradient Descent")



### Stochastic 

Another approach that is takes a less direct path to convergence but is generally faster is stochastic descent.

for  $i = 1::m$

$$ \theta_j := \theta_j - \alpha \frac{\partial J(\Theta)}{\partial \theta_i}$$

![alt text](assets/sgd.png "Stochastic Gradient Descent")


## Trace Operator

$tr$ is a linear operator defined as the following
 $$ tr(A) = \sum_{i =1}^{N} A_{ii} $$

It is has several useful properties.


\begin{align}
tr(AB) &= tr(BA) \\
tr(A) &= tr(A^T) \\
tr(A + B) &= tr(A)  + tr(B)\\
tr(A) &= tr(A^T) \\
\alpha tr(A) &= \alpha tr(A)\\
tr(A) & = \sigma(y-x)\\
\nabla_A  tr(AB) &= B^T
\end{align}


## Closed Form Solution for  Linear Regression


Let us put $X,Y$ in matrix notation

\begin{align}
X &= \begin{bmatrix}
       x_{1} \\
       x_{2} \\
       \vdots \\
       x_{m}
     \end{bmatrix}
\end{align}

\begin{align}
Y &= \begin{bmatrix}
       y_{1} \\
       y_{2} \\
       \vdots \\
       y_{m}
     \end{bmatrix}
\end{align}


Let us define the new cost function as


$$ J(\Theta) = \frac{1}{2} (X \Theta - Y) ^T (X \Theta - Y) $$

Now let us minimize the cost function with respect to $\Theta$

\begin{align}
\nabla_{\Theta} J(\Theta) &= \nabla_\Theta \frac{1}{2} (X \Theta - Y) ^T (X \Theta - Y)\\
&= \nabla_\Theta \frac{1}{2} \Theta^T X^T X \Theta  - \Theta^T X^T Y - Y^T X \Theta - y^T X \Theta + y y^T\\
&= \vdots\\
&= X^T X \Theta - X^T Y
\end{align}


## Closed Form Solution Single Variable Case

This allows us to solve for the optimal $\Theta = (X^T X)^{-1} (X^T Y)$
#### Lets start with $b$

$$ \frac{\partial F(\theta_1,\theta_0)}{\partial \theta_0} = n\theta_0  + \theta_1 \sum_i x_i - \sum_i y_i $$
$$ 0 = n\theta_0  + \theta_1 \sum_i x_i + \sum_i y_i $$
$$ \theta_0 = \frac{1}{n} [\sum_i y_i - \theta_1 \sum_i x_i]$$
$$ \theta_0 = \overline{Y} - m \overline{X}$$


#### Now solve for $m$
Where $\overline{Y}$ refers to the average value of $y$ and $\overline{X}$ refers to the average value of $x$

$$ \frac{\partial F(\theta_1,\theta_0)}{\partial \theta_1} = \sum_i - x_i y_i + a x_i + \theta_0 x_i^2 $$
$$ 0 = \sum_i -x_i y_i + \theta_0 x_i + \theta_1 x_i^2 $$

Substitute the value calculated for $b$ and solve.

$$ \theta_1 = \frac{\sum_i y_i x_i - n \overline{X}\overline{Y}}{\sum_i x_i^2 - n \overline{X}^2}$$

