# Week 1: Linear Models for Regression 

ML falls under unsupervised (modeling giving training set data and test set data) or supervised (classification). Linear regression is unsupervised learning given training set data that is continuous. If given discrete data, this falls under classification. 

Given data (training set) $y_{i}$ (response variable), ${\vec x_{i}}$ (which is a column vector of data $x_{i1}$...$x_{ip}$, where p are features or attributes of the independent variable: $y_{i} = f({\vec x_{i}})$ and we want to find an approximation for this f (function). We want to predict y for test set data using calculated f: $y_{j} = f({\vec x_{j}})$. We want to do this because in general $y_{j}$ is something that is expensive to measure otherwise. 

The linear model, which is the simplest assumption, says $y = \beta_{o} + \sum_{j=1}^{p}x_{j} \times \beta_{j}$ , where $y = {\vec x^{T}} {\vec\beta}$


## Linear Regression Example 

Take $y = \beta_{o} + \beta_{1} x + \epsilon $ , where y is dependent var, x is scalar independent var, $\eta$ is noise or error term, $\beta_{o}$ is intercept, and $\beta_{1}$ is the slope. Here we assume that noise/error does not depend on x (homoscedastic). If this is not the case, then we are in the heteroscedastic case. 

Given $(x_{1}, y_{1}), (x_{2}, y_{2}), etc$, we will calculate the resdiual as follows: $\epsilon_{i} = y_{i} - \hat y_{i}$, where $\hat y_{i} = \beta_{o} + \beta_{1} x_{1}$ is basically how good the fit is. 

We want to test for "goodness of fit" based on the **residual sum of squares (rss)**: $\phi = rss = \sum_{i=1}^{N} (y_{i} - \beta_{o} - \beta_{1} x_{i})^{2}$. 

The ** least squares ** approach chooses $\beta_{o}$ and $\beta_{1}$ to minimize rss. 

### Minimization 

$\frac{\partial \phi}{\partial \beta_{o}} = \sum_{i=1}^{N} 2(y_{i} - \beta_{o} -\beta{1}x_{i})(-1) = 0 $

$\frac{\partial \phi}{\partial \beta_{1}}= \sum_{i=1}^{N} 2(y_{i} - \beta_{o} -\beta{1}x_{i})(-x_{i}) = 0 $

Solving for $\beta_{o}$ and $\beta_{1}$, the equations become: 

(1) $\sum_{i=1}^{N} y_{i} - \beta_{o} -\beta_{1}x_{i} = 0 $

(2) $\sum_{i=1}^{N} y_{i} - \beta_{o} -\beta_{1}x_{i} = 0 $ 

Breaking (1) into component parts: 

$\sum_{i=1}^{N} y_{i} - N \beta_{o} - \beta_{1} \sum_{i=1}^{N} x_{i} = 0 $ 

$ N \bar y - N \beta_{o} - \beta_{1} N \bar x = 0 $ 

Breaking (2) into component parts: 

$\sum_{i=1}^{N} y_{i} x_{i} - \beta_{o} \sum_{i=1}^{N} x_{i} - \beta_{1} \sum_{i=1}^{N} x_{i}^{2} = 0 $ 

$N \langle x_{i} y_{i} \rangle - \beta_{o} N \bar x - \beta_{1} N \langle x_{i}^{2} \rangle = 0$ 

With two equations and two unknowns we can solve for $\beta_{o}$ and $\beta_{1}$: 

$\beta_{1} = \frac{\langle x_{i} y_{i} \rangle - \bar x \bar y}{\langle x_{i}^{2} \rangle - \bar x^{2}}$ 

$ \beta_{o} = \bar y - \beta_{1} \bar x $

### Evaluating Estimate

Now we would like to ask: how good is our estimate of $\beta_{o}$ and $\beta_{1}$? You could use chisq method, bootstrap, or rsquare statistic. 

$R^{2}$ stat: $TSS = \sum_{i=1}^{N} (y_{i} - \bar y)^{2}$, where TSS is total sum of squares, and $\bar y$ is the zeroth order model. 

When you do this you get $R^{2} = \frac{TSS - RSS}{TSS}$, where we want RSS to be much smaller than TSS to $R^{2} \approx 1$. 
* $R^{2}$ close to 1 is good fit. 
* $R^{2}$ close to zero means linear model is not  correct or errors are too large, or both. 