# Linear Models
A linear model assumes that some collection of independent variables can be linearly combined to span the space of the dependent variable. In the setup, we have a data matrix $X$ living in $\mathbb{R}^{n\times p}$ where $n$ is our number of observations and $p$ represents the number of features, as well as a variable we would like to forecast/predict - $Y\in\mathbb{R}^n$ (we'll focus on just one variable in our forecast for the time being). A linear model is simply formulated as $Y=X\beta + \epsilon$ where $\beta\in\mathbb{R}^p$ represents the unknown parameters to estimate, and $\epsilon\sim \mathcal{F}$ represents some uncontrollable and unobservable noise present in the process.


Least squares is the most popular estimation method for determining the coefficients coinciding with each variable. The method assumes:
* No collinearity (features are actually independent / data matrix is full rank) (check: $X$ being full rank implies $X^TX$ has full rank)
* Parameters actually should be linear
* Uncorrelated residuals
* $\mathbb{E}(\epsilon |X) = 0$ (strict exogeneity); $\epsilon\sim\mathcal{N}(0,I\sigma^2)$; 
* No outliers 
* $Y_i$ are i.i.d.
After satisfying these assumptions (hint: they're usually never satisfied), we can continue on with the algorithm of least squares which tells us to choose $\beta$ to minimize the $L2$ norm:

$$\arg\min\limits_{\beta} \lVert Y-X\beta\rVert_2^2.$$ 

Geometrically, here we are simply referring to each residual from the plane constructed by our $X\beta$ as a square and it is easy to see from this visual that outliers can heavily construe our estimates. We can compute moments of our model above as follows: $\mathbb{E}(Y)=X\beta$ as $\mathbb{E}(\epsilon)=0.$ Similarly, 

$$\mathbb{V}(Y)=\mathbb{V}(X\beta+\epsilon)=\mathbb{V}(\epsilon)=\sigma^2I$$

and so $Y\sim \mathcal{N}(X\beta,\sigma^2I)$. As we assume the $Y_i$ are i.i.d, we can construct the likelihood function: 

$$\mathcal{L}_n(\beta,\sigma^2)=\prod\limits_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(Y_i-X_i\beta)^2}{2\sigma^2}}\Rightarrow \ell_n(\beta,\sigma^2)=-n\log(\sqrt{2\pi}\sigma)-\frac{1}{2\sigma^2}\sum\limits_{i=1}^n (Y_i-X_i\beta)^2=-n\log(\sqrt{2\pi}\sigma)-\frac{1}{2\sigma^2}\lVert Y-X\beta\rVert_2^2$$ 

To find the MLE, we simply differentiate the above expression (check: the second derivatives are negative and this is sufficient to guarantee a maximum): 

$$\frac{\partial \ell_n}{\partial\beta} = -\frac{1}{2\sigma^2}\frac{\partial}{\partial\beta}\lVert Y-X\beta \rVert_2^2=-\frac{X^T}{2\sigma^2}(Y-X\beta)=0\iff \hat{\beta}=\frac{X^TY}{X^TX}.$$

Similarly, we can compute the MLE for $\sigma^2$ (again check the second derivatives) to arrive at

$$\frac{\partial \ell}{\partial\sigma}=-n\sigma^{-1}+\sigma^{-3}\lVert Y-X\beta\rVert_2^2=0\iff \hat{\sigma^2}=\frac{1}{n}\lVert Y-X\beta\rVert_2^2.$$

We took a detour here to compute the likelihood function and estimate the MLE for the parameters and as another brief exercise, check that this is equivalent to the method of least squares (this should take a second to realize these expressions are equivalently minimizing the same function). Note that our estimate for $\beta$ is simply the sample covariance between $X$ and $Y$ over the sample variance of $X$ - does this make sense? Now that we have an estimate for $\beta$, we would like to see if it is unbiased by testing if the expectation is equal to the true parameter $\beta$:

$$\mathbb{E}(\hat{\beta})=\mathbb{E}[(X^TX)^{-1}X^TY]=(X^TX)^{-1}X^T\mathbb{E}(Y)=(X^TX)X^TX\beta=\beta$$

which implies that our estimate is unbiased. If we so desire, we could also compute the variance of our estimate:

$$\mathbb{V}(\hat{\beta})=\mathbb{E}\left[(\hat{\beta}-\beta)(\hat{\beta}-\beta)^T\right]=\mathbb{E}[(X^TX)^{-1}X^TY\beta^T\beta]-\mathbb{E}[\beta Y^TX(X^TX)^{-1}]+\mathbb{E}[\beta\beta^T]$$

$$=(X^TX)^{-1}X^T\mathbb{E}[YY^T](X^TX)^{-1}-(X^TX)^{-1}X^T\mathbb{E}[Y]\beta^T-\beta\mathbb{E}[Y^T]X(X^TX)^{-1}+\beta\beta^T$$

$$=^*\beta\beta^T+(X^TX)^{-1}X^T(\sigma^2I)(X^TX)^{-1}-\beta\beta^T=(X^TX)^{-1}\sigma^2I$$

where we used in $=^*$ that $\mathbb{E}(YY^T)=X\beta\beta^TX^T+\sigma^2I$. We see that variance is directly impacted by the size of the inverse of the matrix $X^TX$. If $X$ were not full rank as we stated in our assumptions, then $X^TX$ is not full rank and so when we take the determinant of the singular matrix, we obtain numerical instability which dramatically increases the variance of our estiamte. In practice, we estimate $\sigma^2$ be $\hat{\sigma^2}=(n-p-1)^{-1}(Y-\hat{Y})(Y-\hat{Y})^T$ where we have $n-p-1$ degrees of freedom ($p$ free parameters) and thus, $(n-p-1)\sigma^2\sim \sigma^2\chi^2_{n-p-1}$. We can form hypothesis tests for individual parameters by noting that $\frac{\hat{\beta}_i}{\hat{\sigma}\sqrt{diag((X^TX)^{-1})_i}}\sim t_{n-p-1}$ (this follows from our earlier considerations on the distribution of $\hat{\beta}$). More garbage on useless hypothesis testing can be found in Keener chapter 12 and ESL chapter 3. Maybe mention Gauss Markov here and subset selection can be found chapter 3 of ESL despite being not worth your time IMO... \
\
Shrinkage Methods: If the number of features is much larger than the number of points we have or the rank of the design matrix is not full, then least squares does not do the job well In the first instance, $p>n$, there are an infinite number of $\beta$ that satisfy the normal equation $\left((X^TX)^{-1}X^T\right)^{-1}\beta=Y$ - i.e. we decompose $p$ into orthogonal complements from which we can derive an infinite number of solutions passing through the subspace spanned by $Y$ (think about the number of planes passing between two points). In the latter case, $(X^TX)^{-1}$ is ill-conditioned (the conditioning number is the ratio of the largest singular value to the smallest singular value). In the context of regression, an ill-conditioned matrix is generally indicative of correlated columns/features: think home runs and RBIs in baseball or assists and goals in football. As an example, consider the matrix $A=\begin{bmatrix}1&2\\2&4\end{bmatrix}$ - maybe this represents say grocery prices before and after taxes (in the EU). Then $|A|=4-4=0$ and so our matrix is singular. As $|A|=\lambda_1\times\lambda_2$ for $2\times 2$ matrices, singularity is equivalent to checking if one or more of the eigenvalues are zero. To verify this in our simplistic example, note that

$$|A-\lambda I|=\begin{bmatrix}1-\lambda&2\\2&4-\lambda\end{bmatrix}=0\iff \lambda(\lambda - 5)=0\iff \lambda\in\{0,5\}.$$

As $A$ is real, we can check that the matrix is Hermitian by just noting that $A=A^T$, so by the spectral theorem, we can find orthonormal vectors $u_1$ and $u_2$ such that $A^{-1}=\lambda_1^{-1}u_1u_1^T+\lambda_2u_2u_2^T$, but the RHS is undefined as one of the eigenvalues is zero. In the case of rank deficient matrices, we can instead turn to minimum least square to do the job. That is, we define the minimum least squares estimator as 

$$\hat{\beta}_{MLS}=\arg\min\limits_{\beta\in\mathbb{R}^p} \lVert Y-X\beta\rVert_2^2$$

subject to $\lVert \hat{\beta}_{MLS}\rVert_2^2<\lVert \beta\rVert_2^2$ for all $\beta$ minimizing the residual L2 norm above. If $X$ is full rank, we know that $\hat{\beta}_{MLS}=\hat{\beta}_{OLS}$ as it is unique. Otherwise, we know that for singular matrices, the OLS solution is given by 

$$\lVert (X^TX)^{-1}X^TY+v\rVert_2^2=\lVert (X^TX)^{-1}X^TY\rVert_2^2+2\lVert Y^TX(X^TX)^{-1}v\rVert+\lVert v\rVert_2^2=\lVert (X^TX)^{-1}X^TY\rVert_2^2+\lVert v\rVert_2^2$$ 

where $v$ lies in the orthogonal complement of the row space of $X$. Then, the minimum is obtained when $\lVert v\rVert=0$. By $(X^TX)^{-1}$, I simply mean the Moore-Penrose pseudo inverse which is defined as

$$(X^TX)^{-1}=\sum\limits_{i=1}^p \lambda_i\mathbb{I}_{\lambda_i\neq 0}u_iu_i^T.$$

The pseudo-inverse essentially collapses the signularities, but instead we could also combat the singularity by using regularization which basically adds a constant to the diagonal to prevent instable estimates. For instance take our matrix $A$ above, if we add $2I$ to the matrix, the eigenvalues are now $2$ and $3$ instead of $1$ and $0$. Kennard (1970) defined the ridge estimator as:

$$\hat{\beta}(\lambda)=(X^TX+\lambda I)^{-1}X^TY.$$

Intermezzo: ESL Ex. 3.12: Show that the ridge regression estimates can be obtained by OLS on an augmented data set of $X$. \
\
Answer: If we consider $X^*=\begin{bmatrix}X\\\sqrt{\lambda}I_p\end{bmatrix}$ and $y^*=\langle y,0,\dots,0\rangle^T$, then we sub in the normal equations to get 

$$X^{*^T}y^*=X^{*^T}X^*\beta\iff \begin{bmatrix} X^T& \sqrt{\lambda}I_p\end{bmatrix}\begin{bmatrix}y\\0\end{bmatrix}=\begin{bmatrix} X^T& \sqrt{\lambda}I_p\end{bmatrix}\begin{bmatrix} X\\ \sqrt{\lambda}I_p\end{bmatrix}\beta\iff \hat{\beta}=(X^TX+\lambda I_p)^{-1}X^TY.$$



# Sources

* http://statsmaths.github.io/stat612/lectures/lec15/lecture15.pdf
* https://arxiv.org/pdf/1509.09169.pdf
* "Elements of Statistical Learning" - Hastie et al.
* Quadratic forms: https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470192610.ch5