# Solving least squares problems

Formally, if $Y = X\beta + \epsilon$ with $Y, \epsilon \in \Re^n$, $X$ an $n$ by $p$ matrix, $\beta \in \Re^p$, $p \le n$, and $\mathrm{rank}(X) = p$, then

\begin{equation*}  \hat{\beta}_{\mathrm{OLS}} = \arg \min_{\gamma \in \Re^p} \| Y - X\gamma \|^2 =  (X^TX)^{-1} X^T Y.\end{equation*}

We will derive this, then explain why--although the solution is mathematically correct--it is not a good way to find 
$\hat{\beta}_{\mathrm{OLS}}$ in practice, because it is numerically less stable than other approaches.

## The normal equations

Note that $\| Y - X\gamma \|^2 = (Y - X\gamma)^T(Y - X\gamma)$ is a quadratic function of $\gamma$. 
Extrema with respect to $\gamma$ will be at stationary points.

\begin{equation*}  \frac{\partial}{\partial \gamma} \| Y - X\gamma \|^2 = 2 (Y - X\gamma)^T X = 2 X^T (Y - X\gamma) = 2 X^TY - X^TX\gamma.\end{equation*}

Note that the 2nd derivative with respect to $\gamma$ is $X^TX$, a square, symmetric matrix.

$\gamma_0$ is a stationary point of $\| Y - X\gamma \|^2$ if $ X^TX \gamma_0 = X^T Y$. These linear relations are called the _normal equations_.

If $X$ has rank $p$, then $X^TX$ is positive definite, and hence invertible, so the normal equations have a unique solution:

\begin{equation*}  \gamma_0 = (X^TX)^{-1}X^TY.\end{equation*}

Since $X^TX$ (the 2nd derivative) is positive definite, this is a minimum, not a maximum. Therefore, the value of $\beta$ that minimizes $\|Y - X\beta\|^2$ is $\hat{\beta} = (X^TX)^{-1}X^TY$.

**This is mathematically true, but it is not a numerically stable way to solve the problem.**
In general, one should avoid inverting matrices numerically to solve linear problems.
Other approaches, such as Gaussian elimination and matrix factorization (decomposition) methods, are much more stable.

## Characterizing the optimum

### Reminders

Two vectors of the same dimension, $x$ and $y$, are _orthogonal_ if $x^Ty = 0$.

A very useful result is that the residual vector, $e = Y - X\hat{\beta}$, is orthogonal to the subspace spanned by the columns of $X$; that is, $X^Te = 0$.

Let $\mathrm{colspan}(X)$ denote the span of the columns of $X$, that is,

\begin{equation*}  
\mathrm{colspan}(X) \equiv \{ X \gamma : \gamma \in \Re^p \}.
\end{equation*}

Suppose not. Any vector $e \in \Re^n$ can be decomposed into a component that is contained in the subspace spanned by the columns of $X$ and a component that is orthogonal to that subspace.
Write $e = e_\parallel + e_\perp$, where $e_\parallel \in \mathrm{colspan}(X)$ and $e_\perp \perp \mathrm{colspan}(X)$

## Operator norm of a matrix

Suppose $X$ is an $n$ by $p$ matrix.

\begin{equation*}  \| X \| \equiv \sup_{\gamma \in \Re^p} \frac{ \| X\gamma \|}{\|\gamma \|}.\end{equation*}