# Normal Equations

Real numbers only this time.

## Linear Systems of Equations

Linear equations are of the form $Ax = b$ where $A$ is a matrix and $x$ and $b$ are vectors. The rows of $A$ and $b$ form a system of equations that must be simultaneously satisfied by the entries of $x$. If $x,b\in\mathbb{R}^n$, then the solutions to the equation of a single row corresponds to an $n-1$-dimensional hyperplane. If the rows of $A$ are linearly independent, then solutions that that simultaneously satisfy the equations in $k$-rows correspond to the $n-k$-dimensional intersection of $k$ $n$-dimensional hyperplanes. The solution of $x$ that satisfies all $n$ equations is a $n-n = 0$-dimensional point, and so $x$ is uniquely determined. If any two rows of $A$ are not linearly independent, then the hyperplanes that correspond to values of $x$ that satisfy them overlap exactly, and their intersection is $n$ dimensional, rather than $n-1$ dimensional. In this case, the value of $x$ that satisfies all rows of $A$ is not narrowed down to a single point. The system of equations is said to be *underdetermined*:. This is equivalently the case when $A$ has $m<n$ rows.  

### Square Matices $A\in\mathbb{R}^{nxn}$
If $A\in\mathbb{R}^{nxn}$, then the solution to the system is formally $x = A^{-1}b$, where $A^{-1}$ is the matrix inverse, satisfying $A^{-1}A=I$, where $I$ is the identity matrix. 

Since $Ax=b$ is the same as expressing $b$ in terms of a linear combination of the columns of $A$, the entries of $x$ can be interpreted as the coefficients resulting from the projection of $b$ into the column space of $A$. Therefore, for an orthonormal matrix, the $A^-1$ is simply $A^T$ 

### Rectangular Matrices $A\in\mathbb{R}^{mxn}$
If $A\in\mathbb{R}^{mxn}$ with $m>n$ rows, then there need not be any point $x\in\mathbb{R}^n$ in which the $m$ hyperplanes all intersect. In that case, the system does not have a solution $x\in\mathbb{R}^n$, and the system is considered *overdetermined*. (The intersection of $m$ distinct hyperplanes in $n$ dimensional space would have negative dimension $(n-m)<0$ if $m>n$, which my feeble brain can't make sense of.)


## Normal Equations
In the overdetermined case $A^{mxn}$ with $m>n$, the columns of $A$ do not span $\mathbb{R}^m$ and therefore $b\in\mathbb{R}^m$ may have some component $\epsilon$ that lies outside of the column space of $A$. In that case, no linear combination $x$ of the columns of $A$ can express $b$ perfectly, but we might look for approximate solutions $\hat{x}$ so that:

\begin{equation}
A\hat{x} + \epsilon = b
\end{equation}

A natural approach for picking an approximate solution $\hat{x}$ is to look for the projection of $b$ in the column space of $A$, which can be thought of as looking for the shadow of an $m$-dimensional vector in $n$ dimensional space. 

The projection maximizes the dot product $(A\hat{x})\cdot b$, and hence minimizes the length of the difference vector $\epsilon$. In turn, the length of the difference vector $\epsilon$ is $\sqrt{\epsilon\cdot\epsilon}$, which is monotonic to $\epsilon\cdot\epsilon = \sum^m_i \epsilon_i^2$. That means that finding the projection of $b$ in the column space of $A$ minimizes the  $L_2$ norm of $\epsilon$ or *least squares error*.

There are two ways to go about finding $\hat{x}$.

### The Quick Way to $\hat{x}$

By construction, the vector $\epsilon$ is orthogonal to the column space of $A$. Which means:

\begin{equation}
\begin{array}{rl}
A^T\epsilon &= 0\\
A^T\left(A\hat{x}-b\right) &= 0\\
A^TA\hat{x} &= A^Tb\\
\hat{x} &= \left(A^TA\right)^{-1}A^Tb
\end{array}
\end{equation}

Making use of the fact that $A^TA$ is square and therefore hopefully invertible.

### The Long Way to $\hat{x}$

Loss functions play a central role in computational statistics (for example when regularization is introduced), and therefore it is of interest to approach finding $\hat{x}$ by instead minimizing the least square error. This requires:

\begin{equation}
\frac{d}{d\hat{x}}L_2(\epsilon) = 0
\end{equation}

where

\begin{equation}
\begin{array}{rl}
L_2(\epsilon) &= \left(A\hat{x}-b\right)^T\left(A\hat{x}-b\right)\\
&= \hat{x}^TA^TA\hat{x} - \hat{x}^TA^Tb - b^TA\hat{x} + b^Tb 
\end{array}
\end{equation}

Useful factoids about taking derivatives with respect to vectors include:

\begin{equation}
\begin{array}{l}
\frac{d}{dx} \left(u^Tx\right) = \left[\frac{d}{dx_1}\left(\sum_i u_i x_i\right),...,\frac{d}{dx_n}\left(\sum_i u_i x_i\right)\right] = u^T\\
\\
\frac{d}{dx} \left(x^Tu\right) = \left[\frac{d}{dx_1}\left(\sum_i u_i x_i\right),...,\frac{d}{dx_n}\left(\sum_i u_i x_i\right)\right] = u^T\\
\\
\frac{d}{dx} \left(x^Tx\right) = \left[\frac{d}{dx_1}\left(\sum_i x_i^2\right),...,\frac{d}{dx_n}\left(\sum_i x_i^2\right)\right] = 2x^T\\
\\
\frac{d}{dx} \left(Ax\right) = \left[
\begin{array}{ccc} 
\underbrace{\frac{d}{dx_1}\left(\sum_i A_1i x_i\right)}_{A_{11}} &...& \underbrace{\frac{d}{dx_n}\left(\sum_i A_1i x_i\right)}_{A_1n}\\
\vdots&\vdots&\vdots\\
\underbrace{\frac{d}{dx_1}\left(\sum_i A_ni x_i\right)}_{A_{n1}} &...& \underbrace{\frac{d}{dx_n}\left(\sum_i A_ni x_i\right)}_{A_{nn}}\\
\end{array}\right] = A\\
\end{array}
\end{equation}

It follows:

\begin{equation}
\begin{array}{l}
\frac{d}{d\hat{x}}\left(x^TA^T\underbrace{A\hat{x}}_{u(\hat{x})}\right) = \frac{d}{du}\left(u^Tu\right)\frac{d}{d\hat{x}}u = 2u^T\frac{d}{d\hat{x}}u = 2\hat{x}^TA^TA\\
\\
\frac{d}{d\hat{x}}\hat{x}^TA^Tb = b^TA\\
\\
\frac{d}{d\hat{x}}b^TA\hat{x} = b^TA\\
\\
\frac{d}{d\hat{x}}b^Tb = 0
\end{array}
\end{equation}

So that 

\begin{equation}
\begin{array}{rl}
\frac{d}{dx}L_2(\epsilon) = 0 &= 2\hat{x}^TA^TA - 2b^TA\\
\hat{x}^TA^TA &= b^TA\\
A^TA\hat{x} &= A^Tb\\
\hat{x} &= \left(A^TA\right)^{-1}A^Tb
\end{array}
\end{equation}