# Regression

The linear problems we have learned so far are all balanced or square system, in which the number of equations equals the number of unknowns. However, in the real cases, overdetermined linear systems are much more common: we almost always take more measurements than the unknowns in order to reduce the effect of noises. How can we solve a overdetermined system?

## Linear Least Square Regression

Consider an overdetermined linear system $A x = b$, where $A$ is a $n \times m$ rectangular matrix, where $n > m$. We also require $A$ to be full rank. Due to the error in the measurement $b$, in most cases there are no $x$ existing that fully satisfies all the equations in the system. 

To solve the problem, we need to find a $x$ that minimize the error between the model prediction ($Ax$) and measurement ($b$), which can be presented by a column vector as:
$$ e = Ax - b $$

The most common way to solve $x$ is to minimize L-2 norm of the error vector $e$, which is defined as:
$$
E = \sum_{i=1}^{m} e_i^2 = e^T e
$$
where $e^T$ is the transpose of $e$. 

By substituting $e = Ax-b$, we have
$$
E = e^Te = (Ax-b)^T(Ax-b) = x^TA^TAx - b^TAx - x^TA^Tb + b^Tb = x^TA^TAx - 2x^TA^Tb + b^Tb
$$
To find $x$ that minimizes $E$, we set the derivatives of $E$ with respect to $x$ to zero:
$$
\frac{\partial E}{\partial x} = -2 A^Tb + 2A^T A x = 0
$$
which gives us one of the **normal equations**:
$$
A^T A x = A^T b
$$
This brings us back the problem we have learned in the last section: $A^\dagger x=b^\dagger$ where $A^\dagger = A^TA$ is a full-rank square matrix, $b^\dagger = A^T b$ is a column vector. 

### Exercise
By knowing the following matrix differentiation rules, please prove the normal equations. 
$$
\alpha = A x  \ \ \ \Longrightarrow \ \ \ 
\frac{\partial \alpha}{\partial x} = A
$$
$$
\alpha = x^T A  \ \ \ \Longrightarrow \ \ \ 
\frac{\partial \alpha}{\partial x} = A^T
$$
$$
\alpha = x^T A x  \ \ \ \Longrightarrow \ \ \ 
\frac{\partial \alpha}{\partial x} = x^T \left(A + A^T \right)
$$
The proof of the matrix differentiation rules can be found [here](https://atmos.washington.edu/~dennis/MatrixCalculus.pdf).