# Least Squares Method

---

Consider a system of linear equations with more equations than unknowns:

$$Ax=b, \quad m>n.
$$

If the system has a solution $x$, then $Ax-b=0$, so $\| Ax-b\|=0$ for every vector norm.

If the system does not have solution, then it is natural to seek $x$ such that

$$
\|Ax-b \|_{1,2,\infty}\to \min
$$

for the chosen vecor norm.

__Theorem.__ If $\mathop{\mathrm{rank}} A=n$, then the __unique__ $x$ for which 

$$
\|Ax-b \|_{2}\to \min
$$

is the solution of the system of  __normal equations__:

$$
A^T A x=A^T b. \tag{*}
$$

_Proof._  Define

$$
Q(x)=\|Ax-b\|_2^2=(x^TA^T-b^T)(Ax-b)=x^TA^T A x -2x^T A^T b+b^Tb.
$$

It holds

\begin{align*}
Q(x+h)&=(x^T+h^T)A^TA(x+h)-2(x^T+h^T)A^Tb+b^Tb \\
&=Q(x) +2h^T(A^TAx-A^Tb)+h^TA^TAh\\ &= Q(x)+\|Ah\|_2^2 \\
&\geq Q(x),
\end{align*}

so the minimum is indeed attained at $x$.

This solution is unique since $Q(x)=Q(y)$ implies $\|Ax\|_2=0$, so either $h=0$ or $\mathop{\mathrm{rank}} A<n$, which is a contradiction, and the theorem is proved.

__Geometrical interpretation.__ Vektori $Ax$ i $Ax-b$ are mutually orthogonal, 

$$
(Ax)^T\cdot (Ax - b)=x^T (A^TAx - A^Tb)=0. 
$$ 

Therefore, $Ax$ is the orthogonal projection of the vector $b$ onto the set $\{Ay:\ y \textrm{ arbitrary}\}$.

The solution $x$ is called  __least squares solution__ 
of the system $A x=b$. __Relative residual__

$$
q=\sqrt{\frac{Q(x)}{Q(0)}}=\frac{\|A x - b\|_2}{\|b\|_2 }
$$

measures the quality of the solution (adaptation).

## Example

Let us solve the system 
\begin{align*}
x+y&=0\\
y+z&=1\\
x+z&=0\\
-x+y+z&=1\\
-x-z&=0
\end{align*}
in the sense of least squares.

In [1]:
A=[1//1 1 0;0 1 1;1 0 1;-1 1 1;-1 0 -1]

5×3 Array{Rational{Int64},2}:
  1//1  1//1   0//1
  0//1  1//1   1//1
  1//1  0//1   1//1
 -1//1  1//1   1//1
 -1//1  0//1  -1//1

In [2]:
b=collect([0//1,1,0,1,0])

5-element Array{Rational{Int64},1}:
 0//1
 1//1
 0//1
 1//1
 0//1

In [3]:
x=(A'*A)\(A'*b)

3-element Array{Rational{Int64},1}:
 -10//29
  12//29
  11//29

In [4]:
using LinearAlgebra
# Relative residual
q=sqrt(norm(A*x-b)/norm(b))

0.430923819458906

If the system is overdetermined, the standard command `/` computes 
the least squares solution using QR factorization:

In [5]:
x₁=float(A)\float(b)

3-element Array{Float64,1}:
 -0.3448275862068966
  0.41379310344827624
  0.37931034482758635

In [6]:
float(x)

3-element Array{Float64,1}:
 -0.3448275862068966
  0.41379310344827586
  0.3793103448275862

## Example

In [7]:
import Random
Random.seed!(123)
A=rand(20,10)
b=rand(20);

In [8]:
x=A\b

10-element Array{Float64,1}:
  0.09126520276532515
  0.2325329372697541
 -0.23867707369510557
 -0.16294801609881102
  0.08926547724020212
  0.2631846339836788
  0.5435390650803674
 -0.11240823390574455
 -0.045249764335416116
 -0.01306538784642571

In [9]:
q=sqrt(norm(A*x-b)/norm(b))

0.680981882736473

## Perturbation theory

__Sensitivity of the least squares problem__ is given by following bounds (see [Matrix Computations, Section 5][GVL13]).

__Condition number__ of a general matrix $A$ is:

$$
\kappa_2(A)=\sqrt{\kappa(A^TA)}=\|A\|_2 \|(A^TA)^{-1} A^T\|_2.
$$

Let $x$ and $\hat x$ be the least squares solutions of the systems $Ax=b$ and 
$(A+\delta A)\hat x=b+\delta b$, respectively. The __residuals__ are defined by

\begin{align*}
r&=Ax-b\\
\hat r&=(A+\delta A)\hat x-(b+\delta b).
\end{align*}

Let

$$
\epsilon=\max \bigg\{ \frac{\|\delta A\|_2}{\|A\|_2},\frac{\|\delta b\|_2}{\|b\|_2}\bigg\}
$$

and

$$
q=\frac {\|r\|_2}{\|b\|_2}\equiv\sin\theta <1.
$$

Then,

\begin{align*}
\frac{\|\hat x-x\|_2}{\|x\|_2}&\leq \epsilon \bigg[\frac{2\,\kappa_2(A)}{\cos \theta} +\tan\theta \,\kappa_2^2(A)\bigg]+O(\epsilon^2),\\
\frac{\|\hat r-r\|_2}{\|b\|_2}&\leq \epsilon\,[1+ 2\,\kappa_2(A)](m-n)+O(\epsilon^2).
\end{align*}

We see that the residual itself is less sensitive than the position where it is attained.

[GVL13]: https://books.google.hr/books?id=X5YfsuCWpxMC&printsec=frontcover&hl=hr#v=onepage&q&f=false "G. Golub and C. F Van Loan, 'Matrix Computations', 4th Edition, John Hopkins, Baltimore, 2013"

In [10]:
cond(A)

17.551938062895363

In [11]:
δA=1e-4*(rand(20,10).-0.5)

20×10 Array{Float64,2}:
  3.70804e-5  -1.46483e-5   3.78108e-5  …  -3.40051e-6   2.97237e-7
 -2.92476e-5  -3.01169e-5  -7.11582e-7     -3.10979e-5  -7.3709e-6
 -3.62478e-5  -4.62203e-5  -2.18e-5         3.34273e-5   1.22474e-5
 -1.4055e-5   -2.2101e-5    2.47534e-5     -3.17975e-5  -2.09256e-5
  9.99739e-6  -3.74349e-5   4.32239e-5     -2.72284e-5  -4.57705e-5
 -1.5921e-5   -4.15485e-5   4.53109e-6  …  -1.20715e-5   1.95729e-5
  1.56878e-5  -3.85177e-6   2.79167e-6     -3.39768e-5  -4.07255e-6
 -2.53126e-5   2.53036e-5  -8.00883e-6     -5.53739e-6  -1.31477e-5
  2.30285e-5   2.34625e-5  -1.08889e-5      3.90731e-6  -2.69901e-5
  2.9528e-5   -1.62549e-6   4.34735e-5     -4.67067e-5   3.4356e-5
  4.91776e-6  -1.25858e-5   2.52665e-5  …   1.57212e-5  -3.02322e-5
 -6.00838e-6  -1.4645e-5   -3.7536e-5       1.70556e-5   3.07931e-5
 -3.54482e-5  -4.57917e-5   4.34059e-5     -2.36215e-5  -7.05426e-6
  1.69838e-6  -3.42984e-5   4.95414e-5     -1.88728e-5  -3.29889e-5
  4.96909e-5   4.42417e-5 

In [12]:
x₁=(A+δA)\b

10-element Array{Float64,1}:
  0.09124152035466926
  0.23255326896117623
 -0.23868978883842978
 -0.16295402718226257
  0.08928215497482142
  0.2631596591254607
  0.5434960758646433
 -0.11242053952020316
 -0.045216901317028727
 -0.012995050999963615

In [13]:
r=A*x-b
r₁=(A+δA)*x₁-b

20-element Array{Float64,1}:
  0.28031436336692706
 -0.0016826676004954577
  0.030033122084497266
  0.1518264426892456
  0.13540479424301285
 -0.17325544231692047
  0.23916375817007907
  0.40511688301291676
 -0.06324110867185251
 -0.0376593966430811
 -0.3342630249732057
 -0.05243589518832148
  0.16805905196775228
 -0.024219833108866218
 -0.20719778289290064
 -0.2142518706320662
 -0.0990985874134982
  0.1059955463423751
 -0.2700643963526772
 -0.457295064717957

In [14]:
norm(x₁-x)/norm(x), norm(r₁-r)/norm(b)

(0.0001376041299271579, 5.1862224431428776e-5)

## Error analysis and accuracy

If $\mathop{\mathrm{rang}}A =n$, the matrix $A^TA$ is symmetric and positive definite, so the system 
(*) can be solved using Cholesky factorization.

The computed solution $\hat x$ satisfies

$$
(A^TA +E)\hat x=A^Tb,
$$

where

$$ 
\|E\|_2\approx \varepsilon \| A^TA\|_2,
$$

so the bound for the relative error is

$$
\frac{\|\hat x -x\|_2}{\|x\|_2}\approx \varepsilon \kappa_2(A^TA) =\varepsilon \kappa^2_2(A).
$$

Therefore, the relative error of the solution obtained using normal equation depends upon the __square of the condition number__, so it is better to use QR factorization.