---
# Section 3.1: The Discrete Least Squares Problem
---

## Over-determined linear systems

Let $A \in \mathbb{R}^{m \times n}$ and $b \in \mathbb{R}^m$. 

If $m > n$, we say the linear system 

$$
Ax = b
$$ 

is **over-determined**.

Often over-determined linear systems have no solution, typically due to measurement errors in  $b$.

---

## Example

In [None]:
using LinearAlgebra

In [None]:
A = rand(5, 2)

In [None]:
b = rand(5)

In [None]:
x = A\b

In [None]:
r = b - A*x

In [None]:
norm(r)

In [None]:
x = randn(2)
r = b - A*x
norm(r)

---

## Minimizing the error

Since we cannot solve $Ax=b$ exactly, we want to find $x$ such that the **residual**

$$
r = b - Ax
$$

is as small as possible:

$$
\min_x \|b - Ax\|
$$

We can consider many different norms, such as:

$$
\|b - Ax\|_1 = \sum_{i=1}^m \big| b_i - (Ax)_i \big| = \sum_{i=1}^m \big| b_i - a_i^Tx \big|
$$

$$
\|b - Ax\|_2 = \sqrt{\sum_{i=1}^m \big( b_i - (Ax)_i \big)^2} = \sqrt{\sum_{i=1}^m \big( b_i - a_i^Tx \big)^2}
$$

$$
\|b - Ax\|_\infty = \max_{1\leq i\leq m} \big| b_i - (Ax)_i \big| = \max_{1\leq i\leq m} \big| b_i - a_i^Tx \big|
$$





---

## Least squares

When the error in the entries of $b$ are believed to be **identically and independently [normally distributed](https://en.wikipedia.org/wiki/Normal_distribution)** with **zero mean** and **constant variance**, the best choice is to minimize $\|b - Ax\|_2$. 

In this case, the $x$ that minimizes $\|b - Ax\|_2$ is the **maximum likelihood estimator** of the true solution.

Minimizing $\|b - Ax\|_2$ is equivalent to minimizing

$$
\|b - Ax\|_2^2 = \sum_{i=1}^m \big( b_i - (Ax)_i \big)^2.
$$


The $x$ that minimizes $\|b - Ax\|_2$ (or equivalently $\|b - Ax\|_2^2$) is called the **least-squares solution** because it is minimizing the **sum-of-the-squares** of the errors.

---

## `randn`

In [None]:
?randn

In [None]:
v = randn(10^6)

In [None]:
using Statistics

In [None]:
mean(v)

In [None]:
cov(v)

In [None]:
using Plots

In [None]:
histogram(v, normalize=true, bins=100, label=nothing)

---

## Solving the least-squares problem

In Julia (or MATLAB), we can solve the least-squares problem

$$
\min_x \|b - Ax\|_2
$$

by using the same **backslash** function that we used to solve $n \times n$ linear systems:

```julia
x = A\b
```

Julia will recognize that the linear system is over-determined and will use an algorithm for solving the least-squares problem that is based on the $QR$-factorization of $A$.

---

## Example

In [None]:
A = rand(5, 2)

In [None]:
xtrue = rand(2)
b = A*xtrue + 0.01*randn(5)  # Add random noise to b
[A*xtrue b]

In [None]:
# Solve the least-squares problem:  minimize norm(b - A*x)
x = A\b

In [None]:
xtrue

In [None]:
b - A*x

In [None]:
norm(b - A*x)

In [None]:
norm(b - A*xtrue)

---

## The $QR$-factorization

The $QR$-factorization of $A$ is

$$
A = QR
$$

where 

- $Q$ is an $m \times m$ orthogonal matrix ($Q^TQ = QQ^T = I$),
- $R$ is an $m \times n$ "upper-triangular" matrix.

Alternatively, we can obtain a more compact $QR$-factorization $A = QR$, where

- $Q$ is an $m \times n$ matrix with orthonormal columns ($Q^TQ = I$),
- $R$ is an $n \times n$ upper-triangular matrix.

---

## `qr`

In [None]:
A = rand(5, 2)

In [None]:
?qr

In [None]:
# Q, R = qr(A)

F = qr(A)

In [None]:
Q = F.Q

In [None]:
R = F.R

In [None]:
Q*R - A

In [None]:
dump(Q)

In [None]:
Qthin = Matrix(Q)

In [None]:
Qthin*R - A

In [None]:
Qfull = Q*Matrix(I, 5, 5)

In [None]:
[R; zeros(3,2)]

In [None]:
Qfull*[R; zeros(3,2)] - A

In [None]:
Qfull'*Qfull

In [None]:
Qfull*Qfull'

In [None]:
Qthin'*Qthin

In [None]:
Qthin*Qthin'

---

## Data-fitting

Suppose we are trying to approximate the true function

$$
y(t) = 1 + e^t + 3e^{-t}
$$

given a number of noisy **data points**

$$
(t_1, y_1), \ldots, (t_m, y_m)
$$

where

$$
y_i = y(t_i) + \varepsilon_i, \quad i = 1,\ldots,m,
$$

and each $\varepsilon_i$ is drawn from a [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution) with mean $0$ and variance $\sigma^2$:

$$
\varepsilon_i \sim \mathcal{N}(0,\sigma^2).
$$

In [None]:
using Plots

y(t) = 1 + exp(t) + 3*exp(-t)

m = 100
tt = range(0, 1, length=m)

σ = 0.1
err = σ*randn(m)
yy = y.(tt) .+ err

plot(tt, y.(tt), label="True solution")
scatter!(tt, yy, label="Noisy data")

We let our approximation be given by

$$
p(t) = x_1 + x_2 e^t + x_3 e^{-t}
$$

and we want to find the **maximum likelihood estimate** of the coefficients $x_1, x_2, x_3$.

We want to minimize

$$
\sum_{i=1}^m \big(y_i - p(t_i)\big)^2.
$$

The residual $y_i - p(t_i)$ can be written as

$$
y_i - \left(x_1 + x_2 e^{t_i} + x_3 e^{-t_i}\right) = y_i - \begin{bmatrix} 1 & e^{t_i} & e^{-t_i} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}.
$$

Therefore, the $i^\mathrm{th}$-row of the matrix $A$ is $\begin{bmatrix} 1 & e^{t_i} & e^{-t_i} \end{bmatrix}$:

$$
A = 
\begin{bmatrix} 
1 & e^{t_1} & e^{-t_1} \\
1 & e^{t_2} & e^{-t_2} \\
\vdots&\vdots&\vdots\\
1 & e^{t_m} & e^{-t_m} \\
\end{bmatrix}
$$

and $b \in \mathbb{R}^m$ such that $b_i = y_i$ for $i = 1,\ldots,m$. 

Then 

$$
\sum_{i=1}^m \big(y_i - p(t_i)\big)^2 = \|b - Ax\|_2^2.
$$

In [None]:
## True model:  y = 1 + exp(t) + 3*exp(-t)
A = [ones(m) exp.(tt) exp.(-tt)]
b = yy

## Solve the least squares problem:  min norm(b - A*x)
x = A\b

In [None]:
## Compute the norm of the residual
r = b - A*x
norm(r)

In [None]:
p(t) = x[1] + x[2]*exp(t) + x[3]*exp(-t)

plot(tt, y.(tt), label="True solution")
plot!(tt, p.(tt), label="Least-squares solution", c=3)
scatter!(tt, yy, label="Noisy data", c=2)

---