#### Low-rank approximation overview

If we have a matrix $A\in \mathbf{R}^{m\times n}$, but we don't want to store all $mn$ entries, we can try to decompose $A$ as

$$A=BX$$

where $B\in \mathbf{R}^{m\times k}$ and $X\in \mathbf{R}^{k\times n}, k\ll m, n$, which means we only need to store $k(m+n)$ entries

This is known as low-rank approximation since $B$ and $X$ have rank of $k$ at most

Say we randomly choose $k$ columns of $A$ to construct $B$

To obtain $X$, we basically solve the least squares problem

$$\min \|A-BX\|^2$$

or

$$\min \|a_i-Bx_i\|^2, i=1, \cdots, n$$

With normal equation, we have

$$x_i=(B^TB)^{-1}B^Ta_i$$

and

$$X=(B^TB)^{-1}B^TA$$

Therefore, our low-rank approximation is

$$A\approx B(B^TB)^{-1}B^TA$$

where $(B^TB)^{-1}B^TA\in \mathbf{R}^{k\times n}$

But, what is the `best` low-rank approximation of $A$?

#### Low-rank approximation with SVD

Let $A\in \mathbf{R}^{m \times n}$ (tall, square or fat), its SVD is given by

$$A=U\Sigma V^T=\sum_{i=1}^r\sigma_i u_i v_i^T$$

where
* $A\in \mathbf{R}^{m \times n}$, $\text{rank}(A)=r$
* $U\in \mathbf{R}^{m \times r}$, $U^TU=I$
* $V\in \mathbf{R}^{n \times r}$, $V^TV=I$
* $\Sigma =\text{diag}(\sigma_1, \cdots, \sigma_r)$, $\sigma_1 \geq\cdots\geq \sigma_r > 0$

If we want a matrix $\bar{A}$ with $\text{rank}(\bar{A})\leq p < r$, then the following gives the `optimal approximation` with `minimized` $\|A-\bar{A}\|$

$$\bar{A}=\sum_{i=1}^p\sigma_i u_i v_i^T$$

and

$$\|A-\bar{A}\|=\left\|\sum_{i=p+1}^r\sigma_i u_i v_i^T\right\|=\boxed{\sigma_{p+1}}$$

To show this, assume $B$ is another low-rank approximation with $\text{rank}(B)\leq p$

Then, the dimension of `nullspace` $\dim N(B) \boxed{\geq n-p}$, as nullspace and $R(B^T)$ are complementary in $\mathbf{R}^n$ and $R(B^T)\leq p$

Also, we know that $\dim \text{span}(v_1, \cdots, v_{p+1})\boxed{=p+1}$

Therefore, $N(B)$ and $\text{span}(v_1, \cdots, v_{p+1})$ must `intersect`, that is, there is a unit vector $z\in \mathbf{R}^n$ such that

$$Bz=0\, \text{and} \, z\in \text{span}(v_1, \cdots, v_{p+1})$$

Then, we can write

$$(A-B)z=Az=\sum_{i=1}^{p+1}\sigma_i u_i v_i^Tz$$

There are only $p+1$ terms as $z$ is orthogonal to $v_{p+2},\cdots$

Take Euclidean norm on each side and for right hand side using $u_i$ as basis

$$\|(A-B)z\|^2=\sum_{i=1}^{p+1}\sigma_i^2 (v_i^Tz)^2\geq \sigma_{p+1}^2\sum_{i=1}^{p+1} (v_i^Tz)^2=\sigma_{p+1}^2\|z\|^2$$

The last equality holds since $z\in \text{span}(v_1, \cdots, v_{p+1})$

Move $\|z\|^2$ to the left and use definition of `matrix norm`

$$\|A-B\|\geq \sigma_{p+1} = \|A-\bar{A}\|$$

This means SVD gives the low-rank approximation with `smallest` $\|A-\bar{A}\|$

Essentially, SVD ranks all $u_iv_i^T$ in order of `importance` as determined by $\sigma_i$, then we take as many as we could afford or need, in this case, first $p$ of them

#### Singular value as distance to singularity

With low-rank approximation, we can interpret singular value $\sigma_i$ as

$$\sigma_i=\min \{\|A-B\| \, | \, \text{rank}(B)\leq i-1\}$$

That is, the `distance` (measured by matrix norm) to the nearest rank $i-1$ matrix

For example

* if $\sigma_3$ is small, then, it means the matrix $A$ is very close to a `rank 2` matrix
* for symmetric matrix, if $\sigma_n=\sigma_{\min}$ is small, then, it means the matrix is very close to rank $n-1$ matrix (which would be `singular`)

#### Model simplification

For $y=Ax+v$, if we know

* $A\in \mathbf{R}^{100 \times 30}$ has singular values: $10, 7, 2, 0.5, 0.01,\cdots, 0.0001$
* $\|x\|$ is on the order of 1
* noise has norm on the order of 0.1

Then, the terms $\sigma_i u_i v_i^T x$ for $i\geq 5$ are `substantially smaller than noise term`

So we can simplify the model to

$$y=\sum_{i=1}^4\sigma_i u_i v_i^T x+v$$