## Singular Value Decomposition (SVD)

Notation/Settings

| Symbol        | meaning                                                                             |
| ------------- | ----------------------------------------------------------------------------------- |
| $\delta_{ij}$ | Kronecker delta, that is $\delta_{ij}=1$ if $i=j$, and $\delta_{ij}=0$ if $i\neq j$ |

- We restrict most discussions to real matrices.
  - All results stay true for complex matrices upon replacing tranpose (e.g., $A^T$, $V^T$) with conjugate transpose (e.g., $A^H$, $V^H$).


**Idea**

- Any matrix, viewed as a linear map, is essentially stretching along orthogonal directions.
  - A linear map is completely determined by its actions on basis vectors.
    - Recall how to construct matrix representation of a linear map $T:V\to W$.
  - It is remarkable that there exist (a) orthonormal bases, one on the domain and the other on the codomain, that (b) make any linear map nothing but scalar multiplications with respect to those bases.
  - The name *singular* comes from the surprise that the pineering mathematicians felt: too good to be normal.

**Geometric Intuition** (Sauer (2017) p. 579)

$v_i$'s form the basis of a rectangular coordinate system on which $A$ acts in a simple way: It produces the basis vectors of a new coordinate system, the $u_i$’s, with some stretching quantified by the scalars $s_i$'s. The stretched basis vectors $s_i u_i$ are the semimajor axes of the ellipse.

![SVD geometry](https://www.researchgate.net/profile/Gowtham-Sivaraman/publication/312040021/figure/fig1/AS:654753159716872@1533116726789/Geometric-interpretation-of-the-SVD-s-1-and-s-2-denotes-the-principal-radii-of-the.png)

Figure: Gowtham Sivaraman (Geometry of 2-by-2 SVD)

### Fundamentals of SVD

**Theorem** (Spectral theorem for real symmetric matrix; Rephrase of Horn and Johnson (2013) Matrix analysis 2ed. Theorem 4.1.5. p. 229)

If $A$ is a real symmetric $n$-by-$n$ matrix, then there exists an orthonormal basis of $R^n$ consisting of eigenvectors of $A$. Each eigenvalue of $A$ is real.


**Lemma** 

Let $A$ be an $m \times n$ matrix. The eigenvalues of $A^T A$ are nonnegative.



Proof

Let $v$ be a unit eigenvector of $A^T A$, and $A^T A v=\lambda v$. Then
$$
0 \leq\|A v\|^2=v^T A^T A v=\lambda v^T v=\lambda .
$$

**Theorem** (Sauer (2017) p. 581)

Let $A$ be an $m \times n$ matrix where $m \geq n$. Then there exist two orthonormal bases $\left\{v_1, \ldots, v_n\right\}$ of $R^n$, and $\left\{u_1, \ldots, u_m\right\}$ of $R^m$, and real numbers $s_1 \geq \cdots \geq s_n \geq 0$ such that $A v_i=s_i u_i$ for $1 \leq i \leq n$. The columns of $V=\left[v_1|\ldots| v_n\right]$, the right singular vectors, are the set of orthonormal eigenvectors of $A^T A$; and the columns of $U=\left[u_1|\ldots| u_m\right]$, the left singular vectors, are the set of orthonormal eigenvectors of $A A^T$. That is, we have $A=USV^T$.

Constructive version (Human-friendly; Sauer (2017) p. 581)

1. $s_i$'s (singular values): Find eigenvalues (nonnegative) of $A^T A$ ($n$-by-$n$) in the decreasing order $s_1^2 \ge s_2^2 \ge \cdots \ge s_n^2 \ge 0$ along with
1. $v_i$'s (right singular vectors): corresponding eigenvectors $v_i$ ($i=1,2,\cdots, n$).
1. $u_i$'s (left singular vectors): If $s_i \neq 0$, define $u_i$ by the equation $s_i u_i=A v_i$. Choose each remaining $u_i$ as an arbitrary unit vector subject to being orthogonal to $u_1, \ldots, u_{i-1}$ ($i=1,2,\cdots, m$).

**Remark** 

- $u_i$'s are automatically mutually orthogonal. (Why?)
- The SVD is not unique. 
  - Replacing $v_1$ by $-v_1$ and $u_1$ by $-u_1$ does not change the equality, but changes the matrices $U$ and $V$.

**Example** (Sauer (2017) p. 581)

Find the singular value decomposition of the $4 \times 2$ matrix
$$
A=\left[\begin{array}{rr}
3 & 3 \\
-3 & -3 \\
-1 & 1 \\
1 & -1
\end{array}\right] .
$$

0. Preliminary

$$
A^T A=\left[\begin{array}{ll}
20 & 16 \\
16 & 20
\end{array}\right]
$$

1. Eigenvectors and eigenvalues 

$$
v_1=\begin{bmatrix}1 / \sqrt{2} \\ 1 / \sqrt{2}\end{bmatrix}, 
\quad 
v_2=\begin{bmatrix}1 / \sqrt{2} \\ -1 / \sqrt{2}\end{bmatrix},
\quad
\begin{array}{l}
s_1^2=36 \\ 
s_2^2=4
\end{array}
$$



2. Singular values

$$
\begin{array}{l}
s_1=6 \\ 
s_2=2
\end{array}
$$

3. Right singular vectors

$v_1, v_2$ (same as eigenvectors of $A^T A$)

4. Left singular vectors

From 

$$
6 u_1=A v_1=\left[\begin{array}{r}
3 \sqrt{2} \\
-3 \sqrt{2} \\
0 \\
0
\end{array}\right] \quad 2 u_2=A v_2=\left[\begin{array}{r}
0 \\
0 \\
-\sqrt{2} \\
\sqrt{2}
\end{array}\right]
$$

we have

$$
u_1=\left[\begin{array}{r}
\frac{1}{\sqrt{2}} \\
-\frac{1}{\sqrt{2}} \\
0 \\
0
\end{array}\right] \quad u_2=\left[\begin{array}{r}
0 \\
0 \\
-\frac{1}{\sqrt{2}} \\
\frac{1}{\sqrt{2}}
\end{array}\right] .
$$

For $i = 3, 4$, choose

$$
u_3=\left[\begin{array}{c}
\frac{1}{\sqrt{2}} \\
\frac{1}{\sqrt{2}} \\
0 \\
0
\end{array}\right] \quad u_4=\left[\begin{array}{c}
0 \\
0 \\
\frac{1}{\sqrt{2}} \\
\frac{1}{\sqrt{2}}
\end{array}\right]
$$

If such vectors are not easy to guess, we can use Gram-Schmidt starting with $\{u_1, u_2, e_3, e_4\}$, where $e_i = [\delta_{ij}]_{1\le j \le 4}^T$ and $\delta_{ij}$ is Kronecker delta.



6. SVD

$$
A=\left[\begin{array}{rr}
3 & 3 \\
-3 & -3 \\
-1 & 1 \\
1 & -1
\end{array}\right]=U S V^T=\left[\begin{array}{rrrr}
\frac{1}{\sqrt{2}} & 0 & \frac{1}{\sqrt{2}} & 0 \\
-\frac{1}{\sqrt{2}} & 0 & \frac{1}{\sqrt{2}} & 0 \\
0 & -\frac{1}{\sqrt{2}} & 0 & \frac{1}{\sqrt{2}} \\
0 & \frac{1}{\sqrt{2}} & 0 & \frac{1}{\sqrt{2}}
\end{array}\right]\left[\begin{array}{ll}
6 & 0 \\
0 & 2 \\
0 & 0 \\
0 & 0
\end{array}\right]\left[\begin{array}{cc}
\frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\
\frac{1}{\sqrt{2}} & -\frac{1}{\sqrt{2}}
\end{array}\right] .
$$

**Reduced/Economic SVD**

- The lower block of zeros in $S$ and the corresponding number of left singular vectors do not contribute to $A$.
- Remove them, making it *reduced SVD* or *economic SVD*.


SVD

$$
A=\left[\begin{array}{rr}
3 & 3 \\
-3 & -3 \\
-1 & 1 \\
1 & -1
\end{array}\right]=\hat U \hat S V^T=\left[\begin{array}{rrrr}
\frac{1}{\sqrt{2}} & 0 \\
-\frac{1}{\sqrt{2}} & 0 \\
0 & -\frac{1}{\sqrt{2}}\\
0 & \frac{1}{\sqrt{2}} 
\end{array}\right]\left[\begin{array}{ll}
6 & 0 \\
0 & 2
\end{array}\right]\left[\begin{array}{cc}
\frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\
\frac{1}{\sqrt{2}} & -\frac{1}{\sqrt{2}}
\end{array}\right] .
$$

Clicker question

Case 2: $m\le n$

- Find SVD of $A^T$ to get $A^T=U S V^T$. 
- Then, $A=\left(U S V^T\right)^T=V S^T U^T$ is the SVD of $A$. [Sauer (2017) p. 582]

**Remark** (Sizes of matrices; Sauer (2017) p. 582)

- SVD of $m$-by-$n$ matrix $A=USV^T$
  - $S$ has the same size as $A$.
- Reduced SVD of $m$-by-$n$ matrix $A=\hat U \hat S V^T$ with $m\ge n$
  - $\hat U$ has the same size as $A$.
- Reduced SVD of $m$-by-$n$ matrix $A=U \hat S \hat V^T$ with $m\le n$
  - $\hat V^T$ has the same size as $A$.

**Remark** 

- The geometric intuition of SVD is, in indeed, carried out in a literal sense.
  - $V^T=V^{-1}$ is the change of basis matrix from standard basis $E=\{e_i \in R^n \ : \ [e_i]_j = \delta_{ij} \}$ to $V=[v_1 | v_2 | \cdots | v_n]$ as a column stack of $\{v_1, v_2, \cdots, v_n \}$.
    - Input vector in standard basis is changed to a coordinate in $V$.
  - Aligned in directions in $v_i$, the input vector is scaled by $s_i$.
  - With stretch applied in each direction $v_i$, which results in coordinate vector in $u_i$'s, recover the ouput in standard basis by taking linear combination of $u_i$'s.

### Properties of SVD

In the following, suppose $A=USV^T$ is an SVD of $m$-by-$n$ matrix $A$ with $s_r$ is the smallest nonzero singular value: $s_1 \ge \cdots \ge s_r > s_{r+1} = \cdots 0$.

**Property 1** (Sauer (2017) p. 585)

The rank of the matrix $A=U S V^T$ is the number of nonzero entries in $S$.



Proof. Since $U$ and $V^T$ are invertible matrices, $\operatorname{rank}(A)=\operatorname{rank}(S)$, and the latter is the number of nonzero diagonal entries.



**Property 2** (Sauer (2017) p. 585)

If $A$ is an $n \times n$ matrix, $|\operatorname{det}(A)|=s_1 \cdots s_n$.

Proof. Since $U^T U=I$ and $V^T V=I$, the determinants of $U$ and $V^T$ are 1 or -1 , due to the fact that the determinant of a product equals the product of the determinants. Property 2 follows from the factorization $A=U S V^T$.


**Property 3** (Sauer (2017) p. 585)

If $A$ is an invertible $m \times m$ matrix, then $A^{-1}=V S^{-1} U^T$.

Proof. By Property $1, S$ is invertible, meaning all $s_i>0$. Now Property 3 follows from the fact that if $A_1, A_2$, and $A_3$ are invertible matrices, then $\left(A_1 A_2 A_3\right)^{-1}=$ $A_3^{-1} A_2^{-1} A_1^{-1}$ and that $U,V$ are orthogonal.

**Remark**

- Property 3 says that obtaining $A^{-1}$ is simple once its SVD is known
  - $V$ and $U$ are just transposed and $S^{-1}=\mathrm{diag}(s_i^{-1})$.

**Property 4** (Sauer (2017) p. 586)

The $m \times n$ matrix $A$ can be written as the sum of rank-one matrices

$$
A=\sum_{i=1}^r s_i u_i v_i^T,
$$

where $r$ is the rank of $A$, and $u_i$ and $v_i$ are the $i$ th columns of $U$ and $V$, respectively.

Proof. (sketch)

Given $A=USV^T$, 

1. Split $S$ into sum of $r$ matrices of a single nonzero entry.
2. Expand the result and carry out block multiplication.

**Remark**

- Each summand in Property 4 is called *rank-one* matrix.
  - Each column is a scalar multiple of of the first column.
  - If you haven't, write it out.

**Property 5**

range $(A)=\left\langle u_1, \ldots, u_r\right\rangle$ and $\operatorname{null}(A)=\left\langle v_{r+1}, \ldots, v_n\right\rangle$, where $\left\langle u_1, \ldots, u_r\right\rangle = \mathrm{span}\{u_1, \ldots, u_r\}$.


Proof. This is a consequence of the fact that $\operatorname{range}(S)=\left\langle e_1, \ldots, e_r\right\rangle \subseteq R^m$ and $\operatorname{null}(S)=\left\langle e_{r+1}, \ldots, e_n\right\rangle \subseteq R^n$.

**Remark** (Computational application of SVD; Trefethen and Bau Numerical Linear Algebra p. 36)

- Once one can compute it, the SVD can be used as a tool for all kinds of problems. 
  - Rank: The best method for determining the rank of a matrix is to count the number of singular values greater than a judiciously chosen tolerance (Property 1). 
  - Range and Null space: The most accurate method for finding an orthonormal basis of a range or a nullspace is via Property 5. 
    - QR factorization provides alternative algorithms that are faster but not always as accurate.
  - Low rank approximation: Property 4 is a basis of low-rank approximations (this is the next topic). 
  - Besides these examples, the SVD is also an ingredient in robust algorithms for least squares fitting, intersection of subspaces, regularization, and numerous other problems.

### Applications of SVD

#### Low-rank approximation

Idea: Recall rank one sum expansion of $A$, using unit vectors, 

$$
A=\sum_{i=1}^r s_i u_i v_i^T,
$$

The larger $s_j$ is, the more contribution of the term to make up $A$. 
$\longrightarrow$ Truncate at some point where $s_j$ drops to small numbers, say index $p$. 


$$
A\approx \sum_{i=1}^p s_i u_i v_i^T = A_p,
$$


**Theorem** (Low rank approximation; Trefethen and Bau Numerical Linear Algebra p. 35)

For any $\nu$ with $0 \leq \nu \leq r$, define
$$
A_\nu=\sum_{j=1}^\nu \sigma_j u_j v_j^T
$$
if $\nu=p=\min \{m, n\}$, define $\sigma_{\nu+1}=0$. Then
$$
\left\|A-A_\nu\right\|_2=\inf _{\substack{B \in R^{m \times n} \\ \operatorname{rank}(B) \leq \nu}}\|A-B\|_2=\sigma_{\nu+1} .
$$

In other words, $A_\nu$ is the best approximation  of $A$ in matrix 2-norm.

**Remark**

- We do not want to discuss details on matrix 2-norm here. But we want to say that the low rank approximation gives the best approximation in some norm that measures the distance between two different vectors.
- Matrix 2-norm, also called spectral norm, is defined to be the largest singular value of the matrix.
  - This measures the maximum stretch that a matrix multiplication results in.
  - $\Vert A \Vert_2:=\max_{\Vert x \Vert_2=1} \Vert Ax \Vert_2$.

#### Compression

One can use low rank approximation to compress data such as images.

1. Load an image as matrix of color scales.
2. Take SVD.
3. Take a low rank approximation.

| | |
|---|---|
| ![original rank (480)](https://www.mathworks.com/help/examples/matlab/win64/ImageCompressionWithLowRankSVDExample_01.png) |  ![original rank (288)](https://www.mathworks.com/help/examples/matlab/win64/ImageCompressionWithLowRankSVDExample_02.png) | 
| ![rank (48)](https://www.mathworks.com/help/examples/matlab/win64/ImageCompressionWithLowRankSVDExample_03.png) | ![rank (15)](https://www.mathworks.com/help/examples/matlab/win64/ImageCompressionWithLowRankSVDExample_04.png) |

Figure: MathWorks

#### Dimension reduction

##### Settings

- Abudance of data: $a_j\in R^m$ ($j=1,2,\cdots,n$) with $m\ll n$.
- Data are centered: $\frac 1 n \sum_{j=1}^n [a_j]_i = 0$ for $i=1,2,\cdots,m$.
  - Each component is mean zero across the data.
  - If not, subtract the average.

##### Want

- Find $p$-dimensional subspace of $R^m$ spanned by $p$ orthonormal vectors onto which data $a_j$'s are projected, yielding least square error caused by projection among all such subspaces.
- Also, find the projected data onto that space.

![PCA](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/GaussianScatterPCA.svg/1280px-GaussianScatterPCA.svg.png)

![dimension reduction](https://pro.arcgis.com/en/pro-app/latest/tool-reference/spatial-statistics/GUID-D7896DEA-1569-4F15-B83D-21BF41CB3511-web.png)

Figure: Wikipedia (top), ArcGIS Pro (bottom)

**Idea**: Low rank approximation suggests choosing the first $p$ terms of rank one expansion since 
$$
A \approx A_p=\sum_{i=1}^p s_i u_i v_i^T
$$



##### Summary


- The projection of a collection of vectors $a_1, \ldots, a_n$ to their best least squares $p$-dimensional subspace is precisely the best rank- $p$ approximation matrix $A_p$.
  - The space $\left\langle u_1, \ldots, u_p\right\rangle$ spanned by the left singular vectors $u_1, \ldots, u_p$ is the best approximating dimension-$p$ subspace to $a_1, \ldots, a_n$ in the sense of least squares
  - The orthogonal projections of the columns $a_j$ of $A$ into this space are the columns of $A_p$. (Exercise: Prove this.)
  - The vectors $u_i$ are often called the *principal components* of the data set.


**Example**

Find the best one-dimensional subspace fitting the data vectors $[-4,-4.5],[0.8,1.9]$, $[2.6,-0.7],[0.6,3.3]$.


1. Put
$$
A=\left[\begin{array}{rrrr}
-4 & 0.8 & 2.6 & 0.6 \\
-4.5 & 1.9 & -0.7 & 3.3
\end{array}\right]
$$
2. Find its reduced SVD
$$
U S V^T=\left[\begin{array}{rr}
0.6 & -0.8 \\
0.8 & 0.6
\end{array}\right]\left[\begin{array}{cc}
5 \sqrt{2} & 0 \\
0 & 3
\end{array}\right]\left[\begin{array}{cccc}
-0.6 \sqrt{2} & 0.2 \sqrt{2} & 0.1 \sqrt{2} & 0.3 \sqrt{2} \\
1 / 6 & 1 / 6 & -5 / 6 & 1 / 2
\end{array}\right] .
$$

3. The best one-dimensional subspace: $\mathrm{span}\{u_1=[0.6,0.8]^T\}$. 
4. Projected data onto this subspace: $s_1 u_1 v_1^T$, which is also equal to the following by zeroing $s_2$ from the reduced SVD:
$$
\begin{aligned}
& A_1=\left[\begin{array}{rr}
0.6 & -0.8 \\
0.8 & 0.6
\end{array}\right]\left[\begin{array}{cc}
5 \sqrt{2} & 0 \\
0 & 0
\end{array}\right]\left[\begin{array}{cccc}
-0.6 \sqrt{2} & 0.2 \sqrt{2} & 0.1 \sqrt{2} & 0.3 \sqrt{2} \\
1 / 6 & 1 / 6 & -5 / 6 & 1 / 2
\end{array}\right] \\
& =\left[\begin{array}{llll}
-3.6 & 1.2 & 0.6 & 1.8 \\
-4.8 & 1.6 & 0.8 & 2.4
\end{array}\right] \\
&
\end{aligned}
$$


### Penrose Pseudoinverse

**Definition** (Pseudoinverse; Kincaid and Cheney (2002) p. 290)

For an $m$-by-$n$ matrix
$$
S=\left[\begin{array}{ccccccc}
\sigma_1 & \\
 & \sigma_2 & \\
 & & \ddots \\
 & & & \sigma_r & \\
 & & & & 0 & \\
 & & & & & \ddots & \\
 & & & & & & 0 
\end{array}
\right],
$$

where $\sigma_i > 0$ for $i=1,2,\cdots, r$, its *pseudoinverse* is defined to be $n$-by-$m$ matrix

$$
S^+=\left[\begin{array}{ccccccc}
\sigma_1^{-1} & \\
 & \sigma_2^{-1} & \\
 & & \ddots \\
 & & & \sigma_r^{-1} & \\
 & & & & 0 & \\
 & & & & & \ddots & \\
 & & & & & & 0 
\end{array}
\right].
$$

For a general $m$-by-$n$ matrix with an SVD 

$$A=USV^T,$$ 

its pseudoinverse is defined by

$$A^+=VS^+ U^T,$$ 


**Remark** (Kincaid and Cheney (2002) p. 290)

- Given a matrix, pseudoinverse is unique while SVD is not.

**Example** (Kincaid and Cheney (2002) p. 291)

Find the pseudoinverse of the following matrix $A$ with its SVD:
$$
A=\left[\begin{array}{rrr}
0 & -1.6 & 0.6 \\
0 & 1.2 & 0.8 \\
0 & 0 & 0 \\
0 & 0 & 0
\end{array}\right]=\left[\begin{array}{rrrr}
0.6 & 0.8 & 0 & 0 \\
0.8 & -0.6 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1
\end{array}\right]\left[\begin{array}{lll}
1 & 0 & 0 \\
0 & 2 & 0 \\
0 & 0 & 0 \\
0 & 0 & 0
\end{array}\right]\left[\begin{array}{rrr}
0 & 0 & 1 \\
0 & -1 & 0 \\
1 & 0 & 0
\end{array}\right]
$$

$$
\begin{aligned}
A^{+} & =\left[\begin{array}{rrr}
0 & 0 & 1 \\
0 & -1 & 0 \\
1 & 0 & 0
\end{array}\right]\left[\begin{array}{rrrr}
1 & 0 & 0 & 0 \\
0 & 0.5 & 0 & 0 \\
0 & 0 & 0 & 0
\end{array}\right]\left[\begin{array}{rrrr}
0.6 & 0.8 & 0 & 0 \\
0.8 & -0.6 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1
\end{array}\right] \\
& =\left[\begin{array}{rrrr}
0 & 0 & 0 & 0 \\
-0.4 & 0.3 & 0 & 0 \\
0.6 & 0.8 & 0 & 0
\end{array}\right]
\end{aligned}
$$

##### Inconsistent and Underdetermined Systems


Pseudoinverse gives a way to answer least square-type problem: over-determined AND underdetermined system of linear equations. 

**Definition** (Minimal solution 1; Kincaid and Cheney (2002) p. 291)

Consider a linear system

$$
A x=b
$$

where $A$ is $m \times n$, $x$ is $n \times 1$, and $b$ is $m \times 1$.

The *minimal solution* of this problem is defined as follows:
1. If the system is consistent and has a unique solution, $x$, then the minimal solution is defined to be $x$.
2. If the system is consistent and has a set of solutions, then the minimal solution is the element of this set having the least Euclidean norm.
3. If the system is inconsistent and has a unique least-squares solution, $x$, then the minimal solution is defined to be $x$.
4. If the system is inconsistent and has a set of least-squares solutions, then the minimal solution is the element of this set having the least Euclidean norm.


**Definition** (Minimal solution 2; Kincaid and Cheney (2002) p. 291)

The following definition is equivalent to the previous one.

Let
$$
\rho=\min \left\{\|A x-b\|_2: x \in \mathbb{C}^n\right\}
$$

Then the minimal solution of equation $A x=b$ is the element of least norm in the set $K=\left\{x:\|A x-b\|_2=\rho\right\}$. 


**Remark** (Kincaid and Cheney (2002) p. 291)

- The second definition encompasses all four cases described earlier. 
  - For example, if $\rho=0$, we have Cases 1 and 2, whereas Cases 3 and 4 correspond to $\rho>0$.


**Theorem** (Pseudoinverse Minimal Solution; Kincaid and Cheney (2002) p. 291)

The minimal solution of the equation $A x=b$ is given by the pseudoinverse

$$
x=A^{+} b
$$

Proof 

Let a singular-value decomposition of $A$ be $A=U S V^T$. Let
$$
c=U^T b \quad \text { and } \quad y=V^T x
$$

As $x$ runs over $R^n$, so does $y$ because $V^T$ is surjective; that is, it maps $R^n$ onto $R^n$. Therefore,
$$
\begin{aligned}
\rho & =\inf _x\|A x-b\|_2=\inf _x\|U S V^T x-b\|_2=\inf _x\left\|U^T(U S V^T x-b)\right\|_2 \\
& =\inf _x\left\|S V^T x-U^T b\right\|_2=\inf _y\|S y-c\|_2
\end{aligned}
$$

From the special nature of the matrix $S$, we now have
$$
\|S y-c\|_2^2=\sum_{i=1}^r\left(\sigma_i y_i-c_i\right)^2+\sum_{i=r+1}^m c_i^2,
$$

where $r$ is the index of the smallest nonzero singular value. This quantity is minimized by letting $y_i=c_i / \sigma_i$ for $1 \leq i \leq r$ and by permitting $y_{r+1}, y_{r+2}, \ldots, y_n$ to be arbitrary. Thus, we have
$$
\rho=\left(\sum_{i=r+1}^m c_i^2\right)^{1 / 2}
$$

Among all the $y$-vectors that yield this minimum value $\rho$, the vector of least norm has $y_{r+1}=y_{r+2}=\cdots=y_n=0$. This vector is given by
$$
y=S^{+} c
$$

Since $V$ preserves the 2-norm, the minimality of 2-norm of $y$ carries over to $x$. Therefore, the minimal solution of our problem is, 
$$
x=V y=V S^{+} c=V S^{+} U^T b=A^{+} b
$$

**Remark** (Kincaid and Cheney (2002) p. 292)

- The pseudoinverse plays the same role for inconsistent or underdetermined sys­tems as the inverse does for invertible systems. 
- The minimal solution of any equation $Ах = b$ is unique.
  - The set $K=\left\{x:\|A x-b\|_2=\rho\right\}$ is convex and has а unique element of least norm. (HW problem)
  - This is not trivial because, while the pseudoinverse is determined by SVD, SVD itself is not unique.

**Theorem** (Uniqueness of pseudoinverse; Kincaid and Cheney (2002) p. 293)

The pseudoinverse of a matrix has the four Penrose properties. Hence, each matrix has a unique pseudoinverse.


Proof: See the Appendix

**Example** (Kincaid and Cheney (2002) p. 292)

Find the minimal solution of the system
$$
\left\{\begin{array}{l}
0 x-1.6 y+0.6 z=5 \\
0 x+1.2 y+0.8 z=7 \\
0 x+0 y+0 z=3 \\
0 x+0 y+0 z=-2
\end{array}\right.
$$


Note that the coefficient matrix is the same as previous example.

$$
A^{+} b=\left[\begin{array}{rrrr}
0 & 0 & 0 & 0 \\
-0.4 & 0.3 & 0 & 0 \\
0.6 & 0.8 & 0 & 0
\end{array}\right]\left[\begin{array}{r}
5 \\
7 \\
3 \\
-2
\end{array}\right]=\left[\begin{array}{l}
0.0 \\
0.1 \\
8.6
\end{array}\right]
$$

### Appendix

#### More comprehensive citations

##### Spectral theorem

**Theorem** (Horn and Johnson (2013) Matrix analysis 2ed. Theorem 4.1.5. p. 229) 

A matrix $A \in M_n$ is Hermitian if and only if there is a unitary $U \in M_n$ and a real diagonal $\Lambda \in M_n$ such that $A=U \Lambda U^*$, where $M_n$ is the set of $n$-by- $n$ complex matrices. Moreover, $A$ is real and Hermitian (that is, real symmetric) if and only if there is a real orthogonal $P \in M_n$ and a real diagonal $\Lambda \in M_n$ such that $A=P \Lambda P^T$.

**Remark**

- Observe the subtlety of the statement: If $A$ is symmetric as a complex matrix, then the conclusion is different. (See e.g., [Wikipedia - Complex symmetric matrices](https://en.wikipedia.org/wiki/Symmetric_matrix#Complex_symmetric_matrices))

##### Uniqueness of pseudoinverse

The discussions on the uniqueness of pseudoinverse are borrowed from Kincaid and Cheney (2002) pp. 293--296, including remarks.


**Remark** (Kincaid and Cheney (2002) p. 293)

- The pseudoinverse has some (but not all) of the properties of an inverse. 
  - For example, we cannot expect $A^{+} A=I$ to be true if $n>m$, because the ranks of $A^{+}, A$, and $A^{+} A$ are at most $m$, whereas $I$ is $n \times n$. 
  - However, equations such as $A A^{+} A=A$ are true for arbitrary $A$. 


**Theorem** (Penrose Properties R. Penrose [1955] according to Kincaid and Cheney (2002) p. 293)

Corresponding to any matrix $A$, there exists at most one matrix $X$ having these four properties:
1. $A X A=A$
2. $X A X=X$
3. $(A X)^*=A X$
4. $(X A)^*=X A$


Proof: (Kincaid and Cheney (2002) p. 293)

Let $X$ and $Y$ be two matrices having Properties $1-4$. Then by systematic use of these properties as indicated, we have
$$
\begin{aligned}
X & =X A X       & \text{(property 2)}\\
& =X A Y A X       & \text{(property 1)} \\
& =X A Y A Y A Y A X       & \text{(property 1)} \\
& =(X A)^*(Y A)^* Y(A Y)^*(A X)^*       & \text{(property 4, 3)} \\
& =A^* X^* A^* Y^* Y Y^* A^* X^* A^* & \\
& =(A X A)^* Y^* Y Y^*(A X A)^* & \\
& =A^* Y^* Y Y^* A^*       & \text{(property 1)}\\
& =(Y A)^* Y(A Y)^*  & \\
& =Y A Y A Y         & \text{(property 4, 3)}\\
& =Y A Y         & \text{(property 2)}\\
& =Y          & \text{(property 2)}\\
\end{aligned}
$$


**Theorem** (Uniqueness of pseudoinverse; Kincaid and Cheney (2002) p. 293)

The pseudoinverse of a matrix has the four Penrose properties. Hence, each matrix has a unique pseudoinverse.


Proof: (Kincaid and Cheney (2002) p. 293)

We address only property 1.

Let $A$ be any matrix, and let its singular-value decomposition be
$$
A=P D Q
$$

Then
$$
A^{+}=Q^* D^{+} P^*
$$

If $A$ is $m \times n$, then so is $D$, and $D$ has the form
$$
D_{i j}= \begin{cases}\sigma_i & \text { if } i=j \leq r \\ 0 & \text { otherwise }\end{cases}
$$

From this we can prove that
$$
D D^{+} D=D
$$

To do so, we write
$$
\left(D D^{+} D\right)_{i j}=\sum_{\nu=1}^n D_{i v} \sum_{\mu=1}^m D_{v \mu}^{+} D_{\mu j}
$$

The right-hand side will be 0 unless $i \leq r$ and $j \leq r$ because of the presence of the terms $D_{i \nu}$ and $D_{\mu j}$. Thus, we assume $i \leq r$ and $j \leq r$ and continue, simplifying the right-hand side to
$$
\sum_{\nu=1}^r D_{i \nu} \sum_{\mu=1}^r D_{\nu \mu}^{+} D_{\mu j}=\sigma_i \sum_{\mu=1}^r D_{i \mu}^{+} D_{\mu j}=\sigma_i \sigma_i^{-1} D_{i j}=D_{i j}
$$

By similar reasoning, we prove that $D^{+}$has the remaining three Penrose properties relative to $D$. Then it is a simple matter to prove these four properties for $A^{+}$. For example, the first property is proved as follows:
$$
\begin{aligned}
A A^{+} A & =P D Q Q^* D^{+} P^* P D Q \\
& =P D D^{+} D Q \\
& =P D Q=A
\end{aligned}
$$

#### Additional examples

**Example** (Visualization of SVD)

- $x=\left[\begin{array}{cccc}-10 & -10 & 20 & 20 \\ -10 & 20 & 20 & -10\end{array}\right]$ 
- $A=\left[\begin{array}{cc}1 & 0.3 \\ 0.45 & 1.2\end{array}\right]$.
- $A=USV^T = \begin{bmatrix} -0.5819 & -0.8133 \\ -0.8133 & 0.5819 \end{bmatrix} \begin{bmatrix} 1.4907 & 0 \\ 0 & 0.7144 \end{bmatrix} \begin{bmatrix} -0.6359 & -0.7718 \\ -0.7718 & 0.6359 \end{bmatrix}$.

Example and figures: Alyssa Quek ([SVD visualization](https://alyssaq.github.io/2015/singular-value-decomposition-visualisation/))

- This example is replaced by the picture of transformation of a circle to an ellipse.
  - It is in favor of simplitiy and to focus only on the intuition.
  - However, via this example, you can keep track of numerical values while transformation, hence kept in appendix.

| | |
|---|---|
| $$Ax$$ <br> ![Figure 1](https://alyssaq.github.io/blog/images/eigens-transformation_matrix.png) | | 
| $$V^Tx$$ <br> ![Figure 2](https://alyssaq.github.io/blog/images/svd_Vx.png) | $$SV^Tx$$ <br> ![Figure 3](https://alyssaq.github.io/blog/images/svd_SVx.png) | 
| $$USV^Tx$$ <br> ![Figure 4](https://alyssaq.github.io/blog/images/svd_USVx.png) | |
