# Linear Algebra Bootcamp

This post provides a quick overview of linear algebra, focusing on the
mathematical ideas and definitions that are foundational for the rest 
of the series of posts. The post follows the book by Darve, E., & Wootters, M. (2021)
: Numerical Linear Algebra with Julia. However, it will use [JAX](https://jax.readthedocs.io/en/latest/notebooks/quickstart.html) to implement essential algorithms. 

The book by Darve and Wootters is very good but expensive, I recommend [buying
it](https://amzn.eu/d/8o3qbUI) if you are rich. Otherwise, just read my post. 

<img src="https://github.com/oceanumeric/Applied-Mathematics/raw/main/images/darve-wootters.webp" alt="darve-wootters" width="30%">

Many key concepts are omitted from this chapter as it assumes 
readers should have some basic knowledge about linear algebra. For instance,
you should know the definition of _linearly independent_, _basis_ and
_dot product_, etc. 

The angle $\theta$ between two vectors $x, y \in $\mathbb{R}^n$ satisfies

$$x^T y = ||x||_2 ||y||_2 \cos \theta $$

or in a less rigorous way of expressing

$$\cos \theta = \frac{x^T y }{||x|| ||y||}$$

A set of vectors $x_1, \cdots, x_k$ are __orthogonal vectors__ if they are
pairwise orthogonal, that is, if $x_i^T x_j = 0$ for all $i \neq j$. For 
instance, if $x$ and $y$ are orthogonal, then 

$$||x+y||_2^2 = ||x||^2 + ||y||^2$$

For the subspace $S$ of $\mathbb{R}^n$, we define the __orthogonal complement__
of $S$, denoted $S^{\perp}$, to be the set of vectors in $\mathbb{R}^n$ which
are orthogonal to $S$.

This means we have 

$$\text{dim}(S) + \text{dim}(S^{\perp}) = n $$

We can use the __Cauchy-Schwarz__ inequality (and the more general __Hölder's inequality__)
to bound a dot product of vectors by the norms of the vectors. 

Hölder's inequality says that for any vectors $x, y$ and positive numbers
$p, q$ so that $1/p + 1/q = 1$, 

$$|x^Ty| \leq ||x||_p ||y||_q$$

When specialized to $p = q = 2$. We get the __Cauchy-Schwarz inequality__. For
any vectors $x$ and $y$, 

$$|x^Ty| \leq ||x||_2 ||y||_2 $$

For a complex matrix, we define the __conjugate transpose__ of $A$, denoted 
$A^H$, to be the matrix that we get when we take the entry-wise complex conjugate
of $A^T$. That is, the $(i, j)$ entry of $A^H$ is $\overline{a_{ji}}$

Suppose we want to calculate the conjugate transpose of the 
following matrix $\boldsymbol{A}$.

$$
\boldsymbol{A}=\left[\begin{array}{ccc}
1 & -2-i & 5 \\
1+i & i & 4-2 i
\end{array}\right]
$$
We first transpose the matrix:
$$
\boldsymbol{A}^{\top}=\left[\begin{array}{cc}
1 & 1+i \\
-2-i & i \\
5 & 4-2 i
\end{array}\right]
$$
Then we conjugate every entry of the matrix:
$$
\boldsymbol{A}^{\mathrm{H}}=\left[\begin{array}{cc}
1 & 1-i \\
-2+i & -i \\
5 & 4+2 i
\end{array}\right]
$$

A __symmetric__ matrix is such t hat 

$$A^T = A$$

A __skew-symmetric matrix__ is such that 

$$A^T = - A$$

A __Hermitian__ matrix is such that 

$$A^H = A$$

A __skew-Hermitian__ matrix has 

$$A^H = -A$$

## Algebra of matrix

$$(AB)C = A(BC)$$

$$(AB)^T = B^T A^T$$

$$(A \pm B)^T = A^T \pm B^T$$

$$A^{-1} A = AA^{-1} = I $$

If $A$ is square and not invertible, we say that it is __singular__. 

## Sherman-Morrison-Woodbury formula

We will learn different ways to calculate matrix inverse later, but here is
one useful tool that you can think about now. 

If $A \in \mathbb{R}^{n \times n}$, and $U, V \in \mathbb{R}^{n \times k}$:

$$(A + UV^T)^{-1} = A^{-1} - A^{-1} U(I + V^TA^{-1}U)^{-1} V^T A^{-1}$$

where we assume that $A$ and $I + V^T A^{-1} U$ are non-singular.

## Matrix norms 

The Frobenius norm of $A$ is defined by
$$
\|A\|_F=\left(\sum_{i j} a_{i j}^2\right)^{1 / 2}=\sqrt{\operatorname{tr}\left(A A^H\right)},
$$
where $A^H$ is the transpose conjugate matrix.

The trace of a square matrix $A$ is defined as $\text{tr}(A) = \sum_i a_{ii}$

For any invertible matrix $P$,

$$\text{tr}(A) = \text{tr}(P^{-1}AP)$$

We also have 

$$\text{tr}(AB) = \text{tr}(BA)$$

$$\text{tr}(A(BC)) = \text{tr}((BC)A)$$

$$\text{tr}(A + \alpha B) = \text{tr}(A) + \alpha \ \text{tr}(B)$$

## The determinant of a matrix 

The determinant of matrix $A$ is equal to
$$
\operatorname{det}(A)=\sum_{\sigma \in S_n} \operatorname{sign}(\sigma) \prod_{i=1}^n a_{i, \sigma_i},
$$
where $S_n$ is the set of all permutations of $\{1, \ldots, n\}$, and $\operatorname{sign}(\sigma)$ is the signature of the permutation $\sigma$. Here, the signature of a permutation $\sigma$ is $+1$ if $\sigma$ can be realized by doing an even number of pairwise swaps, and it is $-1$ otherwise; it turns out (though it's not obvious!) that this is well-defined.

The determinant satisfies the following three fundamental properties:
1. $\operatorname{det}\left(I_n\right)=1$.

2. The determinant is an $n$-linear function. That is, if we "fix" all the columns 
of $A$ except column $i, \operatorname{det}(A)$ is a linear function of $a_i$.

3. The determinant is an alternating form. That is, when two columns of $A$ are identical, then $\operatorname{det}(A)=0$.

In fact, these three properties uniquely define the determinant 
(that is, the determinant is the only function satisfying these three properties).

1. For any square matrix $A$ and scalar $\alpha \in \mathbb{R}$, $\operatorname{det}(\alpha A)=\alpha^n \operatorname{det}(A)$.
2. For any square matrices $A$ and $B, \operatorname{det}(A B)=\operatorname{det}(A) \operatorname{det}(B)$.
3. A square matrix $A$ is singular (that is, not invertible) if and only if $\operatorname{det}(A)=0$.
4. $\operatorname{det}(A)=\operatorname{det}\left(A^T\right)$.
5. $\operatorname{det}\left(A^{-1}\right)=\operatorname{det}(A)^{-1}$.

For any square matrix $A$:

$$\operatorname{det}(\text{exp}A) = \text{exp}(\text{tr} A)$$

## Orthogonal matrices and projectors 

An orthogonal matrix $Q \in \mathbb{R}^{m \times n}$ is a matrix whose columns are orthonormal. That is, the columns $q_1, \ldots, q_n$ of $Q$ satisfy
$$
q_i^T q_j=\delta_{i j} .
$$
Equivalently,
$$
Q^T Q=I
$$

A __unitary matrxi__ $U \in \mathbb{C}^{n \times n}$ is a matrix so that

$$U^HU = UU^H = I $$

Multiplication by an orthogonal matrix $Q$ preserves 2-norms:

$$||Qx||_2^2 = x^T Q^T Q x = x^Tx = ||x||_2^2 $$

__Square orthogonal__ matrices are always invertible by definition, and 

$$Q^{-1} = Q^{T}$$

As a result, for square matrices, we have 

$$\text{det}(Q) = \pm 1$$


Note that if $Q$ is square orthogonal, then 

$$Q^TQ = QQ^T = I$$

If $Q$ is rectangular orthogonal, we only have 

$$Q^TQ = I$$

If the vector of response values is denoted by $y$ and the vector of 
fitted values by $\hat{y}$

$$\hat{y} = Py$$

The vector of residuals $r$ can also be expressed compactly using the 
__projection matrix__:

$$r = y-\hat{y} = y - Py = (I-P)y$$

## Eigenvalues, eigenvectors, and related matrix decompositions 

In this section, we will introduce __eigenvalues__ and __eigenvectors__,
fundamental tools in linear algebra. Using these, we will define a number
of different __matrix decompositions__, that is, ways to decompose a 
matrix as a product of easy-to-understand matrices. Later you will realize
that matrix decompositions are an essential part of numerical linear
algebra. 

Easy-to-understand matrices include:

- __Diagonal matrices__, which have non-zero entries only on the diagonal. 
These are very easy to work with!

- __Orthogonal matrices__, which are easier to invert (take the transpose!).
As we will see, orthogonal matrices are also less prone to inaccuracies in 
computation. 

- __Triangular matrices__, which can be either __lower-triangular__ 
(with non-zero entries on or below the diagonal only) or __upper-triangular__
(non-zero entries on or above the diagonal). As we will see, it can be more 
computationally efficient to deal with triangular matrices than with 
general matrices. 

Let $A \in \mathbb{C}^{n \times n}$ be a __square__ matrix, and let 
$x \in \mathbb{C}^n$ be non-zero and $\lambda \in \mathbb{C}$.
We say that $x$ is an __eigenvector__ of $A$ with __eigenvalue__ $\lambda$
if 

$$Ax = \lambda x$$

The determinant and trace can be expressed in terms of the eigenvalues:

$$\text{det}(A) = \prod_{i=1}^n \lambda_i, \ \text{tr}(A) = \sum_{i=1}^n \lambda_i$$

Some matrices $A$ have $n$ linearly independent eigenvectors $x_1, \ldots, x_n$. If this is the case, we say that $A$ is diagonalizable. The reason for the name is that if $A$ has $n$ linearly independent eigenvectors, we can decompose it as follows. Say that $\lambda_1, \ldots, \lambda_n$ are the corresponding eigenvalues, and let $\Lambda$ be the diagonal matrix whose $i$ th diagonal entry is $\lambda_i$, and let $X$ be the matrix with the $x_i$ as columns. In matrix notation,

$$
A X=X \Lambda
$$
Since the columns of $X$ are linearly independent, $X$ is invertible, and we can write
$$
A=X \Lambda X^{-1}
$$
This is called the eigendecomposition of $A$.


Suppose that $A$ is diagonalizable, and let $x_1, \ldots, x_n$ be linearly independent eigenvectors with corresponding eigenvalues $\lambda_1, \ldots, \lambda_n$. Then
$$
A=X \Lambda X^{-1}
$$
where $\Lambda$ is a diagonal matrix with $\lambda_i$ on the diagonal, and $X$ is the matrix with the $x_i$ as columns.

If $A$ is a square matrix with the eigendecomposition $A=X \Lambda X^{-1}$,
then for any integer $k$, we have 

$$A^k = X \Lambda^k X^{-1}$$

Let $A \in M_n(\mathbb{F})$. The polynomial 

$$p_A(x) = \text{det} (A -xI)$$

is called the characteristic polynomial of $A$. The polynomial $P_a(x)$
has degree $n$ It has $n$ complex roots (possibly repeated), and each 
root is an eigenvalue of $A$.

If all the eigenvalues of $A$ are distinct, then indeed $A$ has $n$ linearly
independent eigenvectors and hence is diagonalizable. However, the situation 
gets more complicated if $p_A(x)$ has repeated roots. For example, if 
$p_A(x) = (x-1)^2(x-3)^4$, then we say that $p_A$ has a root at $x=1$ 
of _multiplicity_ $2$ and a root at $x=3$ of _multiplicity_ 4.

Let $A$ be an $n \times n$ matrix and let $\lambda$ be a root of $p_A$
with multiplicity $k$. Then we say that the __algebraic multiplicity__
of $\lambda$ as an eigenvalue of $A$ is $k$.

Let $A$ be an $n \times n$ matrix and let $\lambda$ be an eigenvalue of $A$.
The __eigenspace__ $E_{\lambda}$ is the set of all vectors $x$ which are
eigenvectors of $A$ with eigenvalue $\lambda$. The dimension of $E_{\lambda}$
is called the __geometric multiplicity__ of $\lambda$.

We say a matrix $A$ is __defective__ if, for one of its eigenvalues, 
the algebraic multiplicity strictly exceeds the geometric multiplicity. Thus,
"defective" is just another way of saying "not diagonalizable".

## Gershgoin disc theorem 

Although it is difficult to in general to make simple statements about the 
eigenvalues of a matrix by looking at its entries, there is one major 
exception: the Gershgorin disc theorem. 

Please watch the following video and get a feeling of it. 

<iframe width="600" height="325" src="https://www.youtube.com/embed/rla9Q4E6hVI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

Let $A \in \mathbb{C}^{n \times n}$. For $1 \leq i \leq n$, the __Gershgorin disc__
$D_i$ is the disc in the complex plane with the center at $a_{ii}$ and 
radius $r=\sum_{j \neq i}|a_{ij}|$:

$$D_i = \{ z \in \mathbb{C} | |z - a_{ii}| \leq \sum_{j \neq i} |a_{ij}| \}$$

With this definition, we can state the following theorem:

_All the eigenvalues of $A \in \mathbb{C}^{n \times n}$ are located in one of 
its Gershgorin disc_. 

For example, if
$$
A=\left(\begin{array}{ccc}
3 & i & 1 \\
-1 & 4+5 i & 2 \\
2 & 1 & -1
\end{array}\right)
$$
(as above) then the three Gershgorin discs have:
- centre 3 and radius $|i|+|1|=2$,
- centre $4+5 i$ and radius $|-1|+|2|=3$,
- centre $-1$ and radius $|2|+|1|=3$.

<img src="https://www.maths.ed.ac.uk/~tl/images/discs400.png" width="30%">

Gershgorin’s theorem says that every eigenvalue lies in the union of these three discs. My statement about real and imaginary parts follows immediately.

Proof. Consider an eigenvector:

$$Ax = \lambda x $$

For instance, we could have 

$$
\begin{bmatrix}
1 & 2 & 3  \\
a & b & c \\
3 & 2 & 1 \end{bmatrix} \begin{bmatrix}
1 \\
2 \\
3 \end{bmatrix} = \lambda \begin{bmatrix}
1 \\
2 \\
3 \end{bmatrix}
$$

Now, take the component $x_i$ with the largest magnitude in $x$ (in the above
example, $i=3$). Then 

$$ A_i x = \sum_j a_{ij} x_j = \lambda x_i $$

In the specific example, we have 

$$
\begin{bmatrix}
1 & 2 & 3 \end{bmatrix} \begin{bmatrix}
1 \\
2 \\
3 \end{bmatrix} = \lambda x_3
$$

Now, taking $a_{ii}$ to the other side, we have

$$a_{ii} x_i = \lambda x_i - \sum_{j \neq i} a_{ij} x_j$$

Divide by $x_i \neq 0$:

$$a_{ii} - \lambda = - \sum_{j \neq i} a_{ij} \frac{x_j}{x_i}, \ \frac{|x_j|}{|x_i|} \leq 1$$

Therefore, 

$$|\lambda - a_{ii} | \leq \sum_{j \neq i} |a_{ij}|$$

Or 

$$
\left(\lambda-a_{i i}\right) x_i=\sum_{j \neq i} a_{i j} x_j .
$$
Now take the modulus of each side:
$$
\left|\lambda-a_{i i}\right|\left|x_i\right|=\left|\sum_{j \neq i} a_{i j} x_j\right| \leq \sum_{j \neq i}\left|a_{i j}\right|\left|x_j\right| \leq\left(\sum_{j \neq i}\left|a_{i j}\right|\right)\left|x_i\right|=r_i\left|x_i\right|
$$
where to get the inequalities, we used the triangle inequality and then the maximal property of $\left|x_i\right|$. Cancelling $\left|x_i\right|$ gives $\left|\lambda-a_{i i}\right| \leq r_i$. And that's it!

<iframe width="600" height="325" src="https://www.youtube.com/embed/19FXch2X7sQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

If $\lambda$ is an eigenvalue of $A$ (which means $A$ is a square matrix), 
show that it is an eigenvalue of $A^T$.

$$
\begin{aligned}
\text{det}(A-\lambda I) & = 0 \\ 
\text{det}((A-\lambda I)) & = \text{det}((A-\lambda I)^T) \\
\text{det}(A^T - \lambda I) & = 0 
\end{aligned}
$$

Since the eigenvalues of $A$ and $A^T$ are the same, we also  have the 
eigenvalues of $A$ lie within the Gershgorin discs of $A^T$.

This  theorem shows that if the off-diagonal entries of a matrix $A$ 
are small, the eigenvalues cannot be too far from the diagonal entries of 
$A$. Therefore, if we have an algorithm that reduces the magnitude of 
the off-diagonal entries, it can be used to approximate the eigenvalues 
of $A$. Of course, diagonal entries will change in the process of minimizing 
the off-diagonal entries. 

To see what happens at the off-diagonal entries get larger and larger, 
consider the matrix:

$$C(t) = (1-t)D + tA$$

where $D$ is the diagonal of $A$, and $t$ is a parameter between $0$ and $1$.
When $t = 0, C(0) = D$ and all the Gershgorin discs are centered at the 
diagonal entries $a_{ii}$ and have radius zero. When $t=1, C(1)$ is simply 
$A$. 

As $t$ increases from $0$ to $1$, it turns out that the radius of the Gershgorin
discs increase linearly with $t$, from $0$ to 

$$\sum_{j \neq i} |a_{ij}|$$

<img src="https://github.com/oceanumeric/Applied-Mathematics/raw/main/images/gershgorin.png" width="79%">

For the formula,

$$C(t) = (1-t)D + tA$$

when $t$ is small, all of the discs are still disjoint, each disc contains 
exactly one eigenvalue, which is close to the corresponding eigenvalue of 
$A$. As $t$ grows, it could happen that two discs merge. In this case,
something special happens. After the merge, the union of both two disc
contains exactly two eigenvalues (possibly with algebraic multiplicity 2).
These two eigenvalues may lie in both discs or only one of them: one eigenvalue
may move from one disc to another after the discs merge. 


## Unitarily diagonalizable matrices 

The eigendecomposition is very useful, but - as we will see later - it can 
become difficult to compute with when $X$ is ill-conditioned (that is,
if the eigenvectors are nearly linearly dependent). one reason is that 
computing $X^{-1}$ becomes difficult and can lead to large numerical errors. 

Thus, we might ask ourselves when $X$ is guaranteed to be _well-conditioned_.
In that case, $X$ would be orthogonal or unitary. 

With this in mind, we say that a matrix $A \in C^{n \times n}$ is
__unitarily diagonalizable__ if there is a unitary matrix $Q$ and 
a diagonal matrix $\Lambda$ so that 

$$A = Q \Lambda Q^H$$

A matrix $A \in C^{n}$ is unitarily diagonalizable if and only if 

$$A^HA = AA^H$$

If this condition holds, we say that $A$ is __normal__.

If $A$ is complex Hermitian, that is, $A^H = A$, then the eigenvalues of 
$A$ are real, and 

$$A = Q \Lambda Q^H$$

With $\Lambda$ real and $Q$ unitary. If $A$ is real symmetric, then in 
addition we have that $Q$ is real orthogonal, and 

$$A = Q \Lambda Q^T$$

## Jordan form 

What happens when a square $n \times n$ matrix $A$ is not diagonalizable?
We know that $A$ still has $n$ eigenvalues (counting multiplicities) because
the characteristic polynomial $p_A$ has $n$ complex roots. It turns out 
that even if we don't have a full basis of eigenvectors, we can 
still use the eigenvalues, and the eigenvectors we do have, to decompose
$A$ in a useful way. This is called the _jordan canonical form_.

Before we define the Jordan form, we define a Jordan block. A Jordan block is either a matrix of size 1 or of size greater than 1 ; if greater than 1 , it is a matrix of the form
$$
J=\left(\begin{array}{ccccc}
\lambda & 1 & 0 & \cdots & 0 \\
0 & \lambda & 1 & \cdots & 0 \\
\vdots & \vdots & \ddots & \ddots & \vdots \\
0 & 0 & \cdots & \lambda & 1 \\
0 & 0 & \cdots & 0 & \lambda
\end{array}\right)
$$
with ones on the super-diagonal (that is, ones in the entries $J_{i, i+1}$ ).
All square matrices can be written in the Jordan canonical form:

For any matrix $A \in \mathbb{C}^{n \times n}$, there exist $X$ and $J$ such that
$$
A=X J X^{-1} \text {. }
$$
The matrix $J$ is a block diagonal matrix,
$$
J=\left(\begin{array}{lll}
J_1 & & \\
& \ddots & \\
& & J_k
\end{array}\right)
$$
with Jordan blocks $J_k$ on the diagonal.

Each block $J_k$ corresponds to an eigenvalue $\lambda$ of $A$, which appears on its diagonal. If an eigenvalue $\lambda$ has geometric multiplicity $g$, then there are $g$ distinct Jordan blocks with $\lambda$ on the diagonal. Thus, the Jordan form can be seen as a generalization of the eigendecomposition: if the algebraic multiplicity is equal to the geometric multiplicity for each eigenvalue, then each Jordan block has size 1. In this case, $J$ is diagonal, and we recover the eigendecomposition.

We will see several algorithms later in this book for computing eigendecompositions accurately. However, the Jordan decomposition is very difficult to compute accurately on a computer. To see why, suppose we have a matrix $A$ which has an eigenvalue $\lambda$ with algebraic multiplicity 2 and geometric multiplicity 1 . It turns out that even a small change in the matrix could turn this single eigenvalue into two distinct eigenvalues. This wouldn't

## Schur decomposition 

We have seen that if $A$ is a normal matrix (a very strong property!), then 
it is unitarily diagonalizable: $A = Q \Lambda Q^H$ for some unitary $Q$.
When $A$ is an arbitrary matrix, all we get is the Jordan canonical form :
$A = XJX^{-1}$. The Jordan canonical form is more difficult to work 
with, both because $X$ might be poorly conditioned and because this 
decomposition isn't very robust to small errors. 

The _Schur decomposition_ gets (almost) the benefits of unitary diagonalization
but for arbitrary matrices. 

For any complex matrix $A \in \mathbb{C}^{n \times n}$, there is a unitary 
matrix $Q$ and an upper triangular matrix $T$ so that 

$$A = Q T Q^H$$

This is called the __Schur decomposition__. The eigenvalues of $A$ 
appear along the diagonal of $T$.

A $2 \times 2$ __block upper-triangular matrix__ has the form where all of 
the blocks on the diagonal are either $2 \times 2$ or $1 \times 1$. 

<img src="https://github.com/oceanumeric/Applied-Mathematics/raw/main/images/block-matrix.webp" alt="block-matrix" width="20%">

For any real matrix $A \in \mathbb{R}^{n \times n}$, there is a real 
orthogonal matrix $Q$ and a $2 \times 2$ block upper-triangular matrix
$T$ so that 

$$A = Q T Q^T$$

The $2 \times 2$ blocks on the diagonal of $T$ contain two complex conjugate
eigenvalues of $A$, and the $1 \times 1$ blocks contain a single real eigenvalue.

The decompositions we've seen so far - the eigendecomposition, Jordan form, 
and (real) Schur decomposition - are all __eigenvalue revealing__ in the 
sense that the eigenvalues of the matrix $A$ lie on the diagonal of 
some matrix involved in the decomposition. To summarize what we have seen:

<img src="https://github.com/oceanumeric/Applied-Mathematics/raw/main/images/matrix-decomposition.webp" alt="matrix-decomposition" width="70%">

## Singular value decomposition 

In the previous section we saw several matrix decompositions based on eigenvalues. Now we'll see another decomposition based on singular values instead of eigenvalues. This is called the singular value decomposition (SVD), and it is an extremely important tool for understanding and computing with matrices.

To motivate the SVD, consider the action of a matrix $A$ on the unit sphere. It turns out that $A$ maps the sphere to some hyperellipsoid $E$ (that is, a high-dimensional ellipsoid). Let's define the following quantities:

- The lengths of the semi-axes of $E$ are denoted $\sigma_{1}, \ldots, \sigma_{n}$. These are called the singular values of $A$. By convention, they are ordered so that $\sigma_{1} \geq \sigma_{2} \geq \cdots \geq \sigma_{n}$.

- The directions of the semi-axes are denoted by unit vectors $u_{1}, \ldots, u_{n}$ (so that the $i$ th semi-axis is $\sigma_{i} u_{i}$ ). The vectors $u_{i}$ are called the __left singular vectors of__ $A$. 

- For each $u_{i}$, there is some unit vector $v_{i}$ so that $A v_{i}=\sigma_{i} u_{i}$. The vectors $v_{i}$ are called the __right singular vectors__.

Here's a picture of these quantities for some $2 \times 2$ matrix $A$ :

<img src="https://github.com/oceanumeric/Applied-Mathematics/raw/main/images/svd-illustration1.webp" alt="svd-illustration1" width="70%">

Notice that $A$ doesn't have to be a square matrix for the definitions above to make sense. The singular vectors and singular values can be used to form the singular value decomposition:

Consider any matrix $A \in \mathbb{C}^{m \times n}$ and $p=\min (m, n)$. There exist two unitary matrices $U \in \mathbb{C}^{m \times m}$ and $V \in \mathbb{C}^{n \times n}$ and a diagonal matrix $\Sigma \in \mathbb{R}^{m \times n}$ with real non-negative entries such that

$$
A=U \Sigma V^{H} \text {. }
$$

If $m>n, \Sigma$ has zeros in rows $p+1$ to $m$. If $n>m, \Sigma$ has zeros in columns $p+1$ to $n$. This is called the singular value decomposition.

The diagonal entries of $\Sigma$ are $\sigma_{1}, \ldots, \sigma_{p}$, and are called the singular values. By convention we order the singular values so that $\sigma_{1} \geq \sigma_{2} \geq \cdots \geq \sigma_{p} \geq 0$. The columns of $U$ are called the __left singular vectors__, and the columns of $V$ are called the __right singular vectors__.

If $A$ is real, then $U$ and $V$ are real orthogonal.

The singular values are uniquely determined. In general, $U$ and $V$ are not unique. However, if all the singular values are distinct and ordered from large to small, $U$ and $V$ are uniquely determined (up to the sign of their columns if they are real, or a multiplication by $e^{\imath \theta}$ if they are complex). If $A$ has rank $r$, then $\sigma_{i}=0$ for $r<i \leq p$. In other words, the number of non-zero singular values is equal to the rank of the matrix. 

To see why the decomposition in the theorem above lines up with the geometric intuitions, consider a matrix

$$
A=U \Sigma V^{H} \text {. }
$$

We can view the action of $A$ on the unit sphere as a composition of three linear maps, $V^{H}$, $\Sigma$, and $U$. Since $U$ and $V$ are unitary, their actions are rotations and reflections. Since $\Sigma$ is diagonal, all it does is re-scale the coordinate axes. As a result, we can represent the action of $A$ as follows:

<img src="https://github.com/oceanumeric/Applied-Mathematics/raw/main/images/svd-illustration2.webp" alt="svd-illustration2" width="70%">

Since $U$ and $V$ are just rotations or reflections, the singular values $\sigma_{1}, \ldots, \sigma_{n}$ captures anything that $A$ does to the 2-norm of vectors. In particular, both the 2-norm of $A$ and the __Frobenius norm__ of $A$ can be written in terms of the singular values:

For $A \in \mathbb{R}^{m \times n}$ :

$$
\|A\|_{2}=\sigma_{1}(A),\|A\|_{F}=\sqrt{\sum_{i=1}^{\min \{m, n\}} \sigma_{i}^{2},}
$$

where $\sigma_{1}$ is the largest singular value.

### Condition number and the SVD

For a non-singular $n \times n$ matrix $A$, the __condition number__ of $A$, denoted $\kappa(A)$, is defined as

$$
\kappa(A)=\|A\|_{2}\left\|A^{-1}\right\|_{2} .
$$

The condition number is also given by

$$
\kappa(A)=\frac{\sigma_{1}}{\sigma_{n}} .
$$

As we will see later on in Chapter 3 , the condition number of a matrix is a key factor when determining the accuracy of a numerical calculation involving $A$, for example, when solving a linear system $A x=b$.

### Singular values vs. eigenvalues 

When $A$ is a symmetric matrix, the singular values and the eigenvalues are the same, up to a sign:


For a symmetric $n \times n$ matrix $A$ with eigenvalues $\lambda_{i}$ and singular values $\sigma_{i}$ so that $\left|\lambda_{1}\right| \geq\left|\lambda_{2}\right| \geq \cdots \geq\left|\lambda_{n}\right|$ and $\sigma_{1} \geq \sigma_{2} \geq \cdots \geq \sigma_{n}$,

$$
\sigma_{i}=\left|\lambda_{i}\right|
$$

for all $i$.

However, when $A$ is not symmetric, the singular values and the eigenvalues can be very different. As discussed above, the eigenvalues capture how $A^{k}$ behave for large $k$, while the singular values capture the geometry of $A$ as a linear transformation. As an example of how these can be different, consider the matrix

$$
A=X\left(\begin{array}{cc}
1 & 0 \\
0 & -1
\end{array}\right) X^{-1}
$$

for some invertible matrix $X$.

It is clear that the eigenvalues of $A$ are $\pm 1$. Further, you can check that $A^{2}=I$, so $A^{k}$ is equal to $I$ if $k$ is even and equal to $A$ if $k$ is odd. Thus, the behavior of $A^{k}$ is bounded as $k$ grows in the sense that it just oscillates between two matrices $I$ and $A$; it does not blow up with $k$.

On the other hand, the singular values of $A$ might get quite large. If $X$ is orthogonal, then $\sigma_{1}=\sigma_{2}=1$. But as $X$ gets further and further from orthogonal, the singular values of $A$ can become larger and larger. To see why, we can imagine multiplying by $A$ one factor at a time and seeing what happens to the unit circle. Multiplying by $X^{-1}$ "stretches" the unit circle out. Then the matrix $\left(\begin{array}{cc}1 & 0 \\ 0 & -1\end{array}\right)$ flips the space over the $y$-axis. Finally, $X$ "undoes" the stretching that $X^{-1}$ did, but because of the flip, things don't get put back where they came from, and some points remain stretched. For example, the situation might look like this:

<img src="https://github.com/oceanumeric/Applied-Mathematics/raw/main/images/svd-illustration3.webp" alt="svd-illustration3" width="70%">

In this case, the unit ball gets mapped to some stretched-out ellipsoid, and there is some point on the unit ball which gets mapped to a much longer vector. This means that $\sigma_{1}$ is large.

However, even if $A$ is not symmetric, $A^{T} A$ and $A A^{T}$ are symmetric, and it turns out that the eigenvalues of these matrices are related to the singular values of $A$.

Let $A$ be a square matrix of size $n$ with singular values $\sigma_{i}$. Then both $A^{T} A$ and $A A^{T}$ have eigenvalues $\sigma_{1}^{2}, \ldots, \sigma_{n}^{2}$. Moreover, the right singular vectors are the eigenvectors of $A^{T} A$, while the left singular vectors are the eigenvectors of $A A^{T}$ :

$$
A^{T} A v_{i}=\sigma_{i}^{2} v_{i}, A A^{T} u_{i}=\sigma_{i}^{2} u_{i} .
$$

## Different shapes of SVD 

For square matrices, the three factors $U, \Sigma$, and $V$ have the same shape. For rectangular matrices, we can work two different decompositions: the full SVD (the one defined above) and the thin SVD. 

<img src="https://github.com/oceanumeric/Applied-Mathematics/raw/main/images/svd-full.webp" alt="svd-full" width="50%">

However, all of those zeros in $\Sigma$ make parts of either $U$ or $V^{H}$ extraneous. By trimming off these extra parts, we end up with the thin SVD:

For any matrix $A \in \mathbb{C}^{m \times n}$ there are matrices $\hat{U}$ and $\hat{V}$ so that $\hat{U}^{H} \hat{U}=I$ and $\hat{V}^{H} \hat{V}=I$, and a diagonal matrix $\hat{\Sigma}$ so that

$$
A=\hat{U} \hat{\Sigma} \hat{V}^{H} .
$$

Let $p=\min (m, n)$. Then $\hat{\Sigma}$ is $p \times p, \hat{U}$ is $m \times p$, and $\hat{V}$ is $n \times p$.

<img src="https://github.com/oceanumeric/Applied-Mathematics/raw/main/images/svd-thin.webp" alt="svd-thin" width="50%">


__That's it!__ Now you know all the linear algebra you need for implementing
all those famous algorithms we will learn in this series of posts, such as:

<img src="https://github.com/oceanumeric/Applied-Mathematics/raw/main/images/top-10.webp" alt="top-10-algorithms" width="70%">

## References 

1. [In Praise of the Gershgorin Disc Theorem](https://golem.ph.utexas.edu/category/2016/08/in_praise_of_the_gershgorin_di.html)
2. [Gershgorin circle theorem](https://www.geogebra.org/m/wDEj3Xg9)