# Unsupervised Learning
Author: Bingchen Wang

Last Updated: 24 Sep, 2022

---
<nav>
    <a href="../Machine%20Learning.ipynb">Machine Learning</a>
</nav>

---

In [1]:
%%html
<link rel='stylesheet' type='text/css' media='screen' href='../styles/custom.css'>

<section class = "section--outline">
    <div class = "outline--header">Outline </div>
    <div class = "outline--content">
        <b>Concepts:</b>
        <ul>
            <li> <a href = "#KMC">K-means Clustering</a>
            <li> <a href = "#MoG">Mixture of Gaussians</a>
                <ul>
                    <li> <a href = "#AD">Anomaly Detection</a>
                    <li> <a href = "#MoGM">Mixture of Gaussians Model</a>
                    <li> <a href = "#EMA">Expectation Maximization Algorithm</ul>
            <li> <a href = "#FA">Factor Analysis</a>
            <li> <a href = "#PCA">Principal Component Analysis</a>
            <li> <a href = "#ICA">Independent Component Analysis</a>
        </ul>
        <b>Implementation:</b>
        <ul>
            <li> K-means Clustering
                <ul>
                    <li> <a href = "./K-means Clustering/Numpy Implementation.ipynb">Numpy Implementation</a>
                    <li> <a href = "./K-means Clustering/Sklearn Implementation.ipynb">Sklearn Implementation</a>
                </ul>
            <li> Mixture of Gaussians
                <ul>
                    <li> <a href = "./Mixture of Gaussians/Numpy Implementation.ipynb">Numpy Implementation</a>
                    <li> <a href = "./Mixture of Gaussians/Sklearn Implementation.ipynb">Sklearn Implementation</a>
                </ul>
            <li> Principal Component Analysis
                <ul>
                    <li> <a href = "./Principal Component Analysis/Sklearn Implementation.ipynb">Sklearn Implementation</a>
                </ul>
            <li> Independent Component Analysis
                <ul>
                    <li> <a href = "./Independent Component Analysis/Sklearn Implementation.ipynb">Sklearn Implementation</a>
                </ul>
        </ul>    
    </div>
</section>

<a name = "KMC"></a>
## K-means Clustering
### Cost function
$$
J(\mathbf{c},\mathbf{\mu}) = \sum^m_{i=1}\left\Vert x^{(i)} - \mu_{c^{(i)}}\right\Vert^2
$$

### Algorithm
<section class = "section--algorithm">
    <div class = "algorithm--header"> K-means Clustering Algorithm</div>
    <div class = "algorithm--content">
        Data $ \{x^{(1)}, x^{(2)}, \dots, x^{(m)}\}$.
        <blockquote>
            Initialize cluster centroids $\mu_1, \mu_2, \dots, \mu_k \in \mathbb{R}^n$ <br>
            <div class = "alert alert-block alert-success"><b>Note:</b> Usually randomly pick $k$ examples from the dataset to be the initial cluster centroids. </div>
            Repeat until convergence:
            <blockquote>
                (a) <b>(colour the points)</b> Set $c^{(i)} := \arg\min_j \left\Vert x^{(i)} - \mu_j \right\Vert_2$ <br>
                (b) <b>(move the cluster centroids)</b> For $j = 1, 2, \dots, k$, 
                $$
                \mu_j := \frac{\sum^m_{i=1} \mathbb{1}\{c^{(i)} = j \} x^{(i)}}{\sum^m_{i=1} \mathbb{1}\{c^{(i)} = j \}}
                $$
            </blockquote>
        </blockquote>
    </div>
</section>

### Local minima
**Q: Worry about local minima?** <br>
A: Run the algorithm several times, say 10, 100, 1000 times, with different random initializations of cluster centroids. Pick the one that results in the lowest value for the cost function.

<a name = "MoG"></a>
## Mixture of Gaussians

<a name = "AD"></a>
### Anomaly Detection
<img align="right" src="./images/Anomaly Detection.jpeg" style="width:300px;" >

#### Supervised Learning vs Unsupervised Learning
Supervised Learning when:
- Lots of labelled data of both classes (normal and anomalous)
- Future anomalies similar to the ones seen in the training set

Unsupervised Learning when:
- Unlabelled data or labelled data with few to none anomalies
- Various types of anomalies, with unseen future anomalies

<a name = "MoGM"></a>
### Mixture of Gaussians Model
Suppose there is a latent (hidden/unobserved) random variable $z$, and $x^{(i)}, z^{(i)}$ are distributed
$$
P(x^{(i)}, z^{(i)}) = P(x^{(i)}\vert z^{(i)})P(z^{(i)})
$$
where
$$
\begin{align}
z^{(i)} \sim & \; \text{Multinomial}(\phi), z^{(i)} \in \{1, \dots, k\} \\
x^{(i)}\vert z^{(i)} \sim & \; \mathcal{N}(\mu_j, \Sigma_j)
\end{align}
$$

<a name = "EMA"></a>
### Expectation Maximization Algorithm
<div class = "alert alert-block alert-info"><b>Intuition:</b> Like K-means but with soft assignments.</div>

<div style = "text-align: center;">
    <img src="./images/Expectation maximization.jpeg" style="width:50%;" >
</div>
        
<section class = "section--algorithm">
    <div class = "algorithm--header"> Expectation Maximization Algorithm (for mixture of gaussians)</div>
    <div class = "algorithm--content">
        <b>E-step</b> (Guess the value of $z^{(i)}$'s)
        <blockquote>
            Set $$ 
            \begin{align}
            w_j^{(i)} =& Q_i(z^{(i)}=j) = P(z^{(i)} = j | x^{(i)}; \mathbf{\phi}, \mathbf{\mu}, \mathbf{\Sigma}) \\
            =& \frac{P(x^{(i)}\vert z^{(i)} = j)P(z^{(i)} = j)}{\sum^k_{l=1}P(x^{(i)}\vert z^{(i)} = l)P(z^{(i)} = l)}
            \end{align}
            $$
            where
            $$
            \begin{align}
            P(z^{(i)} = j) = & \phi_j  \\
            P(x^{(i)}\vert z^{(i)} = j) = & \frac{1}{{(2\pi)}^{n/2}\vert\Sigma_j\vert^{1/2}} \exp{\left(-\frac{1}{2}(x^{(i)}-\mu_j)^T \Sigma_j^{-1}(x^{(i)}-\mu_j)\right)}
            \end{align}
            $$
        </blockquote>
        <b>M-step</b> (Update the gaussians)
        <blockquote>
        <div class= "alert alert-block alert-success">
        $$
        \begin{align}
        \max_{\phi, \mu, \Sigma}& \Sigma_i \Sigma_{z^{(i)}} Q_i(z^{(i)}) \log\left[\frac{P(x^{(i)}, z^{(i)}; \phi, \mu, \Sigma)}{Q_i(z^{(i)})}\right] \\
        &= \Sigma_i \Sigma_{j} w^{(i)}_j \log\left[\frac{\frac{1}{{(2\pi)}^{n/2}\vert\Sigma_j\vert^{1/2}} \exp{\left(-\frac{1}{2}(x^{(i)}-\mu_j)^T \Sigma_j^{-1}(x^{(i)}-\mu_j)\right)}\phi_j}{w^{(i)}_j}\right]
        \end{align}
        $$
        </div> <br>
        $$
        \begin{align}
        \phi_j :=& \frac{1}{m} \sum^m_{i=1} w_j^{(i)} \\
        \mu_j :=& \frac{\sum^m_{i=1}w_j^{(i)}x^{(i)}}{\sum^m_{i=1}w_j^{(i)}} \\
        \Sigma_j :=& \frac{\sum^m_{i=1}w_j^{(i)}(x^{(i)} -\mu_j)(x^{(i)} -\mu_j)^T}{\sum^m_{i=1}w_j^{(i)}}&
        \end{align}
        $$
        </blockquote>
    </div>
</section>

#### Prediction

$$ P(x) = \Sigma_k p(x,z = k) $$

Criteria:

$$
\left\{ \begin{array}{c c}
P(x) \geq \epsilon & \text{(ok)} \\
P(x) < \epsilon & \text{(anomaly)}
\end{array}\right.
$$

Choose $\epsilon$ using CV with a labelled CV dataset.

<details>
    <summary><font size="3"><b>Why EM works</b></font></summary>
    <br>
    <section class = "section--concept">
        <div class = "concept--header"> Jensen's Inequality</div>
        <div class = "concept--content">
        Let $f$ be a convex function. (e.g. $f^{\prime\prime}(x)\geq 0$). Let $X$ be a random variable. Then,
        $$
        f(\mathbb{E}[X]) \leq \mathbb{E}[f(X))]
        $$
        Further, if $f^{\prime\prime}(x) > 0$ ($f$ is strictly convex), then 
            $$E[f(X)] = f(E[X]) \iff X \text{ is constant.}$$ <br>
        <div style = "text-align: center;">
        <img src="./images/Jensen's inequality.jpeg" style="width:300px;" >
        </div>
        </div>
    </section>
    <br>
    <b>Idea:</b> construct lower bounds and optimize lower bounds. <br>
    <b>Construct lower bounds that are tight at the current $\theta$:</b>
    MLE: $$
    \begin{align}
    \max_\theta  \log (\Pi_i P(x^{(i)};\theta)) = &
    \Sigma_i \log P(x^{(i)};\theta) \\
    = & \Sigma_i \log \Sigma_{z^{(i)}}  P(x^{(i)}, z^{(i)};\theta) \\
    = & \Sigma_i \log \Sigma_{z^{(i)}} Q_i(z^{(i)})  \left[ \frac{P(x^{(i)}, z^{(i)};\theta)}{Q_i(z^{(i)})} \right] \; \text{where $Q_i(z^{(i)})$ is a probability distribution (i.e., $\Sigma_{z^{(i)}}Q_i(z^{(i)}) = 1$)} \\
    = & \Sigma_i \log \mathbb{E}_{z^{(i)} \sim Q_i} \left[\frac{P(x^{(i)}, z^{(i)};\theta)}{Q_i(z^{(i)})}\right] \\
    \geq & \Sigma_i \mathbb{E}_{z^{(i)} \sim Q_i} \log \left[\frac{P(x^{(i)}, z^{(i)};\theta)}{Q_i(z^{(i)})}\right] \; \text{(Jensen's Inequality)} \\
    = & \Sigma_i \Sigma_{z^{(i)}} Q_i(z^{(i)}) \log \left[\frac{P(x^{(i)}, z^{(i)};\theta)}{Q_i(z^{(i)})}\right]
    \end{align}
    $$
    Want lower bounds to be tight at the current $\theta$:
    $$
    \log \mathbb{E} \left[\frac{P(x^{(i)}, z^{(i)};\theta)}{Q_i(z^{(i)})}\right] = \mathbb{E} \log \left[\frac{P(x^{(i)}, z^{(i)};\theta)}{Q_i(z^{(i)})}\right]
    $$
    Since $\log(\cdot)$ is strictly concave, then it follows that:
    $$
    \frac{P(x^{(i)}, z^{(i)};\theta)}{Q_i(z^{(i)})} \; \text{is constant} \implies Q_i(z^{(i)}) \propto P(x^{(i)}, z^{(i)};\theta)
    $$
    Since $Q_i$ is a pdf, it follows that $\Sigma_{z^{(i)}}Q_i(z^{(i)}) = 1$. Then,
    $$
    Q_i(z^{(i)}) = \frac{P(x^{(i)}, z^{(i)};\theta)}{\Sigma_{z^{(i)}}P(x^{(i)}, z^{(i)};\theta)} = \frac{P(x^{(i)}, z^{(i)};\theta)}{P(x^{(i)};\theta)} = P(z^{(i)} \vert x^{(i)};\theta)
    $$
    <b>Expectation Maximization:</b><br>
    E-step: Set $Q_i(z^{(i)}) = P(z^{(i)}|x^{(i)};\theta)$ <br>
    M-step: $\theta = \arg\max_\theta \Sigma_i\Sigma_{z^{(i)}}Q_i(z^{(i)})\log\left[\frac{P(x^{(i)}, z^{(i)};\theta)}{Q_i(z^{(i)})}\right]$ <br>
    This shows that the EM algorithm is a maximum likelihood estimation algorithm, with optimization solved by constructing lower bounds and optimizing lower bounds.
</details>

<a name = "FA"></a>
## Factor Analysis

<blockquote>
Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors.  -- Wikipedia
</blockquote>

<blockquote>
    <b>Factor Pricing Models in Asset Pricing Theory (APT)</b> (Financial Economics) <br>
    $$
    \begin{align}
    x^i =& \alpha_i + \Sigma_{j=1}^M \beta_{ij}f_j + \epsilon^i \\
    =& \alpha_i + \mathbf{\beta}_i^\prime \mathbf{f} + \epsilon^i \\
    =& \mathbb{E}[x^i] + \Sigma_{j=1}^M \beta_{ij}\tilde{f_j} + \epsilon^i \; \text{(by convention, $\mathbb{E}[\tilde{f}] = 0$)}
    \end{align}
    $$
    where $x^i$ is the return to asset $i$ and $f_j$'s are common factors, such as the market portfolio ("the market"), industry portfolios, or size and book to market portfolios, etc.
    -- For more, see <a href ="https://press.princeton.edu/books/hardcover/9780691121376/asset-pricing">Cochrane (2005)<a>.
</blockquote>    
    
### Factor Analysis Model
    
$\mathbf{X}$ are **observed** variables of shape $(m,n)$ and $\mathbf{Z}$ are **unobserved/latent** factors of shape $(m,d)$, where $d < n$. Denote a single vector of factors as $z \in \mathbb{R}^d$ and a single vector of variables as $x \in \mathbb{R}^n$.
    $$
    \begin{align}
    z \sim & \mathcal{N}(0, I) \\
    x = & \mathbf{\mu} +  \Lambda z + \epsilon, \; \epsilon \sim \mathcal{N}(0, \Psi)
    \end{align}
    $$
    where $\mu \in \mathbb{R}^n, \;\Lambda \in \mathbb{R}^{n\times d},\; \Psi \in \mathbb{R}^{n \times n}$ diagonal.
    $$
    \left[\begin{array}{c}
    z \\
    x
    \end{array}\right] \sim
    \mathcal{N}\left(\left[\begin{array}{c}
    0 \\
    \mu
    \end{array}\right],
    \left[\begin{array}{cc}
    I & \Lambda^T \\
    \Lambda & \Psi + \Lambda\Lambda^T
    \end{array}\right]
    \right)
    $$
    
Conventionally, it is easier to work with the demeaned variables $\tilde x = x - \mathbb{E}[x]$ such that:<br>
    $$
    \begin{align}
    z \sim & \mathcal{N}(0, I) \\
    \tilde x = &  \Lambda z + \epsilon, \; \epsilon \sim \mathcal{N}(0, \Psi)
    \end{align}
    $$<br>
    $$
    \left[\begin{array}{c}
    z \\
    \tilde x
    \end{array}\right] \sim
    \mathcal{N}\left(\left[\begin{array}{c}
    0 \\
    0
    \end{array}\right],
    \left[\begin{array}{cc}
    I & \Lambda^T \\
    \Lambda & \Psi + \Lambda\Lambda^T
    \end{array}\right]
    \right)
    $$

<details>
    <summary><font size="3"><b>Conditional Normal Distribution</b></font></summary>
    <br>
    <section class = "section--concept">
        <div class = "concept--header"> Conditional Normal Distribution</div>
        <div class = "concept--content">
            Given a multivariate normal distribution
            $$
                \left[\begin{array}{c}
                X_1 \\
                X_2
                \end{array}\right] \sim
                \mathcal{N}\left(\left[\begin{array}{c}
                \mu_1\\
                \mu_2
                \end{array}\right],
                \left[\begin{array}{cc}
                \Sigma_{11} & \Sigma_{12} \\
                \Sigma_{21} & \Sigma_{22}
                \end{array}\right]
                \right),
            $$
            the conditional distribution of $X_1 \vert X_2$ is normal with:
            $$
            \begin{align}
            \mu_{1\vert2} = \mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(X_2 - \mu_2) \\
            \Sigma_{1\vert2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}
            \end{align}
            $$
            (Derivation : <a href = "https://stats.stackexchange.com/questions/30588/deriving-the-conditional-distributions-of-a-multivariate-normal-distribution">here</a>)
        </div>
    </section>
    <br>
    It follows that: $z \vert x$ is normal with:
    $$
    \begin{align}
    \mu_{z\vert x} = \Lambda^T {(\Psi + \Lambda\Lambda^T)}^{-1}(x - \mu) \\
    \Sigma_{z\vert x} = I - \Lambda^T{(\Psi + \Lambda\Lambda^T)}^{-1}\Lambda
    \end{align}
    $$
    If we see $z \vert x$ through the lens of <b>linear projection</b>, then
    $$
    \mathbb{E}[z\vert x] = \beta (x - \mu)
    $$
    where $\beta \equiv \Lambda^T {(\Psi + \Lambda\Lambda^T)}^{-1}$.
</details>

### Expectation Maximization for Factor Analysis
Recall **EM**: (construct and optimise lower bounds)
<blockquote>
    <b>E-step</b>: compute $w_j^{(i)}= Q_i(z^{(i)}=j) = P(z^{(i)} = j\vert x^{(i)})$. <br>
    <b>M-step</b>: solve for $\theta$ that maximises the log-likehood
    $$
    \arg\max_\theta \log(\Pi_i^m P(x^{(i)}; \theta))
    = \arg\max_\theta \sum_i^m \mathbb{E}_{z^{(i)}\sim Q_i}[\log(P(x^{(i)}; \theta))]
    $$
</blockquote>
Here, since $z$ follows a normal distribution, we can make use of the sufficient statistics (i.e. mean and variance). <br>
<br>
<section class = "section--algorithm">
    <div class = "algorithm--header"> Expectation Maximization Algorithm (for factor analysis)</div>
    <div class = "algorithm--content">
        <b>E-step</b> (Estimate the first and second moments of the conditional distribution, i.e. $E[z^{(i)}\vert  x^{(i)}]$ and $E[z^{(i)}z^{(i)\prime} \vert  x^{(i)}]$ using $\Lambda$ and $\Psi$)
        <blockquote>
        The variance is given by
            $$
            \mathrm{Var}[z^{(i)} \vert x^{(i)}] = \Sigma_{z\vert x^{(i)}} = I - \Lambda^T{(\Psi + \Lambda\Lambda^T)}^{-1}\Lambda.
            $$
            <br>
        <div class = "alert alert-block alert-info">
            Recall that $\mathrm{Var}[z] = \mathbb{E}[zz^\prime] - \mathbb{E}[z]\mathbb{E}[z^\prime]$, which implies that:
            $$
            E[z^{(i)}z^{(i)\prime} \vert x^{(i)}] = \mathrm{Var}[z^{(i)} \vert x^{(i)}] + \mathbb{E}[z^{(i)}\vert  x^{(i)}]\mathbb{E}[z^{(i)\prime}\vert  x^{(i)}]
            $$
        </div>
        Thus, only need to estimate the conditional mean $\mathbb{E}[z^{(i)}\vert  x^{(i)}]$:
            $$
            \hat \mu_{z\vert x^{(i)}} = \Lambda^T {(\Psi + \Lambda\Lambda^T)}^{-1}(x^{(i)} - \hat \mu).
            $$
        Estimate the mean $\mu$ once before the EM as the sample mean:
            $$
            \hat \mu = \frac{1}{m}\sum_{i=1}^m x^{(i)}.
            $$
        </blockquote>
        <b>M-step</b> (Update $\Lambda$ and $\Psi$)
        <blockquote>
        <div class= "alert alert-block alert-success">
        $$
        \begin{align}
        \max_{\Lambda, \Psi}& \sum_i \mathbb{E}_{z^{(i)}\vert x^{(i)}} \log\left[P(x^{(i)}; \Lambda, \Psi)\right] \\
        &= \sum_i \mathbb{E}_{z^{(i)}\vert x^{(i)}} \log\left[ (2\pi)^{-n/2} {\vert\Psi\vert}^{-1/2} \exp\left\{-\frac{1}{2}{(x^{(i)} - \Lambda z^{(i)})}^T\Psi^{-1}(x^{(i)} - \Lambda z^{(i)}) \right\}\right] \\
        &= -\frac{mn}{2}\log 2\pi - \frac{n}{2}\log \vert \Psi \vert - \frac{1}{2}\sum_i x^{(i)T}\Psi^{-1}x^{(i)} + \sum_i x^{(i)T}\Psi^{-1}\Lambda \mathbb{E}[z^{(i)}\vert x^{(i)}] - \frac{1}{2}\sum_i \mathrm{tr}\left[\Lambda^T\Psi^{-1}\Lambda\mathbb{E}[z^{(i)}z^{(i)T}\vert x^{(i)}]\right]
        \end{align}
        $$
        Solve the first order conditions (FOCs) w.r.t. $\Lambda$ and $\Psi$.
        </div> <br>
        $$
        \begin{align}
        \Lambda :=& \left(\sum^m_{i=1}x^{(i)}\mathbb{E}[z^{(i)T}\vert x^{(i)}]\right){\left(\sum^m_{i=1}\mathbb{E}[z^{(i)}z^{(i)T}\vert x^{(i)}]\right)}^{-1}  \\
        \Psi  :=& \mathrm{diag}\left(\frac{1}{m}\sum_i^m x^{(i)}x^{(i)T} - \frac{1}{m}\underbrace{\left(\sum_i^m x^{(i)}\mathbb{E}[z^{(i)T}\vert x^{(i)}]\right){\left(\sum^m_{i=1}\mathbb{E}[z^{(i)}z^{(i)T}\vert x^{(i)}]\right)}^{-1}}_{\Lambda^{\mathrm{new}}}\left(\sum_i^m \mathbb{E}[z^{(i)}\vert x^{(i)}]x^{(i)T}\right)\right)
        \end{align}
        $$
        where the $\mathrm{diag}$ operator sets all of the off-diagonal elements of a matrix to $0$.
        </blockquote>
        (Original paper: <a href = "https://mlg.eng.cam.ac.uk/zoubin/papers/tr-96-1.pdf">here</a>)
    </div>
</section>

<a name = "PCA"></a>
## Principal Component Analysis
<blockquote>
    The principal components of a set of data in $\mathbb{R}^p$ provides a sequence of best linear approximations to that data, of all ranks $q \leq p$. -- The Elements of Statistical Learning, p.534
    <div style = "text-align: center;">
    <img src="./images/PCA.jpeg" style="width:80%;" > <br>
    Source: The Elements of Statistical Learning, p.536
    </div>
</blockquote>


### Parameterization

Denote the observations by $x_1, x_2, \dots, x_N$, and consider the rank-$q$ linear model for representing them:
$$
f(\lambda) = \mu + \mathbf{V}_q\lambda
$$
where $\mu$ is a location vector in $\mathbb{R}^p$, $\mathbf{V}_q$ is a $p \times q$ matrix with $q$ orthogonal unit vectors as columns, and $\lambda$ is $q$ vector of parameters (the $q$ principal components). $f(\lambda)$ is a point on the affine hyperplane of rank $q$ defined by $f(\cdot)$.

### Reconstruction error
The cost function of PCs:
$$
\min_{\mu,\{\lambda_i\}, \mathbf{V}_q} \sum_{i=1}^N \Vert x_i - \mu - \mathbf{V}_q\lambda_i\Vert^2
$$
Partial optimization:
- optimise for $\mu$ and $\lambda_i$ keeping $\mathbf{V}_q$ fixed.
- plug in solutions for $\mu$ and $\lambda_i$ and optimise for $\mathbf{V}_q$.
Solutions:
$$
\begin{align}
\hat \mu =& \bar x \\
\hat \lambda_i =& \mathbf{V}_q^T(x_i - \bar x)
\end{align}
$$
and
$$
\mathbf{H}_q = \mathbf{V}_q \mathbf{V}_q^T
$$
is a projection matrix, and maps each point $x_i$ onto the subspace spanned by the columns of $\mathbf{V}_q$. $\mathbf{V}_q$ is the first $q$ columns of $\mathbf{V}$ in $\mathbf{X}=\mathbf{U}\mathbf{D}\mathbf{V}^T$.

### Singular value decomposition
Consider an $N \times p$ matrix $\mathbf{X}$:
$$\mathbf{X} =\mathbf{U}\mathbf{D}\mathbf{V}^T $$
where $\mathbf{U}$ is an $N \times p$ orthogonal matrix ($\mathbf{U}^T\mathbf{U} = \mathbf{I}_p$) whose columns $\mathbf{u}_j$ are called the *left singular vectors*; $\mathbf{V}$ is a $p \times p$ orthogonal matrix with columns $\mathbf{v}_j$ called the *right singular vectors*, and $\mathbf{D}$ is a $p \times p$ diagonal matrix, with diagonal elments $d_1 \geq d_2 \geq \cdots \geq d_p \geq 0$ known as the *singular values*. The columns of $\mathbf{UD}$ are called the principal components of $\mathbf{X}$.

Conveniently, the $k$-th principal component is:
$$
\mathbf{X} \mathbf{v}_k = \mathbf{u}_k d_k
$$
(using the orthogonality of $\mathbf{V}$).

### Evaluation metrics
1. variation of data captured by the first few PCs.
2. first few singular values in comparison with those obtained for equivalent uncorrelated data.

<div style = "text-align: center;">
    <img src="./images/Handwritten digits.png" style="width:80%;" > <br>
    Applying PCA to the MNIST handwritten digits dataset.
</div>

<a name = "ICA"></a>
## Independent Component Analysis

Survey paper: <a href = "https://www.cs.jhu.edu/~ayuille/courses/Stat161-261-Spring14/HyvO00-icatut.pdf">Independent Component Analysis: A Tutorial (Hyvarinen and Oja)</a> 

### A latent variables model
Recall the singular value decomposition:
$$
\mathbf{X} = \mathbf{UDV}^T
$$
Write $\mathbf{S} = \sqrt{N}\mathbf{U}$ and $\mathbf{A}^T = \mathbf{DV}^T/\sqrt{N}$ and we have a latent variables model:
$$
\mathbf{X} = \mathbf{SA}^T
$$
Assuming that the collumns of $\mathbf{X}$ (and hence $\mathbf{U}$) have mean zero, this implies that the columns of $\mathbf{S}$ have *zero mean*, are *uncorrelated* and have *unit variance*. Consider a single observation $X$ of size $p \times 1$. (i.e. $\mathbf{X} = \left[\begin{array}{ccc} X_1 & \cdots & X_3\end{array}\right]^T$.)
$$
X = \mathbf{A}S
$$
where $S$ is $p \times 1$. Alternatively, it can be expressed as a system of equations:
$$
\begin{array}{ccc}
X_1 &=& a_{11} S_1 + a_{12} S_2 + \cdots + a_{1p}S_p \\
X_2 &=& a_{21} S_1 + a_{22} S_2 + \cdots + a_{2p}S_p \\
\vdots & &\vdots \\
X_p &=& a_{p1} S_1 + a_{p2} S_2 + \cdots + a_{pp}S_p \\
\end{array}
$$

### Identifiability issse
Consider any orthogonal $p \times p$ matrix $\mathbf{R}$.
$$
\begin{align}
 X =& \mathbf{A}S \\
   =& \mathbf{AR}^T\mathbf{R}S \\
   =& \mathbf{A}^*S^*
\end{align}
$$
and
$$
\mathrm{Cov}(S^*) = \mathbf{R}\mathrm{Cov}(S)\mathbf{R}^T = \mathbf{I} = \mathrm{Cov}(S)
$$
Therefore, there are many such decompositions and it is impossible to identify any particular latent variables as unique underlying sources.

### Classical factor analysis model
$$
X = \mathbf{A}_{p \times q} S_{q \times 1} + \mathbf{\epsilon}_{p \times 1}
$$
where the $\epsilon_{j}$ are uncorrelated zero-mean disturbances. The identifiability issue remains.

### The independent component analysis model
$$
X = \mathbf{A}S
$$
where $S_i$ are assumed to be **statistically independent** rather than uncorrelated. To avoid the identifiability issue, we need to assume that the $S_i$ are also **non-Gaussian**. (Note: multivariate Gaussian is determined up to its second moments.)
<blockquote>
ICA is able to perform <em>blind source separation</em> by exploring the independence and non-Gaussianity of the original sources. -- The Elements of Statistical Learning, p. 561
</blockquote>

### Approaches to ICA

#### Differential entropy and mutual information
Goals:
- Minimize the mutual information (*Kullback-Leibler distance* between the density of a random vector and its independence counterpart).
- Minimize the differential entropy (maximise the non-Gaussianity).

Differential entropy of a random variable $Y$:

$$
H(Y) = - \int g(y)\log g(y) dy
$$

<div class = "alert alert-block alert-info"><b>Theorem (from information theory):</b> With a normal distribution, differential entropy is maximized for a given variance. A Gaussian random variable has the largest entropy amongst all random variables of equal variance, or, alternatively, the maximum entropy distribution under constraints of mean and variance is the Gaussian. <br> Proof: <a href="https://en.wikipedia.org/wiki/Differential_entropy">here</a>.</div>

Mutual information between the components of a random vector $Y$:
$$
I(Y) = \sum_{j=1}^p H(Y_j) - H(Y)
$$
If $X$ has covariance $\mathbf{I}$, and $S = \mathbf{A}^T X$ with $\mathbf{A}$ orthogonal (implied by $X$ has covariance $\mathbf{I}$; see ESLII, p.560), then
$$
\begin{align}
I(S) =& \sum_{j=1}^p H(S_j) - H(X) - \log \vert \mathrm{det} \mathbf{A}\vert \\
 =& \sum_{j=1}^p H(S_j) - H(X)
\end{align}
$$
- Finding an $\mathbf{A}$ to minimise $I(S) = I(\mathbf{A}^T X)$ looks for the orthogonal transformation that leads to the most independence between its components.
- This is equivalent to minimising the sum of the entropies of the separate components of Y.
- This amounts to maximising their departures from Gaussanity.

#### Negentropy
Goal:
- Minimise the departure of $S_j$ from Gaussanity.

The negentropy measure:

$$
J(Y_j) = H(Z_j) - H(Y_j)
$$

where $Y_j$ is a Gaussian random variable with the same variance as $Y_j$. Note that negentropy is non-negative, and measures the departure of $Y_j$ from Gaussanity. A simple approximation to negentropy proposed by <a href = "https://www.cs.jhu.edu/~ayuille/courses/Stat161-261-Spring14/HyvO00-icatut.pdf">Hyvarinen and Oja</a>:

$$
J(Y_j) \approx [\mathrm{E}G(Y_j) - \mathrm{E}G(Z_j)]^2,
$$
where $G(x) = \frac{1}{a}\log\cosh(ax)$ for $1\leq a \leq 2$. When applying to a sample of data, replace the expectations with sample averages. 

<blockquote>
More classical (and less robust) measures are based on fourth moments, and hence look for departures from the Gaussian via kurtosis. See <a href = "https://www.cs.jhu.edu/~ayuille/courses/Stat161-261-Spring14/HyvO00-icatut.pdf">Hyvarinen and Oja</a> for more details. -- The Elements of Statistical Learning, p. 562
</blockquote>