# Unsupervised Learning
Unsupervised learning usually brings about clustering, but there are various forms of clustering as well as density estimation, spatial transformation, feature extraction, and probability density estimation.  
- Clustering : Collect samples close together in the feature space and group them together.
- Density estimation : Estimate probability distribution from data.
- Spatial transformation : Convert the original feature space where the data is defined to low or high dimensional space.


## Clustering
$$
\begin{array}{lcl}
c_i \ne \varnothing, \quad i = 1, 2, ..., k \\
\bigcup_{i=1}^k c_i = \mathbb{X} \\
c_i \bigcap c_j = \varnothing, \quad i \ne j
\end{array}
$$  
Clustering is the task of finding a cluster $C = \{c_1, c_2, ..., c_k\}$ that satisfies the above conditions given a training set $\mathbb{X} = \{\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_n\}$. The first condition is that every cluster should have at least one sample. The second and third one are that all samples belong to only one cluster. It is called hard clustering. The number of clusters $k$ may be given, but in many cases it is often necessary to estimate the number of clusters. Clustering is called a class discovery task because the number of clusters can be regarded as the number of classes.  


### k-means algorithm
The k-means algorithm has the advantage of being intuitively easy to understand and easy to implement. However, the disadvantage is that you must specify the number of clusters $k$. 
```
1. Initialize k cluster centers. If you have a prior knowledge of data distribution, use it.
2. Assign each of the n samples to the nearest cluster center.
3. Update each cluster center with the average of the newly assigned samples.
4. Repeat 2~3 until pre-cluster assignment and current cluster assignment are same.
```  
In k-medoids algorithm, Draw a representative from the sample and update the cluster center with the drawn representative. There are several ways to select a representative, and a lot of methods are used to select a representative sample whose minimum sum of the distances of different sample branches. So, k-medoids algorithm is insensitive to noise than k-means algorithm.  


$$
J(Z,\mathbf{A}) = \sum_{i=1}^n \sum_{j=1}^k a_{ji} dist(\mathbf{x}_i, \mathbf{z}_j)
$$  
It is the objective function of k-means algorithm. $Z$ is a center of cluster, and $\mathbf{A}$ is a matrix $k*n$ representing sample allocation information. If $i^{th}$ sample is assigned to $j^{th}$ cluster, then $a_{ji}$ is 1, otherwise 0. $dist(\mathbf{x}_i, \mathbf{z}_j)$ is a function to measure the distance between $\mathbf{x}$ and $\mathbf{z}$. The k-means algorithm can be seen as an algorithm for solving optimization problems. It iterates through the loop and updates the solution in a direction that decreases the value of the objective function. Convergence must be made with any initial cluster center. However, different initial cluster centers have different end results. Multi-start k-means algorithm can solve this problem.  


### Affinity propagation algorithm
Affinity propagation algorithm calculates two kinds of affinity matrices, the responsibility matrix $\mathbf{R}$ and the availablility matrix $\mathbf{A}$, from the similarity between samples, and finds a cluster using this affinity information. The input is a train set $\mathbb{X} = \{\mathbf{x}_1, \mathbf{X}_2, ..., \mathbf{X}_n\}$.   


$$
s_{ik} = - \lVert \mathbf{x}_i - \mathbf{X}_k \rVert_2^2, \quad i \ne k \text{ and } i, k = 1, 2, ..., n
$$
$s_{ik}$ is a affinity of two samples, $i$ and $k$. The closer two samples are, the larger value it has.  


$$r_{ik} = s_{ik} - \max_{k' \ne k} (a_{ik'} + s_{ik'})$$
$r_{ik}$ is an element of the responsibility matrix $\mathbf{R}$. The similarity between $i$ and $k$ gives a larger value; the closer $i$ is to samples other than $k$, the larger the value is subtracted.  


$$a_{ik} = \min (0, r_{kk} + \sum_{i' \ne i, k} \max (0, r_{i'k}), \quad i \ne k$$
$a_{ik}$ is an element of the availablility matrix $\mathbf{A}$. $r_{kk}$ can be interpreted as a value that sample $k$ gives itself a representative qualification. The larger this value, the larger $a_{ik}$. Moreover, The larger $r_{i'k}$, the larger $a_{ik}$.  


In the affinity propagation algorithm, self-similarity $s_{kk}$ and self-affinity $r_{kk}, a_{kk}$, when two samples have the same index, play an important role. After calculating all the similarities between different samples, $s_{kk}$ can be selected and used among the minimum, median and maximum values of the similarity. The minimum value produces fewer clusters, while the maximum value produces more clusters. $s_{kk}$ is a hyper parameter. $r_{kk}$ is obtained using this equation: $r_{kk} = s_{kk} - \max_{k' \ne k} (a_{kk'} + s_{kk'})$. And $a_{kk}$ is $a_{kk} = \sum_{i' \ne k} \max(0, r_{i'k})$.  


## Density estimation
Density estimation is to find a probability density function $P(\mathbf{x})$ in given dataset $\mathbb{X}$.

### Kernel density estimation
The simplest method is that by dividing the axis into sections, devide the feature space into sets of bins, and then count the frequency of the samples in each bin.(histogram method)  
$$P(\mathbf{x}) = {bin(\mathbf{X}) \over n}$$  
histogram method is simple to understand but it has several serious problems: First, the probability density function $P(\mathbf{x})$ has a step shape. Second, It is very sentitive of the size and position of bin.  

Kernel density estimation method can solve the problem that histogram method has.  


$$
P_h(\mathbf{x}) = {1 \over n} \sum_{i=1}^n K_h(\mathbf{x} - \mathbf{x}_i) = {1 \over nh^d} \sum_{i=1}^n K({\mathbf{x} - \mathbf{x}_i \over h}), \quad
K_h(\mathbf{x}) = {1 \over h^d} K({\mathbf{x} \over h})
$$  

$K$ is a standard kernel function, and $K_h$ is a resized kernel function. Probability density is expressed $P_h$ because the result depends on the bandwidth $h$.  
<img src="./img/4_K_and_Kh.png" width="35%" height="35%">  
The figure above is reduced to a 1-dimensional feature space. In range $[-0.5, 0.5$, a uniform function with a value of 1 was taken as the standard kernel $K$. Wider bandwidth shows a kernel with $h > 1$ in this case. If $h < 1$, it would have a narrow and high shape. The bandwidth $h$ is a hyper parameter and it is important to set the bandwidth a proper value.  

Kernel density estimation still has some fundamental problems: First, it needs a large memory as many as the size of train dataset $\mathbf{X}$. Second, each time a new sample is given, the equation must be recalculated. It takes $\Theta(nd)$ and $d$ is the number of dimensions of the data. Last, The higher the feature space, the sparse the data.(curse of dimension)  

### Gaussian Mixture
We assume that the data follow a certain shape distribution, usually Gaussian distribution.  


$$
P(\mathbf{x}) = N(\mathbf{x};\mathbf{\mu},\mathbf{\Sigma}) = {1 \over \sqrt{\left\vert \Sigma \right\vert} \sqrt{(2\pi)^d}}\exp\left({1 \over 2}(\mathbf{x} - \mathbf{\mu})^T\Sigma^{-1}(\mathbf{x} - \mathbf{\mu})\right) \\
\mathbf{\mu} = {1 \over n} \sum_{i=1,n} \mathbf{x}_i, \mathbf{\Sigma} = {1 \over n} \sum_{i=1,n} (\mathbf{x}_i - \mathbf{\mu})(\mathbf{x}_i - \mathbf{\mu})^T
$$  
Gaussian calculates the mean vector $\mathbf{\mu}$ and the covariance matrix $\mathbf{\Sigma}$ once from the training set, and then calculates the probability distribution without the training set. It takes $\Theta(nd)$. Also, if data has $d$-dimension, you save total $d^2+d$ values because $\mathbf{\mu}$ has $d$ elements and $\mathbf{\Sigma}$ has $d^2$ elements. The Gaussian method is parametric method because it defines the probability distribution with several parameters. Kernel density estimation, on the other hand, is a nonparametric method because it does not use any of the parameters defined as parameters.  

In the case of density estimation with one Gaussian, there are many situations where the error is large. So a mixture of several Gaussians is used to represent the probability density to solve the problem effectively.  
$$
P(\mathbf{x}) = \sum_{j=1}^k \pi_j N(\mathbf{x};\mathbf{\mu}_j, \mathbf{\Sigma}_j)
$$  
Each $N(\mathbf{x};\mathbf{\mu}_j, \mathbf{\Sigma}_j)$ is called element distribution, and the coefficient $\pi_j$ is called mixture coefficient. It satisfies $0 \ge \pi_j \ge 1$ and $\textstyle \sum_{j=1}^k \pi_j = 1$.  


Therfore,  
* Given data : train dataset $\mathbb{X} = \{\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_n\}$, the number of Gaussian $k$.  
* Parameter set to be estimated : $\Theta = \{\mathbf{\pi} = (\pi_1, \pi_2, ..., \pi_k), (\mathbf{\mu}_1, \mathbf{\Sigma}_1, \mathbf{\mu}_2, \mathbf{\Sigma}_2, ..., \mathbf{\mu}_k, \mathbf{\Sigma}_k)\}$  

Use maximum likelihood to optimize the problem.  
$$
P(\mathbb{X}\mid\Theta) = \prod_{i=1}^n P(\mathbf{X}_i\mid\Theta) = \prod_{i=1}^n \left(\sum_{j=1}^k \pi_j N(\mathbf{x}_i;\mathbf{\mu}_j, \mathbf{\Sigma}_j)\right) \\
\log P(\mathbb{X}\mid\Theta) = \sum_{i=1}^n \log \left(\sum_{j=1}^k \pi_j N(\mathbf{x}_i;\mathbf{\mu}_j, \mathbf{\Sigma}_j)\right)
$$  

So,  
$$\widehat{\Theta} = \underset{\Theta}{\text{argmax}}\log P(\mathbb{X}\mid\Theta)$$  
It can be solved by EM algorithm.  

### EM algorithm
<img src="./img/4_EM_algorithm.png" width="40%" height="40%">
If the Gaussians improve their parameters, the improved Gaussians can improve the affiliation information of the sample. Improved affiliation information updates Gaussian more accurately. The method of reaching the convergence point by repeating the two processes alternately is called an EM algorithm.  


$$
z_{ji} = {\pi_j N(\mathbf{x}_i;\mathbf{\mu}_j, \mathbf{\Sigma}_j) \over \textstyle \sum_{q=1}^k \pi_q N(\mathbf{x}_i;\mathbf{\mu}_q, \mathbf{\Sigma}_q)}
$$  
$$
\begin{cases}
\mathbf{\mu}_j = {1 \over n_j} \sum_{i=1}^n z_{ji}\mathbf{x}_i \\
\mathbf{\Sigma}_j = {1 \over n_j} \sum_{i=1}^n z_{ji}(\mathbf{X}_i - \mathbf{\mu}_j)(\mathbf{X}_i - \mathbf{\mu}_j)^T \\
pi_j = {n_j \over n} \\
n_j = \sum_{i=1}^n z_{ji}
\end{cases}
$$  

## Spatial transformation
Spatial transformation consists of encoding and decoding processes. The process of converting the original space into another space is called encoding($f$), and the process of inverting the transformed space into the original one is called decoding($g$).  


$$\hat{\mathbf{x}} = g(f(x))$$  

In Linear factor model, a factor, a latent variable or a hidden variable, is a variable that does not appear as a phenomenon. The linear factor model uses linear arithmetic to convert observed data into the factors.  


$$
\begin{array}{lcl}
f: \mathbf{z} = \mathbf{W}_{enc}\mathbf{x} + \mathbf{\alpha}_{enc} \\
g: \mathbf{x} = \mathbf{W}_{dec}\mathbf{z} + \mathbf{\alpha}_{dec}
\end{array}
$$  

The linear factor model uses linear arithmetic so that it is represented as matrix products. $\mathbf{W}_{enc}$ is $q*d$, $\mathbf{W}_{dec}$ is $d*q$, $\mathbf{\alpha}_{enc}$ is $q*1$, and $\mathbf{\alpha}_{dec}$ is $d*1$ matrices.  

### PCA(Principal Component Analysis)
First move the data to the origin. $\mathbf{\mu}$ is a means of samples in train dataset.  


$$
\mathbf{x}_i = \mathbf{x}_i - \mathbf{\mu}, \quad i = 1, 2, ..., n \\
\mathbf{\mu} = {1 \over n} \sum_{i=1}^n \mathbf{x}_i
$$  
And transform. Each vector dot product is projected onto the axis pointed to by $\mathbf{u}_j$.  


$$
\mathbf{z} = \mathbf{W}_{enc}\mathbf{x} \\
\mathbf{W}_{enc} = (\mathbf{u}_1, \mathbf{u}_2, ..., \mathbf{u}_q), \text{ and }\mathbf{u}_j = (u_{1j}, u_{2j}, ..., u_{dj})^T
$$  
Converting high-dimensional data to lower dimensions results in information loss. The less information lost, the better the axis. The purpose of the PCA is therefore to find a transformation matrix $\mathbf{W}$ that transforms to the lower dimension with minimal information loss. The PCA determines that the greater the variance($\sigma^2$) of the transformed training set $\mathbb{Z} = \{\mathbf{z}_1, \mathbf{z}_2, ... \mathbf{z}_n\}$, the smaller the information loss.  


$$
\sigma^2 = {1 \over n} \sum_{i=1}^n (z_i - \bar{z})^2 = {1 \over n} \sum_{i=1}^n z_i^2 = {1 \over n}\sum_{i=1}^n (\mathbf{u}^T\mathbf{x}_i)^2 = \mathbf{u}^T\mathbf{\Sigma}\mathbf{u}
$$  
It can use Lagrange function $L(\mathbf{u})$ because $\mathbf{u}$ is an unit vector, $\mathbf{u}^T\mathbf{u} = 1$. So, the problem becomes to find a $\mathbf{u}$ that maximize $L(\mathbf{u}) = \mathbf{u}^T\mathbf{\Sigma}\mathbf{u} + \lambda(1 - \mathbf{u}^T\mathbf{u})$.  


$$
\frac {\partial L(\mathbf{u})}{\partial \mathbf{u}} = 2\mathbf{\Sigma}\mathbf{u} - 2\lambda\mathbf{u} \\
\mathbf{\Sigma}\mathbf{u} = \lambda\mathbf{u}
$$  
Solving the above equation gives $d$ eigenvalues and eigenvectors. These vectors, principal componenets are vertical. The higher the eigenvalues, the more information is retained(large variance).  

### ICA(Independent Component Analysis)
Real world measurements result in a mixture of several 'independent' signals. The problem of restoring the original signal from a given mixed signal is called blind source separation problem. According to the physical laws of sound, most mixed signals $\mathbf{x}$ are linear combinations of the original signals $\mathbf{z}$.  


$$
\mathbf{x} = \mathbf{A}\mathbf{z} \\
\tilde{\mathbf{z}} = \mathbf{W}\mathbf{x}, \quad \mathbf{W} = \mathbf{A}^{-1}
$$  
The $\mathbf{A}$ is related with a distance between original sound and MIC. If you know the $\mathbf{A}$, you can solve the blind source separation problem. It is, however, an under-condition problem, but given the proper conditions, a solution can be obtained: Independent, and non-Gaussian. ICA finds $\mathbf{w}_j$ that maximizes the degree to which $\mathbb{z}_j$ is non-Gaussian. ($z_j = \mathbf{w}_j\mathbf{x}$)  


$$\hat{\mathbf{w}}_j = \underset{w_j}{\text{argmax}} \check{G}(\mathbb{z}_j)$$  

### Autoencoder
<img src="./img/4_autoencoder.png" width="25%" height="25%">
Autoencoder is a neural network that takes in the feature vector $\mathbf{x}$ and outputs the same or similar vector $\mathbf{x}'$. The number of nodes of input layer and outpur layer are same. Autoencoder is meaningful when the number of nodes of hidden layer($m$) is smaller than the number of input layers($d$). $h$ has a core imformation, a very prominent feature. With the advent of sparse coding, the overcomplete method achieves successful performance, and various structures have been tried in the autoencoder: $m<d, m=d, m>d$.  


$$
\mathbf{h} = f(\mathbf{x}) = \tau_{encode}(\mathbf{W}\mathbf{x}) \\
\mathbf{x} = g(\mathbf{h}) = \tau_{decode}(\mathbf{V}\mathbf{h})
$$  
$\tau$ is an activation function which is linear or non-linear. The hyperparameters are $f$ and $g$ function, weights $\mathbf{W}$ and $\mathbf{V}$. Let $\Theta = \{\mathbf{W}, \mathbf{V}\}$.  


$$\hat{\Theta} = \underset{\Theta}{\text{argmin}} \sum_{i=1}^n L \left(\mathbf{x}_i, g(f(\mathbf{x}_i))\right)$$  
$L$ is a loss function as objective function.  
In $m>d$ case, the capacity of autoencoder is too large so it needs proper regularization. The regularized autoencoder has several types:SAE(sparse autoencoder), DAE(denoising autoencoder), CAE(contractive autoencoder).  
- **SAE**  
SAE achieves a regulatory effect by forcing the output of the hidden layer $\mathbf{h}$ to be sparse.  
$$
SAE: \hat{\Theta} = \underset{\Theta}{\text{argmin}} \sum_{i=1}^n L \left(\mathbf{x}_i, g(f(\mathbf{x}_i))\right) + \lambda\phi(\mathbf{h}_i)$$  


Otherwise, DAE and CAE keep the extracted feature vectors as constant as possible even if the input changes slightly.
- **DAE**  
DAE uses $\tilde{\mathbf{x}}_i$ which is noisy input. So, it returns to the original pattern to remain constant even if there is a change in the input(noisy). The noise cancellation operation provides a regulatory effect.  
$$
DAE: \hat{\Theta} = \underset{\Theta}{\text{argmin}} \sum_{i=1}^n L \left(\mathbf{x}_i, g(f(\tilde{\mathbf{x}}_i))\right)$$  
- **CAE**  
CAE makes a derivative of encoder function $f$ small. It regulates by keeping Frobenius norm of $f$'s Jacobian matrix small.  
$$
CAE: \hat{\Theta} = \underset{\Theta}{\text{argmin}} \sum_{i=1}^n L \left(\mathbf{x}_i, g(f(\mathbf{x}_i))\right) + \lambda\phi(\mathbf{x}_i, \mathbf{h}_i) \\
\phi(\mathbf{x}_i, \mathbf{h}_i) = \left\Vert \frac {\partial f}{\partial \mathbf{x}} \right\Vert_F^2
$$  