# Autoencoders : A way of Unsupervised Learning of Nonlinear Manifold

## Keywords
- **```Unsupervised learning```**
- **```Nonlinear Dimensionality reduction``` = ```Representation learning``` = ```Efficient coding learning``` = ```Feature Extraction``` = ```Manifold learning```**
- **```Generative model learning(Recently, after VAE)```**
- **```Maximum likelihood density estimation```** <br/><br/>
- **To sum up, ```Autoencoder``` is unsupervised learning method, where loss function is negative likelihood. For trained autoencoder, encoder performs dimensionality reduction(manifold learning) while decoder performs like generative models.**

## Overview
**01. Revisiting DNNs**
- Loss function viewpoint 1 : Backpropagation
- Loss function viewpoint 2 : Maximum likelihood
- Maximum likelihood for autoencoders

**02. Manifold Learning**
- Four objectives of manifold learning
- Dimensionality reduction
- Density estimation

**03. Autoencoder**
- Autoencoder(AE)
- Denoising AE(DAE)
- Contractive AE(CAE)

**04. Variational autoencoder**
- Variational AE(VAE)
- Conditional VAE(CVAE)
- Adverserial AE(AAE)

**05. Applications**
- Retrieval
- Generation
- Regression
- GAN + VAE

### 01. Revisiting DNNs
#### Classical Machine Learning
1. **Collect training data**
2. **Define functions : model inputs, outputs, and loss functions**
    - Generally, DNN training are performed in backpropagation-fashion.
    - The two assumption about loss function while backpropagation is...
        - Total loss of DNN over training sample = Sum of loss for each training sample
        - Loss for each training example is a function of *final* output of DNN
     - These two conditions constrains the format of loss function
3. **Learning/Training : find parameter $\theta$ that minimizes the loss function for every training samples.**
    - Generally, Gradient Descent is applied but GD is one of the most simple method in optimization.
    - GD is iterative method to find best parameter, so updating rules of parameter $\theta$ and terminating condition should be defined.
    - $L(\theta + \Delta\theta) = L(\theta) + \nabla{L}\bullet\Delta\theta + \text{ second derivative }+ ...$
    - $\Delta\theta = -\eta\Delta{L}$ : updating rule, $\eta$ is learning rate
    - Learning rate should be generall small because we are approximating $L(\theta + \Delta\theta) - L(\theta)$ with taylor expansion, only for 1st derivative.
    - For big learning rate, the error will be diffused into a bigger ratio.
    - Mini-batch approach allows batch-wise parameter update.
    - Backpropagation is efficient algorithm for calculating the whole gradient of loss function. <br/>
  
4. **Evaluation and Infenence**

#### Viewpoint 1 : Backpropagation
- Backpropagation is an algorithm that propagates error signal into back layers.

- <p align = "center">
$\text{Backpropagation Algorithm : }$ <br/>
$\bullet\text{ Error of the output layer: }\delta^L = \nabla_aC\bullet\sigma'(z^L) \text{,  C : Cost(Loss)}$ <br/>
$\bullet\text{ Error relationship between two adjacet layers:  }\delta^L = \sigma'(z^l)\bullet((w^{l+1})^T\delta^{l+1})$ <br/>
$\bullet\text{ Gradient of bias : } \nabla_{b^l}C = \delta^l$ <br/>
$\bullet\text{ Gradient of weight : } \nabla_{w^l}C = \delta^l(a^(l-1))^T$ <br/>
$\bullet\text{ a : Final output of the layer, b : bias, w : weight} , \mathcal{D} : \text{Dataset}$ </p>
$\bullet\text{ weight, bias update rule : }w_{k+1}^l = w_{k}^l - \eta\nabla_{w_{k}^l}L(\theta_k, \mathcal{D}) \text{ and } b_{k+1}^l = b_{k}^l - \eta\nabla_{b_{k}^l}L(\theta_k, \mathcal{D})$

#### Viewpoint 2 : Maximum likelihood
- maximum likelihood approach is to find the parameter $\theta$ for given distribution that best explains the observed datapoints.
- i.e. $\theta^* = argmin_\theta[-\log{(p(y|f_\theta(x)))}]$
- if the paramter $\theta$ is optimized, the output of ML-based model is **optimal distribution** of dataset.
- Therefore we can **sample** some datas from optimal distribution for generation task, or calculate mean of the distribution for regression or classification task.
- Especially for generation task, we can argue that  the model has learned the **distribution of specific domain** by ML-approach, not a fixed single output.
- Due to the two constraints for backpropagation algorithm(```Constraint : total loss of DNN over training sampe = sum of loss for each training sample```), we have to calculate **negative log likelihood**, which transforms multiplication into summation.
- Additionally, we have to assume that each sample of **i.i.d(independent and indentical distribution)** to satisfy previously mentioned constraint.(*If each samplees are not i.i.d : cannot represent total loss function as sum of sample loss function*)
- In general, we can assume that the distribution of the data points as **Gaussian Distribution** and **Bernoulli Distribution**, which has correspoinding loss function of **MSE loss** and **Cross-Entropy** loss.

### 02. Manifold Learning
#### Introduction
- We assume that there is a **manifold** that contours around the original data space. 
- If we find a **d-dimensional manifold** in m-dimensional space, we can find explicit mapping $f : \mathcal R^d \to \mathcal R^m$ by projecting data points onto **d-dimensional manifold**.
- $f$ is called embedding function.

#### Four objectives of manifold learning
1. **Data compression** via encoder-decoder network
2. **Data visualization** : t-SNE mapping(t-distributed stochastic netwok embedding)
3. **Curse of dimensionality, manifold hypothesis**
    - If dimension of data in creasese, data will lead to sparse embedding, which causes poor performance in learning networks.
    - As dimension increases the density of the data becomes sparse, thus sample data are more needed for model estimation.
    - Manifold hypothesis(assumption) is an **assumption that there is a low dimensional subspace(manifold) that well-includes the high dimensional dataset.**
    - Which means, there is a **high-density region(so-called manifold) in high dimensional space when datas are spreaded on it.**
    - **Data distribution in high dimensional space** is never homogeneous!
4. **Discovering most important features** : Feature Extraction with manifold learning
    - While distance in eucledian space didn't provide a useful information, distance in manifold space can give us useful features.
    - Interpolation in manifold space can be used to sample plausible images, such as golf swing interpolation.
    - In general, learned manifold is **entangled**, but when a manifold is disentangled, it would be more interpretable.
    
#### Dimensionality Reduction Approaches
1. PCA
- Spread raw data on the space, and find the plane that minimizes the covariance.
- k principal axes are obtained, and its mapping equation is given as $h = f_\theta(x) = W(x-\mu)$ : like neural net form(Weight, Bias)
- It is already proved that AE includes PCA
- Linear manifold(hyperplane) projection, therefore PCA cannot disentangle S-shaped distribution in high dimensional space.
- **LLE, IsoMap** projections are nonlinear manifold projection, so they can create disentangled manifold.
2. Isomap, LLE
- Construct neighborhood graph $d_x(i,j)$ using euclidean distance
- $\epsilon$-isomap : neighbors within a radius $\epsilon$.
- $K$-isomap : $K$ neareset neighbors
- Assume that "neighborhoods"(i.e. super close ones) are also neighbors in manifold.
- *Search for more details.*

#### Density estimation
- Parzen windows
- Isotropic parzen window
- Manifold parzen window
- Non-local manifold parzen window
- Dimensionality reduction approaches like PCA, LLE, Isomap, and density estimation approaches including parzen windows and its variations are **neighborhood based training**, which **explicitly use distance based neighborhoods**.
- They also typically use euclidean neighbors so these nearset neighbor approach can be **inaccurate when data is sparse, and in high-dimension.**

### 03. Autoencoder
#### Vanilla Autoencoder
- To briefly summarize AE, autoencoder make output layer same size as input layer
- $x \to z\text{(latent variable)} \to y$
- $x \to z : \text{encoder, h(.)} \text { and } z \to y :  \text{decoder, g(.)}$
- Solve unsupervised learning problem in supervised learning fashion
- Loss function indicates discrepancy between input $x$ and output $x'$
- If the output $x'$ is similar to $x$, the AE network is learning well, and the latent variable $z$ indicates compressed representation.
- After training, decoder and encoder are independently used for another tasks but these networks are at least trained well for **training DB distribution.**
- Minimal performance is guaranteed in AE : encoder can compress well at least for the data in training DB, and decoder can generate results well at lest for the data in training DB.
- **GAN** has no minimal performanc guaranteed and hard to train, but GAN can create out-of distribution outputs.
- $\text{Reconstruction Error : } L(x,y) = L_{AE} = \sum_{x \in D}{L(x,g(h(x)))}$
- AE connection to PCA & RBM(Restricted boltzmann machine)
    - PCA : for bottleneck architecture of AE(i.e. $d_z < d$), with linear neurons and squared loss, autoencoder learns same subspace as PCA. (But don't learn the same basis)
    - RBM : with a single hidden layer with sigmoid non-linearity and sigmoid output non-linearity (not used nowadays)
- We can also use autoencoder as pretrained network, to innitialize weights and biases of another downstream task network.
![image-2.png](attachment:image-2.png)
    
#### Denoising Autoencoder
![image.png](attachment:image.png)
- The key concept of denoising autoencoder is that **'visually identical' images will be mapped to identical manifold**.
- Researchers added salt and pepper noise into raw input image, creating noised image.
- Loss function was design to evaluate discrepancy between output image of AE & raw input(which is not noised).
- Therefore, trained AE performs denoising task.
- Stacked denoising autoencoder(SDAE) : stack several layers in contraction and expansion path.
- **Generation task with DAE**
    - In expansion path, we can sample hidden layers assuming bernoulli distribution.
    - Therefore, we can obtain sampled output which is slightly varied from input images.
    
#### Contractive Autoencoder
- DAE encourages **reconstruction to be insensitive** to input corruption.
- In alternative, CAE encourages **representation, latent variable** to be insensitive.
- $L_{SCAE} = \sum_{x \in D}{L(x,g(h(x)))} + \lambda E_{q(\tilde{x}|x)}{[||h(x) - h(\tilde{x})||^2]}$
- First term is called **reconstruction error,** and second term is **stochastic regularization term** which indicates that representation should be preserved even though there is a slight perturbation.
- However, SCAE stochastic regularization term $\lambda E_{q(\tilde{x}|x)}{[||h(x) - h(\tilde{x})||^2]}$ is hard to compute, so taylor expansion is applied and approximate as following.
- $E_{q(\tilde{x}|x)}{[||h(x) - h(\tilde{x})||^2]} \sim ||{\partial{h} \over \partial{x}} (x)||_F^2 \text{ : Analytic Regularization}$

### 04. Variationial Autoencoder
#### Variational AE(VAE)

#### Conditional VAE(CVAE)

#### Adverserial AE(AAE)
