# Autoencoder

### Autoencoders: Motivation and History 

* Unsupervised learning:
    - uses **only** the inputs $\mathbf{x}_i$ for learning
    - automatically learns *meaningful* features for data (a.k.a **representation learning**)
    - makes the best use of unlabeled data (a.k.a **semi-supervised learning**)
    - models data generating distribution (a.k.a **Generative modeling**)


* Autoencoders are a feedforward network trained to reconstruct its input at the output layer


* History:
    - Restricted Boltzman Machines (RBM) were quite popular neural networks prior to autoencoders in 2008
    - RBMs consist of an input layer (raw input) and a hidden layer (representation to be learned), with the transformation weights from input to hidden layer learned via stochastic sampling methods like MCMC
    - Pre-trained weights of RBMs were used to intialize autoencoders' weights, which were then optimized via backpropagation
    - *Status Quo*: Autoencoders are fully trained using backpropagation.
    
    
<table>
    <tr>
        <th style="text-align:center">RBMs</th>
        <th style="text-align:center">Autoencoders</th>
    </tr>
    <tr>
        <td><img src="img/rbm.png" width="500px"><br><a href="https://medium.com/datatype/restricted-boltzmann-machine-a-complete-analysis-part-1-introduction-model-formulation-1a4404873b3" target="_blank">(source)</a></td>
        <td><img src="img/ae1.png" width="500px"><br><a href="https://towardsdatascience.com/unsupervised-learning-part-2-b1c130b8815d" target="_blank">source</a></td>
    </tr>
</table>
    

### Autoencoders

* It is a feed-forward network with input $\mathbf{x}$ and output $\hat{\mathbf{x}} \approx \mathbf{x}$


* We will use $f_{\theta}: \mathbb{R}^{d_{in}} \rightarrow \mathbb{R}^{d_{latent}}$ to denote the **encoder** that maps input to a lower dimension space, and


* We will use $g_{\phi}: \mathbb{R}^{d_{latent}} \rightarrow \mathbb{R}^{d_{in}}$ to denote the **decoder** that reconstructs the *compressed representation of the input in the latent space* back to the input


* A typical example of $f_{\theta}$ (or $g_{\phi}$) implementing a 1-layer feedforward network is $f_{\theta} = a(\mathbf{x}\mathbf{\theta})$, where $\theta \in \mathbb{R}^{d_{in} \times d_{latent}}$ , and $a(.)$ is an activation function of choice



<img src="img/ae.png" width="500">

### Autoencoders: Loss function

* The objective is to have reconstructed input $\hat{\mathbf{x}}_i$ representative of the input $\mathbf{x}_i$, so we want to maximize conditional data likelihood, i.e.,  

$$\max_{\theta, \phi}\Pi_{i=1}^{N}p(\mathbf{x}_i | \hat{\mathbf{x}}_i)$$


* In the standard form, we want to minimize negative log likelihood, i.e., 
$$\min_{\theta, \phi}-\sum_{i=1}^{N}\log p(\mathbf{x}_i | \hat{\mathbf{x}}_i)$$


* Therefore, the loss function is $l(\mathbf{x}_i, \hat{\mathbf{x}}_i) = -\log p(\mathbf{x}_i | \hat{\mathbf{x}}_i)$



* Depending on the input data type, we can simplify the above loss functions

    - if $\mathbf{x} \in \mathcal{R}^{d_{in}}$, we assume a Gaussian model, i.e, $p(\mathbf{x}_i | \hat{\mathbf{x}}_i) \sim \mathcal{N}(\hat{\mathbf{x}}_i | \sigma^2 I)$ yielding the **reconstruction loss as $l2$-norm**
    $$ l(\mathbf{x}_i, \hat{\mathbf{x}}_i) = || \hat{\mathbf{x}}_i - \mathbf{x}_i ||_2^2 $$
    
    - if $\mathbf{x} \in \{0, 1\}^{d_{in}}$, we assume a Bernoulli model, i.e, $ p(\mathbf{x}_i | \hat{\mathbf{x}}_i) \sim \mathcal{B}(\hat{\mathbf{x}}_i) $ yielding the reconstruction loss as the cross entropy loss
    
     $$ l(\mathbf{x}_i, \hat{\mathbf{x}}_i) = -\sum_{k=1}^{d_{in}} \mathbf{x}_{i,k} \log \hat{\mathbf{x}}_{i,k} + (1- \mathbf{x}_{i,k}) \log (1 - \hat{\mathbf{x}}_{i,k}) $$
     
    - You can adapt to the type of input you have


* The parameters $\theta$ and $\phi$ are learned by minimizing the reconstruction loss function $\mathcal{L}$

$$\mathcal{L} = \frac{1}{N}\sum\limits_{i=1}^{N} l(\mathbf{x_i}, g_{\phi}(f_{\theta}(\mathbf{x_i}))) $$



### Undercomplete vs Overcomplete Autoencoder

* Depending on the dimension of the latent space, an autoconder could be *undercomplete* or *overcomplete*


<table>
    <tr>
        <th style="text-align:center">Undecomplete ($d_{latent} < d_{in}$)</th>
        <th style="text-align:center">Overcomplete ($d_{latent} > d_{in}$)</th>
    </tr>
    <tr>
        <td><img src="img/uc_ae.png" width="500px"> </td>
        <td><img src="img/oc_ae.png" width="500px"> </td>
    </tr>
</table>

<a href="https://towardsdatascience.com/unsupervised-learning-part-2-b1c130b8815d" target="_blank">(image source)</a>

### Undercomplete Autoencoder

- $d_{latent} < d_{in}$

- doesn't allow neural networks to learn an identity function, thereby discouraging it to memorize input

- encoder performs a lossy compression

- useful to learn **most important features** or compressed representation of the training data

- not useful to represent variations outside of training data (e.g. translations or distortions in images)

<img src="img/uc_ae.png" width="250px">

* (optional) Linear undercomplete autoencoder

    - Linear autoencoder i.e. $f_{\theta}(\mathbf{x}) = \theta \mathbf{x}$ and $g_{\phi}(\mathbf{x}) = \phi \mathbf{x}$ with *normalized inputs* and *$l2$-norm as loss* is same as PCA
    - Thus, a nonlinear autoencoder is a powerful nonlinear generalization of PCA

### Overcomplete Autoencoder

* $d_{latent} > d_{in}$

* There are no guarantees that the autoencoder will extract meaningful features unless different *regularization techniques* are used to encourage this

* There are two ways to regularize these autoencoders
    - Implicit regularization: Loss function is left unchanged e.g. Denoising Autoencoder
    - Explicity regularization: Loss function is augmented with a penalty term, e.g., Sparse Autoencoders, Contractive Autoencoder

<img src="img/oc_ae.png" width="500px">

### Overcomplete Autoencoder:  Implicit regularization

* **Denoising Autoencoder (DAE)** (refer Module 2, L08 for detailed description)
    - Corrupts the input through a noise process 
    - The task is to reconstruct the uncorrupted input, which is enforced through an appropriate loss function (e.g, l2-norm or cross-entropy loss)
    - Encourages the **extraction of higher level features** of the input in the hidden layer
    - By enforcing it to reconstruct the original input, the autoencoder is discouraged to learn an identity function
    - For example, in the figure below, 
        - the original input $X$ is corrupted to $\hat{X}$ such that the feature $X_2$ is dropped out in $\hat{X}$
        - the autoencoder is trained to minimize the reconstruction loss defined between the output $Y$ and the original input $X$
        - thus, the autoencoder would need to learn the relation of $\hat{X}_2$ with other features $\hat{X}_1, \hat{X}_3, \hat{X}_4$
    

<img src="img/denoising_ae.png" width="500px">
<a href="https://www.udemy.com/course/deeplearning/" target="_blank">(image source)</a>

[[1] Stacked Denoising Autoencoder](https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf)

### Overcomplete Autoencoder: Explicit Regularization

* **Sparse Autoencoder**
    - Explicitly penalizes hidden layer activations to encourage sprase representation of the input
    - Encourages **learning unique statistical features** of the training data by allowing only a few hidden layer neurons to be active at any time

<img src="img/sparse_ae.png" width="250px">
<a href="https://www.wikiwand.com/en/Autoencoder" target="_blank">(image source)</a>


### Overcomplete Autoencoder: Explicit Regularization

* **Sparse Autoencoder**
    - **L1 penalty** on hidden layer activations: The augemented loss function can be written as
        $$ l_{aug}(\mathbf{x}_i, \hat{\mathbf{x}}_i) = l(\mathbf{x}_i, \hat{\mathbf{x}}_i) + \lambda ||h(\mathbf{x}_i)||_1 $$
        
        where $h_i = h(\mathbf{x}_i) = f_{\theta}(\mathbf{x}_i)$ is the latent represetnation of the input $\mathbf{x}_i$

    - **Average activation** of the hidden layer activations *across the training samples*: Let $\rho_j$ be the average magnitude of hidden layer activations across the training samples. To keep most of the hidden layer neurons inactive, the loss function is imposed with a constraint $\rho_j = \rho$, where $\rho$ is the sparsity hyperparameter
        $$ \hat{\rho}_j = \frac{1}{N}\sum_{i=1}^{N} h(\mathbf{x}_i)_j $$
        
        $$\mathcal{L}_{aug} = \mathcal{L} + \sum_{j=1}^{k}KL(\rho || \hat{\rho}_j) = \mathcal{L} + \sum_{j=1}^{k} \Big[\rho \log\frac{\rho}{\hat{\rho}_j} + (1-\rho)\log\frac{1-\rho}{1-\hat{\rho}_j} \Big]$$
        
        where $k$ is the number of hidden layer neurons

[[1] Why Regularized Auto-Encoders learn Sparse Representation?](https://arxiv.org/pdf/1505.05561.pdf)

### Overcomplete Autoencoder: Explicit Regularization

* **Contractive Autoencoder (CAE)**
    - Explicitly penalizes gradient of hidden layer activations w.r.t input features
    - Encourages **robustness against small perturbations** to the input
    - (optional) CAEs are connected to DAEs in the limit of small Gaussian noise. While DAEs learn the robust manifold to project the corrupted input back to its original form, CAEs' extracted features are unaffected to this level of noise
    - The augmented loss function can be written as 
    
    $$l_{aug}(\mathbf{x}_i, \hat{\mathbf{x}}_i) = l(\mathbf{x}_i, \hat{\mathbf{x}}_i) + \lambda ||\nabla_{\mathbf{x}_i} h(\mathbf{x}_i)||_F$$

    where $F$ represents the Frobenius norm. Note that $\nabla_{\mathbf{x}_i} h(\mathbf{x}_i)$ is the Jacobian matrix representing partial derivatives of *each of the hidden layer activations w.r.t each of the input features*.


### Deep / Stacked Autoencoders 

* In early days, training autoencoders was a non-trivial task. Therefore, Autoencoders were limited to single layer encoders and decoders


* The training regime for multi-layered autoencoders was done via **layer-wise training** with the help of RBMs, thereby landing them the name **Stacked Autoencoders** or **Deep Autoencoders**


* However, with the recent advances in deep learning like activation functions, normalization layers, etc., training a deep autoencoder has been easy


<img src="img/deep_ae.png" width="500px">
<a href="https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798" target="_blank">(image source)</a>

### Variational Autoencoders (VAE)

* AEs are used for **data compression** 
    - latent representations do not have a special spatial meaning to them
    
    - Sampling a random point in the latent space do not have a meaning unless it is obtained from a point in the original input space
    
    - The figure shows the clusters obtained by AEs on MNIST dataset. It is evident that the latent encodings do not span the entire subspace, thereby loosing the meaning of interpolation between points (1 and 7 in the illustration)

<table>
    <tr>
        <th style="text-align:center">AE on MNIST</th>
        <th style="text-align:center">VAE on MNIST</th>
    </tr>
    <tr>
        <td><img src="img/mnist_ae.png" width="500px"></td>
        <td><img src="img/mnist_vae.png" width="500px"></td>      
    </tr>
</table>

<a href="https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798" target="_blank">(image source)</a>


### Variational Autoencoders (VAE)

* Recall that decoder takes a point in the latent space to reconstruct the input


* Thus, decoder can be used in isolation to generate new datapoints in the original input space


* This requires us to sample points in the latent space


* VAEs impose a spatial structure to the latent space, thereby improving **data generation** and **data interpolation** aspect of AEs


[[Reference] Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114)

### Variational Autoencoders (VAE):  Modification 1/2

* VAEs impose a spatial structure to the latent space, thereby improving **data generation** and **data interpolation** aspect of AEs


* This is done using two modifications to the original AE 

    - **Sampling the point in the latent space** instead of using the deterministic encodings of an AE
    
    - Encoder outputs mean $\mathbb{\mu} \in \mathcal{R}^{d_{latent}}$ and sigma $\mathbb{\sigma} \in \mathcal{R}^{d_{latent}}$ of a multivariate Gaussian $\mathcal{N}(\mathbb{\mu}, \mathbb{\sigma}^2\mathbf{I})$
    
    - Decoder takes in as its input a point sampled as per the distribution defined by $\mathbb{\mu}$ and $\mathbb{\sigma}$
    
    - Therefore, even though the same input is encoded to the same $\mathbf{\mu}$ and $\mathbf{\sigma}$, decoder sees it as different points in the latent space

    - This aspect encourages the decoder to interpret similarity in points surrounding the encoded means, thereby imoroving the **data generation** capability of the VAEs


<img src="img/sample_ae.jpeg" width="750px">
<a href="blog.bayeslabs.co/2019/06/04/All-you-need-to-know-about-Vae" target="_blank">(image source)</a>

[[Reference] Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114)

### Variational Autoencoders (VAE):  Modification 1/2

* VAEs impose a spatial structure to the latent space, thereby improving **data generation** and **data interpolation** aspect of AEs


* This is done using two modifications to the original AE 

    - **Sampling the point in the latent space** instead of using the deterministic encodings of an AE
    


<table style="width:500px">
    <tr>
        <td style="text-align:left; width: 100px; height:50px"><strong>Encode</strong></td>
        <td style="text-align:left">$\mu_i = f_{\theta_1}(\mathbf{x}_i) \qquad \sigma_i = f_{\theta_2}(\mathbf{x}_i)$</td>
    </tr>
    <tr>
        <td style="text-align:left; width: 100px; height:50px"><strong>Sample</strong></td>
        <td style="text-align:left">$ h(\mathbf{x}_i) = \mu_i + \sigma_i^2 \times \mathcal{N}(0, I) $</td>
    </tr>
    <tr>
        <td style="text-align:left; width: 100px; height:50px"><strong>Decode</strong></td>
        <td style="text-align:left">$ \hat{\mathbf{x}} = g_{\phi}(h(\mathbf{x}_i)) $</td>    
    </tr>
</table>

[[Reference] Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114)

### Variational Autoencoders (VAE): Modification 2/2

* VAEs impose a spatial structure to the latent space, thereby improving **data generation** and **data interpolation** aspect of AEs


* This is done using two modifications to the original AE 
  
    - **Imposing the structure to the sampled distribution** via KL divergence 
    
    - Without such penalty, an AE could learn $\mu$ and $\sigma$ so far apart that it looses the meaning of points in between those means
    
    - This results in the problem similar to AEs where latent points in between two clusters loose their meaning
    
    - Thus, we enforce a normal Gaussian prior $\mathcal{N}(\mathbf{0}, \mathbf{I})$ on sampled $\mu$ and $\sigma$
    
    - This encourages AE to distribute its latent representation around the origin in the latent space, thereby improving the **data interpolation** aspect of VAEs


[[Reference] Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114)

### Variational Autoencoders (VAE): Modification 2/2

* VAEs impose a spatial structure to the latent space, thereby improving **data generation** and **data interpolation** aspect of AEs


* This is done using two modifications to the original AE 
  
    - **Imposing the structure on the sampled distribution** via KL divergence 
    
    - KL divergence between $q(z) \sim \mathcal{N}(\mu, \sigma^2)$ and $p(z) \sim \mathcal{N}(0, 1)$ has a simple form derived in the next slide
    
    $$ \mathcal{L}_{VAE} = \mathcal{L} + \sum_{i=1}^{N} KL (\mathcal{N}(\mu_i, diag(\sigma_i^2)) || \mathcal{N}(\mathbf{0}, \mathbf{I})) $$
    
       $$ \mathcal{L}_{VAE} = \mathcal{L} + -\frac{1}{2}\sum_{i=1}^{N} \sum_{k=1}^{d_{latent}}\big(\log \sigma_{i,k}^2  +  - (\mu_{i,k}^2 + \sigma_{i,k}^2) + 1 \big) $$ 

    
    
    
 We will implement the above in our practical session
 
 [[Reference] Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114)

### KL(q || p) derivation (optional) 

Let $q(z) \sim \mathcal{N}(\mu, \sigma^2)$ and $p(z) \sim \mathcal{N}(0, 1)$

$$ \log q(z) \quad= -\frac{1}{2}\log 2\pi \sigma^2 - \frac{1}{2}\Big(\frac{z - \mu}{\sigma}\Big)^2$$

$$ \log \frac{q}{p} \quad= \log q - \log p \quad= -\frac{1}{2}\log 2\pi \sigma^2 - \frac{1}{2}\Big(\frac{z - \mu}{\sigma}\Big)^2 + \frac{1}{2} \log 2\pi + \frac{1}{2}z^2 \quad= -\frac{1}{2}\log \sigma^2 + \frac{1}{2} z^2 - \frac{1}{2}\Big(\frac{z-\mu}{\sigma}\Big)^2$$
    
    
$$KL (q || p ) \quad= \int q(z) \log\frac{q(z)}{p(z)} dz  \quad= -\frac{1}{2}\int  \log \sigma^2 q(z) + \frac{1}{2} \int z^2 q(z) - \frac{1}{2} \int \Big(\frac{z-\mu}{\sigma}\Big)^2 q(z)\quad= -\frac{1}{2} \big(\log \sigma^2  +  - (\mu^2 + \sigma^2) + 1 \big)$$

where the above follows from the following [identity](https://en.wikipedia.org/wiki/Variance#Definition)

$$\int z^2 q(z)dz = E_{q}[z^2] = E_q[z]^2 + Var(z) = \mu^2 + \sigma^2$$ 
For a multivariate independent Gaussian, we get

$$ KL(q || p) = -\frac{1}{2}\sum_{k=1}^{d_{latent}}\big(\log \sigma_k^2  +  - (\mu_k^2 + \sigma_k^2) + 1 \big)$$

### Other VAE 

- **$\beta$-VAE [1]** : Adds a penalty on the KL divergence, such that $\beta > 1$ finds efficient and disentangled latent representation to support better generalization to unseen data


- **Vector-Quantized VAE [2]**: Encoder learns a discrete latent variable by the encoder, a more natural fit for problems like language, speech, reasoning, etc. 


- And many others...

[[1] beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework](https://openreview.net/forum?id=Sy2fzU9gl)

[[2] Neural Discrete Representation Learning](https://proceedings.neurips.cc/paper/2017/file/7a98af17e63a0ac09ce2e96d03992fbc-Paper.pdf)

[[3] Generating Diverse High-Fidelity Images with VQ-VAE-2](https://proceedings.neurips.cc/paper/2017/file/7a98af17e63a0ac09ce2e96d03992fbc-Paper.pdf)

### Autoencoder (Applications)

* dimensionality reduction 

* visualization

* feature extraction

* anomaly detection 

* semi-supervised learning 