# Autoencoders

<center><img width=800 src="https://github.com/jordanott/DeepLearning/blob/master/Figures/f_tensorflow.png?raw=true">

# Learning a Generative Model
* We are given a set of examples, e.g. images of dogs

<center><img src="https://github.com/jordanott/DeepLearning/blob/master/Figures/model_family.png?raw=true">

* **Goal:** learning a probability distribution $p(x)$ over images $x$ such that

* **Generation:** If we sample $x_{new} \sim p(x)$, $x_{new}$ should look like a dog (sampling)

* **Density estimation:** $p(x)$ should be high if $x$ looks like a dog, and low otherwise (anomaly detection)

* **Unsupervised representation learning:** Learn what these images have in common, e.g., ears, tail, etc. (features)

# Generative vs Discriminative Models

<center><img width=70% src="https://i.stack.imgur.com/Xrmqg.png">

Both are used in supervised learning where you want to learn a rule that maps input $x$ to output $y$, given a number of training examples of the form $\{(x_i,y_i)\}$. A generative model (e.g., naive Bayes) explicitly models the joint probability distribution $p(x,y)$ and then uses the Bayes rule to compute $p(y|x)$. On the other hand, a discriminative model (e.g., logistic regression) directly models $p(y|x)$.

Some people argue that the discriminative model is better in the sense that it directly models the quantity you care about ($y$), so you don't have to spend your modeling efforts on the input $x$ (you need to compute $p(x|y)$ as well in a generative model). However, the generative model has its own advantages such as the capability of dealing with missing data, etc. For some comparison, you can take a look at this paper: [On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes](http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf)

There can be cases when one model is better than the other (e.g., discriminative models usually tend to do better if you have lots of data; generative models may be better if you have some extra unlabeled data). In fact, there exists hybird models too that try to bring in the best of both worlds. See this paper for an example: [Principled hybrids of generative and discriminative models](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89.8245&rep=rep1&type=pdf)

# Discriminative Model (Logistic Regression)

* $p(y | x, \theta) = \sigma(\theta_0 + \sum_{i=1}^n \theta_i x_i)$
* Directly models $p(y|x)$

# Example Naive Bayes for Classification
* Classify e-mails as spam $(Y = 1)$ or not spam $(Y = 0)$
    * Let 1 : n index the words in our vocabulary (e.g., English)
    * $x_i = 1$ if word $i$ appears in an e-mail, and 0 otherwise
    * E-mails are drawn according to some distribution: $p(Y, x_1, ..., x_n)$
* Suppose that words are conditionally independent given $y$:
    *  $p(y|x_1, ..., x_n) = p(y)\prod_{i=1}^n p(x_i|y)$
    * **Estimate** parameters from training data. **Predict** with Bayes rule:


\begin{equation}
p(Y=1|x_1, ..., x_n) = \frac{p(Y=1)\prod_{i=1}^n p(x_i|Y=1)}{\sum_{y=\{0,1\}} p(Y=y|x_i) \prod_{i=1}^n p(x_i|Y=y)}
\end{equation}

# Generative Models (Naive Bayes)

* $p(Y|x_1, ..., x_n) = \frac{p(Y, x_1, ..., x_n)}{p(x_1, ..., x_n)}$
* Explicitly models the joint probability distribution $p(x,y)$ 
* Uses the Bayes rule to compute $p(y|x)$

# Maximum Likelihood
* Data: $p_{data}(x)$
* Parameters: $\theta$
* Model: $p_\theta(x)$
* Samples: $x \sim p_{data}(x)$

<center><img src='https://github.com/jordanott/DeepLearning/blob/master/Figures/data_distributions.png?raw=true' width=800>

# Latent Variables
* **Latent:** hidden or concealed

# Latent Variable Example
* Your **health** is a latent variable
* There isn’t a single measurement of “health” that can be measured, it is a rather abstract concept
* Measure physical properties from our bodies
    * Blood pressure
    * Cholesterol level
    * Weight
    * Blood sugar
    * Temperature
    
* These **measurements/observations** give us a clue of a persons health

# Latent Variables for Images

<center><img src="https://github.com/jordanott/DeepLearning/blob/master/Figures/latent_variable_model_images.png?raw=true">

* Only shaded variables x are observed in the data (pixel values)
* Latent variables **z** correspond to high level features
    * If z chosen properly, $p(x|z)$ could be much simpler than $p(x)$
    * If we had trained this model, then we could identify features via $p(z | x)$, e.g., $p(EyeColor = Blue|x)$
* Challenge: Very difficult to specify these conditionals by hand

# Latent Variable Models

model:  
$p_\theta(x, z) = p_\theta(x|z)p_\theta(z)$

* joint $p_\theta(x, z)$
* conditional likelihood $p_\theta(x|z)$
* prior $p_\theta(z)$

marginalization:  
$p_\theta(x) = \int p_\theta(x,z)dz$


# Encoder network 
* z = $g_\phi(x)$
* Translates the original high-dimension input, $x$, into the latent low-dimensional code, $z$
* The input size is larger than the output size


# Decoder network
* $x' = f_\theta(g_\phi(x)) = f_\theta(z)$
* Recovers the data from the code
* Likely with larger and larger output layers

# Architecture 
<center><img src="https://lilianweng.github.io/lil-log/assets/images/autoencoder-architecture.png" width="1000">

# Training
* $(\theta, \phi)$ are learned together
* $\mathbf{x} \approx f_\theta(g_\phi(\mathbf{x}))$

\begin{equation}
    L_\text{AE}(\theta, \phi) = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}^{(i)} - f_\theta(g_\phi(\mathbf{x}^{(i)})))^2
\end{equation}

# Example
* Latent dimension is 2

<center><img src="https://github.com/jordanott/DeepLearning/blob/master/Figures/ae.png?raw=true" width="1000">

# Visualize Latent Space
<center><img src="https://www.researchgate.net/profile/Ehsan_Hosseini_Asl2/publication/275960143/figure/fig3/AS:392026551013379@1470477821195/Visualization-of-MNIST-handwritten-digits-196-higher-representation-of-digits-computed.png">

In [2]:
from IPython.display import HTML

# Interpolate Latent Variable

In [3]:
HTML("""<video alt="test" controls width=100% height=500><source src="https://gertjanvandenburg.com/figures/autoencoder/latent_circle.mp4" type="video/mp4"></video>""")

# Generate Samples
<center><img src="https://blog.keras.io/img/ae/vae_digits_manifold.png" width="500">

<center><img src="https://scontent-sjc3-1.xx.fbcdn.net/v/t1.0-9/66808349_748631585591344_5618925140046774272_n.jpg?_nc_cat=105&_nc_oc=AQkAVDvPGtM3JS6kLpptRRy1jd4AR_ujeLo_1rLvvb7JGHuf-nI6G5gi4rDFsvC9dI5t-W8fw-A7f5Vzp4bS996M&_nc_ht=scontent-sjc3-1.xx&oh=e898657eff72ab037e870590a117688d&oe=5DB70125" width=400>

# Denoising Autoencoder
* Risk of overfitting because AE learns identity function
* Especially when there are more parameters than data points

* Partially corrupt the input by adding noise
* $\mathcal{M}_\mathcal{D}$ adds noise to the original input
* $\tilde{\mathbf{x}} \sim \mathcal{M}_\mathcal{D}(\tilde{\mathbf{x}} \vert \mathbf{x})$

# Training

\begin{aligned}
\tilde{\mathbf{x}}^{(i)} &\sim \mathcal{M}_\mathcal{D}(\tilde{\mathbf{x}}^{(i)} \vert \mathbf{x}^{(i)})\\
L_\text{DAE}(\theta, \phi) &= \frac{1}{n} \sum_{i=1}^n (\mathbf{x}^{(i)} - f_\theta(g_\phi(\tilde{\mathbf{x}}^{(i)})))^2
\end{aligned}

# Denoising AE Architecture
![](https://lilianweng.github.io/lil-log/assets/images/denoising-autoencoder-architecture.png)

# MNIST Results
<center><img src="https://cdn-images-1.medium.com/max/1600/1*hfzos8xmCGjrgpTW78PFLg@2x.png" width="700">

# Variational Autoencoder
---
<center><img src="https://jaan.io/images/encoder-decoder.png"></center>

* We want to know how well the variational posterior (i.e. encoder) $q(z|x)$ approximates the true posterior $p(z|x)$ (that is unkown) 
* To do this we'll use the [KL divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence)  

\begin{align}
\textbf{KL}(q_\lambda(z | x) || p(z|x)) &= \textbf{E}_{q_\lambda(z | x)} [\log \frac{q_\lambda(z | x)}{p(z|x)}] && \text{Definition of KL} \\
&= \textbf{E}_{q_\lambda(z | x)} [\log q_\lambda(z | x)] - \textbf{E}_{q_\lambda(z | x)} [\log p(x,z)] + \log p(x) && \text{Expanding log terms;} p(z|x) = \frac{p(x,z)}{p(x)}
\end{align}

* **GOAL:** Find the variational parameters $\lambda$ that minimize $\textbf{KL}(q_\lambda(z | x) || p(z|x))$

\begin{equation}
q^*_\lambda(z | x) = argmin_\lambda \textbf{KL}(q_\lambda(z | x) || p(z|x))
\end{equation}

* **PROBLEM:** Calculating the KL is intractable because of the $p(x)$ term; this would require an integral over all $z$: $p(x) \int_z p(x,z) dz$

* **SOLUTION:** If we can find some bound on the KL we can optimize that, indirectly optimizing the KL
    * $p(x)$ is a non-negative constant

\begin{align}
\textbf{KL}(q_\lambda(z | x) || p(z|x)) \geq \textbf{E}_{q_\lambda(z | x)} [\log q_\lambda(z | x)] - \textbf{E}_{q_\lambda(z | x)} [\log p(x,z)] && \text{Dropping the } p(x) \text{ term provides a lower bound} \\
\end{align}

\begin{align}
\textbf{ELBO} &= - \textbf{E}_{q_\lambda(z | x)} [\log q_\lambda(z | x)] + \textbf{E}_{q_\lambda(z | x)} [\log p(x,z)]
\end{align}

* Minimizing the KL is equivallent to maximizing the ELBO

\begin{align}
\textbf{ELBO} &= \textbf{E}_{q_\lambda(z | x)} [\log p(x |z)] - KL(q_\lambda(z | x) || p(z)) && \text{Expanding this gives the original definition of ELBO above} 
\end{align}

### Loss
\begin{align}
\mathcal{L}_i(\theta, \phi) &= \textbf{E}_{q_\theta(z | x_i)} [\log p_\phi(x_i |z)] - KL(q_\theta(z | x_i) || p(z))
\end{align}
* The first term $\textbf{E}_{q_\theta(z | x_i)} [\log p_\phi(x_i |z)]$ serves as a reconstruction loss
* The second term $KL(q_\theta(z | x_i) || p(z))$ is like regularization
    * Encourages the encoder to be close to $p(z)$
    * Keeps representations of similar data, $x$, in the same space in $z$

# VAE Architecture
![](https://lilianweng.github.io/lil-log/assets/images/vae-gaussian.png)

# Reparameterization Trick
<center><img src="https://lilianweng.github.io/lil-log/assets/images/reparameterization-trick.png" width=800>

\begin{equation}
    \mathcal{L}(x; \theta, \phi) = \textbf{E}_{q_\phi(z|x)}[\log p(z,x;\theta)-\log q_\phi(z|x)]
\end{equation}

\begin{equation}
    = \textbf{E}_{q_\phi(z|x)}[\log p(z,x;\theta)-\log p(z) + \log p(z) - \log q_\phi(z|x)]
\end{equation}

\begin{equation}
    = \textbf{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x)||p(z))
\end{equation}

1. Take data point $x_i$
2. Map it to $\hat{z}$ by sampling from $q_\phi(z|x_i)$ (encoder)
3. Reconstruct $\hat{x}$ by sampling from $p(x|\hat{z}; \theta)$ (decoder)

What does the training objective $\mathcal{L}(x; \theta, \phi)$ do?

* First term encourages $\hat{x}\approx x_i$ ($x_i$ likely under $p(x|\hat{z} ; \theta)$)

* Second term encourages $\hat{z}$ to be likely under the prior $p(z)$

# References
* [AE in Keras](https://blog.keras.io/building-autoencoders-in-keras.html)
* [From AE to Beta VAE](https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html)