In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

$$
\def\prs#1#2{\mathcal{p}_{#2}(#1)}
\def\qr#1{\mathcal{q}(#1)}
\def\qrs#1#2{\mathcal{q}_{#2}(#1)}
$$

In [2]:
from IPython.display import Image

# Autoencoders

An *Autoencoder* (AE) is a Neural Network comprised of two parts:
- an *Encoder*, which takes the input $\x$ and produces an intermediate ("latent") representation $\z$ as output
- a *Decoder*, which takes $\z$ as input and attempts to reproduce $\x$ as output

Both the Encoder and Decoder are Neural Networks
- their weights are learned by training them in tandem
- on training set $\langle \X, \y \rangle = \langle \X, \X \rangle$

<table>
    <tr>
        <th><center>Autoencoder</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_vanilla.png"></td>
    </tr>
</table>

A non-trivial Autoencoder (i.e., one in which the parts are not merely the Identity transformation)
- has latent representation $\z$ of dimension less than input $\x$
- $\z$ is a *bottle-neck*
- forcing dimensionality reduction, like PCA
- causing the inversion of the Decoder to be imperfect


## Comparison of Autoencoders and PCA

Both the AE and PCA are methods to create representations of an input of length $n$
via reduced dimensionality vectors of length $r \le n$

They are similar in *purpose* but different in *detail*
- PCA creates $n$ vectors (of length $n$) called *components*
    - Each $\x^\ip$ of length $n$ is represented as a linear combination of $r \le n$ components
        - The reduced dimensionality representation is a vector of length $r \le n$: the weights used in the linear combination
    - The components are common to all inputs $\x^\ip$
- Autoencoder
    - the reduced dimensionality representation is a vector of length $r \le n$
    - the representation is unique to $\x^\ip$: not shared "components"
 

Our interest in Autoencoders
- Study Functional architecture
    - [TensorFlow Tutorial on Autoencoders](https://www.tensorflow.org/tutorials/generative/autoencoder)
- Generative
    - Create *synthetic* examples $\x'$
    - By sampling $\z'$ from the space of latent representations
    - And inverting them
    
<table>
    <tr>
        <th><center>Generator</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_decoder.jpg" width=80%</td>
    </tr>
</table>

# Uses

As an aside, we mention other use cases


## Dimensionality reduction and Transfer learning

Once the Autoencoder has been trained, we can discard the Decoder
- Use the Encoder to create reduced dimension representations of large and high dimension inputs
    - Image search by replacing 3D megapixel images by shorter, 1D vectors
- Transfer to another task

<table>
    <tr>
        <th><center>Autoencoder: Encoder + New head</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_encoder_new_head.jpg" width=90%></td>
    </tr>
</table>

## De-noising Autoencoder

Using an AE for dimensionality reduction is similar to using PCA
- **But** unlike PCA, there is no **explicit** "relative importance" associated with the retained dimensions

But we can *hope*that the information lost through the bottleneck process is less important.

A *De-noising Autoencoder* is an Autoencoder trained on a slightly corrupted "noisy" input
- $\langle \X, \y \rangle = \langle \X + \epsilon, \X \rangle$

<table>
    <tr>
        <th><center>Autoencoder: Denoising</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_denoising.png" width=90%></td>
    </tr>
</table>

De-noising may be useful as a pre-processing step for cleaning noisy data.

<table>
    <tr>
            <th><center>De-noising autoencoder: noisy inputs, de-noised outputs</center></th>
    </tr>
    <tr>
        <td><img src="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/generative/images/intro_autoencoder_result.png?raw=1" width=1200></td>
    </tr>
</table>

## Autoencoder as Anomaly Detector

By forcing the input $\x$ through a bottleneck, the reconstructed input hopefully has "less important" information stripped away.

We may choose to characterize this lost information as an *anomaly* if the magnitude of the reconstruction error is larger than some threshold.
- Error: noise to be removed
- Signal: something unusual to be flagged for attention
- Signal: a source of alpha
    - Reconstructed input is our "model"'s prediction
    - The noise is divergence from out model
        - trading opportunity ?
        


<table>
    <tr>
            <th>Anomaly Detector</th>
    </tr>
    <tr>
        <td><img src="images/autoencoder_anomaly_normal.png" width=80%></td>
        <td><img src="images/autoencoder_anomaly_anomalous.png" width=80%></td>
    </tr>
    <tr>
        <center><td><img src="images/autoencoder_anomaly_error.png" width=100%></td></center>
    </tr>
</table>

# Details

**Notation summary**

term | dimension &nbsp; &nbsp;  &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | meaning 
:---|:---|:---
$\x$ | $n$ | Input
$\tilde\x$ | $n$ | Output: reconstructed $\x$
$\z$ | $n' << n$ | Latent representation
$E$  | $\mathbb{R}^n \rightarrow \mathbb{R}^{n'}$ | Encoder
            | | $E(\x) = \z $
$D$  | $\mathbb{R}^{n'} \rightarrow \mathbb{R}^n$ | Decoder
            | | $\tilde\x = D(\z) $
            | | $\tilde\x = D( E(\x) )$
            | | $\tilde\x \approx \x$

## Loss function

The obvious loss functions compare the original $\x^\ip$ and reconstructed $\tilde\x^\ip$ feature by feature:


### Mean Squared Error (MSE)
$$
\loss^\ip = \sum_{j=1}^{|\x|} { (\x^\ip_j - \tx^\ip_j)^2 }
$$

### Binary Cross Entropy

For the special case where *each* original feature is in the range $[0,1]$ (e.g., an image)

$$
\loss^\ip = \sum_{j=1}^{|\x|} {  \left( \x^\ip_j    \log(\tx^\ip_j) + ( 1 - \x^\ip_j ) \log(1 - \tx^\ip_j) \right) }
$$

# Generative Limitations

We propose to create synthetic examples $\x'$ by sampling $\z$.

Although the synthetic $\x'$ created by this inversion seems appealing, there are some issues
- Assuming we need labeled examples $\langle \x, \y \rangle$
    - we have no control as to the class $\y'$ of the synthetic $\x'$
- Our method of sampling $\z$ is not dependent on the distribution of $\z$
    - In general, the distribution is unknown
    - In particular, the sample may not be representative of any known (e.g., training) true example
    - Even if we obtain $\z$ by slight modification of a particular $\x^\ip$
    > $\z = E( \x^\ip) + \epsilon$
    
    there is no guarantee as to to the label or fidelity of $\x' = D(\z)$

To illustrate, we
- create an [autoencoder](autoencoder.ipynb) for MNIST fashion
    - 10 classes
    - Latent representation are vectors of length 64
- obtain the latent representations for a set of test inputs
- create a scatter plot of the latents
    - using PCA to project the high dimensionality latents to 2D
    
<img src="images/autoencoder_latents.png">

As you can see
- the latents are not uniformly distributed
- latents of particular classes (each class depicted with a unique color) form clusters

We can illustrate the latter point via a separate plot of the latents for each class


<img src="images/autoencoder_latents_by_target.png">

Thus, sampling latents uniformly will not necessarily find a latent "in the neighborhood"
- of any of the classes
- of any particular class


We can emphasize the latter point.

Let's explore the neighborhood around a the latent representation of a single input
- add random normal noise with varying increments of standard deviation

We might expect to obtain images similar to the original.

<img src="images/autoencoder_perturb_single_img.png" width=90%>

As you can see from the above, even moving in a small radius from the
latent of the original does not guarantee a realistic decoded output.
- So we can't generate a synthetic example of a particular class by a small perturbation of the latent from a genuine image of the class

Next, we conduct an experiment in interpolating between the latents associated with 2 inputs.
- interpolate between the latents and decode
- first plot: 0% second "end" image; 100% first image
- last plot: 100% second image; 0% first image

<img src="images/autoencoder_interpolate_2_imgs.png">

As you can see from the intermediate outputs
- not all latents correspond to recognizable classes

Thus, we see issues associated with generating synthetic examples by
simple-minded sampling of the latent space.

# Experiments with Autoencoders

The plots in this notebook were generated by
this [notebook](Autoencoder_practice.ipynb)
- derived from the [TensorFlow tutorial on Autoencoders](https://www.tensorflow.org/tutorials/generative/autoencoder)
- illustrates Latent representation, Denoising, Anomaly Detection
- (secondary objective: study the code)



In [3]:
print("Done")

Done
