# Autoencoders

An **autoencoder** is a neural network that is trained to produce a copy of its input. Internally, the autoencoder contains a hidden layer $h$ which serves as a code to represent the input. The autoencoder can be viewed as two functions: 

* an encoder $h = f(x),$ which encodes the input in the hidden neurons 
* a decoder $\rho = g(h),$ which decodes the hidden layer to produce an output.

If the autoencoder is successful, $g(f(x)) = x$ everywhere. This is not very useful, however. We instead place constraints on the autoencoder so that it will only learn the most useful properties of the input.

Historically, these models have been used for dimensionality reduction and feature learning. More recently, they have also taken an important role in generative modeling. 

## Undercomplete Autoencoders

It is seldom the output of an autoencoder that we care about. Rather, we are interested in training the model so that $\textbf h$ takes on useful properties. A common way of doing this is to make the autoencoder *undercomplete*. An undercomplete autoencoder is one that has less hidden-layer dimensions than it has input dimensions. This forces the model to choose only the aspects of the input which are most important for the task of recreating it. 

The learning process is the same as with a feedforward network. A loss function $L(x, g(f(x)))$ is defined, penalizing $g(f(x))$ for being dissimilar from $x$. 

When the decoder is linear and $L$ is the MSE, then the autoencoder spans the same subspace as PCA. Autoencoders with nonlinear encoder and decoder functions can learn a more powerful nonlinear generalization of PCA.

# Regularized Autoencoders

Regularization allows us to train autoenoders with hidden layers of equal or greater dimensions than their input layer. 

### Sparse Autoencoders

A sparse autoencoder is an autoencoder whose training criterion involves a sparsity penalty $\Omega(\textbf h)$ on the code layer $\textbf h$ , in addition to the reconstruction error:

$$ L(\textbf x, g(f(\textbf x))) + \Omega(\textbf h), $$

where $g(\textbf h)$ is the decoder output, and typically we have $\textbf h = f(\textbf x) $, the encoder output. These models are commonly used to learn features for another task, such as classification. 

One can view the sparse autoencoder framework as approximating maximum likelihood training of a generative model with latent variables. We have visible variables $\textbf x$ and latent variables $\textbf h$, with joint distribution $ p_{model}(\textbf x, \textbf h) = p_{model}(\textbf h)p_{model}(\textbf x | \textbf h).$ $p_{model}(\textbf h)$ is the model's prior distribution over the latent variables, representing the model's beliefs prior to seeing $\textbf x$. The log likelihood can be decomposed as:

$$ log p_{model}(\textbf x) = log \sum_{\textbf h} p_{model}(\textbf h, \textbf x). $$

We can think of the autoencoder as approximating this sum with a point estimate for just one highly likely value of $\textbf h $. With this chosen $\textbf h$ we are maximizing

$$ log p_{model}(\textbf h, \textbf x) = log p_{model}(\textbf h) + log p_{model}(\textbf x | \textbf h), $$

where the $log p_{model}(\textbf h)$ term can be sparsity inducing with the use of an absolute value penalty such as the Laplace prior. 

### Denoising Autoencoders

Rather than adding a penalty to the cost function, we can obtain an autoencoder that learns something useful by changing the reconstruction error term of the cost function. 

A standard autoencoder minimizes a loss function 

$$ L(\textbf x, g(f(\textbf x))), $$

where L is a loss function penalizing the output for being different from the input. An example of this is an $L^2$ norm between $\textbf x$ and $g(f(\textbf x)).$ This encourages the model to form an identity function mapping $x$ to $g(f(\textbf x)).$

A **denoising autoencoder** instead minimized a locc function

$$ L(\textbf x, g(f(\mathbf{\tilde{x}}))),$$

where $\mathbf{\tilde{x}}$ is a copy of x that has been corrupsted by some form of noise. A denoising autoencoder, then, is tasked with learning to undo this corruption rather than forming an identity map. 

### Regularizing by Penalizing Derivatives

Another regularization strategy for autoencoders is to penalize the derivatives of $h$ rather than its norm. Here, we sill have 

$$ L(\textbf x, g(f(x))) + \Omega (\textbf h, \textbf x), $$

except now

$$ \Omega (\textbf h, \textbf x) = \lambda \sum_{i} || \triangledown_{x} h_{i} ||^2.  $$

This forces the autoencoder to find a model that is insensitive to slight changes in $textbf x$. This type of autoencoder is referred to as a **contractive autoencoder**.

# Representational Power, Layer Size, and Depth

Autoencoders do not have to be single-layer. In fact, they can benefit from depth in the same ways feedforward networks do. Depth can reduce both computational cost and the amount of training data needed to effectively represent the input distribution. A common strategy for training deep autoencoders is to greedily pretrain a stack of shallow autoencoders before training the full network. 

# Denoising Autoencoders

A denoising autoencoder is an autoencoder whose input has been corrupted. The goal of this model, then, is to learn to reconstruct the original input distribution given the corrupted unput $p(\textbf x | \mathbf{\tilde{x}}).$ The corruption acts as a regularizer, preventing the model from learning a simple identity function. This model can be trained the same way as a standard autoencoder.

# Learning Manifolds with Autoencoders

Autoencoders exploit the idea that data concentrates around a low-dimensional manifold or  a small set of such manifulds by attempting to learn the structure of this (these) manifold(s). 

An important characterization of a manifold is the set of its tangent planes. The tangent plane at point $\textbf x$ on a $d$ dimensional manifold is given by $d$ basis vectors that span the local directions of variation allowed on the manifold. 

An autoencoder's training procedure represents a compromise between the following two forces:

* learning a representation $\textbf h$ of input $\textbf x$ from the training distribution such that $\textbf x$ can be approximately recovered from $\textbf h$ through a decoder
* Satisfying the regularization constraint, architectural or cost-function-based, making the model less sensitive to its input.

Together these forces cause the model to learn a hidden representation that captures information about the data generating distribution. The important principle here is that the autoencoder can only afford to represent the variations that are needd in order to reconstruct the training examples. Without regularization, it will learn unnecessary details about the training examples.

# Contractive Autoencoders

The contractive autoencoder, mentioned earlier, introduces a regularizer on the code $\textbf h = f(\textbf x),$ encouraging the derivatives of $f$ to be as small as possible: 

$$ \Omega (\textbf h) = \lambda \left | \left | \frac{\partial f(\textbf x)}{\partial \textbf x} \right| \right|_{F}^2. $$

This penalty on $\triangledown_{x} \textbf h$ is the squared Frobenius norm (sum of squared elements) of the Jacobian of partial derivatives of the encoder function with respect to $\textbf x$. The result of this penalty is a model that is insensitive to small changes in $\textbf x$.

# Predictive Sparse Decomposition

# Applications