# Denoising Autoencoders

Following: http://deeplearning.net/tutorial/dA.html

Also see: http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/217 for more on dA.

## Autoencoders

See Section 4.6 in http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/239 for a discussion of autoencoders. Essentially, an autoencoder takes an input $x \in [0, 1]^d$ and first maps it (with an _encoder_) to a hidden representation, $y \in [0, 1]^{d'}$ through a deterministic mapping, e.g.

\begin{equation}
y = s(W x + b)
\end{equation}

where $s$ is a non-linearity (e.g. sigmoid). The latent representation, or code, is then mapped back (with a _decoder_) into a reconstruction $z$ that is the same shape as $x$:

\begin{equation}
z = s(W' y + b')
\end{equation}

where $W'$ is not the transpose of $W$, but rather a different matrix. $z$ is a _prediction_ of $x$, given the code $y$. Optionally, we _may_ constrain $W' = W^T$. The parameters of the model are optimized to minimize the average reconstruction error.

The reconstruction error may be measured in a few differnt ways. The traditional _squared error_ $L(x, z) = |x - z|^2$ may be used. If the input is interpreted as either bit vectors or vectors of bit probabilities, the _cross-entropy_ may be used:

\begin{equation}
L_H(x, z) = - \sum_{k=1}^d \left[ x_k \log z_k + (1 - x_k) \log (1 - z_k) \right]
\end{equation}

The hope is that the code $y$ is a distributed representation that captures the coordinates along the main factors of variation in the data. This is similar to the way the projection on principal components captures the main factors of variation in the data. Indeed, in the case of one hidden layer and the mean squared error is used to train the network, the $k$ hidden units learn to prject the input in the span of the first $k$ principal components of the data. If the hidden layer is non-linear, it behaves differently from PCA and may capture multi-modal aspects of the input distribution. Stacking multiple encoders and decoders - building a deep auto-encoder - leads to even further divergance from PCA. See: http://www.cs.toronto.edu/~rsalakhu/papers/science.pdf

Because $y$ acts like a lossy compression of $x$, it cannot be a small-loss compression for all $x$. Optimization makes it perform well for training examples, but not arbitrary inputs. Auto-encoders generalize by giving low reconstruction error on test samples from the same distribution as the training examples, although they give generally high error on samples chosen randomly from the input space.

We will implement an auto-encoder in such a fashion as to make it stackable. Because we are using tied weights here, we will use $W^T$ for $W'$