Skip to content

Commit

Permalink
denoising auto-encoder tutorial (more)
Browse files Browse the repository at this point in the history
  • Loading branch information
pluskid committed Dec 30, 2014
1 parent 2b85682 commit 668aca3
Show file tree
Hide file tree
Showing 2 changed files with 115 additions and 2 deletions.
107 changes: 107 additions & 0 deletions docs/tutorial/mnist-sDA.rst
Expand Up @@ -6,8 +6,115 @@ auto-encoders to do pre-training for a deep neural network. We will work with
the MNIST dataset. Please see :doc:`the LeNet tutorial on
MNIST </tutorial/mnist>` on how to prepare the HDF5 dataset.

Unsupervised pre-training is a way to initialize the weights when training
a deep neural networks. Initialization with pre-training could have better
convergence property than simple random training, especially when the number of
(labeled) training points is not very large.

In the following two figures, we show the results generated from this tutorial.
Specifically, the first figure shows the softmax loss on the training set at
different training iterations with and without pre-training initialization.

.. image:: images/mnist-sDA/obj-val.*

The second plot is similar, except that it shows the prediction accuracy of the
trained model on the test set.

.. image:: images/mnist-sDA/test-accuracy-accuracy.*

As we can see, faster convergence could be observed when initialize with
pre-training.

(stacked) Denoising Auto-encoders
---------------------------------

We provide a brief introduction to (stacked) denoising auto-encoders in this
section. See also the `deep learning tutorial on Denoising Auto-encoders
<http://deeplearning.net/tutorial/dA.html>`_.

An **auto-encoder** takes an input :math:`\mathbf{x}\in \mathbb{R}^p`, maps it to
a latent representation (encoding) :math:`\mathbf{y}\in\mathbb{R}^q`, and then
maps back to the original space :math:`\mathbf{z}\in\mathbb{R}^p` (decoding
/ reconstruction). The mappings are typically linear maps (optionally) followed
by a element-wise nonlinearity:

.. math::
\begin{aligned}
\mathbf{y} &= s\left(\mathbf{W}\mathbf{x} + \mathbf{b}\right) \\
\mathbf{z} &= s\left(\tilde{\mathbf{W}}\mathbf{y} + \tilde{\mathbf{b}}\right)
\end{aligned}
Typically, we constrain the weights in the decoder to be the transpose of the
weights in the encoder. This is referred to as *tied weights*:

.. math::
\tilde{\mathbf{W}} = \mathbf{W}^T
Note the bias :math:`\mathbf{b}` and :math:`\tilde{\mathbf{b}}` are still
different even when the weights are *tied*. An auto-encoder is trained by
minimizing the reconstruction error, typically with the square loss
:math:`\ell(\mathbf{x},\mathbf{z})=\|\mathbf{x}-\mathbf{z}\|^2`.

A **denoising auto-encoder** is an auto-encoder with noise corruptions. More
specifically, the encoder takes a corrupted version :math:`\tilde{\mathbf{x}}`
of the original input. A typical way of corruption is randomly masking elements of
:math:`\mathbf{x}` as zeros. Note the reconstruction error is still measured
against the original uncorrupted input :math:`\mathbf{x}`.

After training, we can take the weights and bias of the encoder layer in
a (denoising) auto-encoder as an initialization of an hidden (inner-product)
layer of a DNN. When there are multiple hidden layers, layer-wise pre-training
of stacked (denoising) auto-encoders could be used to obtain initializations for
all the hidden layers.

Layer-wise pre-training of stacked auto-encoders consists of the following
procedures:

1. Train the bottom most auto-encoder.
2. After training, remove the decoder layer, construct a new auto-encoder by
taking the *latent representation* of existing auto-encoder as input.
3. Train the new auto-encoder. Note the weights and bias of the encoder from the
previously trained auto-encoders are **fixed** when training the newly
constructed auto-encoder.
4. Repeat step 2 and 3 until enough layers pre-trained.

Next we will show how to train denoising auto-encoders in Mocha and use them to
initialize DNNs.

Experiment Configuration
------------------------

We will train a DNN with 3 hidden layers using sigmoid nonlinearity. All the
parameters are listed below:

.. literalinclude:: ../../examples/unsupervised-pretrain/denoising-autoencoder/denoising-autoencoder.jl
:start-after: --start-config--
:end-before: --end-config--

As we can see, we will do 15 epochs when pre-training for each layer, and do
1000 epochs of fine-tuning.

In Mocha, parameters (weights and bias) could be shared among different layers
by specifying the ``param_key`` parameter when constructing layers. The
``param_keys`` variable defined above are unique identifiers for each of
the hidden layers. We will use those identifiers to indicate that the encoders
in pre-training share parameters with the hidden layers in DNN fine-tuning.

Here we define several basic layers that will be used in both pre-training and
fine-tuning.

.. literalinclude:: ../../examples/unsupervised-pretrain/denoising-autoencoder/denoising-autoencoder.jl
:start-after: --start-basic-layers--
:end-before: --end-basic-layers--

Note the ``rename_layer`` is defined to rename the ``:data`` blob to ``:ip0``
blob. This makes it easier to define the hidden layers in a unified manner.

Pre-training
------------

.. literalinclude:: ../../examples/unsupervised-pretrain/denoising-autoencoder/denoising-autoencoder.jl
:start-after: --start-pre-train--
:end-before: --end-pre-train--
Expand Up @@ -4,6 +4,7 @@
ENV["MOCHA_USE_CUDA"] = "true"
using Mocha

# --start-config--
n_hidden_layer = 3
n_hidden_unit = 1000
neuron = Neurons.Sigmoid()
Expand All @@ -17,6 +18,7 @@ pretrain_lr = 0.001
finetune_lr = 0.1

param_keys = ["$param_key_prefix-$i" for i = 1:n_hidden_layer]
# --end-config--

################################################################################
# Construct the Net
Expand All @@ -26,6 +28,7 @@ srand(12345678)
backend = GPUBackend()
init(backend)

# --start-basic-layers--
data_layer = HDF5DataLayer(name="train-data", source="data/train.txt",
batch_size=batch_size, shuffle=@windows ? false : true)
rename_layer = IdentityLayer(bottoms=[:data], tops=[:ip0])
Expand All @@ -35,10 +38,12 @@ hidden_layers = [
bottoms=[symbol("ip$(i-1)")], tops=[symbol("ip$i")])
for i = 1:n_hidden_layer
]
# --end-basic-layers--

################################################################################
# Layerwise pre-training for hidden layers
################################################################################
# --start-pre-train--
for i = 1:n_hidden_layer
ae_data_layer = SplitLayer(bottoms=[symbol("ip$(i-1)")], tops=[:orig_data, :corrupt_data])
corrupt_layer = RandomMaskLayer(ratio=corruption_rates[i], bottoms=[:corrupt_data])
Expand All @@ -48,8 +53,8 @@ for i = 1:n_hidden_layer
tops=[:recon], bottoms=[symbol("ip$i")])
recon_loss_layer = SquareLossLayer(bottoms=[:recon, :orig_data])

da_layers = [data_layer, rename_layer, ae_data_layer, corrupt_layer, hidden_layers[1:i-1]...,
encode_layer, recon_layer, recon_loss_layer]
da_layers = [data_layer, rename_layer, ae_data_layer, corrupt_layer,
hidden_layers[1:i-1]..., encode_layer, recon_layer, recon_loss_layer]
da = Net("Denoising-Autoencoder-$i", backend, da_layers)
println(da)

Expand All @@ -69,6 +74,7 @@ for i = 1:n_hidden_layer

destroy(da)
end
# --end-pre-train--

################################################################################
# Fine-tuning
Expand Down

0 comments on commit 668aca3

Please sign in to comment.