forked from lisa-lab/DeepLearningTutorials
-
Notifications
You must be signed in to change notification settings - Fork 27
/
SdA.txt
202 lines (151 loc) · 8.41 KB
/
SdA.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
.. _SdA:
Stacked Denoising Autoencoders (SdA)
====================================
.. note::
This section assumes you have already read through :doc:`logreg`
and :doc:`mlp`. Additionally it uses the following Theano functions
and concepts : `T.tanh`_, `shared variables`_, `basic arithmetic ops`_, `T.grad`_, `Random numbers`_, `floatX`_. If you intend to run the code on GPU also read `GPU`_.
.. _T.tanh: http://deeplearning.net/software/theano/tutorial/examples.html?highlight=tanh
.. _shared variables: http://deeplearning.net/software/theano/tutorial/examples.html#using-shared-variables
.. _basic arithmetic ops: http://deeplearning.net/software/theano/tutorial/adding.html#adding-two-scalars
.. _T.grad: http://deeplearning.net/software/theano/tutorial/examples.html#computing-gradients
.. _floatX: http://deeplearning.net/software/theano/library/config.html#config.floatX
.. _GPU: http://deeplearning.net/software/theano/tutorial/using_gpu.html
.. _Random numbers: http://deeplearning.net/software/theano/tutorial/examples.html#using-random-numbers
.. note::
The code for this section is available for download `here`_.
.. _here: http://deeplearning.net/tutorial/code/SdA.py
The Stacked Denoising Autoencoder (SdA) is an extension of the stacked
autoencoder [Bengio07]_ and it was introduced in [Vincent08]_.
This tutorial builds on the previous tutorial :ref:`dA`.
Especially if you do not have experience with autoencoders, we recommend reading it
before going any further.
.. _stacked_autoencoders:
Stacked Autoencoders
++++++++++++++++++++
Denoising autoencoders can be stacked to form a deep network by
feeding the latent representation (output code)
of the denoising autoencoder found on the layer
below as input to the current layer. The **unsupervised pre-training** of such an
architecture is done one layer at a time. Each layer is trained as
a denoising autoencoder by minimizing the error in reconstructing its input
(which is the output code of the previous layer).
Once the first :math:`k` layers
are trained, we can train the :math:`k+1`-th layer because we can now
compute the code or latent representation from the layer below.
Once all layers are pre-trained, the network goes through a second stage
of training called **fine-tuning**. Here we consider **supervised fine-tuning**
where we want to minimize prediction error on a supervised task.
For this, we first add a logistic regression
layer on top of the network (more precisely on the output code of the
output layer). We then
train the entire network as we would train a multilayer
perceptron. At this point, we only consider the encoding parts of
each auto-encoder.
This stage is supervised, since now we use the target class during
training. (See the :ref:`mlp` for details on the multilayer perceptron.)
This can be easily implemented in Theano, using the class defined
previously for a denoising autoencoder. We can see the stacked denoising
autoencoder as having two facades: a list of
autoencoders, and an MLP. During pre-training we use the first facade, i.e., we treat our model
as a list of autoencoders, and train each autoencoder seperately. In the
second stage of training, we use the second facade. These two facades are linked because:
* the autoencoders and the sigmoid layers of the MLP share parameters, and
* the latent representations computed by intermediate layers of the MLP are fed as input to the autoencoders.
.. literalinclude:: ../code/SdA.py
:start-after: start-snippet-1
:end-before: end-snippet-1
``self.sigmoid_layers`` will store the sigmoid layers of the MLP facade, while
``self.dA_layers`` will store the denoising autoencoder associated with the layers of the MLP.
Next, we construct ``n_layers`` sigmoid layers and ``n_layers`` denoising
autoencoders, where ``n_layers`` is the depth of our model. We use the
``HiddenLayer`` class introduced in :ref:`mlp`, with one
modification: we replace the ``tanh`` non-linearity with the
logistic function :math:`s(x) = \frac{1}{1+e^{-x}}`).
We link the sigmoid layers to form an MLP, and construct
the denoising autoencoders such that each shares the weight matrix and the
bias of its encoding part with its corresponding sigmoid layer.
.. literalinclude:: ../code/SdA.py
:start-after: start-snippet-2
:end-before: end-snippet-2
All we need now is to add a logistic layer on top of the sigmoid
layers such that we have an MLP. We will
use the ``LogisticRegression`` class introduced in :ref:`logreg`.
.. literalinclude:: ../code/SdA.py
:start-after: end-snippet-2
:end-before: def pretraining_functions
The ``SdA`` class also provides a method that generates training functions for
the denoising autoencoders in its layers.
They are returned as a list, where element :math:`i` is a function that
implements one step of training the ``dA`` corresponding to layer
:math:`i`.
.. literalinclude:: ../code/SdA.py
:start-after: self.errors = self.logLayer.errors(self.y)
:end-before: corruption_level = T.scalar('corruption')
To be able to change the corruption level or the learning rate
during training, we associate Theano variables with them.
.. literalinclude:: ../code/SdA.py
:start-after: index = T.lscalar('index')
:end-before: def build_finetune_functions
Now any function ``pretrain_fns[i]`` takes as arguments ``index`` and
optionally ``corruption``---the corruption level or ``lr``---the
learning rate. Note that the names of the parameters are the names given
to the Theano variables when they are constructed, not the names of the
Python variables (``learning_rate`` or ``corruption_level``). Keep this
in mind when working with Theano.
In the same fashion we build a method for constructing the functions required
during finetuning (``train_fn``, ``valid_score`` and
``test_score``).
.. literalinclude:: ../code/SdA.py
:pyobject: SdA.build_finetune_functions
Note that ``valid_score`` and ``test_score`` are not Theano
functions, but rather Python functions that loop over the entire
validation set and the entire test set, respectively, producing a list of the losses
over these sets.
Putting it all together
+++++++++++++++++++++++
The few lines of code below construct the stacked denoising
autoencoder:
.. literalinclude:: ../code/SdA.py
:start-after: start-snippet-3
:end-before: end-snippet-3
There are two stages of training for this network: layer-wise pre-training
followed by fine-tuning.
For the pre-training stage, we will loop over all the layers of the
network. For each layer we will use the compiled Theano function that
implements a SGD step towards optimizing the weights for reducing
the reconstruction cost of that layer. This function will be applied
to the training set for a fixed number of epochs given by
``pretraining_epochs``.
.. literalinclude:: ../code/SdA.py
:start-after: start-snippet-4
:end-before: end-snippet-4
The fine-tuning loop is very similar to the one in the :ref:`mlp`. The
only difference is that it uses the functions given by
``build_finetune_functions``.
Running the Code
++++++++++++++++
The user can run the code by calling:
.. code-block:: bash
python code/SdA.py
By default the code runs 15 pre-training epochs for each layer, with a batch
size of 1. The corruption levels are 0.1 for the first layer, 0.2 for the second,
and 0.3 for the third. The pretraining learning rate is 0.001 and
the finetuning learning rate is 0.1. Pre-training takes 585.01 minutes, with
an average of 13 minutes per epoch. Fine-tuning is completed after 36 epochs
in 444.2 minutes, with an average of 12.34 minutes per epoch. The final
validation score is 1.39% with a testing score of 1.3%.
These results were obtained on a machine with an Intel
Xeon E5430 @ 2.66GHz CPU, with a single-threaded GotoBLAS.
Tips and Tricks
+++++++++++++++
One way to improve the running time of your code (assuming you have
sufficient memory available), is to compute how the network, up to layer
:math:`k-1`, transforms your data. Namely, you start by training your first
layer dA. Once it is trained, you can compute the hidden units values for
every datapoint in your dataset and store this as a new dataset that you will
use to train the dA corresponding to layer 2. Once you have trained the dA for
layer 2, you compute, in a similar fashion, the dataset for layer 3 and so on.
You can see now, that at this point, the dAs are trained individually, and
they just provide (one to the other) a non-linear transformation of the input.
Once all dAs are trained, you can start fine-tuning the model.