forked from lisa-lab/DeepLearningTutorials
-
Notifications
You must be signed in to change notification settings - Fork 27
/
lstm.txt
246 lines (170 loc) · 10.3 KB
/
lstm.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
.. _lstm:
LSTM Networks for Sentiment Analysis
**********************************************
Summary
+++++++
This tutorial aims to provide an example of how a Recurrent Neural Network
(RNN) using the Long Short Term Memory (LSTM) architecture can be implemented
using Theano. In this tutorial, this model is used to perform sentiment
analysis on movie reviews from the `Large Movie Review Dataset
<http://ai.stanford.edu/~amaas/data/sentiment/>`_, sometimes known as the
IMDB dataset.
In this task, given a movie review, the model attempts to predict whether it
is positive or negative. This is a binary classification task.
Data
++++
As previously mentioned, the provided scripts are used to train a LSTM
recurrent neural network on the Large Movie Review Dataset dataset.
While the dataset is public, in this tutorial we provide a copy of the dataset
that has previously been preprocessed according to the needs of this LSTM
implementation. Running the code provided in this tutorial will automatically
download the data to the local directory.
Model
+++++
LSTM
====
In a *traditional* recurrent neural network, during the gradient
back-propagation phase, the gradient signal can end up being multiplied a
large number of times (as many as the number of timesteps) by the weight
matrix associated with the connections between the neurons of the recurrent
hidden layer. This means that, the magnitude of weights in the transition
matrix can have a strong impact on the learning process.
If the weights in this matrix are small (or, more formally, if the leading
eigenvalue of the weight matrix is smaller than 1.0), it can lead to a
situation called *vanishing gradients* where the gradient signal gets so small
that learning either becomes very slow or stops working altogether. It can
also make more difficult the task of learning long-term dependencies in the
data. Conversely, if the weights in this matrix are large (or, again, more
formally, if the leading eigenvalue of the weight matrix is larger than 1.0),
it can lead to a situation where the gradient signal is so large that it can
cause learning to diverge. This is often referred to as *exploding gradients*.
These issues are the main motivation behind the LSTM model which introduces a
new structure called a *memory cell* (see Figure 1 below). A memory cell is
composed of four main elements: an input gate, a neuron with a self-recurrent
connection (a connection to itself), a forget gate and an output gate. The
self-recurrent connection has a weight of 1.0 and ensures that, barring any
outside interference, the state of a memory cell can remain constant from one
timestep to another. The gates serve to modulate the interactions between the
memory cell itself and its environment. The input gate can allow incoming
signal to alter the state of the memory cell or block it. On the other hand,
the output gate can allow the state of the memory cell to have an effect on
other neurons or prevent it. Finally, the forget gate can modulate the memory
cell’s self-recurrent connection, allowing the cell to remember or forget its
previous state, as needed.
.. figure:: images/lstm_memorycell.png
:align: center
**Figure 1** : Illustration of an LSTM memory cell.
The equations below describe how a layer of memory cells is updated at every
timestep :math:`t`. In these equations :
* :math:`x_t` is the input to the memory cell layer at time :math:`t`
* :math:`W_i`, :math:`W_f`, :math:`W_c`, :math:`W_o`, :math:`U_i`,
:math:`U_f`, :math:`U_c`, :math:`U_o` and :math:`V_o` are weight
matrices
* :math:`b_i`, :math:`b_f`, :math:`b_c` and :math:`b_o` are bias vectors
First, we compute the values for :math:`i_t`, the input gate, and
:math:`\widetilde{C_t}` the candidate value for the states of the memory
cells at time :math:`t` :
.. math::
:label: 1
i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)
.. math::
:label: 2
\widetilde{C_t} = tanh(W_c x_t + U_c h_{t-1} + b_c)
Second, we compute the value for :math:`f_t`, the activation of the memory
cells' forget gates at time :math:`t` :
.. math::
:label: 3
f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)
Given the value of the input gate activation :math:`i_t`, the forget gate
activation :math:`f_t` and the candidate state value :math:`\widetilde{C_t}`,
we can compute :math:`C_t` the memory cells' new state at time :math:`t` :
.. math::
:label: 4
C_t = i_t * \widetilde{C_t} + f_t * C_{t-1}
With the new state of the memory cells, we can compute the value of their
output gates and, subsequently, their outputs :
.. math::
:label: 5
o_t = \sigma(W_o x_t + U_o h_{t-1} + V_o C_t + b_1)
.. math::
:label: 6
h_t = o_t * tanh(C_t)
Our model
=========
The model we used in this tutorial is a variation of the standard LSTM model.
In this variant, the activation of a cell’s output gate does not depend on the
memory cell’s state :math:`C_t`. This allows us to perform part of the
computation more efficiently (see the implementation note, below, for
details). This means that, in the variant we have implemented, there is no
matrix :math:`V_o` and equation :eq:`5` is replaced by equation :eq:`5-alt` :
.. math::
:label: 5-alt
o_t = \sigma(W_o x_t + U_o h_{t-1} + b_1)
Our model is composed of a single LSTM layer followed by an average pooling
and a logistic regression layer as illustrated in Figure 2 below. Thus, from
an input sequence :math:`x_0, x_1, x_2, ..., x_n`, the memory cells in the
LSTM layer will produce a representation sequence :math:`h_0, h_1, h_2, ...,
h_n`. This representation sequence is then averaged over all timesteps
resulting in representation h. Finally, this representation is fed to a
logistic regression layer whose target is the class label associated with the
input sequence.
.. figure:: images/lstm.png
:align: center
**Figure 2** : Illustration of the model used in this tutorial. It is
composed of a single LSTM layer followed by mean pooling over time and
logistic regression.
**Implementation note** : In the code included this tutorial, the equations
:eq:`1`, :eq:`2`, :eq:`3` and :eq:`5-alt` are performed in parallel to make
the computation more efficient. This is possible because none of these
equations rely on a result produced by the other ones. It is achieved by
concatenating the four matrices :math:`W_*` into a single weight matrix
:math:`W` and performing the same concatenation on the weight matrices
:math:`U_*` to produce the matrix :math:`U` and the bias vectors :math:`b_*`
to produce the vector :math:`b`. Then, the pre-nonlinearity activations can
be computed with :
.. math::
z = \sigma(W x_t + U h_{t-1} + b)
The result is then sliced to obtain the pre-nonlinearity activations for
:math:`i`, :math:`f`, :math:`\widetilde{C_t}`, and :math:`o` and the
non-linearities are then applied independently for each.
Code - Citations - Contact
++++++++++++++++++++++++++
Code
====
The LSTM implementation can be found in the two following files :
* `lstm.py <http://deeplearning.net/tutorial/code/lstm.py>`_ : Main script. Defines and train the model.
* `imdb.py <http://deeplearning.net/tutorial/code/imdb.py>`_ : Secondary script. Handles the loading and preprocessing of the IMDB dataset.
After downloading both scripts and putting both in the same folder, the user
can run the code by calling:
.. code-block:: bash
THEANO_FLAGS="floatX=float32" python lstm.py
The script will automatically download the data and decompress it.
**Note** : The provided code supports the Stochastic Gradient Descent (SGD),
AdaDelta and RMSProp optimization methods. You are advised to use AdaDelta or
RMSProp because SGD appears to performs poorly on this task with this
particular model.
Papers
======
If you use this tutorial, please cite the following papers.
Introduction of the LSTM model:
* `[pdf] <http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf>`_ Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
Addition of the forget gate to the LSTM model:
* `[pdf] <http://www.mitpressjournals.org/doi/pdf/10.1162/089976600300015015>`_ Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural computation, 12(10), 2451-2471.
More recent LSTM paper:
* `[pdf] <http://www.cs.toronto.edu/~graves/preprint.pdf>`_ Graves, Alex. Supervised sequence labelling with recurrent neural networks. Vol. 385. Springer, 2012.
Papers related to Theano:
* `[pdf] <http://www.iro.umontreal.ca/~lisa/pointeurs/nips2012_deep_workshop_theano_final.pdf>`_ Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2012.
* `[pdf] <http://www.iro.umontreal.ca/~lisa/pointeurs/theano_scipy2010.pdf>`_ Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010.
Thank you!
Contact
=======
Please email `Kyunghyun Cho <http://www.kyunghyuncho.me/>`_ for any
problem report or feedback. We will be glad to hear from you.
References
++++++++++
* Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
* Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural computation, 12(10), 2451-2471.
* Graves, A. (2012). Supervised sequence labelling with recurrent neural networks (Vol. 385). Springer.
* Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies.
* Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2), 157-166.
* Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 142-150). Association for Computational Linguistics.