## <font color='darkblue'><b>Preface</b></font>
([article source](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/)) <font size='3ptx'><b>Sequence classification is a predictive modeling problem where you have some sequence of inputs over space or time, and the task is to predict a category for the sequence</b></font>.

This problem is difficult because the sequences can vary in length, comprise a very large vocabulary of input symbols, and may require the model to learn the long-term context or dependencies between symbols in the input sequence.

<b>In this post, you will discover how you can develop LSTM recurrent neural network models for sequence classification problems in Python using the [Keras deep learning library](https://keras.io/getting_started/).</b>

After reading this post, you will know:
* How to develop an LSTM model for a sequence classification problem
* How to reduce overfitting in your LSTM models through the use of dropout
* How to combine LSTM models with Convolutional Neural Networks that excel at learning spatial relationships

### <font color='darkgreen'>Problem Description</font>
<b><font size='3ptx'>The problem that you will use to demonstrate sequence learning in this tutorial is the [IMDB movie review sentiment classification problem](http://ai.stanford.edu/~amaas/data/sentiment/)</font>. Each movie review is a variable sequence of words, and the sentiment of each movie review must be classified</b>.

The Large Movie Review Dataset (<font color='brown'>often referred to as the IMDB dataset</font>) contains 25,000 highly polar movie reviews (<font color='brown'>good or bad</font>) for training and the same amount again for testing. <b>The problem is to determine whether a given movie review has a positive or negative sentiment</b>.

The data was collected by [**Stanford researchers and used in a 2011 paper**](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf) where a 50/50 split of the data was used for training and testing. <b>An accuracy of 88.89% was achieved</b>.

Keras provides built-in access to the IMDB dataset. The <font color='blue'>imdb.load_data()</font> function allows you to load the dataset in a format ready for use in neural networks and deep learning models.

<b>The words have been replaced by integers that indicate the ordered frequency of each word in the dataset. The sentences in each review are therefore comprised of a sequence of integers</b>.

### <b><font color='darkgreen'>Word Embedding</font></b>
<b>You will map each movie review into a real vector domain, a popular technique when working with text—called word embedding</b>. This is a technique where words are encoded as real-valued vectors in a high dimensional space, where <b>the similarity between words in terms of meaning translates to closeness in the vector space</b>.

Keras provides a convenient way to <b>convert positive integer representations of words into a word embedding by an Embedding layer</b>.

You will <b>map each word onto a 32-length real valued vector</b>. You will also <b>limit the total number of words that you are interested in modeling to the 5000 most frequent words and zero out the rest</b>. Finally, the sequence length (<font color='brown'>number of words</font>) in each review varies, so you will <b>constrain each review to be 500 words</b>, truncating long reviews and padding the shorter reviews with zero values.

Now that you have defined your problem and how the data will be prepared and modeled, you are ready to develop an LSTM model to classify the sentiment of movie reviews.

<a id='lstm'></a>
## <font color='darkblue'>LSTM models</font>
* <b><font size='3ptx'><a href='#lstm_1'>Simple LSTM for Sequence Classification</a></font></b>
* <b><font size='3ptx'><a href='#lstm_2'>LSTM for Sequence Classification with Dropout</a></font></b>
* <b><font size='3ptx'><a href='#lstm_3'>Bidirectional LSTM for Sequence Classification</a></font></b>
* <b><font size='3ptx'><a href='#lstm_4'>LSTM and Convolutional Neural Network for Sequence Classification</a></font></b>

<a id='lstm_1'></a>
### <font color='darkgreen'>Simple LSTM for Sequence Classification</font>
<b><font size='3ptx'>You can quickly develop a small LSTM for the IMDB problem and achieve good accuracy.</font></b>

Let’s start by importing the classes and functions required for this model and initializing the random number generator to a constant value to ensure you can easily reproduce the results.

In [28]:
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import MaxPooling1D
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing import sequence
# fix random seed for reproducibility
tf.random.set_seed(7)

You need to load the IMDB dataset. <b>You are constraining the dataset to the top 5,000 words. You will also split the dataset into train (50%) and test (50%) sets</b>.

In [2]:
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

Next, you need to <b>truncate and pad the input sequences, so they are all the same length for modeling</b>. The model will learn that the zero values carry no information. <b>The sequences are not the same length in terms of content, but same-length vectors are required to perform the computation in Keras</b>.

In [3]:
# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

In [4]:
# 25,000 records of movie reviews; each review has at most 500 words.
print(f'Shape of X_train={X_train.shape}')

Shape of X_train=(25000, 500)


You can now define, compile and fit your LSTM model.

The first layer is the Embedded layer that uses 32-length vectors to represent each word. The next layer is the LSTM layer with 100 memory units (<font color='brown'>smart neurons</font>). Finally, because this is a classification problem, you will use a Dense output layer with a single neuron and a sigmoid activation function to make 0 or 1 predictions for the two classes (<font color='brown'>good and bad</font>) in the problem.

Because it is a binary classification problem, log loss is used as the loss function (<font color='brown'>[binary_crossentropy](https://keras.io/api/losses/probabilistic_losses/#binarycrossentropy-function) in Keras</font>). The efficient ADAM optimization algorithm is used. The model is fit for only two epochs because it quickly overfits the problem. <b>A large batch size of 64 reviews is used to space out weight updates</b>.

In [5]:
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

2023-08-24 15:00:14.090798: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-08-24 15:00:14.090852: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2023-08-24 15:00:14.090897: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ubuntu): /proc/driver/nvidia/version does not exist
2023-08-24 15:00:14.091183: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 500, 32)           160000    
                                                                 
 lstm (LSTM)                 (None, 100)               53200     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


In [6]:
%%time
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)

Epoch 1/3
Epoch 2/3
Epoch 3/3
CPU times: user 16min 59s, sys: 28min 8s, total: 45min 7s
Wall time: 17min 15s


<keras.callbacks.History at 0x7fe4c9e286d0>

Once fit, you can estimate the performance of the model on unseen reviews.

In [7]:
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 87.37%


You can see that this simple LSTM with little tuning achieves near state-of-the-art results on the IMDB problem. Importantly, this is a template that you can use to apply LSTM networks to your own sequence classification problems.

Now, let’s look at some extensions of this simple model that you may also want to bring to your own problems.

<a id='lstm_2'></a>
### <font color='darkgreen'>LSTM for Sequence Classification with Dropout</font>
<b><font size='3ptx'>Recurrent neural networks like LSTM generally have the problem of overfitting.</font></b>

[**Dropout**](https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/) can be applied between layers using the Dropout Keras layer. You can do this easily by adding new Dropout layers between the Embedding and LSTM layers and the LSTM and Dense output layers. For example:
```python
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
```

The full code listing example above with the addition of Dropout layers is as follows:

In [12]:
def get_lstm_model_with_dropout():
  model = Sequential()
  model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
  model.add(Dropout(0.2))
  model.add(LSTM(100))
  model.add(Dropout(0.2))
  model.add(Dense(1, activation='sigmoid'))
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  print(model.summary())
  return model

In [13]:
model_with_dropout = get_lstm_model_with_dropout()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 500, 32)           160000    
                                                                 
 dropout_2 (Dropout)         (None, 500, 32)           0         
                                                                 
 lstm_2 (LSTM)               (None, 100)               53200     
                                                                 
 dropout_3 (Dropout)         (None, 100)               0         
                                                                 
 dense_2 (Dense)             (None, 1)                 101       
                                                                 
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


In [14]:
%%time
model_with_dropout.fit(X_train, y_train, epochs=3, batch_size=64)

Epoch 1/3
Epoch 2/3
Epoch 3/3
CPU times: user 13min 49s, sys: 21min 31s, total: 35min 20s
Wall time: 12min 50s


<keras.callbacks.History at 0x7fe4cd69a6a0>

In [15]:
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 87.37%


You can see dropout having the desired impact on training with a slightly slower trend in convergence and, in this case, a lower final accuracy. The model could probably use a few more epochs of training and may achieve a higher skill (<font color='brown'>try it and see</font>).

Alternately, dropout can be applied to the input and recurrent connections of the memory units with the LSTM precisely and separately.

Keras provides this capability with parameters on the LSTM layer, <b>the <font color='violet'>dropout</font> for configuring the input dropout, and <font color='violet'>recurrent_dropout</font> for configuring the recurrent dropout</b>. For example, you can modify the first example to add dropout to the input and recurrent connections as follows:
```python
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
```

The full code listing with more precise LSTM dropout is listed below for completeness.

In [16]:
def get_lstm_model_with_lstm_dropout(dropout=0.2, recurrent_dropout=0.2):
  model = Sequential()
  model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
  model.add(LSTM(100, dropout=dropout, recurrent_dropout=recurrent_dropout))
  model.add(Dense(1, activation='sigmoid'))
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  print(model.summary())
  return model

In [17]:
model_with_lstm_dropout = get_lstm_model_with_lstm_dropout()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 500, 32)           160000    
                                                                 
 lstm_3 (LSTM)               (None, 100)               53200     
                                                                 
 dense_3 (Dense)             (None, 1)                 101       
                                                                 
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


In [18]:
model_with_lstm_dropout.fit(X_train, y_train, epochs=3, batch_size=64)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fe4c9f61fa0>

In [19]:
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 87.37%


<b>You can see that the LSTM-specific dropout has a more pronounced effect on the convergence of the network than the layer-wise dropout</b>. Like above, the number of epochs was kept constant and could be increased to see if the skill of the model could be further lifted.

Dropout is a powerful technique for combating overfitting in your LSTM models, and it is a good idea to try both methods. Still, you may get better results with the gate-specific dropout provided in Keras.

<a id='lstm_3'></a>
### <font color='darkgreen'><b>Bidirectional LSTM for Sequence Classification</b></font> ([back](#lstm))
Sometimes, a sequence is better used in reversed order. In those cases, you can simply reverse a vector `x` using the Python syntax `x[::-1]` before using it to train your LSTM network. Sometimes, neither the forward nor the reversed order works perfectly, but combining them will give better results. In this case, you will need a bidirectional LSTM network.

<b>A bidirectional LSTM network is simply two separate LSTM networks; one feeds with a forward sequence and another with reversed sequence. Then the output of the two LSTM networks is concatenated together before being fed to the subsequent layers of the network</b>. In Keras, you have the function [Bidirectional()](https://keras.io/api/layers/recurrent_layers/bidirectional/) to clone an LSTM layer for forward-backward input and concatenate their output. For example:
```python
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(1, activation='sigmoid'))
```

Since you created not one, but two LSTMs with 100 units each, <b>this network will take twice the amount of time to train. Depending on the problem, this additional cost may be justified</b>.

The full code listing with adding the bidirectional LSTM to the last example is listed below for completeness:

In [22]:
def get_bidir_lstm_model_with_lstm_dropout(dropout=0.2, recurrent_dropout=0.2):
  model = Sequential()
  model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
  model.add(Bidirectional(LSTM(100, dropout=dropout, recurrent_dropout=recurrent_dropout)))
  model.add(Dense(1, activation='sigmoid'))
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  print(model.summary())
  return model

In [23]:
model_with_bidir_lstm_dropout = get_bidir_lstm_model_with_lstm_dropout()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 500, 32)           160000    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              106400    
 l)                                                              
                                                                 
 dense_4 (Dense)             (None, 1)                 201       
                                                                 
Total params: 266,601
Trainable params: 266,601
Non-trainable params: 0
_________________________________________________________________
None


In [24]:
%%time
model_with_bidir_lstm_dropout.fit(X_train, y_train, epochs=3, batch_size=64)

Epoch 1/3
Epoch 2/3
Epoch 3/3
CPU times: user 1h 15min 30s, sys: 13min 55s, total: 1h 29min 25s
Wall time: 23min 47s


<keras.callbacks.History at 0x7fe4c99a50a0>

In [25]:
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 87.37%


It seems you can only get a slight improvement but with a significantly longer training time.

<a id='lstm_4'></a>
### <font color='darkgreen'>LSTM and Convolutional Neural Network for Sequence Classification</font> ([back](#lstm))
<b><font size='3ptx'>Convolutional neural networks excel at learning the spatial structure in input data.</font></b>

The IMDB review data does have a one-dimensional spatial structure in the sequence of words in reviews, and <b>the CNN may be able to pick out invariant features for the good and bad sentiment. This learned spatial feature may then be learned as sequences by an LSTM layer</b>.

<b>You can easily add a one-dimensional CNN and max pooling layers after the Embedding layer, which then feeds the consolidated features to the LSTM</b>. You can use a smallish set of 32 features with a small filter length of 3. The pooling layer can use the standard length of 2 to halve the feature map size. For example, you would create the model as follows:
```python
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
```

The full code listing with CNN and LSTM layers is listed below for completeness.

In [26]:
def get_cnn_then_lstm_model():
  model = Sequential()
  model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
  model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
  model.add(MaxPooling1D(pool_size=2))
  model.add(LSTM(100))
  model.add(Dense(1, activation='sigmoid'))
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  print(model.summary())
  return model

In [29]:
cnn_lstm_model = get_cnn_then_lstm_model()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_6 (Embedding)     (None, 500, 32)           160000    
                                                                 
 conv1d (Conv1D)             (None, 500, 32)           3104      
                                                                 
 max_pooling1d (MaxPooling1D  (None, 250, 32)          0         
 )                                                               
                                                                 
 lstm_5 (LSTM)               (None, 100)               53200     
                                                                 
 dense_5 (Dense)             (None, 1)                 101       
                                                                 
Total params: 216,405
Trainable params: 216,405
Non-trainable params: 0
________________________________________________

In [30]:
%%time
model.fit(X_train, y_train, epochs=3, batch_size=64)

Epoch 1/3
Epoch 2/3
Epoch 3/3
CPU times: user 13min 20s, sys: 22min 43s, total: 36min 3s
Wall time: 12min 55s


<keras.callbacks.History at 0x7fe4c8dfc880>

In [31]:
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 86.20%


You can see that you achieve slightly better results than the first example, although with fewer weights and faster training time. You might expect that even better results could be achieved if this example was further extended to use dropout.

## <font color='darkblue'>Resources</font>
Below are some resources if you are interested in diving deeper into sequence prediction or this specific example.
* [Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras](https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/)
* [Medium - LSTMs for regression](https://bobrupakroy.medium.com/lstms-for-regression-cc9b6677697f)