# Using RNNs for movie reviews

Keras has a imdb_lstm.py example of using RNNs. The data set consists of user-generated movie reviews and classification of whether the user liked the movie or not based.

More info on the dataset is here:
https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification

The objective is to use a RNN to do sentiment analysis on full-text movie reviews. An artificial neural network will be trained to read movie reviews and guess whether the author liked the movie or not.

The understanding of written language requires keeping track of all the words in a sentence, a RNN is necessary to keep "memory" of the words that have come before as it "reads" sentences over time. In particular, LSTM (Long Short-Term Memory) will be used because the netowork shouldn't "forget" words too quickly - words early on in a sentence can affect the meaning of that sentence significantly.

In [1]:
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.datasets import imdb

# Import training and test data

Define that only 20,000 most popular words in the dataset will be used. 
- Each word should be transformed to a number in order to the RNN understand it. This is luckly already done.

The dataset includes 5,000 training reviews and 25,000 testing reviews.

In [2]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=20000)

# How a movie review looks like

In [3]:
x_train[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 2,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 2,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 2,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5,
 144,
 30,

# Pre processing

That doesn't look like a movie review. This data set has spared a lot of trouble - they have already converted words to integer-based indices. The actual letters that make up a word don't really matter as far as the model is concerned, what matters are the words themselves - and the model needs numbers to work with, not letters.

Each number in the training features represent some specific word. 

# What do the labels look like?

In [4]:
y_train[0]

1

0 or 1, which indicates whether the reviewer said they liked the movie or not.

# Limiting the features

RNNs can blow up quickly, so to keep things managable, let's limit the reviews to their first 80 words:

In [5]:
x_train = sequence.pad_sequences(x_train, maxlen=80)
x_test = sequence.pad_sequences(x_test, maxlen=80)

# Setting up the RNN

Considering how complicated a LSTM recurrent neural network is under the hood, it's really amazing how easy this is to do with Keras.

1. An Embedding layer: a step that converts the input data into dense vectors of fixed size that's better suited for a neural network. It is generally used in conjunction with index-based text data like we have here. The 20000 indicates the vocabulary size and 128 is the output dimension of 128 units.

2. Set up a LSTM layer for the RNN itself. Specify 128 to match the output size of the Embedding layer, and dropout terms to avoid overfitting, which RNN's are particularly prone to.

3. Single neuron with a sigmoid activation function to choose binay sentiment classification of 0 or 1.

In [6]:
    model = Sequential()
    model.add(Embedding(20000, 128))
    model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
    model.add(Dense(1, activation='sigmoid'))

# Loss function

Binary classification problem = inary_crossentropy loss function. And the Adam optimizer is usually a good choice

In [7]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# Training

RNN's, like CNN's, are very resource heavy. Keeping the batch size relatively small is the key to enabling this to run. In the real word of course, you'd be taking advantage of GPU's installed across many computers on a cluster to make this scale a lot better.

In [None]:
model.fit(x_train, y_train,
          batch_size=32,
          epochs=15,
          verbose=2,
          validation_data=(x_test, y_test))

Epoch 1/15
782/782 - 209s - loss: 0.4362 - accuracy: 0.7948 - val_loss: 0.3741 - val_accuracy: 0.8326
Epoch 2/15
782/782 - 200s - loss: 0.2607 - accuracy: 0.8963 - val_loss: 0.3742 - val_accuracy: 0.8401
Epoch 3/15
782/782 - 202s - loss: 0.1708 - accuracy: 0.9360 - val_loss: 0.4487 - val_accuracy: 0.8236
Epoch 4/15
782/782 - 200s - loss: 0.1127 - accuracy: 0.9582 - val_loss: 0.5914 - val_accuracy: 0.8196
Epoch 5/15
782/782 - 200s - loss: 0.0668 - accuracy: 0.9767 - val_loss: 0.6577 - val_accuracy: 0.8184
Epoch 6/15
782/782 - 205s - loss: 0.0507 - accuracy: 0.9835 - val_loss: 0.6496 - val_accuracy: 0.8084
Epoch 7/15
782/782 - 205s - loss: 0.0393 - accuracy: 0.9866 - val_loss: 0.7526 - val_accuracy: 0.8212
Epoch 8/15
782/782 - 199s - loss: 0.0298 - accuracy: 0.9904 - val_loss: 0.8798 - val_accuracy: 0.8174
Epoch 9/15
782/782 - 152s - loss: 0.0261 - accuracy: 0.9920 - val_loss: 0.8043 - val_accuracy: 0.8171
Epoch 10/15
782/782 - 149s - loss: 0.0175 - accuracy: 0.9945 - val_loss: 1.1897 - 

# Results

In [None]:
score, acc = model.evaluate(x_test, y_test,
                            batch_size=32,
                            verbose=2)
print('Test score:', score)
print('Test accuracy:', acc)

80% eh? Not too bad, considering that it was limited to just the first 80 words of each review.

Note that the validation accuracy while we were training never really improved after the first epoch; it is likely just overfitting. This is a case where early stopping would have been beneficial.