# Sentiment Analysis using RNNs


## Data 
How to use RNNs to perform sentiment analysis in Keras. Keras has a built-in [IMDb movie reviews dataset](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification) that we can use.

In [1]:
from keras.datasets import imdb  # import the built-in imdb dataset in Keras

# Set the vocabulary size
vocabulary_size = 5000

# Load in training and test data (note the difference in convention compared to scikit-learn)
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocabulary_size)
print("Loaded dataset with {} training samples, {} test samples".format(len(X_train), len(X_test)))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Loaded dataset with 25000 training samples, 25000 test samples


In [2]:
# Inspect a sample review and its label
print("--- Review ---")
print(X_train[7])
print("--- Label ---")
print(y_train[7])

--- Review ---
[1, 4, 2, 716, 4, 65, 7, 4, 689, 4367, 2, 2343, 4804, 2, 2, 2, 2, 2315, 2, 2, 2, 2, 4, 2, 628, 2, 37, 9, 150, 4, 2, 4069, 11, 2909, 4, 2, 847, 313, 6, 176, 2, 9, 2, 138, 9, 4434, 19, 4, 96, 183, 26, 4, 192, 15, 27, 2, 799, 2, 2, 588, 84, 11, 4, 3231, 152, 339, 2, 42, 4869, 2, 2, 345, 4804, 2, 142, 43, 218, 208, 54, 29, 853, 659, 46, 4, 882, 183, 80, 115, 30, 4, 172, 174, 10, 10, 1001, 398, 1001, 1055, 526, 34, 3717, 2, 2, 2, 17, 4, 2, 1094, 871, 64, 85, 22, 2030, 1109, 38, 230, 9, 4, 4324, 2, 251, 2, 1034, 195, 301, 14, 16, 31, 7, 4, 2, 8, 783, 2, 33, 4, 2945, 103, 465, 2, 42, 845, 45, 446, 11, 1895, 19, 184, 76, 32, 4, 2, 207, 110, 13, 197, 4, 2, 16, 601, 964, 2152, 595, 13, 258, 4, 1730, 66, 338, 55, 2, 4, 550, 728, 65, 1196, 8, 1839, 61, 1546, 42, 2, 61, 602, 120, 45, 2, 6, 320, 786, 99, 196, 2, 786, 2, 4, 225, 4, 373, 1009, 33, 4, 130, 63, 69, 72, 1104, 46, 1292, 225, 14, 66, 194, 2, 1703, 56, 8, 803, 1004, 6, 2, 155, 11, 4, 2, 3231, 45, 853, 2029, 8, 30, 6, 117, 430

The label is an integer (0 for negative, 1 for positive), and the review itself is stored as a sequence of integers. These are word IDs that have been preassigned to individual words. To map them back to the original words, we can use the dictionary returned by `imdb.get_word_index()`.

In [3]:
# Map word IDs back to words
word2id = imdb.get_word_index()
id2word = {i: word for word, i in word2id.items()}
print("--- Review (with words) ---")
print([id2word.get(i, " ") for i in X_train[7]])
print("--- Label ---")
print(y_train[7])

--- Review (with words) ---
['the', 'of', 'and', 'local', 'of', 'their', 'br', 'of', 'attention', 'widow', 'and', 'captures', 'parties', 'and', 'and', 'and', 'and', 'excitement', 'and', 'and', 'and', 'and', 'of', 'and', 'english', 'and', 'like', 'it', 'years', 'of', 'and', 'unintentional', 'this', 'hitchcock', 'of', 'and', 'learn', 'everyone', 'is', 'quite', 'and', 'it', 'and', 'such', 'it', 'bonus', 'film', 'of', 'too', 'seems', 'he', 'of', 'enough', 'for', 'be', 'and', 'editing', 'and', 'and', 'please', 'great', 'this', 'of', 'shoots', 'thing', '3', 'and', "it's", 'mentioning', 'and', 'and', 'given', 'parties', 'and', 'back', 'out', 'interesting', 'times', 'no', 'all', 'average', 'talking', 'some', 'of', 'nor', 'seems', 'into', 'best', 'at', 'of', 'every', 'cast', 'i', 'i', 'inside', 'keep', 'inside', 'large', 'viewer', 'who', 'obscure', 'and', 'and', 'and', 'movie', 'of', 'and', 'entirely', "you've", 'see', 'because', 'you', 'deals', 'successful', 'her', 'anything', 'it', 'of', 'ded

In [4]:
import numpy as np
np.max( [ np.max([len( review_int ) for review_int in X_train]), 
         np.max([len( review_int ) for review_int in X_test]) ] )

2494

In [5]:
np.min([len( review_int ) for review_int in X_train])

11

In [6]:
np.max([len( review_int ) for review_int in X_train])

2494

Unlike our Bag-of-Words approach, where we simply summarized the counts of each word in a document, this representation essentially retains the entire sequence of words (minus punctuation, stopwords, etc.). This is critical for RNNs to function. But it also means that now the features can be of different lengths.


Max review length in the training set is 2494 while the minimum is only 11


In order to feed this data into the RNN, all input documents must have the same length. Let's limit the maximum review length to `max_words` by truncating longer reviews and **padding** shorter reviews with a null value (0). This can be done using the [`pad_sequences()`](https://keras.io/preprocessing/sequence/#pad_sequences) function in Keras. For now, we set `max_words` to 500.

In [7]:
from keras.preprocessing import sequence

# Set the maximum number of words per document (for both training and testing)
max_words = 500

# TPad sequences in X_train and X_test
X_train = sequence.pad_sequences(X_train, maxlen = max_words )
X_test = sequence.pad_sequences(X_test, maxlen = max_words )

### Design an RNN model for sentiment analysis

Build your model architecture in the code cell below. 

The input is a sequence of words (technically, integer word IDs) of maximum length = `max_words`, and the output is a binary sentiment label (0 or 1).

In [8]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout, Activation

# Design your model
embedding_size = 50 # The embedding size
use_dropout = True
model = Sequential()
model.add( Embedding(vocabulary_size, embedding_size, input_length=max_words) )
model.add( LSTM(50))

if use_dropout:
    model.add(Dropout(0.25))
# model.add(TimeDistributed(Dense(vocabulary)))
model.add(Dense(1, activation='sigmoid')) #, input_dim=784
# model.add(Activation('sigmoid'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 50)           250000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                20200     
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 51        
Total params: 270,251
Trainable params: 270,251
Non-trainable params: 0
_________________________________________________________________
None


#### Architecture and parameters

The first layer is an embedding layer of size 32.  Then there are two LSTM layers (100 hidden units each).  Then a dropout layer and one dense layer of size 1, with a sigmoid activation.
293,701 trainable  parameters.


### Train and evaluate model


In [9]:
# Compile your model, specifying a loss function, optimizer, and metrics
model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy'])

In [10]:
# Specify training parameters: batch size and number of epochs
batch_size = 64
num_epochs = 5

# Reserve/specify some training data for validation (not to be used for training)
X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]  # first batch_size samples
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]  # rest for training
# Train model
model.fit( x=X_train2, y=y_train2, validation_data=(X_valid, y_valid),
          batch_size=batch_size, epochs=num_epochs )

Train on 24936 samples, validate on 64 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f5be56cb6d8>

In [11]:
# Save your model, so that you can quickly load it in future (and perhaps resume training)
model_file = "rnn_model.h5"  # HDF5 file
import os
cache_dir = os.path.join("cache", "sentiment_analysis")  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists

model.save(os.path.join(cache_dir, model_file))
# Can load it using keras.models.load_model()
# from keras.models import load_model
# model = load_model(os.path.join(cache_dir, model_file))

Once you have trained your model, it's time to see how well it performs on unseen test data.

In [12]:
# Evaluate the model on the test set
scores = model.evaluate(X_test, y_test, verbose=0)  # returns loss and other metrics specified in model.compile()
print("Test accuracy:", scores[1])  # scores[1] should correspond to accuracy if we passed in metrics=['accuracy']

Test accuracy: 0.87664


In [15]:
# n=5
# print( model.predict(X_test[n:n+1]) ) #just trying out
# # print([id2word[i] for i in X_test[0]])
# print([id2word.get(i, " ") for i in X_test[n]])

#### Comparing RNNs and Traditional Methods

How well does your RNN model perform compared to the BoW + Gradient-Boosted Decision Trees?

[GaussianNB] Accuracy: train = 0.8198, test = 0.72768

[GradientBoostingClassifier] Accuracy: train = 0.79472, test = 0.79004

RNN Test accuracy: 0.87664

## Extensions

Experimenting with different architectures, layers and parameters. 