# Hands-on Movie Review Sentiment Classification

The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly-polar movie reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given movie review has a positive or negative sentiment.



In [1]:
# LSTM and CNN for sequence classification in the IMDB dataset
import numpy as np
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence


In [2]:
# fix random seed for reproducibility
np.random.seed(7)

### Loading data 

Load the IMDB dataset. We are constraining the dataset to the top 5,000 words. We also split the dataset into train (50%) and test (50%) sets.

In [3]:
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(train_data, train_label), (test_data, test_label) = imdb.load_data(num_words=top_words)

#load the index to reverse the text later
index = imdb.get_word_index()
reverse_index = dict([(value, key) for (key, value) in index.items()]) 


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


In [4]:
# view data shape
print('train shape: ', train_data.shape)
print('test shape: ', test_data.shape)


train shape:  (25000,)
test shape:  (25000,)


### Explore data

Let's see the first 3 rows of each vector


In [5]:
print(train_data[:3])
print(train_label[:3])
print(test_data[:3])
print(test_label[:3])


[list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 2, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 2, 19, 178, 32])
 list([1, 194, 1153, 194, 2, 78, 228, 5, 6, 1463, 4369, 2, 134, 26, 

Let's reverse the first row of X vector to see what's the original text

In [6]:
def decode_text(encoded_text):
    return " ".join( [reverse_index.get(i - 3, '#') for i in encoded_text] )

print("Original sententence")
print(decode_text(train_data[0])) 

Original sententence
# this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert # is an amazing actor and now the same being director # father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for # and would recommend it to everyone to watch and the fly # was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also # to the two little # that played the # of norman and paul they were just brilliant children are often left out of the # list i think because the stars that play them all grown up are such a big # for the whole film but these children are amazing and should be # for what they have done don't you thin

In [7]:
print("Encoded sententence")
print(test_data[0])

Encoded sententence
[1, 591, 202, 14, 31, 6, 717, 10, 10, 2, 2, 5, 4, 360, 7, 4, 177, 2, 394, 354, 4, 123, 9, 1035, 1035, 1035, 10, 10, 13, 92, 124, 89, 488, 2, 100, 28, 1668, 14, 31, 23, 27, 2, 29, 220, 468, 8, 124, 14, 286, 170, 8, 157, 46, 5, 27, 239, 16, 179, 2, 38, 32, 25, 2, 451, 202, 14, 6, 717]


In [8]:
print("Number of words")
len(train_data[0])

Number of words


218

In [9]:
print("Label (Expected result)")
print(train_label[0])

Label (Expected result)
1


In [10]:
print("Original sententence")
print(decode_text(test_data[0])) 

Original sententence
# please give this one a miss br br # # and the rest of the cast # terrible performances the show is flat flat flat br br i don't know how michael # could have allowed this one on his # he almost seemed to know this wasn't going to work out and his performance was quite # so all you # fans give this a miss


In [11]:
print("Encoded sententence")
print(test_data[0])

Encoded sententence
[1, 591, 202, 14, 31, 6, 717, 10, 10, 2, 2, 5, 4, 360, 7, 4, 177, 2, 394, 354, 4, 123, 9, 1035, 1035, 1035, 10, 10, 13, 92, 124, 89, 488, 2, 100, 28, 1668, 14, 31, 23, 27, 2, 29, 220, 468, 8, 124, 14, 286, 170, 8, 157, 46, 5, 27, 239, 16, 179, 2, 38, 32, 25, 2, 451, 202, 14, 6, 717]


In [12]:
print("Label (Expected result)")
print(test_label[0])

Label (Expected result)
0


### Statistical Summary

Let's compute some statistics of the dataset we've loaded

In [13]:
#Training data
print("=== Training data ===")
# print  number of unique words
print("Number of words: %d " % len(np.unique(np.hstack(train_data))))


# print the average review length
print("Average review length:")
result = [len(x) for x in train_data]
print("Mean %.2f words (%f)" % (np.mean(result), np.std(result)))

#Frequency of the label
print("Label frequency:")
train_label_1, train_freq = np.unique(train_label, return_counts=True)
print(np.asarray((train_label_1, train_freq)))

=== Training data ===
Number of words: 4998 
Average review length:
Mean 238.71 words (176.493674)
Label frequency:
[[    0     1]
 [12500 12500]]


In [14]:
#Test data
print("\n\n=== Test data ===")
# print  number of unique words
print("Number of words: %d " % len(np.unique(np.hstack(test_data))))

# print the average review length
print("Average review length:")
result = [len(x) for x in test_data]
print("Mean %.2f words (%f)" % (np.mean(result), np.std(result)))

#Frequency of the label
print("Label frequency: ")
test_label_1, test_freq = np.unique(test_label, return_counts=True)
print(np.asarray((test_label_1, test_freq)))



=== Test data ===
Number of words: 4997 
Average review length:
Mean 230.80 words (169.161087)
Label frequency: 
[[    0     1]
 [12500 12500]]


### Prepare data 

Truncate and pad the input sequences so that they are all the same length for modeling, because same length vectors is required to perform the computation in Keras.

In [15]:
# truncate and pad input sequences
max_review_length = 500
train_data = sequence.pad_sequences(train_data, maxlen=max_review_length)
test_data = sequence.pad_sequences(test_data, maxlen=max_review_length)


Let's print the padded vectors

In [16]:
print(train_data[0])
print(test_data[0])



[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

Original text after padding

In [17]:
print(decode_text(train_data[0])) 
print(decode_text(test_data[0])) 


# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert # is an amazing actor and now the same being director # father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film

### Create The Model

- The first layer is the Embedded layer that uses 32 length vectors to represent each word. 
- The next layer is the LSTM layer with 100 memory units (smart neurons).
- Finally, because this is a classification problem we use a Dense output layer with a single neuron and a sigmoid activation function to make 0 or 1 predictions for the two classes (good and bad) in the problem.

In [18]:
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))

### Compile Model

Because it is a binary classification problem, log loss is used as the loss function (binary_crossentropy in Keras). The efficient ADAM optimization algorithm is used. 

In [19]:
#compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())



Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 500, 32)           160000    
                                                                 
 conv1d (Conv1D)             (None, 500, 32)           3104      
                                                                 
 max_pooling1d (MaxPooling1D  (None, 250, 32)          0         
 )                                                               
                                                                 
 lstm (LSTM)                 (None, 100)               53200     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 216,405
Trainable params: 216,405
Non-trainable params: 0
__________________________________________________

### Train Model

The model is fit for only 2 epochs because it quickly overfits the problem. A large batch size of 64 reviews is used to space out weight updates.

In [20]:
#train model
model.fit(train_data, train_label, epochs=3, batch_size=64)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f3a25bce510>

Evaluate model with unseen data

In [21]:
# Final evaluation of the model
scores = model.evaluate(test_data, test_label, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 86.94%
