# Sentiment Analysis of IMDB movie reviews

In this exercise, we will try to classify whether a IMDB review made on a movie can be considered as positive or negative. To do that, we will use a recurrent network with a LSTM module inside it.

### Imports

In [1]:
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb
import numpy as np

Using TensorFlow backend.


## IMDB Movie Reviews Dataset
[The dataset we will use](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification) comprises 50,000  movies reviews from IMDB (25k for training and 25k for testing), labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data.

A common approach in Natural Language Processing (NLP) is to use a dictionary/vocabulary to encode the words present in the text being processed. There are different ways of building a dictionary, but essentially we hope that it will comprise the most significant words in all your training data (assuming it will generalize well for the testing set). **In this exercise, our dictionary will be composed with the 20,000 most frequent words in our training set.** Each word in a movie review will be encoded (transformed) to a integer associated with a word in our dictionary/vocabulary, indexed by frequency (most frequent words will receive lowest integers). 

For example, suppose our dictionary is: 

`{"movie":1, "actor": 2, "actress": 3, "cool":4, "bad":5, "action":6 ... "awesome": 100 ...}`

Associating the word `movie` (the most common word in the training set) to the number `1`, `bad` to the number `5` and so on. Now, supose we have the following two reviews (disconsidering words that are not in the vocabulary):

> **Review 1:** "movie awesome. Cool actor."
>
> **Review 2:** "movie bad. Awesome actor."

They will be encoded as:

> **Encode 1:** [1,100,4,2] 
>
> **Encode 2:** [1,5,100,2]




In [2]:
vocabulary_size = 20000 #The size of our vocabulary/dictionary is the 20k most frequent words

print('Loading data...')
(x_train, y_train), (x_val, y_val) = imdb.load_data(num_words = vocabulary_size)

print(len(x_train), 'train sequences: \t', sum(y_train == 1), " positives \t", sum(y_train == 0), " negatives")
print(len(x_val), 'val sequences: \t', sum(y_val == 1), " positives \t", sum(y_val == 0), " negatives")

Loading data...
25000 train sequences: 	 12500  positives 	 12500  negatives
25000 val sequences: 	 12500  positives 	 12500  negatives


Let's examine one example of review and encoded array: 

(Just change `review_idx` to another number to see other reviews.)

In [3]:
#This downloads the "reverse dictionary", the mapping of word and index
word_to_id = imdb.get_word_index()
word_to_id = {k:(v+3) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

id_to_word = {value:key for key,value in word_to_id.items()}

In [4]:
review_idx = 0
print("REVIEW:\n", ' '.join(id_to_word[id] for id in x_train[review_idx]), "\n")
print("ENCODED:\n", x_train[review_idx],"\n")
print("CLASS (0 = negative, 1 = positive): ", y_train[review_idx])

REVIEW:
 <START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and sh

Each review also has a variable number of words in it, so before processing them, we will make sure they all have the same length. We will limit the number of words in each review to 80.

On the other hand, those that have less than 80 words will be padded to have length = 80.

In [5]:
maxlen = 80  # cut texts after this number of words (among the top vocabulary_size most common words)

print('Pad sequences (samples x timesteps/words)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_val = sequence.pad_sequences(x_val, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_val shape:', x_val.shape)

Pad sequences (samples x timesteps/words)
x_train shape: (25000, 80)
x_val shape: (25000, 80)


## Model Definition
We will define a simple model composed of:
- [Embedding layer](https://keras.io/layers/embeddings/) mapping our vocabulary size to features of 128 dimensions;
- [LSTM layer](https://keras.io/layers/recurrent/#lstm) with 128 units, with 40% dropout (both `dropout` and `recurrent_dropout`) 
- [Dense layer](https://keras.io/layers/core/#dense) with 1 neuron (because it is a binary problem) and sigmoid activation.

Besides that, we will use `Adam` optimizer and `binary_crossentropy` loss. 

In [6]:
model = Sequential()
model.add(Embedding(vocabulary_size, 128))
model.add(LSTM(128, dropout=0.4, recurrent_dropout=0.4))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

## Training and evaluation
Now let's train our model and monitor the loss and accuracy in the validation data.

In [7]:
batch_size = 64
model.fit(x_train, y_train, 
          batch_size=batch_size, epochs=3, 
          validation_data=(x_val, y_val))

Train on 25000 samples, validate on 25000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f73cf63bef0>

You may want to train your model for more epochs, but you should probably be carefull with overfitting. As we can see, our training loss keeps increasing while the validation loss isn't. Try adding more regularization to the model to deal with that.

Let's see some of the wrongly classified reviews:

In [10]:
predicted_classes = [pred[0] for pred in model.predict_classes(x_val, verbose=1)]

# Check which items we got wrong
incorrect_indices = np.nonzero(predicted_classes != y_val)[0]



In [11]:
posIdx = [idx for idx in incorrect_indices if predicted_classes[idx] == 1]
negIdx = [idx for idx in incorrect_indices if predicted_classes[idx] == 0]

#Select a posIdx or negIdx
idx = posIdx[0]
print("CLASS = ", predicted_classes[idx])
print("REVIEW:\n", ' '.join(id_to_word[id] for id in x_val[idx]))

idx = negIdx[0]
print("\n\nCLASS = ", predicted_classes[idx])
print("REVIEW:\n", ' '.join(id_to_word[id] for id in x_val[idx]))

CLASS =  1
REVIEW:
 poor it keeps shaking all the time in a completely tasteless framing br br its really painful to see this very interesting film in a cinema you got quickly <UNK> and you have to make some huge effort not to puke on your neighbor 's seat br br it's really a shame <UNK> the story is edited in a non linear way which is quite rare and a very good idea for a documentary br br watch this at home


CLASS =  0
REVIEW:
 the <UNK> this sequence is shown repeatedly from various angles thus drawing out what probably was only a five second event br br <UNK> is a film that the revolutionary spirit celebrates it for those already committed and it for the <UNK> it <UNK> of fire and <UNK> with the senseless injustices of the decadent <UNK> regime its greatest impact has been on film students who have borrowed and only slightly improved on techniques invented in russia several generations ago
