# Lab assignment: analyzing movie reviews with Recurrent Neural Networks

<img src="img/cinemaReviews.png" style="width:600px;">

In this assignment we will analyze the sentiment, positive or negative, expressed in a set of movie reviews IMDB. To do so we will make use of word embeddings and recurrent neural networks.

## Guidelines

Throughout this notebook you will find empty cells that you will need to fill with your own code. Follow the instructions in the notebook and pay special attention to the following symbols.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>You will need to solve a question by writing your own code or answer in the cell immediately below, or in a different file as instructed.</td></tr>
 <tr><td><img src="img/exclamation.png" style="width:80px;height:80px;"></td><td>This is a hint or useful observation that can help you solve this assignment. You are not expected to write any solution, but you should pay attention to them to understand the assignment.</td></tr>
 <tr><td><img src="img/pro.png" style="width:80px;height:80px;"></td><td>This is an advanced and voluntary exercise that can help you gain a deeper knowledge into the topic. Good luck!</td></tr>
</table>

During the assigment you will make use of several Python packages that might not be installed in your machine. If that is the case, you can install new Python packages with

    conda install PACKAGENAME
    
if you are using Python Anaconda. Else you should use

    pip install PACKAGENAME

You will need the following packages for this particular assignment. Make sure they are available before proceeding:

* **numpy**
* **keras**
* **matplotlib**

The following code will embed any plots into the notebook instead of generating a new window:

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
import pandas as pd

In [17]:
import numpy as np

Lastly, if you need any help on the usage of a Python function you can place the writing cursor over its name and press Caps+Shift to produce a pop-out with related documentation. This will only work inside code cells. 

Let's go!

## The Keras library

In this lab we will make use of the <a href=http://keras.io/>keras</a> Deep Learning library for Python. This library allows building several kinds of shallow and deep networks, following either a sequential or a graph architecture.

## Data loading

We will make use of a part of the IMDB database on movie reviews. IMDB rates movies with a score ranging 0-10, but for simplicity we will consider a dataset of good and bad reviews, where a review has been considered bad with a score smaller than 4, and good if it features a score larger than 7. The data is available under the *data* folder.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Load the data into two variables, a list **text** with each of the movie reviews and a list **y** of the class labels.
 </td></tr>
</table>

In [4]:
####### INSERT YOUR CODE HERE
ingesta = pd.read_csv("./data/data.csv", sep="\t")
print(type(ingesta))
print(ingesta.shape)

<class 'pandas.core.frame.DataFrame'>
(2500, 2)


In [12]:
ingesta.head()

Unnamed: 0,sentiment,text
0,0,I simply cant understand why all these relics ...
1,1,Director Raoul Walsh was like the Michael Bay ...
2,1,It could have been a better film. It does drag...
3,1,It is very hard to rate this film. As entertai...
4,1,I've read some terrible things about this film...


In [13]:
text = ingesta['text']
y = ingesta['sentiment']

In [20]:
print(type(text))

<class 'pandas.core.series.Series'>


For convenience in what follows we will also split the data into a training and test subsets.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Split the list of texts into **texts_train** and **texts_test** lists, keeping 25% of the texts for test. Split in the same way the labels, obtaining lists **y_train** and **y_test**.
 </td></tr>
</table>

In [19]:
from sklearn.cross_validation import train_test_split



In [21]:
####### INSERT YOUR CODE HERE
texts_train, texts_test, y_train, y_test = train_test_split(text, y, test_size=0.25)

In [24]:
print(type(texts_train))
print(len(texts_train))
print(type(texts_test))
print(len(texts_test))

<class 'pandas.core.series.Series'>
1875
<class 'pandas.core.series.Series'>
625


## Data processing

We can't introduce text directly into the network, so we will have to tranform it to a vector representation. To do so, we will first **tokenize** the text into words (or tokens), and assign a unique identifier to each word found in the text. Doing this will allow us to perform the encoding. We can do this easily by making use of the **Tokenizer** class in keras:

In [23]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


A Tokenizer offers convenient methods to split texts down to tokens. At construction time we need to supply the Tokenizer the maximum number of different words we are willing to represent. If out texts have greater word variety than this number, the least frequent words will be discarded. We will choose a number large enough for our purpose.

In [25]:
maxwords = 1000
tokenizer = Tokenizer(nb_words = maxwords)

We now need to **fit** the Tokenizer to the training texts.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Find in the keras documentation the appropriate Tokenizer method to fit the tokenizer on a list of text, then use it to fit it on the training data.
 </td></tr>
</table>

In [26]:
####### INSERT YOUR CODE HERE
texts_train_tokens = tokenizer.fit_on_texts(texts_train)
print(type(texts_train_tokens))

<class 'NoneType'>


If done correctly, the following should show the number of times the tokenizer has found each word in the input texts.

In [27]:
tokenizer.word_counts

{'waving': 1,
 'switching': 1,
 'whatever': 54,
 'drill': 1,
 'compensated': 3,
 'choreographed': 8,
 'secrets': 9,
 'genders': 1,
 'cliché': 21,
 'doers': 1,
 'crude': 14,
 'ambitions': 6,
 "doesen't": 1,
 'model': 20,
 'basicaly': 1,
 'catdog': 1,
 "'burning": 1,
 "sleeve's": 2,
 'motivated': 2,
 'derivitive': 1,
 "'bad": 1,
 'repugnant': 3,
 'executives': 9,
 'arrives': 19,
 'integral': 1,
 'bubble': 5,
 'efficiency': 3,
 'articles': 1,
 'rabbit': 7,
 'say\x85': 1,
 'mutated': 1,
 "'beyond": 2,
 "'slight'": 1,
 'misanthropic': 1,
 'ambush': 1,
 'deliberately': 8,
 'psycho': 27,
 'wishes': 12,
 'antidote': 1,
 'imprezza': 1,
 'quebecois': 1,
 'bourgeoise': 1,
 'christophe': 1,
 'bewitching': 1,
 'thurman': 2,
 'lake': 20,
 'natural': 41,
 'saturates': 1,
 'deny': 13,
 'payments': 1,
 'skinheads': 1,
 'sullied': 1,
 'viewpoint': 1,
 'dinosaur': 5,
 'pace': 42,
 'stuntmen': 1,
 'utilize': 2,
 'specified': 1,
 'archaic': 3,
 'picky': 3,
 'paraphrase': 1,
 'gossiping': 1,
 'breakdance': 

Now we have trained the tokenizer we can use it to vectorize the texts. In particular, we would like to transform the texts to sequences of word indexes.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Find in the keras documentation the appropriate Tokenizer method to transform a list of texts to a sequence. Apply it to both the training and test data to obtain matrices **X_train** and **X_test**.
 </td></tr>
</table>

In [50]:
####### INSERT YOUR CODE HERE
X_train = tokenizer.texts_to_sequences(texts_train)
X_test = tokenizer.texts_to_sequences(texts_test)
print(type(X_train))
print(len(X_train))
print(len(X_test))
print(len(X_train[0]),len(X_train[1]),len(X_train[2]))

<class 'list'>
1875
625
157 167 98


In [39]:
X_train[0]

[9,
 280,
 2,
 52,
 323,
 273,
 4,
 438,
 5,
 86,
 2,
 18,
 12,
 6,
 37,
 3,
 17,
 86,
 9,
 8,
 2,
 92,
 12,
 75,
 167,
 9,
 490,
 3,
 1,
 4,
 8,
 29,
 4,
 24,
 39,
 46,
 3,
 329,
 91,
 142,
 12,
 6,
 8,
 2,
 183,
 12,
 42,
 456,
 11,
 19,
 197,
 5,
 1,
 475,
 4,
 3,
 400,
 3,
 290,
 7,
 7,
 192,
 119,
 1,
 1,
 19,
 280,
 20,
 2,
 265,
 45,
 46,
 62,
 6,
 2,
 31,
 29,
 3,
 9,
 16,
 1,
 4,
 8,
 4,
 2,
 3,
 336,
 2,
 128,
 16,
 44,
 265,
 537,
 17,
 45,
 21,
 66,
 1,
 15,
 1,
 2,
 95,
 1,
 9,
 632,
 53,
 2,
 4,
 12,
 924,
 21,
 745,
 11,
 6,
 400,
 38,
 43,
 1,
 8,
 324,
 21,
 263,
 1,
 12,
 11,
 19,
 81,
 978,
 9,
 81,
 102,
 26,
 23,
 55,
 30,
 1,
 419,
 4,
 1,
 183,
 34,
 955,
 41,
 1,
 389,
 21,
 213,
 23,
 35,
 9,
 17,
 9,
 6,
 429,
 142,
 255,
 140]

In [49]:
[print(X_train[elemento]) for elemento in range(3)]

[9, 280, 2, 52, 323, 273, 4, 438, 5, 86, 2, 18, 12, 6, 37, 3, 17, 86, 9, 8, 2, 92, 12, 75, 167, 9, 490, 3, 1, 4, 8, 29, 4, 24, 39, 46, 3, 329, 91, 142, 12, 6, 8, 2, 183, 12, 42, 456, 11, 19, 197, 5, 1, 475, 4, 3, 400, 3, 290, 7, 7, 192, 119, 1, 1, 19, 280, 20, 2, 265, 45, 46, 62, 6, 2, 31, 29, 3, 9, 16, 1, 4, 8, 4, 2, 3, 336, 2, 128, 16, 44, 265, 537, 17, 45, 21, 66, 1, 15, 1, 2, 95, 1, 9, 632, 53, 2, 4, 12, 924, 21, 745, 11, 6, 400, 38, 43, 1, 8, 324, 21, 263, 1, 12, 11, 19, 81, 978, 9, 81, 102, 26, 23, 55, 30, 1, 419, 4, 1, 183, 34, 955, 41, 1, 389, 21, 213, 23, 35, 9, 17, 9, 6, 429, 142, 255, 140]
[10, 11, 107, 32, 51, 10, 13, 2, 694, 10, 518, 332, 96, 71, 41, 9, 43, 2, 152, 159, 41, 1, 98, 15, 48, 262, 10, 9, 108, 62, 75, 9, 13, 20, 62, 62, 352, 8, 1, 53, 8, 10, 368, 95, 266, 165, 1, 15, 2, 202, 56, 259, 33, 359, 1, 78, 359, 59, 334, 11, 107, 13, 37, 565, 9, 13, 506, 43, 91, 5, 107, 351, 404, 3, 256, 2, 80, 56, 1, 107, 16, 929, 24, 54, 23, 139, 72, 99, 22, 976, 5, 112, 351, 404, 5,

[None, None, None]

This is enough to train a Sequential Network. However, for efficiency reasons it is recommended that all sequences in the data have the same number of elements. Since this is not the case for our data, should **pad** the sequences to ensure the same length. The padding procedure adds a special *null* symbol to short sequences, and clips out parts of long sequences, thus enforcing a common size.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Find in the keras documentation the appropriate text preprocessing method to pad a sequence. Then pad all sequences to have a maximum of 300 words, both in the training and test data.
 </td></tr>
</table>

In [51]:
####### INSERT YOUR CODE HERE
from keras.preprocessing import sequence
max_length = 300
X_train_pad = sequence.pad_sequences(X_train, maxlen=max_length)
X_test_pad = sequence.pad_sequences(X_test, maxlen=max_length)

print(type(X_train_pad))
print(len(X_train_pad))
print(len(X_test_pad))
print(len(X_train_pad[i]),len(X_train_pad[1]),len(X_train_pad[2]))

<class 'numpy.ndarray'>
1875
625
300 300 300


In [52]:
[print(X_train_pad[elemento]) for elemento in range(3)]

[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   9
 280   2  52 323 273   4 438   5  86   2  18  12   6  37   3  17  86   9
   8   2  92  12  75 167   9 490   3   1   4   8  29   4  24  39  46   3
 329  91 142  12   6   8   2 183  12  42 456  11  19 197   5   1 475   4
   3 400   3 290   7   7 192 119   1   1  19 280  20   2 265  45  46  62
   6   2  31  29   3   9  16   1   4   8   4   2   3 336   2 128  16  44
 265 537  17  45  21  66   1  15   1   2  95   1   

[None, None, None]

## Pure indexes model

We will first try to build a model based just on word indexes. Since keras expects sequential inputs as 3-dimensional array with dimensions NUMBER_SEQUENCES x SEQUENCE_LENGTH x FEATURES and we will we using only the indexes, our features dimension is one.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Create new variables **X_train_idx** and **X_test_idx**, reshaped versions of **X_train** and **X_test**, in which each index has been transformed into a 1-element list.
 </td></tr>
</table>

In [53]:
####### INSERT YOUR CODE HERE
X_train_idx =X_train_pad.reshape(1875,300,1)
X_test_idx = X_test_pad.reshape(625,300,1)

Now we can train the model

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Build, compile and train a keras network with an LSTM layer of 32 units and dropout 0.9, followed by a Dense layer of 1 unit with sigmoid activation. Use the binary crossentroy loss function for training, together with the adam optimizer. Train for 10 epochs. After training, measure the accuracy on the test set.
 </td></tr>
</table>

In [54]:
####### INSERT YOUR CODE HERE
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers.core import Dense
from keras.layers.core import Activation
from keras.layers.core import Dropout

modelo = Sequential()
modelo.add(LSTM(32,input_shape =(300,1)))
modelo.add(Dropout(0.9))
modelo.add(Dense(1))
modelo.add(Activation('sigmoid'))

In [55]:
modelo.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
lstm_1 (LSTM)                    (None, 32)            4352        lstm_input_1[0][0]               
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 32)            0           lstm_1[0][0]                     
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 1)             33          dropout_1[0][0]                  
____________________________________________________________________________________________________
activation_1 (Activation)        (None, 1)             0           dense_1[0][0]                    
Total params: 4,385
Trainable params: 4,385
Non-trainable params: 0
_______________________

In [56]:
modelo.compile(loss='binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])

In [57]:
modelo.fit(
    X_train_idx,
    y_train,
    batch_size=64,
    nb_epoch=10,
    verbose=2)

Epoch 1/10
54s - loss: 0.9978 - acc: 0.4832
Epoch 2/10
47s - loss: 0.9294 - acc: 0.5083
Epoch 3/10
49s - loss: 0.9189 - acc: 0.4795
Epoch 4/10
47s - loss: 0.8506 - acc: 0.4965
Epoch 5/10
47s - loss: 0.8273 - acc: 0.5195
Epoch 6/10
46s - loss: 0.8005 - acc: 0.5131
Epoch 7/10
46s - loss: 0.7451 - acc: 0.4896
Epoch 8/10
47s - loss: 0.7393 - acc: 0.5061
Epoch 9/10
46s - loss: 0.7386 - acc: 0.4976
Epoch 10/10
46s - loss: 0.7245 - acc: 0.4981


<keras.callbacks.History at 0x1042e780>

In [58]:
score = modelo.evaluate(X_test_idx, y_test)
print("")
print("Test loss", score[0])
print("Test accuracy", score[1])


Test loss 0.692267370224
Test accuracy 0.505600000572


## Learning an embedding

Using indexes as a representation of words is a very poor approach. We can easily improve over that by using an **Embedding** layer at the very beginning of the network. This layer will transform word indexes to a vector representation that is learned with the model together with the rest of network weights.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Create a new network similar to the previous one, but adding an Embedding as the first layer of the network. Configure the Embedding layer to produce a vector representation of 64 elements. Then train the network with similar setting to the previous one. Has the test accuracy improved?
 </td></tr>
</table>

<table>
 <tr><td><img src="img/exclamation.png" style="width:80px;height:80px;"></td><td>
The Embedding layer accepts lists of indexes as inputs, so you don't need to use the **X_train_idx** representation you created for the previous network.
 </td></tr>
</table>

In [59]:
####### INSERT YOUR CODE HERE
from keras.layers.embeddings import Embedding
embedding_vector =64

model_emb = Sequential()

model_emb.add(Embedding(maxwords,embedding_vector, input_length=max_length))
model_emb.add(LSTM(32))
model_emb.add(Dropout(0.9))
model_emb.add(Dense(1))
model_emb.add(Activation('sigmoid'))

model_emb.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_1 (Embedding)          (None, 300, 64)       64000       embedding_input_1[0][0]          
____________________________________________________________________________________________________
lstm_2 (LSTM)                    (None, 32)            12416       embedding_1[0][0]                
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 32)            0           lstm_2[0][0]                     
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 1)             33          dropout_2[0][0]                  
___________________________________________________________________________________________

In [61]:
model_emb.compile(loss='binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])
model_emb.fit(
    X_train_pad,
    y_train,
    batch_size=64,
    nb_epoch=10,
    verbose=2)

Epoch 1/10
66s - loss: 0.6936 - acc: 0.4965
Epoch 2/10
59s - loss: 0.6832 - acc: 0.5595
Epoch 3/10
56s - loss: 0.6493 - acc: 0.6469
Epoch 4/10
55s - loss: 0.5514 - acc: 0.7541
Epoch 5/10
57s - loss: 0.4923 - acc: 0.7925
Epoch 6/10
57s - loss: 0.4679 - acc: 0.8171
Epoch 7/10
57s - loss: 0.4398 - acc: 0.8501
Epoch 8/10
56s - loss: 0.3817 - acc: 0.8571
Epoch 9/10
56s - loss: 0.3627 - acc: 0.8715
Epoch 10/10
56s - loss: 0.3327 - acc: 0.8864


<keras.callbacks.History at 0x12c57a58>

In [63]:
score_emb = model_emb.evaluate(X_test_pad, y_test)
print("")
print("Test loss", score_emb[0])
print("Test accuracy", score_emb[1])


Test loss 0.482038615704
Test accuracy 0.777600000286


<span style="color:blue">La *accuracy* en test mejora muchísimo con la capa de embedding.</span>

## Stacked LSTMs

Much like other neural layers, LSTM layers can be stacked on top of each other to produce more complex models. Care must be taken, however, that the LSTM layers before the last one generate a whole sequence of outputs for the following LSTM to process.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Repeat the training of the previous network, but using 2 LSTM layers. Make sure to configure the first LSTM layer in a way that it outputs a whole sequence for the next layer.
 </td></tr>
</table>

In [64]:
####### INSERT YOUR CODE HERE
modelo_2lstm = Sequential()

modelo_2lstm.add(Embedding(maxwords,embedding_vector, input_length=max_length))
modelo_2lstm.add(LSTM(32, return_sequences =True))
modelo_2lstm.add(Dropout(0.9))
modelo_2lstm.add(LSTM(32))
modelo_2lstm.add(Dropout(0.9))
modelo_2lstm.add(Dense(1))
modelo_2lstm.add(Activation('sigmoid'))

modelo_2lstm.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_2 (Embedding)          (None, 300, 64)       64000       embedding_input_2[0][0]          
____________________________________________________________________________________________________
lstm_3 (LSTM)                    (None, 300, 32)       12416       embedding_2[0][0]                
____________________________________________________________________________________________________
dropout_3 (Dropout)              (None, 300, 32)       0           lstm_3[0][0]                     
____________________________________________________________________________________________________
lstm_4 (LSTM)                    (None, 32)            8320        dropout_3[0][0]                  
___________________________________________________________________________________________

In [65]:
modelo_2lstm.compile(loss='binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])
modelo_2lstm.fit(
    X_train_pad,
    y_train,
    batch_size=64,
    nb_epoch=10,
    verbose=2)

Epoch 1/10
128s - loss: 0.6962 - acc: 0.5003
Epoch 2/10
111s - loss: 0.6923 - acc: 0.5349
Epoch 3/10
110s - loss: 0.6862 - acc: 0.5381
Epoch 4/10
113s - loss: 0.6583 - acc: 0.5973
Epoch 5/10
111s - loss: 0.5715 - acc: 0.7211
Epoch 6/10
111s - loss: 0.4800 - acc: 0.7904
Epoch 7/10
111s - loss: 0.4210 - acc: 0.8373
Epoch 8/10
116s - loss: 0.3644 - acc: 0.8683
Epoch 9/10
112s - loss: 0.3107 - acc: 0.8949
Epoch 10/10
113s - loss: 0.2883 - acc: 0.9104


<keras.callbacks.History at 0x1b659390>

In [66]:
score_2lstm = modelo_2lstm.evaluate(X_test_pad, y_test)
print("")
print("Test loss", score_2lstm[0])
print("Test accuracy", score_2lstm[1])


Test loss 0.722634818172
Test accuracy 0.724800000668


<span style="color:blue">La *accuracy* en test no mejora al añadir otra capa. De hecho, empeora sensiblemente.</span>