# Lab assignment: analyzing movie reviews with Recurrent Neural Networks

<img src="img/cinemaReviews.png" style="width:600px;">

In this assignment we will analyze the sentiment, positive or negative, expressed in a set of movie reviews IMDB. To do so we will make use of word embeddings and recurrent neural networks.

## Guidelines

Throughout this notebook you will find empty cells that you will need to fill with your own code. Follow the instructions in the notebook and pay special attention to the following symbols.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>You will need to solve a question by writing your own code or answer in the cell immediately below, or in a different file as instructed.</td></tr>
 <tr><td><img src="img/exclamation.png" style="width:80px;height:80px;"></td><td>This is a hint or useful observation that can help you solve this assignment. You are not expected to write any solution, but you should pay attention to them to understand the assignment.</td></tr>
 <tr><td><img src="img/pro.png" style="width:80px;height:80px;"></td><td>This is an advanced and voluntary exercise that can help you gain a deeper knowledge into the topic. Good luck!</td></tr>
</table>

During the assigment you will make use of several Python packages that might not be installed in your machine. If that is the case, you can install new Python packages with

    conda install PACKAGENAME
    
if you are using Python Anaconda. Else you should use

    pip install PACKAGENAME

You will need the following packages for this particular assignment. Make sure they are available before proceeding:

* **numpy**
* **keras**
* **matplotlib**

The following code will embed any plots into the notebook instead of generating a new window:

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline

Lastly, if you need any help on the usage of a Python function you can place the writing cursor over its name and press Caps+Shift to produce a pop-out with related documentation. This will only work inside code cells. 

Let's go!

## The Keras library

In this lab we will make use of the <a href=http://keras.io/>keras</a> Deep Learning library for Python. This library allows building several kinds of shallow and deep networks, following either a sequential or a graph architecture.

## Data loading

We will make use of a part of the IMDB database on movie reviews. IMDB rates movies with a score ranging 0-10, but for simplicity we will consider a dataset of good and bad reviews, where a review has been considered bad with a score smaller than 4, and good if it features a score larger than 7. The data is available under the *data* folder.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Load the data into two variables, a list **text** with each of the movie reviews and a list **y** of the class labels.
 </td></tr>
</table>

In [2]:
pwd

'/home/bellinsky/Documents/datahack/algoritmos_avanzados/deeplearning/lab2_sentiment'

In [3]:
import pandas as pd
import numpy as np
dataset=pd.read_csv('./data/data.csv',sep='\t')



For convenience in what follows we will also split the data into a training and test subsets.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Split the list of texts into **texts_train** and **texts_test** lists, keeping 25% of the texts for test. Split in the same way the labels, obtaining lists **y_train** and **y_test**.
 </td></tr>
</table>

In [4]:
from sklearn.model_selection import train_test_split

datos_spliteados = train_test_split(dataset,
                                    train_size=0.75, # 80% training
                                    test_size=0.25   # 20% testing
                                   )
texts_train_df=datos_spliteados[0]
texts_test_df=datos_spliteados[1]

texts_train=texts_train_df['text'].tolist()
texts_test=texts_test_df['text'].tolist()

y_train=np.array(texts_train_df['sentiment'].tolist())
y_test=np.array(texts_test_df['sentiment'].tolist())


## Data processing

We can't introduce text directly into the network, so we will have to tranform it to a vector representation. To do so, we will first **tokenize** the text into words (or tokens), and assign a unique identifier to each word found in the text. Doing this will allow us to perform the encoding. We can do this easily by making use of the **Tokenizer** class in keras:

In [5]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


A Tokenizer offers convenient methods to split texts down to tokens. At construction time we need to supply the Tokenizer the maximum number of different words we are willing to represent. If out texts have greater word variety than this number, the least frequent words will be discarded. We will choose a number large enough for our purpose.

In [6]:
maxwords = 1000
tokenizer = Tokenizer(nb_words = maxwords)

We now need to **fit** the Tokenizer to the training texts.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Find in the keras documentation the appropriate Tokenizer method to fit the tokenizer on a list of text, then use it to fit it on the training data.
 </td></tr>
</table>

In [7]:
tokenizer.fit_on_texts(texts_train)

If done correctly, the following should show the number of times the tokenizer has found each word in the input texts.

In [8]:
tokenizer.word_counts

{'rayburn': 1,
 'assistant': 15,
 'neeson': 3,
 'retrouvé': 1,
 'guy': 239,
 'strait': 2,
 'diversity': 6,
 'depraved': 2,
 'minutes\x85': 1,
 'sens': 1,
 'ugc': 1,
 'says': 81,
 'sleaziest': 1,
 'well': 842,
 'thrilled': 4,
 'pejorative': 1,
 'blasters': 1,
 'trendier': 1,
 "p'tite": 1,
 "'nam": 2,
 'rely': 3,
 'justice': 37,
 'demonstrate': 2,
 'dodge': 6,
 'framing': 2,
 'consideration': 4,
 'superhuman': 2,
 '1984': 5,
 'suave': 4,
 'pen': 2,
 'awhile': 8,
 'fussy': 3,
 'honorary': 1,
 "kudos'": 1,
 'colorado': 1,
 'bloodthirsty': 2,
 'forward': 55,
 "something's": 1,
 'oversight': 4,
 'miserable': 9,
 'tutee': 1,
 'mike': 23,
 'keanu': 4,
 "cliché's": 3,
 'away': 246,
 'fiendish': 1,
 'allergic': 1,
 'manifested': 1,
 'densely': 1,
 'golan': 1,
 'reductionism': 1,
 'multiculturalism': 1,
 'hour': 95,
 'titillate': 1,
 'calamari': 1,
 'dreadfully': 2,
 'raped': 12,
 'infiltration': 1,
 'hide': 21,
 'transient': 1,
 'telescope': 1,
 'whorehouse': 1,
 'venturing': 2,
 'bastards': 2,


Now we have trained the tokenizer we can use it to vectorize the texts. In particular, we would like to transform the texts to sequences of word indexes.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Find in the keras documentation the appropriate Tokenizer method to transform a list of texts to a sequence. Apply it to both the training and test data to obtain matrices **X_train** and **X_test**.
 </td></tr>
</table>

In [9]:
X_train=np.array(tokenizer.texts_to_sequences(texts_train))
X_test=np.array(tokenizer.texts_to_sequences(texts_test))


We can see now how a text has been transformed to a list of word indexes.

In [10]:
X_train[0]

[84,
 65,
 35,
 34,
 684,
 14,
 8,
 1,
 88,
 829,
 5,
 14,
 8,
 1,
 334,
 829,
 123,
 124,
 35,
 81,
 12,
 16,
 42,
 73,
 152,
 50,
 3,
 33,
 5,
 35,
 121,
 4,
 9,
 73,
 1,
 88,
 829,
 9,
 13,
 684,
 14,
 2,
 12,
 28,
 4,
 1,
 158,
 10,
 447,
 5,
 61,
 8,
 1,
 334,
 829,
 7,
 7,
 10,
 315,
 11,
 17,
 3,
 630,
 39,
 4,
 154,
 2,
 10,
 315,
 9,
 3,
 305,
 39,
 4,
 154,
 46,
 1,
 63,
 27,
 236,
 5,
 81,
 15,
 107,
 18,
 9,
 124,
 2,
 12,
 123,
 10,
 67,
 315,
 11,
 17,
 3,
 359,
 39,
 154,
 7,
 7,
 10,
 12,
 271,
 36,
 6,
 20,
 140,
 11,
 10,
 83,
 2,
 104,
 1,
 88,
 829,
 308]

This is enough to train a Sequential Network. However, for efficiency reasons it is recommended that all sequences in the data have the same number of elements. Since this is not the case for our data, should **pad** the sequences to ensure the same length. The padding procedure adds a special *null* symbol to short sequences, and clips out parts of long sequences, thus enforcing a common size.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Find in the keras documentation the appropriate text preprocessing method to pad a sequence. Then pad all sequences to have a maximum of 300 words, both in the training and test data.
 </td></tr>
</table>

In [30]:
from keras.preprocessing.sequence import pad_sequences

X_train_pad = pad_sequences(X_train, maxlen=300)
X_test_pad = pad_sequences(X_test, maxlen=300)


## Pure indexes model

We will first try to build a model based just on word indexes. Since keras expects sequential inputs as 3-dimensional array with dimensions NUMBER_SEQUENCES x SEQUENCE_LENGTH x FEATURES and we will we using only the indexes, our features dimension is one.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Create new variables **X_train_idx** and **X_test_idx**, reshaped versions of **X_train** and **X_test**, in which each index has been transformed into a 1-element list.
 </td></tr>
</table>

In [31]:
X_train_idx = np.reshape(X_train_pad, (len(X_train_pad),300,1))
X_test_idx = np.reshape(X_test_pad, (len(X_test_pad),300,1))

Now we can train the model

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Build, compile and train a keras network with an LSTM layer of 32 units and dropout 0.9, followed by a Dense layer of 1 unit with sigmoid activation. Use the binary crossentroy loss function for training, together with the adam optimizer. Train for 10 epochs. After training, measure the accuracy on the test set.
 </td></tr>
</table>

In [23]:
from keras.models import Sequential
from keras.layers import Dense, Activation,LSTM,Dropout

model = Sequential()
model.add(LSTM(32, input_shape=(X_train_idx.shape[1], X_train_idx.shape[2])))
model.add(Dropout(0.9))
model.add(Dense(1))
model.add(Activation('sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(X_train_idx, y_train, batch_size=128, nb_epoch=10)
score, acc = model.evaluate(X_test_idx, y_test,batch_size=128)


print('Test score:', score)
print('Test accuracy:', acc)

Train...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.695407935905
Test accuracy: 0.484799999666


## Learning an embedding

Using indexes as a representation of words is a very poor approach. We can easily improve over that by using an **Embedding** layer at the very beginning of the network. This layer will transform word indexes to a vector representation that is learned with the model together with the rest of network weights.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Create a new network similar to the previous one, but adding an Embedding as the first layer of the network. Configure the Embedding layer to produce a vector representation of 64 elements. Then train the network with similar setting to the previous one. Has the test accuracy improved?
 </td></tr>
</table>

<table>
 <tr><td><img src="img/exclamation.png" style="width:80px;height:80px;"></td><td>
The Embedding layer accepts lists of indexes as inputs, so you don't need to use the **X_train_idx** representation you created for the previous network.
 </td></tr>
</table>

In [27]:
from keras.layers import Embedding

model = Sequential()
model.add(Embedding(20000,64))
model.add(LSTM(32))
model.add(Dropout(0.9))
model.add(Dense(1))
model.add(Activation('sigmoid'))


model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(X_train_pad, y_train, batch_size=128, nb_epoch=10)
score, acc = model.evaluate(X_test_pad, y_test,batch_size=128)


print('Test score:', score)
print('Test accuracy:', acc)

Train...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.471828553772
Test accuracy: 0.78399999876


## Stacked LSTMs

Much like other neural layers, LSTM layers can be stacked on top of each other to produce more complex models. Care must be taken, however, that the LSTM layers before the last one generate a whole sequence of outputs for the following LSTM to process.

<table>
 <tr><td><img src="img/question.png" style="width:80px;height:80px;"></td><td>
Repeat the training of the previous network, but using 2 LSTM layers. Make sure to configure the first LSTM layer in a way that it outputs a whole sequence for the next layer.
 </td></tr>
</table>

In [29]:
model = Sequential()
model.add(Embedding(20000,64))
model.add(LSTM(32,return_sequences=True))
model.add(LSTM(32))
model.add(Dropout(0.9))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

print('Train...')
model.fit(X_train_pad, y_train, batch_size=128, nb_epoch=10)
score, acc = model.evaluate(X_test_pad, y_test,
                            batch_size=128)


print('Test score:', score)
print('Test accuracy:', acc)

Train...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.522992933273
Test accuracy: 0.775999997425
