# Sentiment analysis with LSTM on IMDB review dataset

In this lab, you will work on the dataset of movie reviews provided by IMDB on the [Stanford Page](http://ai.stanford.edu/ãmaas/data/sentiment/aclimdb.tar.gz).

This dataset is already preprocessed in keras, one of simplest way to build neural networks with python.
Check it out !

The goal here is to perform the same task as before, but using a special type of Recurrent Neural Network called a Long Short Term Memory network. 

In [2]:
import glob
import logging
import os.path
logging.getLogger().setLevel(logging.INFO)

import debug_utils

In [4]:
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# fix random seed for reproducibility
numpy.random.seed(42)

# Load the data

This time, we will directly use the loader from keras

In [17]:
num_words = 5000 # only use top 1000 words
index_from = 3   # word index offset

(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data(
    num_words=num_words, index_from=index_from)

word_to_id = keras.datasets.imdb.get_word_index()
word_to_id = {k:(v + INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2
id_to_word = {value:key for key,value in word_to_id.items()}

Make sure we can read the data, going from sequence of integers to a sequence of words

In [54]:
def debugEntry(idx):
    print("Label: ", y_train[idx])
    print(' '.join(id_to_word[i] for i in X_train[idx]))
    
debugEntry(120)

Label:  1
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

# Prepare the data

We want the input vector to be of the same length (500). Hence we have to truncate some entries and pad some others.

In [18]:
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

# Create a Network: Embedding + LSTM + Classifier

In [None]:
embedding_vecor_length = 32
memory_units = 100

model = Sequential()
# !!! Implement here !!!
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)

## Evaluate the model !

In [1]:
# Final evaluation of the model

 # Use the network on new data

In [45]:
def getSentiment(text, model, word_to_id):
    seq = list(map(lambda w: word_to_id.get(w.lower().strip(), 2), text.split()))
    seq = sequence.pad_sequences([seq], maxlen=max_review_length)[0]
    seq = seq.reshape((1, len(seq)))
    return model.predict(seq)

getSentiment("i love it best ttr? best movie so good", model, word_to_id)

array([[ 0.94469827]], dtype=float32)

# Using DropOut for regularization

In [None]:
model = Sequential()
# !!! Implement here !!!
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: {0:2f}%".format(scores[1]*100))

# Using Convolutional Layer to extract structures

In [None]:
model = Sequential()
# !!! Implement here !!!
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=64)
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: {0:2f}%".format(scores[1]*100))