# Sentiment Analysis for IMBD movie reviews using Keras and CNN

In [1]:
from __future__ import print_function
import numpy as np
import os

from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Embedding
from keras.layers import Convolution1D, MaxPooling1D
from keras.datasets import imdb
from keras import backend as K
from keras.utils import data_utils


Using TensorFlow backend.


## Download and explore data

Keras has the imdb dataset as part of its distribution but that dataset is already pre processed. I want to cover pre processing steps because that is an important part of Machine Learning.

In [3]:

imdb_path = data_utils.get_file('aclImdb', 
                                'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz', untar=True)
    

Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


TODO: show histogram for review length and some graph with word count


In [4]:
def load_reviews(path):
    reviews = []
    for fname in sorted(os.listdir(path)):  
        fpath = os.path.join(path, fname)
        f = open(fpath)
        reviews.append(f.read())
        f.close()
    return reviews

positive_reviews = load_reviews(os.path.join(imdb_path, 'train/pos')) 
negative_reviews = load_reviews(os.path.join(imdb_path, 'train/neg')) 

all_reviews = []
all_reviews.extend(positive_reviews)
all_reviews.extend(negative_reviews)

## Data transformation

Original data is in text format. In order to be able to feed it into a neural network it needs to be converted into tensors first. 

The first step is tokenizing the reviews. The tokenizer converts each review into a sequence of integers with each integer representing the index of the word in a dictionary. Next the sequences are padded so all of them have the same length.


In [5]:
# Only used most frequently MAX_NB_WORDS used words. The words are indexed such that lower indexes correspond
# to more frequently used words
MAX_NB_WORDS = 5000
SEQUENCE_LENGTH = 500

tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(all_reviews)

# Unknown words or words that are not frequently used are ingored.
sequences = tokenizer.texts_to_sequences(all_reviews)

# pad sequences with 0 so they have the same length. Longer sequences are reduced to SEQUENCE_LENGTH. 
reviews_vectors = pad_sequences(sequences, maxlen=SEQUENCE_LENGTH)
labels = np.zeros(len(all_reviews), dtype=int)
labels[0:len(positive_reviews)] = 1

def sentence_info(index):
    print('ORIGINAL SENTENCE\n---------------')
    print(all_reviews[index])
    print('TOKENIZED SEQUENCE\n---------------')
    print(sequences[index])
    print('PADDED SEQUENCE\n---------------')
    print(reviews_vectors[index])

sentence_info(1000)

#shuffle the data
indices = np.arange(reviews_vectors.shape[0])
np.random.shuffle(indices)
reviews_vectors = reviews_vectors[indices]
labels = labels[indices]


ORIGINAL SENTENCE
---------------
This was a must see documentary for me when I missed the opportunity in 2004, so I was definitely going to watch the repeat. I really sympathised with the main character of the film, because, this is true, I have a milder condition of the skin problem he had, Dystrophic Epidermolysis Bullosa (EB). This is a sad, sometimes amusing and very emotional documentary about a boy with a terrible skin disorder. Jonny Kennedy speaks like a kid (because of wasting vocal muscle) and never went through puberty, but he is 36 years old. Most sympathising moments are seeing his terrible condition, and pealing off his bandages. Jonny had quite a naughty sense of humour, he even narrated from beyond the grave when showing his body in a coffin. He tells his story with the help of his mother, Edna Kennedy, his older brother and celebrity model, and Jonny's supporter, Nell McAndrew. It won the BAFTAs for Best Editing and Best New Director (Factual), and it was nominated fo

## Build the model

In [6]:
#model parameters
embedding_dims = 50
nb_filter = 250
filter_length = 3
hidden_dims = 250

model = Sequential()

# first layer is an embedding layer. This transforms each number in the input sequence into a vector with embeding_dims
model.add(Embedding(MAX_NB_WORDS,
                    embedding_dims,
                    input_length=SEQUENCE_LENGTH,
                    dropout=0.2))

# we add a Convolution1D, which will learn nb_filter
# word group filters of size filter_length:
model.add(Convolution1D(nb_filter=nb_filter,
                        filter_length=filter_length,
                        border_mode='valid',
                        activation='relu',
                        subsample_length=1))
# we use max pooling:
model.add(MaxPooling1D(pool_length=model.output_shape[1]))

# Output from the Convolution layer is flattened so it can be fed into a Dense layer
model.add(Flatten())

# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))

# Since this is a binary classification the output from the Dense layer is fed into another layer with a single
# neuron. This in turn is activated by a sigmoid. A sigmoid makes sense in this case because it can be interpreted as
# a probability.
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_1 (Embedding)          (None, 500, 50)       250000      embedding_input_1[0][0]          
____________________________________________________________________________________________________
convolution1d_1 (Convolution1D)  (None, 498, 250)      37750       embedding_1[0][0]                
____________________________________________________________________________________________________
maxpooling1d_1 (MaxPooling1D)    (None, 1, 250)        0           convolution1d_1[0][0]            
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 250)           0           maxpooling1d_1[0][0]             
___________________________________________________________________________________________

## Train the model

In [7]:
#training parameters
batch_size = 32
nb_epoch = 2

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history  = model.fit(reviews_vectors, labels,
          batch_size=batch_size,
          nb_epoch=nb_epoch,
          validation_split=0.2)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 20000 samples, validate on 5000 samples
Epoch 1/2
Epoch 2/2



## Predict sentiment

In [18]:
def process_data(tokenizer, reviews):
    sequences = tokenizer.texts_to_sequences(reviews)
    return pad_sequences(sequences, maxlen=SEQUENCE_LENGTH)

review1 = """This movie is horrendous on so many levels - patronising to Christians, incredibly bad plot and direction, the acting and dialogue is as comfortable as a dentist's visit. I suspect the whole thing is purely a desperate attempt by Cameron to pander to under-educated Christians in an effort to earn their hard earned dollars. Either that, or it is just a giant ego trip for him. I had no preconceptions before I went to watch this, and left confused as to the amount of utter garbage wedged into what seemed like an eternity of boredom. Even worse, he is asking his Christian supporters to bear false witness by skewing the votes here and on rotten tomatoes. Hypocritical? I'll let you decide."""
review2 = """I had the rather intense privilege to view James Cameron's much anticipated $400 million budget return to the directing scene, Avatar, at the Empire Leicester Square in London.
Where to begin! The visuals in this pieces was groundbreaking. He did it with the Terminator series and then Titanic, so one would expect Cameron to deliver... and HE DID! The visual are by far some of the sharpest CGI I have seen. You could almost say that there is a disquiet that follows Cameron's soul, as there is no other possibility of this strong and intensified quality. Its production design and visual effects are both noteworthy and it will get its praise upon official release.
What it was lacking that really should have shaped the movie is its character/story. I was expecting a complex and believable plot, but was left with a movie with mostly strong visuals. What most sci-fi lovers desire is mind-bending philosophies, fantasy and exploration and limitations of our or outer species. If it was not for this factor, I would give this a 9.5 vote.
Avatar will be a success, not only because of Cameron's legacy, but by very intelligent and viral marketing. Avatar have had a powerful marketing technique that assembles other successful blockbusters, such as The Blair Witch project (you all remember it), The Dark Knight (Joker invades the world) and also, the current production The Artifice (the-artifice.com) that is intelligently targeting the market.
Kudos to Cameron, Avatar is one of the (if not The) movie of the year. I could get in trouble for sharing this with you guys so early, so please click Yes on "Was the above comment useful to you?" as a thank you."""
data = process_data(tokenizer,[review1,review2])

print(model.predict(data))


[[ 0.00883079]
 [ 0.71472728]]
