# Sentiment analysis of movie reviews

Sentiment analysis (positive or negative) is done using 'large movie review dataset' available at http://ai.stanford.edu/~amaas/data/sentiment/ . 

In [39]:
import os
import numpy as np
import spacy
import re

In [4]:
PATH = '/Users/nis89mad/data/'

In [31]:
# function to read .txt files
def read_files(loc, PATH=PATH):
    files = os.listdir(PATH+loc)
    text = []
    for f in files:
        with open(PATH+loc+f) as ff:
            text.append(ff.readline())
    return text

In [25]:
train_pos = read_files('aclImdb/train/pos/')

In [29]:
len(train_pos)

12500

In [26]:
train_pos[0]

'For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan "The Skipper" Hale jr. as a police Sgt.'

In [32]:
train_neg = read_files('aclImdb/train/neg/')

In [38]:
train_neg[1]

'Well...tremors I, the original started off in 1990 and i found the movie quite enjoyable to watch. however, they proceeded to make tremors II and III. Trust me, those movies started going downhill right after they finished the first one, i mean, ass blasters??? Now, only God himself is capable of answering the question "why in Gods name would they create another one of these dumpster dives of a movie?" Tremors IV cannot be considered a bad movie, in fact it cannot be even considered an epitome of a bad movie, for it lives up to more than that. As i attempted to sit though it, i noticed that my eyes started to bleed, and i hoped profusely that the little girl from the ring would crawl through the TV and kill me. did they really think that dressing the people who had stared in the other movies up as though they we\'re from the wild west would make the movie (with the exact same occurrences) any better? honestly, i would never suggest buying this movie, i mean, there are cheaper ways to 

In [34]:
test_pos = read_files('aclImdb/test/pos/')

In [35]:
test_pos[0]

'Based on an actual story, John Boorman shows the struggle of an American doctor, whose husband and son were murdered and she was continually plagued with her loss. A holiday to Burma with her sister seemed like a good idea to get away from it all, but when her passport was stolen in Rangoon, she could not leave the country with her sister, and was forced to stay back until she could get I.D. papers from the American embassy. To fill in a day before she could fly out, she took a trip into the countryside with a tour guide. "I tried finding something in those stone statues, but nothing stirred in me. I was stone myself." <br /><br />Suddenly all hell broke loose and she was caught in a political revolt. Just when it looked like she had escaped and safely boarded a train, she saw her tour guide get beaten and shot. In a split second she decided to jump from the moving train and try to rescue him, with no thought of herself. Continually her life was in danger. <br /><br />Here is a woman 

In [36]:
test_neg = read_filesd_files('aclImdb/test/neg/')

In [37]:
test_neg[0]

"Alan Rickman & Emma Thompson give good performances with southern/New Orleans accents in this detective flick. It's worth seeing for their scenes- and Rickman's scene with Hal Holbrook. These three actors mannage to entertain us no matter what the movie, it seems. The plot for the movie shows potential, but one gets the impression in watching the film that it was not pulled off as well as it could have been. The fact that it is cluttered by a rather uninteresting subplot and mostly uninteresting kidnappers really muddles things. The movie is worth a view- if for nothing more than entertaining performances by Rickman, Thompson, and Holbrook."

In [43]:
# remove <br />
re_br = re.compile(r'<\s*br\s*/?>', re.IGNORECASE)
def sub_br(x): return re_br.sub("\n", x)

train_pos = [sub_br(r) for r in train_pos]
train_neg = [sub_br(r) for r in train_neg]
test_pos = [sub_br(r) for r in test_pos]
test_neg = [sub_br(r) for r in test_neg]

In [44]:
test_pos[0]

'Based on an actual story, John Boorman shows the struggle of an American doctor, whose husband and son were murdered and she was continually plagued with her loss. A holiday to Burma with her sister seemed like a good idea to get away from it all, but when her passport was stolen in Rangoon, she could not leave the country with her sister, and was forced to stay back until she could get I.D. papers from the American embassy. To fill in a day before she could fly out, she took a trip into the countryside with a tour guide. "I tried finding something in those stone statues, but nothing stirred in me. I was stone myself." \n\nSuddenly all hell broke loose and she was caught in a political revolt. Just when it looked like she had escaped and safely boarded a train, she saw her tour guide get beaten and shot. In a split second she decided to jump from the moving train and try to rescue him, with no thought of herself. Continually her life was in danger. \n\nHere is a woman who demonstrated

In [45]:
# tokenizing
my_tok = spacy.load('en')
def spacy_tok(x): return [tok.text for tok in my_tok.tokenizer(x)]

train_pos = [spacy_tok(r) for r in train_pos]
train_neg = [spacy_tok(r) for r in train_neg]
test_pos = [spacy_tok(r) for r in test_pos]
test_neg = [spacy_tok(r) for r in test_neg]

In [46]:
train_pos[0]

['For',
 'a',
 'movie',
 'that',
 'gets',
 'no',
 'respect',
 'there',
 'sure',
 'are',
 'a',
 'lot',
 'of',
 'memorable',
 'quotes',
 'listed',
 'for',
 'this',
 'gem',
 '.',
 'Imagine',
 'a',
 'movie',
 'where',
 'Joe',
 'Piscopo',
 'is',
 'actually',
 'funny',
 '!',
 'Maureen',
 'Stapleton',
 'is',
 'a',
 'scene',
 'stealer',
 '.',
 'The',
 'Moroni',
 'character',
 'is',
 'an',
 'absolute',
 'scream',
 '.',
 'Watch',
 'for',
 'Alan',
 '"',
 'The',
 'Skipper',
 '"',
 'Hale',
 'jr',
 '.',
 'as',
 'a',
 'police',
 'Sgt',
 '.']

In [49]:
def load_word_embedings(file):
    embeddings = {}
    with open(file, 'r') as infile:
        for line in infile:
            values = line.split()
            embeddings[values[0]] = np.asarray(values[1:], dtype='float32')
    return embeddings

In [52]:
# glove embeddings as a dictionanry
embeddings = load_word_embedings(PATH+'glove/glove.6B.300d.txt')

In [56]:
# remove stopwords from sentences
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stops = set(stopwords.words("english"))

In [65]:
def get_non_stopwords(sentence):
    return [w.lower() for w in sentence if w.lower() not in stops]

In [66]:
train_pos = [get_non_stopwords(r) for r in train_pos]
train_neg = [get_non_stopwords(r) for r in train_neg]
test_pos = [get_non_stopwords(r) for r in test_pos]
test_neg = [get_non_stopwords(r) for r in test_neg]

In [95]:
train_pos[0]

['movie',
 'gets',
 'respect',
 'sure',
 'lot',
 'memorable',
 'quotes',
 'listed',
 'gem',
 '.',
 'imagine',
 'movie',
 'joe',
 'piscopo',
 'actually',
 'funny',
 '!',
 'maureen',
 'stapleton',
 'scene',
 'stealer',
 '.',
 'moroni',
 'character',
 'absolute',
 'scream',
 '.',
 'watch',
 'alan',
 '"',
 'skipper',
 '"',
 'hale',
 'jr',
 '.',
 'police',
 'sgt',
 '.']

# Using word embeddings from GloVe project

In [101]:
# get average feature embedding
def avg_feat_emb(sentence):
    e = [embeddings[w] for w in sentence if w.isalpha() and w in embeddings]
    e = np.array(e)
    return e.mean(axis=0)

In [103]:
train_pos_ae = [avg_feat_emb(r) for r in train_pos]
train_neg_ae = [avg_feat_emb(r) for r in train_neg]
test_pos_ae = [avg_feat_emb(r) for r in test_pos]
test_neg_ae = [avg_feat_emb(r) for r in test_neg]

In [108]:
# train and test data
x_train = np.vstack((train_pos_ae, train_neg_ae))
x_test = np.vstack((test_pos_ae, test_neg_ae))

In [110]:
y_train = np.repeat((1,0), 12500)
y_test = np.repeat((1,0), 12500)

In [112]:
import xgboost as xgb

xgb_pars = {"min_child_weight": 50, "eta": 0.05, "max_depth": 8,
            "subsample": 0.8, "silent" : 1, "nthread": 4,
            "eval_metric": "logloss", "objective": "binary:logistic"}

d_train = xgb.DMatrix(x_train, label=y_train)
d_val = xgb.DMatrix(x_test, label=y_test)

watchlist = [(d_train, 'train'), (d_val, 'valid')]

bst = xgb.train(xgb_pars, d_train, 400, watchlist, early_stopping_rounds=50, verbose_eval=50)

[0]	train-logloss:0.679942	valid-logloss:0.681813
Multiple eval metrics have been passed: 'valid-logloss' will be used for early stopping.

Will train until valid-logloss hasn't improved in 50 rounds.
[50]	train-logloss:0.415095	valid-logloss:0.476104
[100]	train-logloss:0.330598	valid-logloss:0.426798
[150]	train-logloss:0.284898	valid-logloss:0.406132
[200]	train-logloss:0.254756	valid-logloss:0.394055
[250]	train-logloss:0.233699	valid-logloss:0.387104
[300]	train-logloss:0.215828	valid-logloss:0.382148
[350]	train-logloss:0.201557	valid-logloss:0.379127
[399]	train-logloss:0.188821	valid-logloss:0.376886


We can get validation logloss of 0.3769 with 400 estimators

# Using bag of words representation

In [128]:
# only alphbetics
def only_alpha(sentence):
    return [w for w in sentence if w.isalpha()]

train_pos = [only_alpha(r) for r in train_pos]
train_neg = [only_alpha(r) for r in train_neg]
test_pos = [only_alpha(r) for r in test_pos]
test_neg = [only_alpha(r) for r in test_neg]

In [130]:
train_pos[0]

['movie',
 'gets',
 'respect',
 'sure',
 'lot',
 'memorable',
 'quotes',
 'listed',
 'gem',
 'imagine',
 'movie',
 'joe',
 'piscopo',
 'actually',
 'funny',
 'maureen',
 'stapleton',
 'scene',
 'stealer',
 'moroni',
 'character',
 'absolute',
 'scream',
 'watch',
 'alan',
 'skipper',
 'hale',
 'jr',
 'police',
 'sgt']

In [114]:
from sklearn.feature_extraction.text import CountVectorizer

In [141]:
# get all words from positive and negative reviews
corpus = [' '.join(s) for s in train_pos] + [' '.join(s) for s in train_neg] 

In [144]:
corpus[0]

'movie gets respect sure lot memorable quotes listed gem imagine movie joe piscopo actually funny maureen stapleton scene stealer moroni character absolute scream watch alan skipper hale jr police sgt'

In [149]:
vectorizer = CountVectorizer()
x_train = vectorizer.fit_transform(corpus)

In [150]:
len(vectorizer.vocabulary_)

72688

In [154]:
x_train.shape

(25000, 72688)

In [155]:
x_test = vectorizer.transform([' '.join(s) for s in test_pos] + [' '.join(s) for s in test_neg])

In [156]:
xgb_pars = {"min_child_weight": 50, "eta": 0.05, "max_depth": 8,
            "subsample": 0.8, "silent" : 1, "nthread": 4,
            "eval_metric": "logloss", "objective": "binary:logistic"}

d_train = xgb.DMatrix(x_train, label=y_train)
d_val = xgb.DMatrix(x_test, label=y_test)

watchlist = [(d_train, 'train'), (d_val, 'valid')]

bst = xgb.train(xgb_pars, d_train, 400, watchlist, early_stopping_rounds=50, verbose_eval=50)

[0]	train-logloss:0.68157	valid-logloss:0.681532
Multiple eval metrics have been passed: 'valid-logloss' will be used for early stopping.

Will train until valid-logloss hasn't improved in 50 rounds.
[50]	train-logloss:0.486753	valid-logloss:0.492489
[100]	train-logloss:0.425526	valid-logloss:0.437459
[150]	train-logloss:0.389329	valid-logloss:0.406999
[200]	train-logloss:0.364406	valid-logloss:0.386846
[250]	train-logloss:0.346354	valid-logloss:0.372644
[300]	train-logloss:0.331647	valid-logloss:0.362599
[350]	train-logloss:0.320378	valid-logloss:0.354942
[399]	train-logloss:0.311154	valid-logloss:0.349298


We can get validation logloss of 0.3493 with 400 estimators