# Backpropagation for Sentiment Analysis

The following algorithm uses a Backpropagation with a SGD optimizer

### The IMDb Movie Review Dataset

In this section, we will train a simple logistic regression model to classify movie reviews from the 50k IMDb review dataset that has been collected by Maas et. al.

AL Maas, RE Daly, PT Pham, D Huang, AY Ng, and C Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics

[Source: http://ai.stanford.edu/~amaas/data/sentiment/]

The dataset consists of 50,000 movie reviews from the original "train" and "test" subdirectories. The class labels are binary (1=positive and 0=negative) and contain 25,000 positive and 25,000 negative movie reviews, respectively. For simplicity, I assembled the reviews in a single CSV file.

In [1]:
import re
import pandas as pd
import numpy as np

from gensim.models import Word2Vec
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

### Preprocessing text data

In [6]:
stop = stopwords.words('english')
porter = PorterStemmer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return text

Let's give it at try:

In [7]:
tokenizer('This :) is a <a> test! :-)</br>')

['test', ':)', ':)']

### Import dataset and preparing using word2vec (Exercise 1)

In [8]:
def featureVecMethod(words, model, num_features):
    featureVec = np.zeros(num_features,dtype="float32")
    nwords = 0

    index2word_set = set(model.wv.index2word)

    for word in  words:
        if word in index2word_set:
            nwords = nwords + 1
            featureVec = np.add(featureVec,model[word])

    featureVec = np.divide(featureVec, nwords)
    return featureVec


def getAvgFeatureVecs(reviews, model, num_features):
    counter = 0
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    for review in reviews:
        if counter%1000 == 0:
            print("Review %d of %d"%(counter,len(reviews)))

        reviewFeatureVecs[counter] = featureVecMethod(review, model, num_features)
        counter = counter+1

    return reviewFeatureVecs

In [9]:
df = pd.read_csv('shuffled_movie_data.csv')

In [10]:
df.tail()

Unnamed: 0,review,sentiment
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


We need to tokenize each review. This will take approx 3 minutes.

In [11]:
X = list(df['review'])
y = list(df['sentiment'])
xx = []
for i in X:
    xx.append(tokenizer(i))

Lets check some data


In [12]:
xx[0]

['1974',
 'teenager',
 'martha',
 'moxley',
 'maggie',
 'grace',
 'moves',
 'high',
 'class',
 'area',
 'belle',
 'greenwich',
 'connecticut',
 'mischief',
 'night',
 'eve',
 'halloween',
 'murdered',
 'backyard',
 'house',
 'murder',
 'remained',
 'unsolved',
 'twenty',
 'two',
 'years',
 'later',
 'writer',
 'mark',
 'fuhrman',
 'christopher',
 'meloni',
 'former',
 'la',
 'detective',
 'fallen',
 'disgrace',
 'perjury',
 'j',
 'simpson',
 'trial',
 'moved',
 'idaho',
 'decides',
 'investigate',
 'case',
 'partner',
 'stephen',
 'weeks',
 'andrew',
 'mitchell',
 'purpose',
 'writing',
 'book',
 'locals',
 'squirm',
 'welcome',
 'support',
 'retired',
 'detective',
 'steve',
 'carroll',
 'robert',
 'forster',
 'charge',
 'investigation',
 '70',
 'discover',
 'criminal',
 'net',
 'power',
 'money',
 'cover',
 'murder',
 'murder',
 'greenwich',
 'good',
 'tv',
 'movie',
 'true',
 'story',
 'murder',
 'fifteen',
 'years',
 'old',
 'girl',
 'committed',
 'wealthy',
 'teenager',
 'whose',


Create the word2vec model and train with all the words

In [13]:
model = Word2Vec(xx, size=100, window=5, min_count=1, workers=4)

Check some words and its distance to verify the model was trained correctly.

In [14]:
model.wv.similar_by_word('paris')

  if np.issubdtype(vec.dtype, np.int):


[('aime', 0.8245323300361633),
 ('je', 0.7812187671661377),
 ('berlin', 0.7514671683311462),
 ('italy', 0.746889591217041),
 ('london', 0.7405767440795898),
 ('france', 0.7285119295120239),
 ('north', 0.6963350772857666),
 ('england', 0.6910803318023682),
 ('2054', 0.6908023953437805),
 ('san', 0.686427116394043)]

In [15]:
model.wv.similar_by_word('dog')

[('cat', 0.7755012512207031),
 ('puppy', 0.7531970739364624),
 ('chicken', 0.6783939599990845),
 ('rat', 0.6774106621742249),
 ('freak', 0.6763298511505127),
 ('bugs', 0.6654431223869324),
 ('bite', 0.6646356582641602),
 ('bird', 0.6585801839828491),
 ('monkey', 0.6538642048835754),
 ('pet', 0.6523016691207886)]

Prepare data with word2vec model. This will take 15 or 20 minutes, depends on your computer.

In [16]:
x_data = getAvgFeatureVecs(xx, model, 100)

Review 0 of 50000


  # Remove the CWD from sys.path while we load stuff.


Review 1000 of 50000
Review 2000 of 50000
Review 3000 of 50000
Review 4000 of 50000
Review 5000 of 50000
Review 6000 of 50000
Review 7000 of 50000
Review 8000 of 50000
Review 9000 of 50000
Review 10000 of 50000
Review 11000 of 50000
Review 12000 of 50000
Review 13000 of 50000
Review 14000 of 50000
Review 15000 of 50000
Review 16000 of 50000
Review 17000 of 50000
Review 18000 of 50000
Review 19000 of 50000
Review 20000 of 50000
Review 21000 of 50000
Review 22000 of 50000
Review 23000 of 50000
Review 24000 of 50000
Review 25000 of 50000
Review 26000 of 50000
Review 27000 of 50000
Review 28000 of 50000
Review 29000 of 50000
Review 30000 of 50000
Review 31000 of 50000
Review 32000 of 50000
Review 33000 of 50000
Review 34000 of 50000
Review 35000 of 50000
Review 36000 of 50000
Review 37000 of 50000
Review 38000 of 50000
Review 39000 of 50000
Review 40000 of 50000
Review 41000 of 50000
Review 42000 of 50000
Review 43000 of 50000
Review 44000 of 50000
Review 45000 of 50000
Review 46000 of 500

In [17]:
x_data[0]

array([-2.60379612e-01, -2.64186233e-01, -2.53893062e-03,  3.41735072e-02,
       -2.85584480e-01,  8.62184241e-02,  8.96588862e-02, -2.35976726e-02,
        6.79201409e-02,  1.34501606e-01, -1.96452677e-01, -1.52094781e-01,
        4.47102606e-01, -5.54411672e-04, -5.46162605e-01,  3.55192304e-01,
        3.39316428e-02, -8.76795053e-02,  4.94601399e-01,  5.14558852e-01,
       -1.66057527e-01, -5.62029123e-01, -5.45754880e-02,  7.16634467e-02,
       -1.84396118e-01, -2.16080338e-01,  9.07994956e-02,  2.66660064e-01,
       -5.74254274e-01,  2.00708911e-01,  2.09721312e-01, -2.03400284e-01,
       -1.79200754e-01,  8.54769349e-02,  3.16235870e-01, -2.53821462e-01,
        3.49349678e-02, -1.05141506e-01, -9.36906114e-02, -3.21814865e-01,
        4.50330526e-01, -5.60665056e-02,  3.21082503e-01,  2.02802762e-01,
       -3.37554291e-02,  2.02241272e-01, -6.31508291e-01, -4.84763294e-01,
        3.45244855e-02,  1.22525327e-01, -1.11612365e-01, -3.12163770e-01,
        1.27353191e-01, -

In [18]:
x_data = [list(i) for i in x_data]

In [19]:
from copy import deepcopy

In [20]:
dataset = deepcopy(x_data)

In [21]:
for x, yy in zip(dataset, y):
    x.append(yy)

In [23]:
len(dataset[0])

101

## Implement a backpropagation algorithm using SGD (Exercise 2)

The algorithm is implemented in the file `backpropagtion.py`

In [28]:
from backpropagation import *

Initialize the parameters

In [33]:
learning_rate = 0.1
num_iterations = 10
hidden_layers = 6
num_folds = 2

In [34]:
model = Backpropagation(
        learning_rate, num_iterations, hidden_layers, num_folds)

Train data using cross validation. The accyracy approximately should be higher than 80%

In [35]:
model.run(dataset)

iter 0
iter 1
iter 2
iter 3
iter 4
iter 5
iter 6
iter 7
iter 8
iter 9
iter 0
iter 1
iter 2
iter 3
iter 4
iter 5
iter 6
iter 7
iter 8
iter 9
accuracy 86.11999999999999


In [None]:
example = 'I loved this movie'
example = tokenizer(example)


example = featureVecMethod(example, model, 100)
print(example)