# Movie Review Sentiment Analysis Code Report

This ipython notebook contains all code run to reproduce the results discussed in the report

## Common utility functions

The data once loaded is split to 67% training data and 33% training data. This common split of data is used by every technique evaluated here

In [9]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os
from sklearn.metrics import roc_curve,roc_auc_score
from sklearn.cross_validation import train_test_split

# Read the data
originalTrainData = pd.read_csv(os.path.join('data', 'labeledTrainData.tsv'), header=0, delimiter="\t", quoting=3);
[trainData, testData] = train_test_split(originalTrainData, test_size=0.33, random_state=42);

extraWordVecData = pd.read_csv(os.path.join('data', 'unlabeledTrainData.tsv'), header=0, delimiter="\t", quoting=3);

print 'Text Data loaded'

Text Data loaded


To calculate the ROC under curve we use the following utility  function

In [35]:
def plotROC(result,algorithm):
    # plot the ROC curve and show the area under curve
    # TODO : Should we plot this
    [fpr,tpr,threshold] = roc_curve(result["expected_sentiment"],result["sentiment"]);
    
    score = roc_auc_score(result["expected_sentiment"],result["sentiment"]);
    print "ROC under curve for bag of Words:"
    print score ;
    print " Algorithm used :"+ algorithm;

    #store the information in an output file
    result.to_csv(os.path.join('logs',algorithm +".csv"), index=False, quoting=3)

## Bag of Words
We use traditional approach of bag of words to classify sentiments. Without taking structure of sentence and word arrangement into consideration, it relies on "anchor" words to be able to classify correctly. For the implementation I use one-hot-K encoding for each word

In [15]:
from BagOfWords import BagOfWords
plotROC(BagOfWords(trainData, testData), "Bag of Words");

Cleaning and parsing the training set movie reviews...

Creating the bag of words...

Training the random forest (this may take a while)...
Cleaning and parsing the test set movie reviews...

Predicting test labels...

ROC under curve for bag of Words:
0.845485587173
 Algorithm used :Bag of Words


## Word Vector Approach

The dimensionality requirement of one-hot K encoding makes the problem much larger. Also a lot of information of word relations (like man is to woman as king is to queen) is lost in case of traditional one-hot K encoding. As suggested in Kaggle competition, we train the word vector model to reduce dimensionality of the problem. Then we used Average Word vector and K-means classifier for sentiment classification over the tensor for each sentence.

In [4]:
from WordVectorModelUtility import LoadWordVectorModel, TrainWordVectorModel
from Word2Vec_AverageVectors import AverageWordVector
from Word2Vec_BagOfCentroids import BagOfCentroids

wordVecModelName = "ComparisonModelv2";
TrainWordVectorModel(wordVecModelName, originalTrainData, extraWordVecData);

Word vector model already exists


Trying the bag of words approach with much smaller feature size

In [5]:
plotROC(AverageWordVector(LoadWordVectorModel(wordVecModelName), trainData,testData),"Word vector with averaging");

Creating average feature vecs for training reviews
Creating average feature vecs for test reviews
Fitting a random forest to labeled training data...
ROC under curve for bag of Words:
0.843399220404
 Algorithm used :Word vector with averaging


Now try the same with a slightly different approach to using word vectors 

In [6]:
plotROC(BagOfCentroids(LoadWordVectorModel(wordVecModelName),trainData,testData),"Word vector with bag of centroids");

Running K means
Time taken for K Means clustering:  2857.75741005 seconds.

Cluster 0
[u'publicity', u'paycheck']

Cluster 1
[u'prototype', u'hannibal']

Cluster 2
[u'nigh', u'foe', u'consultant', u'embassy', u'challenger', u'ordered', u'renegade', u'commission', u'grants', u'volunteer', u'mercury', u'proposed', u'brass', u'chairman', u'tabloid', u'shipping']

Cluster 3
[u'basics', u'screamers']

Cluster 4
[u'passages', u'happenings', u'glimpses', u'scattered']

Cluster 5
[u'controversy', u'globalization', u'accustomed', u'promotes', u'sentiment', u'evolved', u'clearer']

Cluster 6
[u'widmark', u'jacob', u'louis', u'dreyfuss', u'hines', u'chamberlain', u'foley', u'cardinal', u'dreyfus', u'crenna', u'farnsworth']

Cluster 7
[u'exorcism', u'whipping', u'torturing', u'lovemaking', u'stabs', u'organ', u'foxes', u'bursts', u'stall']

Cluster 8
[u'macho', u'manipulative', u'vulnerable', u'compassionate', u'egotistical', u'neurotic', u'shy', u'smug', u'bratty', u'arrogant', u'naive', u'insecu

## Paragraph Vector based classifier
Unlike word vector which loses the notion of sentence structure, paragraph vector maintains a notion of sentence structure. Therefore the sequence and sentence structure information is maintained in some way. For example, I am not not interested may be lost in any model not using sentence structure free model for classification.

In [18]:
from ParagraphVector import TrainReviewVectorModel, LoadReviewVectorModel
from sklearn import svm

sentenceVecModelName = "ReviewComparisonModel";
TrainReviewVectorModel(sentenceVecModelName, originalTrainData, \
                       extraWordVecData);

TrainReviewVectorModel(sentenceVecModelName, originalTrainData, \
                       extraWordVecData);

def svcClassifier(paragraphVectorModel, trainData, testData, suffix):
    # Use linear SVC as it is less than 100k samples 50-75k samples
    # REFERENCE: 
    # http://scikit-learn.org/stable/tutorial/machine_learning_map/
    classifier = svm.SVC();
    # Use the vector representation from paragraph vector

    classifier.fit(paragraphVectorModel.docvecs[trainData["id"]], \
                   trainData["sentiment"])

    result = pd.DataFrame( data=\
                           {"id":testData["id"], \
                            "sentiment": classifier.predict(paragraphVectorModel.docvecs[testData["id"]]),\
                            "expected_sentiment": \
                           testData["sentiment"]} );
    plotROC(result, "Paragraph Vector Result" + suffix +" : ");

svcClassifier(LoadReviewVectorModel(sentenceVecModelName, True), trainData, testData,"_dm");
svcClassifier(LoadReviewVectorModel(sentenceVecModelName, False), trainData, testData,"_dbow");    


Doc vector model already exists
Doc vector model already exists
ROC under curve for bag of Words:
0.820424061392
 Algorithm used :Paragraph Vector Result_dm : 
ROC under curve for bag of Words:
0.87692742823
 Algorithm used :Paragraph Vector Result_dbow : 


## Recurrent Neural Network Classifier
It took too long on my machine for the training to complete for RNN in iPython. This took a long time. It took even longer to generate the results in iPython. While it took about 6-7 hours when running it as part of Python script, it took 3 days for it to finish in iPython

So I switched to python to finish the training and testing by running the RNNClassifier_passage and caching the results in the required formats in log. On generating the PlotROC results from logs, I get the following

In [2]:
import os,sys
Passage_Lib = os.path.join(os.getenv('HOME'),'anaconda2', 'lib','python2.7','site-packages');
sys.path.append(Passage_Lib)

In [5]:
from RNNClassifier_passage import TrainTestRNN

plotROC(TrainTestRNN(trainData, testData),"RNN with GRU over 10 epochs");

Loading data ...
Training ...
Epoch 0 Seen 16391 samples Avg cost 0.6914 Time elapsed 16697 seconds
Epoch 1 Seen 32782 samples Avg cost 0.6775 Time elapsed 50671 seconds
Epoch 2 Seen 49173 samples Avg cost 0.6791 Time elapsed 67278 seconds
Epoch 3 Seen 65564 samples Avg cost 0.5879 Time elapsed 88099 seconds
Epoch 4 Seen 81955 samples Avg cost 0.4913 Time elapsed 104853 seconds
Epoch 5 Seen 98346 samples Avg cost 0.3822 Time elapsed 132006 seconds
Epoch 6 Seen 114737 samples Avg cost 0.3186 Time elapsed 148636 seconds
Epoch 7 Seen 131128 samples Avg cost 0.2729 Time elapsed 169448 seconds
Epoch 8 Seen 147519 samples Avg cost 0.2515 Time elapsed 188591 seconds
Epoch 9 Seen 163910 samples Avg cost 0.2296 Time elapsed 205305 seconds
Prediction error on training set
Training Prediction Accuracy 0.934865671642
Predicting ...
ROC under curve for bag of Words:
0.901849520062
 Algorithm used :RNN with GRU over 10 epochs


As seens so far Recurrent Neural Network has the best 

## Ensemble Training 

In [38]:

print "Using the results of the various models identified so far--"
for filename in os.listdir('logs'):
    if filename.endswith('csv'):
        print(filename) ;

results = [pd.read_csv(os.path.join('logs',filename)) for filename in os.listdir('logs') if filename.endswith('csv')];

sum_result = np.zeros(len(results[0].sentiment))
for result in results:
    sum_result += np.array(result.sentiment);

majority_voting_result = sum_result >=(len(results)*1.0/2)

majority_result = pd.DataFrame( data=\
                           {"id":results[0]["id"], \
                            "sentiment": majority_voting_result,\
                            "expected_sentiment": \
                           results[0]["expected_sentiment"]} );

plotROC(majority_result,"Majority voting results")

Using the results of the various models identified so far--
Majority voting results.csv
Word vector with averaging.csv
Paragraph Vector Resut_dbow : .csv
Paragraph Vector Result_dm : .csv
Paragraph Vector Result_dbow : .csv
Paragraph Vector Resut_dm : .csv
RNN with GRU over 10 epochs.csv
Bag of Words.csv
Paragraph Vector SVM classification.csv
Word vector with bag of centroids.csv
ROC under curve for bag of Words:
0.897815191982
 Algorithm used :Majority voting results



I tried simple majority based ensemble (which would be same as bag of words). The intuition was that the techniques of boosting or ensemble learning shall be useful to improve the overall accuracy of the system. However, I wasn't able to get highest success with a simple majority based classification technique. 

From what I have seen in lectures (about boosting & ensemble learning) and what I have learnt from discussion forums in Kaggle, I do believe Ensemble learning if applied correctly should give a better result. For example,if probability of correct classification is 0.7,0.7,0.7. Even with simple majority voting, the probability of error comes to 0.3x0.3x0.3 + 0.3x0.3x0.7x3 = 0.216 which is lower than individual errors.

The simpler approach adopted here may have worked better if there was a closer results with the different techniques. A better approach may have been to try to check the results with weighted averaging of ensemble

However, since that is currently beyond the scope of original intent of the proposal, I have marked this as a task to pursue separately to avoid scope creep

# Improvements



### Higher dimensions for word vector, paragraph vector
Naively increasing the dimensions considered for the problem (i.e. number of words considered)
### Better Ensembles
More selective and taking into account the accuracy of each model used.

### RNN models using word vectors
Apart from time taken to train, it shall also capture semantic relations between words.

### RNTN based model for sentiment classification
Stanford tree parser method has shown promising results. But the setup effort looked expensive enough for an optional exercise