## Bag of Words

In this notebook, I'll be attempting to predict movie review sentiments using a bag of words model. This model does not look at the context of words in reviews, only their presence and frequency. Hopefully, by classifying reviews based on features corresponding to the frequency of each word will yield somewhat accurate results.

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
train = pd.read_csv('../../data/train.tsv', sep='\t')
test = pd.read_csv('../../data/test.tsv', sep='\t')
train.shape



(156060, 4)

### Count Vectorization

Our first attempt at feature extraction will be using the [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from [scikit-learn](http://scikit-learn.org/stable/index.html). This tool will look at each review and count the number of times each word occurs. The number of times a word appears in a review will hopefully correlate with the sentiment of the review.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

subset = train[:200]

# Create the vectorizer
# We ignoring common english words and only look at a maximum of 2000 unique words
vectorizer = CountVectorizer(stop_words='english', max_features=2000)
X = vectorizer.fit_transform(subset.Phrase).toarray()

### Learning

Now that we've extracted our features, we can learn from them.

In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC

forest = RandomForestClassifier(n_estimators=500)
boost = AdaBoostClassifier()
svc = SVC()

from sklearn.cross_validation import cross_val_score
import time

t0 = time.time()
forest_score = cross_val_score(forest, X, subset.Sentiment).mean()
print "Random Forest: %2.2f" % forest_score
print "dt: %f" % (time.time() - t0)

t0 = time.time()
boost_score = cross_val_score(boost, X, subset.Sentiment).mean()
print "AdaBoost:      %2.2f" % boost_score
print "dt: %f" % (time.time() - t0)

t0 = time.time()
svc_score = cross_val_score(svc, X, subset.Sentiment).mean()
print "SVC:           %2.2f" % svc_score
print "dt: %f" % (time.time() - t0)

Random Forest: 0.61
dt: 2.040056
AdaBoost:      0.66
dt: 0.184423
SVC:           0.66
dt: 0.012639


### TF-IDF Vectorization

Vectorization only using the frequency of each word did not perform very well. By incorporating the inverse document frequency, we might better evaluate the significance of words in each review. This works by comparing the frequency of each word with the number of times it appears within all the reviews. If every review contains a certain word, then that word probably has a lower significance. This vectorization would reflect that. We will use the [TF-IDF Vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from [scikit-learn](http://scikit-learn.org/stable/index.html).

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

subset = train[:200]

# Create the vectorizer
# We ignoring common english words and only look at a maximum of 2000 unique words
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(subset.Phrase).toarray()

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC

forest = RandomForestClassifier(n_estimators=500)
boost = AdaBoostClassifier()
svc = SVC()

from sklearn.cross_validation import cross_val_score
import time

t0 = time.time()
forest_score = cross_val_score(forest, X, subset.Sentiment).mean()
print "Random Forest: %2.2f" % forest_score
print "dt: %f" % (time.time() - t0)

t0 = time.time()
boost_score = cross_val_score(boost, X, subset.Sentiment).mean()
print "AdaBoost:      %2.2f" % boost_score
print "dt: %f" % (time.time() - t0)

t0 = time.time()
svc_score = cross_val_score(svc, X, subset.Sentiment).mean()
print "SVC:           %2.2f" % svc_score
print "dt: %f" % (time.time() - t0)

Random Forest: 0.61
dt: 2.058766
AdaBoost:      0.62
dt: 0.191025
SVC:           0.66
dt: 0.012861
