## Bag of Words

In this notebook, I'll be attempting to predict movie review sentiments using a bag of words model. This model does not look at the context of words in reviews, only their presence and frequency. Hopefully, by classifying reviews based on features corresponding to the frequency of each word will yield somewhat accurate results.

In [43]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
train = pd.read_csv('../../data/train.tsv', sep='\t')
test = pd.read_csv('../../data/test.tsv', sep='\t')

### Count Vectorization

Our first attempt at feature extraction will be using the [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from [scikit-learn](http://scikit-learn.org/stable/index.html). This tool will look at each review and count the number of times each word occurs. The number of times a word appears in a review will hopefully correlate with the sentiment of the review.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

subset = train[:50000]

# Create the vectorizer
# We ignoring common english words and only look at a maximum of 2000 unique words
vectorizer = CountVectorizer(stop_words='english', max_features=3000)
X = vectorizer.fit_transform(subset.Phrase).toarray()

### Learning

Now that we've extracted our features, we can learn from them.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC

forest = RandomForestClassifier(n_estimators=500)
boost = AdaBoostClassifier()
svc = SVC()

from sklearn.cross_validation import cross_val_score
import time

# t0 = time.time()
# forest_score = cross_val_score(forest, X, train.Sentiment).mean()
# print "Random Forest: %2.2f" % forest_score
# print "dt: %f" % (time.time() - t0)

# t0 = time.time()
# boost_score = cross_val_score(boost, X, train.Sentiment).mean()
# print "AdaBoost:      %2.2f" % boost_score
# print "dt: %f" % (time.time() - t0)

t0 = time.time()
svc_score = cross_val_score(svc, X, subset.Sentiment).mean()
print "SVC:           %2.2f" % svc_score
print "dt: %f" % (time.time() - t0)