# Bag of Words with LDA

Latent Dirichlet Allocation (LDA) is a model that is able to group similar data together. This will reduce the dimensionality of our data by grouping similar words together. For example, the words **movie**, **film**, and **show** might be grouped into one topic called **MOVIE_related**. Putting synonyms into one feature instead of numerous greatly reduces the dimensionality of our data. Hopefully this will improve our learner's score.

In [11]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time

# Load the data
train = pd.read_csv('../../data/train.tsv', sep='\t')
test = pd.read_csv('../../data/test.tsv', sep='\t')
train.shape

(156060, 4)

## Count Vectorization

Now that the data is loaded, we can vectorize it using the Count vectorizer. This will give us a matrix of word frequencies.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

subset = train

# Create the vectorizer
# We ignoring common english words and only look at a maximum of 2000 unique words
vectorizer = CountVectorizer(stop_words='english', min_df=2, max_df=0.95)
X = vectorizer.fit_transform(subset.Phrase)
print X.shape

(156060, 14933)


## LDA Dimension Reduction

Now we can put LDA to use and reduce dimensionality.

In [13]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_topics=10)

t0 = time.time()
L = lda.fit_transform(X, subset.Sentiment)
print L.shape
print "dt: %f" % (time.time() - t0)

(156060, 10)
dt: 239.463410


In [None]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    
print_top_words(lda, vectorizer.get_feature_names(), 50)

Topic #0:
movie story new love hollywood seen cinema entertaining kids dark star feels ca history compelling sweet actually ultimately fascinating culture tv amusing gives romance minute simply ending middle project likely despite particularly eye honest budget getting pace moment reality kid issues mood certainly game comedies york dry air tell terrific
Topic #1:
does way director world character drama performance thriller special half care hour feature nearly modern screenplay clever narrative camera title debut live french instead probably leave fine believe viewers impossible room production quality michael writing city final pop animation process boring ii spy turns latest acted mean mess tension fiction
Topic #2:
good little humor really sense performances fun kind end picture high video works comes women entertainment want dull pretty say cinematic filmmakers quirky goes school summer worst ways head surprisingly laugh solid act reason stuff left concept satisfying attention cut

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC

forest = RandomForestClassifier(n_estimators=500, max_depth=5)
boost = AdaBoostClassifier()
svc = SVC()

from sklearn.cross_validation import cross_val_score

t0 = time.time()
forest_score = cross_val_score(forest, L, subset.Sentiment).mean()
print "Random Forest: %2.2f" % forest_score
print "dt: %f" % (time.time() - t0)

t0 = time.time()
boost_score = cross_val_score(boost, L, subset.Sentiment).mean()
print "AdaBoost:      %2.2f" % boost_score
print "dt: %f" % (time.time() - t0)

t0 = time.time()
svc_score = cross_val_score(svc, L, subset.Sentiment).mean()
print "SVC:           %2.2f" % svc_score
print "dt: %f" % (time.time() - t0)

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

subset = train
vectorizer = CountVectorizer(stop_words='english', min_df=2, max_df=0.95, max_features=2000)
X = vectorizer.fit_transform(subset.Phrase)
print X.shape

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_topics=10)

t0 = time.time()
L = lda.fit_transform(X, subset.Sentiment)
print L.shape
print "dt: %f" % (time.time() - t0)

from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=500, max_depth=5)
forest.fit(L, subset.Sentiment)

subset = test
X = vectorizer.transform(subset.Phrase)
print X.shape

t0 = time.time()
L = lda.transform(X)
print L.shape
print "dt: %f" % (time.time() - t0)
y = forest.predict(L)

import pandas as pd
df = pd.DataFrame({
    'PhraseId': test.PhraseId,
    'Sentiment': y
})

df.to_csv('results.csv', index=False)

(156060, 2000)
(156060, 10)
dt: 104.314473
(66292, 2000)
(66292, 10)
dt: 1.638128
