## Movie Review Classification

![This is getting exciting](https://i.kinja-img.com/gawker-media/image/upload/s--hIgTSFEs--/c_fit,fl_progressive,q_80,w_320/17j2zn73qxdlfgif.jpg)

Using all that we have learned, we will now combine our techniques to perform some basic classifcation! We'll be using the nltk movie reviews data set, we will classify positive and negative reviews. Here's some code to get you started:

In [1]:
# NLP libraries
import nltk
from nltk.corpus import stopwords, movie_reviews as reviews
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Misc
import re
import string
import numpy as np
import pandas as pd
import time

# Preprocessing
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split

# Models
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

# Model evaluation
from sklearn import metrics

In [2]:
X = [reviews.raw(fileid) for fileid in reviews.fileids()]
y = [reviews.categories(fileid)[0] for fileid in reviews.fileids()]

# Recode positive reviews as 1 and negative as 0


1 - Print a positive and negative review:

In [3]:
import numpy as np

def print_first_review(rev_type, reviews, sentiment):
    rev_array = np.array(reviews)
    sent_array = np.array(sentiment)
    print(rev_array[sent_array == rev_type][0])

print('A positive review:\n')
print_first_review('pos', X, y)

print('A negative review:\n')
print_first_review('neg', X, y)

A positive review:

films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
in other words , don't dismiss this film because of its source . 
if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . 
getting the hughes brothers to direct t

2 - Using the scikit train_test_split function (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), split the data into a training set and a test set. 

In [4]:
# Random State
seed = np.random.seed(10)

# Split data

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = .2,
                                                    random_state = seed)

3 - Then lemmatize or stem the reviews, and transform the documents to tf-idf.

In [5]:
# Stemmers/Lemmatizers
porter = PorterStemmer()
snowball = SnowballStemmer('english')
wordnet = WordNetLemmatizer()

In [6]:
# here I define a tokenizer and stemmer which returns the set of
# stems in the text that it is passed. Also remove stopwords 
# and punctuation
regex = re.compile('[%s]' % re.escape(string.punctuation))

def tokenize(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        # Remove punctuation
        noPunc_token = regex.sub(u'', token)
        if re.search('[a-zA-Z]', noPunc_token):
            filtered_tokens.append(token)
    return filtered_tokens

def tokenize_and_porter(text):
    tokenized = tokenize(text)
    stems = [porter.stem(t) for t in tokenized]
    return stems

def tokenize_and_snowball(text):
    tokenized = tokenize(text)
    stems = [snowball.stem(t) for t in tokenized]
    return stems

def tokenize_and_wordnet(text):
    tokenized = tokenize(text)
    stems = [wordnet.lemmatize(t) for t in tokenized]
    return stems

def tokenize_only(text):
    return tokenize(text)

In [7]:
# Given token
tokenizer = tokenize_and_porter

# Tokenization and Stemming/Lemmatization
tfidf_vectorizer = TfidfVectorizer(max_df=.8, min_df=.01,
                                   stop_words='english', 
                                   use_idf=True, ngram_range=(1, 3),
                                   tokenizer=tokenizer)

# Train and transform documents
vectorized_train_docs = tfidf_vectorizer.fit_transform(X_train)
vectorized_test_docs = tfidf_vectorizer.transform(X_test)

# Transform labels
lb = LabelBinarizer()
train_labels = lb.fit_transform(y_train)
test_labels = lb.transform(y_test)

4 - Finally, build a model. To start, use a logistic regression (which we will review in detail in the coming lectures) (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [8]:
clf = linear_model.LogisticRegressionCV(cv=5, random_state=seed)
clf.fit(vectorized_train_docs, train_labels.ravel())
preds = clf.predict(vectorized_test_docs)

5 - Measure the efficacy of your model using the Reciever Operator Characteristic (ROC) Area Under the Curve (AUC). Report this metric on the test set of your data.

For more info on this, see: http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#sphx-glr-auto-examples-model-selection-plot-roc-crossval-py

In [9]:
aucScore_1 = metrics.roc_auc_score(test_labels, preds)
aucScore_1

0.82245153220762979

6 - Change a parameter in your model (introduce regularization) or change a parameter in your word vector transformation (http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Try introducing the use of stop words, or employing a cutoff on terms with min or max df.

In [10]:
# Build functions to easily call process and by default switch things

# Tokenization and Stemming/Lemmatization
def build_tfidfvectorizer(tokenizer, maxdf=.8, mindf=.01,
                          ngramRange=(1, 3)):
    return TfidfVectorizer(tokenizer=tokenizer, max_df=maxdf, 
                           min_df=mindf, stop_words='english', 
                           use_idf=True, ngram_range=ngramRange)


def buildModelAndGetAUCScore(tfidf_vectorizer, model):
    # Train and transform documents
    vectorized_train_docs = tfidf_vectorizer.fit_transform(X_train)
    vectorized_test_docs = tfidf_vectorizer.transform(X_test)

    # Transform labels
    lb = LabelBinarizer()
    train_labels = lb.fit_transform(y_train)
    test_labels = lb.transform(y_test)
    
    clf = model
    clf.fit(vectorized_train_docs, train_labels.ravel())
    preds = clf.predict(vectorized_test_docs)
    
    return metrics.roc_auc_score(test_labels, preds)

7 - Make four models in total, changing parameters and comparing the AUC results. Report your findings in a tabular form.

Make a list of different tokenizations and models

In [11]:
tokenizers = {
    'tokenOnly': tokenize,
    'tokenAndPorter': tokenize_and_porter,
    'tokenAndSnowball': tokenize_and_snowball,
    'tokenAndWordnet': tokenize_and_wordnet
}

mods = {
    'logRegNorm': linear_model.LogisticRegression(random_state=seed),
    'logRegCV5Fold': linear_model.LogisticRegressionCV(cv=5, random_state=seed),
    'SVM': LinearSVC(random_state=seed),
    'randomForrest': RandomForestClassifier(max_depth=5, random_state=seed),
    'SGDLogistic': linear_model.SGDClassifier(loss='log', penalty='elasticnet', random_state=seed),
    'SGDsvm': linear_model.SGDClassifier(penalty='elasticnet', random_state=seed) 
}

Create a function to loop through all models and tokenizations

In [12]:
def runAll():
    res = []
    for tokenName, token in tokenizers.items():
        for modName, mod in mods.items():
            stTime = time.time()
            aucScore = buildModelAndGetAUCScore(build_tfidfvectorizer(token),
                                                mod)
            elapsedTime = time.time() - stTime
            res.append((tokenName, modName, aucScore, elapsedTime))
    return res            

In [13]:
results = runAll()



Print out final results

In [14]:
resCols = ['token_and_stem', 'model', 'auc_score', 'timeInSecs']
res = pd.DataFrame(results, columns=resCols)
res.sort_values(by = 'auc_score', ascending=False)

Unnamed: 0,token_and_stem,model,auc_score,timeInSecs
4,tokenOnly,SGDLogistic,0.848343,17.772624
1,tokenOnly,logRegCV5Fold,0.848093,18.751401
10,tokenAndPorter,SGDLogistic,0.845403,33.951274
2,tokenOnly,SVM,0.843215,17.399514
13,tokenAndSnowball,logRegCV5Fold,0.832833,29.38388
22,tokenAndWordnet,SGDLogistic,0.830519,21.118517
6,tokenAndPorter,logRegNorm,0.830269,33.681931
8,tokenAndPorter,SVM,0.830269,33.61962
12,tokenAndSnowball,logRegNorm,0.830144,27.714912
16,tokenAndSnowball,SGDLogistic,0.830019,27.504364
