### Rotten Tomatoes Sentiment Analysis 
#### Patrick Huston and James Jang

This notebook aims to explore a revised model for the sentiment analysis Kaggle Rotten Tomatoes competition. Taking what we've learned from our exploration and first iteration model, we hope to improve our techniques and validation of modeling choices in the pursuit of a higher score.

In [24]:
import pandas as pd
import re
import math
import numpy as np
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

from sklearn import svm
from sklearn.naive_bayes import MultinomialNB
from sklearn import cross_validation
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier

from scipy import sparse

%matplotlib inline

### TODO

1. Write a easily replicable testing/validation technique so we can easily verify new decisions/choices
2. Write more analysis on why a technique is working better/worse
3. Documentation as we go - write more markdown cells
4. Specific cleaning exploration
    - Unigram vs. bigram
    - Negations
5. Models
    - Logistic regression
    - SVM
    - Tuning, tuning, tuning!
    

### Data Cleaning Techniques/Creating Features

In natural language processing, the main 'feature' models use is the text itself - and there are several ways to extract numerical values from text. Additionally, there are cleaning steps and techniques that can be taken to improve the representation of the text inputted into the model.

One of the first cleaning techniques that we tried was removing all of the punctuation and turning all of words into lower case. We performed this cleaning technique to normalize our dataset a bit. However sometimes capitalization and punctuation could affect the sentiment of the sentence so this cleaning technique might not always be the best.

Another cleaning technique that we used was removing stopwords. Stopwords in english are words that are hold no meaning in the overall sentences. Words like the ,and, of etc are common stopwords that does not really contribute to the overall sentiment of the sentence.

Next, we implemented some additional cleaning techniques used to further normalize the data - porter stemming and lemmatization. Porter stemming is the process of removing common morphological and inflexional endings from words in English. This is accomplished using simple algorithms that don't have any inherent knowledge of the English language, instead applying a set of rules to break down words and remove endings. Lemmatization, on the other hand, uses an input English dictionary to apply more intelligent breakdown of words based on part of speech. Unfortunately, lemmatization requires that every word be tagged with part of speech, which is an additional data processing step that, in the end, offered no real improvement in accuracy. For this reason, we decided to stick with the simpler algorithm, porter stemming. 

In [2]:
# Load in the dataset
train = pd.read_csv("data/train.tsv", sep= '\t')
test = pd.read_csv("data/test.tsv", sep= '\t')

In [5]:
# negations = ['no', 'never', 'not']

def clean_phrase_simple(phrase):
    # Grab only words and lower them
    clean_str = re.findall(r'\w+', phrase, flags = re.UNICODE | re.LOCALE)
    return ' '.join(clean_str).lower()

porter_stemmer = PorterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()

def clean_phrase_porter(phrase):
    # 
    clean_str = re.findall(r'\w+', phrase, flags = re.UNICODE | re.LOCALE)
    stemmed = [porter_stemmer.stem(word) for word in clean_str]
    return ' '.join(stemmed).lower()
    
# I tried something with negations here - didn't seem to offer any real improvement
    
#     for i, word in enumerate(meaningful_words):
#         if word in negations or word.endswith('n\'t'):
#             try:
#                 meaningful_words[i+1] = "!" + meaningful_words[i+1]
#             except:
#                 pass
#             try:
#                 meaningful_words[i-1] = "!" + meaningful_words[i-1]
#             except:
#                 pass      
#     return(" ".join( meaningful_words))   

def clean_phrase_lemmatizer(phrase):
    letters_only = re.sub("[^a-zA-Z]", " ", phrase)
    lower_case = letters_only.lower()
    
    words = lower_case.split()
    stops = set(stopwords.words("english")) 
    meaningful_words = [wordnet_lemmatizer.lemmatize(w) for w in words if not w in stops]
    return(" ".join( meaningful_words))  

In [6]:
def apply_transform(data):
    data['CleanPhrase'] = data['Phrase'].apply(clean_phrase_porter)
    data['CleanPhraseSimple'] = data['Phrase'].apply(clean_phrase_simple)

In [7]:
apply_transform(train)
apply_transform(test)

In [8]:
def add_word_length(X, data):
    num_words_feature = np.asarray(map(lambda x: len(x.split()), data.Phrase))
    num_words_feature = num_words_feature[:, np.newaxis]
    return sparse.hstack((X, num_words_feature))

In [25]:
# Cross-validates model within trainnig set with a split of 'cv' - default value of 3
def cross_validate(model, X, y, cv=3):
    return cross_validation.cross_val_score(model, X, y, cv=cv).mean()
 
# Performs train-test split on data, trains on train, tests on test, returns score
def train_test_splitter(model, X, y, train_size=0.5):
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_size)
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

# trains the model on the whole dataset, predicts on the test set and creates a submission file to kaggle
def train_submit(model, X_train, y_train, X_test, filename = "submission.csv"):
    print "fitting"
    model.fit(X_train, y_train)
    prediction = model.predict(X_test)
    output = pd.DataFrame( data={"PhraseId":test["PhraseId"], "Sentiment":prediction} )

    # Use pandas to write the comma-separated output file
    output.to_csv(filename, index=False, quoting=3 )
    print "done"

### Creating Models

Here we have a list of models. We'll describe what they are, why we chose them, and maybe some section about their strengths, weaknesses, and ability to tune.

In [51]:
vectorizer = TfidfVectorizer()
vectorizer.fit(train.Phrase)

vectorizer_clean = TfidfVectorizer()
vectorizer_clean.fit(train.CleanPhrase)

X = vectorizer.transform(train.Phrase)
X_cleaned = vectorizer_clean.transform(train.CleanPhrase)

X_test = vectorizer.transform(test.CleanPhrase)
X_test = vectorizer_clean.transform(test.CleanPhrase)

In [52]:
logistic = LogisticRegression(multi_class='multinomial', solver='newton-cg')
random = RandomForestClassifier()
multinomial = MultinomialNB()
SVM = svm.LinearSVC(penalty = 'l2', dual = False, tol = 1e-3)

models = {'Logistic': logistic, 'RandomForest': random, 'Multinomial' : multinomial, 'SVM': SVM}

In [53]:
# iterates over all different models and print out their results of train_test_splitter
def test_models(models, X):
    for modelName, model in models.iteritems():
        print modelName
        print train_test_splitter(model, X, train.Sentiment, train_size=0.5)

# test one specific model with train_test_splitter
def test_model(model, X):
    return train_test_splitter(model, X, train.Sentiment, train_size=0.5)        

In [55]:
print "------ Not Cleaned ------"
test_models(models, X)
print "------ Cleaned ------"
test_models(models, X_cleaned)

------ Not Cleaned ------
Multinomial
0.574996796104
RandomForest
0.606625656799
SVM
0.631205946431
Logistic
0.625477380495
------ Cleaned ------
Multinomial
0.572190183263
RandomForest
0.60553633218
SVM
0.625861848007
Logistic
0.624490580546


In [41]:
X_with_word_length = add_word_length(X, train)
X_test = add_word_length(X_test, test)
# test_models(models, X_with_word_length)

In [38]:
logisticTune = LogisticRegressionCV(Cs=[math.e**v for v in range(-5,5)],
                             multi_class='multinomial',
                             solver='newton-cg')

In [60]:
def compare_model_improvment(model1, model2, X):
    first = test_model(model1, X)
    second = test_model(model2, X)
    print "model 1", first
    print "model 2", second
    print "difference in the score", second - first

In [61]:
compare_model_improvment(logistic, logisticTune, X)

model 1 0.625541458413
model 2 0.637908496732
difference in the score 0.0123670383186


In [50]:
train_submit(logisticTune, X_with_word_length, train.Sentiment, X_test, filename = "submission.csv")

fitting
done


In [14]:
import pickle
f = open('rotten_tomatoes_train.pickle')
X = pickle.load(f)
y = pickle.load(f)
f.close()