### Rotten Tomatoes Sentiment Analysis 
#### Patrick Huston and James Jang

This notebook aims to explore a revised model for the sentiment analysis Kaggle Rotten Tomatoes competition. Taking what we've learned from our exploration and first iteration model, we hope to improve our techniques and validation of modeling choices in the pursuit of a higher score.

In [2]:
import pandas as pd
import re
import math
import numpy as np
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

from sklearn import svm
from sklearn.naive_bayes import MultinomialNB
from sklearn import cross_validation
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier

from scipy import sparse

%matplotlib inline



### TODO

1. Write a easily replicable testing/validation technique so we can easily verify new decisions/choices
2. Write more analysis on why a technique is working better/worse
3. Documentation as we go - write more markdown cells
4. Specific cleaning exploration
    - Unigram vs. bigram
    - Negations
5. Models
    - Logistic regression
    - SVM
    - Tuning, tuning, tuning!
    

### Data Cleaning Techniques/Creating Features

In natural language processing, the main 'feature' models use is the text itself - and there are several ways to extract numerical values from text. Additionally, there are cleaning steps and techniques that can be taken to improve the representation of the text inputted into the model.

One of the first cleaning techniques that we tried was removing all of the punctuation and turning all of words into lower case. We performed this cleaning technique to normalize our dataset a bit. However sometimes capitalization and punctuation could affect the sentiment of the sentence so this cleaning technique might not always be the best.

Another cleaning technique that we used was removing stopwords. Stopwords in english are words that are hold no meaning in the overall sentences. Words like the ,and, of etc are common stopwords that does not really contribute to the overall sentiment of the sentence.

Next, we implemented some additional cleaning techniques used to further normalize the data - porter stemming and lemmatization. Porter stemming is the process of removing common morphological and inflexional endings from words in English. This is accomplished using simple algorithms that don't have any inherent knowledge of the English language, instead applying a set of rules to break down words and remove endings. Lemmatization, on the other hand, uses an input English dictionary to apply more intelligent breakdown of words based on part of speech. Unfortunately, lemmatization requires that every word be tagged with part of speech, which is an additional data processing step that, in the end, offered no real improvement in accuracy. For this reason, we decided to stick with the simpler algorithm, porter stemming. 

In [3]:
# Load in the dataset
train = pd.read_csv("data/train.tsv", sep= '\t')
test = pd.read_csv("data/test.tsv", sep= '\t')

In [4]:
# negations = ['no', 'never', 'not']

def clean_phrase_simple(phrase):
    # Grab only words and lower them
    clean_str = re.findall(r'\w+', phrase, flags = re.UNICODE | re.LOCALE)
    return ' '.join(clean_str).lower()

porter_stemmer = PorterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()

def clean_phrase_porter(phrase):
    # 
    clean_str = re.findall(r'\w+', phrase, flags = re.UNICODE | re.LOCALE)
    stemmed = [porter_stemmer.stem(word) for word in clean_str]
    return ' '.join(stemmed).lower()
    
# I tried something with negations here - didn't seem to offer any real improvement
    
#     for i, word in enumerate(meaningful_words):
#         if word in negations or word.endswith('n\'t'):
#             try:
#                 meaningful_words[i+1] = "!" + meaningful_words[i+1]
#             except:
#                 pass
#             try:
#                 meaningful_words[i-1] = "!" + meaningful_words[i-1]
#             except:
#                 pass      
#     return(" ".join( meaningful_words))   

def clean_phrase_lemmatizer(phrase):
    letters_only = re.sub("[^a-zA-Z]", " ", phrase)
    lower_case = letters_only.lower()
    
    words = lower_case.split()
    stops = set(stopwords.words("english")) 
    meaningful_words = [wordnet_lemmatizer.lemmatize(w) for w in words if not w in stops]
    return(" ".join( meaningful_words))  

In [5]:
def apply_transform(data):
    data['CleanPhrase'] = data['Phrase'].apply(clean_phrase_porter)
    data['CleanPhraseSimple'] = data['Phrase'].apply(clean_phrase_simple)

In [6]:
apply_transform(train)
apply_transform(test)

### Creating Numerical Features from Text Data

Now that we've explored some different methods of preprocessing the text, it's time to get into the machine learning. This will involve representing our text-based data numerically and then using this representation to fit a model to. 

One such way of representing text as numerical features is TFIDF - term-frequency inverse-document-frequency. Term frequency involves computing the total number of times each given token in a document appears. Inverse document frequency is an additional step that normalizes for the frequency of appearance of each token in the overall corpus. Inverse document frequency diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely. 

To compute the TFIDF for our data, we use scikit-learn's TfidfVectorizer, a convenient tool that takes care of the details. In reality, however, TFIDF is relatively simple to implement.

In [98]:
def vectorize(train, test, column):
    vectorizer = TfidfVectorizer()
    vectorizer.fit(train[column])
    
    X = vectorizer.transform(train[column])
    X_test = vectorizer.transform(test[column])
    
    return X, X_test


X, X_test = vectorize(train, test, 'Phrase')

X_cleaned, X_test_clean = vectorize(train, test, 'CleanPhraseSimple')

### Creating Models

To facilitate the process of testing a bunch of models, we've made the process easier by defining some convenience methods that iterate over a dictionary of models that we create. Through this process, we'll be able to compare each model's performance against the others, and make an informed decision.

Here are all of the models we'll be trying:

1. Logistic Regression
    - We'll start with a simple multi-class logistic regression model. A common mistake in machine learning, especially in natural language processing, is to ignore the simpler models in lieu of the sexier new techniques - the likes of deep learning and neural networks. If nothing else, this will serve as a good benchmark for the rest of our models.

2. Logistic Regression - Tuned
    - Moving on from the logistic regression benchmark, let's also create a tuned vesion, which defines a list of penalty weights to test from. This will give us a better idea of how much tuning can help.
    
3. Random Forest Model
    - Random forests can be great - it mostly depends on what the data looks like. In this case, we're a little hesitant about its ability to model the data as it seems to be more linear in many aspects than all over the place and random, which a set of decision trees could model more accurately.
    
4. Multinomial Naive Bayes
    - In a lot of text analysis, it is somewhat common practice to start with a Naive Bayes model as a good baseline benchmark. Additionally, it is useful when there are limited resources in terms of CPU and Memory - it can be trained very quickly. 
    
5. Support Vector Machine
    - We also chose to test a linear implement of scikit-learns Support Vector Machine (SVM). SVM models are generally known to be good choices in cases of very high dimensionality, and the TFIDF vectors we have created fit the bill perfectly. 

In [25]:
# Logistic regression
logistic = LogisticRegression(multi_class='multinomial', solver='newton-cg')

# Tuned logistic regression
logisticTune = LogisticRegressionCV(Cs=[math.e**v for v in range(-5,5)],
                                    multi_class='multinomial',
                                    solver='newton-cg')

# Random forest model
random = RandomForestClassifier()

# Multinomail Naive Bayes
multinomial = MultinomialNB()

# Support Vector Machine Linear SVC
SVM = svm.LinearSVC(penalty = 'l2', dual = False, tol = 1e-3)

# Compile dictionary of models for later use
models = {'Logistic': logistic, 'TunedLogistic': logisticTune, 'RandomForest': random, 'Multinomial' : multinomial, 'SVM': SVM}

Before we get to the testing, let's define some convenience methods that we can use to break down the tasks involved in getting some validation. 

In [120]:
# Cross-validates model within trainnig set with a split of 'cv' - default value of 3
def cross_validate(model, X, y, cv=3):
    return cross_validation.cross_val_score(model, X, y, cv=cv).mean()
 
# Performs train-test split on data, trains on train, tests on test, returns score
def train_test_splitter(model, X, y, train_size=0.5):
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_size)
    model.fit(X_train, y_train)
    return X_train, X_test, y_train, y_test, model

# iterates over all different models and print out their results of train_test_splitter
def test_models(models, X, y = train.Sentiment):
    for modelName, model in models.iteritems():
        print modelName
        print train_test_splitter(model, X, y, train_size=0.5)
        
        
# test one specific model with train_test_splitter
def test_model(model, X):
    return train_test_splitter(model, X, train.Sentiment, train_size=0.5)       

# trains the model on the whole dataset, predicts on the test set and creates a submission file to kaggle
def train_submit(model, X_train, y_train, X_test, filename = "submission.csv"):
    print "fitting"
    model.fit(X_train, y_train)
    prediction = model.predict(X_test)
    output = pd.DataFrame( data={"PhraseId":test["PhraseId"], "Sentiment":prediction} )

    # Use pandas to write the comma-separated output file
    output.to_csv(filename, index=False, quoting=3 )
    print "done"

 ### Testing Models
 
 Now that we have a good sampling of models defined and some good functions ready to do some intelligent testing, let's get right to it! All we have to do at this point is call our `test_models` function, and wait while the results come in.

In [12]:
print "------ No Preprocessing ------"
test_models(models, X)
print "-------- Preprocessed --------"
test_models(models, X_cleaned)

------ Not Cleaned ------
Multinomial
0.574394463668
RandomForest
0.606100217865
SVM
0.62891195694
Logistic
0.623926694861
------ Cleaned ------
Multinomial
0.574727668845
RandomForest
0.610085864411
SVM
0.629719338716
Logistic
0.624144559785


### Back to the Exploration

Back in our data exploration, we noticed a strong positive correlation between the length of the phrase and the standard deviation of the sentiment. In other words, a high proportion of short phrases ended up with a neutral sentiment score of 2, while longer phrases were much more all over the place - scoring many more 0s, 1s, 3s, and 4s. For this reason, let's look into whether the addition of the number of words as a feature will add any accuracy to the model. Below, let's define an `add_word_length` function that will add an extra column to the TFIDF feature vector for the model to learn from

In [28]:
def add_word_length(X, data):
    num_words_feature = np.asarray(map(lambda x: len(x.split()), data.Phrase))
    num_words_feature = num_words_feature[:, np.newaxis]
    return sparse.hstack((X, num_words_feature))

Now, let's put this function to use and do some validation.

In [29]:
X_with_word_length = add_word_length(X, train)
X_test = add_word_length(X_test, test)
# test_models(models, X_with_word_length)

In [15]:
def compare_model_improvment(model1, model2, X):
    first = test_model(model1, X)
    second = test_model(model2, X)
    print "model 1", first
    print "model 2", second
    print "difference in the score", second - first

In [16]:
compare_model_improvment(logistic, logisticTune, X)

model 1 0.625797770088
model 2 0.637959759067
difference in the score 0.0121619889786


In [17]:
train_submit(logisticTune, X_with_word_length, train.Sentiment, X_test, filename = "submission.csv")

fitting
done




In [105]:
def add_phrase_length(X, data):
    num_words_feature = np.asarray(map(lambda x: len(x), data.Phrase))
    num_words_feature = num_words_feature[:, np.newaxis]
    return sparse.hstack((X, num_words_feature))

In [110]:
X_add_phrase = add_phrase_length(X, train)
X_add_two = add_word_length(X_add_phrase, train)
X_add_two_t = add_word_length(X_add_phrase_t, test)

In [116]:
print X_add_phrase
print "-----"
print X_add_two
test_models(models, X_add_two_t)

  (0, 14888)	0.287019277845
  (0, 14871)	0.135441541297
  (0, 13681)	0.0761528502645
  (0, 13505)	0.176900059578
  (0, 13503)	0.0898250803699
  (0, 12857)	0.127856375603
  (0, 12424)	0.138159296701
  (0, 11837)	0.176199420482
  (0, 9227)	0.270616837728
  (0, 9204)	0.193013325922
  (0, 9085)	0.189851541708
  (0, 8807)	0.135387954365
  (0, 7217)	0.175229216774
  (0, 5837)	0.228838071385
  (0, 5821)	0.262530286253
  (0, 5595)	0.265796263189
  (0, 5323)	0.20344769269
  (0, 4577)	0.278538658923
  (0, 3490)	0.248505909562
  (0, 1879)	0.110344377348
  (0, 602)	0.263418778638
  (0, 593)	0.220689028838
  (0, 529)	0.161438191432
  (0, 288)	0.2511340968
  (1, 14871)	0.223886118638
  :	:
  (156035, 15240)	34.0
  (156036, 15240)	31.0
  (156037, 15240)	15.0
  (156038, 15240)	15.0
  (156039, 15240)	137.0
  (156040, 15240)	128.0
  (156041, 15240)	126.0
  (156042, 15240)	23.0
  (156043, 15240)	21.0
  (156044, 15240)	102.0
  (156045, 15240)	97.0
  (156046, 15240)	8.0
  (156047, 15240)	88.0
  (156048, 15

ValueError: Found arrays with inconsistent numbers of samples: [ 66292 156060]

In [103]:
test_models(models, X_add_phrase)

RandomForest
0.605626041266
SVM
0.60420351147
TunedLogistic
0.643713956171
Logistic
0.631436626938


In [121]:
X_train, X_test, y_train, y_test, model = train_test_splitter(logisticTune, X, train.Sentiment)

In [122]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test)

target_names = ["Negative", "Somewhat Negative", "Neutral", "Somewhat Positive", "Positive"]
print(classification_report(y_test, y_pred, target_names=target_names))

                   precision    recall  f1-score   support

         Negative       0.55      0.25      0.34      3572
Somewhat Negative       0.52      0.41      0.46     13563
          Neutral       0.69      0.85      0.76     39841
Somewhat Positive       0.56      0.48      0.52     16427
         Positive       0.57      0.31      0.41      4627

      avg / total       0.62      0.64      0.62     78030



In [124]:
# three_class = []
# def transform(y)
# for pred in y_pred:
#     if pred == 0:
#         three_class.append(1)
#     elif pred == 4:
#         three_class.append(3)
#     else:
#         three_class.append(pred)
# print three_class
# three_class_r = []
            
        

[2, 1, 3, 1, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 3, 2, 2, 3, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 3, 2, 2, 1, 3, 2, 2, 1, 3, 2, 3, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 1, 1, 2, 2, 3, 2, 2, 2, 2, 1, 3, 1, 2, 2, 2, 2, 1, 3, 2, 2, 2, 3, 1, 3, 2, 3, 2, 1, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 2, 3, 2, 2, 2, 2, 3, 1, 1, 3, 2, 1, 2, 1, 2, 2, 2, 3, 1, 3, 1, 2, 3, 2, 2, 2, 3, 3, 2, 2, 2, 1, 2, 1, 1, 2, 3, 3, 3, 2, 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 1, 2, 2, 3, 3, 3, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 3, 1, 2, 3, 2, 1, 2, 3, 2, 2, 1, 2, 3, 2, 3, 3, 2, 2, 2, 3, 3, 3, 2, 2, 2, 2, 2, 3, 2, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, 3, 2, 2, 3, 2, 2, 2, 1, 3, 2, 3, 2, 1, 2, 2, 2, 1, 2, 2, 3, 2, 2, 2, 2, 2, 1, 2, 1, 3, 2, 2, 1, 3, 3, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 1, 3, 2, 2, 2, 1, 2, 2, 3, 2, 3, 2, 2, 3, 3, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3, 1, 3, 2, 2, 2, 2, 2, 2, 2, 

In [108]:
X_add_phrase_t = add_phrase_length(X_test, test)
train_submit(logisticTune, X_add_phrase, train.Sentiment, X_add_phrase_t)

fitting
done


In [None]:
train_submit(logisticTune, X_add_two, train.Sentiment, X_add_two_t)

We decided to play around with the Word2Vec that Paul created so we fed the the matrix that he came up with into our pipeline

In [70]:
import pickle
f = open('rotten_tomatoes_train.pickle')
X = pickle.load(f)
y = pickle.load(f)
f.close()

In [78]:
model = LogisticRegression(multi_class='multinomial', solver='newton-cg')
model.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=1, penalty='l2', random_state=None, solver='newton-cg',
          tol=0.0001, verbose=0, warm_start=False)

In [84]:
train_test_splitter(LogisticRegression(multi_class='multinomial', solver='newton-cg'), X, y)

0.56043829296424452

In [91]:
X_with_word = add_word_length(X, train)
test_models1(models, X_with_word_length, y)

ValueError: could not broadcast input array from shape (156060,300) into shape (156060)

In [96]:
from sklearn.tree import DecisionTreeClassifier
tree_model = DecisionTreeClassifier(max_depth=10)
train_test_splitter(tree_model, X, y)

0.51481481481481484

In [88]:
models = {'Logistic': logistic, 'TunedLogistic': logisticTune, 'RandomForest': random, 'SVM': SVM}

In [89]:
test_models1(models, X, y)

RandomForest
0.570306292452
SVM
0.554235550429
TunedLogistic
0.557503524286
Logistic
0.559451493016
