## Rotten Tomatoes Model Iteration 1
#### Patrick Huston and James Jang

This notebook aims to make a first pass at producing a model capable of predicting sentiment on given phrases taken from movie reviews on Rotten Tomatoes. 

First, imports! We'll be attempting to tackle this problem from a wide array of angles.

In [17]:
import pandas as pd
import re
import numpy as np
import matplotlib.pyplot as plt
from nltk.corpus import stopwords

from sklearn.naive_bayes import MultinomialNB
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier

%matplotlib inline

Next, we'll import the data.

In [28]:
train = pd.read_csv("data/train.tsv", sep= '\t')
test = pd.read_csv("data/test.tsv", sep= '\t')
# unlabeled = del train["Sentiment"]

### Attempt 1

As a very basic first pass, we'll try and create a model using the features we built in our data exploration notebook. While some of these features did seem to have positive correlations with the data, we don't have high hopes for the performance of this model - this mainly serves as a point of reference.

We'll start by dropping in the helper functions we created in the data exploration to facilitate the creation of our new features. These new features are:

1. The number of words in a given phrase -- We noticed a strong correlation between the number of words and the standard deviation of the sentiment. While this wouldn't help us classify long phrases, it gives a pretty good indication that short phrases will most likely be of sentiment score 2. 

2. The length of the phrase in total -- This will be included for the same reason that the number of words is being used. Including both probably won't do much, but hey, let's try it anyway and see what happens.

3. The average word length in a given phrase -- In our exploration, we came across some strange trends in the data relating to the average word length and its effect on the sentiment of a given phrase. We're not sure what predictive power this feature will end up having, but we'll give it a try nonetheless.

4. Whether the phrase contains one or more of the most positvely correlated words in the corpus -- Pretty self-explanatory. This seems like it could help, but there are also negations and other patterns of language that could diminish this feature's predictive power.

5. Whether the phrase contains one or more of the most negatively correlated words in the corpus -- same reasoning as above.

In [7]:
def clean_phrase(phrase):
    letters_only = re.sub("[^a-zA-Z]", " ", phrase)
    lower_case = letters_only.lower()
    
    words = lower_case.split()
    stops = set(stopwords.words("english")) 
    meaningful_words = [w for w in words if not w in stops]
    return(" ".join( meaningful_words))   

def num_words(phrase):
    return len(phrase.split())

def length_phrase(phrase):
    return len(phrase)

def avg_word_length(phrase):
    if(phrase != ''):
        return sum(map(len, phrase.split()))/len(phrase.split())
    else:
        return 0

most_positive = ['remarkable', 'brilliant', 'terrific', 'excellent', 'finest', 'extraordinary', 'masterful', 
                 'hilarious', 'beautiful', 'wonderful', 'breathtaking', 'powerful', 'wonderfully', 'delightful', 
                 'masterfully', 'fantastic', 'dazzling', 'funniest', 'interference', 'refreshing']
most_negative = ['worst', 'failure', 'lacks', 'waste', 'bore', 'depressing', 'lacking', 'stupid', 'disappointment', 
                 'unfunny', 'lame', 'devoid', 'trash', 'lousy', 'junk', 'poorly', 'mess', 'sleep', 'unappealing', 'fails']

def contains_positive(phrase):
    for word in phrase.split():
        if word in most_positive:
            return 1 
    return 0
        
def contains_negative(phrase):
    for word in phrase.split():
        if word in most_negative:
            return 1
    return 0

For even further convenience, we've wrapped the functionality of our feature creation functions into one helper function that does everything. 

In [8]:
def apply_transform(data):
    data['CleanPhrase'] = data['Phrase'].apply(clean_phrase)
    data['NumWords'] = data['CleanPhrase'].apply(num_words)
    data['LengthPhrase'] = data['CleanPhrase'].apply(length_phrase)
    data['AvgWordLength'] = data['CleanPhrase'].apply(avg_word_length)
    data['ContainPositive'] = data['CleanPhrase'].apply(contains_positive)
    data['ContainNegative'] = data['CleanPhrase'].apply(contains_negative)

Let's apply our transformation functions to the training and testing datasets now.

In [9]:
apply_transform(train)
apply_transform(test)

Now that the data has been cleaned and features have been created, let's try running some models on the dataset. We'll try a couple of different options - first a logistic regression, and then perhaps a random forest.

In [11]:
predictors = ["ContainPositive", "ContainNegative", "NumWords", "LengthPhrase", "AvgWordLength"]
# predictors = ["ContainPositive", "ContainNegative"]
logisticReg = LogisticRegression(random_state=1)
randomForest = RandomForestClassifier(random_state=1, n_estimators=1000, min_samples_split=8, min_samples_leaf=4)
mean_score_logistic = cross_validation.cross_val_score(logisticReg, train[predictors], train["Sentiment"], cv=3).mean()
mean_score_forest = cross_validation.cross_val_score(randomForest, train[predictors], train["Sentiment"], cv=3).mean()

print "Logistic regression mean score: {}".format(mean_score_logistic)
print "Random forest mean score: {}".format(mean_score_forest)

Logistic regression mean score: 0.525015974421
Random forest mean score: 0.519191245532


Around the 50% mark isn't far from where we expected such a simple model would fall. Clearly our heroic efforts at initial attempts of creating numerical features from text data haven't gotten us very far. We're going to need to call in a bigger, badder, model.

### Attempt 2: Enter bag of words/tfidf/count vectorizer! 

We started by doing some research on how scikit-learn can deal with text data, and after some poking around, we stumbled up on a [Tutorial](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) that goes over the process of combining the powers of [tfidf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) and a supervised learning model, like some of the models mentionted in the scikit-learn [Multiclass and Multilabel](http://scikit-learn.org/stable/modules/multiclass.html) docs. 

As suggested by the scikit-learn docs mentioned above, we started with a MultinomailNB naive Bayes model --

```
Let’s start with a naïve Bayes classifier, which provides a nice baseline for this task. scikit-learn includes several variants of this classifier; the one most suitable for word counts is the multinomial variant.
```

Additionally, we implemented this all using the awesome scikit-learn Pipeline class that behaves like a compound classifier. 

Essentially what it does is this ~ vectorizer => transformer => classifier

In [23]:
pipelineBayes = Pipeline([('vect', CountVectorizer()),
                          ('tfidf', TfidfTransformer()),
                          ('clf', MultinomialNB()),])

We also created another pipeline at the same time to try out a different training model - the [OneVsOneClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsOneClassifier.html#sklearn.multiclass.OneVsOneClassifier) strategy.

In [24]:
pipelineOneVOne = Pipeline([('vect', CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None, max_features = 5000)),
                            ('tfidf', TfidfTransformer()),
                            ('clf', OneVsOneClassifier(LinearSVC())),])

Now that we have our pipelines constructed, let's get some idea to see if all of our work has paid off - some simple scikit-learn cross validation experiments.

In [29]:
predictors = "Phrase"

bayes_mean = cross_validation.cross_val_score(pipelineBayes, train[predictors], train["Sentiment"], cv=3).mean()
onevone_mean = cross_validation.cross_val_score(pipelineOneVOne, train[predictors], train["Sentiment"], cv=3).mean()

print "Mean score for Bayes model: {}".format(bayes_mean)
print "Mean score for OnevOne Model: {}".format(onevone_mean)

Mean score for Bayes model: 0.553844611375
Mean score for OnevOne Model: 0.592489915764


Not bad! Nearly 60% is a marked improvement over previous models, and with some additional tuning, we were able to get up to ~62% on a Kaggle submission. Speaking of Kaggle submissions, let's create some.

In [30]:
pipelineBayes = pipelineBayes.fit(train.Phrase, train.Sentiment)
pipelineOneVOne = pipelineOneVOne.fit(train.Phrase, train.Sentiment)

predictionBayes = pipelineBayes.predict(test.Phrase)
predictionOnevOne = pipelineOneVOne.predict(test.Phrase)

When submitted to Kaggle, these submissions receive scores of around 57% and 60%. We have more work to do, but this feels like a pretty good benchmark to start from.

In [13]:
output = pd.DataFrame( data={"PhraseId":test["PhraseId"], "Sentiment":predictionOnevOne} )

# Use pandas to write the comma-separated output file
output.to_csv("submission.csv", index=False, quoting=3 )

## Future Iterations

In future iterations, we aim to explore some of the following methods more.

1. Training Word2Vec on a larger dataset
2. Implement a porter-stemmer algorithm on the data
3. Utilize the fact that the test data does include the SentenceId - we're currently analyzing only based on the raw text, and including
4. See if we can figure out how to include negations 
5. Look around at existing APIs - like pattern - that could help us improve our submission