# Modeling Models for Movie Reviews

This will be a working notebook where Casey explores several models including Logistic regression, tfidf, and Stochatic Gradient Descent. 

In [43]:
import numpy as np 
import pandas 
from pattern.en import *
import thinkstats2
import thinkplot
import pattern
from sklearn.linear_model import LogisticRegression as LR
from sklearn.grid_search import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer as TFIV


%matplotlib inline

In [44]:
import seaborn as sns
sns.set(color_codes=True)

In [45]:
train = pandas.read_csv('../train.tsv', sep = '\t') 
test = pandas.read_csv('../test.tsv', sep = '\t')

train.head(5)

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


# Will now attempt to use Logistic Regression to train a model. 
Inspiration was pulled from this article: https://jessesw.com/NLP-Movie-Reviews/ 
Logistic Regression is a scikit learn model, documentation for scikit learn can be found here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html 

First, send ever word to lower and split the data.

In [47]:
def splitPhrases(text):
    punctuation = {".", "/", "\\", ","}
    frequencyTracker = {}; 
    for mark in punctuation: 
        text = text.lower().replace(mark, " ")
    return text.lower().split()

In [46]:
##want the training sentiment values 
y_train = train['Sentiment']

Next we will go through and lower and split the data in order to make it ready for developing a model. 

In [None]:
traindata = []
for word in train['Phrase']: 
    traindata.append(" ".join(splitPhrases(word)))
    testdata = []
    for otherword in test['Phrase']: 
        testdata.append(" ".join(splitPhrases(otherword)))
# for i in xrange(0,len(train['Phrase'])):
#     print (" ".join(splitPhrases(train['Phrase'][i])))
#     traindata.append(" ".join(splitPhrases(train['Phrase'][i])))
#     testdata = []
#     for i in xrange(0,len(test['Phrase'])):
#         testdata.append(" ".join(splitPhrases(test['Phrase'][i])))

In [None]:
tfv = TFIV(min_df=3,  max_features=None, 
        strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
        ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1,
        stop_words = 'english')

Then we will combine the train data and the test data in order to vectorize the data. 
After the tfv fits the data, then it transforms the documents into a document-terms matrix. 
Then seperates the data back into training set and testing set. 

Finally, then we are ready and able to use Logistic Regression on the vectorized document-term matrix.

In [None]:
X_all = traindata + testdata # Combine both to fit the TFIDF vectorization.
lentrain = len(traindata)

tfv.fit(X_all) # This is the slow part!
X_all = tfv.transform(X_all)

X = X_all[:lentrain] # Separate back into training and test sets. 
X_test = X_all[lentrain:]

In [None]:
X.shape

In [None]:
grid_values = {'C':[30]} # Decide which settings you want for the grid search. 

model_LR = GridSearchCV(LR(penalty = 'L2', dual = True, random_state = 0), 
                        grid_values, scoring = 'roc_auc', cv = 20) 
# Try to set the scoring on what the contest is asking for. 
# The contest says scoring is for area under the ROC curve, so use this.
# y_train = label_binarize(y_train, classes=[0, 1, 2, 3])

model_LR.fit(X,y_train) # Fit the model.