# Natural Language Processing - Sentiment Analysis of Rotten Tomatoes quotes

In [2]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression

In [3]:
# load data from rt_critics.csv in the data folder of our DAT2 repo
# at '../data/rt_critics.csv'

url = '/Users/jennawhite/Documents/DS-SEA-4/data/rt_critics.csv'

rt = pd.read_csv(url)


In [4]:
# look at first 5 rows
rt.head(5)

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,fresh,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story
1,Richard Corliss,fresh,114709.0,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559.0,Toy story
2,David Ansen,fresh,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story
3,Leonard Klady,fresh,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story
4,Jonathan Rosenbaum,fresh,114709.0,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559.0,Toy story


In [5]:
# Check the shape of dataframe
rt.shape

(14072, 8)

In [6]:
# Fresh is the column with ratings.  Count the number of each value in column 'fresh'
rt.fresh.value_counts()

fresh     8613
rotten    5436
none        23
Name: fresh, dtype: int64

In [7]:
# vectorize the quotes and store it on a variable names Xcv
vect = CountVectorizer()
Xcv = vect.fit_transform(rt['quote'])

In [8]:
# Check the shape of dataframe Xcv
Xcv.shape

(14072, 21544)

But wait! We have more features than samples. This would ensure overfitting. Let's trim that number down to the top 5000, ranked by the term frequency across all documents.

In [9]:
# Create an vectorizer object as a variable named vectorizer that includes just the top 5000
# Hint: check the documentation for CountVectorizer if needed

vect2 = CountVectorizer(max_features=5000)



In [10]:
#  Create a new vectorized feature matix named Xcv with the new vectorizer
Xcv2 = vect2.fit_transform(rt['quote'])
Xcv2.shape

(14072, 5000)

In [11]:
# Create the response vector y where the value is 1 if "fresh" and 0 if any other value than fresh
rt['y'] = np.where(rt.fresh =='fresh',1,0)
rt.head()

y=rt.y

In [12]:
# Determine the null accuracy
max(rt['y'].value_counts()/len(y))


0.61206651506537801

In [13]:
# split the data into training and test sets



X_train, X_test, y_train, y_test = train_test_split(rt.quote,y,random_state=1)


In [14]:
# Evaluate performance of models using test train split or cross_validation

from sklearn import metrics

log=LogisticRegression(C=1000)
X_train_dtm = vect2.fit_transform(X_train)
X_test_dtm = vect2.transform(X_test)

log.fit(X_train_dtm, y_train)

y_pred_class = log.predict(X_test_dtm)
print metrics.accuracy_score(y_test, y_pred_class)

0.696702671973


In [15]:
# Tune the logistic Regression regularization parameter "C" to improve performance.
# Evaluate performance of models using test train split
from collections import defaultdict

dict = defaultdict(int)

for i in range(1,3):
    log = LogisticRegression(C=i)
    X_train_dtm = vect2.fit_transform(X_train)
    X_test_dtm = vect2.transform(X_test)
    log.fit(X_train_dtm, y_train)
    
    y_pred = log.predict(X_test_dtm)
    result = metrics.accuracy_score(y_test,y_pred)
    
    dict[i] = result
    
    


In [16]:
maximum = max(dict, key=dict.get)

print(maximum, dict[maximum])

(1, 0.75810119386014785)


In [17]:
#Bonus: Create a for loop to find the C value
# that produces the most accurate model 


# Stop Words

The performance isn't bad, but it's not great. Let's see if we can improve things by [using stop words](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)?

In [32]:
# Modify your vectorizer to also remove stop words (still allow only 5000 features)
vect3 = CountVectorizer(stop_words='english', max_features=5000)
# create a new vectorizer object that only allows 5000 features


In [28]:
# Create a new X called Xcvs


In [34]:
# split the converted data (Xcvs) into training and test sets

from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(rt.quote,y,random_state=1)


log=LogisticRegression(C=1)
X_train_dtm = vect3.fit_transform(X_train)
X_test_dtm = vect3.transform(X_test)

log.fit(X_train_dtm, y_train)

y_pred_class = log.predict(X_test_dtm)
print metrics.accuracy_score(y_test, y_pred_class)


0.749005116543


In [37]:
# Evaluate performance of models using the test data
# Tune the regularization parameter, C, to improve performance.
from collections import defaultdict

dict = defaultdict(int)

for i in range(1,3):
    log = LogisticRegression(C=i)
    X_train_dtm = vect3.fit_transform(X_train)
    X_test_dtm = vect3.transform(X_test)
    log.fit(X_train_dtm, y_train)
    
    y_pred = log.predict(X_test_dtm)
    result = metrics.accuracy_score(y_test,y_pred)
    
    dict[i] = result
    

In [38]:
# Tune the regularization parameter, C, to improve performance.
maximum = max(dict, key=dict.get)

print(maximum, dict[maximum])

(1, 0.74900511654349067)


In [23]:
#Alternate tuning of C using for loop


# tf-idf

If that didn't work, how about using tf-idf weighting?

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

In [49]:
# edit this cell to create a TfidfVectorizer instead of a simple CountVectorizer
# or start with your own model with CountVectorizer from the cells above

#create vectorizer object
vectorizer = TfidfVectorizer(stop_words='english',max_features=5000)
xtrainti, xtestti, ytrainti, ytestti = train_test_split(Xti, y)

#Create Xti and y
Xti = vectorizer.fit_transform(rt['quote'])
Y = (rt['fresh'] == 'fresh').values.astype(np.int8)

#split the converted data into training and test sets


In [50]:
# Evaluate performance of the new model
log.fit(xtrainti,ytrainti)
y_pred_ti = log.predict(xtestti)

metrics.accuracy_score(ytestti,y_pred_ti)

0.75326890278567371

In [None]:
# Tune the regularization parameter, C, to improve performance.

dict = defaultdict(int)

for i in range(1,3):
    log = LogisticRegression(C=i)
    X_train_dtm = vect3.fit_transform(X_train)
    X_test_dtm = vect3.transform(X_test)
    log.fit(X_train_dtm, y_train)
    
    y_pred = log.predict(X_test_dtm)
    result = metrics.accuracy_score(y_test,y_pred)
    
    dict[i] = result



In [None]:
#Bonus: if you have time find the best value of C using a for loop


# tf-idf and stop words

Do both together help?

In [None]:
# edit this cell to create a TfidfVectorizer that uses stop words

# create vectorizer object
#vectorizer = CountVectorizer(max_features=5000)

# convert our documents and their labels into numpy arrays
#Xtis = vectorizer.fit_transform(df['quote'])
#y = (df['fresh'] == 'fresh').values.astype(np.int8)

# split the converted data into training and test sets
#xtraintis, xtesttis, ytraintis, ytesttis = train_test_split(Xtis, y)

In [None]:
# Evaluate performance of models
# Tune the regularization parameter, C, to improve performance.


In [None]:
# Tune the regularization parameter, C, to improve performance.


# Next steps

Are you satisfied with these results? Why might you be less than satisfied? How can you explain the observed behavior? What are the next steps you would need to do to improve this classifier? If you have time remaining, try a few strategies out below.

In [None]:
# continue playing here

# Use pipeline to evaluate accuracy with cross validation

# More Next Steps

The hardest part of creating a sentiment model is finding good training data. Googling 'sentiment analysis training data' or 'sentiment analysis test data' turns up a few freely available sources. Most of them are hosted by universities.

But notice, determining the judgment of a movie review isn't the same task as determining the emotional content of a tweet. And yet, it kind of is. The computer doesn't know anything about nature of the text. All it knows is that there are documents with one label (fresh/happy) and documents with another label (rotten/sad) and it needs to fit a model to discriminate between the two. This can be extended to more classes (look into the 20 newsgroups dataset in sci-kit learn) and to proprietary corpora.

One application you might use at work is classifying support emails from users. The classes may be 'ranting', 'mischarge', 'lost order', 'gushing'. Or whatever is common. Even if the classifier isn't perfect, it could help streamline the process of getting the right emails to the right support personnel.

In [None]:
from IPython.display import HTML
HTML('''
<style>
.text_cell_render {
  background-color: silver
}
</style>
''')