
# Sentiment Analysis and Naive Bayes


---

In the sentiment analysis lesson we used a predefined dictionary of positive and negative valences for words. This  lab has invert the process: you'll find which words are most likely to appear in positive or negative reviews by using the rotten vs. fresh binary label.

### Naive Bayes

A practical and common way to do this is with the Naive Bayes algorithm. Naive Bayes classifiers are covered in more depth in another lecture – for this lab you'll just be leveraging the sklearn implementation.

Given a feature $x_i$ and target $y_i$, Naive Bayes classifiers solve for $P(x_i \;|\; y_i)$. In other words, the probability of a feature/predictor _given_ that the target is 1.

We'll use this to figure out which words are more likely to appear when the target is 1 ("fresh") vs when the target is 0 ("rotten").

---

### 1. Load packages and movie data

Do any cleaning you deem necessary.

In [15]:
import pandas as pd
import numpy as np

# We are using the BernoulliNB version of Naive Bayes, which assumes predictors are binary encoded.
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_score, train_test_split

from sklearn.feature_extraction.text import CountVectorizer

In [16]:
rt = pd.read_csv('../datasets/rt_critics.csv')

# filtering out 23 "None" values in our target, the "Fresh" column. Encoding to 1's and 0's.
rt = rt[rt.fresh.isin(['fresh','rotten'])]
rt.fresh = rt.fresh.map(lambda x: 1 if x == 'fresh' else 0)

In [17]:
rt.head(10)

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,1,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story
1,Richard Corliss,1,114709.0,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559.0,Toy story
2,David Ansen,1,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story
3,Leonard Klady,1,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story
4,Jonathan Rosenbaum,1,114709.0,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559.0,Toy story
5,Michael Booth,1,114709.0,Denver Post,"As Lion King did before it, Toy Story revived ...",2007-05-03,9559.0,Toy story
6,Geoff Andrew,1,114709.0,Time Out,The film will probably be more fully appreciat...,2006-06-24,9559.0,Toy story
7,Janet Maslin,1,114709.0,New York Times,Children will enjoy a new take on the irresist...,2003-05-20,9559.0,Toy story
8,Kenneth Turan,1,114709.0,Los Angeles Times,Although its computer-generated imagery is imp...,2001-02-13,9559.0,Toy story
9,Susan Wloszczyna,1,114709.0,USA Today,How perfect that two of the most popular funny...,2000-01-01,9559.0,Toy story


---

### 2. Create a predictor matrix of words from the quotes with CountVectorizer

It is up to you what ngram range you want to select. **Make sure that `binary=True`**

In [18]:
# CountVectorizer's fit_transform creates a feature for each word in our rt.quote column.
# The ngram_range param controls how many pairing of words to vectorize, i.e. (2,4) would be all 2,3 and 4 word sets in this string. 
# The max_features param sets a limit on the maximum number of columns that can be created.
# binary param tells our countvectorizer to mark all non-zero counts as 1 (not the actual count, ie 2,3,4,5). 0's stay 0.
# stop words param tells the count vectorizer to ignore all "common" words - conjunctions ("for", "and", "nor", "but", "or", "yet", so), "the" etc. 
cv = CountVectorizer(ngram_range=(1,2), max_features=2500, binary=True, stop_words='english')
words = cv.fit_transform(rt.quote)

In [19]:
words.shape

(14049, 2500)

In [20]:
# to dense returns the actual matrix.
words = pd.DataFrame(words.todense(), columns=cv.get_feature_names())

In [21]:
words.head()

Unnamed: 0,10,100,13,1961,1998,20,2001,30,40,50s,...,year,year old,years,years ago,yes,york,young,younger,youth,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
print words.shape

(14049, 2500)


---

### 3. Split data into training and testing splits

You should keep 25% of the data in the test set.

In [23]:
Xtrain, Xtest, ytrain, ytest = train_test_split(words.values, rt.fresh.values, test_size=0.25)

In [24]:
print Xtrain.shape, Xtest.shape

(10536, 2500) (3513, 2500)


---

### 4. Build a `BernoulliNB` model predicting fresh vs. rotten from the word appearances

The model should only be built (and cross-validated) on the training data.

Cross-validate the score and compare it to baseline.

In [25]:
nb = BernoulliNB()

In [26]:
nb.fit(Xtrain, ytrain)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [27]:
nb_scores = cross_val_score(BernoulliNB(), Xtrain, ytrain, cv=5)
print nb_scores
print np.mean(nb_scores)
print np.mean(ytrain)

[ 0.74478178  0.74051233  0.72520171  0.72994779  0.71604938]
0.731298600405
0.614464692483


---

### 5. Pull out the probability of words given "fresh"

The `.feature_log_prob_` attribute of the naive bayes model contains the log probabilities of a feature appearing given a target class.

The rows correspond to the class of the target, and the columns correpsond to the features. The first row is the 0 "rotten" class, and the second is the 1 "fresh" class.

#### 5.1 Pull out the log probabilities and convert them to probabilities (for fresh and for rotten).

In [54]:
# Here we pull out the natural log probabilities, which aren't in 0-1 space.
feat_lp = nb.feature_log_prob_

In [56]:
feat_lp.shape

(2, 2500)

In [32]:
# Here we exponentiate the log probabilities to put it in 0-1 space: i.e. 2.71828182846^(log_prob) = prob in 0-1 space. 
fresh_p = np.exp(feat_lp[1])

In [33]:
rotten_p = np.exp(feat_lp[0])

#### 5.2 Make a dataframe with the probabilities and features

In [34]:
feat_probs = pd.DataFrame({'fresh_p':fresh_p, 'rotten_p':rotten_p, 'feature':words.columns.values})

In [35]:
# The probability of finding the feature given fress vs rotten.
feat_probs

Unnamed: 0,feature,fresh_p,rotten_p
0,10,0.002162,0.004183
1,100,0.000926,0.001722
2,13,0.001544,0.000984
3,1961,0.000926,0.000246
4,1998,0.000926,0.000984
5,20,0.002162,0.001230
6,2001,0.001390,0.000738
7,30,0.001390,0.000984
8,40,0.000463,0.001230
9,50s,0.001390,0.001476


#### 5.3 Create a column that is the difference between fresh probability of appearance and rotten

In [36]:
feat_probs['fresh_diff'] = feat_probs.fresh_p - feat_probs.rotten_p

#### 5.4 Look at the most likely words for fresh and rotten reviews

In [37]:
# We can think of the words in fresh_diff as the strongest
# indicators within our wordset of fresh reviews - as there is 
# the largest differential between likelihood of being pos vs 
# probability of being negative. 
feat_probs.sort_values('fresh_diff', ascending=False, inplace=True)
feat_probs.head(20)

Unnamed: 0,feature,fresh_p,rotten_p,fresh_diff
825,film,0.153335,0.113927,0.039408
193,best,0.044781,0.020177,0.024604
965,great,0.029339,0.00935,0.019989
693,entertaining,0.022545,0.005167,0.017377
1584,performance,0.023008,0.006152,0.016856
1585,performances,0.020846,0.009104,0.011742
694,entertainment,0.01714,0.005413,0.011727
837,films,0.02517,0.01378,0.01139
900,fun,0.025015,0.014026,0.01099
87,american,0.020383,0.009843,0.01054


In [38]:
feat_probs.sort_values('fresh_diff', ascending=True, inplace=True)
feat_probs.head(20)

Unnamed: 0,feature,fresh_p,rotten_p,fresh_diff
1269,like,0.043082,0.069144,-0.026062
1754,really,0.006022,0.022146,-0.016123
157,bad,0.008493,0.024606,-0.016113
1280,little,0.016677,0.03248,-0.015803
1445,movie,0.127857,0.142717,-0.01486
1629,plot,0.012044,0.026821,-0.014776
1894,script,0.010037,0.0219,-0.011863
606,doesn,0.015596,0.026821,-0.011225
1164,isn,0.01189,0.022638,-0.010748
810,feels,0.003552,0.013533,-0.009982


---

### 6. Examine how your model performs on the test set

In [39]:
print nb.score(Xtest, ytest)
print np.mean(ytest)

0.736407628807
0.608881298036


---

### 7. Look at the top 10 movies and reviews likely to be fresh and top 10 likely to be rotten

You can fit the model on the full set of data for this.

> **Note:** Naive Bayes, while good at classifying, is known to be somewhat bad at giving accurate predicted probabilities (beyond getting it on the correct side of 50%). It is a good classifier but a bad estimator. 

In [40]:
# as the note mentioned, don't fixate on the predicted proba, just use as > 50 = fresh, else rotten. 
X = words.values
y = rt.fresh

In [41]:
#repeating our last model, just fitting on the whole dataset.
nbfull = BernoulliNB().fit(X,y)

In [42]:
pp = pd.DataFrame({
        'prob_fresh':nbfull.predict_proba(X)[:,1],
        'movie':rt.title,
        'quote':rt.quote
    })

In [43]:
pp.head()

Unnamed: 0,movie,prob_fresh,quote
0,Toy story,0.68813,"So ingenious in concept, design and execution ..."
1,Toy story,0.687331,The year's most inventive comedy.
2,Toy story,0.970846,A winning animated feature that has something ...
3,Toy story,0.995701,The film sports a provocative and appealing st...
4,Toy story,0.984896,"An entertaining computer-generated, hyperreali..."


In [44]:
pp.sort_values('prob_fresh', ascending=False, inplace=True)
for movie, quote in zip(pp.movie[0:10], pp.quote[0:10]):
    print movie,'\t', quote
    print '--------------------------------------------------\n'

Kundun 	Stunning, odd, glorious, calm and sensationally absorbing, director Martin Scorsese's Kundun is a remarkable piece of work with vital colors and a wrenching message.
--------------------------------------------------

The Wild Bunch 	The Wild Bunch is Peckinpah's most complex inquiry into the metamorphosis of man into myth. Not incidentally, it is also a raucous, violent, powerful feat of American film making.
--------------------------------------------------

Witness 	Powerful, assured, full of beautiful imagery and thankfully devoid of easy moralising, it also offers a performance of surprising skill and sensitivity from Ford.
--------------------------------------------------

The English Patient 	This is one of the year's most unabashed and powerful love stories, using flawless performances, intelligent dialogue, crisp camera work, and loaded glances to attain a level of eroticism and emotional connection that many similar films miss.
--------------------------------------

In [45]:
pp.sort_values('prob_fresh', ascending=True, inplace=True)
for movie, quote in zip(pp.movie[0:10], pp.quote[0:10]):
    print movie,'\t', quote
    print '--------------------------------------------------\n'

Pokémon: The First Movie 	With intentionally stilted animation, uninspired music and lame jokes, Pokemon is basically an ultralong version of the phenomenon's own boring TV 'toon.
--------------------------------------------------

Joe's Apartment 	There's not enough story here for something half that length, so we're subjected to numerous pointless and irritating song-and-dance numbers designed to nudge the lame plot towards its conclusion.
--------------------------------------------------

Kazaam 	As fairy tale, buddy comedy, family drama, thriller or rap revue, Kazaam is simply uninspired and unconvincing, and Mr. O'Neal, who can carry a basketball team, lacks the charisma to rescue this misguided effort.
--------------------------------------------------

Gung Ho 	A disappointment, a movie in which the Japanese are mostly used for the mechanical requirements of the plot, and the Americans are constructed from durable but boring stereotypes.
----------------------------------------

---

### 8. Find the most likely to be fresh and rotten for movies with at least 10 reviews.

In [46]:
# subset to movies with at least 10 reviews:
movie_counts = pp.movie.value_counts().reset_index()
movie_counts.columns = ['movie','counts']
movie_counts.head()

Unnamed: 0,movie,counts
0,The Hurricane,20
1,Fever Pitch,20
2,The Truman Show,20
3,The Green Mile,20
4,The Sixth Sense,20


In [47]:
# first line runs a groupby over our df, averaging the predict probas over movie titles
pp_movies = pp[['movie','prob_fresh']].groupby('movie').agg(np.mean).reset_index()
# This line filters out any movies that appear less than 10 times!
pp_movies = pp_movies[pp_movies.movie.isin(movie_counts[movie_counts.counts >= 10].movie)]

In [48]:
# Best movies
pp_movies.sort_values('prob_fresh', ascending=False, inplace=True)
pp_movies.head(10)

Unnamed: 0,movie,prob_fresh
1417,The Iron Giant,0.979857
862,Midnight Run,0.938485
209,Boogie Nights,0.933018
830,Manhattan,0.923313
1447,The Little Mermaid,0.91311
1058,Raging Bull,0.909063
652,Il conformista,0.899021
1055,Quiz Show,0.897883
1615,Toy story,0.894362
298,Cookie's Fortune,0.889969


In [49]:
# Worst movies
pp_movies.sort_values('prob_fresh', ascending=True, inplace=True)
pp_movies.head(10)

Unnamed: 0,movie,prob_fresh
152,Basic Instinct 2,0.172296
1205,Spy Hard,0.178794
1661,Vegas Vacation,0.178997
1669,Virus,0.198833
288,Color of Night,0.210623
139,Bad Boys,0.211745
9,3 Strikes,0.216798
391,Dracula: Dead and Loving It,0.220098
627,House Arrest,0.223951
1296,The Bachelor,0.224735


# Knowledge Check:

what are some of the parameters we tuned in building this model?
- ngram_range
- binary
- max features

Decisions made when building model:
- Towards the end, we built a model using only those movies with 10 reviews...

What are some tweaks you would look into given some more time? How do you think these tweaks would impact model performane?