# Sentiment analysis and naive bayes

Now that you've seen sentiment analysis using a predefined dictionary of positive and negative valence for different words, this lab will have you do sentiment analysis on the movie review dataset "in reverse": you'll find which words are most likely to appear in positive or negative valenced reviews.

---

### Naive Bayes

For this lab we're going to use a classifier common in NLP and sentiment analysis called Naive Bayes. We'll be covering Naive Bayes classifiers in more depth in a later lecture – for this lab you'll just be implementing it using sklearn.

Essentially, Naive Bayes solves an inverted problem to what you are used to in supervized learning. Given a feature $x_i$ and target $y$, it solves for $P(x_i | y)$. In other words, the probability of a feature/predictor _given_ that the target is 1.

We'll use this to figure out which words are more likely to appear when the target is 1 ("fresh") vs when the target is 0 ("rotten").

---

### Load packages and movie data

Do any cleaning you deem necessary.

In [1]:
import pandas as pd
import numpy as np

# We are using the BernoulliNB version of Naive Bayes, which assumes predictors are binary encoded.
from sklearn.naive_bayes import BernoulliNB
from sklearn.cross_validation import cross_val_score, train_test_split

from sklearn.feature_extraction.text import CountVectorizer

In [2]:
import string
rt = pd.read_csv('/Users/tlee010/desktop/DSI-SF-2-timdavidlee/datasets/rottentomatoes_critics/rt_critics.csv')
print rt.head(3)
rt.dropna(inplace=True)
rt = rt[rt.fresh.isin (['fresh','rotten'])]
criticz ={y:x for x,y in enumerate(rt.critic.unique())}
titlez = {y:x for x,y in enumerate(rt.title.unique())}
pubz = {y:x for x,y in enumerate(rt.publication.unique())}
freshz = {y:x for x,y in enumerate(rt.fresh.unique())}

rt.critic = rt.critic.map(lambda x:criticz[x]) 
rt.title = rt.title.map(lambda x:titlez[x]) 
rt.publication = rt.publication.map(lambda x:pubz[x]) 
rt.fresh = rt.fresh.map(lambda x: 1 if x=='fresh' else 0) 
rt.reset_index(inplace=True)
rt.quote = rt.quote.map(lambda x : ''.join([c for c in x.lower() if c in string.ascii_lowercase +" -'" ]))
rt.head(4)

            critic  fresh      imdb    publication  \
0      Derek Adams  fresh  114709.0       Time Out   
1  Richard Corliss  fresh  114709.0  TIME Magazine   
2      David Ansen  fresh  114709.0       Newsweek   

                                               quote review_date    rtid  \
0  So ingenious in concept, design and execution ...  2009-10-04  9559.0   
1                  The year's most inventive comedy.  2008-08-31  9559.0   
2  A winning animated feature that has something ...  2008-08-18  9559.0   

       title  
0  Toy story  
1  Toy story  
2  Toy story  


Unnamed: 0,index,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,0,0,1,114709.0,0,so ingenious in concept design and execution t...,2009-10-04,9559.0,0
1,1,1,1,114709.0,1,the year's most inventive comedy,2008-08-31,9559.0,0
2,2,2,1,114709.0,2,a winning animated feature that has something ...,2008-08-18,9559.0,0
3,3,3,1,114709.0,3,the film sports a provocative and appealing st...,2008-06-09,9559.0,0


In [3]:
words = rt.quote.values
words[0:10]

array([ 'so ingenious in concept design and execution that you could watch it on a postage stamp-sized screen and still be engulfed by its charm',
       "the year's most inventive comedy",
       'a winning animated feature that has something for everyone on the age spectrum',
       "the film sports a provocative and appealing story that's every bit the equal of this technical achievement",
       "an entertaining computer-generated hyperrealist animation feature  that's also in effect a toy catalog",
       "as lion king did before it toy story revived the art of american children's animation and ushered in a set of smart movies that entertained children and their parents it's a landmark movie and doesn't get old with frequent repetition",
       "the film will probably be more fully appreciated by adults who'll love the snappy knowing verbal gags the vivid deftly defined characters and the overall conceptual sophistication",
       'children will enjoy a new take on the irresistibl

---

### Create a predictor matrix of words from the quotes with CountVectorizer

It is up to you what ngram range you want to select. **Make sure that `binary=True`**

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
stop = stopwords.words('english')

cvec = CountVectorizer(binary=True,stop_words=stop,max_features=1000, ngram_range=(1,3))
cvec.fit(words)
df  = pd.DataFrame(cvec.transform(words).todense(),
             columns=cvec.get_feature_names())
#df.transpose().sort_values(0, ascending=False).head(10).transpose()
word_matrix = df.transpose().sort_values(0, ascending=False).transpose()


In [7]:
word_matrix.columns = [x.replace(' ','_') for x in word_matrix.columns]
word_matrix.head(3)

Unnamed: 0,watch,still,concept,charm,could,screen,design,psychological,quality,put,...,formula,formulaic,found,four,frank,free,frequently,fresh,friends,young
0,1,1,1,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


---

### Split data into training and testing splits

You should keep 25% of the data in the test set.

In [8]:
#add in the freshness
base = word_matrix.merge(rt[['fresh']],how='left', left_index=True,right_index=True)



In [9]:
base.head(2)

Unnamed: 0,watch,still,concept,charm,could,screen,design,psychological,quality,put,...,formulaic,found,four,frank,free,frequently,fresh_x,friends,young,fresh_y
0,1,1,1,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [10]:
X = base[base.columns.difference(['fresh_y'])].values
y = base[['fresh_y']]
y = np.ravel(y)



In [11]:
from sklearn.cross_validation import train_test_split



Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size=0.25)
print Xtrain.shape, Xtest.shape, ytrain.shape, ytest.shape

(10022, 1000) (3341, 1000) (10022,) (3341,)


---

### Build a `BernoulliNB` model predicting fresh vs. rotten from the word appearances

The model should only be built (and cross-validated) on the training data.

Cross-validate the score and compare it to baseline.

In [None]:
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
print 'start'
bnb.fit(Xtrain,ytrain)
print 'all fit'
score = cross_val_score(bnb,Xtrain,ytrain,cv=5,n_jobs=-1)

start
all fit


In [None]:
print score, np.mean(score)

In [None]:
A = [5]

---

### Pull out the probability of words given "fresh"

The `.feature_log_prob_` attribute of the naive bayes model contains the log probabilities of a feature appearing given a target class.

The rows correspond to the class of the target, and the columns correpsond to the features. The first row is the 0 "rotten" class, and the second is the 1 "fresh" class.

#### 1. Pull out the log probabilities and convert them to probabilities (for fresh and for rotten).

In [None]:
print bnb.feature_log_prob_


#### 2. Make a dataframe with the probabilities and features

#### 3. Create a column that is the difference between fresh probability of appearance and rotten

#### 4. Look at the most likely words for fresh and rotten reviews

---

### Examine how your model performs on the test set

---

### Look at the top 10 movies and reviews likely to be fresh and top 10 likely to be rotten

You can fit the model on the full set of data for this.

Just to note: Naive Bayes, while good at classifying, is known to be somewhat bad at giving accurate predicted probabilities (beyond getting it on the correct side of 50%). It is a good classifier but a bad estimator. 

---

## [Bonus] Find out which critics are likely to give the "freshest" and "rottenest" reviews