# Sentiment analysis codealong using spacy and movie reviews

Sentiment analysis is one of the more popular topics in NLP. It is concerned with finding some kind of valence to written text. This could be positivity, negativity, subjectivity and many others. In this lesson we will just be looking at those three. 

First we will load in a dataset of pre-coded sentiment scores for positivity and negativity on words. These words are also divided up by their part of speech in the sentence.

Then we will load snippets of rottentomatoes reviews and explore the sentiment of the writing.

---

### Load packages and sentiment data

In [2]:
import pandas as pd
import numpy as np

In [5]:
sen = pd.read_csv('/Users/tlee010/desktop//DSI-SF-2-timdavidlee/datasets/sentiment_words/sentiment_words_simple.csv')

sen.head()

Unnamed: 0,pos,word,pos_score,neg_score
0,adj,.22-caliber,0.0,0.0
1,adj,.22-calibre,0.0,0.0
2,adj,.22_caliber,0.0,0.0
3,adj,.22_calibre,0.0,0.0
4,adj,.38-caliber,0.0,0.0


---

### Create a sentiment dataset that does not take into account part of speech tags

This will be what we use first, not knowing the part of speech a word is in. Later when we use spacy we will be able to determine the part of speech of each word and pair the scores accordingly.

In [4]:
sen_agg = sen[['word','pos_score','neg_score']].groupby('word').agg(np.mean).reset_index()
sen_agg.head()

Unnamed: 0,word,pos_score,neg_score
0,'hood,0.0,0.375
1,'s_gravenhage,0.0,0.0
2,'tween,0.0,0.0
3,'tween_decks,0.0,0.0
4,.22,0.125,0.0


---

### Create a dictionary version of the sentiment data for both the part of speech and aggregate

The dictionary format of the data will be much easier to index into in our functions later. If we don't do this it's much harder to make those functions run quickly.

In [11]:
sen_dict = {
    'ADJ':{}
    ,'NOUN':{}
    ,'VERB':{}
    , 'ADV':{}
}
for i, row in enumerate(sen.itertuples()): #makes rows into tuples as it goes (much faster)
    if (i % 10000)==0:
        print i
    sen_dict[row[1].upper()][row[2]] = {'pos_score':row[3], 'neg_score':row[4]}

0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000


In [12]:
sen_dict['ADJ']['horrible']

{'neg_score': 0.625, 'pos_score': 0.0}

In [13]:
sen_agg_dict={}
for row in sen_agg.itertuples():
    sen_agg_dict[row[1]] = {'pos_score':row[2], 'neg_score':row[3]}

In [15]:
sen_dict['ADJ']['worst']

{'neg_score': 0.75, 'pos_score': 0.25}

---

### Load the rotten tomatoes dataset

This dataset has:
    
    critic: critic's name
    fresh: fresh vs. rotten rating
    imdb: code for imdb
    publication: where the review was published
    quote: the review snippet
    review_date: date of review
    rtid: rottentomatoes id
    title: name of movie

In [16]:
rt = pd.read_csv('/Users/tlee010/desktop/DSI-SF-2-timdavidlee/datasets/rottentomatoes_critics/rt_critics.csv')

In [17]:
rt.head(2)

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,fresh,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story
1,Richard Corliss,fresh,114709.0,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559.0,Toy story


---

### Restrict data to reviews with valid ratings and reviews over 10 words long

Clean up the reviews, making a column with the case and punctuation removed.

In [19]:
rt.fresh.value_counts()

fresh     8613
rotten    5436
none        23
Name: fresh, dtype: int64

In [20]:
# remove none reviews
# reviews with 10 words or fewer

rt = rt[rt.fresh.isin(['fresh','rotten'])]
rt.fresh = rt.fresh.map(lambda x : 1 if x=='fresh' else 0)
rt['quote_len'] = rt.quote.map(lambda x : len(x.split()))
rt = rt[rt.quote_len > 10]

In [21]:
rt.shape

(11215, 9)

In [22]:
for q in rt.quote.values[0:4]:
    print q

So ingenious in concept, design and execution that you could watch it on a postage stamp-sized screen and still be engulfed by its charm.
A winning animated feature that has something for everyone on the age spectrum.
The film sports a provocative and appealing story that's every bit the equal of this technical achievement.
An entertaining computer-generated, hyperrealist animation feature (1995) that's also in effect a toy catalog.


In [33]:
import string

string.ascii_lowercase

rt['qt'] = rt['quote'].map(lambda x : unicode(''.join([i for i in x.lower() if i in string.ascii_lowercase+" -'"])))

for q in rt.qt.values[0:4]:
    print q

rt.head()
#spacy will find all these marks and turn into puncuation, but we will remove for practice sake

so ingenious in concept design and execution that you could watch it on a postage stamp-sized screen and still be engulfed by its charm
a winning animated feature that has something for everyone on the age spectrum
the film sports a provocative and appealing story that's every bit the equal of this technical achievement
an entertaining computer-generated hyperrealist animation feature  that's also in effect a toy catalog


Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title,quote_len,qt
0,Derek Adams,1,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story,24,so ingenious in concept design and execution t...
2,David Ansen,1,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story,13,a winning animated feature that has something ...
3,Leonard Klady,1,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story,17,the film sports a provocative and appealing st...
4,Jonathan Rosenbaum,1,114709.0,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559.0,Toy story,14,an entertaining computer-generated hyperrealis...
5,Michael Booth,1,114709.0,Denver Post,"As Lion King did before it, Toy Story revived ...",2007-05-03,9559.0,Toy story,40,as lion king did before it toy story revived t...


---

### Write a function to assign positive rating, negative, and objective based on words in review

We'll use the dictionary we constructed above (without the part of speech tags). 

Objectivity is calculated: 

    1. - (positive_score + negative_score)

In [36]:
def agg_scorer(x):
    x = x.split()
    pos_scores , neg_scores, obj_scores = [],[],[]
    for word in x:
        try:
            pos_scores.append(sen_agg_dict[word]['pos_score'])
            neg_scores.append(sen_agg_dict[word]['neg_score'])
            obj_scores.append(1. - (pos_scores[-1] + neg_scores[-1]))
        except:
            pos_scores.append(0.)
            neg_scores.append(0.)
            obj_scores.append(1.)
            continue
    return [pos_scores, neg_scores, obj_scores]

In [38]:
rev = rt.qt[7]
rev

u'children will enjoy a new take on the irresistible idea of toys coming to life adults will marvel at a witty script and utterly brilliant anthropomorphism'

In [39]:
p, n ,o = agg_scorer(rev)

In [40]:
for word, p_ in zip(rev.split(),p):
    print word, p_

children 0.0
will 0.0208333333333
enjoy 0.475
a 0.0178571428571
new 0.068181818182
take 0.0193452380952
on 0.0208333333333
the 0.0
irresistible 0.3125
idea 0.05
of 0.0
toys 0.0
coming 0.09375
to 0.0
life 0.0178571428571
adults 0.0
will 0.0208333333333
marvel 0.34375
at 0.0
a 0.0178571428571
witty 0.5
script 0.0
and 0.0
utterly 0.5
brilliant 0.4375
anthropomorphism 0.0


---

### Calculate the sum and average ratings for positive, negative, and objective for each review

In [41]:
agg_scores = map(agg_scorer,rt.qt)

In [42]:
rt['pos_avg'] = [np.mean(x[0]) for x in agg_scores]
rt['neg_avg'] = [np.mean(x[1]) for x in agg_scores]
rt['obj_avg'] = [np.mean(x[2]) for x in agg_scores]

rt['pos_sum'] = [np.sum(x[0]) for x in agg_scores]
rt['neg_sum'] = [np.sum(x[1]) for x in agg_scores]
rt['obj_sum'] = [np.sum(x[2]) for x in agg_scores]

In [43]:
rt.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title,quote_len,qt,pos_avg,neg_avg,obj_avg,pos_sum,neg_sum,obj_sum
0,Derek Adams,1,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story,24,so ingenious in concept design and execution t...,0.045647,0.024706,0.929647,1.095524,0.592949,22.311527
2,David Ansen,1,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story,13,a winning animated feature that has something ...,0.062271,0.021978,0.915751,0.809524,0.285714,11.904762
3,Leonard Klady,1,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story,17,the film sports a provocative and appealing st...,0.057831,0.024271,0.917897,0.983135,0.412608,15.604257
4,Jonathan Rosenbaum,1,114709.0,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559.0,Toy story,14,an entertaining computer-generated hyperrealis...,0.072688,0.042331,0.884982,0.94494,0.550298,11.504762
5,Michael Booth,1,114709.0,Denver Post,"As Lion King did before it, Toy Story revived ...",2007-05-03,9559.0,Toy story,40,as lion king did before it toy story revived t...,0.028408,0.021935,0.949657,1.136316,0.877397,37.986287


In [49]:
print y

[1 1 1 ..., 1 1 1]


---

### Evaluate predictive ability using the sentiment scores

In [52]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

X = rt[['pos_avg','neg_avg','obj_avg','quote_len']]
y = rt.fresh.values

lr_scores = cross_val_score(LogisticRegression(), X, y, cv=10)
print np.mean(lr_scores), np.mean(y)
lr = LogisticRegression().fit(X,y)


0.624431608768 0.615069103879


In [53]:
for predictor, coef in zip(X.columns, lr.coef_[0]):
    print predictor, coef

pos_avg 9.0934243294
neg_avg -7.52380496811
obj_avg -0.776342845174
quote_len 0.011594988458


In [54]:
pp = pd.DataFrame({
        'prob_fresh': lr.predict_proba(X)[:,1]
        ,'prob_rotten': lr.predict_proba(X)[:,0]
        ,'quote':rt.quote.values
    })

In [59]:

pp.head(3)

Unnamed: 0,prob_fresh,prob_rotten,quote
1837,0.89304,0.10696,"Appropriately operatic, Chen's visually specta..."
9701,0.875425,0.124575,"From Russia with Love is a preposterous, skill..."
7889,0.859373,0.140627,A very good film with some dazzling moments an...


In [61]:
pp.sort_values('prob_fresh', ascending=False, inplace = True)
for x in pp.quote.values[0:10]:
    print x
    print '='*60

Appropriately operatic, Chen's visually spectacular epic is sumptuous in every respect. Intelligent, enthralling, rhapsodic.
From Russia with Love is a preposterous, skillful slab of hardhitting, sexy hokum.
A very good film with some dazzling moments and one truly outstanding performance!
Remains a beautiful, deftly directed and superbly acted version of a witty and poignant drama.
An ingenious script, excellent special effects and photography, and superior acting, make it an endearing winner.
Part homage, part spoof, the deft balancing act is a clever, engaging adaption.
The Karate Kid exhibits warmth and friendly, predictable humor, its greatest assets.
Improbabilities and all, Simpatico still boasts wonderful scenes and a cast that is truly superb.
An inspiring translation of biblical grandeur, turning the story of one of history's greatest heroes into an entertaining, visually dazzling cartoon.
High Noon combines its points about good citizenship with some excellent picturemaking.

In [66]:
pp['difference'] = abs(pp.prob_fresh - pp.prob_rotten)
pp.head(3)

Unnamed: 0,prob_fresh,prob_rotten,quote,difference
1837,0.89304,0.10696,"Appropriately operatic, Chen's visually specta...",0.78608
9701,0.875425,0.124575,"From Russia with Love is a preposterous, skill...",0.75085
7889,0.859373,0.140627,A very good film with some dazzling moments an...,0.718746


In [68]:
pp.sort_values('difference', ascending=True, inplace = True)
for x in pp.quote.values[0:10]:
    print x
    print '='*60

A shambolic, deafening, intelligence-insulting mess, a crushing failure on almost all counts.
It's like watching the dreckiest of teen puppy courtships trying to pass itself off as 'Annie Hall.
This cockamamy action flick is excruciatingly formulaic -- brimming with spy movie cliches but devoid of the genre's fun, upper-class pretensions.
The film, for all its mayhem and fury, is too distant to be truly disturbing; it treats everything with an impatient, born-too-late shrug.
The story is no more than a thread stitching set pieces of increasing implausibility and ineptitude.
This may work for you if you settle at the outset for a nostalgic, all-American mood piece.
Notorious has a fine time along the way, with Woolard channeling the rapper's sweetness and wit as comfortably as his pathos.
Never manages more than a glib, TV movie-of- the-week glance at their lives.
An old hand at this sort of thing, Pakula goes through the motions, but not much more.
Another soulless, by-the-numbers atte

---

### Import spacy

The spacy package is the current gold standard for parsing text. We are going to use it to find the part of speech tags for the review words. 

Once we have parsed the tags with spacey, we can assign sentiment scores at a more granular level, using the correct part of speech version of the word.

In [72]:
import spacy
en_nlp = spacy.load('en')

In [73]:
txt = en_nlp(rt.qt.values[0])

In [74]:
for x in txt:
    print x.pos_

ADV
ADJ
ADP
NOUN
NOUN
CONJ
NOUN
ADJ
PRON
VERB
VERB
PRON
ADP
DET
NOUN
NOUN
PUNCT
ADJ
NOUN
CONJ
ADV
VERB
VERB
ADP
ADJ
NOUN


In [76]:
token1 = txt[0]
token1

so

In [77]:
str(token1)=='so'

True

In [79]:
type(token1)

spacy.tokens.token.Token

---

### Parse the quotes using spacey's multithreaded parser

In [80]:
parsed_quotes = []
for i, parsed in enumerate(en_nlp.pipe(rt.qt.values, batch_size=50, n_threads=4)):
    if (i%1000)==0:
        print i
    parsed_quotes.append(parsed)

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000


---

### Create columns for part of speech proportions

For each of the part of speech tags, create a column in the dataset that records the proportion of words in the quote that have that part of speech tag. We can try using these as predictors.

In [84]:
unique_pos =[]
for parsed in parsed_quotes:
    unique_pos.extend([t.pos_ for t in parsed])
unique_pos = np.unique(unique_pos)
print "','".join(unique_pos)

ADJ','ADP','ADV','CONJ','DET','INTJ','NOUN','NUM','PART','PRON','PROPN','PUNCT','SPACE','SYM','VERB','X


In [85]:
useful_grammar = 'ADJ','ADP','ADV','CONJ','DET','INTJ','NOUN','NUM','PART','PRON','PROPN','PUNCT','SPACE','SYM','VERB'



In [86]:
for pos in useful_grammar:
    rt[pos+'_prop'] =0.

In [88]:
rt = rt.reset_index(drop=True)
for i, parsed in enumerate(parsed_quotes):
    if (i%500) ==0:
        print i
    parsed_len = len(parsed)
    for pos in useful_grammar:
        prop = len([x for x in parsed if x.pos_ == pos]) / float(parsed_len)
        rt.ix[i,pos+'_prop'] = prop

0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000
6500
7000
7500
8000
8500
9000
9500
10000
10500
11000


---

### Evaluate a model with the new part of speech predictors

---

### Print out the most likely fresh and most likely rotten reviews

Using the predicted probabilities from our model, we can see which reviews are most likely to be fresh or rotten. We can easily validate that our model is doing something that makes sense by looking at these (one of the benefits of doing NLP work!)

---

### Assign sentiment scores using the correct part of speech tag

We need to write another function that will take into account the part of speech tags using the parsed quotes we created earlier and the original sentiment data dictionary.

In [94]:
def scorer(parsed):
    pos_scores, neg_scores, obj_scores =[],[],[]
    for token in [t for t in parsed if t.pos_ in ['NOUN','VERB','ADV','ADJ']]:
        try:
            pos_scores.append(sen_dict[token.pos_][str(token)]['pos_score'])
            neg_scores.append(sen_dict[token.pos_][str(token)]['pos_score'])
            obj_scores.append(1. - (pos_scores[-1] + obj_scores[-1]))
        except:
            pos_scores.append(0.)
            neg_scores.append(0.)
            obj_scores.append(0.)
    return [pos_scores, neg_scores, obj_scores]


In [95]:
scores = map(scorer, parsed_quotes)

In [96]:
rt['pos_part_avg'] = [np.mean(x[0]) for x in scores]
rt['neg_part_avg'] = [np.mean(x[1]) for x in scores]
rt['obj_part_avg'] = [np.mean(x[2]) for x in scores]


In [99]:
rt[[col for col in rt.columns if col.endswith('_avg')]].head(10)

Unnamed: 0,pos_avg,neg_avg,obj_avg,pos_part_avg,neg_part_avg,obj_part_avg
0,0.045647,0.024706,0.929647,0.065343,0.065343,0.388009
1,0.062271,0.021978,0.915751,0.020833,0.020833,0.222222
2,0.057831,0.024271,0.917897,0.078125,0.078125,0.306818
3,0.072688,0.042331,0.884982,0.065972,0.065972,0.363636
4,0.028408,0.021935,0.949657,0.036564,0.036564,0.2587
5,0.119091,0.045225,0.835684,0.097178,0.097178,0.368575
6,0.112158,0.028341,0.859501,0.140578,0.140578,0.356581
7,0.055437,0.019369,0.925193,0.067751,0.067751,0.321453
8,0.025202,0.012446,0.962352,0.082465,0.082465,0.394965
9,0.097777,0.011065,0.891158,0.163904,0.163904,0.396875


---

### Evaluate the new predictors with different models.

Does regularization help? Decision trees?

In [103]:
from sklearn.linear_model import SGDClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import StandardScaler

X = rt[['quote_len'] + [c for c in rt.columns if c.endswith('_avg')] + [c for c in rt.columns if c.endswith('_prop')]] 

In [104]:
X.head(2)

Unnamed: 0,quote_len,pos_avg,neg_avg,obj_avg,pos_part_avg,neg_part_avg,obj_part_avg,ADJ_prop,ADP_prop,ADV_prop,...,INTJ_prop,NOUN_prop,NUM_prop,PART_prop,PRON_prop,PROPN_prop,PUNCT_prop,SPACE_prop,SYM_prop,VERB_prop
0,24,0.045647,0.024706,0.929647,0.065343,0.065343,0.388009,0.153846,0.115385,0.076923,...,0.0,0.269231,0.0,0.0,0.076923,0.0,0.038462,0.0,0.0,0.153846
1,13,0.062271,0.021978,0.915751,0.020833,0.020833,0.222222,0.153846,0.153846,0.0,...,0.0,0.384615,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.153846


In [105]:
Xn = StandardScaler().fit_transform(X)

In [108]:
sgd_params = {
    'loss':['log']
    , 'penalty': ['elasticnet']
    , 'alpha':np.logspace(-4,2,75)
    , 'l1_ratio':np.linspace(0.01,1.0,20)
}

sgd_gs = GridSearchCV(SGDClassifier(), sgd_params, cv=5, verbose=1)
sgd_gs.fit(Xn,y)

Fitting 5 folds for each of 1500 candidates, totalling 7500 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    0.8s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:    3.3s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:    7.4s
[Parallel(n_jobs=1)]: Done 799 tasks       | elapsed:   13.2s
[Parallel(n_jobs=1)]: Done 1249 tasks       | elapsed:   20.7s
[Parallel(n_jobs=1)]: Done 1799 tasks       | elapsed:   29.9s
[Parallel(n_jobs=1)]: Done 2449 tasks       | elapsed:   40.5s
[Parallel(n_jobs=1)]: Done 3199 tasks       | elapsed:   54.1s
[Parallel(n_jobs=1)]: Done 4049 tasks       | elapsed:  1.2min
[Parallel(n_jobs=1)]: Done 4999 tasks       | elapsed:  1.4min
[Parallel(n_jobs=1)]: Done 6049 tasks       | elapsed:  1.8min
[Parallel(n_jobs=1)]: Done 7199 tasks       | elapsed:  2.2min
[Parallel(n_jobs=1)]: Done 7500 out of 7500 | elapsed:  2.2min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['elasticnet'], 'loss': ['log'], 'l1_ratio': array([ 0.01   ,  0.06211,  0.11421,  0.16632,  0.21842,  0.27053,
        0.32263,  0.37474,  0.42684,  0.47895,  0.53105,  0.58316,
        0.63526,  0.68737,  0.73947,  0.79158,  0.84368,  0.89579,
        0.94789,  1.     ]), 'alpha': array([  1.00000e-04,   1.20526e-04, ...,   8.29696e+01,   1.00000e+02])},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

In [109]:
print sgd_gs.best_score_

0.64752563531


In [110]:
for var, coef in zip(X.columns, sgd_gs.best_estimator_.coef_[0]):
    print var, coef

quote_len 0.0902213767003
pos_avg 0.178256757629
neg_avg -0.273272538607
obj_avg 0.0
pos_part_avg 0.160807152371
neg_part_avg 0.160807152371
obj_part_avg 0.0
ADJ_prop 0.0
ADP_prop 0.0
ADV_prop -0.13516635068
CONJ_prop 0.021514940718
DET_prop -0.0241481810788
INTJ_prop 0.0
NOUN_prop 0.114989536061
NUM_prop 0.0
PART_prop -0.0603798853488
PRON_prop 0.0
PROPN_prop 0.0566200422861
PUNCT_prop -0.0650161194037
SPACE_prop 0.0
SYM_prop 0.0
VERB_prop -0.163907482009
