### City University of New York | IS620 | Web Analytics

# Project 4  :  Bayesian Movie Review Document Classifier
---
### Team
+ Robert Sellers

### References
+ [Natural Language Processing With Python](http://www.nltk.org/book/)
+ [Text Classification for Sentiment Analysis – Naive Bayes Classifier](http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/)
+ [Text Classification for Sentiment Analysis – Eliminate Low Information Features](http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/)

## Required libraries

+ collections
+ itertools
+ nltk

In [1]:
import collections
import itertools
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews, stopwords
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.metrics import precision, recall

## Testing and training the Naive Bayes Classifier

**The classification is measured using Precicion, Accuracy, and Recall. The higher the precision and recall the fewer the false positives.** 

In [2]:
def movie_classifier(featx,output):
    
    #Retrieving the positive and negative reviews
    negids = movie_reviews.fileids('neg')
    posids = movie_reviews.fileids('pos')
    
    # define the split of % training / % test
    split = 0.8

    negfeats = [(featx(movie_reviews.words(fileids=[f])), 
                 'neg') for f in negids]
    posfeats = [(featx(movie_reviews.words(fileids=[f])), 
                 'pos') for f in posids]
     
    cutoff = int(len(posfeats) * split)

    trainfeats = negfeats[:cutoff] + posfeats[:cutoff]
    testfeats = negfeats[cutoff:] + posfeats[cutoff:]
 
    #Defining the classifier
    classifier = NaiveBayesClassifier.train(trainfeats)
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)
 
    for i, (feats, label) in enumerate(testfeats):
            refsets[label].add(i)
            observed = classifier.classify(feats)
            testsets[observed].add(i)
            
    #Set output parameter to true if you want to see these metrics!        
    if output is True:
        print 'Train on %d instances\nTest on %d instances' % (len(trainfeats),len(testfeats))
        print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
        print 'pos precision:', precision(refsets['pos'], testsets['pos'])
        print 'pos recall:', recall(refsets['pos'], testsets['pos'])
        print 'neg precision:', precision(refsets['neg'], testsets['neg'])
        print 'neg recall:', recall(refsets['neg'], testsets['neg'])
        classifier.show_most_informative_features(30)


## Scoring and Word Frequencies

In [3]:
word_fd = nltk.probability.FreqDist()
label_word_fd = nltk.probability.ConditionalFreqDist()

### Counting negatives and positives

In [4]:
#Positive scores
for word in movie_reviews.words(categories=['pos']):
    word_fd[word.lower()] += 1
    label_word_fd['pos'][word.lower()] += 1

#Negative scores
for word in movie_reviews.words(categories=['neg']):
    word_fd[word.lower()] += 1
    label_word_fd['neg'][word.lower()] += 1

### Calculating Totals

In [5]:
pos_word_count = label_word_fd['pos'].N()
neg_word_count = label_word_fd['neg'].N()
total_word_count = pos_word_count + neg_word_count

In [6]:
word_scores = {}

 # Merging positive and negative scores together
for word, freq in word_fd.iteritems():
    pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word],
        (freq, pos_word_count), total_word_count)
    neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word],
        (freq, neg_word_count), total_word_count)
    word_scores[word] = pos_score + neg_score

**Adding the top 10000 words as features**

In [7]:
best = sorted(word_scores.iteritems(), 
              key=lambda (w,s): s, reverse=True)[:10000]

bestwords = set([w for w, s in best])

def best_word_feats(words):
    return dict([(word, True) for word in words if word in bestwords])

**Including the top 200 bigrams**

In [8]:
def best_bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    d = dict([(bigram, True) for bigram in bigrams])
    d.update(best_word_feats(words))
    return d

## Results
+ Scoring metrics: accuracy, precision and recall
+ Top 30 results

In [9]:
movie_classifier(best_bigram_word_feats,True)

Train on 1600 instances
Test on 400 instances
accuracy: 0.9225
pos precision: 0.912195121951
pos recall: 0.935
neg precision: 0.933333333333
neg recall: 0.91
Most Informative Features
             outstanding = True              pos : neg    =     13.9 : 1.0
               insulting = True              neg : pos    =     13.7 : 1.0
              vulnerable = True              pos : neg    =     13.0 : 1.0
               ludicrous = True              neg : pos    =     12.6 : 1.0
             uninvolving = True              neg : pos    =     12.3 : 1.0
     (u'matt', u'damon') = True              pos : neg    =     12.3 : 1.0
        (u'give', u'us') = True              neg : pos    =     12.3 : 1.0
   (u'saving', u'grace') = True              neg : pos    =     11.7 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
              astounding = True              pos : neg    =     11.7 : 1.0
             fascination = True              pos : neg    =     11

## Discussion
**Let us analyze the precision and recall. High values in recall means very few false negatives, and the precision rates indicate a high level of true positives in both negative and positive. The accuracy corroborates with these statistics.**

**The results are a mixture of primarily adjectives with a scattering of fairly diverse vocabulary describing film in various ways. We recognize that Matt Damon is a popular film actor and that Steven Seagal is generally a terrible actor. These make sense, however it is definitely surprising to see them specifically floating to the top of this list. We also see some, what I would call "charged" or "leading" words and phrases that are typically associated with the emotional state of a critic. Take for example "avoids". The director or production "avoids" doing something terrible.  Or "does so", which I consider to be an inverse of "with that said". The remaining answers are either unclear or self-explanatory to anyone who has listened or read a substantial number of film reviews.**