### Honey Berk - Project 4
### Movie Review Document Classifier
Using the movie review document classifier discussed in this chapter, generate a list of the 30 features that the classifier finds to be most informative. Can you explain why these particular features are informative? Do you find any of them surprising?

In [115]:
import nltk
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
import random
import string
import pandas as pd

### Read in movie review documents, construct list of top 2,000 words for use in building feature extractor (limited to 2,000 words for processing time)

In [116]:
# Reference: http://www.nltk.org/book/ch06.html

stop = stopwords.words('english')

documents = [([w for w in mr.words(i) if w.lower() not in stop 
               and w.lower() not in string.punctuation], 
              i.split('/')[0]) for i in mr.fileids()]

random.shuffle(documents)

all_words = nltk.FreqDist(w.lower() for w in mr.words())
word_features = list(all_words)[:2000]

### Define feature extractor function that indiciates whether or not top 2,000 words are present in corpus, train and test document classifier

In [117]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

### Check accuracy, show 30 most informative features

In [118]:
print 'accuracy:', nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(30)

accuracy: 0.73
Most Informative Features
          contains(sans) = True              neg : pos    =      8.3 : 1.0
    contains(mediocrity) = True              neg : pos    =      7.6 : 1.0
     contains(dismissed) = True              pos : neg    =      7.1 : 1.0
   contains(bruckheimer) = True              neg : pos    =      6.3 : 1.0
         contains(wires) = True              neg : pos    =      6.3 : 1.0
     contains(uplifting) = True              pos : neg    =      6.2 : 1.0
           contains(ugh) = True              neg : pos    =      5.8 : 1.0
       contains(topping) = True              pos : neg    =      5.7 : 1.0
        contains(fabric) = True              pos : neg    =      5.7 : 1.0
   contains(overwhelmed) = True              pos : neg    =      5.7 : 1.0
   contains(nonsensical) = True              neg : pos    =      5.6 : 1.0
      contains(attorney) = True              pos : neg    =      5.6 : 1.0
  contains(effortlessly) = True              pos : neg    =

### Sampling of explanations as to why features are informative (classifier1, 73.0% accuracy)

Word | Sentiment | Explanation
-------------|-------------|------------|
sans | neg |Means 'without' in French; perhaps referring to a missing positive quality (e.g., sans good acting)
mediocrity | neg |No surprise here, mediocrity is not a desired quantity for a movie
uplifting| pos | Also not a surprise, uplifting movies make people feel good
overwhelmed | neg |Not sure why this would be positive, this one is a surprise
ugh |neg |My favorite review word!
Bruckheimer | neg | Googled his Jerry Bruckheimer's movies, and then I understood -- early films were good, more recent, bad.

### Altnerate document classifier (negative vs. positive words)

In [119]:
# Reference: http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/

def word_features(words):
    return dict([(word, True) for word in words])
 
negrvws = mr.fileids('neg')
posrvws = mr.fileids('pos')
 
negfeatures = [(word_features(mr.words(fileids=[f])), 'neg') for f in negrvws]
posfeatures = [(word_features(mr.words(fileids=[f])), 'pos') for f in posrvws]
 
neglimit = len(negfeatures) * 90/100
poslimit = len(posfeatures) * 10/100
 
train_set2 = negfeatures[:neglimit] + posfeatures[:poslimit]
test_set2 = negfeatures[neglimit:] + posfeatures[poslimit:]
 
classifier2 = NaiveBayesClassifier.train(train_set2)
print 'accuracy:', nltk.classify.util.accuracy(classifier2, test_set2)
classifier2.show_most_informative_features(30)

accuracy: 0.912
Most Informative Features
                  evokes = True              pos : neg    =     20.8 : 1.0
                   russo = True              pos : neg    =     20.8 : 1.0
                unshaven = True              pos : neg    =     20.8 : 1.0
                   pesci = True              pos : neg    =     20.8 : 1.0
                  denial = True              pos : neg    =     20.8 : 1.0
                   deeds = True              pos : neg    =     20.8 : 1.0
                  cheeky = True              pos : neg    =     20.8 : 1.0
             particulars = True              pos : neg    =     20.8 : 1.0
                 existed = True              pos : neg    =     20.8 : 1.0
                   tracy = True              pos : neg    =     20.8 : 1.0
                   fiona = True              pos : neg    =     16.1 : 1.0
                 michele = True              pos : neg    =     14.9 : 1.0
                    lore = True              pos : neg    

### Sampling of explanations as to why features are informative (classifier2, with 91.2% accuracy)

Word | Sentiment | Explanation
-------------|-------------|------------|
evokes | pos |A word used in connection with strong emotion, real-life memories
russo | pos |Model-turned actress, a beautiful woman is a box office draw
unshaven| pos |Stereotypically, unshaven men are often considered wild, dangerous, sexy (box office draw)
bankable | pos |Bankable is a positive quality
binks | pos |Surprising, since Jar Jar Binks was a despised character in Star Wars: Episode I - The Phantom Menace
banned | pos | Possibly surprising, unless it refers to movies that are banned in other countries, or are no longer banned in the US
