## Max Wagner
### Data 620 - Week 12 - Project 4 - Q4
***

In [1]:
import nltk, random, sklearn
from nltk.corpus import movie_reviews
import pandas as pd

### Pull Out Words, Create a Naive Bayes Classifier

It should be noted that most of this week's work came straight from the course's reading material.

In [10]:
random.seed(65)
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

In [11]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000]

In [12]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

In [13]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

***
### Test the Classifier's Accuracy and Print Features

In [18]:
print "Accuracy: " + str(nltk.classify.accuracy(classifier, test_set))

Accuracy: 0.68


In [15]:
classifier.show_most_informative_features(30)

Most Informative Features
        contains(doubts) = True              pos : neg    =      9.6 : 1.0
          contains(sans) = True              neg : pos    =      8.4 : 1.0
    contains(mediocrity) = True              neg : pos    =      7.0 : 1.0
     contains(dismissed) = True              pos : neg    =      7.0 : 1.0
   contains(bruckheimer) = True              neg : pos    =      6.3 : 1.0
     contains(uplifting) = True              pos : neg    =      6.1 : 1.0
           contains(ugh) = True              neg : pos    =      5.8 : 1.0
     contains(sickening) = True              neg : pos    =      5.7 : 1.0
   contains(overwhelmed) = True              pos : neg    =      5.7 : 1.0
       contains(topping) = True              pos : neg    =      5.7 : 1.0
          contains(wits) = True              pos : neg    =      5.7 : 1.0
          contains(lang) = True              pos : neg    =      5.7 : 1.0
         contains(wires) = True              neg : pos    =      5.0 : 1.0

### Feature Explanations

The basic explanation is that certain words tend to have either a positive or negative connotation in addition to their proper denotation. People are more likely to use words they associate with negativity in a negative review, likewise for positivity. 

An additional trouble with feature classifiers is that they are heavily dependent on a fairly small subset of training data. If that data contains some very specific words, it could confuse the classifier. For example, the word **maxwell** appeared to occur 4.3:1 in favor of negativity. This could be due to a small number of reviews speaking badly of a character or author. 

A accuracy of 68% is not great, but it is better than guessing by about 18%. A more comprehensive model with sentence structure included would probably greatly increase the accuracy.

#### A few other oddities:
- *mediocrity* was surprisingly negative for a word that essentially means neutral
- *wires* was also more negative that I expected it to be, possibly due to phrases like, "the wires were showing"
- *dismissed* was very positive, which I figured could have been used in either catagory
- the sheer amount of names in the top 30, made me wonder how effective this method really is