# DATA620 Project 4
## Justin Hink

### Introduction
In this project we construct a sentiment analysis model using nltk to analyze a number of movie reviews in the movie_reviews corpus (which comes with ntlk)

#### Step 1: Load and Transform the Data

We will load the data and split it appropriately into training and test data sets.  This is similar to what we did in group project 3.

In [2]:
import nltk
from nltk.corpus import movie_reviews

words = movie_reviews.words()
all_words = nltk.FreqDist(w.lower() for w in words)
word_features = all_words.keys()[:10000]

tokens = [(list(movie_reviews.words(fileid)), category)
          for category in movie_reviews.categories()
          for fileid in movie_reviews.fileids(category)]

def convert_to_features(document, wfs):
    document_words = set(document)
    features = {}
    for w in wfs:
        features['contains({})'.format(w)] = (w in document_words)
    return features

feature_sets = [(convert_to_features(d, word_features), c) for (d, c) in tokens]

N = int(len(feature_sets) * 0.1)
train_data, test_data = feature_sets[N:], feature_sets[:N]

#### Step 2: Generate our model based off our training data.

In [3]:
model = nltk.NaiveBayesClassifier.train(train_data)

#### Step 3:  Evaluate our model out of sample against our test dataset.  

As shown below, our model is accurately classifying the movie reviews at a rate of 77%

In [4]:
print 'Accuracy: %4.2f' % nltk.classify.accuracy(model, test_data)

Accuracy: 0.77


#### Step 4:  Display 30 most informative features

In [5]:
model.show_most_informative_features(30)

Most Informative Features
     contains(insulting) = True              neg : pos    =     11.2 : 1.0
          contains(sans) = True              neg : pos    =     10.4 : 1.0
         contains(sucks) = True              neg : pos    =     10.2 : 1.0
       contains(wasting) = True              neg : pos    =      7.9 : 1.0
       contains(admired) = True              pos : neg    =      7.7 : 1.0
          contains(deft) = True              pos : neg    =      7.2 : 1.0
    contains(mediocrity) = True              neg : pos    =      7.1 : 1.0
        contains(suvari) = True              neg : pos    =      7.1 : 1.0
         contains(wires) = True              neg : pos    =      7.1 : 1.0
        contains(turkey) = True              neg : pos    =      7.1 : 1.0
      contains(marginal) = True              neg : pos    =      7.1 : 1.0
  contains(refreshingly) = True              pos : neg    =      6.7 : 1.0
          contains(scum) = True              pos : neg    =      6.7 : 1.0

## Comments on Results

The majority of the negative/positive classifications pass the sniff test.  I would break them down as follows:

### Intuitive Results

##### Negative Classification

Sucks
Insulting
Sans
mediocrity
stinks
wasting
jokey
ineptitude
unbearable
turkey
marginal
amateur

##### Positive Classification

beings
Wonderfully
Uplifting
innocence
admired
deft
qui
refreshingly

### Non-Intuitive Results

##### Negative Classification

peel
suvari
wires
goldsman
tremors
bruckheimer
noises

##### Positive Classification

Prior
nigel
hugo
tucker
symbols
scum