# Project 4: Book Problem

For this project, we were asked to solve one of the given problems from the book . I chose to solve problem four, which asks us to find classifications inside a set of movie reviews. 

In [1]:
import nltk
import sklearn as sk
from nltk.corpus import movie_reviews
import pandas as pd

I start off by pulling in the words from the `nltk` movie reviews corpus, follow that with calculating the frequency distribution, and then grabbing the distribution keys. 

In [18]:
words = movie_reviews.words()
distribution = nltk.FreqDist(w.lower() for w in words)
word_features = list(distribution.keys())

Once we have our features, we can grab the list of words and their classification:

In [19]:
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

Next, I create a feature set for the dataset: 

In [14]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]

Once I have my feature set, I set the the first 100 as the training dataset, and the rest as the test dataset. After that, a Naive Bayes classification is permformed, and we finished by showing the 30 most inforamtive features.

In [22]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
classifier.show_most_informative_features(30)

Most Informative Features
    contains(recognizes) = True              pos : neg    =      8.1 : 1.0
 contains(unimaginative) = True              neg : pos    =      7.8 : 1.0
    contains(schumacher) = True              neg : pos    =      7.8 : 1.0
        contains(turkey) = True              neg : pos    =      6.5 : 1.0
     contains(atrocious) = True              neg : pos    =      6.4 : 1.0
        contains(shoddy) = True              neg : pos    =      6.3 : 1.0
          contains(mena) = True              neg : pos    =      6.3 : 1.0
        contains(suvari) = True              neg : pos    =      6.3 : 1.0
         contains(kudos) = True              pos : neg    =      5.9 : 1.0
        contains(wasted) = True              neg : pos    =      5.6 : 1.0
        contains(justin) = True              neg : pos    =      5.6 : 1.0
        contains(canyon) = True              neg : pos    =      5.6 : 1.0
  contains(surveillance) = True              neg : pos    =      5.6 : 1.0

## Conclusion
The ones that were surprising to me were `stellan`, `bronson`, and `turkey`. Those were not words I would expect as indicators as I would have thought they would be sparesely used. On the other hand, I wasn't surprised that words such as `unimaginative`, `fluffy`, and `unfunny` made the list. 