# IS620 - Project 4
## Igor Balagula

Using the moview review document classifier discussed in this chapter, generate a list of 30 features that the classifier finds to be most informative. Can you explain why these particular features are informative? Do you find any of them surprising?

In [12]:
import nltk
import random
from nltk.corpus import movie_reviews

Create a list that includes a list of all words in a specific movie review and an associated category (pos or neg)

In [13]:
documents=[(list(movie_reviews.words(fileid)),category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]

Shuffle documents

In [3]:
random.shuffle(documents)

Calculate frequency of every word

In [4]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

Select 2000 most frequent words

In [5]:
word_features=all_words.keys()[:2000]

Create feature set for each review against list of 2000 words

In [6]:
def document_features(document):
    document_words=set(document)
    features={}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

In [7]:
featuresets=[(document_features(d),c) for (d,c) in documents]

Split data into training set and test set

In [8]:
train_set, test_set = featuresets[100:], featuresets[:100]

Apply NaiveBayes classifier to the training set

In [9]:
classifier=nltk.NaiveBayesClassifier.train(train_set)

Show classification accuracy of our classifier against the test set

In [10]:
print nltk.classify.accuracy(classifier,test_set)

0.77


Show most informative features

In [11]:
classifier.show_most_informative_features(30)

Most Informative Features
          contains(sans) = True              neg : pos    =      8.9 : 1.0
    contains(mediocrity) = True              neg : pos    =      7.6 : 1.0
     contains(dismissed) = True              pos : neg    =      7.1 : 1.0
        contains(fabric) = True              pos : neg    =      6.4 : 1.0
   contains(overwhelmed) = True              pos : neg    =      6.4 : 1.0
   contains(bruckheimer) = True              neg : pos    =      6.2 : 1.0
     contains(uplifting) = True              pos : neg    =      6.2 : 1.0
        contains(doubts) = True              pos : neg    =      5.9 : 1.0
          contains(wits) = True              pos : neg    =      5.8 : 1.0
  contains(effortlessly) = True              pos : neg    =      5.7 : 1.0
         contains(wires) = True              neg : pos    =      5.6 : 1.0
        contains(beware) = True              neg : pos    =      5.6 : 1.0
           contains(ugh) = True              neg : pos    =      5.3 : 1.0

Features that are expeted to be informative since all of them imply either clearly-positive or clearly-negative meaning:

sans (neg)

mediocrity (neg)

uplifting (pos)

wits (pos)

effortlessly(pos)

beware (neg)

ugh(neg)

dumbest(neg)

admired(pos)

wiseguy(pos)

existential(pos)

Features that are surprising as being important:

dismissed

fabric

overwhelmed

bruckheimer

doubts

wires

chad

topping

lang

hugo

wang

snake

quicker

maxwell

cronenberg