NLTK classifiers work with featstructs (http://www.nltk.org/_modules/nltk/featstruct.html). 

For more information about how Naive Bayes works in NLTK, please see here: http://www.nltk.org/_modules/nltk/classify/naivebayes.html 

Here, we will construct a very simple dictionary which maps a feature name (word) to True if the word exists in the data. We will not use bag of words for this example because, for sentiment classification, whether a word occurs or not seems to matter more than its frequency. When it comes to Naive Bayes, this is called binary multinomial Naive Bayes. 

In [1]:
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
import math

In [2]:
def extract_word_feats(words):
    return dict([(word, True) for word in words])

The movie reviews corpus has 1000 positive files and 1000 negative files.

In [3]:
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

In order to train the classifier, we initially need to creat feature-label pairs where the features will be a feature dictionary in the form of {word: True} and the label is either a "pos" or a "neg" label. 

In [14]:
negreviews = [(extract_word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posreviews = [(extract_word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

For instance, a feature-label pair could be: ({'"': True, 'around': True,..., 'neg')

In [15]:
negreviews[1]

({'the': True,
  'happy': True,
  'bastard': True,
  "'": True,
  's': True,
  'quick': True,
  'movie': True,
  'review': True,
  'damn': True,
  'that': True,
  'y2k': True,
  'bug': True,
  '.': True,
  'it': True,
  'got': True,
  'a': True,
  'head': True,
  'start': True,
  'in': True,
  'this': True,
  'starring': True,
  'jamie': True,
  'lee': True,
  'curtis': True,
  'and': True,
  'another': True,
  'baldwin': True,
  'brother': True,
  '(': True,
  'william': True,
  'time': True,
  ')': True,
  'story': True,
  'regarding': True,
  'crew': True,
  'of': True,
  'tugboat': True,
  'comes': True,
  'across': True,
  'deserted': True,
  'russian': True,
  'tech': True,
  'ship': True,
  'has': True,
  'strangeness': True,
  'to': True,
  'when': True,
  'they': True,
  'kick': True,
  'power': True,
  'back': True,
  'on': True,
  'little': True,
  'do': True,
  'know': True,
  'within': True,
  'going': True,
  'for': True,
  'gore': True,
  'bringing': True,
  'few': True,

In order to evaluate our algorithm at a later step, we will need to split the dataset into training and test set. 

Here, we will use 75% of the data as the training set, and the rest as the test set. 

In [16]:
negsplit = int(len(negreviews)*0.75)
possplit = int(len(posreviews)*0.75)

trainingset = negreviews[:negsplit] + posreviews[:possplit]
testset = negreviews[negsplit:] + posreviews[possplit:]

The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of ‘pos’ or ‘neg’.

In [17]:
print('train on %d instances, test on %d instances' % (len(trainingset), len(testset)))
classifier = NaiveBayesClassifier.train(trainingset)

train on 1500 instances, test on 500 instances


For accuracy evaluation, we can use nltk.classify.util.accuracy with the test set as the gold standard.
 
Accuracy is described as follows (taken from NLTK documentation): Given a list of reference values and a corresponding list of test values, return the fraction of corresponding values that are equal. 

In particular, return the fraction of indices
    ``0<i<=len(test)`` such that ``test[i] == reference[i]``.

    :type reference: list
    :param reference: An ordered list of reference values.
    :type test: list
    :param test: A list of values to compare against the corresponding  reference values.
    :raise ValueError: If ``reference`` and ``length`` do not have the same length.

In [18]:
print('accuracy:', nltk.classify.util.accuracy(classifier, testset))

accuracy: 0.728


In addition, NLTK allows us to see the most useful features:

According to NLTK documentation, the most_informative_features() returns a list of the 'most informative' features used by the classifier.  For the purpose of this function, the informativeness of a feature ``(fname,fval)`` is equal to the highest value of P(fname=fval|label), for any label, divided by the lowest value of P(fname=fval|label), for any label:

        |  max[ P(fname=fval|label1) / P(fname=fval|label2) ]

In [19]:
classifier.show_most_informative_features()

Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0
