In [1]:
import nltk
#nltk.download()

In [6]:
from nltk.corpus import movie_reviews
import random

In [24]:
type(movie_reviews.words)

method

In [7]:
documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

In [28]:
len(documents)
type(documents[1999][0])
documents[1999][1]

'pos'

To limit the number of features that the classifier needs to process, we construct a list of the 2000 most frequent words in the overall corpus. We then define a feature extractor that simply checks if each of these words is present in a given document.

The reason that we compute the set of all words in a document *document_words = set(document)*, rather than just checking if the word in the document, is that checking whether a word occurs in a set is much faster than checking whether it happens in a list.

In [11]:
# Define the feature extractor

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [13]:
# Train Naive Bayes classifier
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[1000:], featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [16]:
len(featuresets)
featuresets[0]

({'contains(plot)': False,
  'contains(:)': False,
  'contains(two)': False,
  'contains(teen)': False,
  'contains(couples)': False,
  'contains(go)': True,
  'contains(to)': True,
  'contains(a)': True,
  'contains(church)': False,
  'contains(party)': False,
  'contains(,)': True,
  'contains(drink)': False,
  'contains(and)': True,
  'contains(then)': False,
  'contains(drive)': False,
  'contains(.)': True,
  'contains(they)': False,
  'contains(get)': False,
  'contains(into)': False,
  'contains(an)': True,
  'contains(accident)': False,
  'contains(one)': True,
  'contains(of)': True,
  'contains(the)': True,
  'contains(guys)': False,
  'contains(dies)': False,
  'contains(but)': True,
  'contains(his)': True,
  'contains(girlfriend)': True,
  'contains(continues)': False,
  'contains(see)': False,
  'contains(him)': False,
  'contains(in)': True,
  'contains(her)': True,
  'contains(life)': False,
  'contains(has)': True,
  'contains(nightmares)': False,
  'contains(what)': T

In [17]:
# Test the classifier
print(nltk.classify.accuracy(classifier, test_set))

0.785


In [7]:
# Show the most important features as interpreted by Naive Bayes
classifier.show_most_informative_features(5)

Most Informative Features
     contains(atrocious) = True              neg : pos    =      6.6 : 1.0
    contains(schumacher) = True              neg : pos    =      6.6 : 1.0
        contains(shoddy) = True              neg : pos    =      6.3 : 1.0
        contains(turkey) = True              neg : pos    =      6.1 : 1.0
      contains(explores) = True              pos : neg    =      5.8 : 1.0


### Task: Gender Identification by name

We know that male and female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. Let's build a classifier to model these differences more precisely.


In [8]:
from nltk.corpus import names
labeled_names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

import random
random.shuffle(labeled_names)

The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features. Let us just look at the final letter of a given name. The following feature extractor function builds a dictionary containing relevant information about a given name:

In [9]:
def gender_features(word):
     return {'last_letter': word[-1]}
gender_features('Fabian')

{'last_letter': 'k'}

### Build a Naive Bayes Classifier 
- Add a few more features
- Split into train/test
- Report its accuracy on the test set  
- determine which features it found most effective for distinguishing the names' genders

In [None]:
%load NBMR_sol1.py
