# Project 3 - Gender Classifier
#### Authors: John Mazon, LeTicia Cancel, Bharani Nitalla

**Assignment:** Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can.

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?


In [14]:
# Libraries
import nltk
from nltk.corpus import names
import random
from nltk.classify import apply_features

In [15]:
# Split male and female names
names = ([(name, 'male') for name in names.words('male.txt')] +
        [(name, 'female') for name in names.words('female.txt')])

In [16]:
random.shuffle(names)

In [17]:
names[:10]

[('Hynda', 'female'),
 ('Shir', 'female'),
 ('Nicol', 'female'),
 ('Mufinella', 'female'),
 ('Ichabod', 'male'),
 ('Andrew', 'male'),
 ('Paton', 'male'),
 ('Harmonie', 'female'),
 ('Nisa', 'female'),
 ('Dominick', 'male')]

In [18]:
len(names)

7944

In [19]:
def gender_features(word):
    return {'last_letter':word[-1]}

def gender_features2(word):
    return{'last_letter': word[-1],'word_len': len(word), 'first_letter': word[0]}

In [20]:
# create sets
featuresets = [(gender_features(n), g) for n,g in names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

# second set
featuresets2 = [(gender_features2(n), g) for n,g in names]
train_set2, test_set2 = featuresets2[500:], featuresets2[:500]
classifier2 = nltk.NaiveBayesClassifier.train(train_set2)

In [21]:
# test set 1 from the book, last letter only
print (nltk.classify.accuracy(classifier, test_set))

# test set 2 using last letter, length of word, and first letter
print(nltk.classify.accuracy(classifier2, test_set2))

0.756
0.774


In [22]:
classifier.show_most_informative_features(5)
classifier2.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     35.7 : 1.0
             last_letter = 'k'              male : female =     31.3 : 1.0
             last_letter = 'f'              male : female =     28.9 : 1.0
             last_letter = 'p'              male : female =     21.0 : 1.0
             last_letter = 'v'              male : female =     18.7 : 1.0
Most Informative Features
             last_letter = 'a'            female : male   =     35.7 : 1.0
             last_letter = 'k'              male : female =     31.3 : 1.0
             last_letter = 'f'              male : female =     28.9 : 1.0
             last_letter = 'p'              male : female =     21.0 : 1.0
             last_letter = 'v'              male : female =     18.7 : 1.0


In [23]:
train_set = apply_features(gender_features, names[500:])
test_set = apply_features(gender_features, names[500:])

In [29]:
def gender_features3(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

In [30]:
gender_features3('John')

{'firstletter': 'j',
 'lastletter': 'n',
 'count(a)': 0,
 'has(a)': False,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 0,
 'has(e)': False,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 1,
 'has(h)': True,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 1,
 'has(j)': True,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 0,
 'has(m)': False,
 'count(n)': 1,
 'has(n)': True,
 'count(o)': 1,
 'has(o)': True,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

In [34]:
train_names = names[6900:]
devtest_names = names[500:1000]
test_names = names[:500]

In [32]:
featuresets = [(gender_features3(n), g) for (n,g) in names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.774


In [36]:
train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.75
