D Stern

In this method, we attempt to apply by Naive Bayes again. Our process is to randomize the names list, split it into the test/training/validations sets and that to construct different classifiers, each with different combinations of features. The combination of first letter, last letter, first four letters, and last three letters yields about 80% on the test and validation sets.

In [2]:
import nltk
from nltk.corpus import names
import random
from __future__ import division

In [3]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [4]:
names = nltk.corpus.names
maleNames = names.words('male.txt')
femaleNames = names.words('female.txt')
maleNames = [(each, 'male') for each in maleNames] # add labels
femaleNames = [(each, 'female') for each in femaleNames]
random.shuffle(maleNames) #lists are alphabetical, randomize before splitting
random.shuffle(femaleNames)

In [5]:
len(maleNames)/(len(maleNames)+len(femaleNames)) # pct of male names

0.3704682779456193

In [6]:
0.3704*500 # number of male names to take for validation and test sets

185.20000000000002

In [7]:
0.3704*6944 # number of male names to take for training set

2572.0576

In [8]:
len(femaleNames)/(len(maleNames)+len(femaleNames)) # pct of female names

0.6295317220543807

In [9]:
0.6295*500 # number of female names to take for validation and test sets

314.75

In [10]:
0.6295*6944 # number of female names to take for training set

4371.248

In [11]:
training  = maleNames[0:2572] + femaleNames[0:4370]
test = maleNames[2572:2757] + femaleNames[4370:4685]
validation = maleNames[2757:2942] + femaleNames[4685:5000]

In [12]:
from nltk.classify import apply_features
training1 = apply_features(gender_features, training)
test1 = apply_features(gender_features, test)
classifier1 = nltk.NaiveBayesClassifier.train(training1)
print(nltk.classify.accuracy(classifier1, test1))

0.772


In [13]:
classifier1.show_most_informative_features(10)

Most Informative Features
             last_letter = u'a'           female : male   =     37.3 : 1.0
             last_letter = u'k'             male : female =     29.3 : 1.0
             last_letter = u'f'             male : female =     14.6 : 1.0
             last_letter = u'v'             male : female =     10.5 : 1.0
             last_letter = u'p'             male : female =     10.5 : 1.0
             last_letter = u'd'             male : female =      9.7 : 1.0
             last_letter = u'o'             male : female =      8.4 : 1.0
             last_letter = u'm'             male : female =      8.1 : 1.0
             last_letter = u'r'             male : female =      6.6 : 1.0
             last_letter = u'z'             male : female =      6.4 : 1.0


For the second features definition, I messed around with combinations of just first and last letter. Although there is some variability on each run, I found that the last letter alone is more indicative than first letter alone. Just using the last letter is roughly the same as using both the first and the last letter.

In [14]:
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    return features

In [15]:
training2 = [(gender_features2(n), gender) for (n, gender) in training]
test2 = [(gender_features2(n), gender) for (n, gender) in test]
validation2 = [(gender_features2(n), gender) for (n, gender) in validation]
classifier2 = nltk.NaiveBayesClassifier.train(training2)
print(nltk.classify.accuracy(classifier2, test2))

0.782


In [16]:
def gender_features3(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    features["first_4"] = name[0:3]
    features["last_3"] = name[-3:-1]
    return features

In [17]:
training3 = [(gender_features3(n), gender) for (n, gender) in training]
test3 = [(gender_features3(n), gender) for (n, gender) in test]
validation3 = [(gender_features3(n), gender) for (n, gender) in validation]
classifier3 = nltk.NaiveBayesClassifier.train(training3)
print(nltk.classify.accuracy(classifier3, test2))

0.782


In [18]:
print(nltk.classify.accuracy(classifier3, validation3))

0.828
