# Name Gender Identifier

## 1. Building a feature extractor

An idea is to use the last letter of the name to predict the gender. For instance, names ending in *a*, *e* and *i* are likely to be female, while names ending in *k*, *o*, *r*, *s* and *t* are likely to be male.

In [1]:
# Feature extractor
def gender_features(word):
    return {'last_letter': word[-1]}

gender_features('John')

{'last_letter': 'n'}

The returned dictionary is known as a **feature set**.

## 2. Exploring the `names` corpus

In [2]:
from nltk.corpus import names

names.readme().replace('\n', ' ')

'Names Corpus, Version 1.3 (1994-03-29) Copyright (C) 1991 Mark Kantrowitz Additions by Bill Ross  This corpus contains 5001 female names and 2943 male names, sorted alphabetically, one per line.  You may use the lists of names for any purpose, so long as credit is given in any published work. You may also redistribute the list if you provide the recipients with a copy of this README file. The lists are not in the public domain (I retain the copyright on the lists) but are freely redistributable.  If you have any additions to the lists of names, I would appreciate receiving them.  Mark Kantrowitz <mkant+@cs.cmu.edu> http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/'

In [3]:
names.fileids()

['female.txt', 'male.txt']

In [4]:
names.words('female.txt')[:5]

['Abagael', 'Abagail', 'Abbe', 'Abbey', 'Abbi']

## 3. Building the classifier

We need to prepare a list of examples and corresponding class labels.

In [5]:
labeled_names = ([(name, 'female') for name in names.words('female.txt')] + [(name, 'male') for name in names.words('male.txt')])
labeled_names[:5]

[('Abagael', 'female'),
 ('Abagail', 'female'),
 ('Abbe', 'female'),
 ('Abbey', 'female'),
 ('Abbi', 'female')]

In [6]:
import random
random.shuffle(labeled_names) # We shuffle the data so that we can split it by index into training and test data.
labeled_names[:5]

[('Riannon', 'female'),
 ('Burton', 'male'),
 ('Obadiah', 'male'),
 ('Alston', 'male'),
 ('Ferdy', 'male')]

In [7]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
featuresets[:5]

[({'last_letter': 'n'}, 'female'),
 ({'last_letter': 'n'}, 'male'),
 ({'last_letter': 'h'}, 'male'),
 ({'last_letter': 'n'}, 'male'),
 ({'last_letter': 'y'}, 'male')]

In [8]:
len(featuresets)

7944

In [9]:
from nltk import NaiveBayesClassifier

train_names, test_names = labeled_names[500:], labeled_names[:500]

train_set, test_set = featuresets[500:], featuresets[:500]
classifier = NaiveBayesClassifier.train(train_set)

# When working with large corpora, constructing a single list that contains the features of every instance can use up a large amount of memory. In these cases, use the function nltk.classify.apply_features, which returns an object that acts like a list but does not store all the feature sets in memory: 
# from nltk.classify import apply_features
# train_set = apply_features(gender_features, labeled_names[500:])
# test_set = apply_features(gender_features, labeled_names[:500])

In [10]:
classifier.show_most_informative_features(10) # Prints likelihood ratios for most informative features

Most Informative Features
             last_letter = 'a'            female : male   =     34.2 : 1.0
             last_letter = 'k'              male : female =     32.6 : 1.0
             last_letter = 'f'              male : female =     17.5 : 1.0
             last_letter = 'p'              male : female =     10.6 : 1.0
             last_letter = 'd'              male : female =     10.1 : 1.0
             last_letter = 'v'              male : female =     10.0 : 1.0
             last_letter = 'm'              male : female =      9.3 : 1.0
             last_letter = 'o'              male : female =      8.2 : 1.0
             last_letter = 'w'              male : female =      6.7 : 1.0
             last_letter = 'r'              male : female =      6.6 : 1.0


## 4. Testing the classifier

In [11]:
classifier.labels()

['male', 'female']

In [12]:
from nltk.classify import accuracy

accuracy(classifier, test_set)

0.73

In [13]:
classifier.classify(gender_features('Aphrodite'))

'female'

In [14]:
classifier.classify(gender_features('Zeus'))

'male'

## 5. Building a classifier with more features

In [15]:
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

gender_features2('John')

{'count(a)': 0,
 'count(b)': 0,
 'count(c)': 0,
 'count(d)': 0,
 'count(e)': 0,
 'count(f)': 0,
 'count(g)': 0,
 'count(h)': 1,
 'count(i)': 0,
 'count(j)': 1,
 'count(k)': 0,
 'count(l)': 0,
 'count(m)': 0,
 'count(n)': 1,
 'count(o)': 1,
 'count(p)': 0,
 'count(q)': 0,
 'count(r)': 0,
 'count(s)': 0,
 'count(t)': 0,
 'count(u)': 0,
 'count(v)': 0,
 'count(w)': 0,
 'count(x)': 0,
 'count(y)': 0,
 'count(z)': 0,
 'first_letter': 'j',
 'has(a)': False,
 'has(b)': False,
 'has(c)': False,
 'has(d)': False,
 'has(e)': False,
 'has(f)': False,
 'has(g)': False,
 'has(h)': True,
 'has(i)': False,
 'has(j)': True,
 'has(k)': False,
 'has(l)': False,
 'has(m)': False,
 'has(n)': True,
 'has(o)': True,
 'has(p)': False,
 'has(q)': False,
 'has(r)': False,
 'has(s)': False,
 'has(t)': False,
 'has(u)': False,
 'has(v)': False,
 'has(w)': False,
 'has(x)': False,
 'has(y)': False,
 'has(z)': False,
 'last_letter': 'n'}

In [16]:
featuresets2 = [(gender_features2(n), gender) for (n, gender) in labeled_names]
featuresets2[0]

({'count(a)': 1,
  'count(b)': 0,
  'count(c)': 0,
  'count(d)': 0,
  'count(e)': 0,
  'count(f)': 0,
  'count(g)': 0,
  'count(h)': 0,
  'count(i)': 1,
  'count(j)': 0,
  'count(k)': 0,
  'count(l)': 0,
  'count(m)': 0,
  'count(n)': 3,
  'count(o)': 1,
  'count(p)': 0,
  'count(q)': 0,
  'count(r)': 1,
  'count(s)': 0,
  'count(t)': 0,
  'count(u)': 0,
  'count(v)': 0,
  'count(w)': 0,
  'count(x)': 0,
  'count(y)': 0,
  'count(z)': 0,
  'first_letter': 'r',
  'has(a)': True,
  'has(b)': False,
  'has(c)': False,
  'has(d)': False,
  'has(e)': False,
  'has(f)': False,
  'has(g)': False,
  'has(h)': False,
  'has(i)': True,
  'has(j)': False,
  'has(k)': False,
  'has(l)': False,
  'has(m)': False,
  'has(n)': True,
  'has(o)': True,
  'has(p)': False,
  'has(q)': False,
  'has(r)': True,
  'has(s)': False,
  'has(t)': False,
  'has(u)': False,
  'has(v)': False,
  'has(w)': False,
  'has(x)': False,
  'has(y)': False,
  'has(z)': False,
  'last_letter': 'n'},
 'female')

In [17]:
train_set2, test_set2 = featuresets2[500:], featuresets2[:500]
classifier2 = NaiveBayesClassifier.train(train_set2)
accuracy(classifier2, test_set2)

0.76

I would have expected that having too many specific features on a small dataset would lead to overfitting, but it seems the classifier was really good in avoiding that.

In [18]:
classifier2.show_most_informative_features(15)

Most Informative Features
             last_letter = 'a'            female : male   =     34.2 : 1.0
             last_letter = 'k'              male : female =     32.6 : 1.0
             last_letter = 'f'              male : female =     17.5 : 1.0
             last_letter = 'p'              male : female =     10.6 : 1.0
             last_letter = 'd'              male : female =     10.1 : 1.0
             last_letter = 'v'              male : female =     10.0 : 1.0
             last_letter = 'm'              male : female =      9.3 : 1.0
                count(v) = 2              female : male   =      8.7 : 1.0
             last_letter = 'o'              male : female =      8.2 : 1.0
             last_letter = 'w'              male : female =      6.7 : 1.0
             last_letter = 'r'              male : female =      6.6 : 1.0
                count(w) = 2                male : female =      5.2 : 1.0
             last_letter = 'g'              male : female =      5.2 : 1.0

Indeed, it seems the classifier is mainly using the last letter, along with some other features that happen to improve the accuracy.

## 6. Comparing the two classifiers using `nltk.metrics`

Before we start, here's a useful function for comparing strings:

In [19]:
from nltk.metrics import edit_distance

edit_distance("John", "Joan")

1

The NLTK metrics module provides functions for calculating metrics beyond mere accuracy. But in order to do so, we need to build 2 sets for each classification label: a reference set of correct values, and a test set of observed values.

In [20]:
import collections

# Classifier 1
refsets = collections.defaultdict(set) # For what this is: https://stackoverflow.com/questions/5900578/how-does-collections-defaultdict-work
testsets = collections.defaultdict(set)

for i, (feats, label) in enumerate(test_set):
    refsets[label].add(i)
    observed = classifier.classify(feats)
    testsets[observed].add(i)
    
# Classifier 2
refsets2 = collections.defaultdict(set)
testsets2 = collections.defaultdict(set)

for i, (feats, label) in enumerate(test_set2):
    refsets2[label].add(i)
    observed = classifier2.classify(feats)
    testsets2[observed].add(i)

In [21]:
refsets

defaultdict(set,
            {'female': {0,
              5,
              10,
              11,
              12,
              13,
              14,
              19,
              20,
              21,
              22,
              23,
              26,
              27,
              28,
              30,
              31,
              33,
              34,
              36,
              37,
              38,
              39,
              43,
              45,
              46,
              48,
              49,
              50,
              51,
              54,
              55,
              56,
              57,
              58,
              59,
              63,
              65,
              66,
              67,
              68,
              69,
              72,
              73,
              74,
              75,
              77,
              78,
              79,
              86,
              87,
              92,
              93,
              94,
   

In [22]:
testsets

defaultdict(set,
            {'female': {2,
              4,
              5,
              6,
              10,
              11,
              12,
              13,
              17,
              19,
              20,
              21,
              22,
              23,
              26,
              28,
              29,
              30,
              31,
              32,
              33,
              34,
              35,
              36,
              37,
              39,
              43,
              45,
              46,
              48,
              49,
              50,
              54,
              55,
              56,
              57,
              58,
              59,
              63,
              65,
              66,
              67,
              68,
              69,
              72,
              73,
              74,
              75,
              77,
              79,
              81,
              82,
              84,
              86,
     

In [23]:
from nltk.metrics.scores import (precision, recall, f_measure)

# We can proceed to print the metrics for each classifier. However, we cannot get the accuracy in this manner because nltk.metrics.scores.accuracy(reference, test) works by comparing test[i] == reference[i] and our reference and test are not formatted in a way that allows for this. It's the same for the confusion matrix.
args = (
    round(precision(refsets['female'], testsets['female']), 2),
    round(precision(refsets['male'], testsets['male']), 2),
    round(recall(refsets['female'], testsets['female']), 2),
    round(recall(refsets['male'], testsets['male']), 2),
    round(f_measure(refsets['female'], testsets['female']), 2),
    round(f_measure(refsets['male'], testsets['male']), 2)
)

args2 = (
    round(precision(refsets2['female'], testsets2['female']), 2),
    round(precision(refsets2['male'], testsets2['male']), 2),
    round(recall(refsets2['female'], testsets2['female']), 2),
    round(recall(refsets2['male'], testsets2['male']), 2),
    round(f_measure(refsets2['female'], testsets2['female']), 2),
    round(f_measure(refsets2['male'], testsets2['male']), 2)
)

print('''
CLASSIFIER 1
------------ 
Female precision: {0}
Male precision: {1}
Female recall: {2}
Male recall: {3}
Female F1 score: {4}
Male F1 score: {5}

CLASSIFIER 2
------------ 
Female precision: {6}
Male precision: {7}
Female recall: {8}
Male recall: {9}
Female F1 score: {10}
Male F1 score: {11}
'''.format(*args, *args2))


CLASSIFIER 1
------------ 
Female precision: 0.76
Male precision: 0.68
Female recall: 0.78
Male recall: 0.66
Female F1 score: 0.77
Male F1 score: 0.67

CLASSIFIER 2
------------ 
Female precision: 0.79
Male precision: 0.71
Female recall: 0.8
Male recall: 0.71
Female F1 score: 0.8
Male F1 score: 0.71



## 7. Error analysis

In [24]:
errors = []
for (name, tag) in test_names:
    guess = classifier2.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))

errors[:5]

[('female', 'male', 'Riannon'),
 ('male', 'female', 'Obadiah'),
 ('male', 'female', 'Ferdy'),
 ('male', 'female', 'Pierre'),
 ('female', 'male', 'Christin')]

In [25]:
for (tag, guess, name) in sorted(errors):
    print('Correct = {:8} guess = {:8} name = {}'.format(tag, guess, name)) # :8 creates spaces between columns.

Correct = female   guess = male     name = Aileen
Correct = female   guess = male     name = Alleen
Correct = female   guess = male     name = Allyn
Correct = female   guess = male     name = Anabel
Correct = female   guess = male     name = Angel
Correct = female   guess = male     name = Aurel
Correct = female   guess = male     name = Brittan
Correct = female   guess = male     name = Cal
Correct = female   guess = male     name = Carleen
Correct = female   guess = male     name = Caryl
Correct = female   guess = male     name = Cathyleen
Correct = female   guess = male     name = Catlin
Correct = female   guess = male     name = Christal
Correct = female   guess = male     name = Christel
Correct = female   guess = male     name = Christian
Correct = female   guess = male     name = Christin
Correct = female   guess = male     name = Coral
Correct = female   guess = male     name = Dareen
Correct = female   guess = male     name = Delores
Correct = female   guess = male     name = 

Looking through this list of errors makes it clear that some suffixes that are more than one letter can be indicative of name genders. For example, names ending in *yn* appear to be predominantly female, despite the fact that names ending in *n* tend to be male; and names ending in *ch* are usually male, even though names that end in *h* tend to be female.

## 8. Building a classifier with even more features

In [26]:
def gender_features3(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["suffix1"] = name[-1].lower()
    features["suffix2"] = name[-2:].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

gender_features3('John')

{'count(a)': 0,
 'count(b)': 0,
 'count(c)': 0,
 'count(d)': 0,
 'count(e)': 0,
 'count(f)': 0,
 'count(g)': 0,
 'count(h)': 1,
 'count(i)': 0,
 'count(j)': 1,
 'count(k)': 0,
 'count(l)': 0,
 'count(m)': 0,
 'count(n)': 1,
 'count(o)': 1,
 'count(p)': 0,
 'count(q)': 0,
 'count(r)': 0,
 'count(s)': 0,
 'count(t)': 0,
 'count(u)': 0,
 'count(v)': 0,
 'count(w)': 0,
 'count(x)': 0,
 'count(y)': 0,
 'count(z)': 0,
 'first_letter': 'j',
 'has(a)': False,
 'has(b)': False,
 'has(c)': False,
 'has(d)': False,
 'has(e)': False,
 'has(f)': False,
 'has(g)': False,
 'has(h)': True,
 'has(i)': False,
 'has(j)': True,
 'has(k)': False,
 'has(l)': False,
 'has(m)': False,
 'has(n)': True,
 'has(o)': True,
 'has(p)': False,
 'has(q)': False,
 'has(r)': False,
 'has(s)': False,
 'has(t)': False,
 'has(u)': False,
 'has(v)': False,
 'has(w)': False,
 'has(x)': False,
 'has(y)': False,
 'has(z)': False,
 'suffix1': 'n',
 'suffix2': 'hn'}

In [27]:
featuresets3 = [(gender_features3(n), gender) for (n, gender) in labeled_names]
featuresets3[0]

({'count(a)': 1,
  'count(b)': 0,
  'count(c)': 0,
  'count(d)': 0,
  'count(e)': 0,
  'count(f)': 0,
  'count(g)': 0,
  'count(h)': 0,
  'count(i)': 1,
  'count(j)': 0,
  'count(k)': 0,
  'count(l)': 0,
  'count(m)': 0,
  'count(n)': 3,
  'count(o)': 1,
  'count(p)': 0,
  'count(q)': 0,
  'count(r)': 1,
  'count(s)': 0,
  'count(t)': 0,
  'count(u)': 0,
  'count(v)': 0,
  'count(w)': 0,
  'count(x)': 0,
  'count(y)': 0,
  'count(z)': 0,
  'first_letter': 'r',
  'has(a)': True,
  'has(b)': False,
  'has(c)': False,
  'has(d)': False,
  'has(e)': False,
  'has(f)': False,
  'has(g)': False,
  'has(h)': False,
  'has(i)': True,
  'has(j)': False,
  'has(k)': False,
  'has(l)': False,
  'has(m)': False,
  'has(n)': True,
  'has(o)': True,
  'has(p)': False,
  'has(q)': False,
  'has(r)': True,
  'has(s)': False,
  'has(t)': False,
  'has(u)': False,
  'has(v)': False,
  'has(w)': False,
  'has(x)': False,
  'has(y)': False,
  'has(z)': False,
  'suffix1': 'n',
  'suffix2': 'on'},
 'female

In [28]:
train_set3, test_set3 = featuresets3[500:], featuresets3[:500]
classifier3 = NaiveBayesClassifier.train(train_set3)
accuracy(classifier3, test_set3)

0.766

In [29]:
classifier3.show_most_informative_features(15)

Most Informative Features
                 suffix2 = 'na'           female : male   =     99.2 : 1.0
                 suffix2 = 'la'           female : male   =     76.1 : 1.0
                 suffix2 = 'ia'           female : male   =     40.2 : 1.0
                 suffix2 = 'us'             male : female =     38.1 : 1.0
                 suffix2 = 'sa'           female : male   =     35.4 : 1.0
                 suffix1 = 'a'            female : male   =     34.2 : 1.0
                 suffix1 = 'k'              male : female =     32.6 : 1.0
                 suffix2 = 'rd'             male : female =     32.1 : 1.0
                 suffix2 = 'ra'           female : male   =     26.0 : 1.0
                 suffix2 = 'do'             male : female =     25.3 : 1.0
                 suffix2 = 'ta'           female : male   =     25.1 : 1.0
                 suffix2 = 'ld'             male : female =     24.6 : 1.0
                 suffix2 = 'rt'             male : female =     24.3 : 1.0

## 9. Trying to use a Maximum Entropy classifier

The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge is the one with largest entropy.

The principle of maximum entropy is invoked when we have some piece(s) of information about a probability distribution, but not enough to characterize it completely– likely because we do not have the means or resources to do so. As an example, if all we know about a distribution is its average, we can imagine infinite shapes that yield a particular average. The principle of maximum entropy says that we should humbly choose the distribution that maximizes the amount of unpredictability contained in the distribution.

Taking the idea to the extreme, it wouldn’t be scientific to choose a distribution that simply yields the average value 100% of the time.

From all the models that fit our training data, the Maximum Entropy classifier selects the one which has the largest entropy. Due to the minimum assumptions that the Maximum Entropy classifier makes, we regularly use it when we don’t know anything about the prior distributions and when it is unsafe to make any such assumptions. Moreover, the Maximum Entropy classifier is used when we can’t assume the conditional independence of the features.

In [37]:
from nltk import MaxentClassifier

me_classifier = MaxentClassifier.train(train_set3, max_iter=20) # max_iter has default value 100

LookupError: 

===========================================================================
NLTK was unable to find the megam file!
Use software specific configuration paramaters or set the MEGAM environment variable.

  For more information on megam, see:
    <http://www.umiacs.umd.edu/~hal/megam/index.html>
===========================================================================

In [31]:
accuracy(me_classifier, test_set3)

0.798

In [35]:
me_classifier.show_most_informative_features(10)

  -1.709 suffix2=='na' and label is 'male'
  -1.678 suffix2=='la' and label is 'male'
  -1.352 suffix2=='sa' and label is 'male'
  -1.291 suffix2=='ia' and label is 'male'
  -1.214 suffix1=='a' and label is 'male'
  -1.180 suffix2=='us' and label is 'female'
  -1.162 suffix2=='ra' and label is 'male'
  -1.128 suffix1=='k' and label is 'female'
  -1.100 suffix2=='ta' and label is 'male'
  -1.072 suffix2=='do' and label is 'female'


It seems that, due to having many features, the Maximum Entropy classifier outperforms the Naive Bayes classifier.