# Name Gender Identifier

## 1. Building a feature extractor

An idea is to use the last letter of the name to predict the gender. For instance, names ending in *a*, *e* and *i* are likely to be female, while names ending in *k*, *o*, *r*, *s* and *t* are likely to be male.

In [1]:
# Feature extractor
def gender_features(word):
    return {'last_letter': word[-1]}

gender_features('John')

{'last_letter': 'n'}

The returned dictionary is known as a **feature set**.

## 2. Exploring the `names` corpus

In [2]:
from nltk.corpus import names

names.readme().replace('\n', ' ')

'Names Corpus, Version 1.3 (1994-03-29) Copyright (C) 1991 Mark Kantrowitz Additions by Bill Ross  This corpus contains 5001 female names and 2943 male names, sorted alphabetically, one per line.  You may use the lists of names for any purpose, so long as credit is given in any published work. You may also redistribute the list if you provide the recipients with a copy of this README file. The lists are not in the public domain (I retain the copyright on the lists) but are freely redistributable.  If you have any additions to the lists of names, I would appreciate receiving them.  Mark Kantrowitz <mkant+@cs.cmu.edu> http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/'

In [3]:
names.fileids()

['female.txt', 'male.txt']

In [4]:
names.words('female.txt')[:5]

['Abagael', 'Abagail', 'Abbe', 'Abbey', 'Abbi']

## 3. Building the classifier

We need to prepare a list of examples and corresponding class labels.

In [5]:
labeled_names = ([(name, 'female') for name in names.words('female.txt')] + [(name, 'male') for name in names.words('male.txt')])
labeled_names[:5]

[('Abagael', 'female'),
 ('Abagail', 'female'),
 ('Abbe', 'female'),
 ('Abbey', 'female'),
 ('Abbi', 'female')]

In [6]:
import random
random.shuffle(labeled_names) # We shuffle the data so that we can split it by index into training and test data.
labeled_names[:5]

[('Lilli', 'female'),
 ('Jeff', 'male'),
 ('Kaspar', 'male'),
 ('Celia', 'female'),
 ('Renell', 'female')]

In [7]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
featuresets[:5]

[({'last_letter': 'i'}, 'female'),
 ({'last_letter': 'f'}, 'male'),
 ({'last_letter': 'r'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'l'}, 'female')]

In [8]:
len(featuresets)

7944

In [56]:
from nltk import NaiveBayesClassifier

# We split the data into a training (80%) and test (20%) set:
TRAIN_SET_SIZE = round(len(featuresets) * .8)
train_set, test_set = featuresets[:TRAIN_SET_SIZE], featuresets[TRAIN_SET_SIZE:]

# We also get the names in the test set, to be used later:
test_names = labeled_names[TRAIN_SET_SIZE:]

classifier = NaiveBayesClassifier.train(train_set)

# When working with large corpora, constructing a single list that contains the features of every instance can use up a large amount of memory. In these cases, use the function nltk.classify.apply_features, which returns an object that acts like a list but does not store all the feature sets in memory: 
# from nltk.classify import apply_features
# train_names, test_names = labeled_names[:round(len(featuresets) * .8)], labeled_names[round(len(featuresets) * .8):]
# train_set = apply_features(gender_features, labeled_names[500:])
# test_set = apply_features(gender_features, labeled_names[:500])

In [38]:
classifier.show_most_informative_features(10) # Prints likelihood ratios for most informative features

Most Informative Features
             last_letter = 'a'            female : male   =     34.1 : 1.0
             last_letter = 'k'              male : female =     28.0 : 1.0
             last_letter = 'v'              male : female =     16.5 : 1.0
             last_letter = 'f'              male : female =     12.6 : 1.0
             last_letter = 'm'              male : female =     10.7 : 1.0
             last_letter = 'p'              male : female =      9.2 : 1.0
             last_letter = 'd'              male : female =      8.9 : 1.0
             last_letter = 'o'              male : female =      8.6 : 1.0
             last_letter = 'r'              male : female =      7.2 : 1.0
             last_letter = 'g'              male : female =      5.9 : 1.0


## 4. Testing the classifier

In [39]:
classifier.labels()

['female', 'male']

In [40]:
from nltk.classify import accuracy

round(accuracy(classifier, test_set), 2)

0.76

In [41]:
classifier.classify(gender_features('Aphrodite'))

'female'

In [42]:
classifier.classify(gender_features('Zeus'))

'male'

## 5. Building a classifier with more features

In [43]:
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

gender_features2('John')

{'count(a)': 0,
 'count(b)': 0,
 'count(c)': 0,
 'count(d)': 0,
 'count(e)': 0,
 'count(f)': 0,
 'count(g)': 0,
 'count(h)': 1,
 'count(i)': 0,
 'count(j)': 1,
 'count(k)': 0,
 'count(l)': 0,
 'count(m)': 0,
 'count(n)': 1,
 'count(o)': 1,
 'count(p)': 0,
 'count(q)': 0,
 'count(r)': 0,
 'count(s)': 0,
 'count(t)': 0,
 'count(u)': 0,
 'count(v)': 0,
 'count(w)': 0,
 'count(x)': 0,
 'count(y)': 0,
 'count(z)': 0,
 'first_letter': 'j',
 'has(a)': False,
 'has(b)': False,
 'has(c)': False,
 'has(d)': False,
 'has(e)': False,
 'has(f)': False,
 'has(g)': False,
 'has(h)': True,
 'has(i)': False,
 'has(j)': True,
 'has(k)': False,
 'has(l)': False,
 'has(m)': False,
 'has(n)': True,
 'has(o)': True,
 'has(p)': False,
 'has(q)': False,
 'has(r)': False,
 'has(s)': False,
 'has(t)': False,
 'has(u)': False,
 'has(v)': False,
 'has(w)': False,
 'has(x)': False,
 'has(y)': False,
 'has(z)': False,
 'last_letter': 'n'}

In [44]:
featuresets2 = [(gender_features2(n), gender) for (n, gender) in labeled_names]
featuresets2[0]

({'count(a)': 0,
  'count(b)': 0,
  'count(c)': 0,
  'count(d)': 0,
  'count(e)': 0,
  'count(f)': 0,
  'count(g)': 0,
  'count(h)': 0,
  'count(i)': 2,
  'count(j)': 0,
  'count(k)': 0,
  'count(l)': 3,
  'count(m)': 0,
  'count(n)': 0,
  'count(o)': 0,
  'count(p)': 0,
  'count(q)': 0,
  'count(r)': 0,
  'count(s)': 0,
  'count(t)': 0,
  'count(u)': 0,
  'count(v)': 0,
  'count(w)': 0,
  'count(x)': 0,
  'count(y)': 0,
  'count(z)': 0,
  'first_letter': 'l',
  'has(a)': False,
  'has(b)': False,
  'has(c)': False,
  'has(d)': False,
  'has(e)': False,
  'has(f)': False,
  'has(g)': False,
  'has(h)': False,
  'has(i)': True,
  'has(j)': False,
  'has(k)': False,
  'has(l)': True,
  'has(m)': False,
  'has(n)': False,
  'has(o)': False,
  'has(p)': False,
  'has(q)': False,
  'has(r)': False,
  'has(s)': False,
  'has(t)': False,
  'has(u)': False,
  'has(v)': False,
  'has(w)': False,
  'has(x)': False,
  'has(y)': False,
  'has(z)': False,
  'last_letter': 'i'},
 'female')

In [45]:
train_set2, test_set2 = featuresets2[:TRAIN_SET_SIZE], featuresets2[TRAIN_SET_SIZE:]
classifier2 = NaiveBayesClassifier.train(train_set2)
round(accuracy(classifier2, test_set2), 2)

0.78

We would have expected that having too many specific features on a small dataset would lead to overfitting, but it seems the classifier was good at avoiding that since its performance is slightly better.

In [46]:
classifier2.show_most_informative_features(15)

Most Informative Features
             last_letter = 'a'            female : male   =     34.1 : 1.0
             last_letter = 'k'              male : female =     28.0 : 1.0
             last_letter = 'v'              male : female =     16.5 : 1.0
             last_letter = 'f'              male : female =     12.6 : 1.0
             last_letter = 'm'              male : female =     10.7 : 1.0
             last_letter = 'p'              male : female =      9.2 : 1.0
             last_letter = 'd'              male : female =      8.9 : 1.0
             last_letter = 'o'              male : female =      8.6 : 1.0
             last_letter = 'r'              male : female =      7.2 : 1.0
                count(v) = 2              female : male   =      6.8 : 1.0
             last_letter = 'g'              male : female =      5.9 : 1.0
             last_letter = 'w'              male : female =      5.9 : 1.0
                count(a) = 3              female : male   =      5.4 : 1.0

Indeed, it seems the classifier is mainly using the last letter, along with some other features that happen to improve the accuracy.

## 6. Comparing the two classifiers using `nltk.metrics`

Before we start, here's a useful function for comparing strings:

In [47]:
from nltk.metrics import edit_distance

edit_distance("John", "Joan")

1

The NLTK metrics module provides functions for calculating metrics beyond mere accuracy. But in order to do so, we need to build 2 sets for each classification label: a reference set of correct values, and a test set of observed values.

In [48]:
import collections

# Classifier 1
refsets = collections.defaultdict(set) # For what this is: https://stackoverflow.com/questions/5900578/how-does-collections-defaultdict-work
testsets = collections.defaultdict(set)

for i, (feats, label) in enumerate(test_set):
    refsets[label].add(i)
    observed = classifier.classify(feats)
    testsets[observed].add(i)
    
# Classifier 2
refsets2 = collections.defaultdict(set)
testsets2 = collections.defaultdict(set)

for i, (feats, label) in enumerate(test_set2):
    refsets2[label].add(i)
    observed = classifier2.classify(feats)
    testsets2[observed].add(i)

In [49]:
refsets

defaultdict(set,
            {'female': {2,
              3,
              4,
              6,
              7,
              8,
              9,
              10,
              11,
              12,
              17,
              19,
              21,
              22,
              23,
              24,
              25,
              28,
              29,
              32,
              33,
              34,
              38,
              40,
              41,
              44,
              47,
              48,
              50,
              51,
              53,
              56,
              57,
              60,
              61,
              62,
              66,
              67,
              68,
              69,
              70,
              72,
              74,
              75,
              78,
              79,
              80,
              82,
              83,
              85,
              87,
              88,
              89,
              90,
        

In [50]:
testsets

defaultdict(set,
            {'female': {1,
              2,
              3,
              4,
              6,
              8,
              10,
              12,
              13,
              17,
              20,
              21,
              22,
              24,
              25,
              26,
              27,
              28,
              29,
              31,
              32,
              33,
              34,
              36,
              38,
              40,
              41,
              43,
              46,
              47,
              50,
              52,
              53,
              54,
              57,
              60,
              61,
              63,
              64,
              65,
              66,
              67,
              68,
              69,
              70,
              71,
              72,
              74,
              77,
              79,
              80,
              82,
              85,
              87,
       

In [51]:
from nltk.metrics.scores import (precision, recall, f_measure)

# We can proceed to print the metrics for each classifier. However, we cannot get the accuracy in this manner because nltk.metrics.scores.accuracy(reference, test) works by comparing test[i] == reference[i] and our reference and test are not formatted in a way that allows for this. It's the same for the confusion matrix.
args = (
    round(precision(refsets['female'], testsets['female']), 2),
    round(precision(refsets['male'], testsets['male']), 2),
    round(recall(refsets['female'], testsets['female']), 2),
    round(recall(refsets['male'], testsets['male']), 2),
    round(f_measure(refsets['female'], testsets['female']), 2),
    round(f_measure(refsets['male'], testsets['male']), 2)
)

args2 = (
    round(precision(refsets2['female'], testsets2['female']), 2),
    round(precision(refsets2['male'], testsets2['male']), 2),
    round(recall(refsets2['female'], testsets2['female']), 2),
    round(recall(refsets2['male'], testsets2['male']), 2),
    round(f_measure(refsets2['female'], testsets2['female']), 2),
    round(f_measure(refsets2['male'], testsets2['male']), 2)
)

print('''
CLASSIFIER 1
------------ 
Female precision: {0}
Male precision: {1}
Female recall: {2}
Male recall: {3}
Female F1 score: {4}
Male F1 score: {5}

CLASSIFIER 2
------------ 
Female precision: {6}
Male precision: {7}
Female recall: {8}
Male recall: {9}
Female F1 score: {10}
Male F1 score: {11}
'''.format(*args, *args2))


CLASSIFIER 1
------------ 
Female precision: 0.78
Male precision: 0.73
Female recall: 0.87
Male recall: 0.58
Female F1 score: 0.82
Male F1 score: 0.65

CLASSIFIER 2
------------ 
Female precision: 0.82
Male precision: 0.71
Female recall: 0.83
Male recall: 0.69
Female F1 score: 0.83
Male F1 score: 0.7



## 7. Error analysis

In [52]:
errors = []
for (name, tag) in test_names:
    guess = classifier2.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))

errors[:5]

[('male', 'female', 'Hersh'),
 ('female', 'male', 'Ros'),
 ('female', 'male', 'Kerrin'),
 ('female', 'male', 'Shaylynn'),
 ('male', 'female', 'Tully')]

In [53]:
for (tag, guess, name) in sorted(errors):
    print('Correct = {:8} guess = {:8} name = {}'.format(tag, guess, name)) # :8 creates spaces between columns.

Correct = female   guess = male     name = Adriaens
Correct = female   guess = male     name = Alis
Correct = female   guess = male     name = Allsun
Correct = female   guess = male     name = Alys
Correct = female   guess = male     name = Anet
Correct = female   guess = male     name = Ann
Correct = female   guess = male     name = Ashlen
Correct = female   guess = male     name = Barb
Correct = female   guess = male     name = Beitris
Correct = female   guess = male     name = Berget
Correct = female   guess = male     name = Bev
Correct = female   guess = male     name = Bidget
Correct = female   guess = male     name = Brenn
Correct = female   guess = male     name = Brett
Correct = female   guess = male     name = Brigit
Correct = female   guess = male     name = Brynn
Correct = female   guess = male     name = Carlyn
Correct = female   guess = male     name = Carmen
Correct = female   guess = male     name = Caro
Correct = female   guess = male     name = Cass
Correct = female  

Looking through this list of errors, it seems that some suffixes that are more than one letter long can be indicative of name genders. For example, names ending in *yn* appear to be predominantly female, despite the fact that names ending in *n* tend to be male; and names ending in *ch* are usually male, even though names that end in *h* tend to be female.

## 8. Building a classifier with even more features

In [54]:
def gender_features3(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["suffix1"] = name[-1].lower()
    features["suffix2"] = name[-2:].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

gender_features3('John')

{'count(a)': 0,
 'count(b)': 0,
 'count(c)': 0,
 'count(d)': 0,
 'count(e)': 0,
 'count(f)': 0,
 'count(g)': 0,
 'count(h)': 1,
 'count(i)': 0,
 'count(j)': 1,
 'count(k)': 0,
 'count(l)': 0,
 'count(m)': 0,
 'count(n)': 1,
 'count(o)': 1,
 'count(p)': 0,
 'count(q)': 0,
 'count(r)': 0,
 'count(s)': 0,
 'count(t)': 0,
 'count(u)': 0,
 'count(v)': 0,
 'count(w)': 0,
 'count(x)': 0,
 'count(y)': 0,
 'count(z)': 0,
 'first_letter': 'j',
 'has(a)': False,
 'has(b)': False,
 'has(c)': False,
 'has(d)': False,
 'has(e)': False,
 'has(f)': False,
 'has(g)': False,
 'has(h)': True,
 'has(i)': False,
 'has(j)': True,
 'has(k)': False,
 'has(l)': False,
 'has(m)': False,
 'has(n)': True,
 'has(o)': True,
 'has(p)': False,
 'has(q)': False,
 'has(r)': False,
 'has(s)': False,
 'has(t)': False,
 'has(u)': False,
 'has(v)': False,
 'has(w)': False,
 'has(x)': False,
 'has(y)': False,
 'has(z)': False,
 'suffix1': 'n',
 'suffix2': 'hn'}

In [55]:
featuresets3 = [(gender_features3(n), gender) for (n, gender) in labeled_names]
featuresets3[0]

({'count(a)': 0,
  'count(b)': 0,
  'count(c)': 0,
  'count(d)': 0,
  'count(e)': 0,
  'count(f)': 0,
  'count(g)': 0,
  'count(h)': 0,
  'count(i)': 2,
  'count(j)': 0,
  'count(k)': 0,
  'count(l)': 3,
  'count(m)': 0,
  'count(n)': 0,
  'count(o)': 0,
  'count(p)': 0,
  'count(q)': 0,
  'count(r)': 0,
  'count(s)': 0,
  'count(t)': 0,
  'count(u)': 0,
  'count(v)': 0,
  'count(w)': 0,
  'count(x)': 0,
  'count(y)': 0,
  'count(z)': 0,
  'first_letter': 'l',
  'has(a)': False,
  'has(b)': False,
  'has(c)': False,
  'has(d)': False,
  'has(e)': False,
  'has(f)': False,
  'has(g)': False,
  'has(h)': False,
  'has(i)': True,
  'has(j)': False,
  'has(k)': False,
  'has(l)': True,
  'has(m)': False,
  'has(n)': False,
  'has(o)': False,
  'has(p)': False,
  'has(q)': False,
  'has(r)': False,
  'has(s)': False,
  'has(t)': False,
  'has(u)': False,
  'has(v)': False,
  'has(w)': False,
  'has(x)': False,
  'has(y)': False,
  'has(z)': False,
  'suffix1': 'i',
  'suffix2': 'li'},
 'fem

In [80]:
train_set3, test_set3 = featuresets3[:TRAIN_SET_SIZE], featuresets3[TRAIN_SET_SIZE:]
classifier3 = NaiveBayesClassifier.train(train_set3)
round(accuracy(classifier3, test_set3), 2)

0.79

In [59]:
classifier3.show_most_informative_features(15)

Most Informative Features
                 suffix2 = 'na'           female : male   =     83.4 : 1.0
                 suffix2 = 'ia'           female : male   =     79.6 : 1.0
                 suffix1 = 'a'            female : male   =     34.1 : 1.0
                 suffix2 = 'sa'           female : male   =     31.7 : 1.0
                 suffix2 = 'rd'             male : female =     28.9 : 1.0
                 suffix1 = 'k'              male : female =     28.0 : 1.0
                 suffix2 = 'us'             male : female =     25.5 : 1.0
                 suffix2 = 'ra'           female : male   =     23.0 : 1.0
                 suffix2 = 'ld'             male : female =     21.1 : 1.0
                 suffix2 = 'ta'           female : male   =     20.8 : 1.0
                 suffix2 = 'do'             male : female =     20.6 : 1.0
                 suffix2 = 'rt'             male : female =     19.3 : 1.0
                 suffix1 = 'v'              male : female =     16.5 : 1.0

## 9. Trying to use a maximum entropy classifier

The principle of **maximum entropy** states that the probability distribution which best represents the current state of knowledge is the one with largest entropy.

The principle of maximum entropy is invoked when we have some piece(s) of information about a probability distribution, but not enough to characterize it completely—likely because we do not have the means or resources to do so. As an example, if all we know about a distribution is its average, we can imagine infinite shapes that yield a particular average. The principle of maximum entropy says that we should humbly choose the distribution that maximizes the amount of unpredictability contained in the distribution.

Taking the idea to the extreme, it wouldn’t be scientific to choose a distribution that simply yields the average value 100% of the time.

From all the models that fit our training data, the Maximum Entropy classifier selects the one which has the largest entropy. Due to the minimum assumptions that the Maximum Entropy classifier makes, it is usually used when we don’t know anything about the prior distributions and when it is unsafe to make any assumptions. Also, the maximum entropy classifier is used when we can’t assume the conditional independence of the features.

In [76]:
from nltk import MaxentClassifier

me_classifier = MaxentClassifier.train(train_set3, max_iter=25) # max_iter has default value 100. In this example, the performance in terms of accuracy on the test set starts significantly improving beyond the previous model's at around 25 iterations.

  ==> Training (25 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.369
             2          -0.60190        0.631
             3          -0.58085        0.631
             4          -0.56149        0.635
             5          -0.54373        0.663
             6          -0.52748        0.698
             7          -0.51261        0.729
             8          -0.49902        0.752
             9          -0.48659        0.765
            10          -0.47520        0.774
            11          -0.46475        0.784
            12          -0.45514        0.789
            13          -0.44630        0.794
            14          -0.43815        0.797
            15          -0.43062        0.800
            16          -0.42364        0.804
            17          -0.41716        0.805
            18          -0.41115        0.806
            19          -0.40554        0.806
  

In [78]:
round(accuracy(me_classifier, test_set3), 2) # The accuracies above were on the training set so this is what matters.

0.8

In [81]:
me_classifier.show_most_informative_features(10)

  -1.978 suffix2=='ia' and label is 'male'
  -1.921 suffix2=='na' and label is 'male'
  -1.515 suffix2=='sa' and label is 'male'
  -1.463 suffix1=='a' and label is 'male'
  -1.290 suffix2=='ra' and label is 'male'
  -1.278 suffix1=='k' and label is 'female'
  -1.197 suffix2=='rd' and label is 'female'
  -1.169 suffix2=='do' and label is 'female'
  -1.167 suffix2=='us' and label is 'female'
  -1.166 suffix2=='ta' and label is 'male'


## 10. More classifiers

Scikit-learn (sklearn) is a popular library which features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN.

NLTK provides an API to quickly use sklearn classifiers in `nltk.classify.scikitlearn`. The other option is to import and use sklearn directly.

For an example of integrating sklearn with NLTK, you can check out [this](https://www.kaggle.com/alvations/basic-nlp-with-nltk) notebook on Kaggle. Kaggle is a great website for NLP and machine learning in general, creating an account is highly recommended.