# Name Gender Identifier

## 1. Building a feature extractor

We can use the last letter of the name to predict the gender. For instance, English names ending in “a”, “e”, and “i” are likely to be female, while those ending in “k”, “o”, “r”, “s”, and “t” are likely to be male.

We start by building a feature extractor.

In [1]:
def extract_gender_features_1(word):
    return {'last_letter': word[-1]}


extract_gender_features_1('Samantha')

{'last_letter': 'a'}

## 2. Exploring the `names` corpus

In [2]:
from nltk.corpus import names

print(names.readme()[:195])

Names Corpus, Version 1.3 (1994-03-29)
Copyright (C) 1991 Mark Kantrowitz
Additions by Bill Ross

This corpus contains 5001 female names and 2943 male names, sorted
alphabetically, one per line.



In [3]:
names.fileids()

['female.txt', 'male.txt']

In [4]:
names.words('female.txt')[:5]

['Abagael', 'Abagail', 'Abbe', 'Abbey', 'Abbi']

## 3. Building a name gender classifier

We need to prepare a list of examples and corresponding class labels.

In [5]:
labeled_names = [(name, 'female') for name in names.words('female.txt')] + [
    (name, 'male') for name in names.words('male.txt')
]
labeled_names[:5]

[('Abagael', 'female'),
 ('Abagail', 'female'),
 ('Abbe', 'female'),
 ('Abbey', 'female'),
 ('Abbi', 'female')]

We proceed by shuffling the data so that we can split it by index into training and test data.

In [6]:
import random

random.shuffle(labeled_names)
labeled_names[:5]

[('Nelle', 'female'),
 ('Lorie', 'female'),
 ('Shelley', 'male'),
 ('Anna-Diana', 'female'),
 ('Corbin', 'male')]

In real research applications, consider using a seed value like `random.Random(4).shuffle()` instead, which ensures that your results are reproducible.

In [7]:
feature_sets_1 = [(extract_gender_features_1(n), gender) for (n, gender) in labeled_names]
feature_sets_1[:5]

[({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'y'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'n'}, 'male')]

In [8]:
len(feature_sets_1)

7944

We will use an 80–20 split into training and test data.

In [9]:
from nltk import NaiveBayesClassifier

TRAIN_TEST_SPLIT = 0.8
TRAIN_SET_SIZE = round(len(feature_sets_1) * TRAIN_TEST_SPLIT)
train_set_1, test_set_1 = feature_sets_1[:TRAIN_SET_SIZE], feature_sets_1[TRAIN_SET_SIZE:]

test_names = labeled_names[TRAIN_SET_SIZE:]  # For later use

classifier_1 = NaiveBayesClassifier.train(train_set_1)

**PS:** When working with large corpora, constructing a list that contains the features of every instance, as we did above, can use a lot of memory. In these cases, it is better to use the `nltk.classify.apply_features()`, which returns an object that acts like a list but does not store all the feature sets in memory:

```py
from nltk.classify import apply_features

train_names = labeled_names[:round(len(feature_sets) * TRAIN_TEST_SPLIT)]
test_names = labeled_names[round(len(feature_sets) * TRAIN_TEST_SPLIT):]

train_set = apply_features(extract_gender_features_1, train_names)
test_set = apply_features(extract_gender_features_1, test_names)
```

Now that we have our classifier, we can print the likelihood ratios for the most informative features:

In [10]:
classifier_1.show_most_informative_features(5)

Most Informative Features
             last_letter = 'k'              male : female =     35.1 : 1.0
             last_letter = 'a'            female : male   =     34.2 : 1.0
             last_letter = 'f'              male : female =     24.4 : 1.0
             last_letter = 'p'              male : female =     16.5 : 1.0
             last_letter = 'd'              male : female =     11.7 : 1.0


## 4. Testing the classifier

In [11]:
classifier_1.labels()

['female', 'male']

In [12]:
from nltk.classify import accuracy

round(accuracy(classifier_1, test_set_1), 2)

0.76

In [13]:
classifier_1.classify(extract_gender_features_1('Aphrodite'))

'female'

In [14]:
classifier_1.classify(extract_gender_features_1('Zeus'))

'male'

## 5. Building a classifier with more features

We will try out a classifier with more features. We will now consider the first and last letter. Furthermore, we will also check whether each letter in the alphabet exists within the name and also examine, as a separate feature, the number of times it appears.

While there is an overlap between the presence and count of letters in a name, it is wise to try both as features. It could turn out that the simpler presence feature or the more nuanced count feature can better predict gender.

In [15]:
def extract_gender_features_2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["has({})".format(letter)] = letter in name.lower()
        features["count({})".format(letter)] = name.lower().count(letter)
    return features


john_gender_features = extract_gender_features_2('John')
print('EXAMPLE GENDER FEATURES FOR "JOHN"')
print('first_letter:', john_gender_features['first_letter'])
print('last_letter:', john_gender_features['last_letter'])
print('has(a):', john_gender_features['has(a)'])
print('count(a):', john_gender_features['count(a)'])

EXAMPLE GENDER FEATURES FOR "JOHN"
first_letter: j
last_letter: n
has(a): False
count(a): 0


In [16]:
feature_sets_2 = [(extract_gender_features_2(n), gender) for (n, gender) in labeled_names]
train_set_2, test_set_2 = feature_sets_2[:TRAIN_SET_SIZE], feature_sets_2[TRAIN_SET_SIZE:]

classifier_2 = NaiveBayesClassifier.train(train_set_2)
round(accuracy(classifier_2, test_set_2), 2)

0.76

We would have expected that having too many specific features on a small dataset would lead to overfitting. But it seems the classifier was able to avoid that, as its accuracy is comparable to the previous one’s.

In [17]:
classifier_2.show_most_informative_features(10)

Most Informative Features
             last_letter = 'k'              male : female =     35.1 : 1.0
             last_letter = 'a'            female : male   =     34.2 : 1.0
             last_letter = 'f'              male : female =     24.4 : 1.0
             last_letter = 'p'              male : female =     16.5 : 1.0
             last_letter = 'd'              male : female =     11.7 : 1.0
             last_letter = 'z'              male : female =      9.7 : 1.0
             last_letter = 'v'              male : female =      8.5 : 1.0
                count(v) = 2              female : male   =      8.4 : 1.0
             last_letter = 'm'              male : female =      7.7 : 1.0
             last_letter = 'o'              male : female =      6.8 : 1.0


It appears the classifier is mainly using the last letter feature.

## 6. Comparing the two classifiers using `nltk.metrics`

Before we start, here’s a useful function for comparing strings:

In [18]:
from nltk.metrics import edit_distance

edit_distance("John", "Joan")

1

The `nltk.metrics` module provides functions for calculating metrics beyond just accuracy. To use it, we need to build two sets for each classification label: a reference set of correct values and a test set of observed values.

Starting with the reference sets, we will first build a dictionary with two keys, `male` and `female`. The value for each key will be a set containing all the indices of male and female names, respectively. As for the test sets, we will build a similar dictionary for each classifier, where, this time, the values for the `male` and `female` keys will be the set of indices that the classifier predicted to be male or female.

In [19]:
import collections

ref_sets = collections.defaultdict(set)
test_sets_1 = collections.defaultdict(set)
test_sets_2 = collections.defaultdict(set)

for i, (feats, label) in enumerate(test_set_1):
    ref_sets[label].add(i)

    observed_1 = classifier_1.classify(feats)
    test_sets_1[observed_1].add(i)

for i, (feats, label) in enumerate(test_set_2):
    observed_2 = classifier_2.classify(feats)
    test_sets_2[observed_2].add(i)

We can now proceed to print the metrics for each classifier. This will include all metrics except for the accuracy and the confusion matrix. We actually cannot obtain the accuracy in this manner because `nltk.metrics.scores.accuracy(reference, test)` works by comparing `test[i] == reference[i]`, and our reference and test data are not formatted in a way that allows for this. `nltk.metrics.confusionmatrix.ConfusionMatrix(reference, test)` also works the same way.

In [20]:
from nltk.metrics.scores import precision, recall, f_measure

args1 = (
    round(precision(ref_sets['female'], test_sets_1['female']), 2),
    round(precision(ref_sets['male'], test_sets_1['male']), 2),
    round(recall(ref_sets['female'], test_sets_1['female']), 2),
    round(recall(ref_sets['male'], test_sets_1['male']), 2),
    round(f_measure(ref_sets['female'], test_sets_1['female']), 2),
    round(f_measure(ref_sets['male'], test_sets_1['male']), 2),
)

args2 = (
    round(precision(ref_sets['female'], test_sets_2['female']), 2),
    round(precision(ref_sets['male'], test_sets_2['male']), 2),
    round(recall(ref_sets['female'], test_sets_2['female']), 2),
    round(recall(ref_sets['male'], test_sets_2['male']), 2),
    round(f_measure(ref_sets['female'], test_sets_2['female']), 2),
    round(f_measure(ref_sets['male'], test_sets_2['male']), 2),
)

print(
    '''
CLASSIFIER 1
------------
Female precision: {0}
Male precision: {1}
Female recall: {2}
Male recall: {3}
Female F1 score: {4}
Male F1 score: {5}

CLASSIFIER 2
------------
Female precision: {6}
Male precision: {7}
Female recall: {8}
Male recall: {9}
Female F1 score: {10}
Male F1 score: {11}
'''.format(
        *args1, *args2
    )
)


CLASSIFIER 1
------------
Female precision: 0.8
Male precision: 0.68
Female recall: 0.82
Male recall: 0.65
Female F1 score: 0.81
Male F1 score: 0.67

CLASSIFIER 2
------------
Female precision: 0.81
Male precision: 0.69
Female recall: 0.82
Male recall: 0.67
Female F1 score: 0.81
Male F1 score: 0.68



## 7. Error analysis

Let’s look for patterns in the classification errors made by the second classifier.

In [21]:
errors = []
for name, tag in test_names:
    guess = classifier_2.classify(extract_gender_features_2(name))
    if guess != tag:
        errors.append((tag, guess, name))

errors[:5]

[('female', 'male', 'Windy'),
 ('female', 'male', 'Joey'),
 ('male', 'female', 'Israel'),
 ('male', 'female', 'Dean'),
 ('female', 'male', 'Sheree')]

Let’s sort the errors and print a few of them in a more readable format.

In [22]:
for tag, guess, name in sorted(errors)[:10]:
    print('Correct = {:8} guess = {:8} name = {}'.format(tag, guess, name))

Correct = female   guess = male     name = Aubry
Correct = female   guess = male     name = Audrey
Correct = female   guess = male     name = Audry
Correct = female   guess = male     name = Bab
Correct = female   guess = male     name = Bell
Correct = female   guess = male     name = Beret
Correct = female   guess = male     name = Bert
Correct = female   guess = male     name = Berte
Correct = female   guess = male     name = Berty
Correct = female   guess = male     name = Beryl


If we go through the whole list of errors, we can observe that suffixes that are more than one letter long can be indicative of name genders. E.g., names ending in “ch” appear to be predominantly male, even though names that end in “n” tend to be female; and names ending in “yn” are predominantly female, even though names that end in “h” tend to be male.

## 8. Building a classifier with even more features

In [23]:
def extract_gender_features_3(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["suffix_1"] = name[-1].lower()
    features["suffix_2"] = name[-2:].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["has({})".format(letter)] = letter in name.lower()
        features["count({})".format(letter)] = name.lower().count(letter)
    return features


jacqueline_gender_features = extract_gender_features_3('Jacqueline')

print('EXAMPLE GENDER FEATURES FOR "JACQUELINE"')
print('first_letter:', jacqueline_gender_features['first_letter'])
print('suffix_1:', jacqueline_gender_features['suffix_1'])
print('suffix_2:', jacqueline_gender_features['suffix_2'])
print('has(a):', jacqueline_gender_features['has(a)'])
print('count(a):', jacqueline_gender_features['count(a)'])

EXAMPLE GENDER FEATURES FOR "JACQUELINE"
first_letter: j
suffix_1: e
suffix_2: ne
has(a): True
count(a): 1


In [24]:
feature_sets_3 = [(extract_gender_features_3(n), gender) for (n, gender) in labeled_names]
train_set_3, test_set_3 = feature_sets_3[:TRAIN_SET_SIZE], feature_sets_3[TRAIN_SET_SIZE:]

classifier_3 = NaiveBayesClassifier.train(train_set_3)
round(accuracy(classifier_3, test_set_3), 2)

0.78

In [25]:
classifier_3.show_most_informative_features(10)

Most Informative Features
                suffix_2 = 'na'           female : male   =     85.8 : 1.0
                suffix_2 = 'la'           female : male   =     64.4 : 1.0
                suffix_2 = 'rt'             male : female =     46.2 : 1.0
                suffix_2 = 'rd'             male : female =     41.8 : 1.0
                suffix_2 = 'ta'           female : male   =     37.7 : 1.0
                suffix_1 = 'k'              male : female =     35.1 : 1.0
                suffix_1 = 'a'            female : male   =     34.2 : 1.0
                suffix_2 = 'ia'           female : male   =     32.9 : 1.0
                suffix_2 = 'ra'           female : male   =     32.4 : 1.0
                suffix_2 = 'us'             male : female =     26.5 : 1.0


## 9. Trying a maximum entropy classifier

The principle of **maximum entropy** states that the probability distribution that best represents the current state of knowledge is the one with the largest entropy.

The principle of maximum entropy is invoked when we have some information about a probability distribution but not enough to characterize it completely—likely because we do not have the means or resources to do so. For example, if all we know about a distribution is its average, we can imagine infinite shapes that yield a particular average. The principle of maximum entropy says that we should humbly opt for the distribution that maximizes the unpredictability contained in the distribution.

Taking the idea to the extreme, it wouldn’t be scientific to choose a distribution that yields the average value 100% of the time.

From all the models that fit our training data, the maximum entropy classifier selects the one with the largest entropy. Due to the minimum assumptions that the maximum entropy classifier makes, it is usually used when we don’t know anything about the prior distributions and when it is unsafe to make any assumptions. The maximum entropy classifier is also used when we can’t assume the conditional independence of the features.

NLTK’s `MaxentClassifier.train()` method takes an argument called `max_iter` with a default value of 100. This would take a long time to run. In this example, the performance in terms of accuracy on the test set starts significantly improving beyond the previous model’s at around 15 iterations, so we will set `max_iter` to `20`.

In [26]:
from nltk import MaxentClassifier

me_classifier = MaxentClassifier.train(train_set_3, max_iter=20)

  ==> Training (20 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.369
             2          -0.60103        0.631
             3          -0.57930        0.631
             4          -0.55934        0.639
             5          -0.54105        0.670
             6          -0.52434        0.705
             7          -0.50908        0.736
             8          -0.49516        0.757
             9          -0.48244        0.767
            10          -0.47081        0.780
            11          -0.46016        0.789
            12          -0.45039        0.794
            13          -0.44141        0.797
            14          -0.43315        0.802
            15          -0.42552        0.804
            16          -0.41847        0.806
            17          -0.41193        0.807
            18          -0.40587        0.808
            19          -0.40023        0.811
  

The accuracies printed above were on the training set, so now we will check the accuracy on the test set.

In [27]:
round(accuracy(me_classifier, test_set_3), 2)

0.79

In [28]:
me_classifier.show_most_informative_features(10)

  -1.642 suffix_2=='na' and label is 'male'
  -1.603 suffix_2=='la' and label is 'male'
  -1.343 suffix_2=='rt' and label is 'female'
  -1.315 suffix_2=='ta' and label is 'male'
  -1.281 suffix_2=='ra' and label is 'male'
  -1.222 suffix_1=='a' and label is 'male'
  -1.198 suffix_2=='ia' and label is 'male'
  -1.196 suffix_1=='k' and label is 'female'
  -1.186 suffix_2=='rd' and label is 'female'
  -1.089 suffix_1=='f' and label is 'female'


## 10. More classifiers

`scikit-learn` (sklearn) is a popular library that features various classification, regression, and clustering algorithms, including support vector machines, random forests, gradient boosting, k-means, and DBSCAN.

NLTK provides a wrapper around sklearn classifiers, `nltk.classify.scikitlearn`, which is useful for quick experiments. The other option is to import and use sklearn directly.

For an example of integrating sklearn with NLTK, you can check out Liling Tan’s [Basic NLP with NLTK](https://www.kaggle.com/code/alvations/basic-nlp-with-nltk) notebook on Kaggle.