<a href="https://colab.research.google.com/github/mkane968/Text-Mining-Experiments/blob/main/NLTK/Tutorial%206%3A%20Name%20Gender%20Identifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 6: Name Gender Identifier

***Building a feature extractor***

An idea is to use the last letter of the name to predict the gender. For instance, names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male.

In [None]:
# Feature extractor which returns the last letter of a word
def gender_features(word):
    return {'last_letter': word[-1]}

gender_features('John')

{'last_letter': 'n'}

The returned dictionary is known as a feature set.

In [None]:
#Import and open the names corpus
import nltk
nltk.download('names')
from nltk.corpus import names

names.readme().replace('\n', ' ')

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.


'Names Corpus, Version 1.3 (1994-03-29) Copyright (C) 1991 Mark Kantrowitz Additions by Bill Ross  This corpus contains 5001 female names and 2943 male names, sorted alphabetically, one per line.  You may use the lists of names for any purpose, so long as credit is given in any published work. You may also redistribute the list if you provide the recipients with a copy of this README file. The lists are not in the public domain (I retain the copyright on the lists) but are freely redistributable.  If you have any additions to the lists of names, I would appreciate receiving them.  Mark Kantrowitz <mkant+@cs.cmu.edu> http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/'

In [None]:
#Get the file ids in the names corpus
names.fileids()

['female.txt', 'male.txt']

In [None]:
#Get the first five words in the female text file in corpus
names.words('female.txt')[:5]

['Abagael', 'Abagail', 'Abbe', 'Abbey', 'Abbi']

To build the classifier, we need to prepare a list of examples and corresponding class labels.

In [None]:
#Create list of labeled names where names in female.tx file are labeled female and male.txt names labeled male, print first five in labeled names list
labeled_names = ([(name, 'female') for name in names.words('female.txt')] + [(name, 'male') for name in names.words('male.txt')])
labeled_names[:5]

[('Abagael', 'female'),
 ('Abagail', 'female'),
 ('Abbe', 'female'),
 ('Abbey', 'female'),
 ('Abbi', 'female')]

In [None]:
# We shuffle the data so that we can split it by index into training and test data.
import random
random.shuffle(labeled_names) 
labeled_names[:5]

[('Norbert', 'male'),
 ('Stoddard', 'female'),
 ('Silvan', 'male'),
 ('Joete', 'female'),
 ('Nance', 'female')]

In [None]:
#Create list of the last letter of each name in labeled names and corresponding gender, print first five
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
featuresets[:5]

7944

In [None]:
#print length of feature sets
len(featuresets)

7944

In [None]:
from nltk import NaiveBayesClassifier

# We split the data into a training (80%) and test (20%) set:
TRAIN_SET_SIZE = round(len(featuresets) * .8)
train_set, test_set = featuresets[:TRAIN_SET_SIZE], featuresets[TRAIN_SET_SIZE:]

# We also get the names in the test set, to be used later:
test_names = labeled_names[TRAIN_SET_SIZE:]

classifier = NaiveBayesClassifier.train(train_set)

# When working with large corpora, constructing a single list that contains the features of every instance can use up a large amount of memory. 
#In these cases, use the function nltk.classify.apply_features, which returns an object that acts like a list but does not store all the feature sets in memory: 
# from nltk.classify import apply_features
# train_names, test_names = labeled_names[:round(len(featuresets) * .8)], labeled_names[round(len(featuresets) * .8):]
# train_set = apply_features(gender_features, labeled_names[500:])
# test_set = apply_features(gender_features, labeled_names[:500])

In [None]:
# Prints likelihood ratios for most informative features
classifier.show_most_informative_features(10) 

Most Informative Features
             last_letter = 'a'            female : male   =     31.8 : 1.0
             last_letter = 'k'              male : female =     27.6 : 1.0
             last_letter = 'v'              male : female =     10.4 : 1.0
             last_letter = 'p'              male : female =      9.7 : 1.0
             last_letter = 'd'              male : female =      8.8 : 1.0
             last_letter = 'o'              male : female =      8.6 : 1.0
             last_letter = 'm'              male : female =      7.6 : 1.0
             last_letter = 'r'              male : female =      6.9 : 1.0
             last_letter = 'g'              male : female =      4.9 : 1.0
             last_letter = 'z'              male : female =      4.6 : 1.0


Testing the classifer:

In [None]:
#Get labels from classifer
classifier.labels()

['male', 'female']

In [None]:
#Get accuracy of classifer
from nltk.classify import accuracy

round(accuracy(classifier, test_set), 2)

0.76

In [None]:
#Test classifier on female name based on last letter of name
classifier.classify(gender_features('Aphrodite'))

'female'

In [None]:
#Test classifier on male name based on last letter of name
classifier.classify(gender_features('Zeus'))

'male'

Building a classifier with more features:

In [None]:
#Define a classifier which lowercases first and last letter of word and identifies which letters are contained in word and at what frequency
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

gender_features2('John')

{'count(a)': 0,
 'count(b)': 0,
 'count(c)': 0,
 'count(d)': 0,
 'count(e)': 0,
 'count(f)': 0,
 'count(g)': 0,
 'count(h)': 1,
 'count(i)': 0,
 'count(j)': 1,
 'count(k)': 0,
 'count(l)': 0,
 'count(m)': 0,
 'count(n)': 1,
 'count(o)': 1,
 'count(p)': 0,
 'count(q)': 0,
 'count(r)': 0,
 'count(s)': 0,
 'count(t)': 0,
 'count(u)': 0,
 'count(v)': 0,
 'count(w)': 0,
 'count(x)': 0,
 'count(y)': 0,
 'count(z)': 0,
 'first_letter': 'j',
 'has(a)': False,
 'has(b)': False,
 'has(c)': False,
 'has(d)': False,
 'has(e)': False,
 'has(f)': False,
 'has(g)': False,
 'has(h)': True,
 'has(i)': False,
 'has(j)': True,
 'has(k)': False,
 'has(l)': False,
 'has(m)': False,
 'has(n)': True,
 'has(o)': True,
 'has(p)': False,
 'has(q)': False,
 'has(r)': False,
 'has(s)': False,
 'has(t)': False,
 'has(u)': False,
 'has(v)': False,
 'has(w)': False,
 'has(x)': False,
 'has(y)': False,
 'has(z)': False,
 'last_letter': 'n'}

In [None]:
#Get features above for list of gendered names and put in list, print first item in list
featuresets2 = [(gender_features2(n), gender) for (n, gender) in labeled_names]
featuresets2[0]

({'count(a)': 0,
  'count(b)': 1,
  'count(c)': 0,
  'count(d)': 0,
  'count(e)': 1,
  'count(f)': 0,
  'count(g)': 0,
  'count(h)': 0,
  'count(i)': 0,
  'count(j)': 0,
  'count(k)': 0,
  'count(l)': 0,
  'count(m)': 0,
  'count(n)': 1,
  'count(o)': 1,
  'count(p)': 0,
  'count(q)': 0,
  'count(r)': 2,
  'count(s)': 0,
  'count(t)': 1,
  'count(u)': 0,
  'count(v)': 0,
  'count(w)': 0,
  'count(x)': 0,
  'count(y)': 0,
  'count(z)': 0,
  'first_letter': 'n',
  'has(a)': False,
  'has(b)': True,
  'has(c)': False,
  'has(d)': False,
  'has(e)': True,
  'has(f)': False,
  'has(g)': False,
  'has(h)': False,
  'has(i)': False,
  'has(j)': False,
  'has(k)': False,
  'has(l)': False,
  'has(m)': False,
  'has(n)': True,
  'has(o)': True,
  'has(p)': False,
  'has(q)': False,
  'has(r)': True,
  'has(s)': False,
  'has(t)': True,
  'has(u)': False,
  'has(v)': False,
  'has(w)': False,
  'has(x)': False,
  'has(y)': False,
  'has(z)': False,
  'last_letter': 't'},
 'male')

In [None]:
#Train new classifier on same set of male and female names above and get accuracy
train_set2, test_set2 = featuresets2[:TRAIN_SET_SIZE], featuresets2[TRAIN_SET_SIZE:]
classifier2 = NaiveBayesClassifier.train(train_set2)
round(accuracy(classifier2, test_set2), 2)

0.79

We would have expected that having too many specific features on a small dataset would lead to overfitting, but it seems the classifier was good at avoiding that since its performance is slightly better.



In [None]:
#Show the most informative features for the new classifer
classifier2.show_most_informative_features(15)

Most Informative Features
             last_letter = 'a'            female : male   =     31.8 : 1.0
             last_letter = 'k'              male : female =     27.6 : 1.0
             last_letter = 'v'              male : female =     10.4 : 1.0
             last_letter = 'p'              male : female =      9.7 : 1.0
                count(v) = 2              female : male   =      8.9 : 1.0
             last_letter = 'd'              male : female =      8.8 : 1.0
             last_letter = 'o'              male : female =      8.6 : 1.0
             last_letter = 'm'              male : female =      7.6 : 1.0
             last_letter = 'r'              male : female =      6.9 : 1.0
            first_letter = 'w'              male : female =      4.9 : 1.0
             last_letter = 'g'              male : female =      4.9 : 1.0
             last_letter = 'z'              male : female =      4.6 : 1.0
             last_letter = 'b'              male : female =      4.4 : 1.0

Indeed, it seems the classifier is mainly using the last letter, along with some other features that happen to improve the accuracy.

***Comparing the two classifiers using nltk.metrics***

Before we start, here's a useful function for comparing strings:

In [None]:
#Edit distance is the number of characters that need to be substituted, inserted, or deleted, to transform s1 into s2.
from nltk.metrics import edit_distance

edit_distance("John", "Joan")

1

The NLTK metrics module provides functions for calculating metrics beyond mere accuracy. But in order to do so, we need to build 2 sets for each classification label: a reference set of correct values, and a test set of observed values.

In [None]:
import collections

# Classifier 1
refsets = collections.defaultdict(set) # For what this is: https://stackoverflow.com/questions/5900578/how-does-collections-defaultdict-work
testsets = collections.defaultdict(set)

for i, (feats, label) in enumerate(test_set):
    refsets[label].add(i)
    observed = classifier.classify(feats)
    testsets[observed].add(i)
    
# Classifier 2
refsets2 = collections.defaultdict(set)
testsets2 = collections.defaultdict(set)

for i, (feats, label) in enumerate(test_set2):
    refsets2[label].add(i)
    observed = classifier2.classify(feats)
    testsets2[observed].add(i)

In [None]:
refsets

In [None]:
testsets

In [None]:
from nltk.metrics.scores import (precision, recall, f_measure)

# We can proceed to print the metrics for each classifier. 
#However, we cannot get the accuracy in this manner because nltk.metrics.scores.accuracy(reference, test) works by comparing test[i] == reference[i] and our reference and test are not formatted in a way that allows for this. 
#It's the same for the confusion matrix.
args = (
    round(precision(refsets['female'], testsets['female']), 2),
    round(precision(refsets['male'], testsets['male']), 2),
    round(recall(refsets['female'], testsets['female']), 2),
    round(recall(refsets['male'], testsets['male']), 2),
    round(f_measure(refsets['female'], testsets['female']), 2),
    round(f_measure(refsets['male'], testsets['male']), 2)
)

args2 = (
    round(precision(refsets2['female'], testsets2['female']), 2),
    round(precision(refsets2['male'], testsets2['male']), 2),
    round(recall(refsets2['female'], testsets2['female']), 2),
    round(recall(refsets2['male'], testsets2['male']), 2),
    round(f_measure(refsets2['female'], testsets2['female']), 2),
    round(f_measure(refsets2['male'], testsets2['male']), 2)
)

print('''
CLASSIFIER 1
------------ 
Female precision: {0}
Male precision: {1}
Female recall: {2}
Male recall: {3}
Female F1 score: {4}
Male F1 score: {5}

CLASSIFIER 2
------------ 
Female precision: {6}
Male precision: {7}
Female recall: {8}
Male recall: {9}
Female F1 score: {10}
Male F1 score: {11}
'''.format(*args, *args2))


CLASSIFIER 1
------------ 
Female precision: 0.81
Male precision: 0.67
Female recall: 0.82
Male recall: 0.66
Female F1 score: 0.81
Male F1 score: 0.67

CLASSIFIER 2
------------ 
Female precision: 0.83
Male precision: 0.72
Female recall: 0.85
Male recall: 0.68
Female F1 score: 0.84
Male F1 score: 0.7



***Error analysis:*** Investigating errors of classifier (names whose gender was misclassified)

In [None]:
#Make list for errors and load in classifications where guess does not equal gender tag, print first five
errors = []
for (name, tag) in test_names:
    guess = classifier2.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))

errors[:5]

[('female', 'male', 'Christean'),
 ('female', 'male', 'Charis'),
 ('male', 'female', 'Cody'),
 ('male', 'female', 'Micah'),
 ('male', 'female', 'Tracie')]

In [None]:
#Print three columns (correct gender of name, guessed gender, and name itself)
for (tag, guess, name) in sorted(errors):
    print('Correct = {:8} guess = {:8} name = {}'.format(tag, guess, name)) # :8 creates spaces between columns.

Correct = female   guess = male     name = Abagael
Correct = female   guess = male     name = Abagail
Correct = female   guess = male     name = Abigael
Correct = female   guess = male     name = Aidan
Correct = female   guess = male     name = Ailyn
Correct = female   guess = male     name = Aimil
Correct = female   guess = male     name = Allis
Correct = female   guess = male     name = Amabel
Correct = female   guess = male     name = Amber
Correct = female   guess = male     name = Ambur
Correct = female   guess = male     name = Ann
Correct = female   guess = male     name = Anne-Mar
Correct = female   guess = male     name = Arden
Correct = female   guess = male     name = Ariel
Correct = female   guess = male     name = Arleen
Correct = female   guess = male     name = Arlyn
Correct = female   guess = male     name = Aryn
Correct = female   guess = male     name = Avril
Correct = female   guess = male     name = Beatriz
Correct = female   guess = male     name = Beret
Correct = 

Looking through this list of errors, it seems that some suffixes that are more than one letter long can be indicative of name genders. For example, names ending in yn appear to be predominantly female, despite the fact that names ending in n tend to be male; and names ending in ch are usually male, even though names that end in h tend to be female.

Building a classifier with even more features in response to errors

In [None]:
#Define new classifier which counts first letter and last two letters of word
def gender_features3(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["suffix1"] = name[-1].lower()
    features["suffix2"] = name[-2:].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

gender_features3('John')

{'count(a)': 0,
 'count(b)': 0,
 'count(c)': 0,
 'count(d)': 0,
 'count(e)': 0,
 'count(f)': 0,
 'count(g)': 0,
 'count(h)': 1,
 'count(i)': 0,
 'count(j)': 1,
 'count(k)': 0,
 'count(l)': 0,
 'count(m)': 0,
 'count(n)': 1,
 'count(o)': 1,
 'count(p)': 0,
 'count(q)': 0,
 'count(r)': 0,
 'count(s)': 0,
 'count(t)': 0,
 'count(u)': 0,
 'count(v)': 0,
 'count(w)': 0,
 'count(x)': 0,
 'count(y)': 0,
 'count(z)': 0,
 'first_letter': 'j',
 'has(a)': False,
 'has(b)': False,
 'has(c)': False,
 'has(d)': False,
 'has(e)': False,
 'has(f)': False,
 'has(g)': False,
 'has(h)': True,
 'has(i)': False,
 'has(j)': True,
 'has(k)': False,
 'has(l)': False,
 'has(m)': False,
 'has(n)': True,
 'has(o)': True,
 'has(p)': False,
 'has(q)': False,
 'has(r)': False,
 'has(s)': False,
 'has(t)': False,
 'has(u)': False,
 'has(v)': False,
 'has(w)': False,
 'has(x)': False,
 'has(y)': False,
 'has(z)': False,
 'suffix1': 'n',
 'suffix2': 'hn'}

In [None]:
##Get features above for list of gendered names and put in list, print first item in list
featuresets3 = [(gender_features3(n), gender) for (n, gender) in labeled_names]
featuresets3[0]

({'count(a)': 0,
  'count(b)': 1,
  'count(c)': 0,
  'count(d)': 0,
  'count(e)': 1,
  'count(f)': 0,
  'count(g)': 0,
  'count(h)': 0,
  'count(i)': 0,
  'count(j)': 0,
  'count(k)': 0,
  'count(l)': 0,
  'count(m)': 0,
  'count(n)': 1,
  'count(o)': 1,
  'count(p)': 0,
  'count(q)': 0,
  'count(r)': 2,
  'count(s)': 0,
  'count(t)': 1,
  'count(u)': 0,
  'count(v)': 0,
  'count(w)': 0,
  'count(x)': 0,
  'count(y)': 0,
  'count(z)': 0,
  'first_letter': 'n',
  'has(a)': False,
  'has(b)': True,
  'has(c)': False,
  'has(d)': False,
  'has(e)': True,
  'has(f)': False,
  'has(g)': False,
  'has(h)': False,
  'has(i)': False,
  'has(j)': False,
  'has(k)': False,
  'has(l)': False,
  'has(m)': False,
  'has(n)': True,
  'has(o)': True,
  'has(p)': False,
  'has(q)': False,
  'has(r)': True,
  'has(s)': False,
  'has(t)': True,
  'has(u)': False,
  'has(v)': False,
  'has(w)': False,
  'has(x)': False,
  'has(y)': False,
  'has(z)': False,
  'suffix1': 't',
  'suffix2': 'rt'},
 'male')

In [None]:
#Train new classifier on same set of male and female names above and get accuracy
train_set3, test_set3 = featuresets3[:TRAIN_SET_SIZE], featuresets3[TRAIN_SET_SIZE:]
classifier3 = NaiveBayesClassifier.train(train_set3)
round(accuracy(classifier3, test_set3), 2)

0.8

In [None]:
#Get 15 most informative features for classifier3
classifier3.show_most_informative_features(15)

Most Informative Features
                 suffix2 = 'na'           female : male   =     84.0 : 1.0
                 suffix2 = 'la'           female : male   =     67.8 : 1.0
                 suffix2 = 'ra'           female : male   =     53.7 : 1.0
                 suffix2 = 'ia'           female : male   =     49.4 : 1.0
                 suffix2 = 'us'             male : female =     33.3 : 1.0
                 suffix1 = 'a'            female : male   =     31.8 : 1.0
                 suffix2 = 'rd'             male : female =     29.9 : 1.0
                 suffix1 = 'k'              male : female =     27.6 : 1.0
                 suffix2 = 'sa'           female : male   =     27.3 : 1.0
                 suffix2 = 'ta'           female : male   =     21.9 : 1.0
                 suffix2 = 'do'             male : female =     21.4 : 1.0
                 suffix2 = 'ld'             male : female =     20.7 : 1.0
                 suffix2 = 'rt'             male : female =     16.7 : 1.0

***Maximum entropy classifier:*** The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge is the one with largest entropy.

The principle of maximum entropy is invoked when we have some piece(s) of information about a probability distribution, but not enough to characterize it completely—likely because we do not have the means or resources to do so. As an example, if all we know about a distribution is its average, we can imagine infinite shapes that yield a particular average. The principle of maximum entropy says that we should humbly choose the distribution that maximizes the amount of unpredictability contained in the distribution.

Taking the idea to the extreme, it wouldn’t be scientific to choose a distribution that simply yields the average value 100% of the time.

From all the models that fit our training data, the Maximum Entropy classifier selects the one which has the largest entropy. Due to the minimum assumptions that the Maximum Entropy classifier makes, it is usually used when we don’t know anything about the prior distributions and when it is unsafe to make any assumptions. Also, the maximum entropy classifier is used when we can’t assume the conditional independence of the features.

In [None]:
from nltk import MaxentClassifier

# max_iter has default value 100. 
#In this example, the performance in terms of accuracy on the test set starts significantly improving beyond the previous model's at around 25 iterations.
me_classifier = MaxentClassifier.train(train_set3, max_iter=25) 

  ==> Training (25 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.373
             2          -0.60435        0.627
             3          -0.58273        0.627
             4          -0.56287        0.633
             5          -0.54470        0.668
             6          -0.52810        0.703
             7          -0.51296        0.730
             8          -0.49913        0.752
             9          -0.48651        0.767
            10          -0.47497        0.779
            11          -0.46440        0.787
            12          -0.45471        0.792
            13          -0.44580        0.795
            14          -0.43760        0.795
            15          -0.43004        0.798
            16          -0.42304        0.799
            17          -0.41656        0.801
            18          -0.41055        0.802
            19          -0.40495        0.805
  

In [None]:
#Get the accuracy of the me classifier. The accuracies above were on the training set so this is what matters.
round(accuracy(me_classifier, test_set3), 2) 

0.81

In [None]:
#Get 10 most informative features for me classifier
me_classifier.show_most_informative_features(10)

  -1.938 suffix2=='na' and label is 'male'
  -1.922 suffix2=='la' and label is 'male'
  -1.886 suffix2=='ra' and label is 'male'
  -1.658 suffix2=='ia' and label is 'male'
  -1.430 suffix2=='sa' and label is 'male'
  -1.387 suffix1=='a' and label is 'male'
  -1.346 suffix2=='us' and label is 'female'
  -1.277 suffix1=='k' and label is 'female'
  -1.217 suffix2=='ta' and label is 'male'
  -1.213 suffix2=='rd' and label is 'female'


# ***More Classifiers:***
Scikit-learn (sklearn) is a popular library which features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN.

NLTK provides an API to quickly use sklearn classifiers in nltk.classify.scikitlearn. The other option is to import and use sklearn directly.

For an example of integrating sklearn with NLTK, you can check out [this notebook on Kaggle.](https://www.kaggle.com/alvations/basic-nlp-with-nltk) Kaggle is a great website for NLP and machine learning in general, creating an account is highly recommended.