#### **DATA 620 - Project #3**

Author: Kory Martin  
Date: 3/20/2024  

Instructions:

- Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. 
- Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. 
- Then, starting with the example name gender classifier, make incremental improvements. 
- Use the dev-test set to check your progress. 
- Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? 
- Is this what you'd expect?
- Project is due 3/26.
- Source: Natural Language Processing with Python, exercise 6.10.2.


#### **1. Import Libraries**

In [291]:
from nltk.corpus import names
import nltk
import random
import pandas as pd
import re

#### **2. Import the names corpus**

In [292]:
nltk.download('names')

[nltk_data] Downloading package names to
[nltk_data]     /Users/korymartin/nltk_data...
[nltk_data]   Package names is already up-to-date!


True

Create a list that combines the male and female names

In [293]:
names = ([(name,'male') for name in names.words('male.txt')] + [(name,'female') for name in names.words('female.txt')])

In [294]:
pd.DataFrame(names)[1].value_counts()

1
female    5001
male      2943
Name: count, dtype: int64

Shuffle the names

In [295]:
random.seed(1211)
random.shuffle(names)

Since there are more female names than male names, we will start by creating a balanced dataset that has an equal number of male names and female names

In [296]:
female_names = pd.DataFrame(names).loc[pd.DataFrame(names)[1] == 'female'].copy()
male_names = pd.DataFrame(names).loc[pd.DataFrame(names)[1] == 'male'].copy()

In [297]:
num_male_names = len(male_names)

In [298]:
female_names_sample = female_names.sample(num_male_names, random_state=1211).copy()

In [299]:
balanced_names = pd.concat([male_names,female_names_sample])

This results in the fallanced balanced dataset

In [300]:
balanced_names[1].value_counts()

1
male      2943
female    2943
Name: count, dtype: int64

In [301]:
balanced_names = balanced_names.sample(frac=1).reset_index(drop=True)

Split the names data into training data, test data and dev-test data

In [302]:
len(balanced_names)

5886

In [303]:
test_pct = 500/7900

In [304]:
test_size = int(round((len(balanced_names)) * test_pct,0))
dev_size = int(round((len(balanced_names)) * test_pct,0))
train_size = int(len(balanced_names) - (test_size + dev_size))


In [305]:
train_size

5140

In [306]:
test = names[:test_size]
dev_test = names[dev_size:2*dev_size]
train = names[2*dev_size:]



#### **3. Name Classifier**

Here we start by building a function that takes in a name and extract specific features that will be used to train our model

##### **3.1 Round 1**

For our first classifier, we will replicate the classifier created in our text. This classifier will be trained by using the last letter of the name as the only feature used to classify the text as **male** or **female**

In [307]:
def gender_features(word):
    features = {'last_letter':word[-1]}
    return features

Next we will create the different data sets - along with their features - to train and evaluate our classifier

In [308]:
train_set = [(gender_features(n),g) for (n,g) in train]
test_set = [(gender_features(n),g) for (n,g) in test]
dev_test_set = [(gender_features(n),g) for (n,g) in dev_test]

Next we will train the classifier using the training set

In [309]:
classifier_a = nltk.NaiveBayesClassifier.train(train_set)
classifier_b = nltk.DecisionTreeClassifier.train(train_set)

Here we see that both the Naive and Decision Tree classifiers have an accuracy of 77%

In [310]:
print('Naive Bayes Classifier Accuracy: {}'.format(nltk.classify.accuracy(classifier_a,test_set)))
print('Decision Tree Classifier Accuracy: {}'.format(nltk.classify.accuracy(classifier_b,test_set)))

Naive Bayes Classifier Accuracy: 0.7774798927613941
Decision Tree Classifier Accuracy: 0.7774798927613941


We will now use our dev_test set to examine the errors generated by our classifier and use this to identify other features that can be used to improve upon our classifier

In [311]:
errors = []
for (name,tag) in dev_test:
    guess = classifier_a.classify(gender_features(name))
    if guess != tag:
        errors.append((tag,guess,name))

In [312]:
errors_df = pd.DataFrame(errors).rename(columns={0:'correct', 1:'guess', 2:'name'})

Looking at the errors log, we can begin to explore some additional features that may be helpful in improving our model. 

In [313]:
r1_errors = errors_df.sort_values(by=['correct','name'], ascending=[False,True])

In [314]:
errors_info = [{'first_letter':r1_errors.loc[i,'name'][0].lower(), \
    'correct':r1_errors.loc[i,'correct'], \
        'guess':r1_errors.loc[i,'guess']} for i in range(len(r1_errors))]

In [315]:
x = pd.DataFrame(errors_info).sort_values(by='first_letter')
pd.crosstab(x['first_letter'], x['correct'])

correct,female,male
first_letter,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2,1
b,3,3
c,3,3
d,2,4
e,4,2
f,2,1
g,2,0
h,1,2
j,5,3
k,4,0


Based on reviewing the data based on first letter of the name, it appears that the data is misclassifying names that begin with the letters a,c,d,m,p,s. I think this is worth attempting to enhance the features function to include a first letter. For starters, we will just focus on including the first letter of the name as a feature; and then depending on what we find with our errors, it may be worth being more explicit and creating a feature based on if the name begins with one of the letters mentioned above

##### **3.2 Round 2 - Include first letter** 

As mentioned at the end of the previous step, we will include a feature for the first letter of the name 

In [316]:
def gender_features(word):
    features = {'last_letter':word[-1], 'first_letter':word[0].lower()}
    return features

In [317]:
featuresets = [(gender_features(n),g) for (n,g) in names]

In [318]:
train_set = [(gender_features(n),g) for (n,g) in train]
test_set = [(gender_features(n),g) for (n,g) in test]
dev_test_set = [(gender_features(n),g) for (n,g) in dev_test]

In [319]:
classifier_a = nltk.NaiveBayesClassifier.train(train_set)
classifier_b = nltk.DecisionTreeClassifier.train(train_set)

We see that including the first initial of the letter improved the accuracy score from 77% to 78.9% for the Naive Bayes Classifier and from 77% to 89.6% for the Decision Tree Classifier. 

In [320]:
print('Naive Bayes Classifier Accuracy: {}'.format(nltk.classify.accuracy(classifier_a,test_set)))
print('Decision Tree Classifier Accuracy: {}'.format(nltk.classify.accuracy(classifier_b,test_set)))

Naive Bayes Classifier Accuracy: 0.7882037533512064
Decision Tree Classifier Accuracy: 0.7962466487935657


We will now use our dev_test set to examine the errors generated by our classifier and use this to identify other features that can be used to improve upon our classifier

In [321]:
dev_test_set

[({'last_letter': 'a', 'first_letter': 'm'}, 'female'),
 ({'last_letter': 'a', 'first_letter': 'a'}, 'female'),
 ({'last_letter': 'a', 'first_letter': 'e'}, 'female'),
 ({'last_letter': 'e', 'first_letter': 'm'}, 'female'),
 ({'last_letter': 't', 'first_letter': 'e'}, 'male'),
 ({'last_letter': 'c', 'first_letter': 'f'}, 'male'),
 ({'last_letter': 'e', 'first_letter': 'j'}, 'female'),
 ({'last_letter': 'e', 'first_letter': 'r'}, 'female'),
 ({'last_letter': 'e', 'first_letter': 'f'}, 'male'),
 ({'last_letter': 'y', 'first_letter': 'f'}, 'female'),
 ({'last_letter': 'e', 'first_letter': 'd'}, 'female'),
 ({'last_letter': 'a', 'first_letter': 'n'}, 'female'),
 ({'last_letter': 'y', 'first_letter': 'a'}, 'female'),
 ({'last_letter': 'n', 'first_letter': 'c'}, 'female'),
 ({'last_letter': 'y', 'first_letter': 'j'}, 'male'),
 ({'last_letter': 'a', 'first_letter': 'r'}, 'female'),
 ({'last_letter': 'n', 'first_letter': 's'}, 'male'),
 ({'last_letter': 'a', 'first_letter': 'm'}, 'female'),
 (

In [322]:
errors = []
for (name,tag) in dev_test:
    guess = classifier_b.classify(gender_features(name))
    if guess != tag:
        errors.append((tag,guess,name))

In [323]:
errors_df = pd.DataFrame(errors).rename(columns={0:'correct', 1:'guess', 2:'name'})

Looking at the errors log, we can begin to explore some additional features that may be helpful in improving our model. 

In [324]:
r2_errors = errors_df.sort_values(by=['correct','name'], ascending=[False,True])

In [325]:
r2_errors.head()

Unnamed: 0,correct,guess,name
79,male,female,Anatol
16,male,female,Arnie
6,male,female,Benjie
50,male,female,Blare
21,male,female,Brooke


I've decided to try out the combination of the first and last letter as a potential feature. When looking at the errors data frame based on this feature, there are a lot of 0s in the count, suggesting that it may represent a pretty linear decision boundary. Thus, I will incorporate this in the next updates to our feature set function

In [326]:
errors_info = [{'first_letter':r2_errors.loc[i,'name'][0].lower(), \
    'first_last':r2_errors.loc[i,'name'][0].lower()+r2_errors.loc[i,'name'][-1:].lower(), \
    'correct':r2_errors.loc[i,'correct'], \
        'guess':r2_errors.loc[i,'guess']} for i in range(len(r2_errors))]

In [327]:
x = pd.DataFrame(errors_info).sort_values(by='first_letter')
pd.crosstab(x['first_last'], x['correct'])

correct,female,male
first_last,Unnamed: 1_level_1,Unnamed: 2_level_1
ae,0,1
al,0,1
as,1,0
bb,1,0
bd,1,0
be,0,3
bn,1,0
ce,0,2
co,0,1
cy,0,1


##### **3.3 Round 3 - First and Last Letter combination** 

In [328]:
def gender_features(word):
    features = {'last_letter':word[-1], 'first_letter':word[0].lower(), 'first_last':word[0].lower()+word[-1:]}
    return features

In [329]:
featuresets = [(gender_features(n),g) for (n,g) in names]

In [330]:
train_set = [(gender_features(n),g) for (n,g) in train]
test_set = [(gender_features(n),g) for (n,g) in test]
dev_test_set = [(gender_features(n),g) for (n,g) in dev_test]

In [331]:
classifier_a = nltk.NaiveBayesClassifier.train(train_set)
classifier_b = nltk.DecisionTreeClassifier.train(train_set)

We see that this new feature improved our Naive Bayes classifier from 78.9% to 79.6%, while the Decision Tree Classifier did not improve much.

In [332]:
print('Naive Bayes Classifier Accuracy: {}'.format(nltk.classify.accuracy(classifier_a,test_set)))
print('Decision Tree Classifier Accuracy: {}'.format(nltk.classify.accuracy(classifier_b,test_set)))

Naive Bayes Classifier Accuracy: 0.7962466487935657
Decision Tree Classifier Accuracy: 0.7962466487935657


In [333]:
dev_test_set

[({'last_letter': 'a', 'first_letter': 'm', 'first_last': 'ma'}, 'female'),
 ({'last_letter': 'a', 'first_letter': 'a', 'first_last': 'aa'}, 'female'),
 ({'last_letter': 'a', 'first_letter': 'e', 'first_last': 'ea'}, 'female'),
 ({'last_letter': 'e', 'first_letter': 'm', 'first_last': 'me'}, 'female'),
 ({'last_letter': 't', 'first_letter': 'e', 'first_last': 'et'}, 'male'),
 ({'last_letter': 'c', 'first_letter': 'f', 'first_last': 'fc'}, 'male'),
 ({'last_letter': 'e', 'first_letter': 'j', 'first_last': 'je'}, 'female'),
 ({'last_letter': 'e', 'first_letter': 'r', 'first_last': 're'}, 'female'),
 ({'last_letter': 'e', 'first_letter': 'f', 'first_last': 'fe'}, 'male'),
 ({'last_letter': 'y', 'first_letter': 'f', 'first_last': 'fy'}, 'female'),
 ({'last_letter': 'e', 'first_letter': 'd', 'first_last': 'de'}, 'female'),
 ({'last_letter': 'a', 'first_letter': 'n', 'first_last': 'na'}, 'female'),
 ({'last_letter': 'y', 'first_letter': 'a', 'first_last': 'ay'}, 'female'),
 ({'last_letter': 

In [334]:
errors = []
for (name,tag) in dev_test:
    guess = classifier_a.classify(gender_features(name))
    if guess != tag:
        errors.append((tag,guess,name))

In [335]:
errors_df = pd.DataFrame(errors).rename(columns={0:'correct', 1:'guess', 2:'name'})

Looking at the errors log, we can begin to explore some additional features that may be helpful in improving our model. 

In [336]:
r3_errors = errors_df.sort_values(by=['correct','name'], ascending=[False,True])

In [337]:
r3_errors.head()

Unnamed: 0,correct,guess,name
77,male,female,Anatol
16,male,female,Arnie
4,male,female,Benjie
48,male,female,Blare
21,male,female,Brooke


I've decided to try out the combination of the first and last letter as a potential feature. When looking at the errors data frame based on this feature, there are a lot of 0s in the count, suggesting that it may represent a pretty linear decision boundary. Thus, I will incorporate this in the next updates to our feature set function

##### **3.4 Round 4 - Flagged First Character**

Since we're still seeing some strong misclassifications based on the first name, I'm going to expand the feature function to include a categorical variable based on if the first letter is one of several letters mentioned at the end of Round 1

Here I created a lambda function to set a flag if the first letter of the name is in the special set of characters a,c,d,m,p,s.

In [338]:
x = lambda a: 1 if re.search('^[acdmps]',a) != None else 0

Updated the features function to include this new flag. But also removed the first_letter feature, since that would be suspected to be highly correlated to the new feature which is looking at the first character

In [339]:
def gender_features(word):
    features = {'last_letter':word[-1], \
            'first_last':word[0].lower()+word[-1:],
            'flagged_character':x(word[0].lower())}
    return features

In [340]:
featuresets = [(gender_features(n),g) for (n,g) in names]

In [341]:
train_set = [(gender_features(n),g) for (n,g) in train]
test_set = [(gender_features(n),g) for (n,g) in test]
dev_test_set = [(gender_features(n),g) for (n,g) in dev_test]

In [342]:
classifier_a = nltk.NaiveBayesClassifier.train(train_set)
classifier_b = nltk.DecisionTreeClassifier.train(train_set)

When we run this, we see that this new feature actually reduced the improvement in the Naive Bayes classifier - and brought it back in line with the first classifier; while once again, there isn't any real improvement in our Decision Tree Classifier

In [343]:
print('Naive Bayes Classifier Accuracy: {}'.format(nltk.classify.accuracy(classifier_a,test_set)))
print('Decision Tree Classifier Accuracy: {}'.format(nltk.classify.accuracy(classifier_b,test_set)))

Naive Bayes Classifier Accuracy: 0.7801608579088471
Decision Tree Classifier Accuracy: 0.7962466487935657


#### **4. Conclusion**

In this example, I was able to see how we can use text based data and natural language processing to develop a supervised classifier. I think the biggest challenge was figuring out features that could be considered useful and predictive in nature to inform our classifier. While the process that I undertook was mainly a manual iterative process, I think there are probably more non-trivial aspects of a name that can potentially be used for features. However, at the same time, by reviewing some of the misclassified data in the dev_test data, it's evident that there are some names that are not universally Female or Male names; and while the classifier is probably effective at classifying names that are less ambiguous, it's this subset of more ambiguous names that it has a harder time classying correctly