#### **DATA 620 - Project #3**

Author: Kory Martin  
Date: 3/20/2024  

Instructions:

- Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. 
- Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. 
- Then, starting with the example name gender classifier, make incremental improvements. 
- Use the dev-test set to check your progress. 
- Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? 
- Is this what you'd expect?
- Project is due 3/26.
- Source: Natural Language Processing with Python, exercise 6.10.2.


#### **1. Import Libraries**

In [None]:
from nltk.corpus import names
import nltk
import random
import pandas as pd
import re

#### **2. Import the names corpus**

In [8]:
nltk.download('names')

[nltk_data] Downloading package names to
[nltk_data]     /Users/korymartin/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.


True

Create a list that combines the male and female names

In [14]:
names = ([(name,'male') for name in names.words('male.txt')] + [(name,'female') for name in names.words('female.txt')])

Shuffle the names

In [15]:
random.shuffle(names)

In [17]:
len(names)

7944

Split the names data into training data, test data and dev-test data

In [38]:
train = names[1000:]
test = names[:500]
dev_test = names[500:1000]

#### **3. Name Classifier**

Here we start by building a function that takes in a name and extract specific features that will be used to train our model

##### **3.1 Round 1**

For our first classifier, we will replicate the classifier created in our text. This classifier will be trained by using the last letter of the name as the only feature used to classify the text as **male** or **female**

In [86]:
def gender_features(word):
    features = {'last_letter':word[-1]}
    return features

Next we will create the different data sets - along with their features - to train and evaluate our classifier

In [88]:
train_set = [(gender_features(n),g) for (n,g) in train]
test_set = [(gender_features(n),g) for (n,g) in test]
dev_test_set = [(gender_features(n),g) for (n,g) in dev_test]

Next we will train the classifier using the training set

In [89]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

Here we see that the classifier based on the last letter of the name has a 74% accuracy score

In [90]:
print(nltk.classify.accuracy(classifier,test_set))

0.742


We will now use our dev_test set to examine the errors generated by our classifier and use this to identify other features that can be used to improve upon our classifier

In [91]:
errors = []
for (name,tag) in dev_test:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag,guess,name))

In [92]:
errors_df = pd.DataFrame(errors).rename(columns={0:'correct', 1:'guess', 2:'name'})

In [93]:
pd.set_option('display.max_rows',150)

Looking at the errors log, we can begin to explore some additional features that may be helpful in improving our model. 

In [95]:
r1_errors = errors_df.sort_values(by=['correct','name'], ascending=[False,True])

In [111]:
errors_info = [{'first_letter':r1_errors.loc[i,'name'][0].lower(), \
    'correct':r1_errors.loc[i,'correct'], \
        'guess':r1_errors.loc[i,'guess']} for i in range(len(r1_errors))]

In [117]:
x = pd.DataFrame(errors_info).sort_values(by='first_letter')
pd.crosstab(x['first_letter'], x['correct'])

correct,female,male
first_letter,Unnamed: 1_level_1,Unnamed: 2_level_1
a,3,6
b,0,1
c,1,4
d,2,6
e,3,2
f,1,1
g,4,5
h,1,3
i,0,1
j,4,5


Based on reviewing the data based on first letter of the name, it appears that the data is misclassifying names that begin with the letters a,c,d,m,p,s. I think this is worth attempting to enhance the features function to include a first letter. For starters, we will just focus on including the first letter of the name as a feature; and then depending on what we find with our errors, it may be worth being more explicit and creating a feature based on if the name begins with one of the letters mentioned above

##### **3.2 Round 2**

As mentioned at the end of the previous step, we will include a feature for the first letter of the name 

In [128]:
def gender_features(word):
    features = {'last_letter':word[-1], 'first_letter':word[0].lower()}
    return features

In [129]:
featuresets = [(gender_features(n),g) for (n,g) in names]

In [130]:
train_set = [(gender_features(n),g) for (n,g) in train]
test_set = [(gender_features(n),g) for (n,g) in test]
dev_test_set = [(gender_features(n),g) for (n,g) in dev_test]

In [131]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

We see that including the first initial of the letter improved the accuracy score from 74% to 78%. 

In [132]:
print(nltk.classify.accuracy(classifier,test_set))

0.784


We will now use our dev_test set to examine the errors generated by our classifier and use this to identify other features that can be used to improve upon our classifier

In [133]:
dev_test_set

[({'last_letter': 'e', 'first_letter': 's'}, 'female'),
 ({'last_letter': 'a', 'first_letter': 'e'}, 'female'),
 ({'last_letter': 'e', 'first_letter': 'd'}, 'male'),
 ({'last_letter': 'a', 'first_letter': 's'}, 'female'),
 ({'last_letter': 'n', 'first_letter': 'w'}, 'male'),
 ({'last_letter': 'l', 'first_letter': 'd'}, 'female'),
 ({'last_letter': 'y', 'first_letter': 'p'}, 'female'),
 ({'last_letter': 's', 'first_letter': 'g'}, 'female'),
 ({'last_letter': 'h', 'first_letter': 'a'}, 'female'),
 ({'last_letter': 'a', 'first_letter': 'n'}, 'female'),
 ({'last_letter': 'a', 'first_letter': 'd'}, 'female'),
 ({'last_letter': 'a', 'first_letter': 'a'}, 'female'),
 ({'last_letter': 'e', 'first_letter': 'f'}, 'female'),
 ({'last_letter': 'a', 'first_letter': 'r'}, 'female'),
 ({'last_letter': 'n', 'first_letter': 'a'}, 'male'),
 ({'last_letter': 'l', 'first_letter': 's'}, 'female'),
 ({'last_letter': 'a', 'first_letter': 'p'}, 'female'),
 ({'last_letter': 'd', 'first_letter': 'h'}, 'male'),


In [134]:
errors = []
for (name,tag) in dev_test:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag,guess,name))

In [135]:
errors_df = pd.DataFrame(errors).rename(columns={0:'correct', 1:'guess', 2:'name'})

In [136]:
pd.set_option('display.max_rows',150)

Looking at the errors log, we can begin to explore some additional features that may be helpful in improving our model. 

In [138]:
r2_errors = errors_df.sort_values(by=['correct','name'], ascending=[False,True])

In [140]:
r2_errors.head()

Unnamed: 0,correct,guess,name
55,male,female,Alaa
2,male,female,Anton
52,male,female,Antonin
25,male,female,Artie
50,male,female,Ashby


I've decided to try out the combination of the first and last letter as a potential feature. When looking at the errors data frame based on this feature, there are a lot of 0s in the count, suggesting that it may represent a pretty linear decision boundary. Thus, I will incorporate this in the next updates to our feature set function

In [144]:
errors_info = [{'first_letter':r1_errors.loc[i,'name'][0].lower(), \
    'first_last':r1_errors.loc[i,'name'][0].lower()+r1_errors.loc[i,'name'][-1:].lower(), \
    'correct':r1_errors.loc[i,'correct'], \
        'guess':r1_errors.loc[i,'guess']} for i in range(len(r1_errors))]

In [146]:
x = pd.DataFrame(errors_info).sort_values(by='first_letter')
pd.crosstab(x['first_last'], x['correct'])

correct,female,male
first_last,Unnamed: 1_level_1,Unnamed: 2_level_1
aa,0,1
ae,0,3
al,0,1
an,0,2
as,3,0
ay,0,1
be,0,1
bl,1,0
ce,0,1
cl,0,1


##### **3.3 Round 3**

In [147]:
def gender_features(word):
    features = {'last_letter':word[-1], 'first_letter':word[0].lower(), 'first_last':word[0].lower()+word[-1:]}
    return features

In [148]:
featuresets = [(gender_features(n),g) for (n,g) in names]

In [149]:
train_set = [(gender_features(n),g) for (n,g) in train]
test_set = [(gender_features(n),g) for (n,g) in test]
dev_test_set = [(gender_features(n),g) for (n,g) in dev_test]

In [150]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

We see that this new feature improved our classifier by a little less than .05%. Not much of an improvement, but a slightly higher accuracy score

In [151]:
print(nltk.classify.accuracy(classifier,test_set))

0.79


In [152]:
dev_test_set

[({'last_letter': 'e', 'first_letter': 's', 'first_last': 'se'}, 'female'),
 ({'last_letter': 'a', 'first_letter': 'e', 'first_last': 'ea'}, 'female'),
 ({'last_letter': 'e', 'first_letter': 'd', 'first_last': 'de'}, 'male'),
 ({'last_letter': 'a', 'first_letter': 's', 'first_last': 'sa'}, 'female'),
 ({'last_letter': 'n', 'first_letter': 'w', 'first_last': 'wn'}, 'male'),
 ({'last_letter': 'l', 'first_letter': 'd', 'first_last': 'dl'}, 'female'),
 ({'last_letter': 'y', 'first_letter': 'p', 'first_last': 'py'}, 'female'),
 ({'last_letter': 's', 'first_letter': 'g', 'first_last': 'gs'}, 'female'),
 ({'last_letter': 'h', 'first_letter': 'a', 'first_last': 'ah'}, 'female'),
 ({'last_letter': 'a', 'first_letter': 'n', 'first_last': 'na'}, 'female'),
 ({'last_letter': 'a', 'first_letter': 'd', 'first_last': 'da'}, 'female'),
 ({'last_letter': 'a', 'first_letter': 'a', 'first_last': 'aa'}, 'female'),
 ({'last_letter': 'e', 'first_letter': 'f', 'first_last': 'fe'}, 'female'),
 ({'last_letter'

In [159]:
errors = []
for (name,tag) in dev_test:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag,guess,name))

In [160]:
errors_df = pd.DataFrame(errors).rename(columns={0:'correct', 1:'guess', 2:'name'})

In [161]:
pd.set_option('display.max_rows',150)

Looking at the errors log, we can begin to explore some additional features that may be helpful in improving our model. 

In [162]:
r3_errors = errors_df.sort_values(by=['correct','name'], ascending=[False,True])

In [163]:
r3_errors.head()

Unnamed: 0,correct,guess,name
51,male,female,Alaa
27,male,female,Artie
46,male,female,Ashby
76,male,female,Augustine
69,male,female,Ave


I've decided to try out the combination of the first and last letter as a potential feature. When looking at the errors data frame based on this feature, there are a lot of 0s in the count, suggesting that it may represent a pretty linear decision boundary. Thus, I will incorporate this in the next updates to our feature set function

##### **3.4 Round 4**

Since we're still seeing some strong misclassifications based on the first name, I'm going to expand the feature function to include a categorical variable based on if the first letter is one of several letters mentioned at the end of Round 1

Here I created a lambda function to set a flag if the first letter of the name is in the special set of characters a,c,d,m,p,s.

In [172]:
x = lambda a: 1 if re.search('^[acdmps]',a) != None else 0

Updated the features function to include this new flag. But also removed the first_letter feature, since that would be suspected to be highly correlated to the new feature which is looking at the first character

In [195]:
def gender_features(word):
    features = {'last_letter':word[-1], \
            'first_last':word[0].lower()+word[-1:],
            'flagged_character':x(word[0].lower())}
    return features

In [191]:
featuresets = [(gender_features(n),g) for (n,g) in names]

In [192]:
train_set = [(gender_features(n),g) for (n,g) in train]
test_set = [(gender_features(n),g) for (n,g) in test]
dev_test_set = [(gender_features(n),g) for (n,g) in dev_test]

In [193]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

When we run this, we see that this new feature provides an improvement compared to our first model (which was 74%). However, the performance is below the classifier that used the first letter as a feaure

In [194]:
print(nltk.classify.accuracy(classifier,test_set))

0.776


#### **4. Conclusion**

In this example, I was able to see how we can use text based data and natural language processing to develop a supervised classifier. I think the biggest challenge was figuring out features that could be considered useful and predictive in nature to inform our classifier. While the process that I undertook was mainly a manual iterative process, I think there are probably more non-trivial aspects of a name that can potentially be used for features. However, at the same time, by reviewing some of the misclassified data in the dev_test data, it's evident that there are some names that are not universally Female or Male names; and while the classifier is probably effective at classifying names that are less ambiguous, it's this subset of more ambiguous names that it has a harder time classying correctly