# Project 3: Gender Classifier
## Josh Iden
### 3/21/23

## Assignment Overview

![](PJ3.png)

## Step 1: Load and Split the Names

In [1]:
import nltk 
from nltk.corpus import names
import random

labeled_names = ([(name.lower().strip(), 'male') for name in names.words('male.txt')] + [(name.lower().strip(), 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)

In [2]:
len(labeled_names)

7944

In [3]:
# split the names
test_names, dev_names, train_names = labeled_names[:500], labeled_names[500:1000], labeled_names[1000:]

In [4]:
print("testing set words: {}".format(len(test_names)))
print("dev test set word: {}".format(len(dev_names)))
print("training set words: {}".format(len(train_names)))

testing set words: 500
dev test set word: 500
training set words: 6944


## Steps 2-3: Make Improvements to Gender Classifier

For this assignment, I am using the `nltk.NaiveBayesClassifier()` function. From NLTK [Chapter 6](https://www.nltk.org/book/ch06.html):

*In naive Bayes classifiers, every feature gets a say in determining which label should be assigned to a given input value. To choose a label for an input value, the naive Bayes classifier begins by calculating the prior probability of each label, which is determined by checking frequency of each label in the training set. The contribution from each feature is then combined with this prior probability, to arrive at a likelihood estimate for each label. The label whose likelihood estimate is the highest is then assigned to the input value.*

Before we can use the classifier, however, we'll need to define and encode a set of relevant features. We'll start with a "kitchen sink" approach:

### **Kitchen Sink Iteration: v1**

In [5]:
def gender_features(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

In [6]:
# create the training and testing sets for the classifier
from nltk.classify import apply_features

train_set = apply_features(gender_features, train_names)
devtest_set = apply_features(gender_features, dev_names)
test_set = apply_features(gender_features, test_names)

In [7]:
# train the classifier and test using devtest set 
v1_classifier = nltk.NaiveBayesClassifier.train(train_set)
v1_score = nltk.classify.accuracy(v1_classifier, devtest_set)
print("v1:",v1_score)

v1: 0.776


In [8]:
# this function will keep a running tally of version accuracies
total = []

def running_tally(version, list):
    '''appends a value to a provided list and provides a running tally'''
    list.append(version)
    for i in range(1,len(list)+1):
        print("v{}: {}".format(i, list[i-1]))
     
    if len(list) > 1:
        print("best version: v{}, {}".format(list.index(max(list))+1, max(list)))

In [9]:
running_tally(v1_score, total)

v1: 0.776


Using the dev-test set, we can generate a list of the errors that the classifier makes when predicting name genders:

In [10]:
from tabulate import tabulate

errors = []

for (name, tag) in dev_names:
    guess = v1_classifier.classify(gender_features(name))
    if guess != tag:
         errors.append( (tag, guess, name) )

In [11]:
header = ['Correct','Guess','Name']
print(tabulate(errors, header, tablefmt="grid"))

+-----------+---------+------------+
| Correct   | Guess   | Name       |
| male      | female  | corrie     |
+-----------+---------+------------+
| male      | female  | earle      |
+-----------+---------+------------+
| female    | male    | rosette    |
+-----------+---------+------------+
| female    | male    | suellen    |
+-----------+---------+------------+
| male      | female  | keil       |
+-----------+---------+------------+
| female    | male    | trude      |
+-----------+---------+------------+
| female    | male    | phyllys    |
+-----------+---------+------------+
| male      | female  | obie       |
+-----------+---------+------------+
| male      | female  | luke       |
+-----------+---------+------------+
| female    | male    | pier       |
+-----------+---------+------------+
| male      | female  | nealy      |
+-----------+---------+------------+
| male      | female  | cammy      |
+-----------+---------+------------+
| female    | male    | zorah      |
+

In [12]:
v1_classifier.show_most_informative_features()

Most Informative Features
             last_letter = 'a'            female : male   =     35.5 : 1.0
             last_letter = 'k'              male : female =     30.4 : 1.0
             last_letter = 'f'              male : female =     14.0 : 1.0
             last_letter = 'd'              male : female =     10.6 : 1.0
             last_letter = 'v'              male : female =      9.9 : 1.0
             last_letter = 'p'              male : female =      9.2 : 1.0
             last_letter = 'm'              male : female =      8.8 : 1.0
                count(v) = 2              female : male   =      8.8 : 1.0
             last_letter = 'o'              male : female =      8.6 : 1.0
             last_letter = 'r'              male : female =      6.7 : 1.0


Let's add single and multiple letter suffixes to the features and see if it improves the classifier accuracy. As the book states, *Each time the error analysis procedure is repeated, we should select a different dev-test/training split, to ensure that the classifier does not start to reflect idiosyncrasies in the dev-test set.* - so we will incorporate this into our workflow. 

### **Iteration: v2**

In [13]:
random.shuffle(labeled_names)
test_names, dev_names, train_names = labeled_names[:500], labeled_names[500:1000], labeled_names[1000:]

In [14]:
def gender_features(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    features["suffix1"] = name[-1:].lower()
    features["suffix2"] = name[-2:].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

In [15]:
train_set = apply_features(gender_features, train_names)
devtest_set = apply_features(gender_features, dev_names)
test_set = apply_features(gender_features, test_names)

# train the classifier and test using devtest set 
v2_classifier = nltk.NaiveBayesClassifier.train(train_set)
v2_score = nltk.classify.accuracy(v2_classifier, devtest_set)
diff = round(v2_score - v1_score, 6)
print("v2:",v2_score)
print("Difference from previous result: {}".format(diff))

v2: 0.76
Difference from previous result: -0.016


In [16]:
running_tally(v2_score, total)

v1: 0.776
v2: 0.76
best version: v1, 0.776


We see that this improves the model by nearly two percentage points. There were a few errors that had names with hyphens in them. Let's see if we can add those to a list of features. 

Some names have hyphens. Let's see if adding feature `hyphen` improves the model:

### **Iteration: v3**

In [17]:
random.shuffle(labeled_names)
test_names, dev_names, train_names = labeled_names[:500], labeled_names[500:1000], labeled_names[1000:]

In [18]:
import re

def gender_features(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    features["suffix1"] = name[-1:].lower()
    features["suffix2"] = name[-2:].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    for name in name:
        if re.match('(?:\w+-)+\w+', name):
            features["hyphen"] = "yes"
        else:
            features["hyphen"] = "no"

    return features

In [19]:
train_set = apply_features(gender_features, train_names)
devtest_set = apply_features(gender_features, dev_names)
test_set = apply_features(gender_features, test_names)

# train the classifier and test using devtest set 
v3_classifier = nltk.NaiveBayesClassifier.train(train_set)
v3_score = nltk.classify.accuracy(v3_classifier, devtest_set)
diff = round(v3_score - v2_score, 6)
print("v3:",v3_score)
print("Difference from previous result: {}".format(diff))

v3: 0.79
Difference from previous result: 0.03


In [20]:
running_tally(v3_score, total)

v1: 0.776
v2: 0.76
v3: 0.79
best version: v3, 0.79


This returns the same accuracy score as the previous model. I'm going to remove as it doesn't seem to have significance towards the model and I don't want to overfit. 

Let's add the name length as a feature and try again:

### **Iteration: v4**

In [21]:
random.shuffle(labeled_names)
test_names, dev_names, train_names = labeled_names[:500], labeled_names[500:1000], labeled_names[1000:]

In [22]:
def gender_features(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    features["suffix1"] = name[-1:].lower()
    features["suffix2"] = name[-2:].lower()
    features["total_letters"] = len(name) 
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())

    return features

In [23]:
train_set = apply_features(gender_features, train_names)
devtest_set = apply_features(gender_features, dev_names)
test_set = apply_features(gender_features, test_names)

# train the classifier and test using devtest set 
v4_classifier = nltk.NaiveBayesClassifier.train(train_set)
v4_score = nltk.classify.accuracy(v4_classifier, devtest_set)
diff = round(v4_score - v3_score, 6)
print(v4_score)
print("Difference from previous result: {}".format(diff))

0.78
Difference from previous result: -0.01


In [24]:
running_tally(v4_score, total)

v1: 0.776
v2: 0.76
v3: 0.79
v4: 0.78
best version: v3, 0.79


This produces a *worse* model. This might be because total letters might be redundant considering we are already tallying individual letters. Let's try using the `nltk` SyllableTokenizer to see if we can create a feature on number of syllables.

### **Iteration: v5**

In [25]:
random.shuffle(labeled_names)
test_names, dev_names, train_names = labeled_names[:500], labeled_names[500:1000], labeled_names[1000:]

In [26]:
from nltk.tokenize.sonority_sequencing import SyllableTokenizer

# instantiate the syllable tokenizer
st = SyllableTokenizer()

def gender_features(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    features["suffix1"] = name[-1:].lower()
    features["suffix2"] = name[-2:].lower()
    features["syllables"] = len(st.tokenize(name)) 
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())

    return features

In [27]:
train_set = apply_features(gender_features, train_names)
devtest_set = apply_features(gender_features, dev_names)
test_set = apply_features(gender_features, test_names)

# train the classifier and test using devtest set 
v5_classifier = nltk.NaiveBayesClassifier.train(train_set)
v5_score = nltk.classify.accuracy(v5_classifier, devtest_set)
diff = round(v5_score - v4_score, 6)
print(v5_score)
print("diff from previous: {}".format(diff))



0.778
diff from previous: -0.002


In [28]:
running_tally(v5_score, total)

v1: 0.776
v2: 0.76
v3: 0.79
v4: 0.78
v5: 0.778
best version: v3, 0.79


Adding `syllables` improves the model, although this fluctuates every time I run the model.  

In [29]:
v5_classifier.show_most_informative_features()

Most Informative Features
                 suffix2 = 'na'           female : male   =     95.9 : 1.0
                 suffix2 = 'la'           female : male   =     67.9 : 1.0
                 suffix2 = 'ia'           female : male   =     50.8 : 1.0
             last_letter = 'a'            female : male   =     34.5 : 1.0
                 suffix1 = 'a'            female : male   =     34.5 : 1.0
                 suffix2 = 'sa'           female : male   =     32.5 : 1.0
             last_letter = 'k'              male : female =     29.5 : 1.0
                 suffix1 = 'k'              male : female =     29.5 : 1.0
                 suffix2 = 'rd'             male : female =     28.9 : 1.0
                 suffix2 = 'rt'             male : female =     24.2 : 1.0


### **Iteration: v6**

In [30]:
random.shuffle(labeled_names)
test_names, dev_names, train_names = labeled_names[:500], labeled_names[500:1000], labeled_names[1000:]

In [31]:
def gender_features(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    features["suffix1"] = name[-1:].lower()
    features["suffix2"] = name[-2:].lower()
    features["syllables"] = len(st.tokenize(name)) 
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    
    for name in name:
        if re.match('.+[aeiou]$', name):
            features["ends_in_vowel"] = True
        else:
            features["ends_in_vowel"] = False

    return features

In [32]:
train_set = apply_features(gender_features, train_names)
devtest_set = apply_features(gender_features, dev_names)
test_set = apply_features(gender_features, test_names)

# train the classifier and test using devtest set 
v6_classifier = nltk.NaiveBayesClassifier.train(train_set)
v6_score = nltk.classify.accuracy(v6_classifier, devtest_set)
diff = round(v6_score - v5_score, 6)
print(v6_score)
print("diff from previous: {}".format(diff))

0.822
diff from previous: 0.044


In [33]:
running_tally(v6_score, total)

v1: 0.776
v2: 0.76
v3: 0.79
v4: 0.78
v5: 0.778
v6: 0.822
best version: v6, 0.822


An improvement. Although I have to note that this score has changed every time I have run the notebook. Let's see what happens if we change the feature to ending in two vowels. 

### Iteration: v7

In [34]:
random.shuffle(labeled_names)
test_names, dev_names, train_names = labeled_names[:500], labeled_names[500:1000], labeled_names[1000:]

def gender_features(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    features["suffix1"] = name[-1:].lower()
    features["suffix2"] = name[-2:].lower()
    features["syllables"] = len(st.tokenize(name)) 
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    
    for name in name:
        if re.match('.+[aeiou][aeiou]$', name):
            features["ends_in_dblvwl"] = True
        else:
            features["ends_in_dblvwl"] = False

    return features

In [36]:
train_set = apply_features(gender_features, train_names)
devtest_set = apply_features(gender_features, dev_names)
test_set = apply_features(gender_features, test_names)

# train the classifier and test using devtest set 
v7_classifier = nltk.NaiveBayesClassifier.train(train_set)
v7_score = nltk.classify.accuracy(v7_classifier, devtest_set)
diff = round(v7_score - v6_score, 6)
print(v7_score)
print("diff from previous: {}".format(diff))

0.792
diff from previous: -0.03


In [37]:
running_tally(v7_score, total)

v1: 0.776
v2: 0.76
v3: 0.79
v4: 0.78
v5: 0.778
v6: 0.822
v7: 0.792
best version: v6, 0.822


At this point I need to stress that every time I run these programs, I get different accuracy scores. In some iterations, v4 scores the highest. In others, v5. For this iteration I am going to use the v6 model on the test data to see how it performs. 

## Step 4: Check Performance Against Test Data

Now we'll check the model performance against the v4 training features:

In [38]:
random.shuffle(labeled_names)
test_names, dev_names, train_names = labeled_names[:500], labeled_names[500:1000], labeled_names[1000:]

def gender_features(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    features["suffix1"] = name[-1:].lower()
    features["suffix2"] = name[-2:].lower()
    features["syllables"] = len(st.tokenize(name)) 
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    
    for name in name:
        if re.match('.+[aeiou]$', name):
            features["ends_in_vowel"] = True
        else:
            features["ends_in_vowel"] = False

    return features

train_set = apply_features(gender_features, train_names)
devtest_set = apply_features(gender_features, dev_names)
test_set = apply_features(gender_features, test_names)

# train the classifier and test using devtest set 
final_classifier = nltk.NaiveBayesClassifier.train(train_set)
final_score = nltk.classify.accuracy(v7_classifier, test_set)

In [41]:
print("Test score: {} \nFinal score: {}".format(v6_score, final_score))

Test score: 0.822 
Final score: 0.794


## Conclusions

In each iteration, I observed fluctuating accuracy scores each time I shuffled and split the data, which leads me to conclude that each model variation performs roughly the same, with about the same variance. If I were to continue to improve the model I would focus on removing some of the features to rely on the highest value predictors -- fewer predictors with higher significance will usually return predictions with lower variance. 