# Project 3

Authors: Ari and Lucas

In the following exercise, we're going to analyze the Names corpus and build a gender classifier. First step is to load the Names corpus and split the data into the three sections we need: 500 words for the test set, 500 words for the dev-test set, 6944 words for the training set.

In [45]:
import nltk
#nltk.download('names')
from nltk.corpus import names
import random
from collections import defaultdict
import soundex

soundex_instance = soundex.Soundex()
#Load and shuffle the names
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)

#Split the data
test_names = labeled_names[:500]
dev_test_names = labeled_names[500:1000]
train_names = labeled_names[1000:]

For our classifiers, we used Naive Bayes and Decision Trees to compare their performances as well as trying to improve past a basic version of these. Below are the initial classifiers with only one extracting feature: name length.

In [16]:
def gender_features_basic(name):
    features = {
        'length': len(name)
        }
    return features

#Prepare the feature sets with the one feature extractor
train_set_basic = [(gender_features_basic(name), gender) for (name, gender) in train_names]
dev_test_set_basic = [(gender_features_basic(name), gender) for (name, gender) in dev_test_names]
test_set_basic = [(gender_features_basic(name), gender) for (name, gender) in test_names]

#Train and evaluate the Naive Bayes classifier
classifier_nb = nltk.NaiveBayesClassifier.train(train_set_basic)
accuracy_nb = nltk.classify.accuracy(classifier_nb, dev_test_set_basic)
print(f"Naive Bayes accuracy with a basic extractor on dev-test set: {accuracy_nb:.4f}")

#Train and evaluate the Decision Tree classifier
classifier_dt = nltk.DecisionTreeClassifier.train(train_set_basic)
accuracy_dt = nltk.classify.accuracy(classifier_dt, dev_test_set_basic)
print(f"Decision Tree accuracy with a basic extractor on dev-test set: {accuracy_dt:.4f}")

Naive Bayes accuracy with a basic extractor on dev-test set: 0.6280
Decision Tree accuracy with a basic extractor on dev-test set: 0.6280


We can see that the using name length as our only feature does make the classifier do better than the average (since it's just predicting from two values, male and female, the classifier should be right 50% of the time so the average accuracy is 50% by randomly guessing alone). Both classifiers provided the same accuracies meaning that their differences in methods as not stray away from each other in this simple one feature version. Accuracies in the 60s is not enough to be used so it's time to add some improvements to increase it.

To make improvements, we need to add more extracting features to our function (gender_features_soundex_package). Some new features we added are:

*   length: Name length (used previously)
*   first_letter: First letter
*   last_letter: Last letter
*   Prefix2: First 2 letters
*   Prefix3: First 3 letters
*   Suffix2: Last 2 letters
*   Suffix3: Last 3 letters
*   vowel_count: Vowel count
*   consonant_count: Consonant count
*   soundex: phonetic code to group similar names

The last two are for loops to create features for every 2 letter combination (bigram) and 3 letter combination (trigram) in the names.





In [43]:
def gender_features_soundex_package(name):
    name_lower = name.lower()
    features = {
        'length': len(name),
        'first_letter': name_lower[0],
        'last_letter': name_lower[-1],
        'prefix2': name_lower[:1],
        'prefix3': name_lower[:2],
        'suffix2': name_lower[-2:],
        'suffix3': name_lower[-3:],
        'vowel_count': sum(1 for char in name_lower if char in 'aeiou'),
        'consonant_count': sum(1 for char in name_lower if char not in 'aeiou'),
        'soundex': soundex_instance.soundex(name)
    }
    #Add character n-grams
    for i in range(len(name_lower) - 1):
        features[f'bigram_{name_lower[i:i+2]}'] = True
    for i in range(len(name_lower) - 2):
        features[f'trigram_{name_lower[i:i+3]}'] = True

    return features

#Prepare the feature sets with the new feature extractor
train_set_pkg = [(gender_features_soundex_package(name), gender) for (name, gender) in train_names]
dev_test_set_pkg = [(gender_features_soundex_package(name), gender) for (name, gender) in dev_test_names]
test_set_pkg = [(gender_features_soundex_package(name), gender) for (name, gender) in test_names]

#Train and evaluate the Naive Bayes classifier
classifier_nb_pkg = nltk.NaiveBayesClassifier.train(train_set_pkg)
accuracy_nb_pkg = nltk.classify.accuracy(classifier_nb_pkg, dev_test_set_pkg)
print(f"Naive Bayes accuracy with soundex package on dev-test set: {accuracy_nb_pkg:.4f}")

#Train and evaluate the Decision Tree classifier
classifier_dt_pkg = nltk.DecisionTreeClassifier.train(train_set_pkg)
accuracy_dt_pkg = nltk.classify.accuracy(classifier_dt_pkg, dev_test_set_pkg)
print(f"Decision Tree accuracy with soundex package on dev-test set: {accuracy_dt_pkg:.4f}")

Naive Bayes accuracy with soundex package on dev-test set: 0.8560
Decision Tree accuracy with soundex package on dev-test set: 0.7060


We can see that both classifiers improved significantly from adding more features with Naive Bayes going from .628 to .856 and Decision Tree .628 to .706.

Since our Naive Bayes outperformed our Decision Tree, we will focus on that one and use this trained classifier on our test set.

In [44]:
#Final evaluation on the test set with the Naive Bayes classifier
final_accuracy_test_pkg = nltk.classify.accuracy(classifier_nb_pkg, test_set_pkg)
print(f"\nFinal accuracy of the best classifier on the test set: {final_accuracy_test_pkg:.4f}")
print(f"Accuracy of the same classifier on the dev-test set: {accuracy_nb_pkg:.4f}")


Final accuracy of the best classifier on the test set: 0.8380
Accuracy of the same classifier on the dev-test set: 0.8560


Above is the accuracy for using the Naive Bayes classifier on the test set. We can see that the accuracy actually decreased a bit from the dev-test set. However, this is kind of expected as both the dev-test and test sets only have 500 names each which is a pretty small sample. It could also reflect the effects of a wide train to test ratio gap: the dev-test and test sets combined make up just 12.5% of the data, where we typically use a train to test ratio in the range, 70:30 to 80:20. With limited data, even a few misclassifications can shift the accuracy by a couple of percentage points. The difference we saw (around 2%) is minor and not unusual in machine learning, especially with small datasets like this one. It also shows that male and female names likely don’t differ that drastically in the features we used, so variation between the dev-test and test set results is minimal. I think this shows that while the model does well on a smaller validation set, real-world performance can dip slightly which is something you'd expect.


