# DATA620 Project 3 - Name Gender Classifier
Team: Mia Chen / Wei Zhou

Date: 7/2/2020

Recording [link](https://www.youtube.com/watch?v=QOZDLexcjqk)

## Task
Using any of the three classifiers described in chapter 6 of <b><i>Natural Language Processing with Python</b></i>, and any features you can think of, build the best name gender classifier you can.
Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.
How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

Source: <i>Natural Language Processing with Python</i>, exercise 6.10.2.

### Import Data

In [1]:
# Prepare a list of examples and corresponding class labels
from nltk.corpus import names
import random

# Combine male and female names
names = ([(name, 'male') for name in names.words('male.txt')] +
        [(name, 'female') for name in names.words('female.txt')])

# Shuffle the list
random.shuffle(names)

In [5]:
# View 10 random names and corresponding gender
names[:10]

[('Herold', 'male'),
 ('Paula-Grace', 'female'),
 ('Howard', 'male'),
 ('Mae', 'female'),
 ('Noelle', 'female'),
 ('Meagan', 'female'),
 ('Gussy', 'female'),
 ('Paloma', 'female'),
 ('Agna', 'female'),
 ('Blinni', 'female')]

In [6]:
len(names)

7944

There are 7,944 names in the corpus. We'll begin by splitting them into three subsets:

* test set: 500 words
* dev-test set: 500 words
* training set: remaining 6,900 words

We will start with the example name gender classifier, then make incremental improvements and check with the dev-test set, and finally check with the test set to compare performance on both.

In [33]:
test_names = names[:500]
devtest_names = names[500:1000]
train_names = names[1000:]

## Book Example 1

First step is to define a feature extractor function which build a dictionary containing relevant information about a given name. In this example, we choose the last letter of a given name to be the feaure.

In [17]:
def gender_features(word):
    return {'last_letter': word[-1]}

Next, we use the feature extractor to process the names data, and divide the resulting list of feature sets into a training set and a test set. The training set is used to train a Naive Bayes classifier.

In [34]:
from nltk import NaiveBayesClassifier

# Extract the features using the function we defined
# feature_sets = [(gender_features(n), gender) for (n, gender) in names]

# Split into train and test sets
# train_set, test_set = feature_sets[500:], feature_sets[:500]

# Apply_features function
from nltk.classify import apply_features

# Split into train and test sets
train_set = apply_features(gender_features, names[500:])
test_set = apply_features(gender_features, names[:500])

# Train a Naive Bayes Classifier
classifier = NaiveBayesClassifier.train(train_set)

# Accuracy of Example 1
from nltk.classify import accuracy
print(accuracy(classifier, test_set))

0.75


Finally, we can examine the classifier to determine which features are most effective for distinguishing the name genders.

In [23]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     34.2 : 1.0
             last_letter = 'k'              male : female =     30.5 : 1.0
             last_letter = 'f'              male : female =     16.0 : 1.0
             last_letter = 'p'              male : female =     11.9 : 1.0
             last_letter = 'v'              male : female =     11.3 : 1.0


This listing shows that names that end in 'a' are 34.2 times more likely to be female's; and names that end in 'k' are 30.5 times more likely to be male's.

### Book Example 2

Try modifiying the gender_features() function to provide the classifier with first letter as an additional feature.

In [40]:
def gender_features2(name):
    features = {}
    features['firstletter'] = name[0].lower()
    features['lastletter'] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features['count(%s)' % letter] = name.lower().count(letter)
        features['has(%s)' % letter] = (letter in name.lower())
    return features

# Split the datasets into test, dev-test and train sets
train_set = apply_features(gender_features2, train_names)
test_set = apply_features(gender_features2, test_names)
devtest_set = apply_features(gender_features2, devtest_names)

# Train Naive Bayes Classifier2
classifier2 = NaiveBayesClassifier.train(train_set)

# Accuracy of example 2
print(accuracy(classifier2, devtest_set))

0.778


The accuray is improved by almost 3 percentage points (from 75% to 77.8%).

## Book Example 3
Continue modifying the feature function using the suffix as a feature. Again, we see an improved performance (from 77.8% to 79.4% accuracy).

In [36]:
def gender_features3(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:]}

# Split the datasets into test, dev-test and train sets

train_set = apply_features(gender_features3, train_names)
test_set = apply_features(gender_features3, test_names)
devtest_set = apply_features(gender_features3, devtest_names)

# Train Naive Bayes Classifier3
classifier3 = NaiveBayesClassifier.train(train_set)

# Accuracy of example 3
print(accuracy(classifier3, devtest_set))

0.794


## Combining the features

Combining the above, we include all the features: first letter, last letter, prefix (first 2 letters for a shorter name and first 3 letters for a name with 5 or more letters) and suffix (last 2 letters for a shorter name and last 3 letters for a longer name). The accuracy has improved from 79.4% to 82.4%.

In [39]:
def gender_features4(name):
    features = {}
    features['firstletter'] = name[0].lower()
    features['lastletter'] = name[-1].lower()
    features['prefix'] = name[:2].lower() if len(name) <= 4 else name[:3].lower()
    features['suffix'] = name[-2:].lower() if len(name) <=4 else name[-3:].lower()
    return features

# Split the datasets into test, dev-test and train sets
train_set = apply_features(gender_features4, train_names)
test_set = apply_features(gender_features4, test_names)
devtest_set = apply_features(gender_features4, devtest_names)

# Train Naive Bayes Classifier4
classifier4 = NaiveBayesClassifier.train(train_set)

# Accuracy of example 4
print(accuracy(classifier4, devtest_set))

0.824


## Final Performance
When we check the final performance of our classifier on the test set, we got a 77.4% accuracy which is less than the 82.4% we saw from testing on the dev-test set. This is somewhat expected since the dev-test set was used to improve feature selection, the model might be biased. Instead of using a separate dev-test set, we can consider using a k-fold cross-validation to tune the model.

In [41]:
print(accuracy(classifier4, test_set))

0.774
