## Project 3 -- Name Gender Classifier

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier,
make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

In [1]:
import nltk
import random
import pandas as pd
from nltk.corpus import names
from gender import GenderClassifier


### GenderClassifier Class
For this project I decided to wrap the lower level nltk code into a class in order to avoid a redundent amount of scripting and easily test different feature functions on the test data.

On instantiation with a names corpus, the <i> GenderClassifier </i>  class automatically splits the data and builds test and training sets. The public method train() is called with a custom feature function. Each time it is called it reshuffles and creates a development test set from the training data. The test data from instantiation remains untouched and seperate from the fine-tuning process.  

Wrappers for informative features, error analysis and accuracy are then used to fine tune the feature function. This new function can then be passed in and the BayesClassifier retrained without affecting the initial test data.


In [2]:
# load names into GenderClassifier object
gclass = GenderClassifier(names)

In [3]:
# show test and training name fields
gclass.test_names[0:5]

[(u'Bettye', 'female'),
 (u'Flin', 'male'),
 (u'Emelia', 'female'),
 (u'Dyna', 'female'),
 (u'Nikkie', 'female')]

In [4]:
gclass.train_names[0:5]

[(u'Lorene', 'female'),
 (u'Aziz', 'male'),
 (u'Electra', 'female'),
 (u'Albatros', 'male'),
 (u'Pyotr', 'male')]

We start with a basic feature function as outlined in Chapter 6 of Natural Language Processing in Python. This function computes the frequency of each letter and letter count for each name in the set. For each fine tuning round we will re-train the set 3 times to estimate accuracy. 

In [5]:
def get_freq(name):
    # helper func for frequencies
    d = {'a':0,'b':0,'c':0,'d':0,'e':0,'f':0,'g':0,'h':0,'i':0,
        'j':0,'k':0,'l':0,'m':0,'n':0,'o':0,'p':0,'q':0,'r':0,
         's':0,'t':0,'u':0,'v':0,'w':0,'x':0,'y':0,'z':0, " ":0,
         "'":0,"-":0
    }
    name = name.lower().strip()
    for n in name:
        
        d[n] += 1
    return tuple(d.items())
    

def feature1(name):
    return {
        'freq': get_freq(name),
        'length': len(name)
    }


In [6]:
gclass.train(feature1) # train
gclass.report_dev_accuracy() 

0.594

In [7]:
gclass.train(feature1) # train
gclass.report_dev_accuracy() 

0.624

In [8]:
gclass.train(feature1) # train
gclass.report_dev_accuracy() # low performance

0.606

Frequency and length alone produce a consistently low accuracy rate. We will therefore fine-tune by adding the last letter of the name to the feature function. This seems to be widely accepted as one of the all around defining features for gender classification. Names that end in vowels, especially "a" tend to be female, so by simply looking at the last letter, accuracy should improve a quite a bit.

In [9]:
def feature2(name):
    return {
        'freq': get_freq(name),
        'length': len(name),
        'last': name[-1:].lower()
    }

In [10]:
gclass.train(feature2) # train
gclass.report_dev_accuracy()

0.73

In [11]:
gclass.train(feature2) # train
gclass.report_dev_accuracy()

0.744

In [12]:
gclass.train(feature2) # train
gclass.report_dev_accuracy()

0.73

This improves the accuracy of the feature function on the training set by about ten points. In order to improve the function further we will look at a portion of the error analysis table to try to intuitively spot other characteristics.

In [13]:
gclass.error_analysis_table()[:15]

Unnamed: 0,correct,guess,names
0,male,female,Sherlocke
1,male,female,Torrence
2,male,female,Pate
3,male,female,Emory
4,male,female,Riley
5,female,male,Revkah
6,male,female,Casey
7,female,male,Katharyn
8,male,female,Tony
9,male,female,Dannie


The table shows that ngrams of last letters might have a more important role than the single last letter. The next function will look at both the last two and last three letters of the name. 

We can also look at pairs of letters throughout the entire word which might also help to improve the classifier.

In [14]:
def group_pairs(name):
    pairs = []
    for i in range(len(name)-1):
        pairs.append(name[i] + name[i+1])
    return tuple(set(pairs))  # need to cast set into a tuple for nltk

In [15]:
# add pairs
# add last2 and last3 

def feature3(name):
    return {
        'pairs': group_pairs(name),
        'last': name[-1:].lower(),
        'last2': name[-2:].lower(),
        'last3': name[-3:].lower(),
    }

In [16]:
gclass.train(feature3)
gclass.report_dev_accuracy()

0.802

In [17]:
gclass.train(feature3)
gclass.report_dev_accuracy()

0.762

In [18]:
gclass.train(feature3)
gclass.report_dev_accuracy()

0.756

In [19]:
gclass.informative_features()

Most Informative Features
                   last2 = u'na'          female : male   =     74.9 : 1.0
                    last = u'a'           female : male   =     41.7 : 1.0
                    last = u'k'             male : female =     24.4 : 1.0
                   last2 = u'ra'          female : male   =     21.9 : 1.0
                   last2 = u'us'            male : female =     21.6 : 1.0
                   last2 = u'ia'          female : male   =     20.8 : 1.0
                    last = u'f'             male : female =     14.3 : 1.0
                    last = u'g'             male : female =     14.3 : 1.0
                   last2 = u'do'            male : female =     13.8 : 1.0
                   last3 = u'nne'         female : male   =     13.4 : 1.0


These three runs scored several points higher then the set of feature2() runs. We can see from the informative features list that last, last2 and last3, played the most crucial role. However the combination of these with the freq, pairs, and count features seem to help boost the overall accuracy.

Finally we test this fine-tuned version against the unseen test data. This was stored in the class upon initialization and not a part of the training or dev_test sets.

In [20]:
gclass.report_test_accuracy()

0.6

We see that the algorithm does not perform as well on unseen data, with an accuracy rate of only 60%. This lower accuracy is expected because the training data has been fine tuned to the training and dev test sets. However it is almost 20 points lower then our best run with the feature3 set, so there is a good chance we have overfit the data. 

I will produce one more feature function this time removing frequency and count which played a negligible role in the original set.

Since the test data has already been used from the first class, I will make a new GenderClassifier object. This will make a completely new test set. If the dev test accuracy is above 0.75 I will immediately use the test set.


In [33]:
def feature4(name):
    return {
        'pairs': group_pairs(name),
        'last': name[-1:].lower(),
        'last2': name[-2:].lower(),
        'last3': name[-3:].lower(),
    }

In [34]:
# make new class and train on the new feature function
gclass2 = GenderClassifier(names)
gclass2.train(feature4)

In [35]:
gclass2.report_dev_accuracy()

0.788

In [36]:
gclass2.report_test_accuracy()

0.814

Removing those two components vastly improved the performance of the classifier. We now have an accuracy score on unseen data of over 81%.

We will try one more feature test to see if accuracy can be pushed further. A new function that calculates 3-grams will be included.

In [49]:
def group_threes(name):
    pairs = []
    for i in range(len(name)-2):
        pairs.append(name[i] + name[i+1] + name[i+2])
    return tuple(set(pairs))  # need to cast set into a tuple for nltk


def feature5(name):
    return {
        'pairs': group_pairs(name),
        'threes': group_threes(name),
        'last': name[-1:].lower(),
        'last2': name[-2:].lower(),
        'last3': name[-3:].lower(),
    }


In [46]:
gclass3 = GenderClassifier(names)
gclass3.train(feature5)

In [47]:
gclass3.report_dev_accuracy()

0.744

In [48]:
gclass3.report_test_accuracy()

0.748

With the added three-grams, the test data accuracy seems to have decreased. We will therefore select feature4() as our best classifier set.