# IS620 Group Project

<b>Group project: Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev- test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?</b>

<font color= blue> <b>Group Members:- Aaron Palumbo, Brian Chu,  David Stern, Partha Banerjee;  Rohan Fray, Tulasi Ramarao;</b></font>

[**Partho**] I am reworking on this with permission from Prof. Amit due to my little participation with the awesome group work.

## Environment

In [1]:
import nltk
import numpy as np
import pandas as pd
import string
from sklearn.cross_validation import train_test_split

## Data Preparation

In [2]:
names = nltk.corpus.names
maleNames = names.words('male.txt')
femaleNames = names.words('female.txt')

print "Number of male names:   {}".format(len(maleNames))
print "Number of female names: {}".format(len(femaleNames))
print "Total names:            {}".format(len(names.words()))

Number of male names:   2943
Number of female names: 5001
Total names:            7944


Per instruction, we have to split the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test or validation set, and the remaining words for the training set.

Let us do it based upon data in each category.

In [3]:
perMale = 1.0 * len(maleNames) / (len(maleNames) + len(femaleNames))
print "Male: {}%".format(round(perMale,4)*100)

Male: 37.05%


In [4]:
# numbers for data splitting
numTestMale = int(perMale * 500)
numTestFemale = 500 - numTestMale

print "Number of male names in each of our test and validation set will be:   {}".format(numTestMale)
print "Number of female names in each our of test and validation set will be: {}".format(numTestFemale)

Number of male names in each of our test and validation set will be:   185
Number of female names in each our of test and validation set will be: 315


In [5]:
maleTrain, maleTest = train_test_split(
    maleNames, test_size=numTestMale, random_state=50)
maleTrain, maleVal = train_test_split(
    maleTrain, test_size=numTestMale, random_state=50)

femaleTrain, femaleTest = train_test_split(
    femaleNames, test_size=numTestFemale, random_state=50)
femaleTrain, femaleVal = train_test_split(
    femaleTrain, test_size=numTestFemale, random_state=50)

# Check numbers
print "Train Set      = {:<5} + {:<5} = {}\t(Should be > 6900)".format(len(maleTrain), len(femaleTrain), len(maleTrain) + len(femaleTrain))
print "Validation Set = {:<5} + {:<5} = {}\t(Should be 500)".format(len(maleVal), len(femaleVal), len(maleVal) + len(femaleVal))
print "Test Set       = {:<5} + {:<5} = {}\t(Should be 500)".format(len(maleTest), len(femaleTest), len(maleTest) + len(femaleTest))

Train Set      = 2573  + 4371  = 6944	(Should be > 6900)
Validation Set = 185   + 315   = 500	(Should be 500)
Test Set       = 185   + 315   = 500	(Should be 500)


Finally, prepare 3 separate data sets.

In [6]:
train = ([(name, 'male') for name in maleTrain]+[(name, 'female') for name in femaleTrain])
validation = ([(name, 'male') for name in maleVal]+[(name, 'female') for name in femaleVal])
test = ([(name, 'male') for name in maleTest]+[(name, 'female') for name in femaleTest])

## Naive Bayes

In [7]:
def gender_features(name):
    features = {}
    
    if options['first_letter']:
        features["first_letter"] = name[0].lower()
    if options['last_letter']:
        features["last_letter"] = name[-1].lower()
    if options['first_two']:
        features["first_two"] = name[:2].lower()
    if options['last_two']:
        features["last_two"] = name[-2:].lower()
    if options['first_three']:
        features["first_three"] = name[:3].lower()
    if options['last_three']:
        features["last_three"] = name[-3:].lower()
    if options['length']:
        features["length"] = str(len(name))

    return features

Let us start with the last character alone.

In [8]:
options = {'first_letter': False,
           'last_letter':  True,
           'first_two':    False,
           'last_two':     False,
           'first_three':  False,
           'last_three':   False,
           'length':       False}

train_set = [(gender_features(n), g) for (n,g) in train]
valid_set = [(gender_features(n), g) for (n,g) in validation]

classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, valid_set)

0.742


74.2% accuracy is good, but not great. Let us change the feature extractor. This time let us consider with first and last characters together.

In [9]:
options = {'first_letter': True,
           'last_letter':  True,
           'first_two':    False,
           'last_two':     False,
           'first_three':  False,
           'last_three':   False,
           'length':       False}

train_set = [(gender_features(n), g) for (n,g) in train]
valid_set = [(gender_features(n), g) for (n,g) in validation]

classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, valid_set)

0.75


Better, yet not much progress. Let us try with last 2 letters.

In [10]:
options = {'first_letter': False,
           'last_letter':  False,
           'first_two':    False,
           'last_two':     True,
           'first_three':  False,
           'last_three':   False,
           'length':       False}

train_set = [(gender_features(n), g) for (n,g) in train]
valid_set = [(gender_features(n), g) for (n,g) in validation]

classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, valid_set)

0.788


Looks promising. Let us try first 2 letters and last 2 letters together.

In [11]:
options = {'first_letter': False,
           'last_letter':  False,
           'first_two':    True,
           'last_two':     True,
           'first_three':  False,
           'last_three':   False,
           'length':       False}

train_set = [(gender_features(n), g) for (n,g) in train]
valid_set = [(gender_features(n), g) for (n,g) in validation]

classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, valid_set)

0.804


Try to see if further better result is possible. Let us check with last 3 letters.

In [12]:
options = {'first_letter': False,
           'last_letter':  False,
           'first_two':    False,
           'last_two':     False,
           'first_three':  False,
           'last_three':   True,
           'length':       False}

train_set = [(gender_features(n), g) for (n,g) in train]
valid_set = [(gender_features(n), g) for (n,g) in validation]

classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, valid_set)

0.776


No good, try with first 3 and last 3 together.

In [13]:
options = {'first_letter': False,
           'last_letter':  False,
           'first_two':    False,
           'last_two':     False,
           'first_three':  True,
           'last_three':   True,
           'length':       False}

train_set = [(gender_features(n), g) for (n,g) in train]
valid_set = [(gender_features(n), g) for (n,g) in validation]

classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, valid_set)

0.808


Not much difference between 2 characters and 3 characters at the begining and end. Now let's play with the length of the name with the above combinations. Let us check whether adding length of the name has any influence.

In [14]:
options = {'first_letter': False,
           'last_letter':  False,
           'first_two':    False,
           'last_two':     False,
           'first_three':  True,
           'last_three':   True,
           'length':       True}

train_set = [(gender_features(n), g) for (n,g) in train]
valid_set = [(gender_features(n), g) for (n,g) in validation]

classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, valid_set)

0.8


I will go with the result of first 2 and last 2 letters having 80% accuracy. Now let us see the error cases where it is making wrong guess.

In [15]:
options = {'first_letter': False,
           'last_letter':  False,
           'first_two':    True,
           'last_two':     True,
           'first_three':  False,
           'last_three':   False,
           'length':       False}

train_set = [(gender_features(n), g) for (n,g) in train]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [16]:
errors = []
for (name, tag) in validation:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

print "Number of error cases: {} of {}.\n\nAnd they are:".format(len(errors),len(validation))

for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

Number of error cases: 98 of 500.

And they are:
correct=female   guess=male     name=Alex                          
correct=female   guess=male     name=Astrid                        
correct=female   guess=male     name=Avril                         
correct=female   guess=male     name=Brook                         
correct=female   guess=male     name=Buffy                         
correct=female   guess=male     name=Charlean                      
correct=female   guess=male     name=Clary                         
correct=female   guess=male     name=Devin                         
correct=female   guess=male     name=Dulcy                         
correct=female   guess=male     name=Easter                        
correct=female   guess=male     name=Ethelin                       
correct=female   guess=male     name=Francesmary                   
correct=female   guess=male     name=Frank                         
correct=female   guess=male     name=Franky                        

Since many names are of length 3 or 4, we do not want to move further to avoid the chance of overfitting. Let us apply this featureset to test data set.

In [17]:
test_set = [(gender_features(n), g) for (n,g) in test]

print nltk.classify.accuracy(classifier, test_set)

0.81


### Conclusion

Based on the above results, it looks like using first 2 and last 2 letters produce the best result.