# IS620 Group Project

<b>Group project: Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev- test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?</b>

<font color= blue> <b>Group Members:- Aaron Palumbo, Brian Chu,  David Stern, Partha Banerjee;  Rohan Fray, Tulasi Ramarao;</b></font>

## Dependencies

In [1]:
import nltk
import numpy as np
import pandas as pd
import string
import networkx as nx
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import BernoulliNB
%matplotlib inline

In [4]:
# %load qtutil.py
# silly utility to launch a qtconsole if one doesn't exist
import psutil

def returnPyIDs():
    pyids = set()
    for pid in psutil.pids():
        try:
            if "python" in psutil.Process(pid).name():
                pyids.add(pid)
        except:
            pass
    return pyids

def launchConsole():
    before_pyids = returnPyIDs()
    %qtconsole
    after_pyids = returnPyIDs()
    newid = after_pyids.difference(before_pyids)
    assert len(newid) == 1
    return list(newid)[0]

try:
    qtid
except NameError:
    qtid = launchConsole()
    
if qtid not in returnPyIDs():
    qtid = launchConsole()
    
qtid



AssertionError: 

## Split Data

In [5]:
names = nltk.corpus.names
maleNames = names.words('male.txt')
femaleNames = names.words('female.txt')

* Training Data

    * Train Set: Used to train the model

    * Validation Set: Model Selection

* Test Set: Measure Final Model performance (Only use this once at the end)

There are different numbers of male and female names:

In [6]:
print "Number of male names: {}".format(len(maleNames))
print "Number of female names: {}".format(len(femaleNames))

Number of male names: 2943
Number of female names: 5001


We will have to do our splits separately. We will split the data with the goal of maintaining the same ratio of male to female in our train, validation, and test sets.

In [7]:
perMale = 1.0 * len(maleNames) / (len(maleNames) + len(femaleNames))
perMale

# total names
numNames = len(names.words())
# number used for testing
numTesting = 1000
# slit between final test and validation
perTest = 0.5

# numbers for data splitting
numTestingMale = int(perMale * numTesting)
numTestingFemale = numTesting - numTestingMale

numTestMale = int(numTesting * perTest * perMale)
numTestFemale = int(numTesting *  perTest - numTestMale)

In [8]:
maleTrain, maleTesting = train_test_split(
    maleNames, test_size=numTestingMale, random_state=5)
maleVal, maleTest = train_test_split(
    maleTesting, test_size=numTestMale, random_state=6)

femaleTrain, femaleTesting = train_test_split(
    femaleNames, test_size=numTestingFemale, random_state=7)
femaleVal, femaleTest = train_test_split(
    femaleTesting, test_size=numTestFemale, random_state=8)

# Check numbers
print "Val Set   = {} (Should be 500)".format(len(maleVal) + len(femaleVal))
print "Test Set  = {} (Should be 500)".format(len(maleTest) + len(femaleTest))
print "Train Set = {} (Should be >6900)".format(len(maleTrain) + len(femaleTrain))


Val Set   = 500 (Should be 500)
Test Set  = 500 (Should be 500)
Train Set = 6944 (Should be >6900)


In [9]:
train = pd.DataFrame({'name': maleTrain + femaleTrain,
                      'sex': (['male'] * len(maleTrain) + 
                              ['female'] * len(femaleTrain))})
validation = pd.DataFrame({'name': maleVal + femaleVal,
                           'sex': (['male'] * len(maleVal) +
                                   ['female'] * len(femaleVal))})
test = pd.DataFrame({'name': maleTest + femaleTest,
                     'sex': (['male'] * len(maleTest) +
                             ['female'] * len(femaleTest))})

In [10]:
# Just to make sure we all see the same thing
print train.loc[56, :]
print
print validation.loc[38, :]
print
print test.loc[486, :]

name    Kristos
sex        male
Name: 56, dtype: object

name    Orton
sex      male
Name: 38, dtype: object

name      Shea
sex     female
Name: 486, dtype: object


Names should be Kristos, Orton, and Shea

----

Use the above code to start another notebook to explore an algorithm. Make sure to use the splits as defined above and to not use the final test set to tune your model. =-)

-----

## Naive Bayes

Let's create some features ...

In [38]:
def addEdge(G, source, target):
    G.add_edge(source, target)
    try:
        # this will fail if the edge is just created
        G.edge[source][target]['weight'] += 1
    except KeyError:
        G.edge[source][target]['weight'] = 1

def addFeature(G, rowNum, description, feature, trainingMode=True):
    # create unique feature identifier
    ftext = description + " " + feature
    # if we're in training mode add the feature
    # if we're not in training mode we only want link name
    # to features that already exist
    # note that the second part is not run unless trainingMode == False
    if trainingMode or ftext in [f for f in returngen(G, 'feature')]:
        G.add_node(ftext, label='feature')
        addEdge(G, rowNum, ftext)

def featuresFromName(G, rowNum, name, sex, options):
    trainingMode = options['trainingMode']
    # We're going to use a bernoullli classifier, so all our features
    # will be binary
    
    # add the name node
    G.add_node(rowNum, # using rowNum because name is not unique
               label = 'name',
               name = name,
               sex = sex) 
    n = len(name)
        
    # first letter
    if options['first_letter']:
        addFeature(G, rowNum, "1st letter", name[0], trainingMode)
    
    # last letter
    if options['last_letter']:
        addFeature(G, rowNum, "last letter", name[-1], trainingMode)
    
    # two-grams
    if options['two_grams']:
        for n_gram in [name[i:i+2] for i in range(n - 1)]:
            addFeature(G, rowNum, "contains", n_gram, trainingMode)
    
    # length of names
    if options['length']:
        addFeature(G, rowNum, "length is", str(n), trainingMode)
    
    # first two letters
    if options['first_two']:
        addFeature(G, rowNum, "first two", name[:2], trainingMode)
    
    # last two letters
    if options['last_two']:
        addFeature(G, rowNum, "last_two", name[-2:], trainingMode)
        
    # first four letters
    if options['first_two']:
        addFeature(G, rowNum, "first_four", name[0:3], trainingMode)
    
    # last three letters
    if options['last_two']:
        addFeature(G, rowNum, "last_three", name[-3:-1], trainingMode)
    
def returngen(G, nodeType):
    # generator for nodes of type nodeType
    return (n for n in G.nodes() if G.node[n]['label'] == nodeType)

options = {'first_letter': True,
           'last_letter':  True,
           'two_grams':    True,
           'length':       True,
           'first_two':    True,
           'last_two':     True,
           'first_four':   True,
           'last_three':   True}


Now we create a small scale trial to make sure we're getting what we expect.

In [39]:
tg = nx.Graph()
options['trainingMode'] = True
for (i, nm, sx) in train.itertuples():
    featuresFromName(tg, i, nm.lower(), sx, options)
    if i > 5:
        break

m = nx.bipartite.biadjacency_matrix(
    tg,
    [n for n in tg.nodes() if tg.node[n]['label'] == 'name'],
    [n for n in tg.nodes() if tg.node[n]['label'] == 'feature']
).toarray()

print type(m)
print m

print "rows:"
for x in returngen(tg, 'name'):
    print "    {}".format(tg.node[x]['name'])

print "columns:"
for x in returngen(tg, 'feature'):
    print "    {}".format(x)

<type 'numpy.ndarray'>
[[0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0
  1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 1 0 0 0]
 [0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0
  0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
  1 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 0
  0 0 0 0 0 0]
 [1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0
  0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
  0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
  0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 

Looks good, now we see process the entire training set.

### Train Model

In [40]:
tg = nx.Graph()
options['trainingMode'] = True

for (i, nm, sx) in train.itertuples():
    featuresFromName(tg, i, nm.lower(), sx, options)

X = nx.bipartite.biadjacency_matrix(
    tg,
    [n for n in returngen(tg, 'name')],
    [n for n in returngen(tg, 'feature')]
).toarray()

# Now we need Y, the results column
Y = []
for x in returngen(tg, 'name'):
    Y.append(tg.node[x]['sex'])
Y = np.array(Y)

# Time to train our model
bnb = BernoulliNB()
bnb.fit(X, Y)

Ytrain = bnb.predict(X)

1.0 * sum(Y == Ytrain) / len(Y)



0.8597350230414746

### Validation Set

To create our validation set, we need to do things a little differently. First we need to extract our training feature set, then we need to apply that feature set to our validation set.

In [41]:
# Extract our feature set
vg = nx.Graph()
for x in returngen(tg, 'feature'):
    vg.add_node(x, label='feature')

options['trainingMode'] = False
# Connect validation names to existing features
for (i, nm, sx) in validation.itertuples():
    featuresFromName(vg, i, nm.lower(), sx, options)

Xval = nx.bipartite.biadjacency_matrix(
    vg,
    [n for n in returngen(vg, 'name')],
    [n for n in returngen(vg, 'feature')]
)

Yval = []
for x in returngen(vg, 'name'):
    Yval.append(vg.node[x]['sex'])
Yval = np.array(Yval)

Ypredicted =  bnb.predict(Xval)

1.0 * sum(Yval == Ypredicted) / len(Yval)

0.394

Now let's put this all together so we can run through different options.

In [42]:
def runNaiveBayes(train, validation, options):
    tg = nx.Graph()
    options['trainingMode'] = True

    for (i, nm, sx) in train.itertuples():
        featuresFromName(tg, i, nm.lower(), sx, options)

    X = nx.bipartite.biadjacency_matrix(
        tg,
        [n for n in returngen(tg, 'name')],
        [n for n in returngen(tg, 'feature')]
    ).toarray()

    # Now we need Y, the results column
    Y = []
    for x in returngen(tg, 'name'):
        Y.append(tg.node[x]['sex'])
    Y = np.array(Y)

    # Time to train our model
    bnb = BernoulliNB()
    bnb.fit(X, Y)

    Ytrain = bnb.predict(X)

    trainacc = 1.0 * sum(Y == Ytrain) / len(Y)
    
    # Extract our feature set
    vg = nx.Graph()
    for x in returngen(tg, 'feature'):
        vg.add_node(x, label='feature')

    options['trainingMode'] = False
    # Connect validation names to existing features
    for (i, nm, sx) in validation.itertuples():
        featuresFromName(vg, i, nm.lower(), sx, options)

    Xval = nx.bipartite.biadjacency_matrix(
        vg,
        [n for n in returngen(vg, 'name')],
        [n for n in returngen(vg, 'feature')]
    )

    Yval = []
    for x in returngen(vg, 'name'):
        Yval.append(vg.node[x]['sex'])
    Yval = np.array(Yval)

    Ypredicted =  bnb.predict(Xval)

    valacc = 1.0 * sum(Yval == Ypredicted) / len(Yval)
    return trainacc, valacc

def printResults(options, trainacc, valacc):
    print "The training accuracy was           {}".format(round(trainacc, 3))
    print "and the the validation accuracy was {}".format(round(valacc, 3))

## Feature Comparison

In [43]:
options = {'first_letter': True,
           'last_letter':  True,
           'two_grams':    True,
           'length':       True,
           'first_two':    True,
           'last_two':     True}

ta, va = runNaiveBayes(train, validation, options)
printResults(options, ta, va)

The training accuracy was           0.86
and the the validation accuracy was 0.394


In [44]:
options = {'first_letter': True,
           'last_letter':  False,
           'two_grams':    False,
           'length':       False,
           'first_two':    False,
           'last_two':     False}

ta, va = runNaiveBayes(train, validation, options)
printResults(options, ta, va)

The training accuracy was           0.649
and the the validation accuracy was 0.666


In [45]:
options = {'first_letter': False,
           'last_letter':  True,
           'two_grams':    False,
           'length':       False,
           'first_two':    False,
           'last_two':     False}

ta, va = runNaiveBayes(train, validation, options)
printResults(options, ta, va)

The training accuracy was           0.764
and the the validation accuracy was 0.768


In [46]:
options = {'first_letter': False,
           'last_letter':  False,
           'two_grams':    True,
           'length':       False,
           'first_two':    False,
           'last_two':     False}

ta, va = runNaiveBayes(train, validation, options)
printResults(options, ta, va)

The training accuracy was           0.765
and the the validation accuracy was 0.41


In [47]:
options = {'first_letter': False,
           'last_letter':  False,
           'two_grams':    False,
           'length':       True,
           'first_two':    False,
           'last_two':     False}

ta, va = runNaiveBayes(train, validation, options)
printResults(options, ta, va)

The training accuracy was           0.631
and the the validation accuracy was 0.646


In [48]:
options = {'first_letter': False,
           'last_letter':  False,
           'two_grams':    False,
           'length':       False,
           'first_two':    True,
           'last_two':     False}

ta, va = runNaiveBayes(train, validation, options)
printResults(options, ta, va)

The training accuracy was           0.777
and the the validation accuracy was 0.524


In [49]:
options = {'first_letter': False,
           'last_letter':  False,
           'two_grams':    False,
           'length':       False,
           'first_two':    False,
           'last_two':     True}

ta, va = runNaiveBayes(train, validation, options)
printResults(options, ta, va)

The training accuracy was           0.815
and the the validation accuracy was 0.468


The features that seem to have to most predictive power are last letter, first letter, and length. Combining these features we have.

In [50]:
options = {'first_letter': True,
           'last_letter':  True,
           'two_grams':    False,
           'length':       True,
           'first_two':    False,
           'last_two':     False}

ta, va = runNaiveBayes(train, validation, options)
printResults(options, ta, va)

The training accuracy was           0.781
and the the validation accuracy was 0.73


In [51]:
options = {'first_letter': True,
           'last_letter':  True,
           'two_grams':    False,
           'length':       False,
           'first_two':    False,
           'last_two':     False}

ta, va = runNaiveBayes(train, validation, options)
printResults(options, ta, va)

The training accuracy was           0.778
and the the validation accuracy was 0.798


In [52]:
options = {'first_letter': True,
           'last_letter':  False,
           'two_grams':    False,
           'length':       True,
           'first_two':    False,
           'last_two':     False}

ta, va = runNaiveBayes(train, validation, options)
printResults(options, ta, va)

The training accuracy was           0.652
and the the validation accuracy was 0.474


In [53]:
options = {'first_letter': False,
           'last_letter':  True,
           'two_grams':    False,
           'length':       True,
           'first_two':    False,
           'last_two':     False}

ta, va = runNaiveBayes(train, validation, options)
printResults(options, ta, va)

The training accuracy was           0.759
and the the validation accuracy was 0.784


In [55]:
options = {'first_letter': True,
           'last_letter':  True,
           'two_grams':    False,
           'length':       False,
           'first_four':   True,
           'last_three':   True,
           'first_two':    False,
           'last_two':     False}

ta, va = runNaiveBayes(train, validation, options)
printResults(options, ta, va)

The training accuracy was           0.778
and the the validation accuracy was 0.798


Based on the above results, it looks like using just first and last letter is the best choice.