# IS620 Group Project

<b>Group project: Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev- test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?</b>

<font color= blue> <b>Group Members:- Aaron Palumbo, Brian Chu,  David Stern, Partha Banerjee;  Rohan Fray, Tulasi Ramarao;</b></font>

## Dependencies

In [24]:
import nltk
import numpy as np
import pandas as pd
import string
import networkx as nx
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import BernoulliNB
%matplotlib inline

In [2]:
# %load qtutil.py
# silly utility to launch a qtconsole if one doesn't exist
import psutil

def returnPyIDs():
    pyids = set()
    for pid in psutil.pids():
        try:
            if "python" in psutil.Process(pid).name():
                pyids.add(pid)
        except:
            pass
    return pyids

def launchConsole():
    before_pyids = returnPyIDs()
    %qtconsole
    after_pyids = returnPyIDs()
    newid = after_pyids.difference(before_pyids)
    assert len(newid) == 1
    return list(newid)[0]

try:
    qtid
except NameError:
    qtid = launchConsole()
    
if qtid not in returnPyIDs():
    qtid = launchConsole()
    
qtid

7332

## Split Data

In [3]:
names = nltk.corpus.names
maleNames = names.words('male.txt')
femaleNames = names.words('female.txt')

* Training Data

    * Train Set: Used to train the model

    * Validation Set: Model Selection

* Test Set: Measure Final Model performance (Only use this once at the end)

There are different numbers of male and female names:

In [4]:
print "Number of male names: {}".format(len(maleNames))
print "Number of female names: {}".format(len(femaleNames))

Number of male names: 2943
Number of female names: 5001


We will have to do our splits separately. We will split the data with the goal of maintaining the same ratio of male to female in our train, validation, and test sets.

In [5]:
perMale = 1.0 * len(maleNames) / (len(maleNames) + len(femaleNames))
perMale

# total names
numNames = len(names.words())
# number used for testing
numTesting = 1000
# slit between final test and validation
perTest = 0.5

# numbers for data splitting
numTestingMale = int(perMale * numTesting)
numTestingFemale = numTesting - numTestingMale

numTestMale = int(numTesting * perTest * perMale)
numTestFemale = int(numTesting *  perTest - numTestMale)

In [6]:
maleTrain, maleTesting = train_test_split(
    maleNames, test_size=numTestingMale, random_state=5)
maleVal, maleTest = train_test_split(
    maleTesting, test_size=numTestMale, random_state=6)

femaleTrain, femaleTesting = train_test_split(
    femaleNames, test_size=numTestingFemale, random_state=7)
femaleVal, femaleTest = train_test_split(
    femaleTesting, test_size=numTestFemale, random_state=8)

# Check numbers
print "Val Set   = {} (Should be 500)".format(len(maleVal) + len(femaleVal))
print "Test Set  = {} (Should be 500)".format(len(maleTest) + len(femaleTest))
print "Train Set = {} (Should be >6900)".format(len(maleTrain) + len(femaleTrain))


Val Set   = 500 (Should be 500)
Test Set  = 500 (Should be 500)
Train Set = 6944 (Should be >6900)


In [7]:
train = pd.DataFrame({'name': maleTrain + femaleTrain,
                      'sex': (['male'] * len(maleTrain) + 
                              ['female'] * len(femaleTrain))})
validation = pd.DataFrame({'name': maleVal + femaleVal,
                           'sex': (['male'] * len(maleVal) +
                                   ['female'] * len(femaleVal))})
test = pd.DataFrame({'name': maleTest + femaleTest,
                     'sex': (['male'] * len(maleTest) +
                             ['female'] * len(femaleTest))})

In [8]:
# Just to make sure we all see the same thing
print train.loc[56, :]
print
print validation.loc[38, :]
print
print test.loc[486, :]

name    Kristos
sex        male
Name: 56, dtype: object

name    Orton
sex      male
Name: 38, dtype: object

name      Shea
sex     female
Name: 486, dtype: object


Names should be Kristos, Orton, and Shea

----

Use the above code to start another notebook to explore an algorithm. Make sure to use the splits as defined above and to not use the final test set to tune your model. =-)

-----

## Naive Bayes

Let's create some features ...

In [15]:
def addEdge(source, target):
    G.add_edge(source, target)
    try:       G.edge[source][target]['weight'] += 1
    except KeyError:
        G.edge[source][target]['weight'] = 1

def addFeature(rowNum, description, feature):
    ftext = description + " " + feature
    G.add_node(ftext)
    G.node[ftext]['label'] = 'feature'
    addEdge(rowNum, ftext)

def featuresFromName(rowNum, name, sex):
    # We're going to use a bernoullli classifier, so all our features
    # will be binary
    G.add_node(rowNum)
    G.node[rowNum]['label'] = 'name'
    G.node[rowNum]['name'] = name
    G.node[rowNum]['sex'] = sex
    n = len(name)
    
    # first letter
    addFeature(rowNum, "1st letter", name[0])
    
    # last letter
    ll = "last letter " + name[-1]
    addFeature(rowNum, "last letter", name[-1])
    
    # n-grams
    for n_gram in [name[i:i+2] for i in range(n - 1)]:
        addFeature(rowNum, "containts", n_gram)
    
    # length of names
    addFeature(rowNum, "length is", str(n))
    
def returngen(nodeType):
    # generator for nodes of type nodeType
    return (n for n in G.nodes() if G.node[n]['label'] == nodeType)

Now we create a small scale trial to make sure we're getting what we expect.

In [31]:
G = nx.Graph()

for (i, nm, sx) in train.itertuples():
    featuresFromName(i, nm.lower(), sx)
    if i > 10:
        break

m = nx.bipartite.biadjacency_matrix(
    G,
    [n for n in G.nodes() if G.node[n]['label'] == 'name'],
    [n for n in G.nodes() if G.node[n]['label'] == 'feature']
).toarray()

print type(m)
print m

print "rows:"
for x in returngen('name'):
    print G.node[x]['name']

print "columns:"
for x in returngen('feature'):
    print x

<type 'numpy.ndarray'>
[[0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0]
 [0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0
  0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
  0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0]
 [1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
  0 0 0

Looks good, now we see process the entire training set.

### Train Model

In [27]:
G = nx.Graph()

for (i, nm, sx) in train.itertuples():
    featuresFromName(i, nm.lower(), sx)

X = nx.bipartite.biadjacency_matrix(
    G,
    [n for n in G.nodes() if G.node[n]['label'] == 'name'],
    [n for n in G.nodes() if G.node[n]['label'] == 'feature']
).toarray()

# Now we need Y, the results column
Y = []
for x in returngen('name'):
    Y.append(G.node[x]['sex'])
Y = np.array(Y)

# Time to train our model
bnb = BernoulliNB()
bnb.fit(X, Y)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

### Validation Set

To create our validation set, we need to do things a little differently. First we need to extract our training feature set, then we need to apply that feature set to our validation set.

In [29]:
# Extract our feature set
featureSet = []
for x in returngen('feature'):
    featureSet.append(x)


featureSet[:10]

[u'containts vo',
 u'containts vi',
 u'containts ve',
 u'containts iv',
 u'containts iw',
 u'containts it',
 u'containts iu',
 u'containts ir',
 u'containts is',
 u'containts ip']