# IS620 Group Project

<b>Group project: Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev- test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?</b>

<font color= blue> <b>Group Members:- Aaron Palumbo, Brian Chu,  David Stern, Partha Banerjee;  Rohan Fray, Tulasi Ramarao;</b></font>

## Dependencies

In [1]:
import nltk
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
%matplotlib inline

## Split Data

In [2]:
names = nltk.corpus.names
maleNames = names.words('male.txt')
femaleNames = names.words('female.txt')

* Training Data

    * Train Set: Used to train the model

    * Validation Set: Model Selection

* Test Set: Measure Final Model performance (Only use this once at the end)

There are different numbers of male and female names:

In [3]:
print "Number of male names: {}".format(len(maleNames))
print "Number of female names: {}".format(len(femaleNames))

Number of male names: 2943
Number of female names: 5001


We will have to do our splits separately. We will split the data with the goal of maintaining the same ratio of male to female in our train, validation, and test sets.

In [4]:
perMale = 1.0 * len(maleNames) / (len(maleNames) + len(femaleNames))
perMale

# total names
numNames = len(names.words())
# number used for testing
numTesting = 1000
# slit between final test and validation
perTest = 0.5

# numbers for data splitting
numTestingMale = int(perMale * numTesting)
numTestingFemale = numTesting - numTestingMale

numTestMale = int(numTesting * perTest * perMale)
numTestFemale = int(numTesting *  perTest - numTestMale)

In [5]:
maleTrain, maleTesting = train_test_split(
    maleNames, test_size=numTestingMale, random_state=5)
maleVal, maleTest = train_test_split(
    maleTesting, test_size=numTestMale, random_state=6)

femaleTrain, femaleTesting = train_test_split(
    femaleNames, test_size=numTestingFemale, random_state=7)
femaleVal, femaleTest = train_test_split(
    femaleTesting, test_size=numTestFemale, random_state=8)

# Check numbers
print "Val Set   = {} (Should be 500)".format(len(maleVal) + len(femaleVal))
print "Test Set  = {} (Should be 500)".format(len(maleTest) + len(femaleTest))
print "Train Set = {} (Should be >6900)".format(len(maleTrain) + len(femaleTrain))


Val Set   = 500 (Should be 500)
Test Set  = 500 (Should be 500)
Train Set = 6944 (Should be >6900)


In [6]:
train = pd.DataFrame({'name': maleTrain + femaleTrain,
                      'sex': (['male'] * len(maleTrain) + 
                              ['female'] * len(femaleTrain))})
validation = pd.DataFrame({'name': maleVal + femaleVal,
                           'sex': (['male'] * len(maleVal) +
                                   ['female'] * len(femaleVal))})
test = pd.DataFrame({'name': maleTest + femaleTest,
                     'sex': (['male'] * len(maleTest) +
                             ['female'] * len(femaleTest))})

In [7]:
# Just to make sure we all see the same thing
print train.loc[56, :]
print
print validation.loc[38, :]
print
print test.loc[486, :]

name    Kristos
sex        male
Name: 56, dtype: object

name    Orton
sex      male
Name: 38, dtype: object

name      Shea
sex     female
Name: 486, dtype: object


Names should be Kristos, Orton, and Shea

## Maximum Entropy

**Feature set to train classification on. Start with book example of taking last 2 letters of name**

In [8]:
def gender_features(word):
    return {'suffix1': word[-1:], 'suffix2': word[-2:]}

In [9]:
features = [(gender_features(x), gender) for x, gender in train.values]
features[:5]

[({'suffix1': u'l', 'suffix2': u'el'}, 'male'),
 ({'suffix1': u's', 'suffix2': u'us'}, 'male'),
 ({'suffix1': u'n', 'suffix2': u'on'}, 'male'),
 ({'suffix1': u'd', 'suffix2': u'rd'}, 'male'),
 ({'suffix1': u'n', 'suffix2': u'in'}, 'male')]

Train classifier.  
  
There's a few Maximum Entropy options available but the sklearn maxent wrapped in NLTK seems to be the most popular. The MaxentClassifier in SciPy appears almost deprecated now. Maximum entropy is also essentially logistic regression and can use sklearn's LogisticRegression function. 

In [10]:
# SciPy implementation

from nltk.classify import MaxentClassifier
me_sp = MaxentClassifier.train(features, 
                                   trace = 10, # higher values = more verbose
                                   max_iter = 10, # iterations
                                   min_lldelta = 0.05, # entropy threshold when to stop iterating
                                                    # lower=more fine-tuning, higher=less iterations
                                   algorithm = 'gis') # generally preferred to iis

  ==> Training (10 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.40233        0.791
         Final          -0.35426        0.795


In [11]:
# Scikit-learn Logistic Regression implementation

from sklearn.linear_model import LogisticRegression
from nltk.classify.scikitlearn import SklearnClassifier
sk_classifier = SklearnClassifier(LogisticRegression())
me_lr = sk_classifier.train(features)
print("Accuracy:")
nltk.classify.accuracy(me_lr, features)

Accuracy:


0.7985311059907834

In [12]:
# Scikit-learn maxent implementation (nltk wrapper)

from nltk.classify.scikitlearn import SklearnClassifier
from nltk.classify import maxent

me = nltk.classify.maxent.MaxentClassifier.train(features, 
                                                  trace = 10, 
                                                  max_iter = 10, 
                                                  min_lldelta = 0.05, 
                                                  algorithm = 'gis')

  ==> Training (10 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.40233        0.791
         Final          -0.35426        0.795


All three methods had about the same accuracy. Let's stick with maxent for now.

In [13]:
me.show_most_informative_features(n=10)

  -3.204 suffix2==u'na' and label is 'male'
  -3.169 suffix2==u'la' and label is 'male'
  -2.782 suffix1==u'k' and label is 'female'
  -2.568 suffix2==u'sa' and label is 'male'
  -2.450 suffix2==u'ia' and label is 'male'
  -2.407 suffix2==u'ra' and label is 'male'
  -2.293 suffix1==u'a' and label is 'male'
  -2.053 suffix2==u'ta' and label is 'male'
  -1.872 suffix2==u'ti' and label is 'male'
  -1.863 suffix2==u'ka' and label is 'male'


> Females with suffix consonant + 'a' represent most of the highest labeled pairs.

Error Analysis - Validation Set

In [14]:
def error_analysis(maxentropy, dataset):
    errors = []
    for name, tag in dataset.values:
        guess = maxentropy.classify(gender_features(name))
        if guess != tag:
            errors.append((tag, guess, name))
    return(errors)

In [15]:
# percentage of errors
e = error_analysis(me, validation)
print(len(e))
print(1.0 * len(e) / len(validation))

95
0.19


Inspect errors

In [16]:
e

[('male', 'female', u'Mortie'),
 ('male', 'female', u'Jeth'),
 ('male', 'female', u'Bealle'),
 ('male', 'female', u'Nickie'),
 ('male', 'female', u'Glynn'),
 ('male', 'female', u'Tonnie'),
 ('male', 'female', u'Tymothy'),
 ('male', 'female', u'Blayne'),
 ('male', 'female', u'Lex'),
 ('male', 'female', u'Gabriele'),
 ('male', 'female', u'Darien'),
 ('male', 'female', u'Dougie'),
 ('male', 'female', u'Wye'),
 ('male', 'female', u'Vite'),
 ('male', 'female', u'Len'),
 ('male', 'female', u'Barnaby'),
 ('male', 'female', u'Haley'),
 ('male', 'female', u'Karl'),
 ('male', 'female', u'Patel'),
 ('male', 'female', u'Meryl'),
 ('male', 'female', u'Alex'),
 ('male', 'female', u'Tobie'),
 ('male', 'female', u'Neel'),
 ('male', 'female', u'Burnaby'),
 ('male', 'female', u'Barnie'),
 ('male', 'female', u'Iggie'),
 ('male', 'female', u'Nevile'),
 ('male', 'female', u'Witty'),
 ('male', 'female', u'Sandy'),
 ('male', 'female', u'Emmanuel'),
 ('male', 'female', u'Timothee'),
 ('male', 'female', u'Gare

> Eyeballing the errors, it seems a few male names ending in 'e' were mislabeled. Also, female names ending in 'l'.

Evaluate test set

In [17]:
test_features = [(gender_features(x), gender) for x, gender in test.values]
print(nltk.classify.accuracy(me, test_features))

0.776


> Slightly worse than the training and validation sets, but doesn't seem to be overfitting.   

**Feature Set 2: Add combination of first and last two letters of the name.**

In [18]:
def gender_features2(word):
    return {'suffix1': word[-1:], 'suffix2': word[-2:], 'firstlast':(word[:2],word[-2:])}

In [19]:
features2 = [(gender_features2(x), gender) for x, gender in train.values]
me2 = nltk.classify.maxent.MaxentClassifier.train(features2, 
                                                  trace = 10, 
                                                  max_iter = 10, 
                                                  min_lldelta = 0.05, 
                                                  algorithm = 'gis')

  ==> Training (10 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.38022        0.864
             3          -0.30843        0.878
         Final          -0.27411        0.886


In [20]:
e2 = error_analysis(me2, validation)
print(len(e2))
print(1.0 * len(e2) / len(validation))

95
0.19


The accuracy was much better on the training set but was the exact same as before for the validation set. We probably just overfitted the training set. 

In [21]:
test_features2 = [(gender_features2(x), gender) for x, gender in test.values]
print(nltk.classify.accuracy(me2, test_features2))

0.794


Test accuracy is only slightly improved. Definitely seems like we overfitted the training set.

In [22]:
test_predict = [me2.classify(feature) for feature,gender in test_features2]
test_actual = [gender for feature, gender in test.values]
cm = nltk.ConfusionMatrix(test_predict, test_actual)
print(cm)

       |   f     |
       |   e     |
       |   m   m |
       |   a   a |
       |   l   l |
       |   e   e |
-------+---------+
female |<269> 57 |
  male |  46<128>|
-------+---------+
(row = reference; col = test)



In [23]:
me2.show_most_informative_features(n=10)

  -3.103 suffix2==u'na' and label is 'male'
  -3.071 suffix2==u'la' and label is 'male'
  -2.724 suffix1==u'k' and label is 'female'
  -2.448 suffix2==u'sa' and label is 'male'
  -2.325 suffix2==u'ia' and label is 'male'
  -2.281 suffix2==u'ra' and label is 'male'
  -2.166 suffix1==u'a' and label is 'male'
  -1.896 suffix2==u'ta' and label is 'male'
  -1.796 suffix2==u'ti' and label is 'male'
   1.786 firstlast==(u'Ez', u'ra') and label is 'male'


**Feature Set 3: Add first 3 letters of name**

In [24]:
def gender_features3(word):
    return {'suffix1': word[-1:], 'suffix2': word[-2:], 'prefix3':word[:3]}

features3 = [(gender_features3(x), gender) for x, gender in train.values]
me3 = nltk.classify.maxent.MaxentClassifier.train(features3, 
                                                  trace = 10, 
                                                  max_iter = 10, 
                                                  min_lldelta = 0.05, 
                                                  algorithm = 'gis')

  ==> Training (10 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.39845        0.839
             3          -0.32913        0.853
         Final          -0.29756        0.860


In [25]:
validation_features3 = [(gender_features3(x), gender) for x, gender in validation.values]
print(nltk.classify.accuracy(me3, validation_features3))
test_features3 = [(gender_features3(x), gender) for x, gender in test.values]
print(nltk.classify.accuracy(me3, test_features3))

0.832
0.8


Getting closer. Baby steps ;)

In [26]:
e3 = error_analysis(me3, validation)
print(len(e3))
print(1.0 * len(e3) / len(validation))
e3

91
0.182


[('male', 'female', u'Mortie'),
 ('male', 'female', u'Jeth'),
 ('male', 'female', u'Bealle'),
 ('male', 'female', u'Nickie'),
 ('male', 'female', u'Glynn'),
 ('male', 'female', u'Tonnie'),
 ('male', 'female', u'Tymothy'),
 ('male', 'female', u'Blayne'),
 ('male', 'female', u'Lex'),
 ('male', 'female', u'Gabriele'),
 ('male', 'female', u'Dougie'),
 ('male', 'female', u'Wye'),
 ('male', 'female', u'Vite'),
 ('male', 'female', u'Haley'),
 ('male', 'female', u'Karl'),
 ('male', 'female', u'Meryl'),
 ('male', 'female', u'Alex'),
 ('male', 'female', u'Tobie'),
 ('male', 'female', u'Barnie'),
 ('male', 'female', u'Iggie'),
 ('male', 'female', u'Nevile'),
 ('male', 'female', u'Witty'),
 ('male', 'female', u'Sandy'),
 ('male', 'female', u'Timothee'),
 ('male', 'female', u'Garey'),
 ('male', 'female', u'Bennie'),
 ('male', 'female', u'Isaiah'),
 ('male', 'female', u'Richie'),
 ('male', 'female', u'Che'),
 ('male', 'female', u'Wesley'),
 ('male', 'female', u'Hari'),
 ('male', 'female', u'Kingsly'

**Feature Set 4: First and last letter only**  
*Note: this was one of the optimal Naive Bayes implementations*

In [27]:
def gender_features4(word):
    return {'suffix1': word[-1:], 'prefix1': word[:1]}

features4 = [(gender_features4(x), gender) for x, gender in train.values]
me4 = nltk.classify.maxent.MaxentClassifier.train(features4, 
                                                  trace = 10, 
                                                  max_iter = 10, 
                                                  min_lldelta = 0.05, 
                                                  algorithm = 'gis')

  ==> Training (10 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.49096        0.768
             3          -0.42445        0.778
         Final          -0.39486        0.781


In [28]:
validation_features4 = [(gender_features4(x), gender) for x, gender in validation.values]
print(nltk.classify.accuracy(me4, validation_features4))
test_features4 = [(gender_features4(x), gender) for x, gender in test.values]
print(nltk.classify.accuracy(me4, test_features4))

0.792
0.74


**Feature Set 5: Last letter, length of name, number of vowels**  
*Note: this was one of the optimal Decision Tree implementations*

In [29]:
def numVowels(word):
    vowels = ['a','e','i','o','u']
    return sum(word.count(v) for v in vowels)
    
    
def gender_features5(word):
    return {'length': len(word), 'suffix1': word[-1:], 'vowels': numVowels(word)}

features5 = [(gender_features5(x), gender) for x, gender in train.values]
me5 = nltk.classify.maxent.MaxentClassifier.train(features5, 
                                                  trace = 10, 
                                                  max_iter = 10, 
                                                  min_lldelta = 0.05, 
                                                  algorithm = 'gis')



  ==> Training (10 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.51806        0.722
             3          -0.45559        0.760
         Final          -0.42421        0.767


In [30]:
validation_features5 = [(gender_features5(x), gender) for x, gender in validation.values]
print(nltk.classify.accuracy(me5, validation_features5))
test_features5 = [(gender_features5(x), gender) for x, gender in test.values]
print(nltk.classify.accuracy(me5, test_features5))

0.78
0.746
