In [64]:
import nltk
import random
import numpy as np
from itertools import repeat
from sklearn.model_selection import train_test_split
from nltk.corpus import names
import re

In [65]:
# confirm male and female txt files exist
names.fileids()

[u'female.txt', u'male.txt']

In [66]:
# load male and female  name files from nltk.names; store in people list
males = [n for n in names.words('male.txt')] 
females = [n for n in names.words('female.txt')] 
people = males + females

# make gender list
gender = list(repeat('male',len(males))) + \
list(repeat('female',len(females)))


## Feature Generation

Our raw dataset contains only one predictor variable.  This feature indicates the first name of each observed individual in our data.  Unfortunately, the name variable--in its unprocessed form--will not be very useful for building an accurate gender classification model.  A model constructed using first names only will struggle to predict gender for names not explicitly identified in the training data.  Also, our dataset contains a unique collection of first names within each gender class, but has a small number of overlapping names between gender types.  Without additional feature engineering, our models will not be able to make reasonable guesses when encountering gender-neutral names.  

The textbook, *Natural Language Processing with Python*, provides a number of helpful suggestions for extracting new features from first names for gender classification purposes:
* isolate the first letter of each name
* isolate the last letter of each name
* isolate the last two letters of each name.

We decided to include these features as possible predictors for our models.  These extracted features can reveal common patterns in the prefixes and suffixes of first names that are often associated with a particular gender.  For instance, many female first names end with the letter "a".  

Building on the text's suggestions, we also extracted the following features:
* the first two letters of each name
* the first three letters of each name
* the last three letters of each name

There were a handful of additional features that we thought might be relevant for gender identification:
* the number of vowels in each name
* The first two non-contiguous letters of each name 
* The first three non-contiguous letters
* The last two non-contiguous letters
* the last three non-contiguous letters

We also found an [article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4446333/) from an academic journal that characterizes certain letters as "round" and others as "sharp".  The authors contend that round letters tend to be associated with female names, while sharp letters are more often associated with male names.  Using this information, we created the following potential features:
* the number of round, consonant letters ("b","m", "l", and "n") in each name
* the number of sharp, consant letters ("k", "p", and "t") per name
* the number of round vowels ("u" and "o") in each first name.

Finally, we found a Wordpress [blog entry](https://debuk.wordpress.com/tag/feminine-suffixes/) that discusses common female suffixes in first names.  These include names ending in ending in "a", "y", "ie", and "ah".  We used this information to create a new binary variable that identifies if one of these suffixes is present in a given name.

In the script below, we create a function, *gender_features()*, that returns a dictionary of the extracted features described in this section.  The function has multiple arguments:
* a first name to use for extracting features
* a list of desired features.  This argument can be used to vary the features that are returned by the function 

  
   

In [67]:
# produce features
def gender_features(word, *args):
    """
    function returns dictionary of features
        word: name to extract features from
        args:  one or more strings to specify desired features, including:
                'length','first','first2','first3', 'last', 'last2', 'last3',
                'every_other2_beg','every_other3_beg', 'every_other2_end', 'every_other3_end',
                'vowel_ct', 'round_cons_ct', 'sharp_cons_ct','round_vowel_ct',
                trad_female_end'
               
    """
    
    gf = {}
    
    # word length
    gf['length'] = len(word)
   
    # first letters
    gf['first'] = word[0].lower()
    gf['first2'] = word[0:2].lower()
    gf['first3'] = word[0:3].lower() if gf['length'] >2  else word[0:2].lower()
    
    gf['two_letters'] = 'y' if len(word) == 2 else 'n'
    
    # last letters
    gf['last'] = word[-1].lower()
    gf['last2'] = word[-2:].lower()
    gf['last3'] = word[-3:].lower() if gf['length'] >2  else word[-2:].lower()
    
    # every other beg
    gf['every_other2_beg'] = word[0]+word[2] if gf['length'] > 2 else word[0]
    gf['every_other3_beg'] = gf['every_other2_beg']+word[4]  if gf['length'] > 4 else \
    gf['every_other2_beg']
    
    # every other end
    gf['every_other2_end'] = word[-3]+word[-1] if gf['length'] > 2 else word[-1]
    gf['every_other3_end'] = word[-5]+gf['every_other2_end']  if gf['length'] > 4 else \
    gf['every_other2_end']
    
    # count: vowels, rounded consonants, sharp consonants
    for letter in word:
        # count vowels
        if letter in 'aeiou':
            gf['vowel_ct'] = gf.get('vowel_ct',0) + 1
        # count rounded consonants
        if letter in 'bmln':
            gf['round_cons_ct'] = gf.get('round_cons_ct',0) + 1
        # count sharp consonants
        if letter in 'k,p,t':
            gf['sharp_cons_ct'] = gf.get('sharp_cons_ct',0) + 1
        # count rounded vowels
        if letter in 'uo':
            gf['round_vowel_ct'] = gf.get('round_vowel_ct',0) + 1
            
    # traditional feminine ending, 'y' or 'n'
    gf['trad_female_end'] = 'y' if gf['last2'] in ['ie','ah'] or \
    gf['last'] in ['a','y'] else 'n'
    
    ## patterns: double consonant ends in y: Binny, Daffy...
    #gf['consonant_y'] = 'y' if bool(re.search(r"([b-df-hj-np-tv-z])\1{1,}y$", word)) else 'n'
    
    # generate dictionary subset
    return(dict((k, gf[k]) for k in args if k in gf))
    
       

Here is example output from the gender_features() function:

In [68]:
# specify which features to use
myargs = ['length','first','first2','first3', 'last', 'last2', 'last3', \
          'every_other2_beg','every_other3_beg', 'every_other2_end', 'every_other3_end', \
          'vowel_ct', 'round_cons_ct', 'sharp_cons_ct','round_vowel_ct', \
          'trad_female_end']

# specify name, and argument list 
gender_features('Binny', *myargs)

{'every_other2_beg': 'Bn',
 'every_other2_end': 'ny',
 'every_other3_beg': 'Bny',
 'every_other3_end': 'Bny',
 'first': 'b',
 'first2': 'bi',
 'first3': 'bin',
 'last': 'y',
 'last2': 'ny',
 'last3': 'nny',
 'length': 5,
 'round_cons_ct': 2,
 'trad_female_end': 'y',
 'vowel_ct': 1}

In [69]:
# split into test and train, with test file containing 1000 samples
people_train, people_test, gender_train, gender_test =  \
train_test_split(people, gender, test_size=1000, random_state=4)

# split test into two separate components of 500 each: test and devtest
people_test, people_devtest, gender_test, gender_devtest = \
train_test_split(people_test, gender_test, test_size=500, random_state=4)

# list of tuples, gender features, gender
train_set = list(zip(map(lambda d: gender_features(d, *myargs), people_train),gender_train))
devtest_set = list(zip(map(lambda d: gender_features(d, *myargs), people_devtest),gender_devtest))
test_set = list(zip(map(lambda d: gender_features(d, *myargs), people_test),gender_test))


# list of tuples, names, gender
train_names = list(zip(people_train,gender_train))
devtest_names = list(zip(people_devtest,gender_devtest))
test_names = list(zip(people_test, gender_test))

# train naive bayes classifier 
classifier = nltk.NaiveBayesClassifier.train(train_set)


In [70]:
# look at most informative features
classifier.show_most_informative_features(50)

Most Informative Features
                   last2 = u'na'          female : male   =     94.0 : 1.0
        every_other2_end = u'la'          female : male   =     77.7 : 1.0
                   last2 = u'la'          female : male   =     68.3 : 1.0
        every_other2_end = u'ea'          female : male   =     62.7 : 1.0
        every_other2_end = u'ia'          female : male   =     54.2 : 1.0
                    last = u'a'           female : male   =     36.9 : 1.0
                   last2 = u'ia'          female : male   =     36.2 : 1.0
                   last2 = u'ra'          female : male   =     33.9 : 1.0
                    last = u'k'             male : female =     30.6 : 1.0
                   last2 = u'us'            male : female =     29.1 : 1.0
                   last2 = u'ta'          female : male   =     28.9 : 1.0
                   last2 = u'rd'            male : female =     27.2 : 1.0
        every_other3_end = u'aia'         female : male   =     27.0 : 1.0

In [71]:
# classifer accuracy on validation set
print(nltk.classify.accuracy(classifier, devtest_set))

0.84


In [72]:
# look at names that were mis-classified
errors = []
for (name, tag) in devtest_names:
    #print(name)
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

print('actual, guess, name: \n')
for x in errors:
    print(x)
        
        

actual, guess, name: 

('male', 'female', u'Abbie')
('male', 'female', u'Westbrooke')
('male', 'female', u'Clayborne')
('male', 'female', u'Leo')
('male', 'female', u'Webster')
('male', 'female', u'Ivor')
('male', 'female', u'Reese')
('male', 'female', u'Bartlett')
('male', 'female', u'Randi')
('male', 'female', u'Orton')
('male', 'female', u'Lucian')
('male', 'female', u'Spud')
('male', 'female', u'Adolpho')
('male', 'female', u'Aguste')
('male', 'female', u'Matthew')
('male', 'female', u'Willdon')
('male', 'female', u'Barnard')
('male', 'female', u'Silvester')
('male', 'female', u'Ernest')
('male', 'female', u'Niles')
('male', 'female', u'Garfield')
('male', 'female', u'Lucien')
('male', 'female', u'Dimitrios')
('male', 'female', u'Jeffry')
('male', 'female', u'Davide')
('male', 'female', u'Parry')
('male', 'female', u'Damien')
('male', 'female', u'Ephrem')
('male', 'female', u'Dawson')
('male', 'female', u'Walker')
('male', 'female', u'Tarrant')
('male', 'female', u'Clarance')
('mal

### References
http://www.nltk.org/howto/corpus.html

In [73]:
# show number of mislabeled names 
print "Mislabeled names: ", len(errors)

Mislabeled names:  202
