# Project 3 - Gender Classifier
### Authors: John Mazon, LeTicia Cancel, Bharani Nitalla

**Assignment:** Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can.

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?


In [6]:
# Libraries
import nltk
from nltk.corpus import names
import random
from nltk.classify import apply_features

## Load Data

Using the code found in chapter 6 of Natural Language Processing with Python we are going to do Supervised Classification using the Names corpus. We labeled each name by gender when loading the names files into a list and then shuffuled the list to make the list order random. We then check the length of the list and we can see that we have a total of 7,944 names and each name is labeled 'male' or 'female' based on which text file it was imported from. 

In [7]:
# Split male and female names
names = ([(name, 'male') for name in names.words('male.txt')] +
        [(name, 'female') for name in names.words('female.txt')])

In [8]:
# shuffle the names list in random order
random.shuffle(names)

In [9]:
names[:10]

[('Kayle', 'female'),
 ('Whitman', 'male'),
 ('Deryl', 'male'),
 ('Alden', 'male'),
 ('Mace', 'male'),
 ('Denyse', 'female'),
 ('Aretha', 'female'),
 ('Dianemarie', 'female'),
 ('Fitz', 'male'),
 ('Pacifica', 'female')]

In [10]:
len(names)

7944

## Classifier Function

When creating our names classifier, we have to decide which name features are relevent in predicting if a name belongs to a male or female. We begin first by using exactly what is in the text book to see how this performs on our dataset before exploring features of our own. 

The gender_features function takes each name and takes three features we specified and loads them to a dictionary. The three features we will look at are last letter, length of name, and first name. We then test the funtion using the name John to make sure the features dictionary returns the correct information. 

In [11]:
def gender_features(word):
    features = {}
    features['last_letter'] = word[-1]
    features['word_len'] = len(word) 
    features['first_letter'] = word[0]
    return features

In [12]:
gender_features('John')

{'last_letter': 'n', 'word_len': 4, 'first_letter': 'J'}

Now that we see the function works for our test name John, we run the function for every name in the names list and save this dictionary to the featuresets variable. So every name in the list will have dictionary values similar to the above test we did with the name John. The data is split into a training and test set using an equal number of names for each set. We then use the training set to train a Naive Bayes classifier. When we print the accuracy of this classifier, we can see that it has a 77% accuracy. We will continue with this training set and then in part 2 we will add more features to the classifier so we can raise the accuracy score.  

In [13]:
featuresets = [(gender_features(n), g) for n,g in names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [14]:
print (nltk.classify.accuracy(classifier, test_set))

0.79


Using the show_most_imformative_features function we can see that the last letter is the most effective in distinguishing the gender. We can see the likelihood ratios for 5 letters. Names that end in the letter K are 43.5 times more likely to be a male name than female and names ending in the letter V are more likely to be female. 

In [15]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'k'              male : female =     43.5 : 1.0
             last_letter = 'a'            female : male   =     35.6 : 1.0
             last_letter = 'f'              male : female =     17.2 : 1.0
             last_letter = 'p'              male : female =     11.8 : 1.0
             last_letter = 'v'              male : female =     10.4 : 1.0


We modify the training and test sets using the apply_features function. According to the textbook this is the best way to store the results of the gender_features function when you have a large corpora. We will will apply the features in a different way later, but this was good practice. 

In [167]:
train_set = apply_features(gender_features, names[500:])
test_set = apply_features(gender_features, names[500:])

Now we split the names list according to the project guidlines, 500 test names, 500 devtest_names, and the remainder as training names. 

In [168]:
train_names = names[6900:]
devtest_names = names[500:1000]
test_names = names[:500]

We run the gender_features for each of the names sets, run the Naives Bayes classifier and then check the accuracy. The accuracy for the devtest_set is 77% which is slightly lower than the test_set classifier we inspected earlier. 

In [169]:
train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.77


In [173]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))
print(len(errors))

115


In [171]:
for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    print ('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))
    

correct=female   guess=male     name=Beilul                        
correct=female   guess=male     name=Bren                          
correct=female   guess=male     name=Carol                         
correct=female   guess=male     name=Carol-Jean                    
correct=female   guess=male     name=Carolin                       
correct=female   guess=male     name=Cass                          
correct=female   guess=male     name=Catlin                        
correct=female   guess=male     name=Charil                        
correct=female   guess=male     name=Chrystel                      
correct=female   guess=male     name=Clem                          
correct=female   guess=male     name=Corliss                       
correct=female   guess=male     name=Cris                          
correct=female   guess=male     name=Crystal                       
correct=female   guess=male     name=Eadith                        
correct=female   guess=male     name=Em         

In [174]:
errors = []
for (name, tag) in test_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))
print(len(errors))

122


In [None]:
def gender_features(word):
    features = {}
    if any(vow in word for vow in 'aeiou'):
        features['vowel'] = True
    else:
        features['vowel'] = False
    features['last_letter'] = word[-1]
    features['letter_two'] = word[-2]
    #features['word_len'] = len(word) 
    #features['first_letter'] = word[0]
    features['second_letter'] = word[1]
    return features