# Project 3 - Gender Classifier
### Authors: John Mazon, LeTicia Cancel, Bharani Nitalla

**Video:**

**Assignment:** Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can.

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?


In [51]:
# Libraries
import nltk
from nltk.corpus import names
import random
from nltk.classify import apply_features

## Load Data

Using the code found in chapter 6 of Natural Language Processing with Python we are going to do Supervised Classification using the Names corpus. We labeled each name by gender when loading the names files into a list and then shuffled the list to make the list order random. We then check the length of the list, and we can see that we have a total of 7,944 names and each name is labeled 'male' or 'female' based on which text file it was imported from.

In [52]:
# Split male and female names
names = ([(name, 'male') for name in names.words('male.txt')] +
        [(name, 'female') for name in names.words('female.txt')])

In [53]:
# shuffle the names list in random order
random.shuffle(names)

In [54]:
names[:10]

[('Westleigh', 'male'),
 ('Randolph', 'male'),
 ('Wayland', 'male'),
 ('Tabitha', 'female'),
 ('Rees', 'male'),
 ('Pippy', 'female'),
 ('Sarette', 'female'),
 ('French', 'male'),
 ('Neilla', 'female'),
 ('Silva', 'female')]

In [55]:
len(names)

7944

## Classify Names Dataset

When creating our names classifier, we must decide which name features are relevant in predicting if a name belongs to a male or female. We begin first by using exactly what is in the textbook to see how this performs on our dataset before exploring features of our own. 

The gender_features function takes each name and takes three features we specified and loads them to a dictionary. The three features we will look at are last letter, length of name, and first name. We then test the function using the name John to make sure the features dictionary returns the correct information.

In [56]:
def gender_features(word):
    features = {}
    features['last_letter'] = word[-1]
    features['word_len'] = len(word) 
    features['first_letter'] = word[0]
    return features

In [57]:
gender_features('John')

{'last_letter': 'n', 'word_len': 4, 'first_letter': 'J'}

Now that we see the function works for our test name John, we run the function for every name in the names list and save this dictionary to the featuresets variable. So, every name in the list will have dictionary values like the above test we did with the name John. The data is split into a training and test set using an equal number of names for each set. We then use the training set to train a Naive Bayes classifier. When we print the accuracy of this classifier, we can see that it has a 79% accuracy. We will continue with this training set and then in part 2 we will add more features to the classifier so we can raise the accuracy score.  

In [58]:
featuresets = [(gender_features(n), g) for n,g in names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print (nltk.classify.accuracy(classifier, test_set))

0.794


Using the show_most_imformative_features function we can see that the last letter is the most effective in distinguishing the gender. We can see the likelihood ratios for 5 letters. Names that end in the letter A are 34 times more likely to be a female name than male and names ending in the letter V are also more likely to be female. 

In [59]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     34.4 : 1.0
             last_letter = 'k'              male : female =     30.9 : 1.0
             last_letter = 'f'              male : female =     15.1 : 1.0
             last_letter = 'p'              male : female =     11.1 : 1.0
             last_letter = 'v'              male : female =     11.1 : 1.0


We modify the training and test sets using the apply_features function. According to the textbook this is the best way to store the results of the gender_features function when you have a large corpora. We will apply the features in a different way later, but this was good practice. 

In [60]:
train_set = apply_features(gender_features, names[500:])
test_set = apply_features(gender_features, names[500:])

Now we split the names list according to the project guidelines, 500 test names, 500 devtest_names, and the remainder as training names. 

In [61]:
train_names = names[6900:]
devtest_names = names[500:1000]
test_names = names[:500]

We run the gender_features for each of the names sets, run the Naives Bayes classifier and then check the accuracy. The accuracy for the devtest_set is 77% which is slightly lower than the test_set classifier we inspected earlier. 

In [62]:
train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.77


## Error Analysis

Using the devtest_names set we do an error analysis to see how many times the classifier incorrectly predicts name genders. All of the classifier errors are stored in the error variable and we can see that 115 incorrect predictions were made which is 23% of the devtest set. 

In [63]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))
print('Number of errors: ',len(errors))
print('Number of devtest names: ',len(devtest_names))
print('Error rate: ', len(errors)/len(devtest_names))

Number of errors:  115
Number of devtest names:  500
Error rate:  0.23


We examine each name in the error set so we can try to find patterns that can be used in the new gender_features function.

In [64]:
for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    print ('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))

correct=female   guess=male     name=Beitris                       
correct=female   guess=male     name=Bel                           
correct=female   guess=male     name=Carolann                      
correct=female   guess=male     name=Carolin                       
correct=female   guess=male     name=Cathyleen                     
correct=female   guess=male     name=Charmian                      
correct=female   guess=male     name=Clovis                        
correct=female   guess=male     name=Doris                         
correct=female   guess=male     name=Dot                           
correct=female   guess=male     name=Edith                         
correct=female   guess=male     name=Eilis                         
correct=female   guess=male     name=Elyn                          
correct=female   guess=male     name=Ester                         
correct=female   guess=male     name=Esther                        
correct=female   guess=male     name=Ethelyn    

We modify the gender_features function to also look at vowels in each name, the first two letters, the last two letters, and the length of the name. Now we will perform all the steps to see if we get better results.

In [65]:
def gender_features(word):
    features = {}
    if any(vow in word for vow in 'aeiou'):
        features['vowel'] = True
    else:
        features['vowel'] = False
    features['last_letter'] = word[-1]
    features['letter_two'] = word[-2]
    features['word_len'] = len(word) 
    features['first_letter'] = word[0]
    features['second_letter'] = word[1]
    return features

The accuracy remains the same at 77%.

In [66]:
train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.772


When we do our error analysis the number of errors is also very similar to the first test.

In [67]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))
print('Number of errors: ',len(errors))
print('Number of devtest names: ',len(devtest_names))
print('Error rate: ', len(errors)/len(devtest_names))

Number of errors:  114
Number of devtest names:  500
Error rate:  0.228


## Final Performance Test

Using the test_set we will check the accuracy and do an error analysis.

In [68]:
print(nltk.classify.accuracy(classifier, test_set))

0.798


In [69]:
errors = []
for (name, tag) in test_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))
print('Number of errors: ',len(errors))
print('Number of devtest names: ',len(test_names))
print('Error rate: ', len(errors)/len(test_names))

Number of errors:  101
Number of devtest names:  500
Error rate:  0.202


In [70]:
for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    print ('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))

correct=female   guess=male     name=Aileen                        
correct=female   guess=male     name=Ajay                          
correct=female   guess=male     name=Alexis                        
correct=female   guess=male     name=Alis                          
correct=female   guess=male     name=Avivah                        
correct=female   guess=male     name=Beret                         
correct=female   guess=male     name=Brandice                      
correct=female   guess=male     name=Carley                        
correct=female   guess=male     name=Christean                     
correct=female   guess=male     name=Chrystel                      
correct=female   guess=male     name=Cristal                       
correct=female   guess=male     name=Delilah                       
correct=female   guess=male     name=Edin                          
correct=female   guess=male     name=Esme                          
correct=female   guess=male     name=Evaleen    

## Conclusion

The testset had a higher accuracy score than the devtest and the error percentage also decreased. By modifying the features function the expectation is that it will increase the accuracy and lower the error showing us that it is able to make good predictions. I think it is possible to improve on this even more. I would start by modifying the feature method to look at the number of vowels present in the name instead of just a True/False if a vowel exists. 