In [201]:
import nltk
import random
import re
from nltk.corpus import names
from nltk.metrics import ConfusionMatrix

### Load and Format Data

In [202]:
#nltk.download('names')

The dataset is imbalanced with approximately 63% of the names being female. Yet the imbalance is not severe enough to consider downsampling.

In [273]:
print("Number of male names:", len(names.words('male.txt')))
print("Number of female names:", len(names.words('female.txt')))

Number of male names: 2943
Number of female names: 5001


We convert the text into a list of tuples, with each tuple containing a name and the associated gender. This format will make it easier to extract features and eventually fit nltk models.

In [274]:
labeled_names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

Since the labeled_names list contains male names and then female names sequentially we shuffle the list of tuples so when we create our training and test sets there is an approximate balance of genders in each set.

In [275]:
random.shuffle(labeled_names)

### Create Word Features

Specify all features we want to extract from the names and include into our models. The only feature we tested but did not include is the number of vowels in the name. It was not included since in consistently decreased accuracy. The features included are: last letter, last two letters, last three letters, first letter, first two letters, first three letters, and word length. The general form of this function comes from examples in [chapter 6](https://www.nltk.org/book/ch06.html) of *Natural Language Processing with Python*.

In [276]:
def gender_features(word):
        return {'last_letter': word[-1], 
            'last_two_letters': word[-2:],
            'last_three_letters': word[-3:],
            'first_letter': word[:1],
            'first_two_letters': word[:2],
            'first_three_letters': word[:3],
            'word_length': len(word)}

Using the function above, for each name in the labeled_names list we create a tuple with a dictionary of features and the gender.

In [277]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]

### Create Train, Dev Test, and Test Sets

Create dev_test_set, test_set, train_set datasets as specified in the project instructions.

In [278]:
dev_test_set, test_set, train_set = featuresets[:500], featuresets[500:1000], featuresets[1000:]

Confirm the length of each set is correct. We note a small discrepancy between the expected 6900 words in the train_set and the actual lenght of 6944.

In [279]:
print(len(dev_test_set))
print(len(test_set))
print(len(train_set))

500
500
6944


### Naive Bayes Classifier (final model)

In [280]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

#### Accuracy of Dev Test Set

In [281]:
nltk.classify.accuracy(classifier, dev_test_set)

0.854

#### Accuracy of Test Set (final predictions)

We use accuracy as our evaluation metric since there is no preference to type I or type II errors. Further, we see from the confusion matrix below that there is not a strong bias towards false positives or false negatives. 

In [282]:
nltk.classify.accuracy(classifier, test_set)

0.836

#### Confusion Matrix

In [283]:
test_genders = [x[-1] for x in test_set]
test_features = [x[:-1] for x in test_set]

In [284]:
classify_test = [classifier.classify(x[0]) for x in test_features]

In [285]:
cm = ConfusionMatrix(test_genders, classify_test)
print(cm)

       |   f     |
       |   e     |
       |   m   m |
       |   a   a |
       |   l   l |
       |   e   e |
-------+---------+
female |<269> 47 |
  male |  35<149>|
-------+---------+
(row = reference; col = test)



#### Most Important Features

Here we review the most informative features and see that the last two letters of a name tend to dominate the usefulness in predicting gender. Nevertheless, we keep all features (except for number of vowels) in the model as they all contribute to accuracy as determined through iterative testing. We also uncover somewhat interesting insights such as names ending in 'na' and 'la' have likelihood ratios of 93.6 and 69.3, respectively.

In [295]:
classifier.show_most_informative_features(10)

Most Informative Features
        last_two_letters = 'na'           female : male   =     93.6 : 1.0
        last_two_letters = 'la'           female : male   =     69.3 : 1.0
        last_two_letters = 'ld'             male : female =     39.5 : 1.0
             last_letter = 'a'            female : male   =     37.1 : 1.0
        last_two_letters = 'ia'           female : male   =     36.4 : 1.0
        last_two_letters = 'ra'           female : male   =     34.2 : 1.0
             last_letter = 'k'              male : female =     29.9 : 1.0
        last_two_letters = 'rt'             male : female =     29.7 : 1.0
        last_two_letters = 'us'             male : female =     27.4 : 1.0
      last_three_letters = 'ana'          female : male   =     25.2 : 1.0


### Other Classifiers Tested (but not chosen for final model)

#### Decision Trees Classifier

In [287]:
dt_classifier = nltk.DecisionTreeClassifier.train(train_set)

In [288]:
nltk.classify.accuracy(dt_classifier, dev_test_set)

0.74

In [289]:
nltk.classify.accuracy(dt_classifier, test_set)

0.726

In [290]:
#print(dt_classifier.pseudocode(depth=1))

#### Maximum Entropy Classifier

In [291]:
me_classifier = nltk.MaxentClassifier.train(train_set, trace=1)

  ==> Training (100 iterations)


In [292]:
nltk.classify.accuracy(me_classifier, dev_test_set)

0.834

In [293]:
nltk.classify.accuracy(me_classifier, test_set)

0.828

### How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

Generally, when comparing only training set performance against test set performance I see worse performance in the test set. The degree of the degraded performance depends heavily on the situation as overfitting occurs more commonly with different models and datasets. When comparing dev test set performance with test set performance the difference in evaluation metrics is expected to be less since the model was not fitted on either set of data. However, if one is continually tweaking a model and validating those tweaks on the dev test set, then one would expect degraded performance on the test set. Since for this project, not too much feature tweaking and model parameter tweaking was needed due to the relative simplicity of the classification task, the accuracy of the two sets do not deviate significantly and the direction in which they deviate appears to be random.

### References

*Natural Language Processing with Python* by Steven Bird, Ewan Klein, and Edward Loper<br> 
nltk.classify.MaxentClassifier.train: https://tedboy.github.io/nlps/generated/generated/nltk.classify.MaxentClassifier.train.html