# Gender classifier

This is adapted from the NLTK Book, Chapter 6: [Gender classification](http://www.nltk.org/book/ch06.html)

For this excercise, I am loading the data I need in a notebook, training the classifier, and pickling the model. Data needed will be loaded via NLTK download.

This seemed appropriate as the data needed will be loaded for this session, but an improvement would be to download the actual NLTK files and save them in the domino project.

In [3]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [4]:
from nltk.corpus import names
import random

# load the examples and label them

labeled_names = (
    [(name, 'male') for name in names.words('male.txt')] +\
    [(name, 'female') for name in names.words('female.txt')]
    )

random.shuffle(labeled_names)
total = len(labeled_names)

In [5]:
# create train, dev, test sets

train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]

In [6]:
# choose features
# per the tutorial, using last two letters
# added in feature for length of word and first letter
# improved accuracy

def gender_features(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:],
            'length' : len(word)
           }

train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [7]:
print(nltk.classify.accuracy(classifier, test_set))

0.776


In [8]:
# once features have been chosen, train on all names

full_set = [(gender_features(n), gender) for (n, gender) in labeled_names]
final_classifier = nltk.NaiveBayesClassifier.train(full_set)

In [31]:
import pickle
with open('gender_classifier.pickle', 'wb') as out:
    pickle.dump(final_classifier, out)

In [25]:
final_classifier.classify_many([gender_features(n) for n in ["josiah"]])

['female']