## DATA 620 Project 3

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

Project is due 4/11.

Source: Natural Language Processing with Python, exercise 6.10.2.

In [1]:
import nltk
from nltk.corpus import names
import random
import pandas as pd
from nltk import NaiveBayesClassifier
from nltk import DecisionTreeClassifier
from nltk import classify

In [2]:
# get the names data
names = ([(name, 'male') for name in names.words('male.txt')] +
        [(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)

## Gender features with NaiveBayesClassifier

### 1. gender_features : Last letter only

In [4]:
# gender feature returns the last word of name
def gender_features(word):
    return {'last_letter': word[-1]}
gender_features('David')

{'last_letter': 'd'}

In [5]:
# extracts the features
featuresets = [(gender_features(n), g) for (n,g) in names]

# split data to train and test sets
train_set, test_set = featuresets[500:], featuresets[:500]

# train a Naive Bayes classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [6]:
classifier.classify(gender_features('Bridget'))

'male'

In [7]:
# evaluate the classifier
print(nltk.classify.accuracy(classifier, test_set))

0.778


In [9]:
# get the most informative gender features
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'k'              male : female =     43.1 : 1.0
             last_letter = 'a'            female : male   =     33.1 : 1.0
             last_letter = 'v'              male : female =     11.2 : 1.0
             last_letter = 'd'              male : female =      9.7 : 1.0
             last_letter = 'm'              male : female =      8.7 : 1.0


In [11]:
# use apply_features() to return a generator
from nltk.classify import apply_features
train_set = apply_features(gender_features, names[500:])
test_set = apply_features(gender_features, names[:500])

gender_features return first letter and last letter of name

### 2. gender_features : Last letter, First letter

In [12]:

def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features
gender_features2('John')

{'firstletter': 'j',
 'lastletter': 'n',
 'count(a)': 0,
 'has(a)': False,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 0,
 'has(e)': False,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 1,
 'has(h)': True,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 1,
 'has(j)': True,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 0,
 'has(m)': False,
 'count(n)': 1,
 'has(n)': True,
 'count(o)': 1,
 'has(o)': True,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

NaiveBayesClassifier

In [13]:

featuresets = [(gender_features2(n), g) for (n,g) in names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.794


In [16]:
# split data to train, dev, and test set
train_names = names[1000:]
devtest_names = names[500:1000]
test_names = names[:500]

print(len(test_names), len(devtest_names), len(train_names), len(names))

500 500 6944 7944


Having divided the corpus into appropriate datasets, we train a model using the training
set , and then run it on the dev-test set .

In [17]:
train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.774


Using the dev-test set, we can generate a list of the errors that the classifier makes when
predicting name genders:

In [18]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

We can then examine individual error cases where the model predicted the wrong label,
and try to determine what additional pieces of information would allow it to make the
right decision (or which existing pieces of information are tricking it into making the
wrong decision). The feature set can then be adjusted accordingly. The names classifier
that we have built generates about 100 errors on the dev-test corpus:

In [30]:
for (tag, guess, name) in sorted(errors)[:3]: # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
        print('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))


correct=female   guess=male     name=Alfie                         
correct=female   guess=male     name=Alison                        
correct=female   guess=male     name=Alix                          


### 3. gender_features : Last letter, Last two letters

In [21]:
def gender_features3(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:]}

In [22]:
gender_features3('Shrek')

{'suffix1': 'k', 'suffix2': 'ek'}

In [59]:
train_set = [(gender_features3(n), g) for (n,g) in train_names]
devtest_set = [(gender_features3(n), g) for (n,g) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.78


### 4. gender_features : First Letter, Last two letters, check last letter is vowel or not

In [23]:
# uses first letter instead of last letter; keeps last vowel and last 2 char suffix
def gender_features4(word):
    return {'first_letter': word[0],
            'suffix': word[-2:],
            'last_is_vowel' : (word[-1] in 'aeiouy')}

In [24]:
gender_features4('Shrek')

{'first_letter': 'S', 'suffix': 'ek', 'last_is_vowel': False}

In [25]:
train_set = [(gender_features4(n), g) for (n,g) in train_names]
devtest_set = [(gender_features4(n), g) for (n,g) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.774


# Decision tree classifer

### 1. gender_features : Last letter only

In [26]:
train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
dtclassifier = nltk.classify.DecisionTreeClassifier.train(train_set, entropy_cutoff=0,support_cutoff=0)
print(nltk.classify.accuracy(dtclassifier, devtest_set))

0.772


### 2. gender_features : Last letter, First letter

In [27]:
train_set = [(gender_features2(n), g) for (n,g) in train_names]
devtest_set = [(gender_features2(n), g) for (n,g) in devtest_names]
test_set = [(gender_features2(n), g) for (n,g) in test_names]
dtclassifier2 = nltk.classify.DecisionTreeClassifier.train(train_set, entropy_cutoff=0,support_cutoff=0)
print(nltk.classify.accuracy(dtclassifier2, devtest_set))

0.784


### 3. gender_features : Last letter, Last two letters

In [65]:
train_set = [(gender_features3(n), g) for (n,g) in train_names]
devtest_set = [(gender_features3(n), g) for (n,g) in devtest_names]
test_set = [(gender_features3(n), g) for (n,g) in test_names]
dtclassifier3 = nltk.classify.DecisionTreeClassifier.train(train_set, entropy_cutoff=0,support_cutoff=0)
print(nltk.classify.accuracy(dtclassifier3, devtest_set))

0.768


### 4. gender_features : First Letter, Last two letters, check last letter is vowel or not

In [66]:
train_set = [(gender_features4(n), g) for (n,g) in train_names]
devtest_set = [(gender_features4(n), g) for (n,g) in devtest_names]
test_set = [(gender_features4(n), g) for (n,g) in test_names]
dtclassifier4 = nltk.classify.DecisionTreeClassifier.train(train_set, entropy_cutoff=0,support_cutoff=0)
print(nltk.classify.accuracy(dtclassifier4, devtest_set))

0.754


Error testing with decisiontree


In [31]:
errors = []
for (name, tag) in devtest_names:
    guess = dtclassifier2.classify(gender_features2(name))
    if guess != tag:
        errors.append((tag, guess, name))

for (tag, guess, name) in sorted(errors)[:3]:
    print('correct=%-8s guess=%-8s name=%-30s' %(tag, guess, name))

correct=female   guess=male     name=Alfie                         
correct=female   guess=male     name=Alison                        
correct=female   guess=male     name=Alix                          
