# Text Classification Features and NLTK Classification Code #
This example is based on the NLTK book and uses the Names collection to guess gender of names.

In [1]:
%matplotlib inline
import nltk
from nltk.corpus import names
import random

** A feature recognition function **

In [36]:
def gender_features(word):
    return {'last_letter': word[-1]}

gender_features('Samantha')

{'last_letter': 'a'}

** Create name datasets ** 

In [37]:
def create_name_data():
    male_names = [(name, 'male') for name in names.words('male.txt')]
    female_names = [(name, 'female') for name in names.words('female.txt')]
    allnames = male_names + female_names
    
    # Randomize the order of male and female names, and de-alphabatize
    random.shuffle(allnames)
    return allnames

names_data = create_name_data()

** Make Training, Development, and Test Data Sets **

We  need a development set to test our features on before testing on the real test set. So let's redo our division of the data. In this case we do the dividing up before applying the feature selection so we can keep track of the names.

In [38]:
# This function allows experimentation with different feature definitions
# items is a list of (key, value) pairs from which features are extracted and training sets are made
# Feature sets returned are dictionaries of features

# This function also optionally returns the names of the training, development, 
# and test data for the purposes of error checking

def create_training_sets (feature_function, items, return_items=False):
    # Create the features sets.  Call the function that was passed in.
    # For names data, key is the name, and value is the gender
    featuresets = [(feature_function(key), value) for (key, value) in items]
    
    # Divided training and testing in thirds.  Could divide in other proportions instead.
    third = int(float(len(featuresets)) / 3.0)
    
    train_set, dev_set, test_set = featuresets[0:third], featuresets[third:third*2], featuresets[third*2:]
    train_items, dev_items, test_items = items[0:third], items[third:third*2], items[third*2:]
    if return_items == True:
        return train_set, dev_set, test_set, train_items, dev_items, test_items
    else:
        return train_set, dev_set, test_set

** Train the nltk classifier on the training data, with the first definition of features  **

In [39]:
# pass in a function name
train_set, dev_set, test_set = create_training_sets(gender_features, names_data)
cl = nltk.NaiveBayesClassifier.train(train_set)

** Test the classifier on some examples **

In [40]:
print ("Carl: " + cl.classify(gender_features('Carl')))
print ("Carla: " + cl.classify(gender_features('Carla')))
print ("Carly: " + cl.classify(gender_features('Carly')))
print ("Carlo: " + cl.classify(gender_features('Carlo')))
print ("Carlos: " + cl.classify(gender_features('Carlos')))


Carl: female
Carla: female
Carly: female
Carlo: male
Carlos: male


In [41]:
print ("Carli: " + cl.classify(gender_features('Carli')))
print ("Carle: " + cl.classify(gender_features('Carle')))
print ("Charles: " + cl.classify(gender_features('Charles')))
print ("Carlie: " + cl.classify(gender_features('Carlie')))
print ("Charlie: " + cl.classify(gender_features('Charlie')))

Carli: female
Carle: female
Charles: male
Carlie: female
Charlie: female


** Run the NLTK evaluation function on the development set **

In [42]:
print ("%.3f" % nltk.classify.accuracy(cl, dev_set))

0.759


** Run the NLTK feature inspection function on the classifier **

In [43]:
cl.show_most_informative_features(15)

Most Informative Features
             last_letter = 'a'            female : male   =     36.3 : 1.0
             last_letter = 'f'              male : female =     12.9 : 1.0
             last_letter = 'v'              male : female =     10.6 : 1.0
             last_letter = 'd'              male : female =      9.9 : 1.0
             last_letter = 'r'              male : female =      9.0 : 1.0
             last_letter = 'w'              male : female =      8.4 : 1.0
             last_letter = 'o'              male : female =      8.0 : 1.0
             last_letter = 'g'              male : female =      7.4 : 1.0
             last_letter = 'm'              male : female =      7.2 : 1.0
             last_letter = 'i'            female : male   =      5.6 : 1.0
             last_letter = 's'              male : female =      3.6 : 1.0
             last_letter = 'z'              male : female =      3.6 : 1.0
             last_letter = 't'              male : female =      3.4 : 1.0

** Let's add some more features to improve results **

In [50]:
def gender_features2(word):
    features = {}
    word = word.lower()
    features['last'] = word[-1]
    features['first'] = word[:1]
    features['second'] = word[1:2] # get the 'h' in Charlie?
    return features
gender_features2('Samantha')

def gender_features3(word):
    features = {}
    word = word.lower()
    features['last'] = word[-1]
    features['first'] = word[:1]
#   features['second'] = word[1:2] # get the 'h' in Charlie?
    return features
gender_features2('Samantha')           

{'first': 's', 'last': 'a', 'second': 'a'}

** We wrote the code so that we can easily pass in the new feature function. Lets see if this improves the results on the development set.**

In [45]:
train_set2, dev_set2, test_set2 = create_training_sets(gender_features2, names_data)
cl2 = nltk.NaiveBayesClassifier.train(train_set2)
print ("%.3f" % nltk.classify.accuracy(cl2, dev_set2))

0.785


In [51]:
train_set3, dev_set3, test_set3 = create_training_sets(gender_features3, names_data)
cl3 = nltk.NaiveBayesClassifier.train(train_set3)
print ("%.3f" % nltk.classify.accuracy(cl3, dev_set3))

0.789


** Let's hand check some of the harder cases ... oops some are right but some are now wrong. **

In [48]:
print ("Carli: " + cl2.classify(gender_features2('Carli')))
print ("Carle: " + cl2.classify(gender_features2('Carle')))
print ("Charles: " + cl2.classify(gender_features2('Charles')))
print ("Carlie: " + cl2.classify(gender_features2('Carlie')))
print ("Charlie: " + cl2.classify(gender_features2('Charlie')))

Carli: female
Carle: female
Charles: male
Carlie: female
Charlie: female


** We can see the influence of some of the new features **

In [49]:
cl2.show_most_informative_features(20)

Most Informative Features
                    last = 'a'            female : male   =     36.3 : 1.0
                    last = 'f'              male : female =     12.9 : 1.0
                    last = 'v'              male : female =     10.6 : 1.0
                    last = 'd'              male : female =      9.9 : 1.0
                    last = 'r'              male : female =      9.0 : 1.0
                    last = 'w'              male : female =      8.4 : 1.0
                    last = 'o'              male : female =      8.0 : 1.0
                    last = 'g'              male : female =      7.4 : 1.0
                    last = 'm'              male : female =      7.2 : 1.0
                    last = 'i'            female : male   =      5.6 : 1.0
                   first = 'w'              male : female =      5.4 : 1.0
                  second = 'b'              male : female =      4.3 : 1.0
                   first = 'x'              male : female =      3.9 : 1.0

**Below we use code from the NLTK chapter to print out the correct vs. the guessed answer for the errors, in order to inspect those that were wrong. We use the feature of the training set function that let us get the original names from the training and development set**

In [53]:
train_set3, dev_set3, test_set3, train_items, dev_items, test_items = create_training_sets(gender_features2, names_data, True)
cl3 = nltk.NaiveBayesClassifier.train(train_set3)
# This is code from the NLTK chapter
errors = []
for (name, label) in dev_items:
    guess = cl3.classify(gender_features2(name))
    if guess != label:
        errors.append( (label, guess, name) )

** Print out the correct vs. the guessed answer for the errors, in order to inspect those that were wrong. **

In [54]:
for (tag, guess, name) in sorted(errors): 
    print ('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))

correct=female   guess=male     name=Abagael                       
correct=female   guess=male     name=Abigael                       
correct=female   guess=male     name=Abigail                       
correct=female   guess=male     name=Amargo                        
correct=female   guess=male     name=Amber                         
correct=female   guess=male     name=Anne-Mar                      
correct=female   guess=male     name=April                         
correct=female   guess=male     name=Ardelis                       
correct=female   guess=male     name=Ardith                        
correct=female   guess=male     name=Ardys                         
correct=female   guess=male     name=Ardyth                        
correct=female   guess=male     name=Arlen                         
correct=female   guess=male     name=Aryn                          
correct=female   guess=male     name=Ashleigh                      
correct=female   guess=male     name=Austin     

** Exercise** Rewrite the feature function above to add some additional features, and then rerun the classifier on the development set to evaluate if it improves or degrades results.  Check the results on the dev items to see where you still make errors and add or remove features.  When you are satisfied with the results, *freeze your algorithm* and ** run it one time only on the test collection ** and report the results with the evaluation function. 

Ideas for features:
* name length
* pairs of characters
* your idea goes here