**Project 3** 

***Group 1: Adam Gersowitz, Diego Correa, Maria Gironio***

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

**Example Name Gender Classifier**

Page 222 in Natural Processing with Python

In [75]:
import random

random.seed(10)

def gender_features(word):
  return {'last_letter': word[-1]}


import nltk
#nltk.download('names')
from nltk.corpus import names
import random


names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

random.shuffle(names)

featuresets = [(gender_features(n), g) for (n,g) in names] 
train_set, test_set = featuresets[500:], featuresets[:500] 
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set))

0.77


Improvement from page 227 to help with the second to last letter of a name.

In [76]:
devtest_names = names[500:1000]
test_names = names[:500]
train_names = names[1000:]

def gender_features(word):
  return {'suffix1': word[-1:],
          'suffix2': word[-2:]}

train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, devtest_set))



errors = []
for (name, tag) in devtest_names:
  guess = classifier.classify(gender_features(name)) 
  if guess != tag:
    errors.append( (tag, guess, name) )


for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE ...
  print('correct='+tag+' guess='+guess+' name='+name)

0.77
correct=female guess=male name=Amargo
correct=female guess=male name=Blake
correct=female guess=male name=Brooke
correct=female guess=male name=Carey
correct=female guess=male name=Carin
correct=female guess=male name=Carleen
correct=female guess=male name=Caro
correct=female guess=male name=Chloris
correct=female guess=male name=Christal
correct=female guess=male name=Clarey
correct=female guess=male name=Clary
correct=female guess=male name=Clio
correct=female guess=male name=Conney
correct=female guess=male name=Coral
correct=female guess=male name=Cordey
correct=female guess=male name=Coriss
correct=female guess=male name=Cris
correct=female guess=male name=Dallas
correct=female guess=male name=Darsey
correct=female guess=male name=Dolley
correct=female guess=male name=Doro
correct=female guess=male name=Drew
correct=female guess=male name=Eden
correct=female guess=male name=Eilis
correct=female guess=male name=Esther
correct=female guess=male name=Frank
correct=female guess=m

We can see from the output above that there are some names that have been incorectly classified in the same way that start with the same letter (i.e. 13 incorrect guesses of male on female names that start with "C"). We will add the first and second letter of the name as features in our classifier.

This resulted in an improvement from 0.77 accuracy to 0.82 in our devtest set.

In [77]:
def gender_features(word):
  return {'suffix1': word[-1:],
          'suffix2': word[-2:],
          'prefix1': word[:1],
          'prefix2': word[:2]}


train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))



errors = []
for (name, tag) in devtest_names:
  guess = classifier.classify(gender_features(name)) 
  if guess != tag:
    errors.append( (tag, guess, name) )


for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE ...
  print('correct='+tag+' guess='+guess+' name='+name)

0.82
correct=female guess=male name=Amargo
correct=female guess=male name=Brooke
correct=female guess=male name=Caro
correct=female guess=male name=Chloris
correct=female guess=male name=Clio
correct=female guess=male name=Coriss
correct=female guess=male name=Cris
correct=female guess=male name=Dallas
correct=female guess=male name=Doro
correct=female guess=male name=Drew
correct=female guess=male name=Eden
correct=female guess=male name=Esther
correct=female guess=male name=Frank
correct=female guess=male name=Gabbey
correct=female guess=male name=Gennifer
correct=female guess=male name=Gill
correct=female guess=male name=Harley
correct=female guess=male name=Harlie
correct=female guess=male name=Heather
correct=female guess=male name=Ines
correct=female guess=male name=Isabeau
correct=female guess=male name=Isabel
correct=female guess=male name=Jackquelin
correct=female guess=male name=Janean
correct=female guess=male name=Joan
correct=female guess=male name=Madlin
correct=female gu

We will try to add vowel count as a possible feature. When added to the classifier vowle count actually decreases our accuracy so it will not be added to the final classifier.

Vowel count coe: https://www.delftstack.com/howto/python/python-syllable-counter/

In [78]:
def v_count(str):
    count = 0
    
    syllables = set("AEIOUaeiou")
    
    for letter in str:
        if letter in syllables:
            count = count + 1
    return count

def gender_features(word):
  vowel_count = v_count(word)
  return {'suffix1': word[-1:],
          'suffix2': word[-2:],
          'prefix1': word[:1],
          'prefix2': word[:2],
          'vowel_count': vowel_count
          }

train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))



errors = []
for (name, tag) in devtest_names:
  guess = classifier.classify(gender_features(name)) 
  if guess != tag:
    errors.append( (tag, guess, name) )


#for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE ...
#  print('correct='+tag+' guess='+guess+' name='+name)






0.812


After returning to the list of incorrect guesses we find that there are a few examples of female names that have repeating letters ("Pammy","Gennifer","Gabby") so we will add a feature that will determine if there are back to back letters in a name.

This upped the accuracy a bit to 0.822. 

Code source: https://stackoverflow.com/questions/34443946/count-consecutive-characters

In [79]:
from itertools import groupby




def gender_features(word):
  groups = groupby(word)
  result = [(label, sum(1 for _ in group)) for label, group in groups]
  result=", ".join("{}x{}".format(label, count) for label, count in result)
  rep= "2" in result
  return {'suffix1': word[-1:],
          'suffix2': word[-2:],
          'prefix1': word[:1],
          'prefix2': word[:2],
          'repeat': rep
          }

        

train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))



errors = []
for (name, tag) in devtest_names:
  guess = classifier.classify(gender_features(name)) 
  if guess != tag:
    errors.append( (tag, guess, name) )


#for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE ...
#  print('correct='+tag+' guess='+guess+' name='+name)

0.822


We will also add the third letter and the third to last letter of the names in order to better find patterns in the names that we can not identify. We will stop at 3 so we avoid overfitting the model to the test set. This increases our accuracy to 0.848. 

At this point we will also take a deeper look at the most informative features by looking at the top 100. We can see that the suffix and prefix features are much mroe helpful then the repeating letter feature so we will remove the repeating letter feature to avoid overfitting.

In [80]:


def gender_features(word):
  groups = groupby(word)
  result = [(label, sum(1 for _ in group)) for label, group in groups]
  result=", ".join("{}x{}".format(label, count) for label, count in result)
  rep= "2" in result
  return {'suffix1': word[-1:],
          'suffix2': word[-2:],
          'suffix3': word[-3:],
          'prefix1': word[:1],
          'prefix2': word[:2],
          'prefix3': word[:3],
          'repeat': rep
          }


classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))



errors = []
for (name, tag) in devtest_names:
  guess = classifier.classify(gender_features(name)) 
  if guess != tag:
    errors.append( (tag, guess, name) )


#for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE ...
#  print('correct='+tag+' guess='+guess+' name='+name)


classifier.show_most_informative_features(100)

0.822
Most Informative Features
                 suffix2 = 'na'           female : male   =     94.0 : 1.0
                 suffix2 = 'la'           female : male   =     69.0 : 1.0
                 suffix1 = 'k'              male : female =     41.8 : 1.0
                 suffix2 = 'ia'           female : male   =     37.6 : 1.0
                 suffix1 = 'a'            female : male   =     34.7 : 1.0
                 suffix2 = 'rd'             male : female =     30.7 : 1.0
                 suffix2 = 'us'             male : female =     27.9 : 1.0
                 suffix2 = 'ra'           female : male   =     25.5 : 1.0
                 suffix2 = 'do'             male : female =     25.0 : 1.0
                 suffix2 = 'ta'           female : male   =     23.3 : 1.0
                 suffix2 = 'rt'             male : female =     22.1 : 1.0
                 suffix2 = 'ld'             male : female =     21.7 : 1.0
                 suffix2 = 'os'             male : female =     19.4

The removal of the repeating letter feature did not reduce our accuract of 0.848.

We will add a general measure of the vowels and consanants that are not the first letter of the name (by only pulling lower case)  to see if a collection of these letters helps us determine the gender. We can see this increases our accuracy to 0.854. We can also see when examining the most impactful features that these measures fall iwthin the top 100 features so we will keep them.

In [81]:
import re

def gender_features(word):
  vowels = "".join(sorted(list(set(re.sub(r'[^aeiou]', '', word)))))
  consonants = "".join(sorted(list(set(re.sub(r'[aeiou]', '', word)))))
  return {'suffix1': word[-1:],
          'suffix2': word[-2:],
          'suffix3': word[-3:],
          'prefix1': word[:1],
          'prefix2': word[:2],
          'prefix3': word[:3],
          'vow':vowels,
          'con': consonants
          }

          
train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, devtest_set))



errors = []
for (name, tag) in devtest_names:
  guess = classifier.classify(gender_features(name)) 
  if guess != tag:
    errors.append( (tag, guess, name) )


classifier.show_most_informative_features(100)

#for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE ...
#  print('correct='+tag+' guess='+guess+' name='+name)

0.854
Most Informative Features
                 suffix2 = 'na'           female : male   =     94.0 : 1.0
                 suffix2 = 'la'           female : male   =     69.0 : 1.0
                 suffix1 = 'k'              male : female =     41.8 : 1.0
                 suffix2 = 'ia'           female : male   =     37.6 : 1.0
                 suffix1 = 'a'            female : male   =     34.7 : 1.0
                 suffix2 = 'rd'             male : female =     30.7 : 1.0
                 suffix3 = 'ard'            male : female =     28.8 : 1.0
                 suffix2 = 'us'             male : female =     27.9 : 1.0
                 suffix3 = 'ana'          female : male   =     25.6 : 1.0
                 suffix2 = 'ra'           female : male   =     25.5 : 1.0
                 suffix2 = 'do'             male : female =     25.0 : 1.0
                 suffix3 = 'tta'          female : male   =     23.9 : 1.0
                 suffix2 = 'ta'           female : male   =     23.3

To narrow down the scope of the consonants and vowels we will introduce digraphs and trigraphs. Digrpahs and Trigraphs are a string of 2 or 3 letters in english that typically make a unique sound. we will test these at the beginning and end of the name as well as all of the graphs in a name. Some of these digraph features arein our top 100 and it increases our accuracy to 0.858.


https://www.enchantedlearning.com/consonantblends/

In [82]:
digraphs=['bl', 'br', 'ch', 'ck', 'cl', 'cr', 'dr', 'fl', 'fr', 'gh', 'gl', 'gr', 'ng', 'ph', 'pl', 'pr', 'qu', 'sc', 'sh', 'sk', 'sl', 'sm', 'sn', 'sp', 'st', 'sw', 'th', 'tr', 'tw', 'wh', 'wr']
trigraphs= ['nth', 'sch', 'scr', 'shr', 'spl', 'spr', 'squ', 'str', 'thr']
v_digraphs= ['ai', 'au', 'aw', 'ay', 'ea', 'ee', 'ei', 'eu', 'ew', 'ey', 'ie', 'oi', 'oo', 'ou', 'ow', 'oy']





def gender_features(word):
  #vowels = "".join(sorted(list(set(re.sub(r'[^aeiou]', '', word)))))
  #consonants = "".join(sorted(list(set(re.sub(r'[aeiou]', '', word)))))
  resdi = "".join([ele for ele in digraphs if(ele in word)])
  restri = "".join([ele for ele in trigraphs if(ele in word)])
  resvdi = "".join([ele for ele in v_digraphs if(ele in word)])
  endresdi = "".join([ele for ele in digraphs if(ele in word[-2:])])
  endrestri = "".join([ele for ele in trigraphs if(ele in word[-3:])])
  endresvdi = "".join([ele for ele in v_digraphs if(ele in word[-2:])])
  startresdi = "".join([ele for ele in digraphs if(ele in word.lower()[:2])])
  startrestri = "".join([ele for ele in trigraphs if(ele in word.lower()[:3])])
  startresvdi = "".join([ele for ele in v_digraphs if(ele in word.lower()[:2])])
  return {'suffix1': word[-1:],
          'suffix2': word[-2:],
          'suffix3': word[-3:],
          'prefix1': word[:1],
          'prefix2': word[:2],
          'prefix3': word[:3],
          #'vow':vowels,
          #'con': consonants,
          'digraphs': resdi,
          'trigraphs': restri,
          'vowel_digraphs': resvdi,
          'end_digraphs': endresdi,
          'end_trigraphs': endrestri,
          'end_vowel_digraphs': endresvdi,
          'start_digraphs': startresdi,
          'start_trigraphs': startrestri,
          'start_vowel_digraphs': startresvdi
          }

         


train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, devtest_set))



errors = []
for (name, tag) in devtest_names:
  guess = classifier.classify(gender_features(name)) 
  if guess != tag:
    errors.append( (tag, guess, name) )


classifier.show_most_informative_features(100)


0.858
Most Informative Features
                 suffix2 = 'na'           female : male   =     94.0 : 1.0
                 suffix2 = 'la'           female : male   =     69.0 : 1.0
                 suffix1 = 'k'              male : female =     41.8 : 1.0
                 suffix2 = 'ia'           female : male   =     37.6 : 1.0
                 suffix1 = 'a'            female : male   =     34.7 : 1.0
                 suffix2 = 'rd'             male : female =     30.7 : 1.0
                 suffix3 = 'ard'            male : female =     28.8 : 1.0
                 suffix2 = 'us'             male : female =     27.9 : 1.0
                 suffix3 = 'ana'          female : male   =     25.6 : 1.0
                 suffix2 = 'ra'           female : male   =     25.5 : 1.0
                 suffix2 = 'do'             male : female =     25.0 : 1.0
                 suffix3 = 'tta'          female : male   =     23.9 : 1.0
                 suffix2 = 'ta'           female : male   =     23.3

Now with this list of features we will evaluate our classifier against the testing set. We get a slightly worse accuracy of 0.826 which implies some possible slight overfitting to the dev_test set. An additional way to expand this classifier is by identifiying and applying different features to the names if they can be bucketed into languages of origin (i.e. "Georgiamay", "Hpephzibah", "Bobby", "Andrei" may be derived from different languages originally and thus may have different rules for the gender idneticifaction of their names). 

However an accuracy of 82.6% is a good classifier and anything much higher would likely suffer from overfitting.

In [87]:
print(nltk.classify.accuracy(classifier, test_set))

errors = []
for (name, tag) in test_names:
  guess = classifier.classify(gender_features(name)) 
  if guess != tag:
    errors.append( (tag, guess, name) )


classifier.show_most_informative_features(100)

for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE ...
  print('correct='+tag+' guess='+guess+' name='+name)

0.826
Most Informative Features
                 suffix2 = 'na'           female : male   =     94.0 : 1.0
                 suffix2 = 'la'           female : male   =     69.0 : 1.0
                 suffix1 = 'k'              male : female =     41.8 : 1.0
                 suffix2 = 'ia'           female : male   =     37.6 : 1.0
                 suffix1 = 'a'            female : male   =     34.7 : 1.0
                 suffix2 = 'rd'             male : female =     30.7 : 1.0
                 suffix3 = 'ard'            male : female =     28.8 : 1.0
                 suffix2 = 'us'             male : female =     27.9 : 1.0
                 suffix3 = 'ana'          female : male   =     25.6 : 1.0
                 suffix2 = 'ra'           female : male   =     25.5 : 1.0
                 suffix2 = 'do'             male : female =     25.0 : 1.0
                 suffix3 = 'tta'          female : male   =     23.9 : 1.0
                 suffix2 = 'ta'           female : male   =     23.3