## Web Analytics DATA 620 - Week 04 - Part 02
## Project 3
## Group - Chris Bloome / Mustafa Telab / Vinayak Kamath
## Date - 4th July 2021

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python,
and any features you can think of, build the best name gender classifier you can.

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest set, and the remaining 6900 words for the training set. Then, starting with the example name gender
classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are
satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set? Is this what
you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2

## Import data

In [1]:
import nltk
#nltk.download('names')

## Example classifier

Lets start with the example from the text:

In [2]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [3]:
from nltk.corpus import names
import random

names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

#set seed for consistency 
random.seed(113)
random.shuffle(names)

In [4]:
featuresets = [(gender_features(n), g) for (n,g) in names]

test_set, dev_set, train_set = featuresets[:500],featuresets[500:1000],featuresets[1000:] 
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [5]:
# function to evaluate consistently
def eval():
    {
    print(nltk.classify.accuracy(classifier, dev_set)),
    classifier.show_most_informative_features(10)           
    }
eval()

0.748
Most Informative Features
             last_letter = 'k'              male : female =     45.1 : 1.0
             last_letter = 'a'            female : male   =     34.4 : 1.0
             last_letter = 'f'              male : female =     26.6 : 1.0
             last_letter = 'v'              male : female =     15.3 : 1.0
             last_letter = 'p'              male : female =     11.9 : 1.0
             last_letter = 'd'              male : female =      9.0 : 1.0
             last_letter = 'o'              male : female =      8.4 : 1.0
             last_letter = 'z'              male : female =      7.8 : 1.0
             last_letter = 'm'              male : female =      7.6 : 1.0
             last_letter = 'r'              male : female =      6.5 : 1.0


## Interation - Naive Bayes

### Model 1

We will take "the kitchen sink approach" - starting with all of our features before attempting to improve our accuracy by removing parameters. 

At a starting value of 75%, we have a very good starting place.

In [6]:
def gender_features2(name):
    features = {}
    # Last Letters 
    features["last_letter"] = name[-1].lower()
    features["last_2_letters"] = name[-2:].lower()
    features["last_3_letters"] = name[-3:].lower()
    
    # First Letters 
    features["first_letter"] = name[:1].lower()
    features["first_2_letters"] = name[:2].lower()
    features["first_3_letters"] = name[:3].lower()
    
    # Values
    features["has_A"] =  "a" in name.lower()
    features["has_E"] =  "e" in name.lower()
    features["has_I"] =  "i" in name.lower()
    features["has_O"] =  "o" in name.lower()
    features["has_U"] =  "u" in name.lower()
    features["has_Y"] =  "y" in name.lower()
    
    # Other Letters 
    features["has_l"] =  "l" in name.lower()
    features["has_s"] =  "s" in name.lower()
    features["has_t"] =  "t" in name.lower()
    features["has_r"] =  "r" in name.lower()
    features["has_n"] =  "n" in name.lower()
     
    return features


In [7]:
featuresets = [(gender_features2(n), g) for (n,g) in names]

 
test_set, dev_set, train_set = featuresets[:500],featuresets[500:1000],featuresets[1000:] 
classifier = nltk.NaiveBayesClassifier.train(train_set)

eval()

0.828
Most Informative Features
          last_2_letters = 'na'           female : male   =    153.1 : 1.0
          last_2_letters = 'la'           female : male   =     70.2 : 1.0
             last_letter = 'k'              male : female =     45.1 : 1.0
          last_2_letters = 'us'             male : female =     39.6 : 1.0
          last_2_letters = 'ia'           female : male   =     38.2 : 1.0
             last_letter = 'a'            female : male   =     34.4 : 1.0
          last_2_letters = 'sa'           female : male   =     34.2 : 1.0
          last_2_letters = 'rd'             male : female =     30.2 : 1.0
          last_3_letters = 'ard'            male : female =     27.7 : 1.0
             last_letter = 'f'              male : female =     26.6 : 1.0


At 82.8% we have imporoved or model quite a bit. 

Lets asses all the features to gauge which are predictive:

In [8]:
classifier.show_most_informative_features(-1)

Most Informative Features
          last_2_letters = 'na'           female : male   =    153.1 : 1.0
          last_2_letters = 'la'           female : male   =     70.2 : 1.0
             last_letter = 'k'              male : female =     45.1 : 1.0
          last_2_letters = 'us'             male : female =     39.6 : 1.0
          last_2_letters = 'ia'           female : male   =     38.2 : 1.0
             last_letter = 'a'            female : male   =     34.4 : 1.0
          last_2_letters = 'sa'           female : male   =     34.2 : 1.0
          last_2_letters = 'rd'             male : female =     30.2 : 1.0
          last_3_letters = 'ard'            male : female =     27.7 : 1.0
             last_letter = 'f'              male : female =     26.6 : 1.0
          last_3_letters = 'tta'          female : male   =     25.1 : 1.0
          last_2_letters = 'ra'           female : male   =     24.6 : 1.0
          last_2_letters = 'ta'           female : male   =     24.2 : 1.0

### Model 2

In [9]:
def gender_features2(name):
    features = {}
    # Last Letters 
    features["last_letter"] = name[-1].lower()
    features["last_2_letters"] = name[-2:].lower()
    #features["last_3_letters"] = name[-3:].lower()
    
    # First Letters 
    features["first_letter"] = name[:1].lower()
    features["first_2_letters"] = name[:2].lower()
    #features["first_3_letters"] = name[:3].lower()
    
    # Values
    features["has_A"] =  "a" in name.lower()
    features["has_E"] =  "e" in name.lower()
    features["has_I"] =  "i" in name.lower()
    features["has_O"] =  "o" in name.lower()
    features["has_U"] =  "u" in name.lower()
    features["has_Y"] =  "y" in name.lower()
    
    # Other Letters 
    features["has_l"] =  "l" in name.lower()
    features["has_s"] =  "s" in name.lower()
    features["has_t"] =  "t" in name.lower()
    features["has_r"] =  "r" in name.lower()
    features["has_n"] =  "n" in name.lower()
     
    return features

featuresets = [(gender_features2(n), g) for (n,g) in names]
 
test_set, dev_set, train_set = featuresets[:500],featuresets[500:1000],featuresets[1000:] 
classifier = nltk.NaiveBayesClassifier.train(train_set)

eval()

0.8
Most Informative Features
          last_2_letters = 'na'           female : male   =    153.1 : 1.0
          last_2_letters = 'la'           female : male   =     70.2 : 1.0
             last_letter = 'k'              male : female =     45.1 : 1.0
          last_2_letters = 'us'             male : female =     39.6 : 1.0
          last_2_letters = 'ia'           female : male   =     38.2 : 1.0
             last_letter = 'a'            female : male   =     34.4 : 1.0
          last_2_letters = 'sa'           female : male   =     34.2 : 1.0
          last_2_letters = 'rd'             male : female =     30.2 : 1.0
             last_letter = 'f'              male : female =     26.6 : 1.0
          last_2_letters = 'ra'           female : male   =     24.6 : 1.0


Removing first/last 3 did not improve our model at all. 

### Model 3

Lets next remove each of our single letter searches:

In [10]:
def gender_features2(name):
    features = {}
    # Last Letters 
    features["last_letter"] = name[-1].lower()
    features["last_2_letters"] = name[-2:].lower()
    features["last_3_letters"] = name[-3:].lower()
    
    # First Letters 
    features["first_letter"] = name[:1].lower()
    features["first_2_letters"] = name[:2].lower()
    features["first_3_letters"] = name[:3].lower()
    
    # Values
    #features["has_A"] =  "a" in name.lower()
    #features["has_E"] =  "e" in name.lower()
    #features["has_I"] =  "i" in name.lower()
    #features["has_O"] =  "o" in name.lower()
    #features["has_U"] =  "u" in name.lower()
    #features["has_Y"] =  "y" in name.lower()
    
    # Other Letters 
    #features["has_l"] =  "l" in name.lower()
    #features["has_s"] =  "s" in name.lower()
    #features["has_t"] =  "t" in name.lower()
    #features["has_r"] =  "r" in name.lower()
    #features["has_n"] =  "n" in name.lower()
     
    return features

featuresets = [(gender_features2(n), g) for (n,g) in names]

 
test_set, dev_set, train_set = featuresets[:500],featuresets[500:1000],featuresets[1000:] 
classifier = nltk.NaiveBayesClassifier.train(train_set)

eval()

0.822
Most Informative Features
          last_2_letters = 'na'           female : male   =    153.1 : 1.0
          last_2_letters = 'la'           female : male   =     70.2 : 1.0
             last_letter = 'k'              male : female =     45.1 : 1.0
          last_2_letters = 'us'             male : female =     39.6 : 1.0
          last_2_letters = 'ia'           female : male   =     38.2 : 1.0
             last_letter = 'a'            female : male   =     34.4 : 1.0
          last_2_letters = 'sa'           female : male   =     34.2 : 1.0
          last_2_letters = 'rd'             male : female =     30.2 : 1.0
          last_3_letters = 'ard'            male : female =     27.7 : 1.0
             last_letter = 'f'              male : female =     26.6 : 1.0


Removing the searches for individual levels (and adding back in the first/last 3) improved our model over removing the first/last 3 feature. That being said, this model is still slightly worse than our first attempt.

## Maximum Entropy

As our current best model with a 82.8% has several features which overlap. By definition, there is some overlap between 2 and 3 letter pairings, and our searches for a specific letter. 

Lets run a few Maxiumum Entropy models to see if we can not imporve our figure.

### Model 4

In [11]:
def gender_features2(name):
    features = {}
    # Last Letters 
    features["last_letter"] = name[-1].lower()
    features["last_2_letters"] = name[-2:].lower()
    features["last_3_letters"] = name[-3:].lower()
    
    # First Letters 
    features["first_letter"] = name[:1].lower()
    features["first_2_letters"] = name[:2].lower()
    features["first_3_letters"] = name[:3].lower()
    
    # Values
    features["has_A"] =  "a" in name.lower()
    features["has_E"] =  "e" in name.lower()
    features["has_I"] =  "i" in name.lower()
    features["has_O"] =  "o" in name.lower()
    features["has_U"] =  "u" in name.lower()
    features["has_Y"] =  "y" in name.lower()
    
    # Other Letters 
    features["has_l"] =  "l" in name.lower()
    features["has_s"] =  "s" in name.lower()
    features["has_t"] =  "t" in name.lower()
    features["has_r"] =  "r" in name.lower()
    features["has_n"] =  "n" in name.lower()
     
    return features

featuresets = [(gender_features2(n), g) for (n,g) in names]

 
test_set, dev_set, train_set = featuresets[:500],featuresets[500:1000],featuresets[1000:] 
classifier = nltk.classify.MaxentClassifier.train(train_set)

eval()


  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.370
             2          -0.52193        0.665
             3          -0.45443        0.798
             4          -0.40689        0.844
             5          -0.37247        0.856
             6          -0.34666        0.863
             7          -0.32665        0.869
             8          -0.31068        0.873
             9          -0.29760        0.877
            10          -0.28667        0.882
            11          -0.27737        0.883
            12          -0.26933        0.886
            13          -0.26230        0.887
            14          -0.25608        0.890
            15          -0.25052        0.892
            16          -0.24551        0.894
            17          -0.24096        0.895
            18          -0.23680        0.896
            19          -0.23299        0.898
 

We see above that there is significant overfitting. Our model was 92.2% accurate on the training set, but 82.4% on the dev set. 

We will try one more thing. We see above that the most predictive features are sets of 3 letters. I suspect these may be leading to overfitting, due to the realtively small quantity of names in the training set where each set of 3 letters can be found. By removing the first/last pairs of 3 from the model, we should have a lower accuracy on the training set, but a higher accuracy on the testing set.

### Model 5

In [12]:
def gender_features2(name):
    features = {}
    # Last Letters 
    features["last_letter"] = name[-1].lower()
    features["last_2_letters"] = name[-2:].lower()
    #features["last_3_letters"] = name[-3:].lower()
    
    # First Letters 
    features["first_letter"] = name[:1].lower()
    features["first_2_letters"] = name[:2].lower()
    #features["first_3_letters"] = name[:3].lower()
    
    # Values
    features["has_A"] =  "a" in name.lower()
    features["has_E"] =  "e" in name.lower()
    features["has_I"] =  "i" in name.lower()
    features["has_O"] =  "o" in name.lower()
    features["has_U"] =  "u" in name.lower()
    features["has_Y"] =  "y" in name.lower()
    
    # Other Letters 
    features["has_l"] =  "l" in name.lower()
    features["has_s"] =  "s" in name.lower()
    features["has_t"] =  "t" in name.lower()
    features["has_r"] =  "r" in name.lower()
    features["has_n"] =  "n" in name.lower()
     
    return features

featuresets = [(gender_features2(n), g) for (n,g) in names]

 
test_set, dev_set, train_set = featuresets[:500],featuresets[500:1000],featuresets[1000:] 

classifier = nltk.classify.MaxentClassifier.train(train_set)

eval()

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.370
             2          -0.54504        0.635
             3          -0.49071        0.751
             4          -0.45064        0.790
             5          -0.42079        0.808
             6          -0.39810        0.814
             7          -0.38044        0.819
             8          -0.36639        0.821
             9          -0.35500        0.823
            10          -0.34559        0.823
            11          -0.33771        0.826
            12          -0.33102        0.827
            13          -0.32528        0.828
            14          -0.32030        0.830
            15          -0.31594        0.831
            16          -0.31209        0.831
            17          -0.30867        0.832
            18          -0.30562        0.831
            19          -0.30287        0.832
 

Well, this was actually one of our worst models. Depsite less overfitting, it was actually less accurate than our first model.

## Conclusions / Testing Set Analysis

Despite several attempts to improve our model, our first model was the most accurate. A Naive Bayes model using factors based on the presence of the first and last 1/2/3 letters, as well as all vowels and the most common consonants was able to predict 82.8% of our Dev set. 

Lets see how it does on the testing set:

In [13]:
def gender_features2(name):
    features = {}
    # Last Letters 
    features["last_letter"] = name[-1].lower()
    features["last_2_letters"] = name[-2:].lower()
    features["last_3_letters"] = name[-3:].lower()
    
    # First Letters 
    features["first_letter"] = name[:1].lower()
    features["first_2_letters"] = name[:2].lower()
    features["first_3_letters"] = name[:3].lower()
    
    # Values
    features["has_A"] =  "a" in name.lower()
    features["has_E"] =  "e" in name.lower()
    features["has_I"] =  "i" in name.lower()
    features["has_O"] =  "o" in name.lower()
    features["has_U"] =  "u" in name.lower()
    features["has_Y"] =  "y" in name.lower()
    
    # Other Letters 
    features["has_l"] =  "l" in name.lower()
    features["has_s"] =  "s" in name.lower()
    features["has_t"] =  "t" in name.lower()
    features["has_r"] =  "r" in name.lower()
    features["has_n"] =  "n" in name.lower()
     
    return features

featuresets = [(gender_features2(n), g) for (n,g) in names]
test_set, dev_set, train_set = featuresets[:500],featuresets[500:1000],featuresets[1000:] 

nltk.classify.accuracy(classifier, test_set)

0.82

We see our first model was able to predict 82% on the training set.  

In [14]:
classifier.show_most_informative_features(10) 

   5.984 last_2_letters=='ua' and label is 'male'
  -5.723 last_letter=='k' and label is 'female'
   4.860 last_2_letters=='bs' and label is 'female'
   4.810 first_2_letters=='ll' and label is 'male'
  -4.793 last_2_letters=='ko' and label is 'male'
   4.724 last_2_letters=='oz' and label is 'female'
   4.571 last_2_letters=='ru' and label is 'female'
   4.515 last_2_letters=='ok' and label is 'female'
  -4.503 last_letter=='f' and label is 'female'
   4.252 first_2_letters=='dw' and label is 'male'
