# Data 620 Project 3
## Classification
Jit Seneviratne and Sheryl Piechocki 
June 29, 2020

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python,
and any features you can think of, build the best name gender classifier you can.
Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest set, and the remaining 6900 words for the training set. Then, starting with the example name gender
classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are
satisfied with your classifier, check its final performance on the test set.
How does the performance on the test set compare to the performance on the dev-test set? Is this what
you'd expect? 

**Dataset**
The data used in this project is the names corpus included in the NLTK package.

**Analysis:**


In [24]:
from nltk.corpus import names
import random
from nltk.classify import apply_features
    
%matplotlib inline
import pandas as pd
import plotly as py
import plotly.graph_objs as go 
from plotly.offline import init_notebook_mode, plot, iplot
import matplotlib.pyplot as plt
init_notebook_mode(connected=True)
import nltk, re, pprint
from nltk import word_tokenize
import string
import re
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.ensemble import RandomForestClassifier

### Investigate the NLTK Names Corpus

In [2]:
 names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

In [3]:
#print(names[:10])
print('Count of total names in the corpus is: ' , (len(names)))

females = [(name, gender) for name, gender in names if gender == 'female']
print('Count of female names in the corpus is: ' , (len(females)))
males = [(name, gender) for name, gender in names if gender == 'male']
print('Count of male names in the corpus is: ' , (len(males)))

Count of total names in the corpus is:  7944
Count of female names in the corpus is:  5001
Count of male names in the corpus is:  2943


### Remove any trailing spaces from names

In [4]:
names = list(tuple("".join(i.rsplit()) for i in a) for a in names)

### Split the names data set into train, test, and devtest

In [5]:
random.shuffle(names)
train_names = names[1000:]
devtest_names = names[500:1000]
test_names = names[:500]

In [12]:
print(len(train_names))
print(len(devtest_names))
print(len(test_names))

6944
500
500


### Create dataframes for later use

In [38]:
train_name_df = pd.DataFrame(train_names)
train_name_df.columns = ['name', 'gender']
test_name_df = pd.DataFrame(test_names)
test_name_df.columns = ['name', 'gender']

### Last letter feature
Use the function provided in the book that takes an input word and returns the last letter

In [6]:
 def gender_features(name):
        return {'last_letter': name[-1]}

### Run 1 - Max Entropy Classifier  
Feature: last letter.


In [7]:
train_set = apply_features(gender_features, train_names)
test_set = apply_features(gender_features, test_names)
devtest_set = apply_features(gender_features, devtest_names)

#classifier = nltk.NaiveBayesClassifier.train(train_set)
classifier = nltk.MaxentClassifier.train(train_set, algorithm='iis', trace=0, max_iter=1000)
print('Run 1: Gender correctly identified: ', "{:.1%}".format(nltk.classify.accuracy(classifier, test_set)))

In [8]:
print('Run 1: Gender correctly identified: ', "{:.1%}".format(nltk.classify.accuracy(classifier, test_set)))

Run 1: Gender correctly identified:  77.4%


### Run 1 - Most informative features  
Using the last letter feature, the classifier ranks names ending in 'c' as the most informative feature and is positive for males.  

In [9]:
classifier.show_most_informative_features(5)

   9.966 last_letter=='c' and label is 'male'
   9.966 last_letter=='p' and label is 'male'
  -4.824 last_letter=='a' and label is 'male'
  -3.392 last_letter=='k' and label is 'female'
  -2.644 last_letter=='f' and label is 'female'


### Add additional features for first letter and length of the name


In [10]:
def gender_features2(name):
    features = {}
    features["last_letter"] = name[-1].lower()
    features["first_letter"] = name[0].lower()
    features["name_length"] = len(name)
    return features


### Run 2 - Max Entropy Classifier  
Features: last letter, first letter, length of name

In [11]:

train_set2 = apply_features(gender_features2, train_names)
test_set2 = apply_features(gender_features2, test_names)
devtest_set2 = apply_features(gender_features2, devtest_names)

#classifier2 = nltk.NaiveBayesClassifier.train(train_set2)
classifier2 = nltk.MaxentClassifier.train(train_set2, algorithm='iis', trace=0, max_iter=1000)
print('Run 2: Gender correctly identified: ', "{:.1%}".format(nltk.classify.accuracy(classifier2, test_set2)))

Run 2: Gender correctly identified:  79.0%


### Run 2 - Create Confusion Matrix

In [61]:
tag2 = []
guess2 = []
for  (name, label) in test_names:
    observed2 = classifier2.classify(gender_features2(name))
    tag2.append(label)
    guess2.append(observed2)


print(nltk.ConfusionMatrix(tag2, guess2))

       |   f     |
       |   e     |
       |   m   m |
       |   a   a |
       |   l   l |
       |   e   e |
-------+---------+
female |<253> 53 |
  male |  52<142>|
-------+---------+
(row = reference; col = test)



### Run 2 - Check the errors

In [14]:
errors = []
for (name, tag) in devtest_names:
    guess =  classifier2.classify(gender_features2(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

for (tag, guess, name) in sorted(errors): 
    print('correct=%-8s guess=%-8s name=%-30s'  %
          (tag, guess, name))
print(len(errors))

correct=female   guess=male     name=Arlyn                         
correct=female   guess=male     name=Astrix                        
correct=female   guess=male     name=Bert                          
correct=female   guess=male     name=Bird                          
correct=female   guess=male     name=Bo                            
correct=female   guess=male     name=Brandais                      
correct=female   guess=male     name=Charo                         
correct=female   guess=male     name=Cher                          
correct=female   guess=male     name=Consuelo                      
correct=female   guess=male     name=Deb                           
correct=female   guess=male     name=Demeter                       
correct=female   guess=male     name=Devin                         
correct=female   guess=male     name=Drew                          
correct=female   guess=male     name=Ester                         
correct=female   guess=male     name=Gabriell   

### Run 2 - Most Important Features

In [15]:
classifier2.show_most_informative_features(5)

   9.351 name_length==14 and label is 'female'
   8.777 name_length==13 and label is 'female'
   8.742 last_letter=='c' and label is 'male'
   8.513 last_letter=='p' and label is 'male'
  -5.240 last_letter=='a' and label is 'male'


### Add additional features - counts of "a", "i", "o", "y" and create dummy columns for first and last letter features

In [39]:
train_name_df['last_letter'] = train_name_df['name'].apply(lambda x: x[-1])
train_name_df['first_letter'] = train_name_df['name'].apply(lambda x: x[0])
train_name_df['len_name'] = train_name_df['name'].apply(lambda x: len(x))
train_name_df['a_count'] = train_name_df['name'].apply(lambda x: len(re.findall('a',x)))
train_name_df['i_count'] = train_name_df['name'].apply(lambda x: len(re.findall('i',x)))
train_name_df['o_count'] = train_name_df['name'].apply(lambda x: len(re.findall('o',x)))
train_name_df['y_count'] = train_name_df['name'].apply(lambda x: len(re.findall('y',x)))
train_name_df = pd.get_dummies(train_name_df, columns=['last_letter','first_letter'])

In [40]:
train_name_df.head()

Unnamed: 0,name,gender,len_name,a_count,i_count,o_count,y_count,last_letter_a,last_letter_b,last_letter_c,...,first_letter_Q,first_letter_R,first_letter_S,first_letter_T,first_letter_U,first_letter_V,first_letter_W,first_letter_X,first_letter_Y,first_letter_Z
0,Wyatan,male,6,2,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,Woodman,male,7,1,0,2,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,Chuck,male,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,James,male,5,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Ulrich,male,6,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [41]:
test_name_df['last_letter'] = test_name_df['name'].apply(lambda x: x[-1])
test_name_df['first_letter'] = test_name_df['name'].apply(lambda x: x[0])
test_name_df['len_name'] = test_name_df['name'].apply(lambda x: len(x))
test_name_df['a_count'] = test_name_df['name'].apply(lambda x: len(re.findall('a',x)))
test_name_df['i_count'] = test_name_df['name'].apply(lambda x: len(re.findall('i',x)))
test_name_df['o_count'] = test_name_df['name'].apply(lambda x: len(re.findall('o',x)))
test_name_df['y_count'] = test_name_df['name'].apply(lambda x: len(re.findall('y',x)))
test_name_df = pd.get_dummies(test_name_df, columns=['last_letter','first_letter'])

In [42]:
test_name_df = test_name_df.reindex(columns = train_name_df.columns, fill_value=0)

In [43]:
test_name_df.head()

Unnamed: 0,name,gender,len_name,a_count,i_count,o_count,y_count,last_letter_a,last_letter_b,last_letter_c,...,first_letter_Q,first_letter_R,first_letter_S,first_letter_T,first_letter_U,first_letter_V,first_letter_W,first_letter_X,first_letter_Y,first_letter_Z
0,Orville,male,7,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Edeline,female,7,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Shandie,female,7,1,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,Modesty,female,7,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Netti,female,5,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Run 3 - Logistic Classifier with additional features

In [62]:
def logistic_classifier(train_df, test_df):
    
    X_train = train_df.drop(labels=['gender','name'],
                axis=1)        
    y_train = train_df['gender']
    
    X_test = test_df.drop(labels=['gender','name'],
                axis=1)       
    y_test = test_df['gender']
    
    lr = LogisticRegression(penalty='l2',
                            dual=False,
                            max_iter=1000,
                            tol=.0001,
                            C=1)

    lr.fit(X_train,y_train)
    y_pred = lr.predict(X_test)
    print('Run 3: Gender correctly identified: ', "{:.1%}".format(lr.score(X_test, y_test)))
    #print('Train Accuracy',lr.score(X_train, y_train))
    print("")
    print('Confusion Matrix')
    print(confusion_matrix(y_test, y_pred))

In [63]:
logistic_classifier(train_name_df, test_name_df)

Run 3: Gender correctly identified:  80.0%

Confusion Matrix
[[257  49]
 [ 51 143]]


### Run 4 - Naive Bayes Classifier with additional features

In [64]:
def m_naive_bayes_classifier(train_df, test_df):
    
    X_train = train_df.drop(labels=['gender','name'],
                axis=1)        
    y_train = train_df['gender']
    
    X_test = test_df.drop(labels=['gender','name'],
                axis=1)       
    y_test = test_df['gender']
    
    nb = MultinomialNB()
    
    nb.fit(X_train,y_train)
    y_pred = nb.predict(X_test)
    print('Run 4: Gender correctly identified: ', "{:.1%}".format(nb.score(X_test, y_test)))
    #print('Train Accuracy',nb.score(X_train, y_train))
    print("")
    print('Confusion Matrix')
    print(confusion_matrix(y_test, y_pred))

In [65]:
m_naive_bayes_classifier(train_name_df, test_name_df)

Run 4: Gender correctly identified:  79.2%

Confusion Matrix
[[254  52]
 [ 52 142]]


### Run 5 - Random Forest Classifier with additional features

In [66]:
def random_forest_classifier(train_df, test_df):
    
    X_train = train_df.drop(labels=['gender','name'],
                axis=1)        
    y_train = train_df['gender']
    
    X_test = test_df.drop(labels=['gender','name'],
                axis=1)       
    y_test = test_df['gender']
    rf=RandomForestClassifier(n_estimators=20,
                              min_samples_split=10)


    rf.fit(X_train,y_train)
    y_pred = rf.predict(X_test)
    print('Run 5: Gender correctly identified: ', "{:.1%}".format(rf.score(X_test, y_test)))
    #print('Train Accuracy',rf.score(X_train, y_train))

    print("")
    print('Confusion Matrix')
    print(confusion_matrix(y_test, y_pred))

In [67]:
random_forest_classifier(train_name_df, test_name_df)

Run 5: Gender correctly identified:  78.8%

Confusion Matrix
[[256  50]
 [ 56 138]]
