# Data 620 Project 3
## Classification
Jit Seneviratne and Sheryl Piechocki  
June 29, 2020

**Dataset** 
The data used in this project is the names corpus included in the NLTK package.

**Analysis:** 
After the names corpus is split into train, dev-test, and test subsets, initial classification using NLTK's maximum entropy classifier is performed with one feature.  Additional features are added to improve the maximum entropy classifier.  Further features are added and sklearn's Logistic Regression, Naive Bayes, and Random Forest classification techniques are attempted.  Accuracy and confusion matrices are produced.  

In [1]:
from nltk.corpus import names
import random
from nltk.classify import apply_features
from plotly.offline import init_notebook_mode, plot, iplot
%matplotlib inline
import pandas as pd
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
init_notebook_mode(connected=True)
import nltk, re, pprint
from nltk import word_tokenize
import string
import re
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import resample

### Investigate the NLTK Names Corpus

In [2]:
 names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

In [3]:
#print(names[:10])
print('Count of total names in the corpus is: ' , (len(names)))

females = [(name, gender) for name, gender in names if gender == 'female']
print('Count of female names in the corpus is: ' , (len(females)))
males = [(name, gender) for name, gender in names if gender == 'male']
print('Count of male names in the corpus is: ' , (len(males)))

Count of total names in the corpus is:  7944
Count of female names in the corpus is:  5001
Count of male names in the corpus is:  2943


The corpus has more female names (~63%) than male names (37%).

### Remove any trailing spaces from names

In [4]:
names = list(tuple("".join(i.rsplit()) for i in a) for a in names)

### Split the names data set into train, test, and devtest  
The corpus is split into three subsets per the instructions.  
500 dev-test  
500 test  
6,944 train  

In [5]:
random.shuffle(names)
train_names = names[1000:]
devtest_names = names[500:1000]
test_names = names[:500]

In [6]:
print(len(train_names))
print(len(devtest_names))
print(len(test_names))

6944
500
500


### Create dataframes for later use

In [7]:
train_name_df = pd.DataFrame(train_names)
train_name_df.columns = ['name', 'gender']
test_name_df = pd.DataFrame(test_names)
test_name_df.columns = ['name', 'gender']

### Last letter feature
Use the function provided in the book that takes an input word and returns the last letter

In [8]:
 def gender_features(name):
        return {'last_letter': name[-1]}

### Run 1 - Max Entropy Classifier  
Feature: last letter.  
We started with the maximum entropy classifier because it does not assume the features are independent. (The Naive Bayes classifier assumes features are independent and this may be an unreasonable assumption for classifying gender from names.)  The maximum entropy classifier uses an interative technique to maximize the likelihood of the training corpus.

In [9]:
train_set = apply_features(gender_features, train_names)
test_set = apply_features(gender_features, test_names)
devtest_set = apply_features(gender_features, devtest_names)

classifier = nltk.MaxentClassifier.train(train_set, algorithm='iis', trace=0, max_iter=1000)
print('Run 1: Gender correctly identified: ', "{:.1%}".format(nltk.classify.accuracy(classifier, devtest_set)))

Run 1: Gender correctly identified:  72.2%


With just a single feature of last letter of the name, the dev-test set yields an accuracy of 72.2%.

### Run 1 - Most informative features  
Using the last letter feature, the classifier ranks names ending in 'c' as the most informative feature and is positive for males.  Last letter 'a' is also important, but is negative for males.

In [10]:
classifier.show_most_informative_features(5)

   9.966 last_letter=='c' and label is 'male'
  -4.986 last_letter=='a' and label is 'male'
  -3.481 last_letter=='k' and label is 'female'
  -2.585 last_letter=='f' and label is 'female'
  -2.170 last_letter=='p' and label is 'female'


### Add additional features for first letter and length of the name


In [11]:
def gender_features2(name):
    features = {}
    features["last_letter"] = name[-1].lower()
    features["first_letter"] = name[0].lower()
    features["name_length"] = len(name)
    return features


### Run 2 - Max Entropy Classifier  
Features: last letter, first letter, length of name

In [12]:
train_set2 = apply_features(gender_features2, train_names)
test_set2 = apply_features(gender_features2, test_names)
devtest_set2 = apply_features(gender_features2, devtest_names)

#classifier2 = nltk.NaiveBayesClassifier.train(train_set2)
classifier2 = nltk.MaxentClassifier.train(train_set2, algorithm='iis', trace=0, max_iter=1000)
print('Run 2: Gender correctly identified: ', "{:.1%}".format(nltk.classify.accuracy(classifier2, devtest_set2)))

Run 2: Gender correctly identified:  76.4%


The additional features of first letter and length of the name have increased the accuracy to 76.4%.

### Run 2 - Create Confusion Matrix

In [32]:
tag2 = []
guess2 = []
for  (name, label) in devtest_names:
    observed2 = classifier2.classify(gender_features2(name))
    tag2.append(label)
    guess2.append(observed2)

print(nltk.ConfusionMatrix(tag2, guess2))

       |   f     |
       |   e     |
       |   m   m |
       |   a   a |
       |   l   l |
       |   e   e |
-------+---------+
female |<272> 55 |
  male |  63<110>|
-------+---------+
(row = reference; col = test)



The classifier is not as good at determining male gender from name as it is female gender.  This could be because the corpus was more heavily skewed to female names.

### Run 2 - Check the errors

In [33]:
errors = []
for (name, tag) in devtest_names:
    guess =  classifier2.classify(gender_features2(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

for (tag, guess, name) in sorted(errors): 
    print('correct=%-8s guess=%-8s name=%-30s'  %
          (tag, guess, name))
print(len(errors))

correct=female   guess=male     name=Alexis                        
correct=female   guess=male     name=Alleen                        
correct=female   guess=male     name=Aryn                          
correct=female   guess=male     name=Bridgett                      
correct=female   guess=male     name=Clem                          
correct=female   guess=male     name=Cloris                        
correct=female   guess=male     name=Denys                         
correct=female   guess=male     name=Dian                          
correct=female   guess=male     name=Diann                         
correct=female   guess=male     name=Ellyn                         
correct=female   guess=male     name=Fawn                          
correct=female   guess=male     name=Francis                       
correct=female   guess=male     name=Gertrudis                     
correct=female   guess=male     name=Glenn                         
correct=female   guess=male     name=Harriett   

### Run 2 - Most Important Features

In [15]:
classifier2.show_most_informative_features(5)

   9.097 name_length==15 and label is 'female'
   8.835 last_letter=='c' and label is 'male'
  -5.392 last_letter=='a' and label is 'male'
  -4.085 last_letter=='k' and label is 'female'
  -2.907 last_letter=='f' and label is 'female'


Now the most important feature is around the length of the name.  Name lengths of 15 are positive for females.  The last letter of 'c' is still important and is positive for males.  Last letter of 'a' is negative for males.

### Test the Maximum Entropy Classifier  
Now use the test set to get the accuracy of the classifier.

In [16]:
print('Maximum Entropy test: Gender correctly identified: ', "{:.1%}".format(nltk.classify.accuracy(classifier2, test_set2)))

Maximum Entropy test: Gender correctly identified:  75.8%


On the test set, the maximum entropy classifier with features of first letter, last letter, and name length, we achieve an accuracy of 75.8%.

### Add additional features - counts of "a", "i", "o", "y" and create dummy columns for first and last letter features

In [17]:
train_name_df['last_letter'] = train_name_df['name'].apply(lambda x: x[-1])
train_name_df['first_letter'] = train_name_df['name'].apply(lambda x: x[0])
train_name_df['len_name'] = train_name_df['name'].apply(lambda x: len(x))
train_name_df['a_count'] = train_name_df['name'].apply(lambda x: len(re.findall('a',x)))
train_name_df['i_count'] = train_name_df['name'].apply(lambda x: len(re.findall('i',x)))
train_name_df['o_count'] = train_name_df['name'].apply(lambda x: len(re.findall('o',x)))
train_name_df['y_count'] = train_name_df['name'].apply(lambda x: len(re.findall('y',x)))
train_name_df = pd.get_dummies(train_name_df, columns=['last_letter','first_letter'])

In [18]:
train_name_df.head()

Unnamed: 0,name,gender,len_name,a_count,i_count,o_count,y_count,last_letter_a,last_letter_b,last_letter_c,...,first_letter_Q,first_letter_R,first_letter_S,first_letter_T,first_letter_U,first_letter_V,first_letter_W,first_letter_X,first_letter_Y,first_letter_Z
0,Lucila,female,6,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Therine,female,7,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,Reyna,female,5,1,0,0,1,1,0,0,...,0,1,0,0,0,0,0,0,0,0
3,Tim,female,3,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,Marcia,female,6,2,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
test_name_df['last_letter'] = test_name_df['name'].apply(lambda x: x[-1])
test_name_df['first_letter'] = test_name_df['name'].apply(lambda x: x[0])
test_name_df['len_name'] = test_name_df['name'].apply(lambda x: len(x))
test_name_df['a_count'] = test_name_df['name'].apply(lambda x: len(re.findall('a',x)))
test_name_df['i_count'] = test_name_df['name'].apply(lambda x: len(re.findall('i',x)))
test_name_df['o_count'] = test_name_df['name'].apply(lambda x: len(re.findall('o',x)))
test_name_df['y_count'] = test_name_df['name'].apply(lambda x: len(re.findall('y',x)))
test_name_df = pd.get_dummies(test_name_df, columns=['last_letter','first_letter'])

In [20]:
test_name_df = test_name_df.reindex(columns = train_name_df.columns, fill_value=0)

In [21]:
test_name_df.head()

Unnamed: 0,name,gender,len_name,a_count,i_count,o_count,y_count,last_letter_a,last_letter_b,last_letter_c,...,first_letter_Q,first_letter_R,first_letter_S,first_letter_T,first_letter_U,first_letter_V,first_letter_W,first_letter_X,first_letter_Y,first_letter_Z
0,Worth,male,5,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,Caty,female,4,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Patrizia,female,8,2,2,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Artur,male,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Curtis,male,6,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Run 3 - Logistic Classifier with additional features

In [22]:
def logistic_classifier(train_df, test_df):
    
    X_train = train_df.drop(labels=['gender','name'],
                axis=1)
        
    y_train = train_df['gender']
    
    X_test = test_df.drop(labels=['gender','name'],
                axis=1)
        
    y_test = test_df['gender']
    
    lr = LogisticRegression(penalty='l2',
                            dual=False,
                            max_iter=1000,
                            tol=.0001)
    
    pipeline = Pipeline(steps=[('logistic', lr)])
    
    param_grid = {
    'logistic__C': np.arange(1, 50, 10)}

    cv = ShuffleSplit(n_splits=10, test_size=.3, random_state=0)

    search = GridSearchCV(pipeline, 
                          param_grid,
                          cv=cv,
                          return_train_score=True
                          )

    search.fit(X_train,y_train)

    y_pred = search.predict(X_test)
    
    print('Run 3: Logistic Classifier')
    print('Train: Gender correctly identified: ', "{:.1%}".format(search.score(X_train, y_train)))
    print('Test: Gender correctly identified: ', "{:.1%}".format(search.score(X_test, y_test)))
    print('')
    print('Confusion Matrix')
    print(confusion_matrix(y_test, y_pred))
    

In [23]:
logistic_classifier(train_name_df, test_name_df)

Run 3: Logistic Classifier
Train: Gender correctly identified:  79.0%
Test: Gender correctly identified:  76.0%

Confusion Matrix
[[259  43]
 [ 77 121]]


### Run 4 - Naive Bayes Classifier with additional features

In [24]:
def m_naive_bayes_classifier(train_df, test_df):
    
    X_train = train_df.drop(labels=['gender','name'],
                axis=1)
        
    y_train = train_df['gender']
    
    X_test = test_df.drop(labels=['gender','name'],
                axis=1)
        
    y_test = test_df['gender']
    
    nb = MultinomialNB()
    
    # Create a pipeline that standardizes, then runs logistic regression
    pipeline = Pipeline(steps=[('nb', nb)])
    
    param_grid = {
   'nb__alpha': np.arange(.1, 1, .2)}

    cv = ShuffleSplit(n_splits=10, test_size=.3)

    search = GridSearchCV(pipeline, 
                          param_grid,
                          cv=cv,
                          return_train_score=True
                          )

    search.fit(X_train,y_train)

    y_pred = search.predict(X_test)
    
    
    print('Run 4: Naive Bayes Classifier')
    print('Train: Gender correctly identified: ', "{:.1%}".format(search.score(X_train, y_train)))
    print('Test: Gender correctly identified: ', "{:.1%}".format(search.score(X_test, y_test)))
    print('')
    print('Confusion Matrix')
    print(confusion_matrix(y_test, y_pred))

In [25]:
m_naive_bayes_classifier(train_name_df, test_name_df)

Run 4: Naive Bayes Classifier
Train: Gender correctly identified:  78.1%
Test: Gender correctly identified:  74.4%

Confusion Matrix
[[250  52]
 [ 76 122]]


### Run 5 - Random Forest Classifier with additional features

In [26]:
def random_forest_classifier(train_df, test_df):
    
    X_train = train_df.drop(labels=['gender','name'],
                axis=1)
        
    y_train = train_df['gender']
    
    X_test = test_df.drop(labels=['gender','name'],
                axis=1)
        
    y_test = test_df['gender']
    
    rf=RandomForestClassifier()
   

    # Create a pipeline that standardizes, then runs logistic regression
    pipeline = Pipeline(steps=[('rf', rf)])
    
    param_grid = {
    'rf__min_samples_split': np.arange(2, 10, 2),
    'rf__n_estimators': np.arange(10, 20, 5)}

    cv = ShuffleSplit(n_splits=10, test_size=.3)

    search = GridSearchCV(pipeline, 
                          param_grid,
                          cv=cv,
                          return_train_score=True
                          )

    search.fit(X_train,y_train)
    y_pred = search.predict(X_test)
    print('Run 5: Random Forest Classifier')
    print('Train: Gender correctly identified: ', "{:.1%}".format(search.score(X_train, y_train)))
    print('Test: Gender correctly identified: ', "{:.1%}".format(search.score(X_test, y_test)))
    print('')
    print('Confusion Matrix')
    print(confusion_matrix(y_test, y_pred))

In [27]:
random_forest_classifier(train_name_df, test_name_df)

Run 5: Random Forest Classifier
Train: Gender correctly identified:  87.1%
Test: Gender correctly identified:  76.0%

Confusion Matrix
[[246  56]
 [ 64 134]]


In all cases, the minority class has been misclassified a little bit. In the future, we'll upsample the minority class.

Up

In [50]:
df_minority = train_name_df[train_name_df['gender']=='male']

df_minority_upsampled = resample(df_minority, 
                                 replace=True,
                                 n_samples=2000,
                                 random_state=123) 

df_upsampled = pd.concat([train_name_df, 
                          df_minority_upsampled])

In [51]:
logistic_classifier(df_upsampled, test_name_df)

Run 3: Logistic Classifier
Train: Gender correctly identified:  78.4%
Test: Gender correctly identified:  73.4%

Confusion Matrix
[[213  89]
 [ 44 154]]


In [52]:
m_naive_bayes_classifier(df_upsampled, test_name_df)

Run 4: Naive Bayes Classifier
Train: Gender correctly identified:  77.5%
Test: Gender correctly identified:  71.0%

Confusion Matrix
[[201 101]
 [ 44 154]]


In [53]:
random_forest_classifier(df_upsampled, test_name_df)

Run 5: Random Forest Classifier
Train: Gender correctly identified:  87.9%
Test: Gender correctly identified:  73.0%

Confusion Matrix
[[216  86]
 [ 49 149]]
