# Data 620 Project 3
## Classification
Jit Seneviratne and Sheryl Piechocki  
June 29, 2020

**Dataset** 
The data used in this project is the names corpus included in the NLTK package.

**Analysis:** 
After the names corpus is split into train, dev-test, and test subsets, initial classification using NLTK's maximum entropy classifier is performed with one feature.  Additional features are added to improve the maximum entropy classifier.  Further features are added and sklearn's Logistic Regression, Naive Bayes, and Random Forest classification techniques are attempted.  Accuracy and confusion matrices are produced.  

In [1]:
from nltk.corpus import names
import random
from nltk.classify import apply_features
from plotly.offline import init_notebook_mode, plot, iplot
%matplotlib inline
import pandas as pd
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
init_notebook_mode(connected=True)
import nltk, re, pprint
from nltk import word_tokenize
import string
import re
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import resample

### Investigate the NLTK Names Corpus

In [2]:
 names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

In [3]:
#print(names[:10])
print('Count of total names in the corpus is: ' , (len(names)))

females = [(name, gender) for name, gender in names if gender == 'female']
print('Count of female names in the corpus is: ' , (len(females)))
males = [(name, gender) for name, gender in names if gender == 'male']
print('Count of male names in the corpus is: ' , (len(males)))

Count of total names in the corpus is:  7944
Count of female names in the corpus is:  5001
Count of male names in the corpus is:  2943


The corpus has more female names (~63%) than male names (37%).

### Remove any trailing spaces from names

In [4]:
names = list(tuple("".join(i.rsplit()) for i in a) for a in names)

### Split the names data set into train, test, and devtest  
The corpus is split into three subsets per the instructions.  
500 dev-test  
500 test  
6,944 train  

In [5]:
random.shuffle(names)
train_names = names[1000:]
devtest_names = names[500:1000]
test_names = names[:500]

In [6]:
print(len(train_names))
print(len(devtest_names))
print(len(test_names))

6944
500
500


### Create dataframes for later use

In [7]:
train_name_df = pd.DataFrame(train_names)
train_name_df.columns = ['name', 'gender']
test_name_df = pd.DataFrame(test_names)
test_name_df.columns = ['name', 'gender']

### Last letter feature
Use the function provided in the book that takes an input word and returns the last letter

In [8]:
 def gender_features(name):
        return {'last_letter': name[-1]}

### Run 1 - Max Entropy Classifier  
Feature: last letter.  
We started with the maximum entropy classifier because it does not assume the features are independent. (The Naive Bayes classifier assumes features are independent and this may be an unreasonable assumption for classifying gender from names.)  The maximum entropy classifier uses an interative technique to maximize the likelihood of the training corpus.

In [9]:
train_set = apply_features(gender_features, train_names)
test_set = apply_features(gender_features, test_names)
devtest_set = apply_features(gender_features, devtest_names)

classifier = nltk.MaxentClassifier.train(train_set, algorithm='iis', trace=0, max_iter=1000)
print('Run 1: Gender correctly identified: ', "{:.1%}".format(nltk.classify.accuracy(classifier, devtest_set)))

Run 1: Gender correctly identified:  76.8%


With just a single feature of last letter of the name, the dev-test set yields an accuracy of 72.2%.

### Run 1 - Most informative features  
Using the last letter feature, the classifier ranks names ending in 'c' as the most informative feature and is positive for males.  Last letter 'a' is also important, but is negative for males.

In [10]:
classifier.show_most_informative_features(5)

   9.966 last_letter=='j' and label is 'male'
   9.966 last_letter=='c' and label is 'male'
  -4.852 last_letter=='a' and label is 'male'
  -3.503 last_letter=='k' and label is 'female'
  -2.907 last_letter=='v' and label is 'female'


### Add additional features for first letter and length of the name


In [11]:
def gender_features2(name):
    features = {}
    features["last_letter"] = name[-1].lower()
    features["first_letter"] = name[0].lower()
    features["name_length"] = len(name)
    return features


### Run 2 - Max Entropy Classifier  
Features: last letter, first letter, length of name

In [12]:
train_set2 = apply_features(gender_features2, train_names)
test_set2 = apply_features(gender_features2, test_names)
devtest_set2 = apply_features(gender_features2, devtest_names)

#classifier2 = nltk.NaiveBayesClassifier.train(train_set2)
classifier2 = nltk.MaxentClassifier.train(train_set2, algorithm='iis', trace=0, max_iter=1000)
print('Run 2: Gender correctly identified: ', "{:.1%}".format(nltk.classify.accuracy(classifier2, devtest_set2)))

Run 2: Gender correctly identified:  80.6%


The additional features of first letter and length of the name have increased the accuracy to 76.4%.

### Run 2 - Create Confusion Matrix

In [13]:
tag2 = []
guess2 = []
for  (name, label) in devtest_names:
    observed2 = classifier2.classify(gender_features2(name))
    tag2.append(label)
    guess2.append(observed2)

print(nltk.ConfusionMatrix(tag2, guess2))

       |   f     |
       |   e     |
       |   m   m |
       |   a   a |
       |   l   l |
       |   e   e |
-------+---------+
female |<273> 48 |
  male |  49<130>|
-------+---------+
(row = reference; col = test)



The classifier is not as good at determining male gender from name as it is female gender.  This could be because the corpus was more heavily skewed to female names.

### Run 2 - Check the errors

In [14]:
errors = []
for (name, tag) in devtest_names:
    guess =  classifier2.classify(gender_features2(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

for (tag, guess, name) in sorted(errors): 
    print('correct=%-8s guess=%-8s name=%-30s'  %
          (tag, guess, name))
print(len(errors))

correct=female   guess=male     name=Adrian                        
correct=female   guess=male     name=Alexis                        
correct=female   guess=male     name=Alyson                        
correct=female   guess=male     name=Ariel                         
correct=female   guess=male     name=Bab                           
correct=female   guess=male     name=Barb                          
correct=female   guess=male     name=Blondell                      
correct=female   guess=male     name=Cameo                         
correct=female   guess=male     name=Cat                           
correct=female   guess=male     name=Cher                          
correct=female   guess=male     name=Delores                       
correct=female   guess=male     name=Doreen                        
correct=female   guess=male     name=Doris                         
correct=female   guess=male     name=Elyn                          
correct=female   guess=male     name=Faun       

### Run 2 - Most Important Features

In [15]:
classifier2.show_most_informative_features(5)

   9.267 name_length==14 and label is 'female'
   8.752 last_letter=='c' and label is 'male'
   8.324 last_letter=='j' and label is 'male'
  -5.297 last_letter=='a' and label is 'male'
  -4.052 last_letter=='k' and label is 'female'


Now the most important feature is around the length of the name.  Name lengths of 15 are positive for females.  The last letter of 'c' is still important and is positive for males.  Last letter of 'a' is negative for males.

### Test the Maximum Entropy Classifier  
Now use the test set to get the accuracy of the classifier.

In [16]:
print('Maximum Entropy test: Gender correctly identified: ', "{:.1%}".format(nltk.classify.accuracy(classifier2, test_set2)))

Maximum Entropy test: Gender correctly identified:  78.8%


On the test set, the maximum entropy classifier with features of first letter, last letter, and name length, we achieve an accuracy of 75.8%.

### Add additional features - counts of "a", "i", "o", "y" and create dummy columns for first and last letter features

In [17]:
train_name_df['last_letter'] = train_name_df['name'].apply(lambda x: x[-1])
train_name_df['first_letter'] = train_name_df['name'].apply(lambda x: x[0])
train_name_df['len_name'] = train_name_df['name'].apply(lambda x: len(x))
train_name_df['a_count'] = train_name_df['name'].apply(lambda x: len(re.findall('a',x)))
train_name_df['i_count'] = train_name_df['name'].apply(lambda x: len(re.findall('i',x)))
train_name_df['o_count'] = train_name_df['name'].apply(lambda x: len(re.findall('o',x)))
train_name_df['y_count'] = train_name_df['name'].apply(lambda x: len(re.findall('y',x)))
train_name_df = pd.get_dummies(train_name_df, columns=['last_letter','first_letter'])

In [18]:
train_name_df.head()

Unnamed: 0,name,gender,len_name,a_count,i_count,o_count,y_count,last_letter_a,last_letter_b,last_letter_c,...,first_letter_Q,first_letter_R,first_letter_S,first_letter_T,first_letter_U,first_letter_V,first_letter_W,first_letter_X,first_letter_Y,first_letter_Z
0,Halvard,male,7,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Kirstie,female,7,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Betsey,female,6,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Marni,female,5,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Lucie,female,5,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
test_name_df['last_letter'] = test_name_df['name'].apply(lambda x: x[-1])
test_name_df['first_letter'] = test_name_df['name'].apply(lambda x: x[0])
test_name_df['len_name'] = test_name_df['name'].apply(lambda x: len(x))
test_name_df['a_count'] = test_name_df['name'].apply(lambda x: len(re.findall('a',x)))
test_name_df['i_count'] = test_name_df['name'].apply(lambda x: len(re.findall('i',x)))
test_name_df['o_count'] = test_name_df['name'].apply(lambda x: len(re.findall('o',x)))
test_name_df['y_count'] = test_name_df['name'].apply(lambda x: len(re.findall('y',x)))
test_name_df = pd.get_dummies(test_name_df, columns=['last_letter','first_letter'])

In [20]:
test_name_df = test_name_df.reindex(columns = train_name_df.columns, fill_value=0)

In [21]:
test_name_df.head()

Unnamed: 0,name,gender,len_name,a_count,i_count,o_count,y_count,last_letter_a,last_letter_b,last_letter_c,...,first_letter_Q,first_letter_R,first_letter_S,first_letter_T,first_letter_U,first_letter_V,first_letter_W,first_letter_X,first_letter_Y,first_letter_Z
0,Nike,female,4,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Lonni,female,5,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Blisse,female,6,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Ricardo,male,7,1,1,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,Celestyna,female,9,1,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0


### SK-Learn Models 

Let's use SK-Learn's random forest classifier, logistic regressor and naive bayes classifier on our recently engineered data. Each model will use tenfold cross-validation with a grid-search on specified parameters

#### Run 3 - Logistic Classifier with Additional Features

The function will run a grid-search on the best L2 penalty parameter

In [122]:
def logistic_classifier(train_df, test_df):
    
    X_train = train_df.drop(labels=['gender','name'],
                axis=1)
        
    y_train = train_df['gender']
    
    X_test = test_df.drop(labels=['gender','name'],
                axis=1)
        
    y_test = test_df['gender']
    
    lr = LogisticRegression(penalty='l2',
                            dual=False,
                            max_iter=1000,
                            tol=.0001)
    
    pipeline = Pipeline(steps=[('logistic', lr)])
    
    param_grid = {
    'logistic__C': np.arange(1, 50, 10)}

    cv = ShuffleSplit(n_splits=10, test_size=.3, random_state=0)

    search = GridSearchCV(pipeline, 
                          param_grid,
                          cv=cv,
                          return_train_score=True
                          )

    search.fit(X_train,y_train)

    y_pred = search.predict(X_test)
    
    print('Run 3: Logistic Classifier')
    print('Train: Gender correctly identified: ', "{:.1%}".format(search.score(X_train, y_train)))
    print('Test: Gender correctly identified: ', "{:.1%}".format(search.score(X_test, y_test)))
    print('')
    print('Confusion Matrix')
    print(confusion_matrix(y_test, y_pred))
    
    return search
    

In [123]:
lr = logistic_classifier(train_name_df, test_name_df)

Run 3: Logistic Classifier
Train: Gender correctly identified:  78.5%
Test: Gender correctly identified:  81.4%

Confusion Matrix
[[285  45]
 [ 48 122]]


The logistic classifier shows accuracy of 81%. However, the minority class has been misclassified with specificity of 71.7%

#### Run 4 - Naive Bayes Classifier with additional features

The model will run a grid-search on the best smoothing parameter

In [124]:
def m_naive_bayes_classifier(train_df, test_df):
    
    X_train = train_df.drop(labels=['gender','name'],
                axis=1)
        
    y_train = train_df['gender']
    
    X_test = test_df.drop(labels=['gender','name'],
                axis=1)
        
    y_test = test_df['gender']
    
    nb = MultinomialNB()
    
    # Create a pipeline that standardizes, then runs logistic regression
    pipeline = Pipeline(steps=[('nb', nb)])
    
    param_grid = {
   'nb__alpha': np.arange(.1, 1, .2)}

    cv = ShuffleSplit(n_splits=10, test_size=.3, random_state=0)

    search = GridSearchCV(pipeline, 
                          param_grid,
                          cv=cv,
                          return_train_score=True
                          )

    search.fit(X_train,y_train)

    y_pred = search.predict(X_test)
    
    
    print('Run 4: Naive Bayes Classifier')
    print('Train: Gender correctly identified: ', "{:.1%}".format(search.score(X_train, y_train)))
    print('Test: Gender correctly identified: ', "{:.1%}".format(search.score(X_test, y_test)))
    print('')
    print('Confusion Matrix')
    print(confusion_matrix(y_test, y_pred))
    
    return search

In [125]:
nb = m_naive_bayes_classifier(train_name_df, test_name_df)

Run 4: Naive Bayes Classifier
Train: Gender correctly identified:  77.5%
Test: Gender correctly identified:  79.0%

Confusion Matrix
[[279  51]
 [ 54 116]]


The naive bayes classifier has done well, but once again, it is misclassifying the minority class with specificity of 68%

#### Run 5 - Random Forest Classifier with additional features

The function will run a grid-search on the best number of trees as well at the ideal tree depth.

In [126]:
def random_forest_classifier(train_df, test_df):
    
    X_train = train_df.drop(labels=['gender','name'],
                axis=1)
        
    y_train = train_df['gender']
    
    X_test = test_df.drop(labels=['gender','name'],
                axis=1)
        
    y_test = test_df['gender']
    
    rf=RandomForestClassifier()
   

    # Create a pipeline that standardizes, then runs logistic regression
    pipeline = Pipeline(steps=[('rf', rf)])
    
    param_grid = {
    'rf__min_samples_split': np.arange(2, 10, 2),
    'rf__n_estimators': np.arange(10, 20, 5)}

    cv = ShuffleSplit(n_splits=10, test_size=.3, random_state=0)

    search = GridSearchCV(pipeline, 
                          param_grid,
                          cv=cv,
                          return_train_score=True
                          )

    search.fit(X_train,y_train)
    y_pred = search.predict(X_test)
    print('Run 5: Random Forest Classifier')
    print('Train: Gender correctly identified: ', "{:.1%}".format(search.score(X_train, y_train)))
    print('Test: Gender correctly identified: ', "{:.1%}".format(search.score(X_test, y_test)))
    print('')
    print('Confusion Matrix')
    print(confusion_matrix(y_test, y_pred))
    
    return search

In [127]:
rf = random_forest_classifier(train_name_df, test_name_df)

Run 5: Random Forest Classifier
Train: Gender correctly identified:  86.8%
Test: Gender correctly identified:  79.2%

Confusion Matrix
[[274  56]
 [ 48 122]]


We have a similar test accuracy and misclassification rate for the minority class as the other two models

### Upsampling Minority Class

In all cases, the minority class has been misclassified to a fair degree. Let's upsample the minority class.

In [128]:
df_minority = train_name_df[train_name_df['gender']=='male']

df_minority_upsampled = resample(df_minority, 
                                 replace=True,
                                 n_samples=2000,
                                 random_state=123) 

df_upsampled = pd.concat([train_name_df, 
                          df_minority_upsampled])

### Re-Run Models with Upsampled Data

#### Logistic Rerun

In [129]:
lr = logistic_classifier(df_upsampled, test_name_df)

Run 3: Logistic Classifier
Train: Gender correctly identified:  77.5%
Test: Gender correctly identified:  78.2%

Confusion Matrix
[[253  77]
 [ 32 138]]


We see a drop in accuracy, but more of a balance in terms of an error rate for the two classes. Specificity is at 81%

#### Logistic Regressor Coefficients

In [130]:
pd.DataFrame.from_dict({feature:(abs(coef), coef) for 
                        feature, coef in zip(test_name_df.iloc[:,2:].columns, 
                                  lr.best_estimator_['logistic'].coef_[0])},
                        orient='index').rename({0:'Abs Coef', 1:'Coef'},
                                                axis=1).sort_values(by='Abs Coef',
                                                        ascending=False)[['Coef']][:5]

Unnamed: 0,Coef
last_letter_a,-4.663564
last_letter_i,-2.373821
last_letter_k,1.868934
last_letter_c,1.860789
last_letter_e,-1.688131


The logistic classifier tells us that the features capable of differentiating  males from females the most are features specifying whether the last letters are a, i, k, c, and e.

#### Naive Bayes Rerun

In [131]:
nb = m_naive_bayes_classifier(df_upsampled, test_name_df)

Run 4: Naive Bayes Classifier
Train: Gender correctly identified:  76.5%
Test: Gender correctly identified:  77.2%

Confusion Matrix
[[245  85]
 [ 29 141]]


Once again, we see a drop in accuracy, but more of a balance in terms of an error rate for the two classes. Specificity is at 83%

#### Naive Bayes Feature Log Probabilities

In [168]:
nb_df = pd.DataFrame.from_dict({feature:[coef1, coef2] for 
                        feature, coef1, coef2 in zip(test_name_df.iloc[:,2:].columns, 
                                  nb.best_estimator_['nb'].feature_log_prob_[0],
                                  nb.best_estimator_['nb'].feature_log_prob_[1])},
                        orient='index').rename({0:'Coef1', 1:'Coef2'},
                                                axis=1)

(nb_df['Coef1']/nb_df['Coef2']).sort_values(ascending=False)[:5]

last_letter_c    1.716548
last_letter_k    1.616015
last_letter_d    1.458959
last_letter_o    1.432279
last_letter_f    1.426924
dtype: float64

The naive bayes classifier tells us that the features capable of differentiating  males from females the most are features specifying whether the last letters are c, k, d, o, and f.

#### Rendom Forest Rerun

In [133]:
rf = random_forest_classifier(df_upsampled, test_name_df)

Run 5: Random Forest Classifier
Train: Gender correctly identified:  89.3%
Test: Gender correctly identified:  76.4%

Confusion Matrix
[[255  75]
 [ 43 127]]


Accuracy has suffered again, but specificity is at 75%

#### Random Forest Feature Importances

In [134]:
pd.DataFrame.from_dict({feature: coef for 
                        feature, coef in zip(test_name_df.iloc[:,2:].columns, 
                                  rf.best_estimator_['rf'].feature_importances_)},
                        orient='index').rename({0: 'Coef'},
                                                axis=1).sort_values(by='Coef',
                                                        ascending=False)[:5]

Unnamed: 0,Coef
last_letter_a,0.188018
len_name,0.143636
a_count,0.075006
last_letter_e,0.048652
i_count,0.044153


The random forest classifier's feature importances give us more of a variety for the top features than the other two classifiers. It seems to prioritize the length of the name, the counts of vowels a and i, along with the features specifying whether the last letters are a and i.

### Conclusions of SK-Learn Models

* All models classified test data with accuracy above 76%, but specificity was low on average
* After upsampling the minority class, accuracy suffered, but specificity rose significantly
* Significant features varied across all three models