# Project 3 - Text Mining

Predict Name-Gender Labels with NLTK "names" data.

---

Jeff Shamp, John Kellogg

In [1]:
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import seaborn as sns

#nltk.download('names')
from nltk.corpus import names

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

%load_ext autoreload
%autoreload 2


Bad key "text.kerning_factor" on line 4 in
/Users/jeffshamp/.conda/envs/sps620/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
http://github.com/matplotlib/matplotlib/blob/master/matplotlibrc.template
or from the matplotlib source distribution


## Load Data - Extract Features

We will first load up the data, organize it into a dataframe and extract some features like min, max, and average length of name. 

In [2]:
names_dict = {'name':names.words()}
names_df = pd.DataFrame(data= names_dict)

In [3]:
def label_names(row):
    if row in names.words('female.txt'):
        return "female"
    if row in names.words('male.txt'):
        return "male"
    else: return -1
    
def get_last_letter(row):
    return row[-1]

In [4]:
# Not the most efficient use of python or lists or encoders, but it works
names_df['label'] = names_df.name.apply(lambda x: label_names(x))

In [5]:
names_df['last_letter'] = names_df.name.apply(lambda x: get_last_letter(x))
names_df = names_df.sample(frac=1).reset_index(drop=True)

In [6]:
print(f"Min length name: {names_df.name.apply(len).min()}",
      f"Average length name: {names_df.name.apply(len).mean()}",
      f"Median length name: {names_df.name.apply(len).median()}",
      f"Max length name: {names_df.name.apply(len).max()}",
      sep="\n")

Min length name: 2
Average length name: 6.03285498489426
Median length name: 6.0
Max length name: 15


## Base Line Model

The book uses a Naive Bayes classifer for determining name-gender congruence using a simple last letter scheme. The model shown in the book achieved a 0.782 accuracy. This is what we intend to beat. 

We will leverage the fact that there is a diversity of name length to create character n-grams of the names and assign count scores to those character n-grams. Above we see that the minimum length name is 2 characters and maximum is 15 characters. We will build n-grams to count from 0 - 15. After several tests we found that this combination is likely to be the best for model accuaracy and precision. 

In [39]:
char_vectorizer = CountVectorizer(analyzer='char', ngram_range=(0, 16))
X = char_vectorizer.fit_transform(names_df.name)

In [40]:
# collapse to sparse matrix
X = X.tocsc()

In [41]:
train_dev = X[500:]
test = X[:500]

In [42]:
X_train, X_dev, y_train, y_dev = \
train_test_split(train_dev,
                  names_df['label'][500:],
                  test_size=0.07,
                  random_state=42)

In [43]:
naive_model = MultinomialNB().fit(X_train, y_train)

In [44]:
print(classification_report(y_dev, naive_model.predict(X_dev)))

              precision    recall  f1-score   support

      female       0.88      0.93      0.90       359
        male       0.82      0.72      0.76       163

    accuracy                           0.86       522
   macro avg       0.85      0.82      0.83       522
weighted avg       0.86      0.86      0.86       522



In [45]:
print(confusion_matrix(y_dev, naive_model.predict(X_dev)))

[[333  26]
 [ 46 117]]


## Initial Results

Using a similar (or same family of classifiers) model we can achieve much higher accuaracy on predicting name-gender using character n-grams than what was used in the book. Let's now push it further.

In [55]:
svm_model = SVC(gamma='scale').fit(X_train, y_train)

In [56]:
print(classification_report(y_dev, svm_model.predict(X_dev)))

              precision    recall  f1-score   support

      female       0.85      0.95      0.90       359
        male       0.86      0.64      0.74       163

    accuracy                           0.86       522
   macro avg       0.86      0.80      0.82       522
weighted avg       0.86      0.86      0.85       522



In [57]:
print(confusion_matrix(y_dev, svm_model.predict(X_dev)))

[[342  17]
 [ 58 105]]


SVM is no better than the base line model 

#### Logistic Regression is the best model

In [49]:
logit_model = LogisticRegression(max_iter=3000).fit(X_train, y_train)

In [50]:
print(classification_report(y_dev, logit_model.predict(X_dev)))

              precision    recall  f1-score   support

      female       0.91      0.96      0.93       359
        male       0.90      0.79      0.84       163

    accuracy                           0.90       522
   macro avg       0.90      0.87      0.88       522
weighted avg       0.90      0.90      0.90       522



In [51]:
print(confusion_matrix(y_dev, logit_model.predict(X_dev)))

[[344  15]
 [ 35 128]]


The Logit also minimizes False Negtives/False Positive ratio.

In [52]:
gmb_model = GradientBoostingClassifier().fit(X_train, y_train)

In [53]:
print(classification_report(y_dev, gmb_model.predict(X_dev)))

              precision    recall  f1-score   support

      female       0.81      0.96      0.88       359
        male       0.84      0.51      0.63       163

    accuracy                           0.82       522
   macro avg       0.82      0.73      0.76       522
weighted avg       0.82      0.82      0.80       522



In [54]:
print(confusion_matrix(y_dev, gmb_model.predict(X_dev)))

[[343  16]
 [ 80  83]]


## Development Results

We see that the best model is the most simple. The logistic regression model significantly better than the base line model. Whereas the, more complex, and generally superior Tree Boosted model was worse than the base model. If constrained to a Navie Bayesian model only, our NB model trained on character n-grams demonstrated better results (86% accuracy) as compared to the base model (78%). 

## Test Set Results

Fit the best development model to the entire training set and evaluate on the test set. 

In [58]:
final_model = LogisticRegression(max_iter=3000).fit(train_dev, names_df.label[500:])

In [59]:
print(classification_report(names_df.label[:500], final_model.predict(test)))

              precision    recall  f1-score   support

      female       0.89      0.94      0.91       341
        male       0.86      0.74      0.79       159

    accuracy                           0.88       500
   macro avg       0.87      0.84      0.85       500
weighted avg       0.88      0.88      0.88       500



In [60]:
print(confusion_matrix(names_df.label[:500], final_model.predict(test)))

[[321  20]
 [ 41 118]]


Consistent results from the development set. Great!