# Project 3 - Text Mining

Predict Name-Gender Labels with NLTK "names" data.

---

Jeff Shamp, John Kellogg

In [1]:
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
import nltk

#nltk.download('names')
nltk.download('names')
from nltk.corpus import names


from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

%load_ext autoreload
%autoreload 2

import warnings;
warnings.filterwarnings('ignore');

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\x\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!


## Assignment

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. 

Begin by splitting the Names Corpus into three subsets: 
* 500 words for the test set, 
* 500 words for the dev-test set 
* 6900 words remaining for the training set. 

Starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress.

Once you are satisfied with your classifier, check its final performance on the test set. 

* How does the performance on the test set compare to the performance on the dev-test set? 
* Is this what you'd expect?

## Data Processing & Extract Features

As in everything the first step is to arrange the data into a usable format, like a dataframe. Once there we can ensure there is nothing we need to fix and extract some features like min, max, and average length of name.  Each of these will be useful later. 

In [2]:
# import the data
names_dict = {'name':names.words()}
names_df = pd.DataFrame(data= names_dict)

In [3]:
def label_names(row):
    if row in names.words('female.txt'):
        return "female"
    if row in names.words('male.txt'):
        return "male"
    else: return -1
    
def get_last_letter(row):
    return row[-1]

In [4]:
# Not the most efficient use of python or lists or encoders, but it works
names_df['label'] = names_df.name.apply(lambda x: label_names(x))

In [5]:
names_df['last_letter'] = names_df.name.apply(lambda x: get_last_letter(x))
names_df = names_df.sample(frac=1).reset_index(drop=True)

In [6]:
print(f"Min length name: {names_df.name.apply(len).min()}",
      f"Average length name: {names_df.name.apply(len).mean()}",
      f"Median length name: {names_df.name.apply(len).median()}",
      f"Max length name: {names_df.name.apply(len).max()}",
      sep="\n")

Min length name: 2
Average length name: 6.03285498489426
Median length name: 6.0
Max length name: 15


Now, we have some useful stats and a way to clearly define Male/Female.  The Mean/Median length is around 6 characters.  

## Base Line Model (of this report)

The book uses a Naive Bayes classifer for determining name-gender congruence using a simple last letter scheme. The model shown in the book achieved a 0.782 accuracy. We intend to beat that. 

We will leverage the diversity of name length to create character n-grams of the names and assign count scores to those character n-grams. Using the data from earlier, we see that the minimum length name is 2 characters and maximum is 15. 

Our process is to build n-grams to count from 0 - 15.  We found that this combination is likely to be the best for model accuracy and precision in our goal to beat the book. 

In [7]:
char_vectorizer = CountVectorizer(analyzer='char', ngram_range=(0, 16))
X = char_vectorizer.fit_transform(names_df.name)

In [8]:
# collapse to sparse matrix
X = X.tocsc()

In [9]:
train_dev = X[500:]
test = X[:500]

In [10]:
X_train, X_dev, y_train, y_dev = \
train_test_split(train_dev,
                  names_df['label'][500:],
                  test_size=0.07,
                  random_state=42)

In [11]:
naive_model = MultinomialNB().fit(X_train, y_train)

In [12]:
print(classification_report(y_dev, naive_model.predict(X_dev)))

              precision    recall  f1-score   support

      female       0.86      0.93      0.89       365
        male       0.80      0.64      0.71       157

    accuracy                           0.84       522
   macro avg       0.83      0.78      0.80       522
weighted avg       0.84      0.84      0.84       522



In [13]:
print(confusion_matrix(y_dev, naive_model.predict(X_dev)))

[[340  25]
 [ 57 100]]


### Initial Results

Success! We have beat the book in all parameters.  The accuracy numbers are around 0.85

Using a similar (or same family of classifiers) model we can achieve much higher accuracy on predicting name-gender using character n-grams than what was used in the book. Let's push it further.  Can we get those numbers up higher?

## Potential other models

### SVM Model

In [14]:
svm_model = SVC(gamma='scale').fit(X_train, y_train)

In [15]:
print(classification_report(y_dev, svm_model.predict(X_dev)))

              precision    recall  f1-score   support

      female       0.83      0.95      0.89       365
        male       0.83      0.55      0.66       157

    accuracy                           0.83       522
   macro avg       0.83      0.75      0.77       522
weighted avg       0.83      0.83      0.82       522



In [16]:
print(confusion_matrix(y_dev, svm_model.predict(X_dev)))

[[347  18]
 [ 71  86]]


#### SVM is no better than the base line model. 

### Logistic Regression

In [17]:
logit_model = LogisticRegression(max_iter=3000).fit(X_train, y_train)

In [18]:
print(classification_report(y_dev, logit_model.predict(X_dev)))

              precision    recall  f1-score   support

      female       0.91      0.95      0.93       365
        male       0.87      0.79      0.83       157

    accuracy                           0.90       522
   macro avg       0.89      0.87      0.88       522
weighted avg       0.90      0.90      0.90       522



In [19]:
print(confusion_matrix(y_dev, logit_model.predict(X_dev)))

[[346  19]
 [ 33 124]]


#### So far the Logistic Regression model is the best in terms of precision and accuracy.  It also minimizes False Negatives/False Positive ratio.

### GMB Model

In [20]:
gmb_model = GradientBoostingClassifier().fit(X_train, y_train)

In [21]:
print(classification_report(y_dev, gmb_model.predict(X_dev)))

              precision    recall  f1-score   support

      female       0.80      0.97      0.87       365
        male       0.85      0.42      0.56       157

    accuracy                           0.80       522
   macro avg       0.82      0.69      0.72       522
weighted avg       0.81      0.80      0.78       522



In [22]:
print(confusion_matrix(y_dev, gmb_model.predict(X_dev)))

[[353  12]
 [ 91  66]]


#### The GMB model is lower than the baseline model.

## Development Results

We see that the best model is the Logistic Regression model as it is significantly better than both the book's and our baseline model. The more complex (and generally superior) Tree Boosted model was worse than the both base models. 

If constrained to a Naive Bayesian model only, our model trained on character n-grams demonstrated better results (86% accuracy) as compared to the book's base model (78%). 

## Test Set Results

We will now answer the initial question: "How does the performance on the test set compare to the performance on the dev-test set?"  Using the LogRegression model as our best dev model we fit it to the entire training set and evaluate on the test set. 

In [23]:
final_model = LogisticRegression(max_iter=3000).fit(train_dev, names_df.label[500:])

In [24]:
print(classification_report(names_df.label[:500], final_model.predict(test)))

              precision    recall  f1-score   support

      female       0.91      0.94      0.92       353
        male       0.85      0.77      0.81       147

    accuracy                           0.89       500
   macro avg       0.88      0.86      0.87       500
weighted avg       0.89      0.89      0.89       500



In [25]:
print(confusion_matrix(names_df.label[:500], final_model.predict(test)))

[[333  20]
 [ 34 113]]


The secondary question: "Is this what you'd expect?" is Yes, this is what was expected.  We achieved consistent results from the development set into the training/test set. While the numbers are lower than the dev numbers, they are still greater than the book's model numbers and still slightly higher than our base model numbers.

Great!

[Video Submission](!https://youtu.be/pQiAWf8XyEw).