I will build several classifiers in this jupyter notebook then select the best one to deploy.

In [1]:
# important dependencies
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.externals import joblib
from scipy import stats
from hpsklearn import HyperoptEstimator, random_forest, knn, svc, sgd
from hyperopt import hp, tpe
import timeit
%matplotlib inline

WARN: OMP_NUM_THREADS=None =>
... If you are using openblas if you are using openblas set OMP_NUM_THREADS=1 or risk subprocess calls hanging indefinitely


### 1. Preprocessing

In [2]:
# read csv
train_df = pd.read_csv('../../data/train_complete.csv', index_col= 0)
test_df = pd.read_csv('../../data/test_complete.csv', index_col= 0)

In [3]:
# check csv dimension

train_df.shape

(30162, 15)

In [4]:
test_df.shape

(15061, 15)

In [5]:
train_df['label'].head(10)

1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     1
9     1
10    1
Name: label, dtype: int64

In [6]:
# convert the discrete columns into dummiy variables
dummy_cols = ["workclass", "education",
             "marital_stat", "occupation",
             "relationship", "race",
             "sex", "native_country"]

In [7]:
train_df_with_dummies = pd.get_dummies(train_df, columns= dummy_cols)

In [8]:
# inspect the training set again for dummies

train_df_with_dummies.head(5)

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hr_per_wk,label,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Private,...,native_country_ Portugal,native_country_ Puerto-Rico,native_country_ Scotland,native_country_ South,native_country_ Taiwan,native_country_ Thailand,native_country_ Trinadad&Tobago,native_country_ United-States,native_country_ Vietnam,native_country_ Yugoslavia
1,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,38,215646,9,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4,53,234721,7,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
5,28,338409,13,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [9]:
train_df_with_dummies.shape

(30162, 105)

In [10]:
test_df_with_dummies = pd.get_dummies(test_df, columns= dummy_cols)
test_df_with_dummies.shape

(15061, 104)

The number of columns in `test` does not match `train`. I will check and see which column is missing in the test.

In [11]:
# ref: https://stackoverflow.com/questions/45482755/compare-headers-of-dataframes-in-pandas
train_df_with_dummies.columns.difference(test_df_with_dummies.columns)

Index(['native_country_ Holand-Netherlands'], dtype='object')

In [12]:
test_df_with_dummies.columns.difference(train_df_with_dummies.columns)

Index([], dtype='object')

Looks like that all the columns present in `test` are present in `train` but the column `native_country_ Holand-Netherlands` in `train` is not present in `test`.

I should not dig further because test sets are meant to be locked away. Technically I am not allowed to look at the test set yet.
I will just add this missing column into `test` then move on.

In [13]:
test_df_with_dummies['native_country_ Holand-Netherlands'] = 0

In [14]:
# check shape again

test_df_with_dummies.shape

(15061, 105)

### Create X and Y arrays for training

In [15]:
# drop NaN in the dataframe
train_df_noNaN = train_df_with_dummies.dropna()

In [16]:
train_df_noNaN.shape

(30162, 105)

As expected, no `NaN` is present in the training set becaues all the `?` values have been removed.

In [17]:
test_df_noNaN = test_df_with_dummies.dropna()

In [18]:
test_df_noNaN.shape

(15060, 105)

One row is dropped from the test set. It was a phrase (`|1x3 Cross validator`) misread to be a row in the original `csv`. It does not affect the quality of `test`.

In [19]:
# create label array 

y = train_df_noNaN['label'].values
y.shape

(30162,)

In [20]:
y_test = test_df_noNaN['label'].values
y_test.shape

(15060,)

In [21]:
# create feature array 

X = train_df_noNaN.drop(['label'], axis=1).values
X.shape

(30162, 104)

In [22]:
X_test = test_df_noNaN.drop(['label'], axis=1).values
X_test.shape

(15060, 104)

The dimensions look alright so I will start creating validation sets

In [23]:
# create train-validation split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size= 0.2, random_state= 1)

In [24]:
X_train.shape

(24129, 104)

In [25]:
X_val.shape

(6033, 104)

The dimensions are correct.

### 2. Logistic Regression

Logistic regression is a good choice for binary classification.

I can use the coefficients to learn about feature importance as well.

I will use the elastic net logistic regression SGDClassifier to do feature selection and build a model at the same time.

I choose elastic net regularization because it is able to select highly correlated features. Lasso may select one out of many correlated features and ignore others. Ridge regularization does not do sparse selection so it may give significance to non-important features. Elastic net is computationally more expensive but I am prototyping so it is fine. 

ref: https://stats.stackexchange.com/questions/93181/ridge-lasso-and-elastic-net/225263
ref: https://stats.stackexchange.com/questions/184029/what-is-elastic-net-regularization-and-how-does-it-solve-the-drawbacks-of-ridge
ref: https://stats.stackexchange.com/questions/345343/any-disadvantages-of-elastic-net-over-lasso

### 3. Random Forest

I will try random forest classifier because it is the best-out-of-the-box classifier.

Random forest is sensitive to unbalanced data so I will use the sklearn built-in function to balance the class weight to validate the result if it is selected by the hyperparameter optimizer as the best model.

### 4. kNN and SVC

I want to use the `hyperopt-sklearn` package to help me with the model tuning. Since it supports kNN and SVC, I will try them out.

kNN is not sensitive to imbalanced data so I do not have adjust anything if I need to re-run the model to validate the result.
SVM, however, needs data balancing similar to random forest.

ref: https://hyperopt.github.io/hyperopt-sklearn/
ref2: https://stats.stackexchange.com/questions/122095/does-support-vector-machine-handle-imbalanced-dataset

In [26]:
# set seed so the results are reproducible
np.random.seed(123)

clf = hp.pchoice(label='donor_classifier',
                 p_options= [(0.25, random_forest('RandomForest')),
                             (0.25, sgd('LogisticRegression')),
                             (0.25, knn('knn')),
                             (0.25, svc('svc'))])

In [27]:
estim = HyperoptEstimator(classifier=clf, algo=tpe.suggest, max_evals=150, trial_timeout=60)

In [28]:
# ref: http://www.matthewemery.ca/pdf/hpsklearn-pydata.pdf

clock_in =timeit.default_timer()

estim.fit(X_train, y_train,
         n_folds=5,
         cv_shuffle=True,
         random_state=0)

clock_out=timeit.default_timer()
print("time taken to run hyperopt-sklearn optimizer:", clock_out-clock_in)



time taken to run hyperopt-sklearn optimizer: 4206.714160047


In [29]:
# save the estim object as pickle so I don't have to retrain later
# ref: https://www.thoughtco.com/using-pickle-to-save-objects-2813661
joblib.dump(estim, "hyperopt_estim.pkl")

['hyperopt_estim.pkl']

In [30]:
# inspect the result
# hyperopt-sklearn only supports accuracy and R2 score
# I will use accuracy for now

print( estim.score(X_val, y_val))

0.8543013426156141




In [31]:
print( estim.best_model())

{'learner': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=0.2113512273031174,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=5,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=65, n_jobs=1, oob_score=False, random_state=2,
            verbose=False, warm_start=False), 'preprocs': (StandardScaler(copy=True, with_mean=False, with_std=True),), 'ex_preprocs': ()}


The hyperopt-sklearn package idenitifies random forest to be better than SVC, logistic regression and kNN.
Hyperparameter optimization result does not suggest class weight adjustment to balance the data so the data imbalance present in this data is not significant enough to render random forest inaccurate.

The best accuracy of this classifier is 85%.

Next I will try to train a neural net in a different notebook and compare it with the random forest classifier.

*Evaluation metric to compare Neural Net and Random Forest*

ref: https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall

Precision answers the question "What proportion of positive identifications was actually correct?".

Recall answers the question "What proportion of actual positives was identified correctly?"

For donor prediction, I assume that a nonprofit has limited resources to launch fundraisers. To maximize their fundraising efficiency, both precision and recall will be important so that they do not miss any potential donors or misidentify bad donoros as good donors in their list. 

F1 score is a combind metrics that look at both precision and recall. I will use F1 score to compare my models.
