I will build several classifiers in this jupyter notebook then select the best one to deploy.

In [1]:
# important dependencies
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
from sklearn.linear_model import ElasticNet, ElasticNetCV, LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
from scipy import stats
%matplotlib inline

### 1. Preprocessing

In [2]:
# read csv
train_df = pd.read_csv('../../data/train_complete.csv', index_col= 0)
test_df = pd.read_csv('../../data/test_complete.csv', index_col= 0)

In [3]:
# check csv dimension

train_df.shape

(30162, 15)

In [4]:
test_df.shape

(15061, 15)

In [5]:
train_df['label'].head(10)

1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     1
9     1
10    1
Name: label, dtype: int64

In [6]:
# convert the discrete columns into dummiy variables
dummy_cols = ["workclass", "education",
             "marital_stat", "occupation",
             "relationship", "race",
             "sex", "native_country"]

In [7]:
train_df_with_dummies = pd.get_dummies(train_df, columns= dummy_cols)

In [8]:
# inspect the training set again for dummies

train_df_with_dummies.head(5)

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hr_per_wk,label,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Private,...,native_country_ Portugal,native_country_ Puerto-Rico,native_country_ Scotland,native_country_ South,native_country_ Taiwan,native_country_ Thailand,native_country_ Trinadad&Tobago,native_country_ United-States,native_country_ Vietnam,native_country_ Yugoslavia
1,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,38,215646,9,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4,53,234721,7,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
5,28,338409,13,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [11]:
train_df_with_dummies.shape

(30162, 105)

In [10]:
test_df_with_dummies = pd.get_dummies(test_df, columns= dummy_cols)
test_df_with_dummies.shape

(15061, 104)

The number of columns in `test` does not match `train`. I will check and see which column is missing in the test.

In [12]:
# ref: https://stackoverflow.com/questions/45482755/compare-headers-of-dataframes-in-pandas
train_df_with_dummies.columns.difference(test_df_with_dummies.columns)

Index(['native_country_ Holand-Netherlands'], dtype='object')

In [13]:
test_df_with_dummies.columns.difference(train_df_with_dummies.columns)

Index([], dtype='object')

Looks like that all the columns present in `test` are present in `train` but the column `native_country_ Holand-Netherlands` in `train` is not present in `test`.

I should not dig further because test sets are meant to be locked away. Technically I am not allowed to look at the test set yet.
I will just add this missing column into `test` then move on.

In [14]:
test_df_with_dummies['native_country_ Holand-Netherlands'] = 0

In [15]:
# check shape again

test_df_with_dummies.shape

(15061, 105)

### Create X and Y arrays for training

In [16]:
# drop NaN in the dataframe
train_df_noNaN = train_df_with_dummies.dropna()

In [17]:
train_df_noNaN.shape

(30162, 105)

As expected, no `NaN` is present in the training set becaues all the `?` values have been removed.

In [18]:
test_df_noNaN = test_df_with_dummies.dropna()

In [19]:
test_df_noNaN.shape

(15060, 105)

One row is dropped from the test set. It was a phrase (`|1x3 Cross validator`) misread to be a row in the original `csv`. It does not affect the quality of `test`.

In [20]:
# create label array 

y = train_df_noNaN['label'].values
y.shape

(30162,)

In [22]:
y_test = test_df_noNaN['label'].values
y_test.shape

(15060,)

In [23]:
# create feature array 

X = train_df_noNaN.drop(['label'], axis=1).values
X.shape

(30162, 104)

In [24]:
X_test = test_df_noNaN.drop(['label'], axis=1).values
X_test.shape

(15060, 104)

The dimensions look alright so I will start creating validation sets

In [25]:
# create train-validation split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size= 0.2, random_state= 1)

In [26]:
X_train.shape

(24129, 104)

In [27]:
X_val.shape

(6033, 104)

The dimensions are correct.

### 2. Baseline Logistic Regression as a Binary Classifier

Random forest is the best out-of-the-box classifier so I will use it as a baseline model.

Note: I do not expect random forest to very well because it is sensitive to imbalanced data. My data is not exactly 50:50 split of positive and negative labels but it (ref: https://stats.stackexchange.com/questions/242833/is-random-forest-a-good-option-for-unbalanced-data-classification)

*Evaluation metric*: I will use precision, recall and F1 score to evalute the models