# Hackathon - Predicting Income Level
## A Collaborative Classification Challenge

### Description:

#### Purpose:
To build a best model to predict whether an individual's wage level is above or below $50,000 based on the given dataset.

Three team were randomly assigned with specific constrains:

1. **Team Sample Constraint**: must use the cheap train sample, a smaller sample size
2. **Team Features Constraint**: limited to a maximum of 20 features
3. **Team Algorithm Constraint**: must use a Random Forest

Our Team, **Team Allegator** is assigned to the **Algorithm** constrain. The challenge is to build a best model to predict whether an individual's wage level is above or below $50,000 using **RANDOM FOREST** algorithm **ONLY**.

#### Team Alligator:
1. Antony Paulson Chazhoor
2. Eli Regen
3. Kai Zhao
4. Kevin Roesch

#### Date and Time
2019-7-15

### Table of Content

0. [Import Libraries](#0.0---Import-Libraries)
1. [Load Data](#1.0---Load-Data)
2. [EDA & Data Cleaning](#2.0---EDA-&-Data-Cleaning)
3. [Preprocessing & Feature Engineering](#3.0---Preprocessing-&-Feature-Engineering)
4. [Train/Test Split](#4.0---Train/Test-Split)
5. [Modeling with GridSearchCV](#5.0---Modeling-with-GridSearchCV)
6. [Submission](#6.0---Submission)


### 0.0 - Import Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

### 1.0 - Load Data

In [2]:
# Use _sample to avoid duplicate terms during train/test split
# Use the large train data sample
df_sample = pd.read_csv('./data/large_train_sample.csv') 
df_test = pd.read_csv('./data/test_data.csv')

In [3]:
# Review the first 3 rows of the training data
df_sample.head(3)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,sex,capital-gain,capital-loss,hours-per-week,native-country,wage
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,Male,0,0,40,United-States,<=50K


In [4]:
# Review the first 3 rows of the testing data
df_test.head(3)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Male,0,0,40,United-States
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,Male,0,0,50,United-States
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,Male,0,0,40,United-States


### 2.0 - EDA & Data Cleaning

1. The given dataset is clean and has no missing value.
2. The datatypes are correct.

In [5]:
# Check sample data missing values
df_sample.isna().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
wage              0
dtype: int64

In [6]:
# Check test data missing values
df_test.isna().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
dtype: int64

In [7]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 14 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
marital-status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
sex               32561 non-null object
capital-gain      32561 non-null int64
capital-loss      32561 non-null int64
hours-per-week    32561 non-null int64
native-country    32561 non-null object
wage              32561 non-null object
dtypes: int64(6), object(8)
memory usage: 3.5+ MB


### 3.0 - Preprocessing & Feature Engineering

**3.1 X and y**

In [8]:
# Dummify the dependent variable y and check the first 3 rows
df_sample = pd.get_dummies(df_sample, columns=['wage'], drop_first=True)
df_sample.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,sex,capital-gain,capital-loss,hours-per-week,native-country,wage_ >50K
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Female,0,0,40,Cuba,0


In [9]:
# Separate X and y
df_sample_X = df_sample.loc[:, df_sample.columns!='wage_ >50K']
y = df_sample.loc[:, df_sample.columns == 'wage_ >50K']

In [10]:
# Check the size of the trainning and testing datasets
print(f'Size of training data: {df_sample.shape}')
print(f'Size of training data: {df_test.shape}')

Size of training data: (32561, 14)
Size of training data: (16281, 13)


**3.2 Combine Training X and Testing X**

In [11]:
# Add label column to traning and testing datasets (for combine -> create dummy -> separate)
df_sample_X['dataset'] = 'train'
df_test['dataset'] = 'test'

In [12]:
# Combine training X and testing X, and check the shape of the combined dataframe
df_X = pd.concat([df_sample_X, df_test], axis=0, sort=False)
print(f'Size of training data: {df_X.shape}')

Size of training data: (48842, 14)


**3.2 Dummy Variables**

In [13]:
# Drop the unnecessary 'education' column. The original dataset already have the numeric class assigned
# in a separate column 'education-num'
df_X.drop(columns=['education'], inplace=True)

In [14]:
# Check th columns and confirm the deletion
df_X.head(2)

Unnamed: 0,age,workclass,fnlwgt,education-num,marital-status,occupation,relationship,sex,capital-gain,capital-loss,hours-per-week,native-country,dataset
0,39,State-gov,77516,13,Never-married,Adm-clerical,Not-in-family,Male,2174,0,40,United-States,train
1,50,Self-emp-not-inc,83311,13,Married-civ-spouse,Exec-managerial,Husband,Male,0,0,13,United-States,train


In [15]:
# Create Dummy Variables
df_X = pd.get_dummies(df_X, 
                      columns=['workclass',
                                     'marital-status',
                                     'occupation',
                                     'relationship',
                                     'sex',
                                     'native-country'],
                      drop_first=True
                     )
# Check the top 3 rows
df_X.head(3)

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,dataset,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,39,77516,13,2174,0,40,train,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,train,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,train,0,0,0,...,0,0,0,0,0,0,0,1,0,0


**3.3 Separate Training X and Testing X**

In [16]:
# Separate using label
X = df_X.loc[df_X['dataset']=='train', :]
X_final_test = df_X.loc[df_X['dataset']=='test', :]

# Check the size of the Training X, Training y and Testing X
print(f'Training X size after dummy: {X.shape}')
print(f'Training y size after dummy: {y.shape}')
print(f'Testing X size after dummy: {X_final_test.shape}')

Training X size after dummy: (32561, 82)
Training y size after dummy: (32561, 1)
Testing X size after dummy: (16281, 82)


In [17]:
# Drop the label columns
X.drop(columns='dataset', inplace=True)
X_final_test.drop(columns='dataset', inplace=True)

# Check the size of the Training X, Training y and Testing X
print(f'Training X size after dummy: {X.shape}')
print(f'Training y size after dummy: {y.shape}')
print(f'Testing X size after dummy: {X_final_test.shape}')

Training X size after dummy: (32561, 81)
Training y size after dummy: (32561, 1)
Testing X size after dummy: (16281, 81)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


### 4.0 - Train/Test Split

In [18]:
# Train/Test Split the training data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### 5.0 - Modeling with GridSearchCV

**5.1 Training Model with the Train/Test Split Data**
> Based on the Accuracy scores (see below), the output of the training model confirms the model is not overfitting

In [19]:
# hyperparameters for iteration
grid_params = {
   'n_estimators': [114, 350, 500],
   'min_samples_leaf': [35, 50, 100],
   'oob_score': [False, True],
   'class_weight': [None, 'balanced']
}

# Instantiate GridSearch
gs = GridSearchCV(RandomForestClassifier(random_state=42), 
                  param_grid=grid_params, 
                  n_jobs=-1,
                  verbose=1,
                  cv=3)

# Fit Model
gs.fit(X_train, y_train)

# Print the best cross validation score
print(f'The CV score for the best performing model is {gs.best_score_}')

# Print the hyperparameters for the best performing model
print('The hyperparameters for the best performing data is summarized below:')
print(gs.best_params_)

# Print Accruacy for training and testing data from the train/test split
print(f'The Accuracy for the traing portion of the training dataset is {gs.score(X_train, y_train)}')
print(f'The Accuracy for the testing portion of the training dataset is {gs.score(X_test, y_test)}')

Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   54.8s
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:  2.3min finished
  self.best_estimator_.fit(X, y, **fit_params)


The CV score for the best performing model is 0.8531531531531531
The hyperparameters for the best performing data is summarized below:
{'class_weight': None, 'min_samples_leaf': 35, 'n_estimators': 500, 'oob_score': False}
The Accuracy for the traing portion of the training dataset is 0.8589271089271089
The Accuracy for the testing portion of the training dataset is 0.8587397125660239


**5.2 Fit and Model the Full Training Dataset**

In [20]:
#creating a grid search model
gs_final = GridSearchCV(RandomForestClassifier(random_state=42), 
                  param_grid=grid_params, 
                  n_jobs=-1,
                  verbose=1,
                  cv=3)

# Fit the model with full training dataset
gs_final.fit(X, y)

# Print the best cross validation score
print(f'The CV score for the best performing model is {gs_final.best_score_}')

# Print the hyperparameters for the best performing model. To confirm it is same as the above training model
print('The hyperparameters for the best performing data is summarized below:')
print(gs_final.best_params_)

# Print Accruacy for training and testing data from the train/test split
print(f'The Accuracy for the full training dataset is {gs_final.score(X, y)}')

Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:  3.6min finished
  self.best_estimator_.fit(X, y, **fit_params)


The CV score for the best performing model is 0.8560240778845859
The hyperparameters for the best performing data is summarized below:
{'class_weight': None, 'min_samples_leaf': 35, 'n_estimators': 500, 'oob_score': False}
The Accuracy for the full training dataset is 0.8606001044193975


In [21]:
# Calculate the propability for each classes 
y_prob = gs_final.predict_proba(X_final_test)

# Check the size of the probability table
print(f'The size of the probability table is {y_prob.shape}')

# Review the top 5 rows
pd.DataFrame(y_prob, columns=['<=50K', '>50K']).head()

The size of the probability table is (16281, 2)


Unnamed: 0,<=50K,>50K
0,0.994975,0.005025
1,0.722561,0.277439
2,0.704862,0.295138
3,0.195465,0.804535
4,0.99699,0.00301


### 6.0 - Submission

In [22]:
# Generate .csv file for submission
y_final_df = pd.DataFrame(y_prob, columns=['low_wage', 'wage'])
y_final_df['wage'].to_csv('y_prob_sf_alligator_algorithm_constrain.csv', header='wage', index=False)