# Simple Model Comparisons

Notebook comparing baseline versions of different classification models to help direct further optimization. Limited time so will only chose 4 models to run pipelines and gridsearch on.

Magnus Bigelow

## Contents

- [Imports](#Imports)
- [Useful Functions](#Useful-Functions)
- [Modeling](#Modeling)
- [Scoring Models](#Scoring-Models)

## Imports

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# General Modeling Imports 
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score

# Classification models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier
from sklearn.svm import SVC

In [2]:
train = pd.read_csv('../data/clean_train.csv')

In [4]:
train.head()

Unnamed: 0,age,education-num,sex,capital-gain,capital-loss,hours-per-week,wage,marital_status_num,occupation_com_House_Services,occupation_com_Other,occupation_com_Professional,occupation_com_Specialty,occupation_com_Tech/sales,workclass_com_ Government,workclass_com_ Other,workclass_com_ Private,workclass_com_ Self-employed,cap_gain_binary,cap_loss_binary,gdp_pc
0,39,13,1,2174,0,40,0,0,0,1,0,0,0,1,0,0,0,1,0,41524.09
1,50,13,1,0,0,13,0,1,0,0,1,0,0,0,0,0,1,0,0,41524.09
2,38,9,1,0,0,40,0,0,1,0,0,0,0,0,0,1,0,0,0,41524.09
3,53,7,1,0,0,40,0,1,1,0,0,0,0,0,0,1,0,0,0,41524.09
4,28,13,1,0,0,40,0,1,0,0,1,0,0,0,0,1,0,0,0,12492.097


In [5]:
train.isnull().sum()

age                              0
education-num                    0
sex                              0
capital-gain                     0
capital-loss                     0
hours-per-week                   0
wage                             0
marital_status_num               0
occupation_com_House_Services    0
occupation_com_Other             0
occupation_com_Professional      0
occupation_com_Specialty         0
occupation_com_Tech/sales        0
workclass_com_ Government        0
workclass_com_ Other             0
workclass_com_ Private           0
workclass_com_ Self-employed     0
cap_gain_binary                  0
cap_loss_binary                  0
gdp_pc                           0
dtype: int64

### Useful Functions

In [3]:
# Function to print train and test F1 score
def f1(model, X_train, y_train, X_test, y_test):
    y_train_p = model.predict(X_train)
    y_test_p = model.predict(X_test)
    f_train = f1_score(y_train,y_train_p)
    f_test = f1_score(y_test,y_test_p)
    print(f'Train F1: {round(f_train,3)}')
    print(f'Test F1: {round(f_test,3)}')

### Modeling

We will instantiate and fit all the models at once, then compare train and test F1 scores to determine which models to procede with.

In [9]:
# Set up X and Y
X = train[['age', 'education-num', 'sex', 
       'hours-per-week', 'marital_status_num',
       'occupation_com_House_Services', 'occupation_com_Professional',
       'occupation_com_Specialty','occupation_com_Tech/sales', 
       'workclass_com_ Government','workclass_com_ Private',
       'workclass_com_ Self-employed', 'cap_gain_binary', 
       'cap_loss_binary','gdp_pc']]
y = train['wage']

# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify=y,
                                                    random_state=33)

# Scaled data for KNN
ss = StandardScaler()
Z_train = ss.fit_transform(X_train)
Z_test = ss.transform(X_test)

# Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train,y_train);

# KNN - requires scaled X
knn = KNeighborsClassifier()
knn.fit(Z_train,y_train);

# decision tree
dt = DecisionTreeClassifier(random_state=33)
dt.fit(X_train,y_train);

# Bagged decision trees
bt = BaggingClassifier(random_state=33)
bt.fit(X_train,y_train);

# random forest
rf = RandomForestClassifier(random_state=33)
rf.fit(X_train,y_train);

# Adaboost
ab = AdaBoostClassifier(random_state=33)
ab.fit(X_train,y_train);

# Support Vector Machine
svc = SVC()
svc.fit(X_train,y_train);

### Scoring Models

In [10]:
# Logistic Regression
print('Logistic Regression')
f1(logreg,X_train, y_train, X_test, y_test)

# KNN
print('\nKNN')
f1(knn,Z_train, y_train, Z_test, y_test)

# Decision Tree
print('\nDecision Tree')
f1(dt,X_train, y_train, X_test, y_test)

# Bagged dt
print('\nBagged Decision Trees')
f1(bt,X_train, y_train, X_test, y_test)

# Random forest
print('\nRandom Forest')
f1(rf,X_train, y_train, X_test, y_test)

# AdaBoost
print('\nAdaBoost')
f1(ab,X_train, y_train, X_test, y_test)

# SVC
print('\nSupport Vector Machince')
f1(svc,X_train, y_train, X_test, y_test)

Logistic Regression
Train F1: 0.572
Test F1: 0.562

KNN
Train F1: 0.721
Test F1: 0.624

Decision Tree
Train F1: 0.907
Test F1: 0.556

Bagged Decision Trees
Train F1: 0.888
Test F1: 0.581

Random Forest
Train F1: 0.91
Test F1: 0.608

AdaBoost
Train F1: 0.641
Test F1: 0.646

Support Vector Machince
Train F1: 0.0
Test F1: 0.0


Based on our tests of the baseline, un-optimized models it seems as though the Support Vector Machine model is not worth pursuing / optimizing further. It received very low F1 scores and our time is going to be better spent on other models.

Of the other models the simple decision tree shows the most overfitting and intuition says that it will be highly overfit in most circumstances. Therefore, we will optimize, using pipelines and gridsearch, the following model types:
- K-Nearest Neighbors
- Bagged Decision Trees
- Random Forest
- AdaBoost
- Logistic Regression