# Random Forest

After reading (Liu, V. et al.) and (Mahmoud, M. et al.) I decided to begin expirementing with random forest models instead of logistic regression models.

This means I will not be using the engineered interaction terms, but these models will be able to help anaylze the base features with *feature importance*

## Read and prep data for modeling

In [17]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [18]:
df = pd.read_csv('../data/frame_no_interactions.csv', index_col=0)
df.head()


Unnamed: 0,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,diag1_group_diabetes,diag1_group_digestive,...,glucose_group_normal,admit_type_group_elective,admit_type_group_na,admit_type_group_urgent,diabetesMed_flag,change_flag,metformin_flag,insulin_flag,num_drugs,target
0,1,41,0,1,0,0,0,1,1,0,...,0,0,1,0,0,0,0,0,0,0
1,3,59,0,18,0,0,0,9,0,0,...,0,0,0,0,1,1,0,1,1,0
2,2,11,5,13,2,0,1,6,0,0,...,0,0,0,0,1,0,0,0,1,0
3,2,44,1,16,0,0,0,7,0,0,...,0,0,0,0,1,1,0,1,1,0
4,1,51,0,8,0,0,0,5,0,0,...,0,0,0,0,1,1,0,1,2,0


In [19]:
# split inital train and test sections

X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['target']), df['target'], test_size=0.2, stratify=df['target'])

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, roc_auc_score
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler

cv = StratifiedKFold(
    n_splits=10,
    shuffle=True,
    random_state=42
    )

scaler = StandardScaler()

smote = SMOTE(random_state=42)
under_sample = RandomUnderSampler(random_state=42)
rf = RandomForestClassifier(random_state=42)

pipe = Pipeline([
    ('scaler', scaler),
   # ('smote', smote),
   ('under_sample', under_sample),
    ('model', rf)
])

scores = cross_val_score(
    pipe, 
    X_train, 
    y_train, 
    cv = cv,
    scoring=make_scorer(roc_auc_score) 
    )

print('roc_auc score per fold: ', scores)
print('mean roc_auc score: ', scores.mean())

roc_auc score per fold:  [0.5865322  0.59842904 0.61576364 0.6030025  0.59377467 0.59653041
 0.60120737 0.58901398 0.58877793 0.58923077]
mean roc_auc score:  0.5962262523855333


The first run of random forest scores were abysmally poor running SMOTE:
 - mean roc_auc score:  0.5033108087188244

Swapping to RandomUnderSampler was significantly better:
- mean roc_auc score:  0.5962262523855333