# Model Selection and Performance Testing<br>
Results from the model_selection_v0 python notebook got me here, which is testing some optimized Logistic Regression models for use in predicting risk of 7 day mortality. In this notebook, I'll be finalizing the models and reporting on performance of the trained models on validation data, using 10-fold cross validation. <br><br>

Additionally, other models will be ran (kNN, decision tree, maybe neural net) to show other future options, show a performance ceiling given the data inputs, despite the lack of model interpretability. I think a long-term modeling option is deep learning methods, or even XGboost, but right now, the models need to be fully interpretable. Thus, the focus on logistic regression.<br><br>

While the previous notebook identified median columnar imputation of missing data as the superior method (when compared to adding 0 for missing data), I'm still going to test each model version against both methods of imputation, as I think it would be worthwhile to quantify the improvements in model performance with median imputation. Additionally, seeing errors for each imputation method will be informative. <br><br>

One data processing step I would like to also test is z-scaling (normalizing data to mean = 0, sd = 1) for all data. By doing this, I think I will get a better representation of which variables appear to be most informative of the outcome, which again will help with interpretability clinically. While yes, the scales themselves will need to be re-converted for direct interpretation, I don't think this will be necessary, as most people right now just want to know WHAT is informative, not necessarily how much.

In [2]:
%matplotlib inline

#load libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier 
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, Imputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.feature_selection import SelectKBest, f_classif, RFECV
from sklearn.tree import DecisionTreeClassifier 

from IPython.display import display #displays full dataframe columns
#display all dataframe columns when printed
pd.options.display.max_columns = None

In [None]:
#load data
df = pd.read_csv('C:/Users/Mark.Burghart/Documents/projects/hospice_carepoint/data/transformed/carepoint_transformed_dummied.csv', index_col=0)
df.shape

### Train/Test Split<br>
Same split and randomizer will be used from the previous model_selection_v0 notebook

In [None]:
#separate variables (X) from outcome of interest (y)
df.shape
cols = df.columns.get_values() #converts column names to list
cols = cols.tolist()
feature_cols = [x for x in cols if x != 'death_within_7_days'] #removes outcome of interest from list ('death_within_7_days')

#extract rows
#print(feature_cols) #debug
X = df.loc[:, feature_cols]
X.shape #outcome column has been removed

In [None]:
#save outcome variable as y
y = df.death_within_7_days
y.shape

In [None]:
#separate data into training/test (aka holdout) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 23) #random_state for reproducibility

In [None]:
#create objects with specific imputation method
#impute missing values: replacing NaNs with Median Column value for each column
X_train_med = X_train.fillna(X_train.median()) 
y_train_med = y_train.fillna(y_train.median())

#impute missing values: replacing NaNs with 0
X_train_zero = X_train.fillna(0)
y_train_zero = y_train.fillna(0)

### Model Training<br>

Below, I have created a python sklearn pipeline to implement a series of steps before acquiring outputs. First, the data is standardized to a mean of 0 and a SD of 1 (aka. Z-scaling, standard scaling). Next, the top 40 features are selected based on F-test of ANOVA procedure. Finally, a logistic regression (L1 or L2) is built with specified C parameter obtained from v0 model selection notebook. Lastly, the scores for each cross validation fold are printed, along with the collective cross validation mean from the 10 folds. <br><br>
#### L1 LASSO Logistic Regressions

In [None]:
%%time
#pipeline for cross validation, preprocessing, feature selection, and model training of L1 Logistic Regression
L1_pipeline = make_pipeline(StandardScaler(), SelectKBest(f_classif, k = 40), LogisticRegression(penalty = 'l1', random_state = 0, C = 100))
#L1 median imputation data
scores_L1_med = cross_val_score(L1_pipeline, X_train_med, y_train_med, cv = 10, scoring = 'roc_auc')
print("Cross validation AUC scores from each 10-fold run for L1 Regression (Median impute): ")
scores_L1_med
print("Mean cross validation AUC from 10-fold CV for L1 Regression (Median impute): ")
scores_L1_med.mean()
#L1 zero imputation data
scores_L1_zero = cross_val_score(L1_pipeline, X_train_zero, y_train_zero, cv = 10, scoring = 'roc_auc')
print("Cross validation AUC scores from each 10-fold run for L1 Regression (Zero impute): ")
scores_L1_zero
print("Mean cross validation AUC from 10-fold CV for L1 Regression (Zero impute): ")
scores_L1_zero.mean()

#### L2 Ridge Penalty Logistic Regressions

In [None]:
%%time
#pipeline for CV, feature selection, model training of L2 Logistic Regression
L2_pipeline = make_pipeline(StandardScaler(), SelectKBest(f_classif, k = 40), LogisticRegression(penalty = 'l2', random_state = 0, C = 10))
#L2 median imputation data
scores_L2_med = cross_val_score(L2_pipeline, X_train_med, y_train_med, cv = 10, scoring = 'roc_auc')
print("Cross validation AUC scores from each 10-fold run for L2 Regression (Median impute): ")
scores_L2_med
print("Mean cross validation AUC from 10-fold CV for L2 Regression (Median impute): ")
scores_L2_med.mean()
#L2 zero imputation data
scores_L2_zero = cross_val_score(L2_pipeline, X_train_med, y_train_med, cv = 10, scoring = 'roc_auc')
print("Cross validation AUC scores from each 10-fold run for L2 Regression (Zero impute): ")
scores_L2_zero
print("Mean cross validation AUC from 10-fold CV for L2 Regression (Zero impute): ")
scores_L2_zero.mean()