# Overview

## This package helps one compare and deploy models in two steps:

1. Compare models built on most of your data (we have to hold some rows out for checking the accuracy, *this is referred to as the test set*)
2. Pick the best approach, build this model using all of your data, save the model, and deploy predictions on test data to SQL Server.

## Step 1

We make a connection and load in data. In this example we will load from a simple csv file. Usually we load data directly from a SQL Server.

In [2]:
from healthcareai import DevelopSupervisedModel
import pandas as pd
import time
df = pd.read_csv('../healthcareai/tests/fixtures/HCPyDiabetesClinical.csv',
                     na_values=['None'])

Let's glance at the first few records we loaded.

In [3]:
df.head()

Unnamed: 0,PatientEncounterID,PatientID,SystolicBPNBR,LDLNBR,A1CNBR,GenderFLG,ThirtyDayReadmitFLG,InTestWindowFLG
0,1,10001,167.0,195.0,4.2,M,N,N
1,2,10001,153.0,214.0,5.0,M,N,N
2,3,10001,170.0,191.0,4.0,M,N,N
3,4,10002,187.0,135.0,4.4,M,N,N
4,5,10002,188.0,125.0,4.3,M,N,N


OK that looks good. What are our column data types?

In [4]:
df.dtypes

PatientEncounterID       int64
PatientID                int64
SystolicBPNBR          float64
LDLNBR                 float64
A1CNBR                 float64
GenderFLG               object
ThirtyDayReadmitFLG     object
InTestWindowFLG         object
dtype: object

Looks pretty good, but let's say we had to change an **int** to a **factor** column (which might happen if the factor column is 0,1,2, etc). Also, we'll change an **object (factor)** column to a **float**.

This is how:

*Please note that in this example we are changing an integer ID to a float ID, which doesn't make any sense practically, but is used to illustrate the process.*

In [5]:
df['GenderFLG'] = df['GenderFLG'].astype(object) # changing to factor
df['PatientEncounterID'] = df['PatientEncounterID'].astype(float) # to float

Now that we've cleaned up the data, let's do some preprocessing, split the data into train and test sets, and store the results in an object.

In [7]:
import random
random.seed(43) # <-- used to make results reproducible
o = DevelopSupervisedModel(modeltype='classification',
                           df=df,
                           predictedcol='ThirtyDayReadmitFLG',
                           graincol='',#OPTIONAL/ENCOURAGED
                           impute=True,
                           debug=False)

Now that we've arranged the data and done imputation, let's create a logistic model and see how accurate it is.

In [35]:
o.linear(cores=1,
         debug=False)


 LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)
Best hyper-parameters found after tuning:
No hyper-parameter tuning was done.

AUC Score: 0.858630952381 



Interesting, so an AUC above 0.8 is fairly predictive, so the linear model did fairly well. (You'll note the cell above also specifies model details.)

While we've already done well, let's see how well a random forest does:

In [36]:
o.random_forest(cores=1,
               debug=False)


 RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
Best hyper-parameters found after tuning:
No hyper-parameter tuning was done.

AUC Score: 0.902529761905 

Variable importance:
1. OrganizationLevel (0.488065)
2. VacationHours (0.239089)
3. SickLeaveHours (0.210164)
4. Gender.M (0.032773)
5. MaritalStatus.S (0.029909)


Oh, so that's interesting--random forest does even better with an AUC of 0.91. This means we'll choose to use the random forest model for nightly predictions. Random forest also gives us some guidance as to which variables are most important. If you have features that contribute below 0.1 in the variable importance list, you can safely leave them out of the deploy step (see the next example).

## Feedback? Questions?

Reach out to Levi Thatcher (levi.thatcher@healthcatalyst.com) if you have any questions!