# LR_BAS: Logistic Regression Baseline Model Training

Our baseline model is a replication of the one built by this paper:
AN ELECTRONIC EMERGENCY TRIAGE SYSTEM TO IMPROVE PATIENT DISTRIBUTION BY CRITICAL OUTCOMES Dugas et al. 
http://dx.doi.org/10.1016/j.jemermed.2016.02.026, which uses the CDC NHAMCS 2009 File. We replicated features and matched the AUC ROC reported, which allowed us to confirm we are using the correct fields from the CDC files. 

Vital signs are categorized by known thresholds of danger, and normalized due to different scales of measurement. We also grouped the reasons for ED visit (RFV) into categories that indicate some complaints that are known to be dangerous or high-priority (categorizations and thresholds were taken from out baseline paper).  

To calculate the Critical Outcome variable, we collapsed several patient outcomes into one binary variable: any patient an outcome of death, admittance to ICU, OR, or cardiac catheterization, was considered to have experienced a critical outcome. This represented about 7.1% of the total dataset, which we accounted for with weighted modeling techniques.

In [15]:
%c inline
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split

import sys 
import json

sys.path.append("../../src/models/train_model")
import LR_model 
sys.path.append("../../src/features")
import build_features, vital_signs_features, age_features, RFV_features

%matplotlib inline

ERROR:root:Line magic function `%c` not found.


## Model Training

In [16]:
with open('../../fileConfig.json') as config_file:    
        fileConfig = json.load(config_file)

In [19]:
#Training model via LR_model.py method
reload(LR_model)
LR_model.LR_BAS_model_training(fileConfig,'ED_TOTAL_2009_2009.csv')

ROC_AUC = 0.8264 
ROC_AUC = 0.8239 
ROC_AUC = 0.8252 
ROC_AUC = 0.8719 
ROC_AUC = 0.8562 
ROC_AUC = 0.8234 
ROC_AUC = 0.8415 
ROC_AUC = 0.8354 
ROC_AUC = 0.8209 
ROC_AUC = 0.8110 
ROC AUC: 0.8336% (+/- 0.02%


## Model Training, step by step

### Reading CDC File

In [4]:
pd.options.mode.chained_assignment = None  # default='warn'

In [6]:
#reading file
processedDirectory = fileConfig['dataDirectory'] + fileConfig['processedDirectory'] 
cdc_input = pd.read_csv(processedDirectory + 'ED_TOTAL_2009_2009.csv' )
print 'Full shape: ', cdc_input.shape

Full shape:  (24321, 93)


In [4]:
list(cdc_input)

['VYEAR',
 'VMONTH',
 'VDAYR',
 'AGE',
 'ARRTIME',
 'WAITTIME',
 'LOV',
 'RESIDNCE',
 'SEX',
 'ETHUN',
 'RACEUN',
 'ARREMS',
 'TEMPF',
 'PULSE',
 'RESPR',
 'BPSYS',
 'BPDIAS',
 'POPCT',
 'ONO2',
 'GCS',
 'IMMEDR',
 'PAIN',
 'SEEN72',
 'RFV1',
 'RFV2',
 'RFV3',
 'EPISODE',
 'INJURY',
 'CHF',
 'DIABETES',
 'DIAGSCRN',
 'CBC',
 'BUNCREAT',
 'CARDENZ',
 'ELECTROL',
 'GLUCOSE',
 'LFT',
 'ABG',
 'PTTINR',
 'BLOODCX',
 'BAC',
 'OTHERBLD',
 'CARDMON',
 'EKG',
 'HIVTEST',
 'FLUTEST',
 'PREGTEST',
 'TOXSCREN',
 'URINE',
 'WOUNDCX',
 'OTHRTEST',
 'ANYIMAGE',
 'XRAY',
 'CATSCAN',
 'CTHEAD',
 'CTNHEAD',
 'CTNUNK',
 'MRI',
 'ULTRASND',
 'OTHIMAGE',
 'TOTDIAG',
 'PROC',
 'IVFLUIDS',
 'CAST',
 'SPLINT',
 'SUTURE',
 'INCDRAIN',
 'FBREM',
 'NEBUTHER',
 'BLADCATH',
 'PELVIC',
 'CENTLINE',
 'CPR',
 'ENDOINT',
 'OTHPROC',
 'TOTPROC',
 'LEFTBMSE',
 'LEFTAMSE',
 'LEFTAMA',
 'DOA',
 'DIEDED',
 'TRANPSYC',
 'TRANOTH',
 'ADMITHOS',
 'OBSHOS',
 'OBSDIS',
 'OTHDISP',
 'ADMIT',
 'HDSTAT',
 'BDATEFL',
 'IMMEDRFL',


In [6]:
#sample of how the records look 
cdc_input[0:5]

Unnamed: 0,VYEAR,VMONTH,VDAYR,AGE,ARRTIME,WAITTIME,LOV,RESIDNCE,SEX,ETHUN,...,ADMITHOS,OBSHOS,OBSDIS,OTHDISP,ADMIT,HDSTAT,BDATEFL,IMMEDRFL,REGION,MSA
0,2009,7,6,40,1904,5,86,1,1,2,...,0,0,0,0,-7,-7,0,0,2,1
1,2009,7,6,76,1034,0,86,1,2,2,...,0,0,0,0,-7,-7,0,1,2,1
2,2009,7,5,27,25,63,190,1,1,2,...,0,0,0,0,-7,-7,0,1,2,1
3,2009,7,7,48,917,3,268,1,2,2,...,0,0,0,0,-7,-7,0,1,2,1
4,2009,7,1,89,2001,99,234,1,1,2,...,0,0,0,0,-7,-7,0,1,2,1


### Feature Engineering

In [7]:
#Feature Engineering for Baselin model
# (1) get vital signs, fill missing values
# (2) creates categories for vital signs and RVF (reason for visit)
# (3) creates age categories
# (4) creates categories for sex indicators 
# (5) predictors = get_arrival_mode_indicators(cdc_input,predictors) 
predictors, target = build_features.get_baseline_features (cdc_input )

In [8]:
#sample of how records look after feature engineering
predictors[0:5]

Unnamed: 0,Temp_Baseline,Pulse_Baseline,Sys_BP_Baseline,Resp_Rate_Baseline,Oxygen_Sat_Baseline,Reason_Chest_Pain,Reason_Abdominal_Pain,Reason_Headache,Reason_Shortness_of_Breath,Reason_Back_Pain,...,Age_41_50,Age_51_60,Age_61_70,Age_71_80,Age_81_Above,Male_Flag,Female_Flag,Ambulance_Arrival,Other_Arrival,Unknown_Arrival
0,991.0,90.0,129.0,16.0,16.0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
1,975.0,71.0,167.0,16.0,16.0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,1,0
2,984.0,89.0,118.0,20.0,20.0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,1,0
3,980.0,87.0,136.0,18.0,18.0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,1,0
4,980.0,86.0,180.0,20.0,20.0,0,0,0,0,0,...,0,0,0,0,1,0,1,1,0,0


In [9]:
list (predictors)

['Temp_Baseline',
 'Pulse_Baseline',
 'Sys_BP_Baseline',
 'Resp_Rate_Baseline',
 'Oxygen_Sat_Baseline',
 'Reason_Chest_Pain',
 'Reason_Abdominal_Pain',
 'Reason_Headache',
 'Reason_Shortness_of_Breath',
 'Reason_Back_Pain',
 'Reason_Cough',
 'Reason_Nausea_Vomiting',
 'Reason_Fever_Chills',
 'Reason_Syncope',
 'Reason_Dizziness',
 'Reason_Psychiatric_Complaint',
 'Reason_Nervous_System',
 'Reason_Cardiovascular_Other',
 'Reason_Ears_Eyes_Complaint',
 'Reason_Respiratory_Other',
 'Reason_Gastrointestinal_Other',
 'Reason_Genitourinary_Other',
 'Reason_Skin_Hair_Nails_Complaint',
 'Reason_Musculoskeletal_Other',
 'Reason_Injury_Poisoning',
 'Reason_Other',
 'Hypothermia',
 'Hyperthermia',
 'Bradycardia',
 'Mild_Tachycardia',
 'Moderate_Tachycardia',
 'Severe_Tachycardia',
 'Hypotension',
 'Hypertension',
 'Bradypnea',
 'Moderate_Tachypnea',
 'Severe_Tachypnea',
 'Mild_Hypoxia',
 'Severe_Hypoxia',
 'Age_18_30',
 'Age_31_40',
 'Age_41_50',
 'Age_51_60',
 'Age_61_70',
 'Age_71_80',
 'Age_81

### Training Logistic Regression (LR) Model

In [6]:
# split file in three sets: one for training, one for development (to tune hyper-parameters)
X_train, X_dev, y_train, y_dev = train_test_split(predictors, target, test_size = 0.1)

In [7]:
#training Logistic Regressing Model
# with a L2 regularization and 
# using the option of class_weight='balanced' because the binary label had only 7% of positive values
LR_model.LR_modeling(X_train, y_train, X_dev, y_dev)

C= 0.0000, ROC_AUC = 0.6573 
C= 0.0001, ROC_AUC = 0.7500 
C= 0.0010, ROC_AUC = 0.8257 
C= 0.0050, ROC_AUC = 0.8406 
C= 0.0100, ROC_AUC = 0.8439 
C= 0.0500, ROC_AUC = 0.8476 
C= 0.1000, ROC_AUC = 0.8484 
C= 0.5000, ROC_AUC = 0.8490 
C= 1.0000, ROC_AUC = 0.8490 
C= 2.0000, ROC_AUC = 0.8491 
C= 4.0000, ROC_AUC = 0.8491 
C= 6.0000, ROC_AUC = 0.8491 
C= 8.0000, ROC_AUC = 0.8491 
C= 10.0000, ROC_AUC = 0.8491 


In [8]:
reload(LR_model)
# 10 cross Validation 
LR_model.cross_LR_Validation ( predictors, target, c=1.0)

ROC_AUC = 0.8264 
ROC_AUC = 0.8239 
ROC_AUC = 0.8252 
ROC_AUC = 0.8719 
ROC_AUC = 0.8562 
ROC_AUC = 0.8234 
ROC_AUC = 0.8415 
ROC_AUC = 0.8354 
ROC_AUC = 0.8209 
ROC_AUC = 0.8110 
ROC AUC: 0.8336% (+/- 0.02%
