In [153]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

In [154]:
import pandas as pd
from os import listdir
from os.path import isfile, join
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import confusion_matrix, classification_report, precision_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm   

# Summary

The preliminary analysis and modeling report below uses United States and Canada's healthcare data on a hospital level in order to predict mortality-related features and readmission features. The United States datasets required more data clean-up and preprocessing than the Canadian dataset, however both datasets had high rates of missing values. Missing values were addressed using iterative imputation; categorical variables were addressed through encoding into dummy variables. For both datasets, both classification and regression models were attempted. Further exploration has to be done with feature selection, as there is unaddressed multi-colinearity and some features have very high variance inflation factors. In addition, ensemble classification methods and penalized regression methods may improve model performance once the multi-colinearity issue is resolved. 

# Data Collection

## United States Healthcare Data

United States hospital data is available through Medicare Hospital Compare, which provides contextual and performance datasets on hospitals. A fill list of datasets is shown below, however only a portion of them will be used.

In [155]:
us_data_dir = r"C:\Users\mkive\Documents\GitHub\Research Project\US Hospital Data"
df_list = [f.strip('\.csv') for f in listdir(us_data_dir) if isfile(join(us_data_dir, f)) and f.endswith('.csv')]
d = {name: pd.read_csv(us_data_dir + '\\' + name + '.csv', encoding='cp1252', low_memory=False) for name in df_list if 'Hospital' in name}
list(d.keys())

['Complications and Deaths - Hospital',
 'HCAHPS - Hospital',
 'Healthcare Associated Infections - Hospital',
 'Hospital General Information',
 'HOSPITAL_ANNUAL_QUALITYMEASURE_PCH_OCM_Hospital',
 'Medicare Hospital Spending by Claim',
 'Medicare Hospital Spending Per Patient - Hospital',
 'Medicare Hospital Spending Per Patient - National',
 'Medicare Hospital Spending Per Patient - State',
 'Outpatient Imaging Efficiency - Hospital',
 'Payment and Value of Care - Hospital',
 'Structural Measures - Hospital',
 'Timely and Effective Care - Hospital',
 'Unplanned Hospital Visits - Hospital',
 'Unplanned Hospital Visits - National',
 'Unplanned Hospital Visits - State']

### Exploratory Data Analysis

#### Hospital General Information

The hospital general information dataset contains a list of 5,320 hospitals and any available hospital information and metric comparisons.

Most hospitals are either Acute Care or Critical Access Hospitals.

In [156]:
d['Hospital General Information']['Hospital Type'].value_counts()

Acute Care Hospitals                  3263
Critical Access Hospitals             1354
Psychiatric                            573
Childrens                               95
Acute Care - Department of Defense      35
Name: Hospital Type, dtype: int64

The following table shows the distribution of Hospital Ratings (1-5) by hospital type. Hospital rating data is missing entirely for Department of Defence Hospitals, Childrens Hospitals, and Psychiatric Hospitals. Critical Access Hospitals are missing overall rating over half the time. Missing values will have to be addressed. Data is most likely missing not at random (MNAR).

In [157]:
pd.crosstab(index=d['Hospital General Information']['Hospital Type'], 
            columns=d['Hospital General Information']['Hospital overall rating']).apply(lambda r: round(r/r.sum(),2), axis=1)

Hospital overall rating,1,2,3,4,5,Not Available
Hospital Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Acute Care - Department of Defense,0.0,0.0,0.0,0.0,0.0,1.0
Acute Care Hospitals,0.07,0.2,0.27,0.25,0.11,0.1
Childrens,0.0,0.0,0.0,0.0,0.0,1.0
Critical Access Hospitals,0.0,0.03,0.17,0.24,0.04,0.52
Psychiatric,0.0,0.0,0.0,0.0,0.0,1.0


Missing data prevalence also varries by hospital ownership type.

In [158]:
pd.crosstab(index=d['Hospital General Information']['Hospital Ownership'], 
            columns=d['Hospital General Information']['Hospital overall rating']).apply(lambda r: round(r/r.sum(),2), axis=1)

Hospital overall rating,1,2,3,4,5,Not Available
Hospital Ownership,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Department of Defense,0.0,0.0,0.0,0.0,0.0,1.0
Government - Federal,0.02,0.11,0.09,0.06,0.0,0.72
Government - Hospital District or Authority,0.03,0.12,0.26,0.19,0.04,0.36
Government - Local,0.05,0.11,0.18,0.21,0.03,0.42
Government - State,0.03,0.03,0.07,0.04,0.01,0.81
Physician,0.04,0.07,0.03,0.05,0.15,0.66
Proprietary,0.05,0.18,0.17,0.12,0.04,0.44
Tribal,0.0,0.0,0.22,0.0,0.0,0.78
Voluntary non-profit - Church,0.04,0.12,0.26,0.33,0.13,0.12
Voluntary non-profit - Other,0.03,0.16,0.24,0.26,0.1,0.2


In addition to overall rating, the Hospital General Information dataset also provides comparisons on mortality, safety of care, readmission, patient experience, effectiveness of care, timeliness, and efficient use of medical imaging. Below is a summary of the class sizes for each of these comparisons. For all of these features, the classes are imbalanced and are missing data. For all features, either 'Not Available' or 'Same as the national average' are the majority class. 

In [159]:
"""cols = 7
i = 0
fig, axes = plt.subplots(1, cols, figsize=(12, 8))
"""
for col in ['Mortality national comparison',
           'Safety of care national comparison',
           'Readmission national comparison',
           'Patient experience national comparison',
           'Effectiveness of care national comparison',
           'Timeliness of care national comparison',
           'Efficient use of medical imaging national comparison']:
    
    data = d['Hospital General Information'].groupby(col).size()
    print(data)
    """
    left = [k[0] for k in enumerate(data)]
    right = [k[1] for k in enumerate(data)]
    
    axes[i].bar(left,right,label="%s" % (col.replace(" national comparison", "")), 
                color=['blue', 'purple', 'green', 'red'])
    axes[i].set_xticks(left, minor=False)
    axes[i].set_xticklabels([])
    
    axes[i].grid(True)
    i = i + 1
    
    
fig.suptitle('Employment By Industry By Comparison Metric', fontsize=20)
fig.legend(loc='upper right')
fig.tight_layout()

"""

Mortality national comparison
Above the national average       381
Below the national average       346
Not Available                   1977
Same as the national average    2616
dtype: int64
Safety of care national comparison
Above the national average      1218
Below the national average       848
Not Available                   2711
Same as the national average     543
dtype: int64
Readmission national comparison
Above the national average      1451
Below the national average      1303
Not Available                   1589
Same as the national average     977
dtype: int64
Patient experience national comparison
Above the national average      1157
Below the national average      1078
Not Available                   1962
Same as the national average    1123
dtype: int64
Effectiveness of care national comparison
Above the national average       103
Below the national average       269
Not Available                   2038
Same as the national average    2910
dtype: int64
Timeliness of car

#### Complications and Deaths

Apart from the hospital general information, all of the datasets that will be used follow a long format, with a row item for each measure and value. Looking at complications and death data reveals a percentage of missing data and class imbalance. 'Better than the national' and 'worse than the national' never occur in more than 5% of hospitals for all categories. "No different than the national" and "not available" are the majority classes.

In [160]:
pd.crosstab(index=d['Complications and Deaths - Hospital']['Measure Name'], 
            columns=d['Complications and Deaths - Hospital']['Compared to National'].str.replace("Rate|Value", "")).apply(lambda r: round(r/r.sum(),2), axis=1)

Compared to National,Better Than the National,No Different Than the National,Not Available,Number of Cases Too Small,Worse Than the National
Measure Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A wound that splits open after surgery on the abdomen or pelvis,0.0,0.57,0.37,0.06,0.0
Accidental cuts and tears from medical treatment,0.0,0.6,0.36,0.03,0.01
Blood stream infection after surgery,0.0,0.55,0.39,0.04,0.01
Broken hip from a fall after surgery,0.0,0.66,0.33,0.0,0.0
Collapsed lung due to medical treatment,0.0,0.66,0.33,0.0,0.0
Death rate for CABG surgery patients,0.0,0.2,0.77,0.03,0.0
Death rate for COPD patients,0.01,0.71,0.09,0.17,0.02
Death rate for heart attack patients,0.01,0.46,0.16,0.37,0.0
Death rate for heart failure patients,0.05,0.65,0.09,0.18,0.03
Death rate for pneumonia patients,0.05,0.73,0.08,0.08,0.06


#### Timely and Effective Care

The timely and effective care dataset contains both numeric and categorical variables, so I will split the dataset for the purpose of data cleanup and EDA.

In [161]:
pd.crosstab(index=d['Timely and Effective Care - Hospital'][d['Timely and Effective Care - Hospital']['Measure ID'] == 'EDV']['Measure Name'], 
            columns=d['Timely and Effective Care - Hospital'][d['Timely and Effective Care - Hospital']['Measure ID'] == 'EDV']['Score'].str.lower()).apply(lambda r: round(r/r.sum(),2), axis=1)

Score,high,low,medium,not available,very high
Measure Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Emergency department volume,0.13,0.32,0.2,0.2,0.14


In [162]:
d['Timely and Effective Care - Hospital']['Numeric Score'] = pd.to_numeric(d['Timely and Effective Care - Hospital']['Score'], errors = 'coerce')
d['Timely and Effective Care - Hospital'][d['Timely and Effective Care - Hospital']['Measure ID'] != 'EDV'][['Condition', 'Measure Name', 'Numeric Score']].groupby(['Condition', 'Measure Name']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Numeric Score
Condition,Measure Name,Unnamed: 2_level_1
Cancer care,External Beam Radiotherapy for Bone Metastases,88.833939
Cataract surgery outcome,Improvement in Patient's Visual Function within 90 Days Following Cataract Surgery,98.368421
Colonoscopy care,Endoscopy/polyp surveillance: appropriate follow-up interval for normal colonoscopy in average risk patients,88.891327
Colonoscopy care,Endoscopy/polyp surveillance: colonoscopy interval for patients with a history of adenomatous polyps - avoidance of inappropriate use,92.222887
Emergency Department,Average (median) time patients spent in the emergency department before leaving from the visit A lower number of minutes is better,141.459658
Emergency Department,Average (median) time patients spent in the emergency department before leaving from the visit- Psychiatric/Mental Health Patients. A lower number of minutes is better,253.123217
Emergency Department,"Average (median) time patients spent in the emergency department, after the doctor decided to admit them as an inpatient before leaving the emergency department for their inpatient room A lower number of minutes is better",99.285255
Emergency Department,Head CT results,73.668522
Emergency Department,Left before being seen,1.456539
Heart Attack or Chest Pain,Fibrinolytic Therapy Received Within 30 Minutes of ED Arrival,67.127907


#### Medicare Hospital Spending by Claim

Medicare hospital spending by clain data separates spending into period and claim type. The following show mean spending by period and by claim type.  

In [163]:
pd.crosstab(index=d['Medicare Hospital Spending by Claim']['Period'], 
            columns='main',
            values=d['Medicare Hospital Spending by Claim']['Avg Spndg Per EP Hospital'],
            aggfunc=np.mean)

col_0,main
Period,Unnamed: 1_level_1
1 through 30 days After Discharge from Index Hospital Admission,1238.914925
1 to 3 days Prior to Index Hospital Admission,96.680645
Complete Episode,20059.366202
During Index Hospital Admission,1530.021686


In [164]:
pd.crosstab(index=d['Medicare Hospital Spending by Claim']['Claim Type'], 
            columns='main',
            values=d['Medicare Hospital Spending by Claim']['Avg Spndg Per EP Hospital'],
            aggfunc=np.mean)

col_0,main
Claim Type,Unnamed: 1_level_1
Carrier,963.277544
Durable Medical Equipment,37.261001
Home Health Agency,265.60666
Hospice,51.492702
Inpatient,3985.297978
Outpatient,302.9534
Skilled Nursing Facility,1080.550978
Total,20059.366202


#### Payment and Value of Care

In [165]:
pd.crosstab(index = d['Payment and Value of Care - Hospital']['Payment Category'], 
            columns = d['Payment and Value of Care - Hospital']['Payment Measure Name'], 
            values = pd.to_numeric(d['Payment and Value of Care - Hospital']['Payment'].str.replace('$','').str.replace(',',''), errors = 'coerce'),
            aggfunc = np.mean)

Payment Measure Name,Payment for heart attack patients,Payment for heart failure patients,Payment for hip/knee replacement patients,Payment for pneumonia patients
Payment Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Greater Than the National Average Payment,28875.331492,19853.680451,24194.906355,21065.239735
Less Than the National Average Payment,22776.159509,15738.034568,18729.694995,15801.986842
No Different Than the National Average Payment,25632.755604,17530.496751,21042.781731,18286.705933


#### Unplanned Hospital Visits

In [166]:
pd.crosstab(index = d['Unplanned Hospital Visits - Hospital']['Measure Name'], 
            columns = d['Unplanned Hospital Visits - Hospital']['Compared to National'].str.lower().str.replace("than expected|than the national rate", "")).apply(lambda r: round(r/r.sum(),2), axis=1)

Compared to National,average days per 100 discharges,better,fewer days than average per 100 discharges,more days than average per 100 discharges,no different,not available,number of cases too small,worse
Measure Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Acute Myocardial Infarction (AMI) 30-Day Readmission Rate,0.0,0.0,0.0,0.0,0.42,0.19,0.38,0.0
Heart failure (HF) 30-Day Readmission Rate,0.0,0.02,0.0,0.0,0.69,0.09,0.17,0.03
Hospital return days for heart attack patients,0.28,0.0,0.05,0.1,0.0,0.19,0.38,0.0
Hospital return days for heart failure patients,0.49,0.0,0.09,0.16,0.0,0.09,0.17,0.0
Hospital return days for pneumonia patients,0.5,0.0,0.12,0.21,0.0,0.08,0.08,0.0
Pneumonia (PN) 30-Day Readmission Rate,0.0,0.01,0.0,0.0,0.8,0.08,0.08,0.03
Rate of emergency department (ED) visits for patients receiving outpatient chemotherapy,0.0,0.01,0.0,0.0,0.31,0.32,0.37,0.0
Rate of inpatient admissions for patients receiving outpatient chemotherapy,0.0,0.0,0.0,0.0,0.3,0.32,0.37,0.01
Rate of readmission after discharge from hospital (hospital-wide),0.0,0.04,0.0,0.0,0.81,0.05,0.03,0.07
Rate of readmission after hip/knee replacement,0.0,0.01,0.0,0.0,0.54,0.32,0.12,0.0


#### Structural Measures - Hospital

In [167]:
pd.crosstab(index = d['Structural Measures - Hospital']['Measure Name'], 
            columns = d['Structural Measures - Hospital']['Measure Response']).apply(lambda r: round(r/r.sum(),2), axis=1)

Measure Response,No,Not Available,Yes
Measure Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Able to receive lab results electronically,0.04,0.25,0.71
"Able to track patients' lab results, tests, and referrals electronically between visits",0.05,0.25,0.7


### Pre-Processing

In order to get the data ready for modeling, the following steps will be taken:

1. Data Wrangling: Convert datasets from long to wide format so that each facility ID has a unique row
2. Merge: Join all the datasets into one using Facility ID
3. Encode categorical data into dummy variables
4. Address Missing Data: Drop rows that are null in the target variable and majority null columns/rows; impute the rest

In [168]:
#data wrangling
# Hospital General Information Cleanup
d['Hospital General Information']['Facility ID'] = d['Hospital General Information']['Facility ID'].astype('object')
d['Hospital General Information'] = d['Hospital General Information'][['Facility ID', 'Facility Name', 'Hospital Type', 
                                   'Hospital Ownership', 'Emergency Services',
                                   'Meets criteria for promoting interoperability of EHRs', 'Hospital overall rating', 
                                   'Mortality national comparison',
                                   'Safety of care national comparison',
                                   'Readmission national comparison',
                                   'Patient experience national comparison',
                                   'Effectiveness of care national comparison',
                                   'Timeliness of care national comparison',
                                   'Efficient use of medical imaging national comparison']]



# Hospital Surveys
d['HCAHPS - Hospital'] = (d['HCAHPS - Hospital'][d['HCAHPS - Hospital']['HCAHPS Linear Mean Value'].str.contains('Not') == False][['Facility ID', 'HCAHPS Question', 'HCAHPS Linear Mean Value']]
                          .set_index('Facility ID')
                          .pivot(columns = 'HCAHPS Question', values = 'HCAHPS Linear Mean Value')
                          .reset_index())


# Hospital Complications and Deaths Cleanup
d['Complications and Deaths - Hospital'] = (d['Complications and Deaths - Hospital'][['Facility ID', 'Measure Name', 'Score']]
                                            .replace('Not Available', np.nan)
                                            .set_index('Facility ID')
                                            .pivot(columns = 'Measure Name', values = 'Score')
                                            .reset_index())



# Hospital Healthcare Associated Infections Cleanup
d['Healthcare Associated Infections - Hospital'] = (d['Healthcare Associated Infections - Hospital'][['Facility ID', 'Measure Name', 'Score']]
                                                    .replace('Not Available', np.nan)
                                                    .set_index('Facility ID')
                                                    .pivot(columns = 'Measure Name', values = 'Score')
                                                    .reset_index())[['Facility ID',
       'Catheter Associated Urinary Tract Infections (ICU + select Wards)',
       'Catheter Associated Urinary Tract Infections (ICU + select Wards): Number of Urinary Catheter Days',
       'Central Line Associated Bloodstream Infection (ICU + select Wards)',
       'Central Line Associated Bloodstream Infection: Number of Device Days',
       'Clostridium Difficile (C.Diff)',
       'Clostridium Difficile (C.Diff): Patient Days',
       'MRSA Bacteremia', 'MRSA Bacteremia: Patient Days',
       'SSI - Abdominal Hysterectomy',
       'SSI - Colon Surgery']]




# Medicare Hospital Spending by Claim Cleanup
d['Medicare Hospital Spending by Claim']['Period - Claim'] = d['Medicare Hospital Spending by Claim']['Period'].astype(str) + '_' + d['Medicare Hospital Spending by Claim']['Claim Type'].astype(str)
d['Medicare Hospital Spending by Claim'] = (d['Medicare Hospital Spending by Claim'][['Facility ID', 'Period - Claim', 
                                          'Percent of Spndg Hospital']]
                                           .replace('Not Available', np.nan)
                                           .pivot(index = 'Facility ID',
                                                  columns = 'Period - Claim', 
                                                  values = 'Percent of Spndg Hospital')
                                           .reset_index()
                                           .drop(columns = ['Complete Episode_Total']))
d['Medicare Hospital Spending by Claim']['Facility ID'] = d['Medicare Hospital Spending by Claim']['Facility ID'].astype('object')



  
# Hospital Timely and Effective Care Cleanup
d['Timely and Effective Care - Hospital - EDV'] = (d['Timely and Effective Care - Hospital'][d['Timely and Effective Care - Hospital']['Measure ID'] == 'EDV'][['Facility ID', 'Measure Name', 'Score']]
                                                   .pivot(index = 'Facility ID', columns = 'Measure Name', values = 'Score')
                                                   .reset_index())




d['Timely and Effective Care - Hospital']['Numeric Score'] = pd.to_numeric(d['Timely and Effective Care - Hospital']['Score'], errors = 'coerce')
d['Timely and Effective Care - Hospital'] = (d['Timely and Effective Care - Hospital'][d['Timely and Effective Care - Hospital']['Measure ID'] != 'EDV'][['Facility ID', 'Measure Name', 'Numeric Score']] 
                                             .replace('Not Available', np.nan)
                                             .pivot(index = 'Facility ID', columns = 'Measure Name', values = 'Numeric Score')
                                             .reset_index())


  
# Unplanned Hospital Visits Care Cleanup
d['Unplanned Hospital Visits - Hospital'] = (d['Unplanned Hospital Visits - Hospital'][['Facility ID', 'Measure Name', 'Score']]
                                             .replace('Not Available', np.nan)
                                             .pivot(index = 'Facility ID', columns = 'Measure Name', values = 'Score')
                                             .reset_index())



  

# Payment and Value of Care Cleanup
d['Payment and Value of Care - Hospital'] = (d['Payment and Value of Care - Hospital'][['Facility ID', 'Payment Measure Name', 'Payment']]
                                             .replace('Not Available', np.nan)
                                             .pivot(index = 'Facility ID', columns = 'Payment Measure Name', values = 'Payment')
                                             .reset_index())



d['Payment and Value of Care - Hospital']['Facility ID'] = d['Payment and Value of Care - Hospital']['Facility ID'].astype('object')


# Structural Measures Cleanup
d['Structural Measures - Hospital'] = (d['Structural Measures - Hospital'][['Facility ID', 'Measure Name', 'Measure Response']]
                                             .replace('Not Available', np.nan)
                                             .pivot(index = 'Facility ID', columns = 'Measure Name', values = 'Measure Response')
                                             .reset_index())
d['Structural Measures - Hospital']['Facility ID'] = d['Structural Measures - Hospital']['Facility ID'].astype('object')




#merge
us_df = (d['Hospital General Information']
 .merge(d['Complications and Deaths - Hospital'], how = 'left', on = 'Facility ID') 
 .merge(d['HCAHPS - Hospital'], how = 'left', on = 'Facility ID')
 .merge(d['Healthcare Associated Infections - Hospital'], how = 'left', on = 'Facility ID')
 .merge(d['Medicare Hospital Spending by Claim'], how = 'left', on = 'Facility ID')
 .merge(d['Payment and Value of Care - Hospital'], how = 'left', on = 'Facility ID')
 .merge(d['Timely and Effective Care - Hospital'], how = 'left', on = 'Facility ID')
 .merge(d['Timely and Effective Care - Hospital - EDV'], how = 'left', on = 'Facility ID')
 .merge(d['Unplanned Hospital Visits - Hospital'], how = 'left', on = 'Facility ID')
 .merge(d['Structural Measures - Hospital'], how = 'left', on = 'Facility ID')
        )

### FOR CAATEGORICAL TARGET
target_var = 'Mortality national comparison'
us_df = us_df[us_df[target_var].isna() == False]
us_df[target_var] = us_df[target_var].astype('category')

# set index
us_df = us_df.set_index(['Facility ID', 'Facility Name'])

# drop columns that are more than half null
us_df = us_df.drop(columns = pd.DataFrame(us_df.isna().sum()).reset_index()[pd.DataFrame(us_df.isna().sum()).reset_index()[0] > us_df.shape[0] * .5]['index'])

# drop rows that are more than half null
us_df['null rate'] = us_df.isna().sum(axis=1) / len(us_df.columns)
us_df = us_df[us_df.isna().sum(axis=1) < len(us_df.columns) * 0.5]


us_df[[col for col in us_df.columns if any(us_df[col].astype(str).str.contains('[0-9]', regex=True))]] = us_df[[col for col in us_df.columns if any(us_df[col].astype(str).str.contains('[0-9]', regex=True))]].apply(pd.to_numeric, errors = 'coerce')
us_df[us_df.filter(regex = '([C|c]ases)|Number|Days|score').columns] = us_df.filter(regex = '([C|c]ases)|Number|Days|score').apply(pd.to_numeric, errors = 'coerce')

In [169]:
# make dummy variables

for col in us_df[pd.DataFrame(us_df.dtypes)[pd.DataFrame(us_df.dtypes)[0] == 'object'].reset_index()['index']].columns:
    us_df = pd.concat([us_df, pd.DataFrame(pd.get_dummies(us_df[col], prefix = col)).iloc[:, :-1]], axis = 1).drop(columns = col)
    
# impute missing data
imputer = IterativeImputer()
# fit on the dataset
imputer.fit(us_df.drop(columns = [target_var]))
# transform the dataset
us_df[us_df.drop(columns = [target_var]).columns] = imputer.transform(us_df[us_df.drop(columns = [target_var]).columns])



### Classification Model

For the first model, I am cross validating multiple classifiers in order to predict 'Mortality national comparison'. Random Forest produces the highest accuracy. I have removed several dependent variables, however further analysis on multicolinearity must be done. The accuracy of each classification model is shown below.

In [170]:
X = us_df.drop(columns = [target_var, 
                          'Hospital overall rating',
                          'Death rate for COPD patients',
                          'Death rate for heart failure patients',
                          'Death rate for pneumonia patients'])
y = us_df[target_var].astype('str')

seed = 42

# prepare models
models = []
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
models.append(('RF', RandomForestClassifier()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = model_selection.KFold(n_splits=10)
    cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f" % (name, cv_results.mean())
    print(msg)

LDA: 0.735504
KNN: 0.646568
CART: 0.642333
NB: 0.457951
SVM: 0.689614
RF: 0.751640


Random forest produces the highest accuracy, and digging deeper on the confusion matrix shows the need for addressing class imbalance further. the f1-score is decent for the 'Not Available' and 'Same as the national average' classes, however the minority classes have extremely poor performance metrics. 

In [171]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0, stratify = y)
rf = RandomForestClassifier()
model = rf.fit(X_train, y_train)
pred=model.predict(X_test)
print(np.unique(pred, return_counts=True))
print(confusion_matrix(pred, y_test.values))
print(classification_report(y_test, pred, digits=3))

(array(['Above the national average', 'Below the national average',
       'Not Available', 'Same as the national average'], dtype=object), array([ 27,   3,  59, 794], dtype=int64))
[[ 12   1   0  14]
 [  1   1   0   1]
 [  0   0  57   2]
 [ 82  85  37 590]]
                              precision    recall  f1-score   support

  Above the national average      0.444     0.126     0.197        95
  Below the national average      0.333     0.011     0.022        87
               Not Available      0.966     0.606     0.745        94
Same as the national average      0.743     0.972     0.842       607

                    accuracy                          0.747       883
                   macro avg      0.622     0.429     0.452       883
                weighted avg      0.694     0.747     0.682       883



The list of feature importance ranking shows that complications occurance is highest on the list for mortality ranking classification. However, the classification results have to be improved before these results could be accepted. Multicollinearity and class imbalance have to be addressed further.

In [172]:
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(10):
    print("%d. feature %s (%f)" % (f + 1, X.columns[f], importances[indices[f]]))
"""
# Plot the impurity-based feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
        color="r", align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()"""

Feature ranking:
1. feature A wound that splits open after surgery on the abdomen or pelvis (0.060705)
2. feature Accidental cuts and tears from medical treatment (0.038875)
3. feature Blood stream infection after surgery (0.036048)
4. feature Broken hip from a fall after surgery (0.031396)
5. feature Collapsed lung due to medical treatment (0.030466)
6. feature Perioperative Hemorrhage or Hematoma Rate (0.029279)
7. feature Postoperative Acute Kidney Injury Requiring Dialysis Rate (0.027438)
8. feature Postoperative Respiratory Failure Rate (0.022994)
9. feature Pressure sores (0.022640)
10. feature Rate of complications for hip/knee replacement patients (0.021667)


'\n# Plot the impurity-based feature importances of the forest\nplt.figure()\nplt.title("Feature importances")\nplt.bar(range(X.shape[1]), importances[indices],\n        color="r", align="center")\nplt.xticks(range(X.shape[1]), indices)\nplt.xlim([-1, X.shape[1]])\nplt.show()'

### Regression Model

The following regression model was created in order to predict 'Death rate for pneumonia patients'. This field has the lowest rate of missing data among the numerical indicators of mortality rate. Further correlation analysis must be done, but for now the directly correlated predictors have been removed. The regression model uses backward elimination until the maximum p-value of predictors is less than 0.10. The results of the model are extemely poor, and additional feature engineering must be done. 

In [173]:
#merge
us_df = (d['Hospital General Information']
 .merge(d['Complications and Deaths - Hospital'], how = 'left', on = 'Facility ID') 
 .merge(d['HCAHPS - Hospital'], how = 'left', on = 'Facility ID')
 .merge(d['Healthcare Associated Infections - Hospital'], how = 'left', on = 'Facility ID')
 .merge(d['Medicare Hospital Spending by Claim'], how = 'left', on = 'Facility ID')
 .merge(d['Payment and Value of Care - Hospital'], how = 'left', on = 'Facility ID')
 .merge(d['Timely and Effective Care - Hospital'], how = 'left', on = 'Facility ID')
 .merge(d['Timely and Effective Care - Hospital - EDV'], how = 'left', on = 'Facility ID')
 .merge(d['Unplanned Hospital Visits - Hospital'], how = 'left', on = 'Facility ID')
 .merge(d['Structural Measures - Hospital'], how = 'left', on = 'Facility ID')
        )

### FOR NUMERIC TARGET
target_var = 'Death rate for pneumonia patients'
us_df = us_df[us_df[target_var].isna() == False]

# set index
us_df = us_df.set_index(['Facility ID', 'Facility Name'])

# drop columns that are more than half null
us_df = us_df.drop(columns = pd.DataFrame(us_df.isna().sum()).reset_index()[pd.DataFrame(us_df.isna().sum()).reset_index()[0] > us_df.shape[0] * .5]['index'])

# drop rows that are more than half null
us_df['null rate'] = us_df.isna().sum(axis=1) / len(us_df.columns)
us_df = us_df[us_df.isna().sum(axis=1) < len(us_df.columns) * 0.5]


us_df[[col for col in us_df.columns if any(us_df[col].astype(str).str.contains('[0-9]', regex=True))]] = us_df[[col for col in us_df.columns if any(us_df[col].astype(str).str.contains('[0-9]', regex=True))]].apply(pd.to_numeric, errors = 'coerce')
us_df[us_df.filter(regex = '([C|c]ases)|Number|Days|score').columns] = us_df.filter(regex = '([C|c]ases)|Number|Days|score').apply(pd.to_numeric, errors = 'coerce')

# make dummy variables
for col in us_df[pd.DataFrame(us_df.dtypes)[pd.DataFrame(us_df.dtypes)[0] == 'object'].reset_index()['index']].columns:
    us_df = pd.concat([us_df, pd.DataFrame(pd.get_dummies(us_df[col], prefix = col)).iloc[:, :-1]], axis = 1).drop(columns = col)
    
# impute missing data
imputer = IterativeImputer()
# fit on the dataset
imputer.fit(us_df.drop(columns = [target_var]))
# transform the dataset
us_df[us_df.drop(columns = [target_var]).columns] = imputer.transform(us_df[us_df.drop(columns = [target_var]).columns])


X = us_df.drop(columns = us_df.filter(regex = '(Mortality)|Death|overall').columns)
#X = us_df[pd.DataFrame(reg.fit().pvalues)[pd.DataFrame(reg.fit().pvalues)[0] != max(reg.fit().pvalues)].index[1:]]
y = us_df[target_var]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

X_endog = sm.add_constant(X_test)

reg = sm.OLS(y_test, X_endog)


while max(reg.fit().pvalues) > 0.1:
    X = us_df[pd.DataFrame(reg.fit().pvalues)[pd.DataFrame(reg.fit().pvalues)[0] != max(reg.fit().pvalues)].index[1:]]
    y = us_df[target_var]

    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

    X_endog = sm.add_constant(X_test)

    reg = sm.OLS(y_test, X_endog)

reg.fit().summary()



0,1,2,3
Dep. Variable:,Death rate for pneumonia patients,R-squared:,0.172
Model:,OLS,Adj. R-squared:,0.143
Method:,Least Squares,F-statistic:,6.038
Date:,"Tue, 24 Nov 2020",Prob (F-statistic):,4.23e-19
Time:,08:44:19,Log-Likelihood:,-1684.0
No. Observations:,815,AIC:,3424.0
Df Residuals:,787,BIC:,3556.0
Df Model:,27,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,15.0643,3.981,3.784,0.000,7.251,22.878
Collapsed lung due to medical treatment,3.0593,1.494,2.047,0.041,0.126,5.993
Care transition - linear mean score,-0.1226,0.044,-2.766,0.006,-0.210,-0.036
Quietness - linear mean score,0.0652,0.020,3.328,0.001,0.027,0.104
Staff responsiveness - linear mean score,0.0722,0.028,2.544,0.011,0.016,0.128
Catheter Associated Urinary Tract Infections (ICU + select Wards): Number of Urinary Catheter Days,7.679e-05,2.43e-05,3.163,0.002,2.91e-05,0.000
Clostridium Difficile (C.Diff),-0.3165,0.156,-2.027,0.043,-0.623,-0.010
Clostridium Difficile (C.Diff): Patient Days,-1.213e-05,3.34e-06,-3.627,0.000,-1.87e-05,-5.56e-06
Appropriate care for severe sepsis and septic shock,-0.0215,0.009,-2.327,0.020,-0.040,-0.003

0,1,2,3
Omnibus:,17.362,Durbin-Watson:,1.938
Prob(Omnibus):,0.0,Jarque-Bera (JB):,21.678
Skew:,0.254,Prob(JB):,1.96e-05
Kurtosis:,3.617,Cond. No.,3850000.0


## Canada Health Data

The following indicators are available on a hospital level for Canadian Facilities.

In [174]:
canada_data_dir = r"https://yourhealthsystem.cihi.ca/yhslive/downloads/In%20Depth_All%20Data%20Export%20Report.xlsx"
canada_df = pd.read_excel(canada_data_dir, sheet_name = 3, skiprows = 2)
canada_df = canada_df[canada_df['Reporting level'] == 'Hospital or long-term care organization']
canada_df['Indicator'].unique()

array(['Emergency Department Wait Time for Physician Initial Assessment (90% Spent Less, in Hours)',
       'Hip Fracture Surgery Within 48 Hours',
       'Total Time Spent in Emergency Department for Admitted Patients (90% Spent Less, in Hours)',
       'Falls in the Last 30 Days in Long-Term Care',
       'In-Hospital Sepsis', 'Obstetric Trauma (With Instrument)',
       'Worsened Pressure Ulcer in Long-Term Care',
       'All Patients Readmitted to Hospital', 'Hospital Deaths (HSMR)',
       'Hospital Deaths Following Major Surgery',
       'Low-Risk Caesarean Sections',
       'Medical Patients Readmitted to Hospital',
       'Obstetric Patients Readmitted to Hospital',
       'Pediatric Patients Readmitted to Hospital',
       'Potentially Inappropriate Use of Antipsychotics in Long-Term Care',
       'Restraint Use in Long-Term Care',
       'Surgical Patients Readmitted to Hospital',
       'Corporate Services Expense Ratio',
       'Cost of a Standard Hospital Stay',
       'Ex

### Exploratory Data Analysis

Data is available over time, by fiscal year for a period of five years.

In [175]:
canada_df['Measure'] = canada_df['Indicator'] + ' (' + canada_df['Unit of measurement'].fillna('-') + ')'
pd.crosstab(index = canada_df['Measure'].str.replace('\(Hours\)|\(-\)',''), 
            columns = canada_df['Data year'], 
            values = canada_df['Indicator result'],
            aggfunc = np.mean)

Data year,2014–2015,2015–2016,2016–2017,2017–2018,2018–2019
Measure,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
All Patients Readmitted to Hospital (Percentage),9.272525,9.428487,9.482192,9.574081,9.610547
Corporate Services Expense Ratio (Percentage),7.238596,6.591667,6.667568,6.736842,6.686755
Cost of a Standard Hospital Stay (Dollars),6513.975717,6598.697917,6593.012121,6735.715447,6939.106426
"Emergency Department Wait Time for Physician Initial Assessment (90% Spent Less, in Hours)",2.917143,2.818617,2.818227,2.901961,3.335217
Experiencing Pain in Long-Term Care (Percentage),10.931368,9.522744,9.015379,8.360582,7.664048
Experiencing Worsened Pain in Long-Term Care (Percentage),11.622929,11.18559,11.02638,10.926728,10.986842
Falls in the Last 30 Days in Long-Term Care (Percentage),14.920997,15.115152,15.151394,15.318253,15.757113
Hip Fracture Surgery Within 48 Hours (Percentage),86.278302,88.708333,88.179817,86.242982,88.420175
Hospital Deaths (HSMR),104.853933,100.719626,100.342342,98.121739,96.773913
Hospital Deaths Following Major Surgery (Percentage),1.647059,1.470388,1.371498,1.431401,1.414851


### Data Preprocessing

Convert from long to wide format, so each hospital and year combination has a line item. There are fewer hospitals in Canada than in the United States, and Canadian data is much more sparse. Most fields are missing 50% - 80% of values. For this reason, all years of data will be used in order to have more data to learn from. This will violate the condition of independence between observations, so I will do further research to see what is the best way to proceed. 

In [176]:
canada_clean = (canada_df[['Hospital or long-term care organization',
                                                                  'Type of hospital', 
                                                                  'Region', 
                                                                  'Province/territory',
                                                                  'Indicator',
                                                                  'Indicator result',
                                                                  'Data year'             
                                                                 ]]
                .pivot_table(values='Indicator result', 
                             index=['Hospital or long-term care organization',
                                    'Type of hospital', 
                                    'Region', 
                                    'Province/territory', 
                                    'Data year'],
                             columns= 'Indicator')
               .reset_index()
               .set_index('Hospital or long-term care organization'))

percent_missing = canada_clean.isnull().sum() * 100 / len(canada_clean)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
missing_value_df

Unnamed: 0_level_0,percent_missing
Indicator,Unnamed: 1_level_1
Type of hospital,0.0
Region,0.0
Province/territory,0.0
Data year,0.0
All Patients Readmitted to Hospital,4.611924
Corporate Services Expense Ratio,73.415823
Cost of a Standard Hospital Stay,9.336333
"Emergency Department Wait Time for Physician Initial Assessment (90% Spent Less, in Hours)",62.504687
Experiencing Pain in Long-Term Care,77.915261
Experiencing Worsened Pain in Long-Term Care,78.365204


### Regression Model

The missing data will be imputed (unless missing in the target variable, then it will be dropped). For now, the 'All Patients Readmitted to Hospital' field will be used as the predictor variable, since it has the least amount of missing data. The r-squared value is sufficient, however further analysis of colinearity should be done (VIF, correlation plots).

In [177]:
imputer = IterativeImputer()
# fit on the dataset
target_var = 'All Patients Readmitted to Hospital'
canada_clean = canada_clean.dropna(subset=[target_var])

canada_clean['Data year'] = canada_clean['Data year'].str[-4:].astype(int)

imputer.fit(canada_clean.drop(columns = [target_var, 'Type of hospital', 'Region', 'Province/territory']))
# transform the dataset
canada_clean[canada_clean.drop(columns = [target_var, 'Type of hospital', 'Region', 'Province/territory']).columns] = imputer.transform(canada_clean[canada_clean.drop(columns = [target_var, 'Type of hospital', 'Region', 'Province/territory']).columns])

for col in ['Type of hospital', 'Region', 'Province/territory']:
    canada_clean = pd.concat([canada_clean, pd.DataFrame(pd.get_dummies(canada_clean[col], prefix = col)).iloc[:, :-1]], axis = 1).drop(columns = col)
    
p = 1
X = canada_clean.drop(columns = canada_clean.filter(regex = 'Readmitted').columns)
y = canada_clean[target_var]

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

X_endog = sm.add_constant(X_test)

reg = sm.OLS(y_test, X_endog)

while max(reg.fit().pvalues) > 0.1:
    X = canada_clean[pd.DataFrame(reg.fit().pvalues)[pd.DataFrame(reg.fit().pvalues)[0] != max(reg.fit().pvalues)].index[1:]]
    y = canada_clean[target_var]

    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

    X_endog = sm.add_constant(X_test)

    reg = sm.OLS(y_test, X_endog)

reg.fit().summary()



0,1,2,3
Dep. Variable:,All Patients Readmitted to Hospital,R-squared:,0.781
Model:,OLS,Adj. R-squared:,0.769
Method:,Least Squares,F-statistic:,67.21
Date:,"Tue, 24 Nov 2020",Prob (F-statistic):,5.9299999999999996e-176
Time:,08:45:03,Log-Likelihood:,-800.62
No. Observations:,636,AIC:,1667.0
Df Residuals:,603,BIC:,1814.0
Df Model:,32,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-308.6649,97.668,-3.160,0.002,-500.476,-116.854
Data year,0.1530,0.048,3.212,0.001,0.059,0.246
Corporate Services Expense Ratio,1.3863,0.151,9.208,0.000,1.091,1.682
Cost of a Standard Hospital Stay,-0.0002,6.69e-05,-2.803,0.005,-0.000,-5.61e-05
"Emergency Department Wait Time for Physician Initial Assessment (90% Spent Less, in Hours)",-0.5273,0.093,-5.679,0.000,-0.710,-0.345
Experiencing Pain in Long-Term Care,-0.3778,0.021,-17.916,0.000,-0.419,-0.336
Experiencing Worsened Pain in Long-Term Care,0.2435,0.029,8.258,0.000,0.186,0.301
Falls in the Last 30 Days in Long-Term Care,0.0746,0.014,5.341,0.000,0.047,0.102
Hip Fracture Surgery Within 48 Hours,-0.1692,0.015,-11.416,0.000,-0.198,-0.140

0,1,2,3
Omnibus:,185.422,Durbin-Watson:,1.925
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4374.657
Skew:,-0.711,Prob(JB):,0.0
Kurtosis:,15.77,Cond. No.,20100000.0


### Classification Model

Since data over time is available, we can also select an indictor and determine whether it decreased or increased over time. Using the same 'All Patients Readmitted to Hospital' field, these models will attempt to predict whether readmissions will increase or decrease for each hospital from 2015 to 2019. The accuracy of each classification model is shown below.

In [178]:
canada_df['Measure + Year'] = canada_df['Data year'].str[-4:] + ' ' + canada_df['Indicator']
canada_time = (canada_df[(canada_df['Data year'] == '2018–2019') | (canada_df['Data year'] == '2014–2015')][['Hospital or long-term care organization',
                                                                  'Type of hospital', 
                                                                  'Region', 
                                                                  'Province/territory',
                                                                  'Indicator result',
                                                                  'Measure + Year'             
                                                                 ]]
                .pivot_table(values='Indicator result', 
                             index=['Hospital or long-term care organization',
                                    'Type of hospital', 
                                    'Region', 
                                    'Province/territory'],
                             columns= 'Measure + Year')
               .reset_index()
               .set_index('Hospital or long-term care organization'))

target_var = 'All Patients Readmitted to Hospital'

for indicator in canada_df['Indicator'].unique():
    if indicator == target_var:
        canada_time[target_var + ' change'] = np.where(canada_time['2019 ' + target_var] > canada_time['2015 ' + target_var], 'increase', 'decrease')
    
    else:
        try:
            canada_time[indicator + ' change'] = canada_time['2019 ' + indicator] - canada_time['2015 ' + indicator]
            canada_time = canada_time.drop(columns = ['2019 ' + indicator, '2015 ' + indicator])
        except:
            canada_time = canada_time.drop(columns = ['2019 ' + indicator]) 
            
            
imputer = IterativeImputer()
# fit on the dataset
target_var = 'All Patients Readmitted to Hospital'
target_var = target_var + ' change'
canada_time = canada_time.dropna(subset=[target_var])

imputer.fit(canada_time.drop(columns = [target_var, 'Type of hospital', 'Region', 'Province/territory']))
# transform the dataset
canada_time[canada_time.drop(columns = [target_var, 'Type of hospital', 'Region', 'Province/territory']).columns] = imputer.transform(canada_time[canada_time.drop(columns = [target_var, 'Type of hospital', 'Region', 'Province/territory']).columns])

for col in ['Type of hospital', 'Region', 'Province/territory']:
    canada_time = pd.concat([canada_time, pd.DataFrame(pd.get_dummies(canada_time[col], prefix = col)).iloc[:, :-1]], axis = 1).drop(columns = col)

    
X = canada_time.drop(columns = canada_time.filter(regex = 'Readmitted').columns)
y = canada_time[target_var]

#X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)


# prepare configuration for cross validation test harness
seed = 7
# prepare models
models = []
#models.append(('LR', LogisticRegression(class_weight = 'balanced')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
models.append(('RF', RandomForestClassifier()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = model_selection.KFold(n_splits=10)
    cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f" % (name, cv_results.mean())
    print(msg)
# boxplot algorithm comparison
#fig = plt.figure()
#fig.suptitle('Algorithm Comparison')
#ax = fig.add_subplot(111)
#plt.boxplot(results)
#ax.set_xticklabels(names)
#plt.show()

LDA: 0.731751
KNN: 0.602727
CART: 0.665522
NB: 0.582727
SVM: 0.550101
RF: 0.713401


LDA had the highest accuracy; the following shows a more detailed classification report for the model. It predicted increase in readmissions better than decrease in readmissions. 

In [179]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0, stratify = y)
lda = LinearDiscriminantAnalysis()
model = lda.fit(X_train, y_train)
pred=model.predict(X_test)
print(confusion_matrix(pred, y_test.values))
print(classification_report(y_test, pred, digits=3))

[[40 10]
 [21 65]]
              precision    recall  f1-score   support

    decrease      0.800     0.656     0.721        61
    increase      0.756     0.867     0.807        75

    accuracy                          0.772       136
   macro avg      0.778     0.761     0.764       136
weighted avg      0.776     0.772     0.769       136



Random forest performed almost as well as LDA, and random forest can reveal feature importance rankings. Using random forst shows that change in wait times are 2 of the top 3 features that were used in prediction. 

In [180]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0, stratify = y)
rf = RandomForestClassifier()
model = rf.fit(X_train, y_train)
pred=model.predict(X_test)

importances = model.feature_importances_

indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(10):
    print("%d. feature %s (%f)" % (f + 1, X.columns[f], importances[indices[f]]))

Feature ranking:
1. feature Emergency Department Wait Time for Physician Initial Assessment (90% Spent Less, in Hours) change (0.084179)
2. feature Hip Fracture Surgery Within 48 Hours change (0.070518)
3. feature Total Time Spent in Emergency Department for Admitted Patients (90% Spent Less, in Hours) change (0.069845)
4. feature Falls in the Last 30 Days in Long-Term Care change (0.057189)
5. feature In-Hospital Sepsis change (0.047023)
6. feature Obstetric Trauma (With Instrument) change (0.045314)
7. feature Worsened Pressure Ulcer in Long-Term Care change (0.045056)
8. feature Hospital Deaths (HSMR) change (0.041762)
9. feature Hospital Deaths Following Major Surgery change (0.041447)
10. feature Potentially Inappropriate Use of Antipsychotics in Long-Term Care change (0.041192)


# Next Steps

- Multi-colinearity analysis: Look into correlation plots, variance inflation factors, feature reduction techniques.
- Condition checks: Ensure that the data meets the conditions of each classification and regression model
- Model improvement: Try out ensemble techniques for classification models / penalized regression for regression models
- Target variable: Attempt to use a different target variable for Canadian data (one more closely related to mortality).