## *DISCLAIMER*
<p style="font-size:16px; color:#117d30;">
 By accessing this code, you acknowledge the code is made available for presentation and demonstration purposes only and that the code: (1) is not subject to SOC 1 and SOC 2 compliance audits; (2) is not designed or intended to be a substitute for the professional advice, diagnosis, treatment, or judgment of a certified financial services professional; (3) is not designed, intended or made available as a medical device; and (4) is not designed or intended to be a substitute for professional medical advice, diagnosis, treatment or judgement. Do not use this code to replace, substitute, or provide professional financial advice or judgment, or to replace, substitute or provide medical advice, diagnosis, treatment or judgement. You are solely responsible for ensuring the regulatory, legal, and/or contractual compliance of any use of the code, including obtaining any authorizations or consents, and any solution you choose to build that incorporates this code in whole or in part.
</p>

# Readmission Prediction using Azure AutoML
<h3><span style="color: #117d30;"> Using Azure AutoML Cognitive Services</span></h3>

$*****$ For Demonstration purpose only, Please customize as per your enterprise security needs and compliances.License agreement: https://github.com/microsoft/Azure-Analytics-and-AI-Engagement/blob/main/HealthCare/License.md $*****$ 

## Legal Notices 

This presentation, demonstration, and demonstration model are for informational purposes only. Microsoft makes no warranties, express or implied, in this presentation demonstration, and demonstration model. Nothing in this presentation, demonstration, or demonstration model modifies any of the terms and conditions of Microsoft’s written and signed agreements. This is not an offer and applicable terms and the information provided is subject to revision and may be changed at any time by Microsoft.

This presentation, demonstration, and/or demonstration model do not give you or your organization any license to any patents, trademarks, copyrights, or other intellectual property covering the subject matter in this presentation, demonstration, and demonstration model.

The information contained in this presentation, demonstration and demonstration model represent the current view of Microsoft on the issues discussed as of the date of presentation and/or demonstration, and the duration of your access to the demonstration model. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of presentation and/or demonstration and for the duration of your access to the demonstration model.

No Microsoft technology, nor any of its component technologies, including the demonstration model, is intended or made available: (1) as a medical device; (2) for the diagnosis of disease or other conditions, or in the cure, mitigation, treatment or prevention of a disease or other conditions; or (3) as a substitute for the professional clinical advice, opinion, or judgment of a treating healthcare professional. Partners or customers are responsible for ensuring the regulatory compliance of any solution they build using Microsoft technologies.

© 2020 Microsoft Corporation. All rights reserved

## Scenario Overview

Azure AutoML is a cognitive service which helps in building ML models for different problems such as classification, time-series forecasting, and regression.

This notebook provides an end-to-end demo on how to use the Azure Cognitive Services AutoML to build an to detect whether a patient would get readmitted within 30 days or not. 

In this scenario we will see how using patient encounter data we are able to tell whether a patient will get readmitted or not. Moreover, the model is also able to explain why it is making the decision it is, and thus we are able to tell what factors are important when deciding whether a patient will be readmitted or not.

The raw data is stored in an ADLSGen2 storage container


## Setting up the workspace

In [1]:
import azureml.core
import pandas as pd
import azureml.core
from azureml.core import Workspace, Datastore, Dataset, Experiment
from azureml.data import DataType
from azureml.data.datapath import DataPath
from azureml.core.compute import AmlCompute
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
import os


print("SDK Version:", azureml.core.VERSION)

ws = Workspace.from_config()
ws

SDK Version: 1.19.0


Workspace.create(name='mlw-healthcare-dev', subscription_id='6f6a71d2-83bb-42b0-9912-2e243ef214c4', resource_group='rg-healthcare-dev')

#### Create new datastore for Datasets

In [2]:
import GlobalVariables

In [3]:
blob_datastore_name=GlobalVariables.READMISSION_DATASTORE_NAME # Name of the datastore in workspace
container_name=GlobalVariables.GLOBAL_CONTAINER_NAME
account_name=GlobalVariables.STORAGE_ACCOUNT_NAME
account_key=GlobalVariables.STORAGE_ACCOUNT_KEY # Storage account access key

blob_datastore = Datastore.register_azure_blob_container(workspace=ws, 
                                                         datastore_name=blob_datastore_name, 
                                                         container_name=container_name, 
                                                         account_name=account_name,
                                                         account_key=account_key)

dstore = Datastore.get(ws, datastore_name=blob_datastore_name)

In [4]:
filepath = GlobalVariables.READMISSION_INPUT_FILE_NAME
print(filepath)

# Set the path to the storage account containing the file
datastore_path = [DataPath(dstore, filepath)]
patientdataset = Dataset.Tabular.from_delimited_files(path=datastore_path)
patientdataset.to_pandas_dataframe()

/pbiPatientPredictiveSetv4.csv


Unnamed: 0,encounter_id,hospital_id,department_id,city,patient_id,patient_age,risk_level,acute_type,patient_category,doctor_id,...,drug_cost,hospital_expense,follow_up,readmitted_patient,payment_type,date,month,year,reason_for_readmission,disease
0,2135731,2,7,Chicago,bb94327b-2f30-11eb-8d56-70b5e8b8edbb,30,2,Non Acute,InPatient,10676,...,816,6086,0,0,Medicare,2018-06-29 08:34:00,Jun,2018,,flu
1,4611669,2,5,Chicago,fa6cbf7d-2f42-11eb-a512-70b5e8b8edbb,46,5,Acute,InPatient,4415,...,760,5234,1,0,Uninsured,2020-01-23 15:37:00,Jan,2020,,oligomenorrhea
2,1921463,2,6,Chicago,89974f6b-2f30-11eb-a2b4-70b5e8b8edbb,26,5,Acute,InPatient,1783,...,653,6340,1,0,Uninsured,2018-05-15 15:45:00,May,2018,,chronic headache
3,1998115,2,4,Chicago,994a316b-2f30-11eb-b0f2-70b5e8b8edbb,52,3,Non Acute,InPatient,12655,...,744,6544,0,0,Uninsured,2018-01-03 08:47:00,Jan,2018,,maternity
4,2238767,2,6,Chicago,fe55cf81-2f30-11eb-a868-70b5e8b8edbb,27,3,Non Acute,InPatient,8460,...,1091,7039,0,0,Uninsured,2019-04-10 00:13:00,Apr,2019,,chronic headache
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4153018,4410743,2,6,Chicago,bd4eacbe-2f42-11eb-b0d1-70b5e8b8edbb,44,4,Acute,Cancelled Appointment,13553,...,0,-100,0,1,,2020-08-29 16:18:00,Aug,2020,Failure to follow hospital discharge,
4153019,4629642,2,6,Chicago,ff411237-2f42-11eb-ab73-70b5e8b8edbb,81,4,Acute,Cancelled Appointment,3581,...,0,-100,1,1,,2020-08-03 07:43:00,Aug,2020,Recurrence of a preexisting condition / infection,
4153020,4147292,9,1,Honolulu,5c11509c-2f42-11eb-889d-70b5e8b8edbb,60,4,Acute,Cancelled Appointment,9535,...,0,-58,0,1,,2020-06-22 13:52:00,Jun,2020,Failure to follow hospital discharge,
4153021,4276403,21,8,Miami,942bdfc3-2f42-11eb-8dfb-70b5e8b8edbb,50,4,Acute,Cancelled Appointment,948,...,0,-48,0,1,,2020-09-16 11:10:00,Sep,2020,Medication errors or lack of accurate medicati...,


#### Convert to Pandas DataFrame to do data preparation

In [4]:
patient_df = patientdataset.to_pandas_dataframe()
patient_df.head()

Unnamed: 0,encounter_id,hospital_id,department_id,city,patient_id,patient_age,risk_level,acute_type,patient_category,doctor_id,...,drug_cost,hospital_expense,follow_up,readmitted_patient,payment_type,date,month,year,reason_for_readmission,disease
0,2135731,2,7,Chicago,bb94327b-2f30-11eb-8d56-70b5e8b8edbb,30,2,Non Acute,InPatient,10676,...,816,6086,0,0,Medicare,2018-06-29 08:34:00,Jun,2018,,flu
1,4611669,2,5,Chicago,fa6cbf7d-2f42-11eb-a512-70b5e8b8edbb,46,5,Acute,InPatient,4415,...,760,5234,1,0,Uninsured,2020-01-23 15:37:00,Jan,2020,,oligomenorrhea
2,1921463,2,6,Chicago,89974f6b-2f30-11eb-a2b4-70b5e8b8edbb,26,5,Acute,InPatient,1783,...,653,6340,1,0,Uninsured,2018-05-15 15:45:00,May,2018,,chronic headache
3,1998115,2,4,Chicago,994a316b-2f30-11eb-b0f2-70b5e8b8edbb,52,3,Non Acute,InPatient,12655,...,744,6544,0,0,Uninsured,2018-01-03 08:47:00,Jan,2018,,maternity
4,2238767,2,6,Chicago,fe55cf81-2f30-11eb-a868-70b5e8b8edbb,27,3,Non Acute,InPatient,8460,...,1091,7039,0,0,Uninsured,2019-04-10 00:13:00,Apr,2019,,chronic headache


In [5]:
# View info to see what the column names and types are
patient_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4153023 entries, 0 to 4153022
Data columns (total 25 columns):
encounter_id              int64
hospital_id               int64
department_id             int64
city                      object
patient_id                object
patient_age               int64
risk_level                int64
acute_type                object
patient_category          object
doctor_id                 int64
length_of_stay            int64
wait_time                 int64
type_of_stay              object
treatment_cost            int64
claim_cost                int64
drug_cost                 int64
hospital_expense          int64
follow_up                 int64
readmitted_patient        int64
payment_type              object
date                      datetime64[ns]
month                     object
year                      int64
reason_for_readmission    object
disease                   object
dtypes: datetime64[ns](1), int64(15), object(9)
memory usage: 792.1+ 

## Data Preparation for AutoML

Drop encounter_id and patient_id since they are unique and give us no predictive power


In [6]:
patient_df = patient_df.drop(['encounter_id', 'patient_id'], axis=1)
patient_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4153023 entries, 0 to 4153022
Data columns (total 23 columns):
hospital_id               int64
department_id             int64
city                      object
patient_age               int64
risk_level                int64
acute_type                object
patient_category          object
doctor_id                 int64
length_of_stay            int64
wait_time                 int64
type_of_stay              object
treatment_cost            int64
claim_cost                int64
drug_cost                 int64
hospital_expense          int64
follow_up                 int64
readmitted_patient        int64
payment_type              object
date                      datetime64[ns]
month                     object
year                      int64
reason_for_readmission    object
disease                   object
dtypes: datetime64[ns](1), int64(14), object(8)
memory usage: 728.8+ MB


#### Fix Months column

The months column has both "January" and "Jan" inside it in some cases

In [7]:
patient_df['month'] = patient_df['month'].apply(lambda x: x[0:3])

#### Create dictionaries to rename columns from snake_case to Capital Camel Case and vice versa

In [8]:
def to_snake_case(col_name):
    words = col_name.split('_')
    camel_words = [word.title() for word in words]
    camel_words = ''.join(camel_words)
    return camel_words
    
def get_snake_to_camel_dicts(df):
    snake_case_columns = list(df.columns)
    camel_case_columns = [to_snake_case(col_name) for col_name in snake_case_columns]
    snake_to_camel = dict(zip(snake_case_columns, camel_case_columns))
    camel_to_snake = {v: k for k, v in snake_to_camel.items()}
    return snake_to_camel, camel_to_snake

In [9]:
snake_to_camel, camel_to_snake = get_snake_to_camel_dicts(patient_df)

### Split main DataFrame into separate dataframes for each city

In [10]:
all_cities = dict(patient_df['city'].value_counts())
all_cities

{'Los Angeles': 1416417,
 'Chicago': 1352737,
 'Miami': 603225,
 'Honolulu': 425706,
 'Anchorage': 354938}

In [11]:
all_dfs = {}
for city in all_cities:
    city_df = patient_df[patient_df['city'] == city]
    all_dfs[city] = city_df


In [12]:
for city, df in all_dfs.items():
    print(city, len(df))

Los Angeles 1416417
Chicago 1352737
Miami 603225
Honolulu 425706
Anchorage 354938


## Prepare Training and Testing set

Since we plan on predicting whether patients would be readmitted in October, November or December, we split the training and testing data based on the date.

Moreover, we will also be removing some columns as we do this to get a better model

#### Split data based on time

In [13]:
columns_to_keep = [
    'department_id',
    'patient_age',
    'risk_level',
    'acute_type',
    'length_of_stay',
    'type_of_stay',
    'treatment_cost',
    'claim_cost',
    'drug_cost',
    'hospital_expense',
    'follow_up',
    'readmitted_patient',
    'disease'
]

In [14]:
date_cutoff = pd.to_datetime('2020-10-01')

train_dfs = {}
for city, df in all_dfs.items():
    train_df = df[df['date'] < date_cutoff]
    train_df = train_df[train_df['patient_category'] == 'InPatient']
    train_df = train_df[columns_to_keep]
    train_dfs[city] = train_df
    
for city, train_df in train_dfs.items():
    print(city, train_df.shape)

Los Angeles (768336, 13)
Chicago (969441, 13)
Miami (424873, 13)
Honolulu (311576, 13)
Anchorage (257843, 13)


In [15]:
test_dfs = {}
for city, df in all_dfs.items():
    test_df = df[df['date'] >= date_cutoff]
    test_df = test_df[test_df['patient_category'] == 'InPatient']
    test_df = test_df[columns_to_keep]
    test_dfs[city] = test_df

for city, test_df in test_dfs.items():
    print(city, len(test_df))


Los Angeles 30163
Chicago 29325
Miami 13556
Honolulu 9726
Anchorage 9983


#### Balance the training dataset

In [16]:
for city, train_df in train_dfs.items():
    print(city)
    print(train_df['readmitted_patient'].value_counts())

Los Angeles
0    741192
1     27144
Name: readmitted_patient, dtype: int64
Chicago
0    921397
1     48044
Name: readmitted_patient, dtype: int64
Miami
0    375783
1     49090
Name: readmitted_patient, dtype: int64
Honolulu
0    285024
1     26552
Name: readmitted_patient, dtype: int64
Anchorage
0    245435
1     12408
Name: readmitted_patient, dtype: int64


In [17]:
balanced_train_dfs = {}
for city, train_df in train_dfs.items():
    readmitted_df = train_df[train_df['readmitted_patient'] == 1]
    not_readmitted_df = train_df[train_df['readmitted_patient'] == 0]
    sampled_non_readmitted_df = not_readmitted_df.sample(n=int(1.6*len(readmitted_df)), random_state=1)
    balanced_train_df = pd.concat([readmitted_df, sampled_non_readmitted_df]).sample(frac=1, random_state=0)
    balanced_train_dfs[city] = balanced_train_df

for city, train_df in balanced_train_dfs.items():
    print(city)
    print(train_df['readmitted_patient'].value_counts())


Los Angeles
0    40716
1    27144
Name: readmitted_patient, dtype: int64
Chicago
0    72066
1    48044
Name: readmitted_patient, dtype: int64
Miami
0    73635
1    49090
Name: readmitted_patient, dtype: int64
Honolulu
0    39828
1    26552
Name: readmitted_patient, dtype: int64
Anchorage
0    18612
1    12408
Name: readmitted_patient, dtype: int64


#### Upload training and testing set to the Storage Account

In [18]:
# Save data locally
local_data_folder = 'readmission_data/'
if not os.path.exists(local_data_folder):
    os.mkdir(local_data_folder)

train_base_filename = 'readmission_data_train'
test_base_filename = 'readmission_data_test'
    
train_files = {}
test_files = {}
for city, train_df in balanced_train_dfs.items():
    city_space_removed = '_'.join(city.split(' '))
    print(city_space_removed)
    # Save train DF locally
    train_file = train_base_filename + '_' + city_space_removed + '.csv'
    train_files[city] = train_file
    train_df = train_df.rename(columns=snake_to_camel)
    print(train_df.columns)
    train_df.to_csv(local_data_folder+train_file, index=False)
    
    # Save test DF locally
    test_file = test_base_filename + '_' + city_space_removed + '.csv'
    test_files[city] = test_file
    test_df = test_dfs[city]
    test_df = test_df.rename(columns=snake_to_camel)
    print(test_df.columns)
    test_df.to_csv(local_data_folder+test_file, index=False)
    


Los_Angeles
Index(['DepartmentId', 'PatientAge', 'RiskLevel', 'AcuteType', 'LengthOfStay',
       'TypeOfStay', 'TreatmentCost', 'ClaimCost', 'DrugCost',
       'HospitalExpense', 'FollowUp', 'ReadmittedPatient', 'Disease'],
      dtype='object')
Index(['DepartmentId', 'PatientAge', 'RiskLevel', 'AcuteType', 'LengthOfStay',
       'TypeOfStay', 'TreatmentCost', 'ClaimCost', 'DrugCost',
       'HospitalExpense', 'FollowUp', 'ReadmittedPatient', 'Disease'],
      dtype='object')
Chicago
Index(['DepartmentId', 'PatientAge', 'RiskLevel', 'AcuteType', 'LengthOfStay',
       'TypeOfStay', 'TreatmentCost', 'ClaimCost', 'DrugCost',
       'HospitalExpense', 'FollowUp', 'ReadmittedPatient', 'Disease'],
      dtype='object')
Index(['DepartmentId', 'PatientAge', 'RiskLevel', 'AcuteType', 'LengthOfStay',
       'TypeOfStay', 'TreatmentCost', 'ClaimCost', 'DrugCost',
       'HospitalExpense', 'FollowUp', 'ReadmittedPatient', 'Disease'],
      dtype='object')
Miami
Index(['DepartmentId', 'PatientAge

In [19]:
print(train_files)

{'Los Angeles': 'readmission_data_train_Los_Angeles.csv', 'Chicago': 'readmission_data_train_Chicago.csv', 'Miami': 'readmission_data_train_Miami.csv', 'Honolulu': 'readmission_data_train_Honolulu.csv', 'Anchorage': 'readmission_data_train_Anchorage.csv'}


In [20]:
# Upload the data
local_files = [local_data_folder + file for file in train_files.values()] 
local_files += [local_data_folder + file for file in test_files.values()]
print(local_files)

dstore.upload_files(
    files = local_files,
    relative_root = local_data_folder,
    target_path = '/',
    overwrite=True,
    show_progress=True
)

['readmission_data/readmission_data_train_Los_Angeles.csv', 'readmission_data/readmission_data_train_Chicago.csv', 'readmission_data/readmission_data_train_Miami.csv', 'readmission_data/readmission_data_train_Honolulu.csv', 'readmission_data/readmission_data_train_Anchorage.csv', 'readmission_data/readmission_data_test_Los_Angeles.csv', 'readmission_data/readmission_data_test_Chicago.csv', 'readmission_data/readmission_data_test_Miami.csv', 'readmission_data/readmission_data_test_Honolulu.csv', 'readmission_data/readmission_data_test_Anchorage.csv']
Uploading an estimated of 10 files
Uploading readmission_data/readmission_data_train_Los_Angeles.csv
Uploaded readmission_data/readmission_data_train_Los_Angeles.csv, 1 files out of an estimated total of 10
Uploading readmission_data/readmission_data_train_Honolulu.csv
Uploaded readmission_data/readmission_data_train_Honolulu.csv, 2 files out of an estimated total of 10
Uploading readmission_data/readmission_data_train_Anchorage.csv
Uploade

$AZUREML_DATAREFERENCE_readmission_prediction_store

### Set up AutoML Experiment

#### Set the Data Types for each column. 
This needs to be done explicitly since some ID columns are automatically inferred as integers, when they should be treated as strings

In [21]:
data_types = {
    'hospital_id': DataType.to_string(),
    'department_id': DataType.to_string(),
    'city': DataType.to_string(),
    'patient_age': DataType.to_long(),
    'risk_level': DataType.to_long(),
    'acute_type': DataType.to_string(),
    'patient_category': DataType.to_string(),
    'doctor_id': DataType.to_string(),
    'length_of_stay': DataType.to_long(),
    'wait_time': DataType.to_long(),
    'type_of_stay': DataType.to_string(),
    'treatment_cost': DataType.to_long(),
    'claim_cost': DataType.to_long(),
    'drug_cost': DataType.to_long(),
    'hospital_expense': DataType.to_long(),
    'follow_up': DataType.to_bool(),
    'readmitted_patient': DataType.to_bool(),
    'payment_type': DataType.to_string(),
    'date': DataType.to_datetime("%Y-%m-%d"),
    'month': DataType.to_string(),
    'year': DataType.to_long(),
    'disease': DataType.to_string(), 
    'reason_for_readmission': DataType.to_string()
}

print(len(data_types))

23


In [22]:
data_types_camel_case = {snake_to_camel[key]:value for key, value in data_types.items()}
print(data_types_camel_case.keys())

dict_keys(['HospitalId', 'DepartmentId', 'City', 'PatientAge', 'RiskLevel', 'AcuteType', 'PatientCategory', 'DoctorId', 'LengthOfStay', 'WaitTime', 'TypeOfStay', 'TreatmentCost', 'ClaimCost', 'DrugCost', 'HospitalExpense', 'FollowUp', 'ReadmittedPatient', 'PaymentType', 'Date', 'Month', 'Year', 'Disease', 'ReasonForReadmission'])


#### Load Training data from Storage Blob as a TabularDataSet

In [23]:
train_datasets = {}
for city, filepath in train_files.items():
    datastore_path = [DataPath(dstore, filepath)]
    train_dataset = Dataset.Tabular.from_delimited_files(path=datastore_path, set_column_types=data_types_camel_case)
    train_datasets[city] = train_dataset
    print(city)
    print(train_dataset.to_pandas_dataframe().info())
    print()


Los Angeles
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67860 entries, 0 to 67859
Data columns (total 13 columns):
DepartmentId         67860 non-null object
PatientAge           67860 non-null int64
RiskLevel            67860 non-null int64
AcuteType            67860 non-null object
LengthOfStay         67860 non-null int64
TypeOfStay           67860 non-null object
TreatmentCost        67860 non-null int64
ClaimCost            67860 non-null int64
DrugCost             67860 non-null int64
HospitalExpense      67860 non-null int64
FollowUp             67860 non-null bool
ReadmittedPatient    67860 non-null bool
Disease              67860 non-null object
dtypes: bool(2), int64(7), object(4)
memory usage: 5.8+ MB
None

Chicago
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120110 entries, 0 to 120109
Data columns (total 13 columns):
DepartmentId         120110 non-null object
PatientAge           120110 non-null int64
RiskLevel            120110 non-null int64
AcuteType        

In [24]:
y_variable = "ReadmittedPatient"

#### Setup Computer Instances

In [25]:
compute = AmlCompute(ws, "health-cluster")

#### Configure the AutoML model and run it

In [26]:
for city, train_dataset in train_datasets.items():
    city_space_removed = '_'.join(city.split(' '))
    experiment_name = 'Readmission-Prediction-Experiment-'+city_space_removed
    print(experiment_name)
    experiment = Experiment(ws, experiment_name)

    automl_config = AutoMLConfig(task = 'classification',
                         debug_log = 'automl_errors.log',
                         iteration_timeout_minutes = 15,
                         experiment_timeout_minutes = 15,
                         label_column_name=y_variable,
                         enable_early_stopping=True,
                         primary_metric='precision_score_weighted',
                         compute_target = compute,
                         training_data = train_dataset,
                         model_explainability=True)

    training_run = experiment.submit(automl_config, show_output = False)

Readmission-Prediction-Experiment-Los_Angeles
Running on remote.
Readmission-Prediction-Experiment-Chicago
Running on remote.
Readmission-Prediction-Experiment-Miami
Running on remote.
Readmission-Prediction-Experiment-Honolulu
Running on remote.
Readmission-Prediction-Experiment-Anchorage
Running on remote.


### Retrieve model to predict the test set
We must wait until the models train for us to run this step

In [5]:
ws = Workspace.from_config()
blob_datastore_name=GlobalVariables.GLOBAL_DATASTORE_NAME
dstore = Datastore.get(ws, datastore_name=blob_datastore_name)
#ws_ds = ws.get_default_datastore()

print('Workspace Name: ' + ws.name, 
      'Resource Group: ' + ws.resource_group,
      'Default Storage Account Name: ' + dstore.account_name,
      'AzureML Core Version: ' + azureml.core.VERSION,
      sep = '\n')

Workspace Name: mlw-healthcare-dev
Resource Group: rg-healthcare-dev
Default Storage Account Name: sthealthcaredev001
AzureML Core Version: 1.19.0


In [6]:
# These need to be looked up from the Experiments page on Azure ML
autoMLRunIds = {
    'Los Angeles': 'AutoML_3d090dd8-c21e-4942-9f53-15bc25f6bd0d',
    'Chicago': 'AutoML_889f215c-4834-49fa-9523-bc664effb77b',
    'Miami': 'AutoML_9402871e-0ed3-4582-ab96-979a44cf789b',
    'Honolulu': 'AutoML_38f800a5-cef5-4259-abef-af7aa33f1e59',
    'Anchorage': 'AutoML_46cd2dea-0b1e-486b-a00b-e86429e45052'
}

all_models = {}
for city, autoMLRunId in autoMLRunIds.items():
    city_space_removed = '_'.join(city.split(' '))
    experiment_name = 'Readmission-Prediction-Experiment-'+city_space_removed
    print(city)
    experiment = Experiment(workspace = ws, name = experiment_name)
    automl_run = AutoMLRun(experiment, autoMLRunId, outputs = None)
    
    display(automl_run)
    # Get the model
    best_run, fitted_model = automl_run.get_output()
    model_name = best_run.properties['model_name']
    print(model_name)
    all_models[city] = fitted_model

Los Angeles


Experiment,Id,Type,Status,Details Page,Docs Page
Readmission-Prediction-Experiment-Los_Angeles,AutoML_3d090dd8-c21e-4942-9f53-15bc25f6bd0d,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation


AutoML3d090dd8c12
Chicago


Experiment,Id,Type,Status,Details Page,Docs Page
Readmission-Prediction-Experiment-Chicago,AutoML_889f215c-4834-49fa-9523-bc664effb77b,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation


AutoML889f215c425
Miami


Experiment,Id,Type,Status,Details Page,Docs Page
Readmission-Prediction-Experiment-Miami,AutoML_9402871e-0ed3-4582-ab96-979a44cf789b,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation


AutoML9402871e012
Honolulu


Experiment,Id,Type,Status,Details Page,Docs Page
Readmission-Prediction-Experiment-Honolulu,AutoML_38f800a5-cef5-4259-abef-af7aa33f1e59,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation


AutoML38f800a5c11
Anchorage


Experiment,Id,Type,Status,Details Page,Docs Page
Readmission-Prediction-Experiment-Anchorage,AutoML_46cd2dea-0b1e-486b-a00b-e86429e45052,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation


AutoML46cd2dea011


In [31]:
all_models['Honolulu']

PipelineWithYTransformations(Pipeline={'memory': None,
                                       'steps': [('datatransformer',
                                                  DataTransformer(enable_dnn=None,
                                                                  enable_feature_sweeping=None,
                                                                  feature_sweeping_config=None,
                                                                  feature_sweeping_timeout=None,
                                                                  featurization_config=None,
                                                                  force_text_dnn=None,
                                                                  is_cross_validation=None,
                                                                  is_onnx_compatible=None,
                                                                  logger=None,
                                                              

#### Set up Test Data for prediction

The test_df also contains the y_variable which needs to be dropped

In [32]:
all_dfs_inpatient = {}
for city, df in all_dfs.items():
    df = df[df['patient_category'] == 'InPatient']
#     df = df.drop('patient_category', axis=1)
    all_dfs_inpatient[city] = df

In [33]:
all_X_dfs = {}

for city, df in all_dfs_inpatient.items():
    X_df = df[columns_to_keep]
    X_df = X_df.rename(columns=snake_to_camel)
    X_df = X_df.drop(y_variable, axis=1)
    print(city, X_df.shape)
    all_X_dfs[city] = X_df

Los Angeles (798499, 12)
Chicago (998766, 12)
Miami (438429, 12)
Honolulu (321302, 12)
Anchorage (267826, 12)


#### Get Predictions and probabilities

In [34]:
citywise_predictions = {}
citywise_probabilities = {}

for city, X_df in all_X_dfs.items():
    print(city)
    fitted_model = all_models[city]
    print("Getting Predictions")
    predictions = fitted_model.predict(X_df)
    print("Getting Probabilities")
    prediction_probabilities = fitted_model.predict_proba(X_df)
    
    # Prediction Probabilities returns a dataframe with probabilities of True and False as columns
    # Since we only want the probability of True we select the appropriate series
    prediction_probabilities = prediction_probabilities[True]

    citywise_predictions[city] = predictions
    citywise_probabilities[city] = prediction_probabilities

Los Angeles
Getting Predictions
Getting Probabilities
Chicago
Getting Predictions
Getting Probabilities
Miami
Getting Predictions
Getting Probabilities
Honolulu
Getting Predictions
Getting Probabilities
Anchorage
Getting Predictions
Getting Probabilities


In [35]:
citywise_predictions['Honolulu']

array([False, False, False, ..., False,  True, False])

In [36]:
citywise_probabilities['Honolulu']

0        0.41
1        0.27
2        0.45
3        0.27
4        0.45
         ... 
321297   0.44
321298   0.68
321299   0.46
321300   0.56
321301   0.44
Name: True, Length: 321302, dtype: float64

In [37]:
temp_dict = {}
for city, df in all_dfs_inpatient.items():
    print(city)
    df = df.reset_index(drop=True)
    df['Actual_Flag'] = df['readmitted_patient'].apply(lambda x: True if x else False)
    df['Predicted_Flag'] = citywise_predictions[city]
    df['Prediction_Probability'] = citywise_probabilities[city]

    temp_dict[city] = df
    
all_dfs = temp_dict

Los Angeles
Chicago
Miami
Honolulu
Anchorage


In [38]:
all_dfs['Honolulu']

Unnamed: 0,hospital_id,department_id,city,patient_age,risk_level,acute_type,patient_category,doctor_id,length_of_stay,wait_time,...,readmitted_patient,payment_type,date,month,year,reason_for_readmission,disease,Actual_Flag,Predicted_Flag,Prediction_Probability
0,9,5,Honolulu,40,5,Acute,InPatient,6185,4,45,...,0,Uninsured,2016-06-25 08:44:00,Jun,2016,,ectopic pregnancy,False,False,0.41
1,9,7,Honolulu,36,2,Non Acute,InPatient,7907,4,49,...,0,Uninsured,2016-11-10 21:04:00,Nov,2016,,flu,False,False,0.27
2,9,7,Honolulu,26,5,Acute,InPatient,4357,3,50,...,0,Uninsured,2018-03-10 06:33:00,Mar,2018,,flu,False,False,0.45
3,9,7,Honolulu,27,2,Non Acute,InPatient,12458,4,48,...,0,Uninsured,2018-08-16 01:36:00,Aug,2018,,flu,False,False,0.27
4,9,7,Honolulu,30,5,Acute,InPatient,3882,4,49,...,0,Uninsured,2018-03-07 09:24:00,Mar,2018,,flu,False,False,0.45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
321297,9,1,Honolulu,76,5,Acute,InPatient,505,9,31,...,1,Medicaid,2018-11-25 23:24:00,Nov,2018,Recurrence of a preexisting condition / infection,chemotherapy,True,False,0.44
321298,9,8,Honolulu,39,5,Acute,InPatient,458,14,38,...,1,Medicaid,2018-05-21 16:01:00,May,2018,Recurrence of a preexisting condition / infection,psoriasis,True,True,0.68
321299,9,3,Honolulu,73,5,Acute,InPatient,5615,10,39,...,1,Medicaid,2019-09-22 02:31:00,Sep,2019,Medication errors or lack of accurate medicati...,heart attack,True,False,0.46
321300,9,7,Honolulu,86,5,Acute,InPatient,5842,4,48,...,1,Medicaid,2019-05-29 14:54:00,May,2019,Pneumonia,pneumonia,True,True,0.56


### Data Wrangling for PowerBI Reports

#### Limit Data to only 2020

In [76]:
final_dfs = {}

for city, df in all_dfs.items():
    df = df[df['year'] == 2020]
    final_dfs[city] = df

In [77]:
for city, df in final_dfs.items():
    print(df['year'].value_counts())

2020    194108
Name: year, dtype: int64
2020    201371
Name: year, dtype: int64
2020    97976
Name: year, dtype: int64
2020    66534
Name: year, dtype: int64
2020    55821
Name: year, dtype: int64


#### Get Aggregate Readmission Rate - Real

In [78]:
def get_readmission_rate(readmission_series):
    return sum(readmission_series)/len(readmission_series)*100

In [79]:
monthly_readmission_rate_real_dfs = {}
for city, df in final_dfs.items():
    print(city)
    monthly_readmission_rate_real = df.groupby('month').agg({'Actual_Flag': get_readmission_rate})
    monthly_readmission_rate_real = monthly_readmission_rate_real.reset_index()
    monthly_readmission_rate_real = monthly_readmission_rate_real.rename(columns={"Actual_Flag": "Actual_Readmission_Rate"})
    print(len(monthly_readmission_rate_real))
    monthly_readmission_rate_real_dfs[city] = monthly_readmission_rate_real

Los Angeles
11
Chicago
11
Miami
11
Honolulu
11
Anchorage
11


In [80]:
temp_dict = {}
for city, df in final_dfs.items():
    df = df.merge(monthly_readmission_rate_real_dfs[city], on='month', how='left')
    temp_dict[city] = df

final_dfs = temp_dict

In [81]:
final_dfs['Honolulu']

Unnamed: 0,hospital_id,department_id,city,patient_age,risk_level,acute_type,patient_category,doctor_id,length_of_stay,wait_time,...,payment_type,date,month,year,reason_for_readmission,disease,Actual_Flag,Predicted_Flag,Prediction_Probability,Actual_Readmission_Rate
0,9,7,Honolulu,27,2,Non Acute,InPatient,3421,4,50,...,Uninsured,2020-04-19 01:47:00,Apr,2020,,flu,False,False,0.27,11.24
1,9,7,Honolulu,25,1,Non Acute,InPatient,9757,4,50,...,Uninsured,2020-03-24 16:51:00,Mar,2020,,flu,False,False,0.24,10.55
2,9,3,Honolulu,72,5,Acute,InPatient,13356,9,42,...,Medicaid,2020-06-05 11:28:00,Jun,2020,,heart attack,False,False,0.44,11.99
3,9,3,Honolulu,62,3,Non Acute,InPatient,1569,13,41,...,Medicaid,2020-06-24 09:47:00,Jun,2020,,arrhythmia,False,False,0.28,11.99
4,9,3,Honolulu,85,2,Non Acute,InPatient,12910,12,33,...,Medicaid,2020-01-29 16:32:00,Jan,2020,,arrhythmia,False,False,0.25,8.16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66529,9,7,Honolulu,33,4,Acute,InPatient,2532,4,46,...,Medicaid,2020-10-19 05:06:00,Oct,2020,Recurrence of a preexisting condition / infection,flu,True,False,0.45,8.16
66530,9,2,Honolulu,17,2,Non Acute,InPatient,2443,5,41,...,Medicaid,2020-11-30 02:55:00,Nov,2020,Failure to follow hospital discharge,scoliosis,True,False,0.24,7.83
66531,9,3,Honolulu,82,3,Non Acute,InPatient,9685,12,31,...,Medicaid,2020-11-20 22:07:00,Nov,2020,Medication errors or lack of accurate medicati...,arrhythmia,True,False,0.30,7.83
66532,9,1,Honolulu,85,5,Acute,InPatient,1900,8,34,...,Medicaid,2020-05-21 20:22:00,May,2020,Recurrence of a preexisting condition / infection,chemotherapy,True,False,0.46,13.27


#### Get Aggregate Readmission Rate - Predicted

In [82]:
monthly_readmission_rate_predicted_dfs = {}
for city, df in final_dfs.items():
    print(city)
    monthly_readmission_rate_predicted = df.groupby('month').agg({'Predicted_Flag': get_readmission_rate})
    monthly_readmission_rate_predicted = monthly_readmission_rate_predicted.reset_index()
    monthly_readmission_rate_predicted = monthly_readmission_rate_predicted.rename(columns={"Predicted_Flag": "Predicted_Readmission_Rate"})
    print(len(monthly_readmission_rate_predicted))
    monthly_readmission_rate_predicted_dfs[city] = monthly_readmission_rate_predicted

Los Angeles
11
Chicago
11
Miami
11
Honolulu
11
Anchorage
11


In [83]:
temp_dict = {}
for city, df in final_dfs.items():
    df = df.merge(monthly_readmission_rate_predicted_dfs[city], on='month', how='left')
    temp_dict[city] = df

final_dfs = temp_dict

In [84]:
final_dfs['Honolulu']

Unnamed: 0,hospital_id,department_id,city,patient_age,risk_level,acute_type,patient_category,doctor_id,length_of_stay,wait_time,...,date,month,year,reason_for_readmission,disease,Actual_Flag,Predicted_Flag,Prediction_Probability,Actual_Readmission_Rate,Predicted_Readmission_Rate
0,9,7,Honolulu,27,2,Non Acute,InPatient,3421,4,50,...,2020-04-19 01:47:00,Apr,2020,,flu,False,False,0.27,11.24,6.97
1,9,7,Honolulu,25,1,Non Acute,InPatient,9757,4,50,...,2020-03-24 16:51:00,Mar,2020,,flu,False,False,0.24,10.55,4.81
2,9,3,Honolulu,72,5,Acute,InPatient,13356,9,42,...,2020-06-05 11:28:00,Jun,2020,,heart attack,False,False,0.44,11.99,2.98
3,9,3,Honolulu,62,3,Non Acute,InPatient,1569,13,41,...,2020-06-24 09:47:00,Jun,2020,,arrhythmia,False,False,0.28,11.99,2.98
4,9,3,Honolulu,85,2,Non Acute,InPatient,12910,12,33,...,2020-01-29 16:32:00,Jan,2020,,arrhythmia,False,False,0.25,8.16,8.36
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66529,9,7,Honolulu,33,4,Acute,InPatient,2532,4,46,...,2020-10-19 05:06:00,Oct,2020,Recurrence of a preexisting condition / infection,flu,True,False,0.45,8.16,2.47
66530,9,2,Honolulu,17,2,Non Acute,InPatient,2443,5,41,...,2020-11-30 02:55:00,Nov,2020,Failure to follow hospital discharge,scoliosis,True,False,0.24,7.83,7.52
66531,9,3,Honolulu,82,3,Non Acute,InPatient,9685,12,31,...,2020-11-20 22:07:00,Nov,2020,Medication errors or lack of accurate medicati...,arrhythmia,True,False,0.30,7.83,7.52
66532,9,1,Honolulu,85,5,Acute,InPatient,1900,8,34,...,2020-05-21 20:22:00,May,2020,Recurrence of a preexisting condition / infection,chemotherapy,True,False,0.46,13.27,9.81


#### Upload predictions to storage account

In [89]:
final_dfs_list = list(final_dfs.values())
final_dfs_list[0]

Unnamed: 0,hospital_id,department_id,city,patient_age,risk_level,acute_type,patient_category,doctor_id,length_of_stay,wait_time,...,date,month,year,reason_for_readmission,disease,Actual_Flag,Predicted_Flag,Prediction_Probability,Actual_Readmission_Rate,Predicted_Readmission_Rate
0,1,7,Los Angeles,36,1,Non Acute,InPatient,12958,3,50,...,2020-03-13 13:38:00,Mar,2020,,flu,False,False,0.19,3.06,6.07
1,1,1,Los Angeles,55,5,Acute,InPatient,11834,8,32,...,2020-02-22 10:57:00,Feb,2020,,chemotherapy,False,False,0.40,1.66,0.11
2,1,7,Los Angeles,23,1,Non Acute,InPatient,757,4,49,...,2020-10-04 17:48:00,Oct,2020,,flu,False,False,0.18,1.69,7.66
3,1,7,Los Angeles,28,4,Acute,InPatient,2566,3,47,...,2020-10-21 08:52:00,Oct,2020,,flu,False,False,0.29,1.69,7.66
4,1,7,Los Angeles,37,5,Acute,InPatient,6153,3,50,...,2020-05-13 01:11:00,May,2020,,flu,False,False,0.42,5.57,0.11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194103,1,2,Los Angeles,11,4,Acute,InPatient,2455,6,41,...,2020-07-16 16:57:00,Jul,2020,Recurrence of a preexisting condition / infection,scoliosis,True,True,0.52,3.10,1.35
194104,1,7,Los Angeles,29,4,Acute,InPatient,6585,2,50,...,2020-02-01 16:25:00,Feb,2020,Poor coordination of care post Discharge,flu,True,False,0.44,1.66,0.11
194105,1,8,Los Angeles,38,5,Acute,InPatient,5400,14,40,...,2020-06-28 04:54:00,Jun,2020,Recurrence of a preexisting condition / infection,psoriasis,True,True,0.74,4.87,7.31
194106,1,7,Los Angeles,34,5,Acute,InPatient,8564,3,46,...,2020-03-13 08:38:00,Mar,2020,Failure to follow hospital discharge,sinusitis,True,True,0.56,3.06,6.07


In [90]:
full_df = pd.concat(final_dfs_list)
full_df

Unnamed: 0,hospital_id,department_id,city,patient_age,risk_level,acute_type,patient_category,doctor_id,length_of_stay,wait_time,...,date,month,year,reason_for_readmission,disease,Actual_Flag,Predicted_Flag,Prediction_Probability,Actual_Readmission_Rate,Predicted_Readmission_Rate
0,1,7,Los Angeles,36,1,Non Acute,InPatient,12958,3,50,...,2020-03-13 13:38:00,Mar,2020,,flu,False,False,0.19,3.06,6.07
1,1,1,Los Angeles,55,5,Acute,InPatient,11834,8,32,...,2020-02-22 10:57:00,Feb,2020,,chemotherapy,False,False,0.40,1.66,0.11
2,1,7,Los Angeles,23,1,Non Acute,InPatient,757,4,49,...,2020-10-04 17:48:00,Oct,2020,,flu,False,False,0.18,1.69,7.66
3,1,7,Los Angeles,28,4,Acute,InPatient,2566,3,47,...,2020-10-21 08:52:00,Oct,2020,,flu,False,False,0.29,1.69,7.66
4,1,7,Los Angeles,37,5,Acute,InPatient,6153,3,50,...,2020-05-13 01:11:00,May,2020,,flu,False,False,0.42,5.57,0.11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55816,27,6,Anchorage,27,3,Non Acute,InPatient,7145,3,39,...,2020-10-19 18:10:00,Oct,2020,Too early of hospital discharge,chronic headache,True,False,0.27,3.23,0.81
55817,27,5,Anchorage,71,5,Acute,InPatient,10230,4,44,...,2020-11-18 01:09:00,Nov,2020,Poor coordination of care post Discharge,endometriosis,True,False,0.47,2.83,4.12
55818,27,7,Anchorage,28,4,Acute,InPatient,5647,3,46,...,2020-10-19 02:45:00,Oct,2020,Poor coordination of care post Discharge,flu,True,False,0.44,3.23,0.81
55819,27,4,Anchorage,70,5,Acute,InPatient,11343,5,49,...,2020-09-25 20:10:00,Sep,2020,Failure to follow hospital discharge,maternity,True,False,0.48,4.26,2.18


#### Change values for reason_for_readmission column

In [91]:
original_columns = full_df.columns

In [92]:
full_df['reason_for_readmission'].value_counts()

Failure to follow hospital discharge                        9274
Poor coordination of care post Discharge                    9179
Recurrence of a preexisting condition / infection           7178
Medication errors or lack of accurate medication history    7045
Too early of hospital discharge                             4615
Pneumonia                                                   1809
Name: reason_for_readmission, dtype: int64

In [93]:
def change_reason_for_readmission(x):
    if x == 'Medication errors or lack of accurate medication history':
        return "Medication Errors"
    elif x == "Poor coordination of care post Discharge":
        return "Poor Coordination"
    elif x == "Failure to follow hospital discharge":
        return "Failure to Follow Up"
    elif x == "Recurrence of a preexisting condition / infection":
        return "Unfriendly Staff"
    elif x == "Pneumonia":
        return "High Wait Time"
    else:
        return x

In [94]:
full_df['reason_for_readmission_new'] = full_df['reason_for_readmission'].apply(change_reason_for_readmission)
full_df['reason_for_readmission_new'].value_counts()

Failure to Follow Up               9274
Poor Coordination                  9179
Unfriendly Staff                   7178
Medication Errors                  7045
Too early of hospital discharge    4615
High Wait Time                     1809
Name: reason_for_readmission_new, dtype: int64

In [95]:
full_df['reason_for_readmission'] = full_df['reason_for_readmission_new']
full_df['reason_for_readmission'].value_counts()

Failure to Follow Up               9274
Poor Coordination                  9179
Unfriendly Staff                   7178
Medication Errors                  7045
Too early of hospital discharge    4615
High Wait Time                     1809
Name: reason_for_readmission, dtype: int64

In [96]:
full_df = full_df[original_columns]
full_df.head()

Unnamed: 0,hospital_id,department_id,city,patient_age,risk_level,acute_type,patient_category,doctor_id,length_of_stay,wait_time,...,date,month,year,reason_for_readmission,disease,Actual_Flag,Predicted_Flag,Prediction_Probability,Actual_Readmission_Rate,Predicted_Readmission_Rate
0,1,7,Los Angeles,36,1,Non Acute,InPatient,12958,3,50,...,2020-03-13 13:38:00,Mar,2020,,flu,False,False,0.19,3.06,6.07
1,1,1,Los Angeles,55,5,Acute,InPatient,11834,8,32,...,2020-02-22 10:57:00,Feb,2020,,chemotherapy,False,False,0.4,1.66,0.11
2,1,7,Los Angeles,23,1,Non Acute,InPatient,757,4,49,...,2020-10-04 17:48:00,Oct,2020,,flu,False,False,0.18,1.69,7.66
3,1,7,Los Angeles,28,4,Acute,InPatient,2566,3,47,...,2020-10-21 08:52:00,Oct,2020,,flu,False,False,0.29,1.69,7.66
4,1,7,Los Angeles,37,5,Acute,InPatient,6153,3,50,...,2020-05-13 01:11:00,May,2020,,flu,False,False,0.42,5.57,0.11


In [122]:
full_df.to_csv(local_data_folder+'readmission_data_predictedv2.csv',index=False)

In [123]:
# Upload the data
local_files = [local_data_folder + 'readmission_data_predictedv2.csv']
print(local_files)

dstore.upload_files(
    files = local_files,
    relative_root = local_data_folder,
    target_path = '/',
    overwrite=True,
    show_progress=True
)

['readmission_data/readmission_data_predictedv2.csv']
Uploading an estimated of 1 files
Uploading readmission_data/readmission_data_predictedv2.csv
Uploaded readmission_data/readmission_data_predictedv2.csv, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_readmission_prediction_store