# Exploratory Data Analysis

## Goal

The overall goal of this project is to build a binary classifier that predicts survival for patients who have experienced cardiac arrest, using the public use 2011 Texas hospital discharge data set.

## Merge data into single dataframe, create predictor groups

In [1]:
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt

%matplotlib inline

First, let's load in the pickled dataframes for each quarter's cardiac arrest patients.

In [2]:
with open('data/cardiac_arrest_dfs.pkl', 'rb') as picklefile:
    [q1_df, q2_df, q3_df, q4_df] = pickle.load(picklefile)

In [12]:
# Let's confirm that all 4 data frames have the same columns. They do; there is no difference between
# the sets of columns for each dataframe pair-wise.

print((set(q1_df.columns)).difference(set(q2_df.columns)))
print((set(q1_df.columns)).difference(set(q3_df.columns)))
print((set(q1_df.columns)).difference(set(q4_df.columns)))

print((set(q2_df.columns)).difference(set(q3_df.columns)))
print((set(q2_df.columns)).difference(set(q4_df.columns)))

print((set(q3_df.columns)).difference(set(q4_df.columns)))

set()
set()
set()
set()
set()
set()


In [16]:
# Let's merge all of the dataframes together to get the dataframe for the year.
master_2011_df = pd.concat([q1_df, q2_df, q3_df, q4_df], ignore_index=True)

master_2011_df.shape

(13183, 194)

In [20]:
# Let's look for any duplicate encounters, which means any rows where the record_id is the same.
# This is an empty dataframe, which means that all of the record_ids are unique.
master_2011_df[master_2011_df.duplicated(subset='record_id')]

Unnamed: 0,record_id,discharge,thcic_id,provider_name,type_of_admission,source_of_admission,spec_unit_1,spec_unit_2,spec_unit_3,spec_unit_4,...,apr_drg,risk_mortality,illness_severity,apr_grouper_version_nbr,apr_grouper_error_code,attending_physician_unif_id,operating_physician_unif_id,encounter_indicator,cert_status,filler_space


Let's separate out the columns that we will use as predictors, and also the column that we will use as a target.

target: patient status (at time of discharge)

Predictor group 1: Personal demographics
* Age
* Gender
* Race
* Ethnicity
* Primary payer (Medicare, Medicaid, private insurance, etc.)
* Patient zip code/county

Predictor group 2: Details about hospital stay
* Day of the week patient admitted (difference between mortality in patients admitted on weekday vs weekend?)
* Type of hospital patient is admitted to; academic vs private vs community vs critical access hospitals
* Length of stay
* Type of admission (urgent vs emergent vs elective)
* Source of admission

Predictor group 3: Medical/procedural
* In-hospital vs out-of-hospital cardiac arrest (presumably, patients who had an admitting diagnosis of cardiac arrest experienced their arrest out-of-hospital, although there may be inaccuracy here if the patient was transferred from another hospital; need to account for this by looking at the source of admission)
* Other associated diagnosis codes, e.g. diabetes, heart failure, etc.
* Other associated procedural codes, e.g. heart surgery, mechanical ventilation, etc.
* Can separate out the associated diagnosis codes to distinguish between the medical conditions that are present on arrival (e.g. chronic conditions) vs medical conditions that develop during the hospital stay.

In [107]:
target = 'pat_status'

index = 'record_id'

personal_demographic_predictors = ['pat_age', 'sex_code', 'race', 'ethnicity', 'pat_state', 'pat_zip', 'pat_country', 'county', 'first_payment_src', 'secondary_payment_src']
hospital_stay_predictors = ['provider_name', 'type_of_admission', 'source_of_admission', 'admit_weekday', 'length_of_stay', 'type_of_bill']

diag_codes_predictors = ['admitting_diagnosis']
diag_codes_predictors.extend([col for col in master_2011_df.columns if 'diag_code' in col])

e_code_predictors = [col for col in master_2011_df.columns if 'e_code' in col]

proc_code_predictors = ['princ_surg_proc_code', 'princ_surg_proc_day', 'princ_icd9_code']
proc_code_predictors.extend([col for col in master_2011_df.columns if 'oth_surg' in col or 'oth_icd9' in col])

In [94]:
z = master_2011_df[target].value_counts()

new_index = []

for i in z.index:
    new_index.append(pat_status_dict[i])
    
z.index = new_index

In [95]:
z

Expired                                                                                              8183
Discharged to home or self-care (routine discharge)                                                  1643
Discharged/transferred to Medicare-certified long term care hospital                                  794
Discharged to skilled nursing facility                                                                565
Discharged to care of home health service                                                             486
Discharged to hospice–medical facility                                                                425
Discharged to other short term general hospital                                                       368
Discharged/transferred to inpatient rehabilitation facility                                           353
Discharged to hospice–home                                                                            112
Discharged to intermediate care facility      

In [93]:
pat_status_dict = {
    "01": "Discharged to home or self-care (routine discharge)",
    "02": "Discharged to other short term general hospital", 
    "03": "Discharged to skilled nursing facility",
    "04": "Discharged to intermediate care facility",
    "05": "Discharged/transferred to a Designated Cancer Center or Children's Hospital (effective 10-1-2007)",
    "06": "Discharged to care of home health service",
    "07": "Left against medical advice",
    "08": "Discharged to care of Home IV provider",
    "09": "Admitted as inpatient to this hospital",
    "20": "Expired",
    "30": "Still patient",
    "40": "Expired at home",
    "41": "Expired in a medical facility",
    "42": "Expired, place unknown",
    "43": "Discharged/transferred to federal health care facility",
    "50": "Discharged to hospice–home",
    "51": "Discharged to hospice–medical facility",
    "61": "Discharged/transferred within this institution to Medicare-approved swing bed",
    "62": "Discharged/transferred to inpatient rehabilitation facility",
    "63": "Discharged/transferred to Medicare-certified long term care hospital",
    "64": "Discharged/transferred to Medicaid-certified nursing facility",
    "65": "Discharged/transferred to psychiatric hospital or psychiatric distinct part of a hospital",
    "66": "Discharged/transferred to Critical Access Hospital (CAH)",
    "71": "Discharged/transferred to other outpatient service",
    "72": "Discharged/transferred to institution outpatient",
    "`": "Invalid"
}

## Transform columns

We now have to transform the features of the dataframe into a form where we can actually run our different models on the data set.

### Target variable transformation:

First, we'll tackle the target variable. For a minimum viable product, we need to transform the target into a binary target, and then also select only the records that have the appropriate values for pat_status that we're looking at.

https://erikrood.com/Python_References/rows_cols_python.html

To select rows whose column value is in an iterable array, which we'll define as array, you can use isin:

    array = ['yellow', 'green']
    df.loc[df['favorite_color'].isin(array)]

In [134]:
target_binary_dict = {
    "Expired": ["20", "40", "41", "50", "51"],
    "Alive": ["01", "03", "04", "06", "07", "62", "63", "64"]
}

target_ternary_dict = {
    "Expired": ["20", "40", "41", "50", "51"],
    "Facility": ["03", "04", "06", "62", "63", "64"],
    "Home": ["01", "06", "07"]
}

In [146]:
# We make a copy of the filtered master dataframe so that python won't yell about making changes to a slice of an array.
expired_df = master_2011_df[master_2011_df[target].isin(target_binary_dict['Expired'])].copy()
alive_df = master_2011_df[master_2011_df[target].isin(target_binary_dict['Alive'])].copy()

In [147]:
# Now, let's set the value of target/pat_status for expired_df to 'expired' and for alive_df to 'alive'.

expired_df[target] = "expired"
alive_df[target] = "alive"

In [154]:
# Let's concatenate these frames back together. We can also set the record number as the index.
binary_df = pd.concat([expired_df, alive_df], ignore_index=True)
binary_df.set_index(index, inplace=True)

In [156]:
# Now we can set y as equal to the target column of binary_df, and this is a vector of 'alive' and 'expired'.
y = binary_df[target]

In [158]:
diag_codes_predictors

['admitting_diagnosis',
 'princ_diag_code',
 'poa_princ_diag_code',
 'oth_diag_code_1',
 'poa_oth_diag_code_1',
 'oth_diag_code_2',
 'poa_oth_diag_code_2',
 'oth_diag_code_3',
 'poa_oth_diag_code_3',
 'oth_diag_code_4',
 'poa_oth_diag_code_4',
 'oth_diag_code_5',
 'poa_oth_diag_code_5',
 'oth_diag_code_6',
 'poa_oth_diag_code_6',
 'oth_diag_code_7',
 'poa_oth_diag_code_7',
 'oth_diag_code_8',
 'poa_oth_diag_code_8',
 'oth_diag_code_9',
 'poa_oth_diag_code_9',
 'oth_diag_code_10',
 'poa_oth_diag_code_10',
 'oth_diag_code_11',
 'poa_oth_diag_code_11',
 'oth_diag_code_12',
 'poa_oth_diag_code_12',
 'oth_diag_code_13',
 'poa_oth_diag_code_13',
 'oth_diag_code_14',
 'poa_oth_diag_code_14',
 'oth_diag_code_15',
 'poa_oth_diag_code_15',
 'oth_diag_code_16',
 'poa_oth_diag_code_16',
 'oth_diag_code_17',
 'poa_oth_diag_code_17',
 'oth_diag_code_18',
 'poa_oth_diag_code_18',
 'oth_diag_code_19',
 'poa_oth_diag_code_19',
 'oth_diag_code_20',
 'poa_oth_diag_code_20',
 'oth_diag_code_21',
 'poa_oth

The predictors that are important are likely going to be different for in-hospital vs out-of-hospital cardiac arrests; how to account for this? could split the populations manually.

The columns that I would be interested in as predictors:
* provider_name (also the same as thcic_id)
* type_of_admission (exclude 4 which is Newborn, and 9 which is unknown? prob okay to include 9)
* source_of_admission
* patient_state (can likely separate into in-state vs out of state patients)
* pat_zip (patient zip code)
* county (fips code)
* public_health_region
* sex_code
* race
* ethnicity
* admit_weekday (weekend vs weekday)
* length_of_stay
* pat_age (patient age on day of discharge)
* first_payment_src
* secondary_payment_src
* admitting_diagnosis (this would be how you would split the data between out-of-hospital vs in-hospital cardiac arrests)
* all of the other diagnoses codes (this is how you can figure out what are the comorbidities for these patients; you could use the most common ?chronic comorbidities as predictors, e.g. the presence or absence of them in any of the diagnoses codes; will have to do rearranging of this dataframe in order to come up with these 'flag' variables)
* e-code (again, you should probably look to see if there are external injury codes that come up particularly frequently, and then use those as flag variables)
* procedure codes (again, look at the procedures that are most frequently performed on these patients, use them as flags; these primarily serve as a surrogate marker for how severe the disease is, e.g. if a patient requires a cardiac cath then they're sick, if they require a CABG then they're probably sicker, etc.) --> if you have time to get really granular about it, there are fields that records on which day a procedure occurred, you could try to tie that in somehow.
* MS-MDC (major diagnosis code; not sure if it will add more info than already provided in the diagnosis codes, but maybe?)
* MS-DRG (diagnosis-related group)
* risk_mortality (Assignment of a risk of mortality score from the All Patient Refined (APR) Diagnosis Related
* Group (DRG) from the 3M APR-DRG Grouper. Indicates the likelihood of dying.)
* illness_severity (Assignment of a severity of illness score from the All Patient Refined (APR) Diagnosis Related
* Group (DRG) from the 3M APR-DRG Grouper. Indicates the extent of physiologic decompensation.)
* attending_physician
* operating_physician

Target variable:
* pat_status -> for a binary classifier, you could separate it out into either expired (20, 40, 41, 42) or hospice (50, 51) vs alive at discharge. You could probably more usefully separate it into 3 classes, which is expired, discharged home, and discharged to some other facility (but probably more specifically a SNF or inpatient rehability facility). You should probably pull all the rows where the patients experienced a cardiac arrest, and see what their patient status is.


In [132]:
t_df['oth_diag_code_1'].value_counts()

51881    60
3481     11
5856      9
5849      7
2762      5
3485      5
570       5
0389      4
5845      4
25000     4
5185      3
486       3
42741     2
4280      1
43491     1
5789      1
25012     1
34590     1
41011     1
4233      1
4254      1
4019      1
41400     1
41519     1
9916      1
41071     1
V667      1
4413      1
34831     1
56089     1
5184      1
42781     1
2760      1
86350     1
20300     1
34830     1
29040     1
9971      1
41401     1
5990      1
4255      1
7802      1
7854      1
430       1
2639      1
49392     1
V4511     1
5070      1
1550      1
496       1
042       1
41090     1
7530      1
78001     1
3453      1
Name: oth_diag_code_1, dtype: int64