This file cleans the original hospital readmissions data (as downloaded directly from Kaggle) and cleans it into a format that can be used with sklearn's `RandomForestClassifier` (RFC).

This new file is then saved as `readmissions_clean.csv`, which is the file used in the experiments.

In [5]:
# Load Data
import pandas as pd
readmissions = pd.read_csv("hospital_readmissions.csv")

The original data looks as follows:

In [6]:
readmissions.head(3)

Unnamed: 0,age,time_in_hospital,n_lab_procedures,n_procedures,n_medications,n_outpatient,n_inpatient,n_emergency,medical_specialty,diag_1,diag_2,diag_3,glucose_test,A1Ctest,change,diabetes_med,readmitted
0,[70-80),8,72,1,18,2,0,0,Missing,Circulatory,Respiratory,Other,no,no,no,yes,no
1,[70-80),3,34,2,13,0,0,0,Other,Other,Other,Other,no,no,no,yes,no
2,[50-60),5,45,0,18,0,0,0,Missing,Circulatory,Circulatory,Circulatory,no,no,yes,yes,yes


We start by converting the age column from text into a number, which now represents the age decade that someone is in (e.g. 70-80 now becomes 70)

In [7]:
# Create new age decade column
readmissions["age_decade"] = readmissions["age"].apply(lambda x: int(x[1:3]))
# Drop old column
readmissions = readmissions.drop(["age"], axis=1)

We then do some other cleanup, replacing "Missing" with a value Python recognizes as missing. Then, since we only need some of the columns for our analysis, for simplicity we remove columns that have too many categories, as this greatly complicates the data when formatting for RFC.

In [8]:
# Replace missing with None
readmissions = readmissions.replace("Missing", None)
# Drop complex columns
readmissions = readmissions.drop(["diag_1", "diag_2", "diag_3", "medical_specialty"], axis=1)

In [9]:
readmissions.head(3)

Unnamed: 0,time_in_hospital,n_lab_procedures,n_procedures,n_medications,n_outpatient,n_inpatient,n_emergency,glucose_test,A1Ctest,change,diabetes_med,readmitted,age_decade
0,8,72,1,18,2,0,0,no,no,no,yes,no,70
1,3,34,2,13,0,0,0,no,no,no,yes,no,70
2,5,45,0,18,0,0,0,no,no,yes,yes,yes,50


Next, we convert the glucose test and A1C test variables into indicator variables rather than categorical variables.

In [10]:
# Utility function for helping create a new variable which is an indicator for high glucose
def high_glucose(row):
    if row["glucose_test"] == "high":
        val = 1
    else:
        val = 0
    return val

# Utility function for helping create a new variable which is an indicator for high A1C
def high_A1C(row):
    if row["A1Ctest"] == "high":
        val = 1
    else:
        val = 0
    return val

In [11]:
# Use utility functions from above to create new indicators
readmissions["high_glucose"] = readmissions.apply(high_glucose, axis=1)
readmissions["high_A1C"] = readmissions.apply(high_A1C, axis=1)
# Drop original variables
readmissions = readmissions.drop(["glucose_test", "A1Ctest"], axis=1)

Finally, we convert "no" and "yes" (which are not recognized by RFC) to 0 and 1, which are.

In [12]:
# Replace "no"/"yes" with 0/1
readmissions = readmissions.replace("no", 0).replace("yes", 1)

The cleaned data now looks as follows:

In [13]:
readmissions.head(3)

Unnamed: 0,time_in_hospital,n_lab_procedures,n_procedures,n_medications,n_outpatient,n_inpatient,n_emergency,change,diabetes_med,readmitted,age_decade,high_glucose,high_A1C
0,8,72,1,18,2,0,0,0,1,0,70,0,0
1,3,34,2,13,0,0,0,0,1,0,70,0,0
2,5,45,0,18,0,0,0,1,1,1,50,0,0


In [31]:
# Save data to CSV
readmissions.to_csv("readmissions_clean.csv", index=False)