## About the Data - Diabetes Dataset from 130 US Hospitals, 1999-2008
Data From: UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008#<br>

Abstract: This data has been prepared to analyze factors related to readmission as well as other outcomes pertaining to patients with diabetes.<br>

Source: The data are submitted on behalf of the Center for Clinical and Translational Research, Virginia Commonwealth University, a recipient of NIH CTSA grant UL1 TR00058 and a recipient of the CERNER data. John Clore (jclore '@' vcu.edu), Krzysztof J. Cios (kcios '@' vcu.edu), Jon DeShazo (jpdeshazo '@' vcu.edu), and Beata Strack (strackb '@' vcu.edu). This data is a de-identified abstract of the Health Facts database (Cerner Corporation, Kansas City, MO).<br>

Data Set Information: The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.<br>
(1) It is an inpatient encounter (a hospital admission).<br>
(2) It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.<br>
(3) The length of stay was at least 1 day and at most 14 days.<br>
(4) Laboratory tests were performed during the encounter.<br>
(5) Medications were administered during the encounter.<br>
The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.<br>

## Final Project

* We will use this dataset to predict the likelihood of readmission.


In [2]:
### Import Dependencies
import numpy as np
import pandas as pd

### Set Display Options
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth",0)
import warnings
warnings.filterwarnings("ignore")

## Data Dictionary & Mappings

In [3]:
### Read Data Dictionary
# data_dict = pd.read_csv("Resources/data_dictionary.csv")
# data_dict

In [4]:
### Read Data Mappings
# data_mapping_IDs = pd.read_csv("Resources/IDs_mapping.csv", encoding= "unicode_escape")
# data_mapping_IDs

drugs_mapping = pd.read_csv("Resources/data_dictionary_drugs.csv", encoding= "unicode_escape")
drugs_mapping.head()

Unnamed: 0,Data_Label,Drug_Class,Drug_Category,Drug_Description
0,acarbose,alpha-glucosidase inhibitor,inhibitor,Carbohydrates that are eaten are digested by enzymes in the intestine into smaller sugars which are absorbed into the body and increase blood sugar levels.
1,miglitol,alpha-glucosidase inhibitor,inhibitor,Miglitol is an oral medication used to control blood glucose (sugar) levels in type 2 diabetes.
2,metformin,biguanide,biguanide,"The term biguanide refers to a group of oral type 2 diabetes drugs that work by preventing the production of glucose in the liver, improving the body's sensitivity towards insulin and reducing the amount of sugar absorbed by the intestines."
3,metformin-pioglitazone,biguanide-sulfonylurea,combo drug therapy,Combo drug therapy
4,metformin-rosiglitazone,biguanide-thiazolidinedione,combo drug therapy,Combo drug therapy


In [5]:
### Read Diabetes Data
diabetes_df = pd.read_csv("Resources/diabetic_data.csv")
# diabetes_df.info()

In [6]:
diabetes_df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,diag_1,diag_2,diag_3,number_diagnoses,max_glu_serum,A1Cresult,metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,?,Pediatrics-Endocrinology,41,0,1,0,0,0,250.83,?,?,1,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,?,?,59,0,18,0,0,0,276.0,250.01,255,9,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,?,?,11,5,13,2,0,1,648.0,250,V27,6,,,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,?,?,44,1,16,0,0,0,8.0,250.43,403,7,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,?,?,51,0,8,0,0,0,197.0,157,250,5,,,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,Steady,No,No,No,No,No,Ch,Yes,NO


## Data Transformation

In [7]:
### Create Our Variables of Interest (Hospital Readmission)

### Binary Variable where 1 = readmit
diabetes_df["readmit"] = diabetes_df["readmitted"].map({"<30": 1, ">30": 1, "NO": 0})

### Binary Variable where 1 = readmit within 30 days of first hospitalization 
diabetes_df["readmit_<30_days"] = diabetes_df["readmitted"].map({"<30": 1, ">30": 0, "NO": 0})

In [8]:
### Recode Mapped ID Variables to their Original Categorical Descriptions 
diabetes_df["admission_source_id"] = diabetes_df["admission_source_id"].map({1: "Physician Referral", 2: "Clinic Referral",3: "HMO Referral",
                          4: "Transfer from a hospital", 5: "Transfer from a Skilled Nursing Facility (SNF)",
                          6: "Transfer from another health care facility", 7: "Emergency Room",
                          8: "Court/Law Enforcement", 9: "Not Available",10: "Transfer from critial access hospital",
                          11: "Normal Delivery",12: "Premature Delivery",13: "Sick Baby",14: "Extramural Birth",
                          15: "Not Available",17: "NULL",18: "Transfer From Another Home Health Agency",
                          19: "Readmission to Same Home Health Agency",20: "Not Mapped",21: "Unknown/Invalid",
                          22: "Transfer from hospital inpt/same fac reslt in a sep claim",23: "Born inside this hospital",
                          24: "Born outside this hospital",25: "Transfer from Ambulatory Surgery Center",
                          26: "Transfer from Hospice"})

diabetes_df["admission_type_id"] = diabetes_df["admission_type_id"].map({1: "Emergency", 2: "Urgent", 3:"Elective", 
                                4:"Newborn",5:"Not Available", 6:"NULL", 7:"Trauma Center", 8:"Not Mapped"})

diabetes_df["discharge_disposition_id"] = diabetes_df["discharge_disposition_id"].map({1: "Discharged to home",2: "Discharged/transferred to another short term hospital",
                                    3: "Discharged/transferred to SNF", 4 : "Discharged/transferred to ICF",
                                    5: "Discharged/transferred to another type of inpatient care institution",6: "Discharged/transferred to home with home health service",
                                    7: "Left AMA",8: "Discharged/transferred to home under care of Home IV provider",
                                    9: "Admitted as an inpatient to this hospital",10: "Neonate discharged to another hospital for neonatal aftercare",
                                    11: "Expired",12: "Still patient or expected to return for outpatient services",
                                    13: "Hospice / home",14: "Hospice / medical facility",15: "Discharged/transferred within this institution to Medicare approved swing bed",
                                    16: "Discharged/transferred/referred another institution for outpatient services",17: "Discharged/transferred/referred to this institution for outpatient services",
                                    18: "NULL",19: "Expired at home. Medicaid only, hospice.",
                                    20: "Expired in a medical facility. Medicaid only, hospice.",21: "Expired, place unknown. Medicaid only, hospice.",
                                    22: "Discharged/transferred to another rehab fac including rehab units of a hospital.",23: "Discharged/transferred to a long term care hospital.",
                                    24: "Discharged/transferred to a nursing facility certified under Medicaid but not certified under Medicare.","25": "Not Mapped",
                                    26: "Unknown/Invalid","30": "Discharged/transferred to another Type of Health Care Institution not Defined Elsewhere",
                                    27: "Discharged/transferred to a federal health care facility.",28: "Discharged/transferred/referred to a psychiatric hospital of psychiatric distinct part unit of a hospital",
                                    29: "Discharged/transferred to a Critical Access Hospital (CAH)."})

### Recode Values of Categorical Variables for Simplification
diabetes_df["race"] = diabetes_df["race"].replace({"?": "Other/Unknown","Other": "Other/Unknown"})
diabetes_df["weight"] = diabetes_df["weight"].replace({"?": "Unknown"})

In [9]:
### Dropping Unused Variables - 

## We were unable to identify mappings/definitions for: number_outpatient, number_emergency, number_inpatient, change,payer_code,diag_1,diag_2,diag_3
## Drop "readmitted" because we've created our variables of interest from it already (it is no longer needed) 
## These are unique values not needed for this analysis: encounter_id,patient_nbr
diabetes_df = diabetes_df.drop(["number_outpatient", "number_emergency", "number_inpatient", "change","payer_code","diag_1",
                  "diag_2","diag_3","readmitted", "encounter_id","patient_nbr"], axis=1)

### Droping Individual Drugs Variables (we still have "diabetesMed" (yes/no))
diabetes_df = diabetes_df.drop(["acarbose","miglitol","metformin","metformin-pioglitazone","metformin-rosiglitazone",
                                "citoglipton","examide","insulin","repaglinide","nateglinide","acetohexamide",
                                "glipizide","chlorpropamide","glimepiride","glyburide","tolbutamide","tolazamide",
                                "glyburide-metformin","glipizide-metformin","glimepiride-pioglitazone","pioglitazone","rosiglitazone",
                                "troglitazone"], axis=1)

### Drop NULL Columns (if any)
diabetes_df = diabetes_df.dropna(axis="columns", how="all")

### Drop NULL Rows (if any)
diabetes_df = diabetes_df.dropna()

### Drop the 3 Observations (rows) where Gender is "Unknown/Invalid"
gender_unknown = diabetes_df[diabetes_df["gender"] == "Unknown/Invalid" ].index
diabetes_df = diabetes_df.drop(gender_unknown)

In [10]:
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100774 entries, 1 to 101765
Data columns (total 18 columns):
race                        100774 non-null object
gender                      100774 non-null object
age                         100774 non-null object
weight                      100774 non-null object
admission_type_id           100774 non-null object
discharge_disposition_id    100774 non-null object
admission_source_id         100774 non-null object
time_in_hospital            100774 non-null int64
medical_specialty           100774 non-null object
num_lab_procedures          100774 non-null int64
num_procedures              100774 non-null int64
num_medications             100774 non-null int64
number_diagnoses            100774 non-null int64
max_glu_serum               100774 non-null object
A1Cresult                   100774 non-null object
diabetesMed                 100774 non-null object
readmit                     100774 non-null int64
readmit_<30_days            

## Random Stratified Sample
* Dataset is large.
* We will take a random stratified sample of the dataset to use in model building.
* This needs to be done prior to converting categorical variables to binary/dummies.

In [11]:
## Source: https://stackoverflow.com/questions/44114463/stratified-sampling-in-pandas/44115314
## Function for Stratified Sample
def strat_sample(df, col, n_samples):
    n = min(n_samples, df[col].value_counts().min())
    df_ = df.groupby(col).apply(lambda x: x.sample(n))
    df_.index = df_.index.droplevel(0)
    return df_

### Stratefied Samples
diabetes_strat_race = strat_sample(diabetes_df, "race", 10000)
diabetes_strat_gender = strat_sample(diabetes_df, "gender", 10000)

In [12]:
### Convert Categorical Data To Dummy Variables
diabetes_strat_race = pd.get_dummies(diabetes_strat_race)
diabetes_strat_gender = pd.get_dummies(diabetes_strat_gender)

In [13]:
diabetes_strat_race.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3180 entries, 98834 to 31837
Columns: 121 entries, time_in_hospital to diabetesMed_Yes
dtypes: int64(7), uint8(114)
memory usage: 552.8 KB


In [14]:
diabetes_strat_gender.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20000 entries, 4494 to 10613
Columns: 150 entries, time_in_hospital to diabetesMed_Yes
dtypes: int64(7), uint8(143)
memory usage: 3.9 MB


In [15]:
### Export Diabetes Dataframe
diabetes_strat_race.to_csv("Resources/diabetes_strat_race.csv", encoding="utf-8", index=False)
diabetes_strat_gender.to_csv("Resources/diabetes_strat_gender.csv", encoding="utf-8", index=False)

## Ceating Dataframe for Logistic Classification

In [16]:
diabetes_df.head()

Unnamed: 0,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_diagnoses,max_glu_serum,A1Cresult,diabetesMed,readmit,readmit_<30_days
1,Caucasian,Female,[10-20),Unknown,Emergency,Discharged to home,Emergency Room,3,?,59,0,18,9,,,Yes,1,0
2,AfricanAmerican,Female,[20-30),Unknown,Emergency,Discharged to home,Emergency Room,2,?,11,5,13,6,,,Yes,0,0
3,Caucasian,Male,[30-40),Unknown,Emergency,Discharged to home,Emergency Room,2,?,44,1,16,7,,,Yes,0,0
4,Caucasian,Male,[40-50),Unknown,Emergency,Discharged to home,Emergency Room,1,?,51,0,8,5,,,Yes,0,0
5,Caucasian,Male,[50-60),Unknown,Urgent,Discharged to home,Clinic Referral,3,?,31,6,16,9,,,Yes,1,0


In [17]:
diabetes_top_features = diabetes_df.drop(["race","gender","age","weight","admission_type_id",
                                          "discharge_disposition_id","admission_source_id",
                                          "max_glu_serum", "A1Cresult","diabetesMed", "medical_specialty"], axis=1)

diabetes_top_features.head()

Unnamed: 0,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_diagnoses,readmit,readmit_<30_days
1,3,59,0,18,9,1,0
2,2,11,5,13,6,0,0
3,2,44,1,16,7,0,0
4,1,51,0,8,5,0,0
5,3,31,6,16,9,1,0


In [20]:
### Export Top Features Dataframe
diabetes_top_features.to_csv("Resources/diabetes_top_features.csv", encoding="utf-8", index=False)