# Exploratory Data Analysis of  Hospital Readmission data for Diabetes patients

The Centers for Medicare & Medicaid Services, CMS which is part of the Department of Health and Human Services (HHS) has created many programs to improve the quality of care of patient as the healthcare system moves toward value-based care. Hospital Readmission Reduction Program (HRRP), which is one of them , reduces reimbursement to hospitals with above average readmissions. For those hospitals which are currently penalized under this program, one solution is to create interventions to provide additional assistance to patients with increased risk of readmission. 
I propose to use predictive modeling from data science to help identify patients with a risk for hospital readmission.

Datasets that are available for this project are
1) Diabetic data with all the details of the patients getting admitted and 
2) IDS Mapping that has mapping values for  some  of the columns  from diabetic data

I. Cleaning and Consolidating the Data In order to consolidate patient data fromSome unnecessary attributes were dropped to reduce dimensionality. 
II. Missing Values :Some columns in the dataset had missing values and there were a few inconsistencies in notation that were adjusted for ease of future analysis. 
    • Dropna and fillna were used to drop if the numbere of rows were insignificant to the data and to replace using mean value as applicable for each of the specific case.
III. Outliers There were not much significant outliers to be worked upon.
We had different types of data: numerical, and categorical. 

To apply any model, preprocessing of data is essential.
Dealing with the missing values by dropping the columns which had too many missing values.
Modifying the data like standardization, log transform.
Dealing with the categorical variable like Readmitted to make it dummy variable suitable for applying ML techniques.
The challenge was that there were too many variables so cleaning and making sense of the data was a challenge but step by step 
approach helps!


In [2]:
#initial libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Mining / EDA / dimensionality reduction
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split


## Dataset Exploration

#Initial examination of csv file shows that null values are indicated by  '?' in csv file.

In [3]:
diabetic_data = pd.read_csv('diabetic_data.csv', index_col='encounter_id', na_values="?", low_memory=False)
eda_data = pd.read_csv('diabetic_data.csv',  na_values="?",low_memory=False)

In [4]:
print('Number of samples:',len(diabetic_data))

Number of samples: 101766


In [5]:
diabetic_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 101766 entries, 2278392 to 443867222
Data columns (total 49 columns):
patient_nbr                 101766 non-null int64
race                        99493 non-null object
gender                      101766 non-null object
age                         101766 non-null object
weight                      3197 non-null object
admission_type_id           101766 non-null int64
discharge_disposition_id    101766 non-null int64
admission_source_id         101766 non-null int64
time_in_hospital            101766 non-null int64
payer_code                  61510 non-null object
medical_specialty           51817 non-null object
num_lab_procedures          101766 non-null int64
num_procedures              101766 non-null int64
num_medications             101766 non-null int64
number_outpatient           101766 non-null int64
number_emergency            101766 non-null int64
number_inpatient            101766 non-null int64
diag_1                      1

In [6]:
diabetic_data.shape

(101766, 49)

In [7]:
# The column 'readmitted' indicates if a patient was hospitalized within 30 days, greater than 30 days or not readmitted.

diabetic_data.groupby('readmitted').size()

readmitted
<30    11357
>30    35545
NO     54864
dtype: int64

In [8]:
# 'discharge_disposition_id' column indicates  what happened to patient post hospitalization.From the csv file, the ids 
#  11,13,14,19,20,21 are related to death or hospice , so those can be dropped
diabetic_data = diabetic_data.loc[~diabetic_data['discharge_disposition_id'].isin([11,13,14,19,20,21])]
diabetic_data.shape

(99343, 49)

In [9]:
diabetic_data['Readmission_label'] = (diabetic_data.readmitted == 'NO').astype('int')


In [10]:
diabetic_data.head()

Unnamed: 0_level_0,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,...,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted,Readmission_label
encounter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2278392,8222157,Caucasian,Female,[0-10),,6,25,1,1,,...,No,No,No,No,No,No,No,No,NO,1
149190,55629189,Caucasian,Female,[10-20),,1,1,7,3,,...,Up,No,No,No,No,No,Ch,Yes,>30,0
64410,86047875,AfricanAmerican,Female,[20-30),,1,1,7,2,,...,No,No,No,No,No,No,No,Yes,NO,1
500364,82442376,Caucasian,Male,[30-40),,1,1,7,2,,...,Up,No,No,No,No,No,Ch,Yes,NO,1
16680,42519267,Caucasian,Male,[40-50),,1,1,7,1,,...,Steady,No,No,No,No,No,Ch,Yes,NO,1


In [11]:
# Get an idea of how many features are missing values, and how many values they're missing:
def percent_null(data):
    # Returns a Pandas series of what percentage of each feature of 'data' contains NaN values
    pc_null = data.apply(pd.Series.isnull).apply(lambda x: 100*round(len(x[x==True])/len(x), 4))
    return pc_null[pc_null!=0]
percent_null(diabetic_data)

race                  2.25
weight               96.85
payer_code           39.66
medical_specialty    48.94
diag_1                0.02
diag_2                0.36
diag_3                1.43
dtype: float64

In [12]:
# Make a copy
dd_df = diabetic_data.copy()
# The columns 'weight' and ' payer_code ' can be dropped as there is significant null values in those columns
diabetic_data.drop(['weight', 'payer_code'], axis=1, inplace=True);

In [13]:
diabetic_data.age = (LabelEncoder().fit_transform(diabetic_data.age))

In [14]:
diabetic_data.age = diabetic_data.age*10+5

In [15]:
# Convert nominal and ordinate variables as categorical dtypes, interval variables as integers
cols_nume = ['age','time_in_hospital','num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient','number_diagnoses']

In [16]:
diabetic_data[cols_nume].isnull().sum()

age                   0
time_in_hospital      0
num_lab_procedures    0
num_procedures        0
num_medications       0
number_outpatient     0
number_emergency      0
number_inpatient      0
number_diagnoses      0
dtype: int64

In [17]:
cols_cate = ['race', 'gender', 
       'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed']

In [18]:
diabetic_data[cols_cate].isnull().sum()

race                        2234
gender                         0
max_glu_serum                  0
A1Cresult                      0
metformin                      0
repaglinide                    0
nateglinide                    0
chlorpropamide                 0
glimepiride                    0
acetohexamide                  0
glipizide                      0
glyburide                      0
tolbutamide                    0
pioglitazone                   0
rosiglitazone                  0
acarbose                       0
miglitol                       0
troglitazone                   0
tolazamide                     0
insulin                        0
glyburide-metformin            0
glipizide-metformin            0
glimepiride-pioglitazone       0
metformin-rosiglitazone        0
metformin-pioglitazone         0
change                         0
diabetesMed                    0
dtype: int64

In [19]:
diabetic_data['race'] = diabetic_data['race'].fillna('UNK')
diabetic_data['medical_specialty'] = diabetic_data['medical_specialty'].fillna('UNK')

In [20]:
print('Number medical specialty:', diabetic_data.medical_specialty.nunique())
diabetic_data.groupby('medical_specialty').size().sort_values(ascending = False)

Number medical specialty: 73


medical_specialty
UNK                                  48616
InternalMedicine                     14237
Emergency/Trauma                      7419
Family/GeneralPractice                7252
Cardiology                            5279
Surgery-General                       3059
Nephrology                            1539
Orthopedics                           1392
Orthopedics-Reconstructive            1230
Radiologist                           1121
Pulmonology                            854
Psychiatry                             853
Urology                                682
ObstetricsandGynecology                669
Surgery-Cardiovascular/Thoracic        642
Gastroenterology                       538
Surgery-Vascular                       525
Surgery-Neuro                          462
PhysicalMedicineandRehabilitation      391
Oncology                               319
Pediatrics                             253
Neurology                              201
Hematology/Oncology                 

In [21]:
#  medical speciality other than Top 20 would be clubbed under 'other'
top_20 = ['UNK','InternalMedicine','Emergency/Trauma',\
          'Family/GeneralPractice', 'Cardiology','Surgery-General' ,\
          'Nephrology','Orthopedics',\
          'Orthopedics-Reconstructive','Radiologist','Pulmonology',\
          'Psychiatry','Urology','ObstetricsandGynecology',\
          'Surgery-Cardiovascular/Thoracic','Gastroenterology'
          'Surgery-Vascular','Surger-Neuro',\
          'PhysicalMedicineandRehabilitation','Oncology']

# make a new column with duplicated data
diabetic_data['med_spec_other'] = diabetic_data['medical_specialty'].copy()



# replace all specialties not in top 20 with 'Other' category
diabetic_data.loc[~diabetic_data.med_spec_other.isin(top_20),'med_spec_other'] = 'Other'

In [22]:
diabetic_data.groupby('med_spec_other').size()


med_spec_other
Cardiology                            5279
Emergency/Trauma                      7419
Family/GeneralPractice                7252
InternalMedicine                     14237
Nephrology                            1539
ObstetricsandGynecology                669
Oncology                               319
Orthopedics                           1392
Orthopedics-Reconstructive            1230
Other                                 3789
PhysicalMedicineandRehabilitation      391
Psychiatry                             853
Pulmonology                            854
Radiologist                           1121
Surgery-Cardiovascular/Thoracic        642
Surgery-General                       3059
UNK                                  48616
Urology                                682
dtype: int64

In [23]:
#Normalization of the data,

listnormal = ['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications',
                     'number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses']

from sklearn.preprocessing import StandardScaler

normal = StandardScaler()

diabetic_data[listnormal] = normal.fit_transform(diabetic_data[listnormal])
