# **Diabetic Patient Readmission -- Data Wrangling**

This dataset was analyzed by numerous Virginia Commonwealth University faculty in a recent research article which is accompanied by feature descriptions. These can be found at https://www.hindawi.com/journals/bmri/2014/781670/tab1/.

In [1]:
import os
import pandas as pd
import numpy as np

df1 = pd.read_csv('diabetic_data.csv')
df2 = pd.read_csv('IDs_mapping.csv')

**Data Collection:**

In [2]:
print(df1.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      101766 non-null  object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    101766 non-null  object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                101766 non-null  object
 11  medical_specialty         101766 non-null  object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

In [3]:
print(df2.head(10))
print(df2.info())

          admission_type_id    description
0                         1      Emergency
1                         2         Urgent
2                         3       Elective
3                         4        Newborn
4                         5  Not Available
5                         6            NaN
6                         7  Trauma Center
7                         8     Not Mapped
8                       NaN            NaN
9  discharge_disposition_id    description
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67 entries, 0 to 66
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   admission_type_id  65 non-null     object
 1   description        62 non-null     object
dtypes: object(2)
memory usage: 1.2+ KB
None


**Data Organization:**

In [4]:
print(df2[df2.description == 'description'])

           admission_type_id  description
9   discharge_disposition_id  description
41       admission_source_id  description


The 'IDs_mapping.csv' seems to contain *multiple* tables containing key-value descriptions for designated IDs in the primary dataset.

In [5]:
admission_type_ids = df2[:8] # Adjusted from [:9] to remove NaN ID
discharge_disposition_ids = df2[10:40] # Adjusted from [10:41] to remove NaN ID
discharge_disposition_ids = discharge_disposition_ids.reset_index(drop=True).rename(columns={'admission_type_id':'discharge_disposition_id'})
admission_source_ids = df2[42:]
admission_source_ids = admission_source_ids.reset_index(drop=True).rename(columns={'admission_type_id':'admission_source_id'})
print(admission_type_ids)
print(discharge_disposition_ids)
print(admission_source_ids)

  admission_type_id    description
0                 1      Emergency
1                 2         Urgent
2                 3       Elective
3                 4        Newborn
4                 5  Not Available
5                 6            NaN
6                 7  Trauma Center
7                 8     Not Mapped
   discharge_disposition_id                                        description
0                         1                                 Discharged to home
1                         2  Discharged/transferred to another short term h...
2                         3                      Discharged/transferred to SNF
3                         4                      Discharged/transferred to ICF
4                         5  Discharged/transferred to another type of inpa...
5                         6  Discharged/transferred to home with home healt...
6                         7                                           Left AMA
7                         8  Discharged/transferred t

As we have a singular, primary dataset that we will work with, there is no need for merging/joining at this stage. Introducing these descriptions into the dataset would likely complicate future analysis and modeling.

In [6]:
# Saving individual id libraries for future reference
admission_type_ids.to_csv('admission_type_ids.csv', index=False)
discharge_disposition_ids.to_csv('discharge_disposistion_ids.csv', index=False)
admission_source_ids.to_csv('admission_source_ids.csv', index=False)

**Data Definition:**

For feature/column definitions/descriptions, see https://www.hindawi.com/journals/bmri/2014/781670/tab1/.

In [7]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      101766 non-null  object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    101766 non-null  object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                101766 non-null  object
 11  medical_specialty         101766 non-null  object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

In [8]:
df1.loc[:,['encounter_id', 'patient_nbr', 'weight', 'payer_code', 'A1Cresult', 'insulin', 'change', 'readmitted']]

Unnamed: 0,encounter_id,patient_nbr,weight,payer_code,A1Cresult,insulin,change,readmitted
0,2278392,8222157,?,?,,No,No,NO
1,149190,55629189,?,?,,Up,Ch,>30
2,64410,86047875,?,?,,No,No,NO
3,500364,82442376,?,?,,Up,Ch,NO
4,16680,42519267,?,?,,Steady,Ch,NO
...,...,...,...,...,...,...,...,...
101761,443847548,100162476,?,MC,>8,Down,Ch,>30
101762,443847782,74694222,?,MC,,Steady,No,NO
101763,443854148,41088789,?,MC,,Down,Ch,NO
101764,443857166,31693671,?,MC,,Up,Ch,NO


In [9]:
df1.loc[:,['encounter_id','A1Cresult', 'diabetesMed', 'readmitted', 'diag_1', 'diag_2', 'diag_3']]

Unnamed: 0,encounter_id,A1Cresult,diabetesMed,readmitted,diag_1,diag_2,diag_3
0,2278392,,No,NO,250.83,?,?
1,149190,,Yes,>30,276,250.01,255
2,64410,,Yes,NO,648,250,V27
3,500364,,Yes,NO,8,250.43,403
4,16680,,Yes,NO,197,157,250
...,...,...,...,...,...,...,...
101761,443847548,>8,Yes,>30,250.13,291,458
101762,443847782,,Yes,NO,560,276,787
101763,443854148,,Yes,NO,38,590,296
101764,443857166,,Yes,NO,996,285,998


'diag_1', 'diag_2', and 'diag_3' are addressed in feature descriptions link.

**Generated .html report using pandas-profiler and the following code:**<br>
import pandas_profiling<br>
from pandas_profiling.utils.cache import cache_file<br>
profile_report = df1.profile_report(explorative=True, html={'style': {'full_width': True}})<br>
profile_report.to_file('df1_profile.html')<br>

In [10]:
df1.weight.describe()

count     101766
unique        10
top            ?
freq       98569
Name: weight, dtype: object

In [11]:
print("We only have", 101766-98569,'out of 101766 weights which equates to',100*98569/101766,'% missingness.')

We only have 3197 out of 101766 weights which equates to 96.85847925633315 % missingness.


In [12]:
df1.diabetesMed.describe()

count     101766
unique         2
top          Yes
freq       78363
Name: diabetesMed, dtype: object

In [13]:
df1.readmitted.describe()

count     101766
unique         3
top           NO
freq       54864
Name: readmitted, dtype: object

**Data Cleaning:**<br>
<br>
The data is in a good enough state to move forward and handle further data cleaning as deemed necessary by the requirements of the modeling stage.<br>
<br>
The provided link to feature descriptions also shows calculated missingness percentages.