# Diabetes Patients Early Readmissions Prediction

**Authors:** Peter Macinec, Frantisek Sefcik

## Data Preprocessing

In this jupyter notebook, we will try to preprocess the data, create functions for preprocessing and define preprocessing pipeline.

### Setup and import libraries

# TODO

- drop attributes with too many values (or just group smallest classes)
- encoding values (one-hot)
- oversampling/undersampling
- outliers removal?
- categorical with too few values into "other"?
- age can be converted to average
- parts from feature engineering


- copy into file and one preprocessing function

In [1]:
# Automatically reloading imported modules
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import sys
sys.path.append('..')

from sklearn.pipeline import Pipeline
from src.preprocessing.transformers import *

In [3]:
df = pd.read_csv('../data/data.csv', na_values='?', low_memory=False)

In [4]:
df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


### Preprocessing

TODO - description about building pipeline

#### Drop redundant columns

* `encounter_id`, `patient_nbr` useless
* `weight`, `payer_code`, `medical_specialty` too many missing values
* `citoglipton`, `examide` only one value

In [5]:
columns_to_drop = [
    'encounter_id', 'patient_nbr', 'payer_code', 'medical_specialty',
    'citoglipton', 'examide'
]
columns_to_onehot = [
    'race', 'gender', 'A1Cresult', 'metformin', 'glipizide', 'glyburide', 
    'insulin', 'change', 'diabetesMed'
]

#### Drop redundant rows

In [6]:
indices_to_drop = set()

In [7]:
indices_to_drop.update(list(df[df.diag_1.isna() & df.diag_2.isna() & df.diag_3.isna()].index))

In [8]:
indices_to_drop.update(list(df[~df.gender.isin(['Male', 'Female'])].index))

### Preprocessing using Pipeline

In [9]:
preprocessing_pipeline = Pipeline([
    ('cols_nan_filter', ColumnsNanFilter()),
    ('cols_filter', ColumnsFilter(columns_to_drop)),
    ('cols_values_diversity_filter', ColumnsValuesDiversityFilter(0.9)),
    ('rows_filter', RowsFilter(indices_to_drop)),
    ('one_hot_encoder', OneHotEncoder(columns_to_onehot))
])

In [10]:
df_preprocessed = preprocessing_pipeline.fit_transform(df)

ColumnsNanFilter transformation started.
ColumnsNanFilter transformation ended, took 0.03 seconds.
ColumnsFilter transformation started.
ColumnsFilter transformation ended, took 0.02 seconds.
ColumnsValuesDiversityFilter transformation started.
ColumnsValuesDiversityFilter transformation ended, took 0.01 seconds.
RowsFilter transformation started.
RowsFilter transformation ended, took 0.02 seconds.
OneHotEncoder transformation started.
OneHotEncoder transformation ended, took 0.4 seconds.


In [11]:
df_preprocessed.columns

Index(['age', 'admission_type_id', 'discharge_disposition_id',
       'admission_source_id', 'time_in_hospital', 'num_lab_procedures',
       'num_procedures', 'num_medications', 'number_outpatient',
       'number_emergency', 'number_inpatient', 'diag_1', 'diag_2', 'diag_3',
       'number_diagnoses', 'readmitted', 'race_AfricanAmerican', 'race_Asian',
       'race_Caucasian', 'race_Hispanic', 'race_Other', 'race_nan',
       'gender_Female', 'gender_Male', 'A1Cresult_>7', 'A1Cresult_>8',
       'A1Cresult_None', 'A1Cresult_Norm', 'metformin_Down', 'metformin_No',
       'metformin_Steady', 'metformin_Up', 'glipizide_Down', 'glipizide_No',
       'glipizide_Steady', 'glipizide_Up', 'glyburide_Down', 'glyburide_No',
       'glyburide_Steady', 'glyburide_Up', 'insulin_Down', 'insulin_No',
       'insulin_Steady', 'insulin_Up', 'change_Ch', 'change_No',
       'diabetesMed_No', 'diabetesMed_Yes'],
      dtype='object')

In [12]:
df_preprocessed

Unnamed: 0,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,...,glyburide_Steady,glyburide_Up,insulin_Down,insulin_No,insulin_Steady,insulin_Up,change_Ch,change_No,diabetesMed_No,diabetesMed_Yes
0,[0-10),6,25,1,1,41,0,1,0,0,...,0,0,0,1,0,0,0,1,1,0
1,[10-20),1,1,7,3,59,0,18,0,0,...,0,0,0,0,0,1,1,0,0,1
2,[20-30),1,1,7,2,11,5,13,2,0,...,0,0,0,1,0,0,0,1,0,1
3,[30-40),1,1,7,2,44,1,16,0,0,...,0,0,0,0,0,1,1,0,0,1
4,[40-50),1,1,7,1,51,0,8,0,0,...,0,0,0,0,1,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101761,[70-80),1,3,7,3,51,0,16,0,0,...,0,0,1,0,0,0,1,0,0,1
101762,[80-90),1,4,5,5,33,3,18,0,0,...,0,0,0,0,1,0,0,1,0,1
101763,[70-80),1,1,7,1,53,0,9,1,0,...,0,0,1,0,0,0,1,0,0,1
101764,[80-90),2,3,7,10,45,2,21,0,0,...,0,0,0,0,0,1,1,0,0,1


In [14]:
one = OneHotEncoder(columns_to_onehot)

In [15]:
df.columns

Index(['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight',
       'admission_type_id', 'discharge_disposition_id', 'admission_source_id',
       'time_in_hospital', 'payer_code', 'medical_specialty',
       'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1',
       'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
      dtype='object')

In [19]:
df_a = one.add_missing_columns(df, 'gender')

In [18]:
one.fit(df)

<src.preprocessing.transformers.OneHotEncoder at 0x7ff8e71d2048>

In [20]:
df_a.columns

Index(['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight',
       'admission_type_id', 'discharge_disposition_id', 'admission_source_id',
       'time_in_hospital', 'payer_code', 'medical_specialty',
       'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1',
       'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted',
       'gender_Female', 'gender_Male', 'gender_Unknown/Invalid'],
      dtype='object')