# Diabetes Patients Early Readmissions Prediction

**Authors:** Peter Macinec, Frantisek Sefcik

## Data Preprocessing

In this jupyter notebook, we will try to preprocess the data, create functions for preprocessing and define preprocessing pipeline.

### Setup and import libraries

# TODO

- add other ordinal
- categorical - classes with too few values into "other"?
- check according to analyse what should be done in pipeline and which columns

- outliers removal description
- normalization description (or add normalization?)

- copy into file and one preprocessing function

- reorder descriptions according to pipeline

- conclusion


In [1]:
# Automatically reloading imported modules
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import sys
sys.path.append('..')

from sklearn.pipeline import make_pipeline
from src.preprocessing.transformers import *
from sklearn.impute import SimpleImputer

In [3]:
df = pd.read_csv('../data/data.csv', na_values='?', low_memory=False)

In [4]:
df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


### Preprocessing

TODO - description about building pipeline

#### Drop redundant columns

* `encounter_id`, `patient_nbr` useless
* `weight`, `payer_code`, `medical_specialty` too many missing values
* `citoglipton`, `examide` only one value

In [5]:
columns_to_drop = [
    'encounter_id', 'patient_nbr', 'payer_code', 'medical_specialty',
    'citoglipton', 'examide' # TODO not all should be there - some should be filtered by another transformer
]

age_mapping =  { # TODO movo to appropriate section, # TODO 2: add other mappings
    'age': {
        '[0-10)': 0, '[10-20)': 1, '[20-30)': 2, '[30-40)': 3, '[40-50)': 4, 
        '[50-60)': 5, '[60-70)': 6, '[70-80)': 7, '[80-90)': 8, '[90-100)': 9
    }}

#### Fill missing values

**TODO** - there should be all columns maybe? all categorical and all numerical with different filling method

In [6]:
columns_to_fillnan = ['diag_3', 'race']

#### Drop redundant rows

In [7]:
indices_to_drop = set()

In [8]:
indices_to_drop.update(list(df[df.diag_1.isna() & df.diag_2.isna() & df.diag_3.isna()].index))

In [9]:
indices_to_drop.update(list(df[~df.gender.isin(['Male', 'Female'])].index))

#### Diagnoses encoding

In [10]:
diagnoses_cols = ['diag_1', 'diag_2', 'diag_3']

#### Number of all visits

In [11]:
visits_cols = ['number_emergency', 'number_outpatient', 'number_inpatient']

#### Medicaments features

* number of medicaments
* number of medicaments changes

In [12]:
medicaments_cols = [
    'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide','glimepiride', 
    'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone',
    'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide',
    'citoglipton', 'insulin','glyburide-metformin', 'glipizide-metformin',
    'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone'
]

#### One-hot encoding


In [13]:
columns_to_onehot = [
    'race', 'gender', 'A1Cresult', 'metformin', 'glipizide', 'glyburide', 
    'insulin', 'change', 'diabetesMed'
]

In [14]:
columns_to_onehot += [f'{column}_category' for column in diagnoses_cols]

### Preprocessing using Pipeline

In [15]:
preprocessing_pipeline = make_pipeline(
    ColumnsNanFilter(),
    ColumnsFilter(columns_to_drop),
    ColumnsValuesDiversityFilter(0.9),
    RowsNanFilter(),
    RowsFilter(indices_to_drop),
    MissingValuesImputer(columns=columns_to_fillnan, strategy="most_frequent"),
    ValueMapper(age_mapping),
    NumberVisitsCreator(visits_cols),
    NumberMedicamentsChangesCreator(medicaments_cols),
    NumberMedicamentsCreator(medicaments_cols),
    DiagnosesCodesMapper(diagnoses_cols),
    OneHotEncoder(columns_to_onehot)
)

In [16]:
df_preprocessed = preprocessing_pipeline.fit_transform(df)

ColumnsNanFilter transformation started.
ColumnsNanFilter transformation ended, took 0.03 seconds.
ColumnsFilter transformation started.
ColumnsFilter transformation ended, took 0.03 seconds.
ColumnsValuesDiversityFilter transformation started.
ColumnsValuesDiversityFilter transformation ended, took 0.01 seconds.
RowsNanFilter transformation started.
RowsNanFilter transformation ended, took 0.14 seconds.
RowsFilter transformation started.
RowsFilter transformation ended, took 0.02 seconds.
MissingValuesImputer transformation started.
MissingValuesImputer transformation ended, took 0.03 seconds.
ValueMapper transformation started.
ValueMapper transformation ended, took 0.16 seconds.
NumberVisitsCreator transformation started.
NumberVisitsCreator transformation ended, took 0.01 seconds.
NumberMedicamentsChangesCreator transformation started.
NumberMedicamentsChangesCreator transformation ended, took 15.42 seconds.
NumberMedicamentsCreator transformation started.
NumberMedicamentsCreator 

In [17]:
df_preprocessed.columns

Index(['age', 'admission_type_id', 'discharge_disposition_id',
       'admission_source_id', 'time_in_hospital', 'num_lab_procedures',
       'num_procedures', 'num_medications', 'number_outpatient',
       'number_emergency', 'number_inpatient', 'number_diagnoses',
       'readmitted', 'visits_sum', 'number_medicaments_changes',
       'number_medicaments', 'race_AfricanAmerican', 'race_Asian',
       'race_Caucasian', 'race_Hispanic', 'race_Other', 'gender_Female',
       'gender_Male', 'A1Cresult_>7', 'A1Cresult_>8', 'A1Cresult_None',
       'A1Cresult_Norm', 'metformin_Down', 'metformin_No', 'metformin_Steady',
       'metformin_Up', 'glipizide_Down', 'glipizide_No', 'glipizide_Steady',
       'glipizide_Up', 'glyburide_Down', 'glyburide_No', 'glyburide_Steady',
       'glyburide_Up', 'insulin_Down', 'insulin_No', 'insulin_Steady',
       'insulin_Up', 'change_Ch', 'change_No', 'diabetesMed_No',
       'diabetesMed_Yes', 'diag_1_category_circulatory',
       'diag_1_category_diabet

In [18]:
df_preprocessed

Unnamed: 0,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,...,diag_2_category_None,diag_3_category_circulatory,diag_3_category_diabetes,diag_3_category_digestive,diag_3_category_genitourinary,diag_3_category_injury,diag_3_category_musculoskeletal,diag_3_category_neoplasm,diag_3_category_other,diag_3_category_respiratory
0,0,6,25,1,1,41,0,1,0,0,...,0,0,1,0,0,0,0,0,0,0
1,1,1,1,7,3,59,0,18,0,0,...,0,0,0,0,0,0,0,0,1,0
2,2,1,1,7,2,11,5,13,2,0,...,0,0,0,0,0,0,0,0,1,0
3,3,1,1,7,2,44,1,16,0,0,...,0,1,0,0,0,0,0,0,0,0
4,4,1,1,7,1,51,0,8,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101761,7,1,3,7,3,51,0,16,0,0,...,0,1,0,0,0,0,0,0,0,0
101762,8,1,4,5,5,33,3,18,0,0,...,0,0,0,1,0,0,0,0,0,0
101763,7,1,1,7,1,53,0,9,1,0,...,0,0,0,0,0,0,0,0,1,0
101764,8,2,3,7,10,45,2,21,0,0,...,0,0,0,0,0,1,0,0,0,0


In [23]:
df_preprocessed.dtypes

age                                int64
admission_type_id                  int64
discharge_disposition_id           int64
admission_source_id                int64
time_in_hospital                   int64
                                   ...  
diag_3_category_injury             uint8
diag_3_category_musculoskeletal    uint8
diag_3_category_neoplasm           uint8
diag_3_category_other              uint8
diag_3_category_respiratory        uint8
Length: 76, dtype: object

In [21]:
df_preprocessed.dtypes.value_counts()

uint8     58
int64     17
object     1
dtype: int64