# Diabetes Patients Early Readmissions Prediction

**Authors:** [Peter Macinec](https://github.com/pmacinec), [Frantisek Sefcik](https://github.com/FrantisekSefcik)

## Data Preprocessing

In this jupyter notebook, we will try to preprocess the data, create functions for preprocessing and define preprocessing pipeline.

### Setup and import libraries

In [1]:
# Automatically reloading imported modules
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import sys
sys.path.append('..')

from sklearn.pipeline import make_pipeline
from src.preprocessing.transformers import *
from sklearn.impute import SimpleImputer

Now, let's read the original data to be preprocessed: 

In [3]:
df = pd.read_csv('../data/data.csv', na_values='?', low_memory=False)

In [4]:
df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


### Preprocessing pipeline

Data will be preprocessed via *preprocessing pipeline*. We decided to use preprocessing pipeline to ensure reproducibility without hard-coding preprocessing steps. In this case, other dataframes with same attributes and shapes can be simply preprocessed.

In data analysis, we have identified following steps to be done during preprocessing:

* drop columns with too many missing values (columns with more than 45% missing values)
* drop redundant columns (like `patient_nbr` or `encounter_id`)
* merge too small classes (less than 5% frequency) in categorical attributes into one 'other' class
* drop columns with low variability (e.g. more than 90% of samples have one same value in certain column)
* drop rows with missing values in too many attributes (value is missing in more than 30% of columns)
* drop some rows we decided to be dropped (e.g. those with unknown gender, or with missing all diagnoses)
* fill missing values in nominal attributes with most-frequent value
* fill missing values in numerical attributes with median value
* map ordinal features values into numbers
* create new features from analysis:
    * `visits_sum`
    * `number_medicaments_changes`
    * `number_medicaments`
* map diagnoses codes to diagnoses categories (according to analysis)
* encode categorical features into numbers (one-hot encoding)

We have checked numerical attributes carefully to see whether normalization and outliers removal should be performed. According to analysis, there are no such differences in measures so normalization is not needed. We have checked also outliers, but we have not found any extreme value that should be removed.

In following sections, individual preprocessing steps will be described and prepared for preprocessing pipeline.

####  Drop columns with too many missing values

At first, columns with too many missing values (more than 45% of values are missing) should be droppend. For this preprocessing, we use `ColumnsNanFilter` transformer.

#### Drop redundant columns

Some columns contain redundant information for our prediction and can be manually filtered to avoid unnecessary bias. This will be done using `ColumnsFilter` transformer.

Following columns will be dropped:
* `encounter_id`, `patient_nbr` do not contain information that can help prediction
* `payer_code` contain a lot of missing values (39%) and we consider it being redundant (more detailed explanation can be found in analysis)

In [None]:
columns_to_drop = ['encounter_id', 'patient_nbr', 'payer_code']

#### Small categories reducing - TODO


`SmallCategoriesReducer`

In [None]:
columns_to_reduce = ['discharge_disposition_id', 'admission_source_id']

#### Drop low diversity columns

Some attributes contain one major value and so the attribute has low variability (e.g. `citoglipton`, `examide` containing only one value). Those attributes, having major value with more than 90% frequency will be dropped using `ColumnsValuesDiversityFilter` transformer.

#### Filter rows with too many missing values

Some rows with too many missing values (more than 30% of atributes) should will be dropped using `RowsNanFilter` transformer.

#### Filter specific rows - TODO

`RowsFilter`

In [None]:
indices_to_drop = set()

In [None]:
indices_to_drop.update(list(df[df.diag_1.isna() & df.diag_2.isna() & df.diag_3.isna()].index))

In [None]:
indices_to_drop.update(list(df[~df.gender.isin(['Male', 'Female'])].index))

#### Fill missing values - TODO

`MissingValuesImputer`

categorical and numerical


**TODO** - there should be all columns maybe? all categorical and all numerical with different filling method

In [None]:
columns_to_fillnan = ['diag_3', 'race']

#### Mapping ordinal attributes - TODO

`ValueMapper`

In [5]:
age_mapping =  { # TODO movo to appropriate section, # TODO 2: add other mappings
    'age': {
        '[0-10)': 0, '[10-20)': 1, '[20-30)': 2, '[30-40)': 3, '[40-50)': 4, 
        '[50-60)': 5, '[60-70)': 6, '[70-80)': 7, '[80-90)': 8, '[90-100)': 9
    }}

#### Feature engineering - TODO

    NumberVisitsCreator(visits_cols),
    NumberMedicamentsChangesCreator(medicaments_cols),
    NumberMedicamentsCreator(medicaments_cols),
    DiagnosesCodesMapper(diagnoses_cols),

In [None]:
visits_cols = ['number_emergency', 'number_outpatient', 'number_inpatient']

In [None]:
medicaments_cols = [
    'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide','glimepiride', 
    'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone',
    'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide',
    'citoglipton', 'insulin','glyburide-metformin', 'glipizide-metformin',
    'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone'
]

In [None]:
diagnoses_cols = ['diag_1', 'diag_2', 'diag_3']

#### One-hot encoding - TODO

In [13]:
columns_to_onehot = [
    'race', 'gender', 'A1Cresult', 'metformin', 'glipizide', 'glyburide', 
    'insulin', 'change', 'diabetesMed'
]

In [14]:
columns_to_onehot += [f'{column}_category' for column in diagnoses_cols]

### Preprocessing using Pipeline

TODO - filling numerical?
    - then maybe pipeline "branching" should be used

In [16]:
preprocessing_pipeline = make_pipeline(
    ColumnsNanFilter(), 
    ColumnsFilter(columns_to_drop),
    SmallCategoriesReducer(columns_to_reduce),
    ColumnsValuesDiversityFilter(0.9),
    RowsNanFilter(),
    RowsFilter(indices_to_drop),
    MissingValuesImputer(columns=columns_to_fillnan, strategy='most_frequent'),
    ValueMapper(age_mapping),
    NumberVisitsCreator(visits_cols),
    NumberMedicamentsChangesCreator(medicaments_cols),
    NumberMedicamentsCreator(medicaments_cols),
    DiagnosesCodesMapper(diagnoses_cols),
    OneHotEncoder(columns_to_onehot)
)

In [17]:
df_preprocessed = preprocessing_pipeline.fit_transform(df)

ColumnsNanFilter transformation started.
ColumnsNanFilter transformation ended, took 0.03 seconds.
ColumnsFilter transformation started.
ColumnsFilter transformation ended, took 0.02 seconds.
SmallCategoriesReducer transformation started.
SmallCategoriesReducer transformation ended, took 0.18 seconds.
ColumnsValuesDiversityFilter transformation started.
ColumnsValuesDiversityFilter transformation ended, took 0.07 seconds.
RowsNanFilter transformation started.
RowsNanFilter transformation ended, took 0.12 seconds.
RowsFilter transformation started.
RowsFilter transformation ended, took 0.02 seconds.
MissingValuesImputer transformation started.
MissingValuesImputer transformation ended, took 0.02 seconds.
ValueMapper transformation started.
ValueMapper transformation ended, took 0.12 seconds.
NumberVisitsCreator transformation started.
NumberVisitsCreator transformation ended, took 0.01 seconds.
NumberMedicamentsChangesCreator transformation started.
NumberMedicamentsChangesCreator trans

How many of features are available after preprocessing?

In [18]:
df_preprocessed.shape[1]

Index(['age', 'admission_type_id', 'discharge_disposition_id',
       'admission_source_id', 'time_in_hospital', 'num_lab_procedures',
       'num_procedures', 'num_medications', 'number_outpatient',
       'number_emergency', 'number_inpatient', 'number_diagnoses',
       'readmitted', 'visits_sum', 'number_medicaments_changes',
       'number_medicaments', 'race_AfricanAmerican', 'race_Asian',
       'race_Caucasian', 'race_Hispanic', 'race_Other', 'gender_Female',
       'gender_Male', 'A1Cresult_>7', 'A1Cresult_>8', 'A1Cresult_None',
       'A1Cresult_Norm', 'metformin_Down', 'metformin_No', 'metformin_Steady',
       'metformin_Up', 'glipizide_Down', 'glipizide_No', 'glipizide_Steady',
       'glipizide_Up', 'glyburide_Down', 'glyburide_No', 'glyburide_Steady',
       'glyburide_Up', 'insulin_Down', 'insulin_No', 'insulin_Steady',
       'insulin_Up', 'change_Ch', 'change_No', 'diabetesMed_No',
       'diabetesMed_Yes', 'diag_1_category_circulatory',
       'diag_1_category_diabet

Let's check the data after preprocessing:

In [19]:
df_preprocessed

Unnamed: 0,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,...,diag_2_category_None,diag_3_category_circulatory,diag_3_category_diabetes,diag_3_category_digestive,diag_3_category_genitourinary,diag_3_category_injury,diag_3_category_musculoskeletal,diag_3_category_neoplasm,diag_3_category_other,diag_3_category_respiratory
0,0,6,other,1,1,41,0,1,0,0,...,0,0,1,0,0,0,0,0,0,0
1,1,1,1,other,3,59,0,18,0,0,...,0,0,0,0,0,0,0,0,1,0
2,2,1,1,other,2,11,5,13,2,0,...,0,0,0,0,0,0,0,0,1,0
3,3,1,1,other,2,44,1,16,0,0,...,0,1,0,0,0,0,0,0,0,0
4,4,1,1,other,1,51,0,8,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101761,7,1,other,other,3,51,0,16,0,0,...,0,1,0,0,0,0,0,0,0,0
101762,8,1,other,other,5,33,3,18,0,0,...,0,0,0,1,0,0,0,0,0,0
101763,7,1,1,other,1,53,0,9,1,0,...,0,0,0,0,0,0,0,0,1,0
101764,8,2,other,other,10,45,2,21,0,0,...,0,0,0,0,0,1,0,0,0,0


In [None]:
df_preprocessed.dtypes.value_counts()

TODO - all numerical, explanation

### Conclusion