# Diabetes Patients Early Readmissions Prediction

**Authors:** [Peter Macinec](https://github.com/pmacinec), [Frantisek Sefcik](https://github.com/FrantisekSefcik)

## Data Preprocessing

In this jupyter notebook, we will try to preprocess the data, create functions for preprocessing and define preprocessing pipeline.

### Setup and import libraries

In [1]:
# Automatically reloading imported modules
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append('..')

# Supress libraries deprecation import warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
import pandas as pd
pd.set_option('display.max_columns', None)

from src.preprocessing.helpers import load_dataset, transform_label
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from src.preprocessing.transformers import *

Now, let's read the original data to be preprocessed: 

In [4]:
df = load_dataset()

In [5]:
df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,diag_1,diag_2,diag_3,number_diagnoses,max_glu_serum,A1Cresult,metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),,6,25,1,1,,Pediatrics-Endocrinology,41,0,1,0,0,0,250.83,,,1,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),,1,1,7,3,,,59,0,18,0,0,0,276.0,250.01,255,9,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),,1,1,7,2,,,11,5,13,2,0,1,648.0,250.0,V27,6,,,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),,1,1,7,2,,,44,1,16,0,0,0,8.0,250.43,403,7,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),,1,1,7,1,,,51,0,8,0,0,0,197.0,157.0,250,5,,,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,Steady,No,No,No,No,No,Ch,Yes,NO


### Initial preprocessing

At the beginning, we will filter some samples:

* drop rows with missing values in too many attributes (valdue is missing in more than 30% of columns)
* drop some rows we decided to be dropped (e.g. those with unknown gender, or with missing all diagnoses)

Those samples are rather to be dropped than used for training machine learning algorithms.

#### Filter rows with too many missing values

Some rows with too many missing values (more than 30% of atributes) should will be dropped using `RowsNanFilter` transformer.

#### Filter specific rows

Filter specific rows using `RowsFilter`.

* filter those rows that have missing values in all diagnoses attributes
* filter those rows that has gender set to `unknown` (there are only a few rows meeting this condition, so those rows can be filtered instead of handling this problem)

In [6]:
indices_to_drop = set()

In [7]:
indices_to_drop.update(list(df[df.diag_1.isna() & df.diag_2.isna() & df.diag_3.isna()].index))

In [8]:
indices_to_drop.update(list(df[~df.gender.isin(['Male', 'Female'])].index))

In [9]:
initial_preprocessing_pipeline = make_pipeline(
    RowsNanFilter(),
    RowsFilter(indices_to_drop),
)

In [10]:
df = initial_preprocessing_pipeline.fit_transform(df)

RowsNanFilter transformation ended, took 0.33 seconds.
RowsFilter transformation ended, took 0.04 seconds.


### Train-test split

Next, we will split data into train and test subsets before preprocessing (to avoid information leak from test data). The size of test subset will be 20%.

In [11]:
X = df.drop('readmitted', axis=1)
y = transform_label(df['readmitted'])

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [13]:
X_train.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,diag_1,diag_2,diag_3,number_diagnoses,max_glu_serum,A1Cresult,metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed
41511,128413584,23279256,Caucasian,Male,[70-80),,1,22,7,2,MC,Orthopedics,38,3,27,0,1,2,824.0,E888,401,7,,,No,No,No,No,No,No,Steady,No,No,Steady,No,No,No,No,No,No,No,Steady,No,No,No,No,No,Ch,Yes
24079,81844290,94788,Caucasian,Female,[70-80),,1,1,7,4,,InternalMedicine,48,0,11,0,0,0,276.0,402,428,9,,Norm,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No
91370,308424104,45904869,AfricanAmerican,Female,[40-50),,1,1,7,2,MD,,28,0,15,0,3,4,250.6,577,357,9,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Steady,No,No,No,No,No,No,Yes
6237,31258956,18397782,Caucasian,Male,[80-90),,1,1,7,4,,,44,0,10,0,0,0,599.0,788,599,7,,,No,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes
72207,210690456,86230836,Caucasian,Female,[80-90),,1,3,7,2,MC,,65,2,23,0,0,1,453.0,578,280,9,,,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes


### Preprocessing pipeline

Data will be preprocessed via *preprocessing pipeline*. We decided to use preprocessing pipeline to ensure reproducibility without hard-coding preprocessing steps. In this case, other dataframes with same attributes and shapes can be simply preprocessed.

In data analysis, we have identified following steps to be done during preprocessing:

* drop columns with too many missing values (columns with more than 45% missing values)
* drop redundant columns (like `patient_nbr` or `encounter_id`)
* merge too small classes (less than 5% frequency) in categorical attributes into one 'other' class
* drop columns with low variability (e.g. more than 90% of samples have one same value in certain column)
* fill missing values in nominal attributes with most-frequent value
* fill missing values in numerical attributes with median value
* map ordinal features values into numbers
* create new features from analysis:
    * `visits_sum`
    * `number_medicaments_changes`
    * `number_medicaments`
* map diagnoses codes to diagnoses categories (according to analysis)
* encode categorical features into numbers (one-hot encoding)

We have checked numerical attributes carefully to see whether normalization and outliers removal should be performed. According to analysis, there are no such differences in measures so normalization is not needed. We have checked also outliers, but we have not found any extreme value that should be removed.

In following sections, individual preprocessing steps will be described and prepared for preprocessing pipeline.

####  Drop columns with too many missing values

At first, columns with too many missing values (more than 45% of values are missing) should be droppend. For this preprocessing, we use `ColumnsNanFilter` transformer.

#### Drop redundant columns

Some columns contain redundant information for our prediction and can be manually filtered to avoid unnecessary bias. This will be done using `ColumnsFilter` transformer.

Following columns will be dropped:
* `encounter_id`, `patient_nbr` do not contain information that can help prediction
* `payer_code` contain a lot of missing values (39%) and we consider it being redundant (more detailed explanation can be found in analysis)

In [14]:
columns_to_drop = ['encounter_id', 'patient_nbr', 'payer_code']

#### Small categories reducing

According to data analysis, there are categorical attributes that contain one majority class and several very small classes. Those small classes can be merged together into one `other` class using `SmallCategoriesReducer`.

In [15]:
columns_to_reduce = ['discharge_disposition_id', 'admission_source_id']

#### Drop low diversity columns

Some attributes contain one major value and so the attribute has low variability (e.g. `citoglipton`, `examide` containing only one value). Those attributes, having major value with more than 90% frequency will be dropped using `ColumnsValuesDiversityFilter` transformer.

#### Fill missing values

Missing values will be filled with `MissingValuesImputer` transformer:
* for categorical attributes, *most frequent* value will be filled in,
* for numerical attributes, *median* value will be used.

#### Mapping ordinal attributes

Ordinal attributes will be mapped using `ValueMapper`. Order of attribute values will be specified by appropriate numbers, e.g. in `age` case:
* `[0-10)` -> `0`
* `[10-20)` -> `1`
* ...

In [16]:
ordinal_mappings =  {
    'age': {
        '[0-10)': 0, '[10-20)': 1, '[20-30)': 2, '[30-40)': 3, '[40-50)': 4, 
        '[50-60)': 5, '[60-70)': 6, '[70-80)': 7, '[80-90)': 8, '[90-100)': 9
    }}

#### Feature engineering

In data analysis, we have found some new features that can be derived from data:
* number of all visits (`NumberVisitsCreator`)
* number of medicaments changes (`NumberMedicamentsChangesCreator`)
* number of medicaments (`NumberMedicamentsCreator`)
* diagnoses codes transformed into diagnoses categories (`DiagnosesCodesMapper`)

In [17]:
visits_cols = ['number_emergency', 'number_outpatient', 'number_inpatient']

In [18]:
medicaments_cols = [
    'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide','glimepiride', 
    'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone',
    'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide',
    'citoglipton', 'insulin','glyburide-metformin', 'glipizide-metformin',
    'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone'
]

In [19]:
diagnoses_cols = ['diag_1', 'diag_2', 'diag_3']

#### One-hot encoding

All categorical features (that is also majority of features) will be one-hot encoded at the end (to be able to use any machine learning algorithm) with `OneHotEncoder`.

### Preprocessing using Pipeline

We define pipeline to preprocess data with 4 main steps:

1. **Columns filtering**
    - filter redundant columns and columns with too many missing values
2. **Numerical features preprocessing**
    - fill in missing values using median value
    - create new `number of visits` feature
3. **Categorical features preprocessing**
    - code diagnose codes to diagnoses
    - fill in missing values with most frequent value
    - merge small categories into `other` category
    - remove columns with too low diversity
    - map ordinal attributes into numbers
    - one-hot encoding
4. **Creating medicaments-based features**

In [20]:
categorical_features = X_train.select_dtypes(include=np.object).columns.to_list()
numerical_features = X_train.select_dtypes(exclude=np.object).columns.to_list()
except_medicaments_cols = list(set(X_train.columns) - set(medicaments_cols)) 

In [21]:
preprocessing_pipeline = make_pipeline(
    ColumnsFilter(columns_to_drop),
    ColumnsNanFilter(),
    PandasFeatureUnion([
        ('numerical_features', make_pipeline(
            ColumnsFilter(categorical_features),
            MissingValuesImputer(strategy='median'),
            NumberVisitsCreator(visits_cols),
        )),
        ('categorical_features', make_pipeline(
            ColumnsFilter(numerical_features),
            DiagnosesCodesMapper(diagnoses_cols),
            MissingValuesImputer(strategy='most_frequent'),
            SmallCategoriesReducer(),
            ColumnsValuesDiversityFilter(0.9),
            ValueMapper(ordinal_mappings),
            OneHotEncoder(exclude_columns=['age'])
        )),
        ('medicaments_features', make_pipeline(
            ColumnsFilter(except_medicaments_cols),
            NumberMedicamentsChangesCreator(medicaments_cols),
            NumberMedicamentsCreator(medicaments_cols),
            ColumnsFilter(medicaments_cols)
        ))
    ])
)

In [22]:
X_train_prep = preprocessing_pipeline.fit_transform(X_train)
X_test_prep = preprocessing_pipeline.transform(X_test)

ColumnsFilter transformation ended, took 0.02 seconds.
ColumnsNanFilter transformation ended, took 0.02 seconds.
ColumnsFilter transformation ended, took 0.0 seconds.
MissingValuesImputer transformation ended, took 0.0 seconds.
NumberVisitsCreator transformation ended, took 0.0 seconds.
ColumnsFilter transformation ended, took 0.02 seconds.
DiagnosesCodesMapper transformation ended, took 5.19 seconds.
MissingValuesImputer transformation ended, took 0.25 seconds.
SmallCategoriesReducer transformation ended, took 2.7 seconds.
ColumnsValuesDiversityFilter transformation ended, took 0.01 seconds.
ValueMapper transformation ended, took 0.1 seconds.
OneHotEncoder transformation ended, took 0.46 seconds.
ColumnsFilter transformation ended, took 0.01 seconds.
NumberMedicamentsChangesCreator transformation ended, took 46.68 seconds.
NumberMedicamentsCreator transformation ended, took 41.97 seconds.
ColumnsFilter transformation ended, took 0.0 seconds.
ColumnsFilter transformation ended, took 0.

How many of features are available after preprocessing?

In [23]:
X_train_prep.shape[1], X_test_prep.shape[1]

(53, 53)

Let's check the data after preprocessing:

In [24]:
X_train_prep

Unnamed: 0,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,visits_sum,age,race_AfricanAmerican,race_Caucasian,race_other,gender_Female,gender_Male,admission_type_id_1,admission_type_id_other,discharge_disposition_id_1,discharge_disposition_id_other,admission_source_id_1,admission_source_id_other,A1Cresult_>8,A1Cresult_None,A1Cresult_other,metformin_No,metformin_other,glipizide_No,glipizide_other,glyburide_No,glyburide_other,insulin_No,insulin_other,change_Ch,change_No,diabetesMed_No,diabetesMed_Yes,diag_1_category_circulatory,diag_1_category_diabetes,diag_1_category_genitourinary,diag_1_category_other,diag_1_category_respiratory,diag_2_category_circulatory,diag_2_category_diabetes,diag_2_category_genitourinary,diag_2_category_other,diag_2_category_respiratory,diag_3_category_circulatory,diag_3_category_diabetes,diag_3_category_genitourinary,diag_3_category_other,diag_3_category_respiratory,number_medicaments_changes,number_medicaments
41511,2,38,3,27,0,1,2,7,3,7.0,0,1,0,0,1,1,0,0,1,0,1,0,1,0,1,0,0,1,1,0,0,1,1,0,0,1,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,3
24079,4,48,0,11,0,0,0,9,0,7.0,0,1,0,1,0,1,0,1,0,0,1,0,0,1,1,0,1,0,1,0,1,0,0,1,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0
91370,2,28,0,15,0,3,4,9,7,4.0,1,0,0,1,0,1,0,1,0,0,1,0,1,0,1,0,1,0,1,0,0,1,0,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1
6237,4,44,0,10,0,0,0,7,0,8.0,0,1,0,0,1,1,0,1,0,0,1,0,1,0,1,0,1,0,0,1,1,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,1
72207,2,65,2,23,0,0,1,9,1,8.0,0,1,0,1,0,1,0,0,1,0,1,0,1,0,1,0,0,1,1,0,1,0,0,1,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6265,2,35,0,12,0,0,0,9,0,7.0,0,1,0,0,1,1,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0
54887,4,42,2,18,0,0,0,9,0,6.0,0,1,0,1,0,1,0,1,0,0,1,0,1,0,0,1,1,0,1,0,1,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,2
76822,4,30,1,16,0,0,2,6,2,,0,1,0,1,0,0,1,0,1,1,0,0,1,0,1,0,1,0,0,1,0,1,1,0,0,1,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,2
860,12,77,2,21,0,0,0,9,0,6.0,0,1,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,1,2


In model selection phase, different types of machine learning algorithms will be tried. Thus, all attributes were transformed into numbers in case that any of algorithms requires it.

In [25]:
X_train_prep.dtypes.value_counts()

uint8      41
int64      11
float64     1
dtype: int64

### Save preprocessed data

After preprocessing, data will be stored for further usage:

In [26]:
%%time
X_train_prep.to_csv('../data/X_train.csv', index=False)
X_test_prep.to_csv('../data/X_test.csv', index=False)
y_train.to_csv('../data/y_train.csv', index=False)
y_test.to_csv('../data/y_test.csv', index=False)

CPU times: user 5.71 s, sys: 8.04 ms, total: 5.71 s
Wall time: 5.71 s


### Conclusion

This jupyter notebook summarizes preprocessing phase. Data were preprocessed using preprocessing pipeline. Pipeline consists of two parts:
* figuring out problems in data identified during data analysis,
* feature engineering to create new features.