# Preprocessing and Standardization

## Pipelines

We created `preprocessing pipeline` based on the transformers already described in the previous sections. The following steps have been implemented:

* `DateTimeTransformer`: Extracts the time from the time column and creates circular features for the hour and minute
* `DropColumnsTransformer`: Drops all columns for parameters `activity` and `carbs`
* `FillPropertiesNaNsTransformer`: Interpolate (limit 3), forwards and backwards fill (limit 1) and median for the remaining columns for `bg`, `insulin`, `hr`, `steps`
* `DropOutliersTransformer`: Find an rewrite outliers for `insulin`
* `ExtractFeaturesTransformer`: Extracts all specified columns, here:
    * hour_sin, hour_cos
    * bg-0:00 - bg-2:00
    * insulin-0:00 - insulin-2:00
    * cals-0:00 - cals-2:00
    * hr-0:00 - hr-2:00
    * steps-0:00 - steps-2:00
    * p_num
    * bg+1:00 (target)


```python
preprocessing_pipeline = Pipeline(steps=[
    ('date_time', DateTimeHourTransformer(time_column='time', result_column='hour', type='sin_cos', drop_time_column=True)),
    ('drop_parameter_cols', DropColumnsTransformer(starts_with=['activity', 'carbs'])),
    ('drop_others', DropColumnsTransformer(columns_to_delete=['time'])),
    ('fill_properties_nan_bg', FillPropertyNaNsTransformer(parameter='bg', how=['interpolate', 'median'], interpolate=3, ffill=1, bfill=1, precision=1)),
    ('fill_properties_nan_insulin', FillPropertyNaNsTransformer(parameter='insulin', how=['interpolate', 'median'], interpolate=3, ffill=1, bfill=1, precision=4)),
    ('fill_properties_nan_cals', FillPropertyNaNsTransformer(parameter='cals', how=['interpolate', 'median'], interpolate=3, ffill=1, bfill=1, precision=1)),
    ('fill_properties_nan_hr', FillPropertyNaNsTransformer(parameter='hr', how=['interpolate', 'median'], interpolate=3, ffill=1, bfill=1, precision=1)),
    ('fill_properties_nan_steps', FillPropertyNaNsTransformer(parameter='steps', how=['zero'], interpolate=3, ffill=1, bfill=1, precision=1)),
    ('drop_outliers', PropertyOutlierTransformer(parameter='insulin', filter_function=lambda x: x < 0, fill_strategy='zero')),
    ('extract_features', ExtractColumnsTransformer(columns_to_extract=columns_to_extract)),
])
```

The `standardization pipeline` contains:

* `GetDummiesTransformer`: One-hot encodes the `p_num` column
* `StandardScaler`: Standardizes the data (excluding the target column)

```python
standardization_pipeline = Pipeline(steps=[
  ('get_dummies', GetDummiesTransformer(columns=['hour', 'p_num'])),
    ('standard_scaler', StandardScalerTransformer(columns=columns_to_extract[3:-1]))
])
```

In [1]:
import pandas as pd
from src.features.helpers.load_data import load_data
from src.models.model_2.model.pipelines_2h import pipeline

train_data, augmented_data, test_data = load_data('2_00h')

all_train_data_transformed = pipeline.fit_transform(pd.concat([train_data, augmented_data]))

X_train, y_train = all_train_data_transformed.iloc[len(train_data):].drop(columns=['bg+1:00']), all_train_data_transformed.iloc[len(train_data):]['bg+1:00']
X_augmented, y_augmented = all_train_data_transformed.iloc[:len(train_data)].drop(columns=['bg+1:00']), all_train_data_transformed.iloc[:len(train_data)]['bg+1:00']

all_train_data_transformed.head()

Unnamed: 0_level_0,hour_sin,hour_cos,bg-2:00,bg-1:55,bg-1:50,bg-1:45,bg-1:40,bg-1:35,bg-1:30,bg-1:25,...,p_num_p10,p_num_p11,p_num_p12,p_num_p15,p_num_p16,p_num_p18,p_num_p19,p_num_p21,p_num_p22,p_num_p24
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
p01_0,0.999048,-0.043619,2.815442,2.915458,3.044933,3.141255,3.176766,3.176801,3.208038,3.1787,...,0,0,0,0,0,0,0,0,0,0
p01_1,0.994056,-0.108867,3.138183,3.173915,3.174163,3.205851,3.176766,3.112147,3.078782,3.01699,...,0,0,0,0,0,0,0,0,0,0
p01_2,0.984808,-0.173648,3.202732,3.173915,3.109548,3.076659,3.015126,2.918186,2.852584,2.887622,...,0,0,0,0,0,0,0,0,0,0
p01_3,0.971342,-0.237686,3.073635,3.012379,2.915704,2.850574,2.885814,2.885859,2.917212,2.887622,...,0,0,0,0,0,0,0,0,0,0
p01_4,0.953717,-0.300706,2.847716,2.883151,2.883396,2.915169,2.885814,2.885859,2.852584,2.725913,...,0,0,0,0,0,0,0,0,0,0
