# Modelling I

In this notebook, we will:
1. Load the preprocessed data
2. Preprocess the data
3. Feature engineering (day phase)
4. Model selection

In [1]:
import os
import pandas as pd

data_file = os.path.join('..', '..', 'data', 'interim', 'all_train.csv')
test_file = os.path.join('..', '..', 'data', 'raw', 'test.csv')
df = pd.read_csv(data_file)
df.head()

Unnamed: 0.1,Unnamed: 0,p_num,days_since_start,time,initial_resolution,bg,insulin,carbs,hr,steps,cals,activity,bg+1:00
0,2020-01-01 00:15:00,p01,0,00:15:00,15min,,0.0083,,,,,,
1,2020-01-01 00:20:00,p01,0,00:20:00,15min,,0.0083,,,,,,
2,2020-01-01 00:25:00,p01,0,00:25:00,15min,9.6,0.0083,,,,,,
3,2020-01-01 00:30:00,p01,0,00:30:00,15min,,0.0083,,,,,,
4,2020-01-01 00:35:00,p01,0,00:35:00,15min,,0.0083,,,,,,


# Data Preprocessing

## 1. Select only bg (train) and bg+1:00 (target) columns from dataframe

In [2]:
df = df[['time', 'bg', 'bg+1:00']]
df.head()

Unnamed: 0,time,bg,bg+1:00
0,00:15:00,,
1,00:20:00,,
2,00:25:00,9.6,
3,00:30:00,,
4,00:35:00,,


# Clean Data

## Interpolate missing values in bg column and drop rows with missing values

In [3]:
df['bg'] = df['bg'].interpolate(method='linear').ffill().bfill()
df = df.dropna()
df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Index: 177024 entries, 71 to 235126
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   time     177024 non-null  object 
 1   bg       177024 non-null  float64
 2   bg+1:00  177024 non-null  float64
dtypes: float64(2), object(1)
memory usage: 5.4+ MB


# Feature Engineering

## 1. Create Day Phase feature

In [4]:
from src.features.transformers import DayPhaseTransformer

day_phase_transformer = DayPhaseTransformer(time_column='time', time_format='%H:%M:%S', result_column='day_phase',
                                            drop_time_column=True)
df = day_phase_transformer.fit_transform(X=df)
df.head()

Unnamed: 0,day_phase,bg,bg+1:00
71,morning,15.1,13.4
74,morning,14.4,12.8
77,morning,13.9,15.5
80,morning,13.8,14.8
83,morning,13.4,12.7


# Model selection

## 1. Split the data into train and test sets

In [5]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=['bg+1:00'])
y = df['bg+1:00']

X = pd.get_dummies(X, columns=['day_phase'], drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 2. Use LazyPredict

In [6]:
from notebooks.helpers.LazyPredict import get_lazy_regressor

reg = get_lazy_regressor()
models, predictions = reg.fit(X_train, X_test, y_train, y_test)
models

 97%|█████████▋| 38/39 [13:41<00:45, 45.07s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000400 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 255
[LightGBM] [Info] Number of data points in the train set: 141619, number of used features: 1
[LightGBM] [Info] Start training from score 8.276012


100%|██████████| 39/39 [13:41<00:00, 21.07s/it]


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GradientBoostingRegressor,0.5,0.5,2.14,3.39
LGBMRegressor,0.5,0.5,2.14,0.61
MLPRegressor,0.5,0.5,2.14,1.89
XGBRegressor,0.49,0.49,2.14,0.29
HistGradientBoostingRegressor,0.49,0.49,2.14,0.73
BaggingRegressor,0.49,0.49,2.14,0.39
ExtraTreesRegressor,0.49,0.49,2.15,2.26
ExtraTreeRegressor,0.49,0.49,2.15,0.03
DecisionTreeRegressor,0.49,0.49,2.15,0.05
SVR,0.49,0.49,2.16,468.34


The best model is **GradientBoostingRegressor** with **R2 score of 0.50**.

## 3. Hyperparameter tuning

In [7]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor

params = {
    'n_estimators': [100, 150, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 1]
}

gbr = GradientBoostingRegressor()
grid_search = GridSearchCV(gbr, params, cv=5, verbose=1, n_jobs=-1)
grid_search.fit(X=X_train, y=y_train)
grid_search.best_params_


Fitting 5 folds for each of 27 candidates, totalling 135 fits


{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200}

In [8]:
grid_search.best_params_

{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200}

The best hyperparameters are:

- n_estimators = 200
- max_depth = 3
- learning_rate = 0.1

In [9]:
# train the model with best hyperparameters
gbr = GradientBoostingRegressor(n_estimators=200, max_depth=3, learning_rate=0.1)
gbr.fit(X=X_train, y=y_train)
y_pred = gbr.predict(X=X_test)

## 4. Evaluate the model

In [10]:
# Evaluate the model
from sklearn.metrics import r2_score

r2_score(y_test, y_pred)

0.5031366396998774

# Prepare test results

In [11]:
## Load the test data
test_data = pd.read_csv(test_file, index_col=0)
test_data.head()

Unnamed: 0_level_0,p_num,time,bg-5:55,bg-5:50,bg-5:45,bg-5:40,bg-5:35,bg-5:30,bg-5:25,bg-5:20,...,activity-0:45,activity-0:40,activity-0:35,activity-0:30,activity-0:25,activity-0:20,activity-0:15,activity-0:10,activity-0:05,activity-0:00
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
p01_8459,p01,06:45:00,,9.2,,,10.2,,,10.3,...,,,,,,,,,,
p01_8460,p01,11:25:00,,,9.9,,,9.4,,,...,,,,,,,,Walk,Walk,Walk
p01_8461,p01,14:45:00,,5.5,,,5.5,,,5.2,...,,,,,,,,,,
p01_8462,p01,04:30:00,,3.4,,,3.9,,,4.7,...,,,,,,,,,,
p01_8463,p01,04:20:00,,,8.3,,,10.0,,,...,,,,,,,,,,


In [12]:
test_data = test_data[['time', 'bg-0:00']]
test_data = day_phase_transformer.transform(test_data)
test_data.head()

Unnamed: 0_level_0,day_phase,bg-0:00
id,Unnamed: 1_level_1,Unnamed: 2_level_1
p01_8459,morning,9.6
p01_8460,noon,4.6
p01_8461,afternoon,8.0
p01_8462,night,9.9
p01_8463,night,5.3


## Interpolate missing values in bg column and fill with mean

In [13]:
test_data.isna().sum()

day_phase      0
bg-0:00      132
dtype: int64

In [14]:
test_data['bg-0:00'] = test_data['bg-0:00'].fillna(test_data['bg-0:00'].median())

In [15]:
# Predict the bg+1:00 values
test_data.rename(columns={'bg-0:00': 'bg'}, inplace=True)
test_data = pd.get_dummies(test_data, columns=['day_phase'], drop_first=True)
test_data['bg+1:00'] = gbr.predict(test_data)
test_data.head()

Unnamed: 0_level_0,bg,day_phase_evening,day_phase_morning,day_phase_night,day_phase_noon,bg+1:00
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
p01_8459,9.6,False,True,False,False,9.03
p01_8460,4.6,False,False,False,True,6.57
p01_8461,8.0,False,False,False,False,7.76
p01_8462,9.9,False,False,True,False,9.42
p01_8463,5.3,False,False,True,False,5.98


## Prepare the submission file

In [16]:
submission = pd.DataFrame(test_data['bg+1:00'])
submission

Unnamed: 0_level_0,bg+1:00
id,Unnamed: 1_level_1
p01_8459,9.03
p01_8460,6.57
p01_8461,7.76
p01_8462,9.42
p01_8463,5.98
...,...
p24_256,6.63
p24_257,9.58
p24_258,6.84
p24_259,8.40


In [17]:
submission.to_csv(os.path.join('..', '..', 'data', 'processed', 'modelling_I_submission.csv'))