# Modelling most important features

## Model Selection

In [1]:
import os
import pandas as pd

train_data_file = os.path.join('..', '..', '..', '..', 'data', 'raw', 'train.csv')
train_data = pd.read_csv(train_data_file, index_col=0, low_memory=False)

validation_data_file = os.path.join('..', '..', '..', '..', 'data', 'interim', 'all_test_4_55h.csv')
validation_data = pd.read_csv(validation_data_file, index_col=0, low_memory=False)

additional_train_data_file = os.path.join('..', '..', '..', '..', 'data', 'interim', 'all_test_3h.csv')
additional_train_data = pd.read_csv(additional_train_data_file, index_col=0, low_memory=False)
# remove from additional data patients that are not in validation data ids
additional_train_data = additional_train_data[~additional_train_data.index.isin(validation_data.index.unique())]

# merge train and additional data
train_data = pd.concat([train_data, additional_train_data], axis=0)

# do not train with patients that are not have to be predicted
test_data_file = os.path.join('..', '..', '..', '..', 'data', 'raw', 'test.csv')
test_data = pd.read_csv(test_data_file, index_col=0, low_memory=False)

unique_patients = test_data['p_num'].unique()
train_data = train_data[train_data['p_num'].isin(unique_patients)]
validation_data = validation_data[validation_data['p_num'].isin(unique_patients)]
test_data = test_data[test_data['p_num'].isin(unique_patients)]

## Loop through unique patients

In [2]:
from notebooks.helpers.LazyPredict import get_lazy_regressor
from pipelines import pipeline

lazy_predict_results = {}

for patient in unique_patients:
    patient_train_data = train_data[train_data['p_num'] == patient]
    patient_validation_data = validation_data[validation_data['p_num'] == patient]
    
    patient_train_data_transformed = pipeline.fit_transform(patient_train_data)
    patient_validation_data_transformed = pipeline.transform(patient_validation_data)
    
    X_train = patient_train_data_transformed.drop(columns=['bg+1:00'])
    y_train = patient_train_data_transformed['bg+1:00']
    
    X_test = patient_validation_data_transformed.drop(columns=['bg+1:00'])
    y_test = patient_validation_data_transformed['bg+1:00']

    reg = get_lazy_regressor(exclude=['SVR'])
    models = reg.fit(X_train, X_test, y_train, y_test)
    lazy_predict_results[patient] = models

 97%|█████████▋| 37/38 [01:56<00:00,  1.97it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004449 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 43979
[LightGBM] [Info] Number of data points in the train set: 13600, number of used features: 208
[LightGBM] [Info] Start training from score 8.854422


100%|██████████| 38/38 [01:57<00:00,  3.10s/it]
 97%|█████████▋| 37/38 [04:21<00:01,  1.00s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.010555 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 46745
[LightGBM] [Info] Number of data points in the train set: 31000, number of used features: 208
[LightGBM] [Info] Start training from score 9.377212


100%|██████████| 38/38 [04:22<00:00,  6.91s/it]
 97%|█████████▋| 37/38 [04:01<00:01,  1.04s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.009008 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 46184
[LightGBM] [Info] Number of data points in the train set: 30505, number of used features: 208
[LightGBM] [Info] Start training from score 7.783695


100%|██████████| 38/38 [04:02<00:00,  6.39s/it]
 97%|█████████▋| 37/38 [02:03<00:00,  1.43it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006645 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 43485
[LightGBM] [Info] Number of data points in the train set: 14153, number of used features: 208
[LightGBM] [Info] Start training from score 8.269870


100%|██████████| 38/38 [02:05<00:00,  3.30s/it]
 97%|█████████▋| 37/38 [01:54<00:00,  1.70it/s]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007661 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 42631
[LightGBM] [Info] Number of data points in the train set: 13315, number of used features: 208
[LightGBM] [Info] Start training from score 9.094037


100%|██████████| 38/38 [01:55<00:00,  3.05s/it]
 97%|█████████▋| 37/38 [05:07<00:00,  1.01it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006572 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 44019
[LightGBM] [Info] Number of data points in the train set: 29515, number of used features: 208
[LightGBM] [Info] Start training from score 6.490273


100%|██████████| 38/38 [05:08<00:00,  8.13s/it]
 97%|█████████▋| 37/38 [20:11<00:01,  1.03s/it]   

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.010340 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 45589
[LightGBM] [Info] Number of data points in the train set: 29531, number of used features: 208
[LightGBM] [Info] Start training from score 9.256592


100%|██████████| 38/38 [20:12<00:00, 31.92s/it]
 97%|█████████▋| 37/38 [07:42<00:01,  1.67s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.011509 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 45668
[LightGBM] [Info] Number of data points in the train set: 31754, number of used features: 208
[LightGBM] [Info] Start training from score 7.999423


100%|██████████| 38/38 [07:45<00:00, 12.25s/it]
 97%|█████████▋| 37/38 [00:59<00:00,  2.39it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007726 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 31316
[LightGBM] [Info] Number of data points in the train set: 6503, number of used features: 208
[LightGBM] [Info] Start training from score 8.169626


100%|██████████| 38/38 [01:00<00:00,  1.58s/it]
 97%|█████████▋| 37/38 [01:06<00:00,  1.96it/s]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006659 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 32586
[LightGBM] [Info] Number of data points in the train set: 5670, number of used features: 208
[LightGBM] [Info] Start training from score 8.243407


100%|██████████| 38/38 [01:07<00:00,  1.79s/it]
 97%|█████████▋| 37/38 [01:01<00:00,  2.06it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006492 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 34368
[LightGBM] [Info] Number of data points in the train set: 5219, number of used features: 208
[LightGBM] [Info] Start training from score 10.117103


100%|██████████| 38/38 [01:02<00:00,  1.65s/it]
 97%|█████████▋| 37/38 [01:07<00:00,  2.03it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006118 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 41924
[LightGBM] [Info] Number of data points in the train set: 5519, number of used features: 208
[LightGBM] [Info] Start training from score 8.523470


100%|██████████| 38/38 [01:08<00:00,  1.81s/it]
 97%|█████████▋| 37/38 [00:55<00:00,  2.28it/s]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.008051 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 40659
[LightGBM] [Info] Number of data points in the train set: 4991, number of used features: 208
[LightGBM] [Info] Start training from score 10.827031


100%|██████████| 38/38 [00:56<00:00,  1.49s/it]
 97%|█████████▋| 37/38 [00:48<00:00,  2.48it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008044 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 38120
[LightGBM] [Info] Number of data points in the train set: 4462, number of used features: 208
[LightGBM] [Info] Start training from score 8.097407


100%|██████████| 38/38 [00:49<00:00,  1.31s/it]
 97%|█████████▋| 37/38 [01:14<00:00,  1.83it/s]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007982 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 43331
[LightGBM] [Info] Number of data points in the train set: 5957, number of used features: 208
[LightGBM] [Info] Start training from score 7.978553


100%|██████████| 38/38 [01:16<00:00,  2.00s/it]


In [4]:
lazy_predict_results.keys()

dict_keys(['p01', 'p02', 'p04', 'p05', 'p06', 'p10', 'p11', 'p12', 'p15', 'p16', 'p18', 'p19', 'p21', 'p22', 'p24'])

In [5]:
lazy_predict_results.values()

dict_values([(                                                              Adjusted R-Squared  \
Model                                                                              
ExtraTreesRegressor                                                         0.57   
LGBMRegressor                                                               0.44   
OrthogonalMatchingPursuit                                                   0.42   
OrthogonalMatchingPursuitCV                                                 0.42   
XGBRegressor                                                                0.41   
GradientBoostingRegressor                                                   0.41   
HistGradientBoostingRegressor                                               0.39   
LassoCV                                                                     0.34   
LassoLarsCV                                                                 0.34   
LarsCV                                                        

In [22]:
for p_num in lazy_predict_results.keys():
    print(f'Patient: {p_num}')
    models = lazy_predict_results[p_num]
    display(models[0].head())
        

Patient: p01


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExtraTreesRegressor,0.57,0.79,1.7,36.34
LGBMRegressor,0.44,0.73,1.94,1.22
OrthogonalMatchingPursuit,0.42,0.73,1.97,0.07
OrthogonalMatchingPursuitCV,0.42,0.72,1.97,0.2
XGBRegressor,0.41,0.72,1.99,0.96


Patient: p02


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExtraTreesRegressor,0.01,0.69,1.97,78.54
BaggingRegressor,-0.04,0.67,2.03,25.45
XGBRegressor,-0.07,0.66,2.05,1.18
HistGradientBoostingRegressor,-0.21,0.62,2.18,3.19
LGBMRegressor,-0.25,0.6,2.22,1.49


Patient: p04


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExtraTreesRegressor,0.01,0.52,1.48,82.7
XGBRegressor,-0.24,0.39,1.66,1.04
BaggingRegressor,-0.29,0.37,1.69,26.16
LGBMRegressor,-0.29,0.36,1.69,1.53
HistGradientBoostingRegressor,-0.34,0.34,1.72,2.95


Patient: p05


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExtraTreesRegressor,0.49,0.81,1.38,35.13
HistGradientBoostingRegressor,0.36,0.76,1.54,2.97
LGBMRegressor,0.36,0.76,1.54,1.49
XGBRegressor,0.26,0.72,1.65,1.13
LassoCV,0.21,0.7,1.71,3.92


Patient: p06


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExtraTreesRegressor,0.6,0.82,2.23,35.6
BaggingRegressor,0.58,0.81,2.3,11.9
LGBMRegressor,0.58,0.81,2.3,1.41
HistGradientBoostingRegressor,0.56,0.8,2.36,2.04
XGBRegressor,0.54,0.8,2.4,1.15


Patient: p10


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExtraTreesRegressor,-2.85,0.43,1.44,97.1
XGBRegressor,-3.32,0.36,1.53,1.21
HistGradientBoostingRegressor,-3.38,0.35,1.54,3.93
BaggingRegressor,-3.6,0.32,1.58,34.29
LGBMRegressor,-3.67,0.31,1.59,1.63


Patient: p11


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExtraTreesRegressor,0.11,0.6,1.34,79.25
BaggingRegressor,-0.08,0.51,1.47,20.86
XGBRegressor,-0.22,0.44,1.57,1.21
HistGradientBoostingRegressor,-0.23,0.44,1.57,3.58
LGBMRegressor,-0.31,0.4,1.62,1.76


Patient: p12


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExtraTreesRegressor,0.46,0.74,1.53,121.55
XGBRegressor,0.39,0.71,1.62,2.21
LGBMRegressor,0.38,0.71,1.63,2.76
HistGradientBoostingRegressor,0.36,0.7,1.66,5.55
BaggingRegressor,0.35,0.69,1.67,95.39


Patient: p15


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExtraTreesRegressor,-0.16,0.46,2.17,17.55
LGBMRegressor,-0.2,0.44,2.21,1.19
HistGradientBoostingRegressor,-0.28,0.41,2.27,3.46
XGBRegressor,-0.41,0.34,2.39,0.82
GradientBoostingRegressor,-0.46,0.32,2.43,8.82


Patient: p16


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExtraTreesRegressor,0.17,0.58,1.02,18.78
HistGradientBoostingRegressor,-0.05,0.47,1.15,3.34
LGBMRegressor,-0.05,0.47,1.15,1.55
XGBRegressor,-0.11,0.44,1.19,1.41
GradientBoostingRegressor,-0.18,0.41,1.22,14.26


Patient: p18


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExtraTreesRegressor,-0.1,0.7,2.16,16.17
XGBRegressor,-0.19,0.67,2.26,1.15
LGBMRegressor,-0.23,0.66,2.29,1.31
HistGradientBoostingRegressor,-0.28,0.65,2.34,3.54
BaggingRegressor,-0.35,0.63,2.4,4.04


Patient: p19


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExtraTreesRegressor,0.22,0.69,1.43,17.15
HistGradientBoostingRegressor,0.14,0.66,1.51,3.14
LGBMRegressor,0.11,0.65,1.53,1.37
BaggingRegressor,0.09,0.65,1.54,4.86
XGBRegressor,0.06,0.63,1.57,1.23


Patient: p21


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExtraTreesRegressor,0.09,0.87,1.49,14.17
XGBRegressor,-0.48,0.79,1.91,1.03
HistGradientBoostingRegressor,-0.53,0.78,1.94,3.1
LGBMRegressor,-0.55,0.78,1.96,1.52
BaggingRegressor,-0.69,0.76,2.04,3.51


Patient: p22


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExtraTreesRegressor,-3.35,0.65,1.47,13.48
XGBRegressor,-4.93,0.53,1.72,0.98
HistGradientBoostingRegressor,-6.03,0.44,1.87,3.46
LGBMRegressor,-6.27,0.42,1.9,1.28
MLPRegressor,-6.45,0.41,1.92,2.69


Patient: p24


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExtraTreesRegressor,-0.37,0.62,1.47,20.97
BaggingRegressor,-0.83,0.49,1.7,6.3
HistGradientBoostingRegressor,-0.84,0.48,1.7,3.07
LGBMRegressor,-0.85,0.48,1.71,1.62
XGBRegressor,-0.89,0.47,1.72,1.34


In [20]:
lazy_predict_results['p01'][0].head()

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExtraTreesRegressor,0.57,0.79,1.7,36.34
LGBMRegressor,0.44,0.73,1.94,1.22
OrthogonalMatchingPursuit,0.42,0.73,1.97,0.07
OrthogonalMatchingPursuitCV,0.42,0.72,1.97,0.2
XGBRegressor,0.41,0.72,1.99,0.96
