# Re-running LazyPredict on the Feature Subset

Using the insights from the SHAP analysis in the previous chapter, we identified a subset of the most important features. In this step, we re-run LazyPredict on this refined feature subset to evaluate model performance and compare it with previous results.

In [None]:
# import get_lazy_regresor()
import importlib.util
import sys

def import_function_from_path(file_path, function_name):
    # Load the module from the file path
    spec = importlib.util.spec_from_file_location("module_name", file_path)
    module = importlib.util.module_from_spec(spec)
    sys.modules["module_name"] = module
    spec.loader.exec_module(module)
    
    # Retrieve the function from the loaded module
    func = getattr(module, function_name)
    return func

# NOTE: Below, PATH_TO_PROJECT is the location of the project folder (e.g: /home/sep24_bds_int_medical)
PATH_TO_SCRIPT = 'PATH_TO_PROJECT/notebooks/helpers/LazyPredict.py'
function_name = 'get_lazy_regressor'

get_lazy_regressor = import_function_from_path( PATH_TO_SCRIPT , function_name )

PATH_TO_SRC = '/Users/masaver/Desktop/masaver/data_science_projects/sep24_bds_int_medical'
sys.path.append( PATH_TO_SRC )

# Load other requiered libraries
import os
import pandas as pd
from sklearn.model_selection import train_test_split

# Mute warnings
import warnings
warnings.filterwarnings("ignore")

# Import the preprocessign pipeline
from pipelines import *

In [None]:
# Read and display the train.csv and test.csv data
data_dir = '../../../../data/'
train_file = os.path.join( data_dir , 'raw' , 'train.csv' )
test_file = os.path.join( data_dir , 'raw' , 'train.csv' )

df_train = pd.read_csv(train_file, index_col = 0 , parse_dates = True )
df_test = pd.read_csv(test_file, index_col = 0 , parse_dates = True )

## Data Preprocessing

### Main steps
* Re-encoding the ``timestamp`` into a ``day-phase``
* Dropping the following columns: ``activity``, ``carbs``, ``steps`` , ``p_num``and ``time``
* Imputing NANs in the remaining columns with interpolation and medians
* Two negative values in  the ``insulin`` column replaced with ``0``
* The column ``day-phase`` is re-encoded using ``pd.get_dummies()``
* Finally, all columns were transformed using ``StandardScaler``.

In [5]:
# Split the data into Features and Target variables, 
# and Standarize the features with the preprocessing pipelines
X = df_train.drop( 'bg+1:00' , axis = 1 )
y = df_train['bg+1:00']

# fix column names 
import re
X.columns = [re.sub(r'[^a-zA-Z0-9_]', '_', col) for col in X.columns]

# Train Test Split
x_train,x_test,y_train,y_test = train_test_split( Xs , y , test_size=0.2 , random_state=17 )
data_pipe = pipeline_s
x_train_s = data_pipe.fit_transform( x_train )
x_test_s = data_pipe.transform( x_test )

#Subset  to keep top features opnly
top_feat = ['hr_0_00', 'bg_0_15', 'day_phase_evening', 'bg_0_00', 'insulin_0_00', 'day_phase_night', 'bg_0_10']
x_train_s = x_train_s[ top_feat ]
x_test_s = x_test_s[ top_feat ]

display( x_train_s )
display( x_test_s )


Unnamed: 0_level_0,hr_0_00,bg_0_15,day_phase_evening,bg_0_00,insulin_0_00,day_phase_night,bg_0_10
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
p11_1931,-0.18,0.74,-0.44,0.48,-0.12,1.74,0.68
p10_9422,-0.34,-0.66,-0.44,-0.79,-0.12,-0.58,-0.66
p06_8273,1.43,-1.20,-0.44,-1.29,-0.12,-0.58,-1.23
p12_17133,1.87,0.34,-0.44,0.17,-0.12,-0.58,0.31
p03_2346,-0.61,0.04,-0.44,0.01,-0.12,-0.58,0.17
...,...,...,...,...,...,...,...
p02_17172,-0.46,0.91,-0.44,0.84,-0.12,-0.58,0.88
p10_23964,0.47,1.18,-0.44,0.98,-0.12,-0.58,1.18
p03_7966,-1.00,-0.53,-0.44,-0.63,-0.12,1.74,-0.56
p03_628,-0.13,-0.49,-0.44,-0.53,-0.12,-0.58,-0.49


Unnamed: 0_level_0,hr_0_00,bg_0_15,day_phase_evening,bg_0_00,insulin_0_00,day_phase_night,bg_0_10
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
p11_17351,-0.13,0.01,-0.44,-0.09,-0.12,1.74,-0.09
p10_15385,-0.93,-0.86,-0.44,-0.79,-0.12,-0.58,-0.90
p06_2496,-0.70,-0.06,-0.44,0.04,-0.12,-0.58,-0.03
p10_8410,0.67,-0.03,-0.44,0.14,-0.12,-0.58,0.04
p02_2562,1.22,-0.49,-0.44,-0.36,-0.12,-0.58,-0.49
...,...,...,...,...,...,...,...
p02_24761,0.32,-0.63,-0.44,-0.69,-0.12,1.74,-0.59
p10_22597,0.11,0.54,2.26,0.58,0.53,-0.58,0.58
p03_2231,-1.55,-0.33,-0.44,-0.36,-0.12,1.74,-0.36
p12_4088,1.75,-0.36,2.26,-0.39,-0.11,-0.58,-0.36


## Run LazyPredict Regressor

In [6]:
# Run a Lazy Regressor
reg = get_lazy_regressor( exclude = ['SVR','QuantileRegressor'] )
models, predictions = reg.fit( x_train_s , x_test_s , y_train , y_test )

 97%|█████████▋| 36/37 [01:31<00:01,  1.13s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001444 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1237
[LightGBM] [Info] Number of data points in the train set: 141619, number of used features: 7
[LightGBM] [Info] Start training from score 8.273489


100%|██████████| 37/37 [01:31<00:00,  2.47s/it]


In [7]:
display( models )

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LGBMRegressor,0.55,0.55,2.0,0.43
HistGradientBoostingRegressor,0.55,0.55,2.01,0.92
MLPRegressor,0.55,0.55,2.01,53.6
XGBRegressor,0.55,0.55,2.01,0.4
GradientBoostingRegressor,0.54,0.54,2.04,6.23
SGDRegressor,0.52,0.52,2.08,0.17
Ridge,0.52,0.52,2.08,0.06
BayesianRidge,0.52,0.52,2.08,0.05
RidgeCV,0.52,0.52,2.08,0.05
TransformedTargetRegressor,0.52,0.52,2.08,0.05


**Conclusion:** The previously selected models **XGBoost**, **LGBM** and **HistGradientBoosting** continue to be the best performers.