# Boosting MODEL

In [1]:
import pandas as pd
import numpy as np

# For imports
from notebooks import utility
import importlib

# For optimization
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Data import
Let's import the data that was previously cleaned

In [2]:
X_train = pd.read_csv("../DWMProjectData/formodel/X_train.csv")
y_train = pd.read_csv("../DWMProjectData/formodel/y_train.csv")
X_valid = pd.read_csv("../DWMProjectData/formodel/X_valid.csv")
y_valid = pd.read_csv("../DWMProjectData/formodel/y_valid.csv")
X_test = pd.read_csv("../DWMProjectData/formodel/X_test.csv")
y_test = pd.read_csv("../DWMProjectData/formodel/y_test.csv")
# Transform all y in a 1-dimensional array - required to avoid warning in model building
y_train = np.ravel(y_train)
y_valid = np.ravel(y_valid)
y_test = np.ravel(y_test)

## Scale data
For Gradint boosting, it's not needed since it is based on rules and not on distances ( [source](https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35#:~:text=Algorithms%20that%20do,not%20require%20normalization.) )

## Score function

I defined the score functions used for the regression. For a more clear approach I wrote the function `print_metrics` in the file `utility.py` In particular, I decided to write a function that prints the following values to compare models:
- mean absolute error
- mean squared error
- $r^2$, where the best score is 1, good is above 0.7
- explained variance score, where the best score is 1


In [3]:
from utility import print_metrics
importlib.reload(utility)

<module 'notebooks.utility' from 'C:\\Users\\marco\\Documents\\UNI\\Y3\\DataWebMining\\project\\DWMProject\\notebooks\\utility.py'>

## Model Building
According with [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html#sklearn.ensemble.HistGradientBoostingRegressor) it's better for large dataset to use HistGradientBoostingRegressor instead of GradientBoostingRegressor, since it's much faster (and yes, I tried it and I can ensure that is much faster - at least 5x).

As suggested [here](https://medium.com/@eijaz/holdout-vs-cross-validation-in-machine-learning-7637112d3f8f#:~:text=Cross%2Dvalidation%20is%20usually%20the,just%20one%20train%2Dtest%20split.) given that
- I have a large dataset and
- to have a first model as benchmark

I decide to use the train-vaildation-test datasets instad of performing cross validation.



In [38]:
from sklearn.ensemble import HistGradientBoostingRegressor

model_base = HistGradientBoostingRegressor().fit(X_train, y_train)
y_pred = model_base.predict(X_test)
print_metrics(y_test, y_pred)

+--------------------------+-------+
|          Method          | Value |
| mean absolute error      | 0.074 |
+--------------------------+-------+
| mean squared error       | 0.030 |
+--------------------------+-------+
| r^2                      | 0.008 |
+--------------------------+-------+
| explained variance score | 0.009 |
+--------------------------+-------+


Although I have tried to change the hyperparameters of the model, the obtained results are pretty bad; to improve it I can try to perform a parameter tuning on the model. Insipration for paramenter tuning [from here](https://medium.com/all-things-ai/in-depth-parameter-tuning-for-gradient-boosting-3363992e9bae) and [from here](https://towardsdatascience.com/cross-validation-and-hyperparameter-tuning-how-to-optimise-your-machine-learning-model-13f005af9d7d).
I first used RandomisedSearchCV that, according with documentation: _In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter_ ([more datails here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV))
Then I tried with GridSearchCV, un order to try all combinations.
[Here](https://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html) the difference between the two

In [19]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

parameters_tuning = {
    'learning_rate': [1, 0.5, 0.25, 0.1, 0.05, 0.01],
    'max_depth': [None, 3, 10, 15, 25, 50, 100, 150, 200, 300],
    'min_samples_leaf': [10, 20, 35, 60, 100],
}
model_base = HistGradientBoostingRegressor()
# model_tuned = RandomizedSearchCV(estimator=model_base,
#                                  param_distributions=parameters_tuning,
#                                  n_iter=5,
#                                  verbose=4,
#                                  n_jobs=1)

model_tuned = GridSearchCV(
    estimator=model_base,
    param_grid=parameters_tuning,
    verbose=4,
    n_jobs=1

)

# Fit the random search model
model_tuned.fit(X_train, y_train)

# Get the optimal parameters
print(f"Best parameters are: {model_tuned.best_params_}")

Fitting 5 folds for each of 300 candidates, totalling 1500 fits
[CV 1/5] END learning_rate=1, max_depth=None, min_samples_leaf=10;, score=-0.074 total time=   0.3s
[CV 2/5] END learning_rate=1, max_depth=None, min_samples_leaf=10;, score=-0.017 total time=   0.3s
[CV 3/5] END learning_rate=1, max_depth=None, min_samples_leaf=10;, score=-0.048 total time=   0.3s
[CV 4/5] END learning_rate=1, max_depth=None, min_samples_leaf=10;, score=-0.074 total time=   0.3s
[CV 5/5] END learning_rate=1, max_depth=None, min_samples_leaf=10;, score=-0.041 total time=   0.4s
[CV 1/5] END learning_rate=1, max_depth=None, min_samples_leaf=20;, score=-0.055 total time=   0.3s
[CV 2/5] END learning_rate=1, max_depth=None, min_samples_leaf=20;, score=0.024 total time=   0.3s
[CV 3/5] END learning_rate=1, max_depth=None, min_samples_leaf=20;, score=-0.009 total time=   0.2s
[CV 4/5] END learning_rate=1, max_depth=None, min_samples_leaf=20;, score=-0.061 total time=   0.3s
[CV 5/5] END learning_rate=1, max_dep

## Model re-building with best parameters + Metrics

In [34]:
# model_final = HistGradientBoostingRegressor(learning_rate=0.25, max_depth=10, min_samples_leaf=10) # hard-coded version
model_final = HistGradientBoostingRegressor(**model_tuned.best_params_)
X_train_n = pd.concat([X_train, X_valid])
y_train_n = np.concatenate([y_train, y_valid])
model_final.fit(X_train_n, y_train_n)

print_metrics(y_test, model_final.predict(X_test))

+--------------------------+-------+
|          Method          | Value |
| mean absolute error      | 0.075 |
+--------------------------+-------+
| mean squared error       | 0.029 |
+--------------------------+-------+
| r^2                      | 0.041 |
+--------------------------+-------+
| explained variance score | 0.042 |
+--------------------------+-------+


It wuold be a good idea to print feature importance to understand
 1. what features contribute the most
 2. if my new feature is somewhat useful and how much
but unfortunately it is not yet implemented as described in [this issue](https://github.com/scikit-learn/scikit-learn/issues/15132)