# **Regression models ensembles**

## Configuration:

Import necessary entities:

In [1]:
from typing import Any
from warnings import filterwarnings
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from pandas import (
    Series,
    DataFrame,
    read_csv,
)
from sklearn.model_selection import (
    GridSearchCV,
    cross_val_score,
    train_test_split,
)
from sklearn.linear_model import (
    ElasticNet,
    HuberRegressor,
    LinearRegression,
)
from sklearn.ensemble import (
    VotingRegressor,
    BaggingRegressor,
    StackingRegressor,
    RandomForestRegressor,
    GradientBoostingRegressor,
)

Ignore all warnings:

In [2]:
filterwarnings("ignore", )

## Preprocessing:

Create a dictionary for `read_csv()` function callings:

In [3]:
read_csv_params: dict[str, str] = {
    "target_file": "numerical.csv",
    "features_file": "features.csv",

    "targets_file_path": "../../../data/datasets/targets/",
    "features_file_path": "../../../data/datasets/processed/",
}

Read the `features.csv` data to a *Pandas* dataframe:

In [4]:
X: DataFrame = read_csv(
    read_csv_params["features_file_path"] + read_csv_params["features_file"],
    index_col=0,
)

Read the `numerical.csv` data to a *Pandas* dataframe:

In [5]:
y: Series = read_csv(
    read_csv_params["targets_file_path"] + read_csv_params["target_file"],
    index_col=0,
)

Check `X`, `y` variables data:

In [6]:
X.head()

Unnamed: 0,cod,fig,egg,gin,ham,oat,nut,pea,rum,rye,...,fortified wine,sparkling wine,sugar snap pea,beef tenderloin,cranberry sauce,pork tenderloin,poultry sausage,pomegranate juice,jerusalem artichoke,hominy/cornmeal/masa
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
y.head()

Unnamed: 0,rating
0,2.5
1,4.375
2,3.75
3,5.0
4,3.125


Use `train_test_split()` function for splitting `y` target:

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=21,
)

Check `X_train`, `X_test`, `y_train`, `y_test` variables:

In [9]:
X_train.head()

Unnamed: 0,cod,fig,egg,gin,ham,oat,nut,pea,rum,rye,...,fortified wine,sparkling wine,sugar snap pea,beef tenderloin,cranberry sauce,pork tenderloin,poultry sausage,pomegranate juice,jerusalem artichoke,hominy/cornmeal/masa
15035,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15435,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1790,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8329,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
X_test.head()

Unnamed: 0,cod,fig,egg,gin,ham,oat,nut,pea,rum,rye,...,fortified wine,sparkling wine,sugar snap pea,beef tenderloin,cranberry sauce,pork tenderloin,poultry sausage,pomegranate juice,jerusalem artichoke,hominy/cornmeal/masa
8900,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17863,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10688,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
17923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3607,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
y_train.head()

Unnamed: 0,rating
15035,0.0
15435,3.125
1790,0.0
8329,0.0
3195,3.75


In [12]:
y_test.head()

Unnamed: 0,rating
8900,4.375
17863,4.375
10688,4.375
17923,4.375
3607,4.375


## Prediction:

### *Bagging* model:

Create a model of *bagging regression*:

In [13]:
bagging_reg_model: BaggingRegressor = BaggingRegressor(
    n_jobs=-1,
    random_state=21,
)

Create a parameters grid for the *bagging regression* model:

In [14]:
bagging_reg_model_params_grid: dict[str, list[Any]] = {
    "estimator": [
        HuberRegressor(),
        LinearRegression(n_jobs=-1, ),
        ElasticNet(random_state=21, ),
        DecisionTreeRegressor(random_state=21, ),
    ],
    "n_estimators": [
        2,
        3,
        4,
        5,
    ],
}

Create a *gridsearch* model of the *bagging regression* model:

In [15]:
bagging_reg_grid_search_model: GridSearchCV = GridSearchCV(
    cv=5,
    n_jobs=-1,
    estimator=bagging_reg_model,
    scoring="neg_root_mean_squared_error",
    param_grid=bagging_reg_model_params_grid,
)

Train the *gridsearch* model of the *bagging regression* model:

In [16]:
bagging_reg_grid_search_model.fit(X, y, );

Print the best *bagging regression* model *RMSE* metric score:

In [17]:
print(
    f"The bagging regression model RMSE metric score is {
        -bagging_reg_grid_search_model.best_score_:.3f
    }.",
)

The bagging regression model RMSE metric score is 1.291.


### *Random forest tree* model:

Create a model of *random forest tree*:

In [18]:
tree_forest_model: RandomForestRegressor = RandomForestRegressor(
    n_jobs=-1,
    random_state=21,
)

Print the *random forest tree cross-validation* model *RMSE* metric score:

In [19]:
print(
    f"The random forest tree cross-validation model RMSE metric score is {
        -cross_val_score(
            X=X,
            y=y,
            cv=5,
            n_jobs=-1,
            estimator=tree_forest_model,
            scoring="neg_root_mean_squared_error",
        ).mean():.3f
    }.",
)

The random forest tree cross-validation model RMSE metric score is 1.355.


Create a parameters grid for the *random forest tree* model:

In [20]:
tree_forest_model_params_grid: dict[str, list[Any]] = {
    "n_estimators": [
        15,
        30,
        45,
    ],
    "criterion": [
        "poisson",
        "friedman_mse",
        "squared_error",
    ],
}

Create a *gridsearch* model of the *random forest tree* model:

In [21]:
tree_forest_grid_search_model: GridSearchCV = GridSearchCV(
    cv=5,
    n_jobs=-1,
    estimator=tree_forest_model,
    scoring="neg_root_mean_squared_error",
    param_grid=tree_forest_model_params_grid,
)

Train the *gridsearch* model of the *random forest tree* model:

In [22]:
tree_forest_grid_search_model.fit(X, y, );

Print the best *random forest tree* model *RMSE* metric score:

In [23]:
print(
    f"The best random forest tree model RMSE metric score is {
        -tree_forest_grid_search_model.best_score_:.3f
    }.",
)

The best random forest tree model RMSE metric score is 1.343.


### *Gradient boosting* model:

Create a model of *gradient boosting*:

In [24]:
grad_boost_model: GradientBoostingRegressor = GradientBoostingRegressor(
    random_state=21,
)

Print the *gradient boosting cross-validation* model *RMSE* metric score:

In [25]:
print(
    f"The gradient boosting cross-validation model RMSE metric score is {
        -cross_val_score(
            X=X,
            y=y,
            cv=15,
            n_jobs=-1,
            estimator=grad_boost_model,
            scoring="neg_root_mean_squared_error",
        ).mean():.3f
    }.",
)

The gradient boosting cross-validation model RMSE metric score is 1.289.


Create a parameters grid for the *gradient boosting* model:

In [26]:
grad_boost_model_params_grid: dict[str, list[Any]] = {
    "criterion": [
        "friedman_mse",
        "squared_error",
    ],
    "n_estimators": [
        45,
        60,
        75,
    ],
    "learning_rate": [
        0.05,
        0.1,
        0.15,
    ],
    "loss": [
        "huber",
        "quantile",
        "squared_error",
    ],
}

Create a *gridsearch* model of the *gradient boosting* model:

In [27]:
grad_boost_grid_search_model: GridSearchCV = GridSearchCV(
    cv=5,
    n_jobs=-1,
    estimator=grad_boost_model,
    scoring="neg_root_mean_squared_error",
    param_grid=grad_boost_model_params_grid,
)

Train the *gridsearch* model of the *gradient boosting* model:

In [28]:
grad_boost_grid_search_model.fit(X, y, );

Print the best *gradient boosting* model *RMSE* metric score:

In [29]:
print(
    f"The best gradient boosting model RMSE metric score is {
        -grad_boost_grid_search_model.best_score_:.3f
    }.",
)

The best gradient boosting model RMSE metric score is 1.288.


### *Stacking* model:

Create a model of *stacking regression*:

In [30]:
stacking_reg_model: StackingRegressor = StackingRegressor(
    n_jobs=-1,
    final_estimator=DecisionTreeRegressor(random_state=21, ),
    estimators=[
        ("hub_reg", HuberRegressor(), ),
        ("knn_reg", KNeighborsRegressor(n_jobs=-1, ), ),
        ("lin_reg_model", LinearRegression(n_jobs=-1, ), ),
        ("elastic_net_model", ElasticNet(random_state=21, ), ),
    ],
)

Print the *stacking regression cross-validation* model *RMSE* metric score:

In [31]:
print(
    f"The stacking regression cross-validation model RMSE metric score is {
        -cross_val_score(
            X=X,
            y=y,
            cv=5,
            n_jobs=-1,
            estimator=stacking_reg_model,
            scoring="neg_root_mean_squared_error",
        ).mean():.3f
    }.",
)

The stacking regression cross-validation model RMSE metric score is 1.783.


### *Voting regression* model:

Create a model of *voting regression*:

In [32]:
voting_reg_model: VotingRegressor = VotingRegressor(
    n_jobs=-1,
    estimators=[
        ("hub_reg", HuberRegressor(), ),
        ("knn_reg", KNeighborsRegressor(n_jobs=-1, ), ),
        ("lin_reg_model", LinearRegression(n_jobs=-1, ), ),
        ("tree", DecisionTreeRegressor(random_state=21, ), ),
        ("elastic_net_model", ElasticNet(random_state=21, ), ),
    ],
)

Print the *voting regression cross-validation* model *RMSE* metric score:

In [33]:
print(
    f"The voting regression cross-validation model RMSE metric score is  {
        -cross_val_score(
            X=X,
            y=y,
            cv=5,
            n_jobs=-1,
            estimator=voting_reg_model,
            scoring="neg_root_mean_squared_error",
        ).mean():.3f
    }.",
)

The voting regression cross-validation model RMSE metric score is  1.282.


## Model selection:

Check the best regression model ensemble parameters:

In [34]:
voting_reg_model

0,1,2
,estimators,"[('hub_reg', ...), ('knn_reg', ...), ...]"
,weights,
,n_jobs,-1
,verbose,False

0,1,2
,epsilon,1.35
,max_iter,100
,alpha,0.0001
,warm_start,False
,fit_intercept,True
,tol,1e-05

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,-1

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,-1
,positive,False

0,1,2
,criterion,'squared_error'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,21
,max_leaf_nodes,
,min_impurity_decrease,0.0

0,1,2
,alpha,1.0
,l1_ratio,0.5
,fit_intercept,True
,precompute,False
,max_iter,1000
,copy_X,True
,tol,0.0001
,warm_start,False
,positive,False
,random_state,21


Train the best regression model ensemble:

In [35]:
voting_reg_model.fit(X_train, y_train, );

Print the best regression model ensemble *RMSE* metric score:

In [36]:

print(
    f"The best regression model RMSE metric score is {
        mean_squared_error(
            y_test,
            voting_reg_model.predict(X_test, ),
        ) ** 0.5:.3f
    }.",
)

The best regression model RMSE metric score is 1.260.
