# **Regression models**

## Configuration:

Import necessary entities:

In [1]:
from typing import Any
from numpy import average
from sklearn.svm import SVR
from warnings import filterwarnings
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from pandas import (
    Series,
    DataFrame,
    read_csv,
)
from sklearn.model_selection import (
    GridSearchCV,
    cross_val_score,
    train_test_split,
)
from sklearn.linear_model import (
    Lasso,
    Ridge,
    ElasticNet,
    HuberRegressor,
    LinearRegression,
)

Ignore all warnings:

In [2]:
filterwarnings("ignore", )

## Preprocessing:

Create a dictionary for `read_csv()` function callings:

In [3]:
read_csv_params: dict[str, str] = {
    "target_file": "numerical.csv",
    "features_file": "features.csv",

    "targets_file_path": "../../../data/datasets/targets/",
    "features_file_path": "../../../data/datasets/processed/",
}

Read the `features.csv` data to a *Pandas* dataframe:

In [4]:
X: DataFrame = read_csv(
    read_csv_params["features_file_path"] + read_csv_params["features_file"],
    index_col=0,
)

Read the `target.csv` data to a *Pandas* dataframe:

In [5]:
y: Series = read_csv(
    read_csv_params["targets_file_path"] + read_csv_params["target_file"],
    index_col=0,
)

Check `X`, `y` variables data:

In [6]:
X.head()

Unnamed: 0,cod,fig,egg,gin,ham,oat,nut,pea,rum,rye,...,fortified wine,sparkling wine,sugar snap pea,beef tenderloin,cranberry sauce,pork tenderloin,poultry sausage,pomegranate juice,jerusalem artichoke,hominy/cornmeal/masa
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
y.head()

Unnamed: 0,rating
0,2.5
1,4.375
2,3.75
3,5.0
4,3.125


Use `train_test_split()` function for splitting `y` target:

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=21,
)

Check `X_train`, `X_test`, `y_train`, `y_test` variables:

In [9]:
X_train.head()

Unnamed: 0,cod,fig,egg,gin,ham,oat,nut,pea,rum,rye,...,fortified wine,sparkling wine,sugar snap pea,beef tenderloin,cranberry sauce,pork tenderloin,poultry sausage,pomegranate juice,jerusalem artichoke,hominy/cornmeal/masa
15035,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15435,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1790,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8329,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
X_test.head()

Unnamed: 0,cod,fig,egg,gin,ham,oat,nut,pea,rum,rye,...,fortified wine,sparkling wine,sugar snap pea,beef tenderloin,cranberry sauce,pork tenderloin,poultry sausage,pomegranate juice,jerusalem artichoke,hominy/cornmeal/masa
8900,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17863,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10688,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
17923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3607,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
y_train.head()

Unnamed: 0,rating
15035,0.0
15435,3.125
1790,0.0
8329,0.0
3195,3.75


In [12]:
y_test.head()

Unnamed: 0,rating
8900,4.375
17863,4.375
10688,4.375
17923,4.375
3607,4.375


## Prediction:

### Naive solution:

Print the naive solution `RMSE` metric score:

In [13]:
print(
    f"The naive solution RMSE metric is {
        mean_squared_error(
            y,
            [round(average(y, ), 3, ), ] * len(y, ),
        ) ** 0.5:.3f
    }.",
)

The naive solution RMSE metric is 1.341.


### *Linear regression* model:

Create a model of *linear regression*:

In [14]:
lin_reg_model: LinearRegression = LinearRegression(n_jobs=-1, )

Print the *linear regression cross-validation* model *RMSE* metric score:

In [15]:
print(
    f"The linear regression cross-validation model RMSE metric score is {
        -cross_val_score(
            X=X,
            y=y,
            cv=15,
            n_jobs=-1,
            estimator=lin_reg_model,
            scoring="neg_root_mean_squared_error",
        ).mean():.3f
    }.",
)

The linear regression cross-validation model RMSE metric score is 1.286.


Create a parameters grid for the *linear regression* model:

In [16]:
lin_reg_model_params_grid: dict[str, list[bool]] = {
    "positive": [True, False, ],
    "fit_intercept": [True, False, ],
}

Create a *gridsearch* model of the *linear regression* model:

In [17]:
lin_reg_grid_search_model: GridSearchCV = GridSearchCV(
    cv=10,
    n_jobs=-1,
    estimator=lin_reg_model,
    param_grid=lin_reg_model_params_grid,
    scoring="neg_root_mean_squared_error",
)

Train the *gridsearch* model of the *linear regression* model:

In [18]:
lin_reg_grid_search_model.fit(X, y, );

Print the best *linear regression* model *RMSE* metric score:

In [19]:
print(
    f"The best linear regression model RMSE metric score is {
        -lin_reg_grid_search_model.best_score_:.3f
    }.",
)

The best linear regression model RMSE metric score is 1.286.


### *Ridge linear regression* model:

Create a model of *ridge linear regression*:

In [20]:
ridge_lin_reg_model: Ridge = Ridge(random_state=21, )

Print the *ridge linear regression cross-validation* model *RMSE* metric score:

In [21]:
print(
    f"The ridge linear regression cross-validation model RMSE metric score is {
        -cross_val_score(
            X=X,
            y=y,
            cv=20,
            n_jobs=-1,
            estimator=ridge_lin_reg_model,
            scoring="neg_root_mean_squared_error",
        ).mean():.3f
    }.",
)

The ridge linear regression cross-validation model RMSE metric score is 1.284.


Create a parameters grid for the *ridge linear regression* model:

In [22]:
ridge_lin_reg_model_params_grid: dict[str, list[Any]] = {
    "positive": [True, False, ],
    "fit_intercept": [True, False, ],
    "alpha": [
        0,
        0.5,
        1,
        1.5,
    ],
    "solver": [
        "svd",
        "sag",
        "lsqr",
        "saga",
        "lbfgs",
        "cholesky",
        "sparse_cg",
    ],
}

Create a *gridsearch* model of the *ridge linear regression* model:

In [23]:
ridge_lin_reg_grid_search_model: GridSearchCV = GridSearchCV(
    cv=5,
    n_jobs=-1,
    estimator=ridge_lin_reg_model,
    scoring="neg_root_mean_squared_error",
    param_grid=ridge_lin_reg_model_params_grid,
)

Train the *gridsearch* model of the *ridge linear regression* model:

In [24]:
ridge_lin_reg_grid_search_model.fit(X, y, );

Print the best *ridge linear regression* model *RMSE* metric score:

In [25]:
print(
    f"The best ridge linear regression model RMSE metric score is {
        -ridge_lin_reg_grid_search_model.best_score_:.3f
    }.",
)

The best ridge linear regression model RMSE metric score is 1.287.


### *Lasso linear regression* model:

Create a model of *lasso linear regression*:

In [26]:
lasso_lin_reg_model: Lasso = Lasso(random_state=21, )

Print the *lasso linear regression cross-validation* model *RMSE* metric score:

In [27]:
print(
    f"The lasso linear regression cross-validation model RMSE metric score is {
        -cross_val_score(
            X=X,
            y=y,
            cv=20,
            n_jobs=-1,
            estimator=lasso_lin_reg_model,
            scoring="neg_root_mean_squared_error",
        ).mean():.3f
    }.",
)

The lasso linear regression cross-validation model RMSE metric score is 1.340.


Create a parameters grid for the *lasso linear regression* model:

In [28]:
lasso_lin_reg_model_params_grid: dict[str, list[Any]] = {
    "positive": [True, False, ],
    "warm_start": [True, False, ],
    "fit_intercept": [True, False, ],
    "selection": ["random", "cyclic", ],
    "alpha": [
        0.1,
        0.5,
        1,
        1.5,
    ],
}

Create a *gridsearch* model of the *lasso linear regression* model:

In [29]:
lasso_lin_reg_grid_search_model: GridSearchCV = GridSearchCV(
    cv=10,
    n_jobs=-1,
    estimator=lasso_lin_reg_model,
    scoring="neg_root_mean_squared_error",
    param_grid=lasso_lin_reg_model_params_grid,
)

Train the *gridsearch* model of the *lasso linear regression* model:

In [30]:
lasso_lin_reg_grid_search_model.fit(X, y, );

Print the best *lasso linear regression* model *RMSE* metric score:

In [31]:
print(
    f"The best lasso linear regression model RMSE metric score is {
        -lasso_lin_reg_grid_search_model.best_score_:.3f
    }.",
)

The best lasso linear regression model RMSE metric score is 1.341.


### *Elastic net linear regression* model:

Create a model of *elastic net linear regression*:

In [32]:
elastic_net_lin_reg_model: ElasticNet = ElasticNet(random_state=21, )

Print the *elastic net linear regression cross-validation* model *RMSE* metric score:

In [33]:
print(
    "The elastic net linear regression cross-validation model RMSE metric " +
    f"score is {
        -cross_val_score(
            X=X,
            y=y,
            cv=20,
            n_jobs=-1,
            estimator=elastic_net_lin_reg_model,
            scoring="neg_root_mean_squared_error",
        ).mean():.3f
    }.",
)

The elastic net linear regression cross-validation model RMSE metric score is 1.340.


Create a parameters grid for the *elastic net linear regression* model:

In [34]:
elastic_net_lin_reg_model_params_grid: dict[str, list[Any]] = {
    "positive": [True, False, ],
    "warm_start": [True, False, ],
    "fit_intercept": [True, False, ],
    "selection": ["random", "cyclic", ],
    "l1_ratio": [
        0.25,
        0.5,
        0.75,
    ],
    "alpha": [
        0.25,
        0.5,
        1,
        1.5,
    ],
}

Create a *gridsearch* model of the *elastic net linear regression* model:

In [35]:
elastic_net_lin_reg_grid_search_model: GridSearchCV = GridSearchCV(
    cv=10,
    n_jobs=-1,
    estimator=elastic_net_lin_reg_model,
    scoring="neg_root_mean_squared_error",
    param_grid=elastic_net_lin_reg_model_params_grid,
)

Train the *gridsearch* model of the *elastic net linear regression* model:

In [36]:
elastic_net_lin_reg_grid_search_model.fit(X, y, );

Print the best *elastic net linear regression* model *RMSE* metric score:

In [37]:
print(
    f"The best elastic net linear regression model RMSE metric score is {
        -elastic_net_lin_reg_grid_search_model.best_score_:.3f
    }.",
)

The best elastic net linear regression model RMSE metric score is 1.341.


### *Huber regression* model:

Create a model of *huber regression*:

In [38]:
hub_reg_model: HuberRegressor = HuberRegressor()

Print the *huber regression cross-validation* model *RMSE* metric score:

In [39]:
print(
    f"The huber regression cross-validation model RMSE metric score is {
        -cross_val_score(
            X=X,
            y=y,
            cv=10,
            n_jobs=-1,
            estimator=hub_reg_model,
            scoring="neg_root_mean_squared_error",
        ).mean():.3f
    }.",
)

The huber regression cross-validation model RMSE metric score is 1.342.


Create a parameters grid for the *huber regression* model:

In [40]:
hub_reg_model_params_grid: dict[str, list[Any]] = {
    "fit_intercept": [True, False, ],
    "epsilon": [
        1,
        1.2,
        1.4,
        1.6,
    ],
}

Create a *gridsearch* model of the *huber regression* model:

In [41]:
hub_reg_grid_search_model: GridSearchCV = GridSearchCV(
    cv=5,
    n_jobs=-1,
    estimator=hub_reg_model,
    param_grid=hub_reg_model_params_grid,
    scoring="neg_root_mean_squared_error",
)

Train the *gridsearch* model of the *huber regression* model:

In [42]:
hub_reg_grid_search_model.fit(X, y, );

Print the best *huber regression* model *RMSE* metric score:

In [43]:
print(
    f"The best huber regression model RMSE metric score is {
        -hub_reg_grid_search_model.best_score_:.3f
    }.",
)

The best huber regression model RMSE metric score is 1.329.


### *KNN regression* model:

Create a model of *KNN regression*:

In [44]:
knn_reg_model: KNeighborsRegressor = KNeighborsRegressor(n_jobs=-1, )

Print the *KNN regression cross-validation* model *RMSE* metric score:

In [45]:
print(
    f"The KNN regression cross-validation model RMSE metric score is {
        -cross_val_score(
            X=X,
            y=y,
            cv=15,
            n_jobs=-1,
            estimator=knn_reg_model,
            scoring="neg_root_mean_squared_error",
        ).mean():.3f
    }.",
)

The KNN regression cross-validation model RMSE metric score is 1.398.


Create a parameters grid for the *KNN regression* model:

In [46]:
knn_reg_model_params_grid: dict[str, list[Any]] = {
    "n_neighbors": [
        3,
        4,
    ],
    "weights": [
        "uniform",
        "distance",
    ],
    "algorithm": [
        "brute",
        "kd_tree",
        "ball_tree",
    ],
}

Create a *gridsearch* model of the *KNN regression* model:

In [47]:
knn_reg_grid_search_model: GridSearchCV = GridSearchCV(
    cv=4,
    n_jobs=-1,
    estimator=knn_reg_model,
    param_grid=knn_reg_model_params_grid,
    scoring="neg_root_mean_squared_error",
)

Train the *gridsearch* model of the *KNN regression* model:

In [48]:
knn_reg_grid_search_model.fit(X, y, );

Print the best *KNN regression* model *RMSE* metric score:

In [49]:
print(
    f"The best KNN regression model RMSE metric score is {
        -knn_reg_grid_search_model.best_score_:.3f
    }.",
)

The best KNN regression model RMSE metric score is 1.423.


### *Decision tree* model:

Create a model of *decision tree*:

In [50]:
tree_model: DecisionTreeRegressor = DecisionTreeRegressor(random_state=21, )

Print the *decision tree cross-validation* model *RMSE* metric score:

In [51]:
print(
    f"The decision tree cross-validation model RMSE metric score is {
        -cross_val_score(
            X=X,
            y=y,
            cv=15,
            n_jobs=-1,
            estimator=tree_model,
            scoring="neg_root_mean_squared_error",
        ).mean():.3f
    }.",
)

The decision tree cross-validation model RMSE metric score is 1.576.


Create a parameters grid for the *decision tree* model:

In [52]:
tree_model_params_grid: dict[str, list[Any]] = {
    "max_depth": range(26, ),
    "splitter": ["random", "best", ],
    "ccp_alpha": [
        0,
        0.5,
        1,
    ],
    "criterion": [
        "poisson",
        "friedman_mse",
        "asbsolute_error",
    ],
    "max_features": [
        None,
        "auto",
        "sqrt",
        "log2",
    ],
}

Create a *gridsearch* model of the *decision tree* model:

In [53]:
tree_grid_search_model: GridSearchCV = GridSearchCV(
    cv=5,
    n_jobs=-1,
    estimator=tree_model,
    param_grid=tree_model_params_grid,
    scoring="neg_root_mean_squared_error",
)

Train the *gridsearch* model of the *decision tree* model:

In [54]:
tree_grid_search_model.fit(X, y, );

Print the best *decision tree* model *RMSE* metric score:

In [55]:
print(
    f"The best decision tree model RMSE metric score is {
        -tree_grid_search_model.best_score_:.3f
    }.",
)

The best decision tree model RMSE metric score is 1.304.


### *SVR* model:

Create a model of *SVR*:

In [56]:
svr_model: SVR = SVR()

Print the *SVR cross-validation* model *RMSE* metric score:

In [57]:
print(
    f"The SVR cross-validation model RMSE metric score is {
        -cross_val_score(
            X=X,
            y=y,
            cv=4,
            n_jobs=-1,
            estimator=svr_model,
            scoring="neg_root_mean_squared_error",
        ).mean():.3f
    }.",
)

The SVR cross-validation model RMSE metric score is 1.325.


Create a parameters grid for the *SVR* model:

In [58]:
svr_model_params_grid: dict[str, list[str]] = {
    "gamma": ["auto", "scale", ],
    "kernel": [
        "rbf",
        "poly",
        "sigmoid",
    ],
}

Create a *gridsearch* model of the *SVR* model:

In [59]:
svr_grid_search_model: GridSearchCV = GridSearchCV(
    cv=2,
    n_jobs=-1,
    estimator=svr_model,
    param_grid=svr_model_params_grid,
    scoring="neg_root_mean_squared_error",
)

Train the *gridsearch* model of the *SVR* model:

In [60]:
svr_grid_search_model.fit(X, y, );

Print the best *SVR* model *RMSE* metric score:

In [61]:
print(
    f"The best SVR model RMSE metric score is {
        -svr_grid_search_model.best_score_:.3f
    }.",
)

The best SVR model RMSE metric score is 1.327.


## Model selection:

Check the best regression model parameters:

In [62]:
ridge_lin_reg_model

0,1,2
,alpha,1.0
,fit_intercept,True
,copy_X,True
,max_iter,
,tol,0.0001
,solver,'auto'
,positive,False
,random_state,21


Train the best regression model:

In [63]:
ridge_lin_reg_model.fit(X_train, y_train, );

Print the best regression model *RMSE* metric score:

In [64]:
print(
    f"The best regression model RMSE metric score is {
        mean_squared_error(
            y_test,
            ridge_lin_reg_model.predict(X_test, ),
        ) ** 0.5:.3f
    }.",
)

The best regression model RMSE metric score is 1.272.
