# **Classification models ensembles**

## Configuration:

Import necessary entities:

In [1]:
from typing import Any
from xgboost import XGBClassifier
from warnings import filterwarnings
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score
from pandas import (
    Series,
    DataFrame,
    read_csv,
)
from sklearn.model_selection import (
    GridSearchCV,
    cross_val_score,
    train_test_split,
)
from sklearn.ensemble import (
    VotingClassifier,
    StackingClassifier,
    ExtraTreesClassifier,
    RandomForestClassifier,
)

Ignore all warnings:

In [2]:
filterwarnings("ignore", )

## Preprocessing:

Create a dictionary for `read_csv()` method callings:

In [3]:
read_csv_params: dict[str, str] = {
    "features_file": "features.csv",
    "categorical_target": "categorical.csv",

    "targets_file_path": "../../../data/datasets/targets/",
    "features_file_path": "../../../data/datasets/processed/",
}

Read the `features.csv` data to a *Pandas* dataframe:

In [4]:
X: DataFrame = read_csv(
    read_csv_params["features_file_path"] + read_csv_params["features_file"],
    index_col=0,
)

Read the `categorical.csv` data to a *Pandas* dataframe:

In [5]:
y: Series = read_csv(
    read_csv_params["targets_file_path"] +
    read_csv_params["categorical_target"],
    index_col=0,
)

Check `X`, `y` variables data:

In [6]:
X.head()

Unnamed: 0,cod,fig,egg,gin,ham,oat,nut,pea,rum,rye,...,fortified wine,sparkling wine,sugar snap pea,beef tenderloin,cranberry sauce,pork tenderloin,poultry sausage,pomegranate juice,jerusalem artichoke,hominy/cornmeal/masa
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
y.head()

Unnamed: 0,categorical_rating
0,2
1,4
2,4
3,5
4,3


Use `train_test_split()` function for splitting `y` target:

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    stratify=y,
    test_size=0.2,
    random_state=21,
)

Check `X_train`, `X_test`, `y_train`, `y_test`:

In [9]:
X_train.head()

Unnamed: 0,cod,fig,egg,gin,ham,oat,nut,pea,rum,rye,...,fortified wine,sparkling wine,sugar snap pea,beef tenderloin,cranberry sauce,pork tenderloin,poultry sausage,pomegranate juice,jerusalem artichoke,hominy/cornmeal/masa
16991,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12957,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8564,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15804,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10259,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
X_test.head()

Unnamed: 0,cod,fig,egg,gin,ham,oat,nut,pea,rum,rye,...,fortified wine,sparkling wine,sugar snap pea,beef tenderloin,cranberry sauce,pork tenderloin,poultry sausage,pomegranate juice,jerusalem artichoke,hominy/cornmeal/masa
5532,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14955,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1157,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10288,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4974,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
y_train.head()

Unnamed: 0,categorical_rating
16991,4
12957,3
8564,4
15804,4
10259,4


In [12]:
y_test.head()

Unnamed: 0,categorical_rating
5532,4
14955,0
1157,4
10288,3
4974,0


## Prediction:

### *XGBoost* model:

Create a model of *XGBoost*:

In [13]:
xgb_model: XGBClassifier = XGBClassifier(random_state=21, )

Print the *XGBoost cross-validation* model metrics scores:

In [14]:
print(
    f"The XGBoost cross-validation model accuracy metric score is {
        cross_val_score(
            X=X,
            y=y,
            cv=5,
            n_jobs=-1,
            scoring="accuracy",
            estimator=xgb_model,
        ).mean():.3f
    }.",
    f"\nThe XGBoost cross-validation model precision metric score is {
        cross_val_score(
            X=X,
            y=y,
            cv=5,
            n_jobs=-1,
            estimator=xgb_model,
            scoring="precision_weighted",
        ).mean():.3f
    }.",
)

The XGBoost cross-validation model accuracy metric score is 0.667. 
The XGBoost cross-validation model precision metric score is 0.555.


Create a parameters grid for the *XGBoost* model:

In [15]:
xgb_model_params_grid: dict[str, list[Any]] = {
    "n_estimators": [
        75,
        100,
    ],
    "learning_rate": [
        0.1,
        0.3,
        0.5,
    ],
}

Create a *gridsearch* model of the *XGBoost* model:

In [16]:
xgb_grid_search_model: GridSearchCV = GridSearchCV(
    cv=5,
    n_jobs=-1,
    estimator=xgb_model,
    scoring="precision_weighted",
    param_grid=xgb_model_params_grid,
)

Train the *gridsearch* model of the *XGBoost* model:

In [17]:
xgb_grid_search_model.fit(X, y, );

Print the best *XGBoost* model *precision* metric score:

In [18]:
print(
    f"The best XGBoost model precision metric score is {
        xgb_grid_search_model.best_score_:.3f
    }.",
)

The best XGBoost model precision metric score is 0.561.


### *Random forest tree* model:

Create a model of *random forest tree*:

In [19]:
forest_model: RandomForestClassifier = RandomForestClassifier(
    n_jobs=-1,
    random_state=21,
)

Print the *random forest tree cross-validation* model metrics scores:

In [20]:
print(
    f"The random forest cross-validation model accuracy metric score is {
        cross_val_score(
            X=X,
            y=y,
            cv=5,
            n_jobs=-1,
            scoring="accuracy",
            estimator=forest_model,
        ).mean():.3f
    }.",
    f"\nThe random forest cross-validation model precision metric score is {
        cross_val_score(
            X=X,
            y=y,
            cv=5,
            n_jobs=-1,
            estimator=forest_model,
            scoring="precision_weighted",
        ).mean():.3f
    }.",
)

The random forest cross-validation model accuracy metric score is 0.649. 
The random forest cross-validation model precision metric score is 0.584.


Create a parameters grid for the *random forest tree* model:

In [21]:
forest_model_params_grid: dict[str, list[Any]] = {
    "criterion": [
        "gini",
        "entropy",
        "log_loss",
    ],
    "n_estimators": [
        50,
        100,
        150,
    ],
}

Create a *gridsearch* model of the *random forest tree* model:

In [22]:
forest_grid_search_model: GridSearchCV = GridSearchCV(
    cv=5,
    n_jobs=-1,
    estimator=forest_model,
    scoring="precision_weighted",
    param_grid=forest_model_params_grid,
)

Train the *gridsearch* model of the *random forest tree* model:

In [23]:
forest_grid_search_model.fit(X, y, );

Print the best *random forest tree* model *precision* metric score:

In [24]:
print(
    f"The best random forest tree model precision metric score is {
        forest_grid_search_model.best_score_:.3f
    }.",
)

The best random forest tree model precision metric score is 0.586.


### *Extra trees* model:

Create a model of *extra trees*:

In [25]:
extra_trees_model: ExtraTreesClassifier = ExtraTreesClassifier(
    n_jobs=-1,
    random_state=21,
)

Print the *extra trees cross-validation* model metrics scores:

In [26]:
print(
    f"The extra trees cross-validation model accuracy metric score is {
        cross_val_score(
            X=X,
            y=y,
            cv=5,
            n_jobs=-1,
            scoring="accuracy",
            estimator=extra_trees_model,
        ).mean():.3f
    }.",
    f"\nThe extra trees cross-validation model precision metric score is {
        cross_val_score(
            X=X,
            y=y,
            cv=5,
            n_jobs=-1,
            estimator=extra_trees_model,
            scoring="precision_weighted",
        ).mean():.3f
    }.",
)

The extra trees cross-validation model accuracy metric score is 0.627. 
The extra trees cross-validation model precision metric score is 0.571.


Create a parameters grid for the *extra trees* model:

In [27]:
extra_trees_model_params_grid: dict[str, list[Any]] = {
    "criterion": [
        "gini",
        "entropy",
        "log_loss",
    ],
    "n_estimators": [
        50,
        75,
        100,
    ],
}

Create a *gridsearch* model of the *extra trees* model:

In [28]:
extra_trees_grid_search_model: GridSearchCV = GridSearchCV(
    cv=5,
    n_jobs=-1,
    estimator=extra_trees_model,
    scoring="precision_weighted",
    param_grid=extra_trees_model_params_grid,
)

Train the *gridsearch* model of the *extra trees* model:

In [29]:
extra_trees_grid_search_model.fit(X, y, );

Print the best *extra trees* model *precision* metric score:

In [30]:
print(
    f"The best extra trees model precision metric score is {
        extra_trees_grid_search_model.best_score_:.3f
    }.",
)

The best extra trees model precision metric score is 0.571.


### *Voting* model:

Create a model of *voting*:

In [31]:
voting_model: VotingClassifier = VotingClassifier(
    n_jobs=-1,
    estimators=[
        ("tree", DecisionTreeClassifier(random_state=21, ), ),
        ("log_reg", LogisticRegression(
            n_jobs=-1,
            random_state=21,
            multi_class="multinomial",
        ), ),
    ],
)

Print the *voting cross-validation* model metrics scores:

In [32]:
print(
    f"The voting cross-validation model accuracy metric score is {
        cross_val_score(
            X=X,
            y=y,
            cv=10,
            n_jobs=-1,
            scoring="accuracy",
            estimator=voting_model,
        ).mean():.3f
    }.",
    f"\nThe voting cross-validation model precision metric score is {
        cross_val_score(
            X=X,
            y=y,
            cv=10,
            n_jobs=-1,
            estimator=voting_model,
            scoring="precision_weighted",
        ).mean():.3f
    }.",
)

The voting cross-validation model accuracy metric score is 0.598. 
The voting cross-validation model precision metric score is 0.561.


### *Stacking* model:

Create a model of *Stacking*:

In [33]:
stacking_model: StackingClassifier = StackingClassifier(
    passthrough=True,
    final_estimator=DecisionTreeClassifier(random_state=21, ),
    estimators=[
        ("knn", KNeighborsClassifier(n_jobs=-1, ), ),
        ("log_reg", LogisticRegression(
            n_jobs=-1,
            random_state=21,
            multi_class="multinomial",
        ), ),
    ],
)

Print the *stacking cross-validation* model metrics scores:

In [34]:
print(
    f"The stacking cross-validation model accuracy metric score is {
        cross_val_score(
            X=X,
            y=y,
            cv=5,
            n_jobs=-1,
            scoring="accuracy",
            estimator=stacking_model,
        ).mean():.3f
    }.",
    f"\nThe stacking cross-validation model precision metric score is {
        cross_val_score(
            X=X,
            y=y,
            cv=5,
            n_jobs=-1,
            estimator=stacking_model,
            scoring="precision_weighted",
        ).mean():.3f
    }.",
)

The stacking cross-validation model accuracy metric score is 0.532. 
The stacking cross-validation model precision metric score is 0.512.


## Model selection:

Check the best classification ensemble model parameters:

In [35]:
forest_grid_search_model

0,1,2
,estimator,RandomForestC...ndom_state=21)
,param_grid,"{'criterion': ['gini', 'entropy', ...], 'n_estimators': [50, 100, ...]}"
,scoring,'precision_weighted'
,n_jobs,-1
,refit,True
,cv,5
,verbose,0
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,n_estimators,100
,criterion,'entropy'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


Train the best classification ensemble model:

In [36]:
forest_grid_search_model.fit(X_train, y_train, );

Print the best classification ensemble model metrics scores:

In [37]:
print(
    f"The best classfification ensemble model accuracy metric score is {
        accuracy_score(
            y_test,
            forest_grid_search_model.predict(X_test, ),
        ):.3f
    }.",
    f"\nThe best classfification ensemble model precision metric score is {
        precision_score(
            y_test,
            forest_grid_search_model.predict(X_test, ),
            average="weighted",
        ):.3f
    }.",
)

The best classfification ensemble model accuracy metric score is 0.649. 
The best classfification ensemble model precision metric score is 0.588.
