<a href="https://colab.research.google.com/github/harishmuh/machine_learning_practices/blob/main/Ensemble_various_type_voting_and_stacking_tutorial_white_wine_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Ensemble Model**: `Various Type: voting & Stacking`
---

We will demonstrate voting and stacking ensemble. First, set the python library that will be needed.

In [1]:
# Import required libraries
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
import numpy as np
import pandas as pd

Create a synthetic binary classification problem with 1000 examples and 10 input features using make_classification().

In [2]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=10, random_state=2)

# Separate the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=2020, stratify=y
)

#### **1. Voting Ensembles**

**Definition**: is an ensemble machine learning model that combines the predictions from multiple other models. It can be used for classification or regression. In the case of regression, this involves calculating the average of the predictions from the models. In the case of classification, the predictions for each label are summed and the label with the majority vote is predicted.

There are two approaches to the majority vote prediction for classification.

- Hard Voting. Voting is calculated on the predicted output class.
- Soft Voting. Voting is calculated on the predicted probability of the output class.

Use a voting ensemble if:

**It results in better performance than any model used in the ensemble.**

The voting ensemble is not guaranteed to provide better performance than any single model used in the ensemble. If any given model used in the ensemble performs better than the voting ensemble, that model should probably be used instead of the voting ensemble.

**It results in a lower variance than any model used in the ensemble.**

This is not always the case. A voting ensemble can offer lower variance in the predictions made over individual models. This lower variance may result in a lower mean performance of the ensemble, which might be desirable given the higher stability or confidence of the model.

**Notes:** A limitation of the voting ensemble is that it treats all models the same, meaning all models contribute equally to the prediction. This is a problem if some models are good in some situations and poor in others.

**Practice:**

First, we can create a function named get_voting() that creates each model and combines the models into a hard voting ensemble



In [3]:
# Voting ensembles
from sklearn.ensemble import VotingClassifier

# Define a function for voting ensemble
def get_voting():
    models=[]
    models.append(('lr', LogisticRegression()))
    models.append(('knn', KNeighborsClassifier()))
    models.append(('dtree', DecisionTreeClassifier()))

    ensemble = VotingClassifier(estimators=models, voting='hard')
    return ensemble

In [4]:
get_voting()

The get_models() function below creates the list of models for us to evaluate. This will help us directly compare each standalone configuration of the each model with the ensemble in terms of the distribution of classification accuracy scores.

In [5]:
# Define a function to get stand_alone model
def get_models():
    models = {}
    models['LogisticRegression'] = LogisticRegression()
    models['KNeighborsClassifier'] = KNeighborsClassifier()
    models['DecisionTreeClassifier'] = DecisionTreeClassifier()
    models['Voting'] = get_voting()
    return models

In [6]:
get_models()

{'LogisticRegression': LogisticRegression(),
 'KNeighborsClassifier': KNeighborsClassifier(),
 'DecisionTreeClassifier': DecisionTreeClassifier(),
 'Voting': VotingClassifier(estimators=[('lr', LogisticRegression()),
                              ('knn', KNeighborsClassifier()),
                              ('dtree', DecisionTreeClassifier())])}

The evaluate_model() function below takes a model instance and return as a list of scores from three repeats of stratified 10-fold-cross-validation

In [7]:
# Define a function to evaluate the voting ensemble with cross validation
def evaluate_model(model, X, y):
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(
        model,
        X,
        y,
        cv=cv,
        scoring='f1',
        n_jobs=-1
    )
    return scores

Finally we try to compare the voting ensemble to all standalone versions of the model.

In [8]:
models = get_models()

for name, model in models.items():
    scores = evaluate_model(model, X_train, y_train)
    print(f'{name}: Mean: {np.mean(scores):.3f} - Std: {np.std(scores):.3f}')

LogisticRegression: Mean: 0.884 - Std: 0.030
KNeighborsClassifier: Mean: 0.841 - Std: 0.037
DecisionTreeClassifier: Mean: 0.846 - Std: 0.047
Voting: Mean: 0.875 - Std: 0.038


#### **2. Stacking Ensembles**

**Definition**: is an ensemble machine learning model that uses a meta-learning algorithm to learn how to best combine the prediction from two or more base machine learning algorithms.

Stacking is appropriate when multiple different machine learning models have skill on a dataset, but have skill in different ways. Another way to say this is that the predictions made by the models or the errors in predictions made by the models are uncorrelated or have a low correlation.

The architecture of a stacking model involves two or more base models, often referred to as level-0 models, and a meta-model that combines the predictions of the base models, referred to as a level-1 model.

- **Level-0 Models (Base-Models)**:
Models fit on the training data and whose predictions are compiled. Use a diverse range of models that make different assumptions about the prediction task.

- **Level-1 Model (Meta-Model)**:
Model that learns how to best combine the predictions of the base models. The meta-model is often simple, providing a smooth interpretation of the predictions made by the base models. As such, linear regression or logistic regression are often used as the meta-model.

**Notes**
- Stacking is designed to improve modeling performance, although is not guaranteed to result in an improvement in all cases.
- If a base-model performs as well as or better than the stacking ensemble, the base model should be used instead, given its lower complexity (e.g. it’s simpler to describe, train and maintain).

**Practice:**

First, we can evaluate a suite of different machine learning mmodels on the dataset. Each algorithm will be evaluated using default model hyperparameters. The get_stacking() function below defines the StackingClassifier model by first defining a list of tuples for the three base models, then defining the logistic regression meta-model to combine the predictions from the base models using 5-fold-cross-validation.



In [9]:
from sklearn.ensemble import StackingClassifier

# Define a function for stacking ensemble
def get_stacking():
    base=[]
    base.append(('lr', LogisticRegression()))
    base.append(('knn', KNeighborsClassifier()))
    base.append(('dtree', DecisionTreeClassifier()))
    meta = LogisticRegression()

    ensemble = StackingClassifier(estimators=base, final_estimator=meta, cv=5)
    return ensemble

In [10]:
get_stacking()

We can include the stacking ensemble in the list of models to evaluate, along with the standalone models

In [11]:
# Define a function to get stand_alone model
def get_models():
    models = {}
    models['LogisticRegression'] = LogisticRegression()
    models['KNeighborsClassifier'] = KNeighborsClassifier()
    models['DecisionTreeClassifier'] = DecisionTreeClassifier()
    models['Stacking'] = get_stacking()
    return models

The evaluate_model() function below takes a model instance and return as a list of scores from three repeats of stratified 10-fold-cross-validation

In [12]:
# Define a function to evaluate the stacking ensemble with cross validation
def evaluate_model(model, X, y):
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(
        model,
        X,
        y,
        cv=cv,
        scoring='f1',
        n_jobs=-1
    )
    return scores

Finally we try to compare the stacking ensemble to all standalone versions of the model.

In [13]:
models = get_models()

for name, model in models.items():
    scores = evaluate_model(model, X_train, y_train)
    print(f'{name}: Mean: {np.mean(scores):.3f} - Std: {np.std(scores):.3f}')

LogisticRegression: Mean: 0.884 - Std: 0.030
KNeighborsClassifier: Mean: 0.841 - Std: 0.037
DecisionTreeClassifier: Mean: 0.848 - Std: 0.048
Stacking: Mean: 0.875 - Std: 0.035


### `Application`


We will build ensemble models to classify good and poor quality of white wine. We will be using F1 for the metrics to evaluate the model performance.



`Load Dataset`

In [14]:
# Load the white_wine.csv dataset
url = 'https://raw.githubusercontent.com/harishmuh/machine_learning_practices/refs/heads/main/datasets/white_wine.csv'
data = pd.read_csv(url)
data.sample(5, random_state=0)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
422,7.0,0.21,0.28,8.6,0.045,37.0,221.0,0.9954,3.25,0.54,10.4,6.0
107,7.1,0.23,0.35,16.5,0.04,60.0,171.0,0.999,3.16,0.59,9.1,6.0
253,5.8,0.24,0.44,3.5,0.029,5.0,109.0,0.9913,3.53,0.43,11.7,3.0
235,7.2,0.23,0.38,14.3,0.058,55.0,194.0,0.9979,3.09,0.44,9.0,6.0
311,5.0,0.55,0.14,8.3,0.032,35.0,164.0,0.9968,3.53,0.51,12.5,8.0


`Data Cleaning`

**Duplicated Value**

Duplicated data detection and quantification

In [15]:
print(f'Total number of duplicates: {data.duplicated().sum()}')
print(f'Total duplicate percentage: {data.duplicated().sum()/len(data)*100:.2f}%')

Total number of duplicates: 84
Total duplicate percentage: 16.15%


As much as 16.15% of the data is indicated as duplicate, so remove one of them.

Handling duplicates

In [16]:
data.drop_duplicates(keep='first', inplace=True, ignore_index=True)

**Missing Value**

Missing value detection and quantification

In [17]:
data.isna().sum().to_frame('Missing values')

Unnamed: 0,Missing values
fixed acidity,0
volatile acidity,0
citric acid,0
residual sugar,0
chlorides,0
free sulfur dioxide,0
total sulfur dioxide,0
density,0
pH,1
sulphates,1


There are missing values ​​in the pH, sulphates, alcohol and quality columns. We will just drop the missing value.

In [18]:
data.dropna(inplace=True)

**Change Target**

Change the target to binary categorical (Good quality wine : 1, Poor quality wine : 0). Good quality white wine has quality above six. The quality less than six will be considered as poor quality.

In [19]:
# Change target
data['quality'] = np.where(data['quality']> 6, 1, 0) # Good quality wine : 1, Poor quality wine : 0
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,0
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,0
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,0
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,0
4,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6,0


`Data Splitting`

Split data into train and test sets with 80:20 compositions.

In [20]:
# Define features and target
X = data.drop(columns=['quality'])
y = data['quality']

# Separate data into train and test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0, stratify=y
)

### `Model Experiment`

Setting basic parameters to initialize the experiment

In [21]:
from sklearn.preprocessing import RobustScaler
from imblearn.pipeline import Pipeline

scaler = RobustScaler()

**Benchmark Model**

In [22]:
# Define the models
logreg = LogisticRegression(max_iter=1000)
knn = KNeighborsClassifier()
dtree = DecisionTreeClassifier(max_depth=5, random_state=0)

# Base Models
logreg_pipeline = Pipeline([
    ('scaling', scaler),
    ('modeling', logreg)
])

knn_pipeline = Pipeline([
    ('scaling', scaler),
    ('modeling', knn)
])

dtree_pipeline = Pipeline([
    ('modeling', dtree)
])

# Meta learner
meta_logreg = LogisticRegression(max_iter=1000)

# Voting Classifier (Hard)
voting_clf_hard = VotingClassifier(
    estimators=[
        ('clf1', logreg_pipeline),
        ('clf2', knn_pipeline),
        ('clf3', dtree_pipeline)
    ], voting='hard'
)

# Voting Classifier (Soft)
voting_clf_soft = VotingClassifier(
    estimators=[
        ('clf1', logreg_pipeline),
        ('clf2', knn_pipeline),
        ('clf3', dtree_pipeline)
    ], voting='soft'
)

# Stacking Classifier
stacking_clf = StackingClassifier(
    estimators=[
        ('clf1', logreg_pipeline),
        ('clf2', knn_pipeline),
        ('clf3', dtree_pipeline)
    ],
    final_estimator=meta_logreg
)

In [23]:
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Define models
models = [logreg_pipeline, knn_pipeline, dtree_pipeline,
          voting_clf_hard, voting_clf_soft, stacking_clf]
model_names = ['LogisticRegression', 'KNeigborsClassifier', 'DecisionTree',
               'Voting Classifier (Hard)', 'Voting Classifier (Soft)', 'Stacking Classifier']

# Create list to store evaluation score
f1_mean = []
f1_std = []
all_f1 = []

# Cross Validation
for model in models:

    skfold = StratifiedKFold(n_splits=5)

    model_cv = cross_val_score(
        model,
        X_train,
        y_train,
        cv=skfold,
        scoring='f1',
        n_jobs=-1
    )

    f1_mean.append(model_cv.mean())
    f1_std.append(model_cv.std())
    all_f1.append(model_cv.round(4))

In [24]:
pd.DataFrame({
    'model': model_names,
    'mean f1': f1_mean,
    'std f1': f1_std,
    'all score': all_f1
}).sort_values('mean f1', ascending=False)

Unnamed: 0,model,mean f1,std f1,all score
0,LogisticRegression,0.992,0.016,"[1.0, 0.96, 1.0, 1.0, 1.0]"
5,Stacking Classifier,0.975304,0.020204,"[1.0, 0.96, 0.96, 1.0, 0.9565]"
3,Voting Classifier (Hard),0.923789,0.049178,"[0.96, 0.8333, 0.96, 0.9565, 0.9091]"
4,Voting Classifier (Soft),0.902577,0.089754,"[0.96, 0.7273, 0.96, 0.9565, 0.9091]"
2,DecisionTree,0.85913,0.101001,"[0.96, 0.6957, 0.96, 0.88, 0.8]"
1,KNeigborsClassifier,0.806192,0.094593,"[0.8333, 0.6364, 0.7826, 0.9091, 0.8696]"


**Predict Benchmark Model to Test Set**

To measure the final performance of the model

In [25]:
from sklearn.metrics import f1_score

# Models
models = [logreg_pipeline, knn_pipeline, dtree_pipeline,
          voting_clf_hard, voting_clf_soft, stacking_clf]

# Create list to store evaluation score
list_f1 = []
dict_pred = {}
dict_proba = {}

# Predict to test set
for model, name in zip(models, model_names):

    # fitting to train set
    model.fit(X_train, y_train)

    # predict to test set
    y_pred = model.predict(X_test)
    dict_pred[name] = y_pred

    if model in [logreg_pipeline, knn_pipeline, dtree_pipeline, voting_clf_soft]:
        y_proba = model.predict_proba(X_test)
        dict_proba[name] = y_proba[:, 1].round(4)

    # evaluate score
    score = f1_score(y_test, y_pred)
    list_f1.append(score)

In [26]:
pd.DataFrame({
    'model': model_names,
    'F1 score': list_f1
}).sort_values('F1 score', ascending=False)

Unnamed: 0,model,F1 score
5,Stacking Classifier,0.967742
2,DecisionTree,0.941176
4,Voting Classifier (Soft),0.903226
0,LogisticRegression,0.896552
3,Voting Classifier (Hard),0.896552
1,KNeigborsClassifier,0.785714


**Illustration of model performance**

`Hard Voting`

In [27]:
pd.DataFrame({
    'Pred Logreg': dict_pred['LogisticRegression'],
    'Pred KNN': dict_pred['KNeigborsClassifier'],
    'Pred Tree': dict_pred['DecisionTree'],
    'Pred Voting Hard': dict_pred['Voting Classifier (Hard)']
}).head(10)

Unnamed: 0,Pred Logreg,Pred KNN,Pred Tree,Pred Voting Hard
0,0,0,0,0
1,0,0,0,0
2,0,0,1,0
3,0,0,0,0
4,1,0,1,1
5,0,0,0,0
6,0,0,0,0
7,0,0,0,0
8,0,0,0,0
9,1,1,1,1


`Soft Voting`

In [28]:
pd.DataFrame({
    'Pred Logreg': dict_pred['LogisticRegression'],
    'Pred KNN': dict_pred['KNeigborsClassifier'],
    'Pred Tree': dict_pred['DecisionTree'],
    'Pred Voting Soft': dict_pred['Voting Classifier (Soft)']
}).head(10)

Unnamed: 0,Pred Logreg,Pred KNN,Pred Tree,Pred Voting Soft
0,0,0,0,0
1,0,0,0,0
2,0,0,1,1
3,0,0,0,0
4,1,0,1,1
5,0,0,0,0
6,0,0,0,0
7,0,0,0,0
8,0,0,0,0
9,1,1,1,1


In [29]:
pd.DataFrame({
    'Proba Logreg': dict_proba['LogisticRegression'],
    'Proba KNN': dict_proba['KNeigborsClassifier'],
    'Proba Tree': dict_proba['DecisionTree'],
    'Proba Voting Soft': dict_proba['Voting Classifier (Soft)']
}).head(10)

Unnamed: 0,Proba Logreg,Proba KNN,Proba Tree,Proba Voting Soft
0,0.0031,0.0,0.0,0.001
1,0.0157,0.0,0.0,0.0052
2,0.4632,0.4,1.0,0.6211
3,0.0,0.0,0.0,0.0
4,0.9434,0.4,1.0,0.7811
5,0.0148,0.0,0.0,0.0049
6,0.0165,0.0,0.0,0.0055
7,0.0,0.0,0.0,0.0
8,0.0021,0.0,0.0,0.0007
9,0.8881,1.0,1.0,0.9627


`Stacking`

In [30]:
# Fitting to train set
logreg_pipeline.fit(X_train, y_train)
knn_pipeline.fit(X_train, y_train)
dtree_pipeline.fit(X_train, y_train)

# Predict to train set
y_train_logreg = logreg_pipeline.predict(X_train)
y_train_knn = knn_pipeline.predict(X_train)
y_train_dtree = dtree_pipeline.predict(X_train)

**Data to be learned by meta learner**

In [31]:
pd.DataFrame({
    'Pred Logreg': y_train_logreg,
    'Pred KNN': y_train_knn,
    'Pred Tree': y_train_dtree,
    'Y train': y_train
}).head(10)

Unnamed: 0,Pred Logreg,Pred KNN,Pred Tree,Y train
394,0,0,0,0
13,1,1,1,1
309,0,0,0,0
163,0,0,0,0
220,0,0,0,0
254,0,0,0,0
434,0,0,0,0
82,0,0,0,0
69,0,0,0,0
90,0,0,0,0


**Data to be predicted by meta learning**

In [32]:
pd.DataFrame({
    'Pred Logreg': dict_pred['LogisticRegression'],
    'Pred KNN': dict_pred['KNeigborsClassifier'],
    'Pred Tree': dict_pred['DecisionTree'],
}).head(10)

Unnamed: 0,Pred Logreg,Pred KNN,Pred Tree
0,0,0,0
1,0,0,0
2,0,0,1
3,0,0,0
4,1,0,1
5,0,0,0
6,0,0,0
7,0,0,0
8,0,0,0
9,1,1,1


**Meta learner prediction results**

In [33]:
pd.DataFrame({
    'Pred Logreg': dict_pred['LogisticRegression'],
    'Pred KNN': dict_pred['KNeigborsClassifier'],
    'Pred Tree': dict_pred['DecisionTree'],
    'Pred Stacking': dict_pred['Stacking Classifier']
}).head(10)

Unnamed: 0,Pred Logreg,Pred KNN,Pred Tree,Pred Stacking
0,0,0,0,0
1,0,0,0,0
2,0,0,1,1
3,0,0,0,0
4,1,0,1,1
5,0,0,0,0
6,0,0,0,0
7,0,0,0,0
8,0,0,0,0
9,1,1,1,1
