# A Deep Dive into Stacking Ensemble Machine Learning - Part III

## How to fully understand stacking and use it effectively in machine learning by implementing a stacking model from scracth in Python and Jupyter

![tim-wildsmith-o2fc-C-Uotw-unsplash.jpg](attachment:tim-wildsmith-o2fc-C-Uotw-unsplash.jpg)
Photo by <a href="https://unsplash.com/@timwildsmith?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Tim Wildsmith</a> on <a href="https://unsplash.com/s/photos/stack-of-books?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

### Background
In my two recent articles on implementing ensemble machine learning algorithms using stacking I have explored how stacking works and how to build and understand a stacking algorithm in ``scikit-learn``.

https://towardsdatascience.com/a-deep-dive-into-stacking-ensemble-machine-learning-part-i-10476b2ade3

https://towardsdatascience.com/a-deep-dive-into-stacking-ensemble-machine-learning-part-i-10476b2ade3

This final article of the series will take the understanding further by building a stacking algorithm from scratch and bu using pipelining to link the Level 0 and Level 1 models.

### Getting Started
As with the other articles in the series we will need to import a set of libraries that are going to be used in the code and create some constants ...

In [1]:
import pandas as pd
import numpy as np
import copy as cp
from icecream import ic

from sklearn.datasets import make_classification

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import ExtraTreesClassifier

from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin

from sklearn.model_selection import StratifiedKFold, cross_val_predict, train_test_split
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score
from imblearn.pipeline import Pipeline

from typing import Tuple

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

RANDOM_STATE : int = 42
TARGET_NAME : str = "target"

  from pandas import MultiIndex, Int64Index


### Getting Some Data
We will also stick with the same dataset that has been used in the other articles ...

In [2]:
def make_classification_dataframe(n_samples : int = 10000, n_features : int = 25, n_classes : int = 2, n_clusters_per_class : int = 2, feature_name_prefix : str = "feature_", target_name : str = "target", random_state : int = 42) -> pd.DataFrame:
    X, y = make_classification(n_samples=n_samples, n_features=n_features, n_classes=n_classes, n_informative = n_classes * n_clusters_per_class, random_state=random_state)

    feature_names = [feature_name_prefix + str(v) for v in np.arange(1, n_features+1)]
    return pd.concat([pd.DataFrame(X, columns=feature_names), pd.DataFrame(y, columns=[target_name])], axis=1)

df_classification = make_classification_dataframe(n_samples=12000, target_name=TARGET_NAME, random_state=RANDOM_STATE)

X = df_classification.drop([TARGET_NAME], axis=1)
y = df_classification[TARGET_NAME]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=2000, random_state=RANDOM_STATE)

X_train.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,...,feature_16,feature_17,feature_18,feature_19,feature_20,feature_21,feature_22,feature_23,feature_24,feature_25
9334,0.136638,-0.713842,0.157276,0.055179,-1.499486,-0.801896,2.413641,1.05526,-0.360709,0.390722,...,-0.951505,0.919107,0.327989,1.240349,1.916928,0.560735,0.149824,1.345824,1.194973,-0.079852
895,-1.966571,0.152219,0.859824,-2.668824,-1.029491,-1.125497,-2.036671,-0.798295,-0.256613,-0.717745,...,0.977914,0.464718,-0.742434,-1.61832,0.198326,-1.046548,0.523912,0.939122,-0.528114,0.179478
11264,-0.373778,-1.119354,-1.258326,-0.852311,-2.152573,-0.423335,-1.046887,0.50248,-0.140387,-0.276063,...,-1.698537,1.660224,-1.115649,-0.0828,-0.492556,1.075708,0.199834,1.419582,-0.767532,1.047571
7724,-0.755063,-1.821703,-1.222018,-0.566342,-1.165231,2.542285,0.20489,0.127901,1.21173,-2.145323,...,-0.380833,1.201113,1.500583,0.250396,1.149989,3.020883,-1.3429,-0.488408,0.632942,-0.5831
765,0.975995,-0.876509,0.98074,0.997508,-0.926742,5.945441,0.244172,-0.699326,2.134937,0.793963,...,-1.506136,-1.196388,3.100057,0.917948,0.522144,2.914381,-2.820182,0.18517,0.630581,-0.837476


### A Quick Reminder from Part II
If you would like all of the detail please have a look back to Part II (or back to Part I for a guide to the principles before any coding takes place).

In summary ``scikit-learn`` was used in Part II to implement the part of stacking where Level 0 and Level 1 models are combined to produce a classifier that improves performance by adding classification predictions as engineered features -

![stacking_step_2.png](attachment:stacking_step_2.png)

In [3]:
level_0_classifiers = dict()
level_0_classifiers["logreg"] = LogisticRegression(random_state=RANDOM_STATE)
level_0_classifiers["forest"] = RandomForestClassifier(random_state=RANDOM_STATE)
level_0_classifiers["xgboost"] = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=RANDOM_STATE)
level_0_classifiers["xtrees"] = ExtraTreesClassifier(random_state=RANDOM_STATE)

level_1_classifier = ExtraTreesClassifier(random_state=RANDOM_STATE)

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
stacking_model = StackingClassifier(estimators=list(level_0_classifiers.items()), final_estimator=level_1_classifier, passthrough=True, cv=kfold, stack_method="predict_proba")

level_0_columns = [f"{name}_prediction" for name in level_0_classifiers.keys()]
pd.DataFrame(stacking_model.fit_transform(X_train, y_train), columns=level_0_columns + list(X_train.columns))

Unnamed: 0,logreg_prediction,forest_prediction,xgboost_prediction,xtrees_prediction,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,...,feature_16,feature_17,feature_18,feature_19,feature_20,feature_21,feature_22,feature_23,feature_24,feature_25
0,0.746721,0.91,0.908653,1.0,0.136638,-0.713842,0.157276,0.055179,-1.499486,-0.801896,...,-0.951505,0.919107,0.327989,1.240349,1.916928,0.560735,0.149824,1.345824,1.194973,-0.079852
1,0.301752,0.07,0.058239,0.0,-1.966571,0.152219,0.859824,-2.668824,-1.029491,-1.125497,...,0.977914,0.464718,-0.742434,-1.618320,0.198326,-1.046548,0.523912,0.939122,-0.528114,0.179478
2,0.786726,0.89,0.963420,1.0,-0.373778,-1.119354,-1.258326,-0.852311,-2.152573,-0.423335,...,-1.698537,1.660224,-1.115649,-0.082800,-0.492556,1.075708,0.199834,1.419582,-0.767532,1.047571
3,0.681193,0.96,0.979050,1.0,-0.755063,-1.821703,-1.222018,-0.566342,-1.165231,2.542285,...,-0.380833,1.201113,1.500583,0.250396,1.149989,3.020883,-1.342900,-0.488408,0.632942,-0.583100
4,0.206296,0.03,0.008406,0.0,0.975995,-0.876509,0.980740,0.997508,-0.926742,5.945441,...,-1.506136,-1.196388,3.100057,0.917948,0.522144,2.914381,-2.820182,0.185170,0.630581,-0.837476
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0.805857,0.91,0.972783,1.0,-0.110827,-1.415858,0.770497,-0.934064,0.392124,-0.365793,...,-0.841045,1.081216,0.603585,-0.391405,0.062470,1.166114,0.599579,-0.116912,-0.228063,-0.725702
9996,0.808834,1.00,0.994323,1.0,1.015395,0.206763,0.251755,0.512048,0.105519,-0.597608,...,-0.418663,0.237959,0.200852,-0.286942,0.459424,-1.125065,1.595078,-1.334188,-0.977567,-0.034062
9997,0.595495,0.93,0.984356,1.0,0.053380,2.320757,-0.843711,2.480900,0.470590,0.787763,...,-0.581459,-0.674763,-0.203324,-1.646048,-1.236306,-1.222219,-1.697821,1.534112,1.330090,0.602752
9998,0.655721,0.89,0.872995,1.0,0.114489,-0.611051,0.800832,0.411475,1.681659,0.123260,...,-1.325198,-1.955506,-0.665176,-0.851898,-0.283757,1.030944,-0.749034,-0.589223,1.008358,-0.312456


The next stage was to use the trained stacking model to generate a set of predictions ...

![stacking_step_4.png](attachment:stacking_step_4.png)

In [4]:
y_val_pred_sklearn = stacking_model.predict(X_val)
y_val_pred_sklearn

array([0, 1, 0, ..., 0, 0, 0])

### Building A Stacking Classifier from Scratch

The design for the scratch-built classifier is as follow -
1. Build a set of Level 0 models using the ``Transformer`` pattern.
2. Implement the Level 1 model as a simple classifier.
3. Link the two together using a pipeline.

#### 1. Building the Level 0 and Level 1 Models Using Object Orientation in Python
The Level 0 model is built using the ``Transformer`` pattern and the Level 1 model using the ``Estimator`` pattern as follows -

In [5]:
def copy_data(data_in):
    data_out = cp.deepcopy(data_in)
    try:
        data_out.reset_index(drop=True, inplace=True)
    except:
        pass
    return data_out

class Level0Stacker(BaseEstimator, TransformerMixin):
    
    def __init__(self, level_0_classifiers : dict, stack_method : str = "predict_proba", passthrough : bool = False, save_x : bool=False): # no *args or **kargs
        ic("Level0Stacker.init")
        self.level_0_classifiers = level_0_classifiers
        self.stack_method = stack_method
        self.passthrough = passthrough
        self.save_x = save_x

        self.X = None

    def fit(self, X, y=None):
        ic("Level0Stacker.fit")
        X_copy = copy_data(X) 

        for classifier in self.level_0_classifiers.values():
            classifier.fit(X_copy, y)

        return self

    def transform(self, X):
        ic("Level0Stacker.transform")
        X_copy = copy_data(X) 

        all_predictions = [None] * len(self.level_0_classifiers)

        for i, classifier in enumerate(self.level_0_classifiers.values()):
            if self.stack_method == "predict_proba":
                all_predictions[i] = classifier.predict_proba(X_copy)[:, 1]
            else:
                all_predictions[i] = classifier.predict(X_copy)

        df_stacking = pd.DataFrame(np.array(all_predictions).T, columns=[f"{name}_prediction" for name in self.level_0_classifiers.keys()])

        X_copy = pd.concat([df_stacking, X_copy], axis=1) if self.passthrough == True else df_stacking

        self.X = copy_data(X_copy) if self.save_x == True else None

        return X_copy
    
class Level1Stacker(BaseEstimator, ClassifierMixin):

    def __init__(self, model):
        ic("Level1Stacker.init")
        self.model = model

    def fit(self, X, y):
        ic("Level1Stacker.fit")
        self.model.fit(X, y)
        return self

    def predict(self, X):
        ic("Level1Stacker.predict")
        return self.model.predict(X)
    
    def predict_proba(self, X):
        ic("Level1Stacker.predict_proba")
        return self.model.predict_proba(X)
    
    @property
    def classes_(self):
        return self.model.classes_   

#### 2. Training the Stacking Model using a Pipeline

In [6]:
level_0 = Level0Stacker(cp.deepcopy(level_0_classifiers), passthrough=True, save_x=True)
level_1 = Level1Stacker(ExtraTreesClassifier(random_state=RANDOM_STATE))

scratch_stacking_model = Pipeline([
                                   ('level_0', level_0), 
                                   ('level_1', level_1) 
                                  ])

scratch_stacking_model.fit(X_train, y_train)
level_0.X

ic| 'Level0Stacker.init'
ic| 'Level1Stacker.init'
ic| 'Level0Stacker.fit'
ic| 'Level0Stacker.transform'
ic| 'Level1Stacker.fit'


Unnamed: 0,logreg_prediction,forest_prediction,xgboost_prediction,xtrees_prediction,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,...,feature_16,feature_17,feature_18,feature_19,feature_20,feature_21,feature_22,feature_23,feature_24,feature_25
0,0.746721,0.91,0.908653,1.0,0.136638,-0.713842,0.157276,0.055179,-1.499486,-0.801896,...,-0.951505,0.919107,0.327989,1.240349,1.916928,0.560735,0.149824,1.345824,1.194973,-0.079852
1,0.301752,0.07,0.058239,0.0,-1.966571,0.152219,0.859824,-2.668824,-1.029491,-1.125497,...,0.977914,0.464718,-0.742434,-1.618320,0.198326,-1.046548,0.523912,0.939122,-0.528114,0.179478
2,0.786726,0.89,0.963420,1.0,-0.373778,-1.119354,-1.258326,-0.852311,-2.152573,-0.423335,...,-1.698537,1.660224,-1.115649,-0.082800,-0.492556,1.075708,0.199834,1.419582,-0.767532,1.047571
3,0.681193,0.96,0.979050,1.0,-0.755063,-1.821703,-1.222018,-0.566342,-1.165231,2.542285,...,-0.380833,1.201113,1.500583,0.250396,1.149989,3.020883,-1.342900,-0.488408,0.632942,-0.583100
4,0.206296,0.03,0.008406,0.0,0.975995,-0.876509,0.980740,0.997508,-0.926742,5.945441,...,-1.506136,-1.196388,3.100057,0.917948,0.522144,2.914381,-2.820182,0.185170,0.630581,-0.837476
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0.805857,0.91,0.972783,1.0,-0.110827,-1.415858,0.770497,-0.934064,0.392124,-0.365793,...,-0.841045,1.081216,0.603585,-0.391405,0.062470,1.166114,0.599579,-0.116912,-0.228063,-0.725702
9996,0.808834,1.00,0.994323,1.0,1.015395,0.206763,0.251755,0.512048,0.105519,-0.597608,...,-0.418663,0.237959,0.200852,-0.286942,0.459424,-1.125065,1.595078,-1.334188,-0.977567,-0.034062
9997,0.595495,0.93,0.984356,1.0,0.053380,2.320757,-0.843711,2.480900,0.470590,0.787763,...,-0.581459,-0.674763,-0.203324,-1.646048,-1.236306,-1.222219,-1.697821,1.534112,1.330090,0.602752
9998,0.655721,0.89,0.872995,1.0,0.114489,-0.611051,0.800832,0.411475,1.681659,0.123260,...,-1.325198,-1.955506,-0.665176,-0.851898,-0.283757,1.030944,-0.749034,-0.589223,1.008358,-0.312456


The first thing to note is that the scratch-build stacking model is producing identical output to ``scikit-learn`` at the training stage which is a good validation of the implementation.

The second thing to note is what the what the ``icecream`` debugging output is telling us about how the pipeline works (note: I could have used ``print`` statements but ``icecream`` still works when the code is moved into an external library where ``print`` only works locally) -

- A pipeline must consist of one or more ``Transformer`` objects followed by exactly one ``Estimator`` object.
- The ``init`` method is called on every object before anything else happens.
- When the ``fit()`` method id called on the pipeline ...
    - For every ``Transformer`` the ``fit()`` method is called followed by the ``transform()`` method. This is the correct sequence of methods to call for the training data.
    - For the final object in the pipeline (the ``Estimator``) just the ``fit`` method id called.
    
The reason I built the ``Level1Stacker`` class rather than directly adding an ``Estimator`` was just so I could add the debug statements to ``icecream`` to demonstrate exactly what methods are being called.

The ``Level0Stacker`` class is not very complicated. The ``fit()`` method is simply fitting each classifier in the Level 0 model to the whole of the training data. I have reviewed some code samples where out-of-fold predictions are used but ``scikit-learn`` trains on the whole of X so that is good enough for me.

It is also worth considering what is going on with the ``copy_data`` helpder function. I have found that when ``DataFrame`` objects are passed through a pipeline that pipeline will crash unless each step is strictly working on a deep copy of the that has also had the index reset.

The ``transform()`` method is simply iterating around the Level 0 models calling either ``predict`` or ``predict_proba`` depending on how the ``stack_method`` parameter is set and then adding the predictions as new features to the data.

It is very important to fit the Level 0 models in the ``fit()`` method and then to make predictions in the ``transform()`` method as we will see next ...

#### 3. Making Predictions on the Test Data

In [7]:
y_val_pred_scratch = scratch_stacking_model.predict(X_val)
y_val_pred_scratch

ic| 'Level0Stacker.transform'
ic| 'Level1Stacker.predict'


array([1, 1, 0, ..., 0, 0, 0])

Once again the ``icecream`` debugging enables us to see exactly what is going on. Calling the ``predict()`` method on the pipeline calls the ``transform()`` method of every ``Transformer`` in turn followed by a call to the ``Estimator`` object ``predict()`` method.

### Has It Worked?
Well, we have succeeded in building a stacking model from scratch. The step that trains the model and generates the Level 0 predictions as new data features certainly works because the output of the scratch-built model is identical to ``scikit-learn``.

However, the final predictions are not the same ...

In [8]:
print(f"Accuracy of scikit-learn stacking classifier: {accuracy_score(y_val, y_val_pred_sklearn)}")
print(f"Accuracy of scratch built stacking classifier: {accuracy_score(y_val, y_val_pred_scratch)}")

Accuracy of scikit-learn stacking classifier: 0.8825
Accuracy of scratch built stacking classifier: 0.8735


The documentation of the ``scikit-learn`` ``StackingClassifier`` states that - 

"``estimators_`` are fitted on the full X while ``final_estimator_`` is trained using cross-validated predictions of the base estimators using ``cross_val_predict``"

However, when I attempt to replicate that in the scratch build stacker the accuracy is much lower than either the ``scikit-learn`` stacking model or the scratch built model without cross-fold validation in the final step -

In [9]:
y_val_pred_scratch_kfold = cross_val_predict(level_1, X_val, y_val, cv=kfold, method="predict")
print(f"Accuracy of scratch built stacking classifier using level 1 cross-validation: {accuracy_score(y_val, y_val_pred_scratch_kfold)}")

ic| 'Level1Stacker.init'
ic| 'Level1Stacker.fit'
ic| 'Level1Stacker.predict'
ic| 'Level1Stacker.init'
ic| 'Level1Stacker.fit'
ic| 'Level1Stacker.predict'
ic| 'Level1Stacker.init'
ic| 'Level1Stacker.fit'
ic| 'Level1Stacker.predict'
ic| 'Level1Stacker.init'
ic| 'Level1Stacker.fit'
ic| 'Level1Stacker.predict'
ic| 'Level1Stacker.init'
ic| 'Level1Stacker.fit'
ic| 'Level1Stacker.predict'


Accuracy of scratch built stacking classifier using level 1 cross-validation: 0.8195


That is a bit unsatisfying but I can never know how the Level 1 part of the ``scikit-learn`` stacking model is implemented unless I can see the code or talk to one of the developers and in lieu of being able to do that I am satisfied that the investigation and research has achieved its objectives.

### Conclusion
Part I of this set of articles set out to provide an easy-to-follow explanation of what stacking is and how it works. Part II expanded on this by providing a fully working example using the ``scikit-learn`` library and a more detailed explanation. Part III completed the exploration by building a complete stacking model from scratch using Python object oreintation and the ``Transformer`` and ``Estimator`` coding patterns.

Building a stacking model from scratch has turned out to be an exercise in completing my understanding of how stacking works. I was able to replicate the training phase of the ``scikit-learn`` model exactly but I could not quite replicate the way the library works for the final Level 1 predictions. In spite of this the research has equiped me with the knowledge of how to use stacking, when to use it and also when not to use it.

In future I will be using the ``scikit-learn`` implementation of stacking whenever I judge that the increased complexity of multiple models stacked together and the associated difficulties in explaining how the final model arrives at its predictions is offset by a bigger need to drive higher accuracy and improved performance.

I hope this series of articles has helped other data scientists to fully understand this effective and fascinating technique and to remove some of the mystery surrounding exactly what is going on inside stacking and how it works.

### Thank you for reading!
If you enjoyed reading this article, why not check out my other articles at https://grahamharrison-86487.medium.com/? Also, I would love to hear from you to get your thoughts on this piece, any of my other articles or anything else related to data science and data analytics.

If you would like to get in touch to discuss any of these topics please look me up on LinkedIn — https://www.linkedin.com/in/grahamharrison1 or feel free to e-mail me at GHarrison@lincolncollege.ac.uk.

If you would like to support the author and 1000’s of others who contribute to article writing world-wide by subscribing, please use the following link (note: the author will receive a proportion of the fees if you sign up using this link at no extra cost to you).

https://grahamharrison-86487.medium.com/membership