# Bagging

In this notebook I will cover bootstrap aggregating, or bagging. I will use the breast cancer dataset from Article II & III to test the performance of our ensemble algorithm.

In [31]:
## imports ##
import numpy as np
from sklearn.base import clone
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score,precision_score,recall_score

Let's build a class to contain our ensemble. I'm going to make use of the make_bootstraps function I built in Article IV:

In [74]:
## make an ensemble classifier based on decision trees ##
class BaggedTreeClassifier(object):
    #initializer
    def __init__(self,n_elements=100):
        self.n_elements = n_elements
        self.models     = []
    
    #destructor
    def __del__(self):
        del self.n_elements
        del self.models
        
    #private function to make bootstrap samples
    def __make_bootstraps(self,data):
        #initialize output dictionary & unique value count
        dc   = {}
        unip = 0
        #get sample size
        b_size = data.shape[0]
        #get list of row indexes
        idx = [i for i in range(b_size)]
        #loop through the required number of bootstraps
        for b in range(self.n_elements):
            #obtain boostrap samples with replacement
            sidx   = np.random.choice(idx,replace=True,size=b_size)
            b_samp = data[sidx,:]
            #compute number of unique values contained in the bootstrap sample
            unip  += len(set(sidx))
            #obtain out-of-bag samples for the current b
            oidx   = list(set(idx) - set(sidx))
            o_samp = np.array([])
            if oidx:
                o_samp = data[oidx,:]
            #store results
            dc['boot_'+str(b)] = {'boot':b_samp,'test':o_samp}
        #return the bootstrap results
        return(dc)
        
    #train the ensemble
    def train(self,X_train,y_train):
        #package the input data
        training_data = np.concatenate((X_train,y_train.reshape(-1,1)),axis=1)
        #make bootstrap samples
        dcBoot = self.__make_bootstraps(training_data)
        #iterate through each bootstrap sample & fit a model ##
        cls = DecisionTreeClassifier(class_weight='balanced')
        for b in dcBoot:
            #make a clone of the model
            model = clone(cls)
            #fit a decision tree classifier to the current sample
            model.fit(dcBoot[b]['boot'][:,:-1],dcBoot[b]['boot'][:,-1].reshape(-1, 1))
            #append the fitted model
            self.models.append(model)
            
    #predict from the ensemble
    def predict(self,X):
        #check we've fit the ensemble
        if not self.models:
            print('You must train the ensemble before making predictions!')
            return(None)
        #loop through each fitted model
        predictions = []
        for m in self.models:
            #make predictions on the input X
            yp = m.predict(X)
            #append predictions to storage list
            predictions.append(yp.reshape(-1,1))
        #compute the ensemble prediction
        ypred = np.round(np.mean(np.concatenate(predictions,axis=1),axis=1)).astype(int)
        #return the prediction
        return(ypred)

### Load Dataset

Here I'll load the breast cancer dataset. Note I already analysed these data in Article II & III. I'll do a train-test split. The classes are unbalanced so I will have to address that.

In [3]:
## load classification dataset ##
data = load_breast_cancer()
X    = data.data
y    = data.target

In [5]:
## do train-test split ##
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,stratify=y)

### Build an Ensemble & Produce Predictions

Let's take our data and train an ensemble classifier based on decision trees. Generate predictions on the test set to check performance.

In [75]:
## declare an ensemble instance with default parameters ##
ens = BaggedTreeClassifier()

In [76]:
## train the ensemble ##
ens.train(X_train,y_train)

In [77]:
## make predictions on the test set ##
ypred = ens.predict(X_test)

In [81]:
## evaluate model performance ##
print("accuracy: %.2f" % accuracy_score(y_test,ypred))
print("precision: %.2f" % precision_score(y_test,ypred))
print("recall: %.2f" % recall_score(y_test,ypred))

accuracy: 0.96
precision: 0.99
recall: 0.94


These results look good, they represent an improvement over the lone decision tree classifier (see results from Article III). This is especially true in terms of the accuracy of the model predictions. Note that no hyperparameter tuning was done: further improvements to our ensemble are therefore possible.

Let's compare how our custom-built ensemble compares to the one available through scikit-learn:

In [82]:
## import the scikit-learn model ##
from sklearn.ensemble import BaggingClassifier

In [83]:
## declare a bagging classifier instance ##
ens = BaggingClassifier(n_estimators=100)

In [84]:
## train the ensemble ##
ens.fit(X_train,y_train)

BaggingClassifier(n_estimators=100)

In [85]:
## make predictions on the test set ##
ypred = ens.predict(X_test)

In [86]:
## evaluate model performance ##
print("accuracy: %.2f" % accuracy_score(y_test,ypred))
print("precision: %.2f" % precision_score(y_test,ypred))
print("recall: %.2f" % recall_score(y_test,ypred))

accuracy: 0.96
precision: 0.99
recall: 0.94


The results here are in agreement with those obtained with our custom ensemble classifier. Like the case with the custom classifier, no hyperparameter tuning was done here. Further work to optimise the hyperparameters of the base decision tree classifier could offer further improvements.