In [1]:
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

For this exercise we will use the Iris dataset which is included in *scikit-learn* module. The first step is loading the data with `load_iris()` function and importing the data into the dataframe.

In [2]:
iris = load_iris()
df = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris['feature_names'] + ['class'])
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


Bagging is an ensemble technique that combines the predictions from multiple models together to make more accurate predictions than any individual model. It's used to reduce the variance for algorithms characterized by the high variance such as Decision Trees. Here we are implementing our own bagging algorithm. The key point of this method is to draw a sample of data which will be used for training the given model and predicting probabilities of the classes. The whole process is repeated for a certain number of times and then the predictions are getting averaged.

In [3]:
def own_bagging(clf, n_estimators, X_train, y_train, X_test):
    data = X_train.join(y_train)
    agg_probabilities = []
    
    for i in range(n_estimators):
        subsample = data.sample(n=random.randint(1, data.shape[0]), replace=True)
        clf.fit(X_train, y_train)
        probabilities = clf.predict_proba(X_test)
        agg_probabilities.append(probabilities)
    
    agg_probabilities = np.mean(agg_probabilities, axis=0)
    return agg_probabilities

Before an evaluation of the implemented algorithm we will create two helper functions. The first of them chooses the class having the highest probability and returns a list of predictions.

In [4]:
def predict_class(agg_probabilities):
    predictions = [np.argmax(probabilities) for probabilities in agg_probabilities]
    return predictions

The second function just simply provides the number of false predictions.

In [5]:
from sklearn.metrics import confusion_matrix

def false_predictions(y_test, y_pred):
    falses = 0
    for pred_value, test_value in zip(y_pred, y_test):
        if pred_value != test_value:
            falses += 1
    return falses

Now we will select the features and target variable and split our data into training and testing datasets with 4:1 proportions.

In [6]:
from sklearn.model_selection import train_test_split

X = df.iloc[:, :-1]
y = df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Finally we are able to apply our implementation for LDA and Decision Tree algorithms to spot the differences. We will investigate the number of false predictions for both models with number of data subsamples varying from 1 to 50.

In [7]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

classifiers = [LinearDiscriminantAnalysis(), DecisionTreeClassifier()]
names = ['LDA', 'Decision Tree']
n_estimator = [1, 2, 5, 10, 20, 50]

for name, clf in zip(names, classifiers):
    print('False predictions for {}:'.format(name))
    
    for n in n_estimator:
        agg_prob = own_bagging(clf, n, X_train, y_train, X_test)
        y_pred = predict_class(agg_prob)
        print('{} estimators: {}'.format(n, false_predictions(y_test, y_pred)))
    
    print('------------------\n')

False predictions for LDA:
1 estimators: 1
2 estimators: 1
5 estimators: 1
10 estimators: 1
20 estimators: 1
50 estimators: 1
------------------

False predictions for Decision Tree:
1 estimators: 3
2 estimators: 2
5 estimators: 2
10 estimators: 2
20 estimators: 2
50 estimators: 2
------------------



In case of LDA number of false predictions remained the same while in case of Decision Tree it has been decreased slightly. That's because LDA is the method having low variance of results thus bagging doesn't have any impact. The differences would have been more visible if we had used more complicated dataset.