# Ensemble methods. Bagging

Bagging consists of the following steps:
1. create a bootstrap samples \begin{equation}S_{i}\end{equation},
2. for each sample train a classifier,
3. vote $f(x)=\arg\max\sum_{i}^{T}(f_{i}(X)=y)$. 
In other words, we have a list of prediction and set the prediction to the most occurance prediction.

Let's load the data set in the first place.

In [1]:
%store -r data_set
%store -r labels
%store -r test_data_set
%store -r test_labels
%store -r unique_labels

We use the accuracy metric from scikit-learn and the the tree package.

In [2]:
from sklearn import tree
import numpy as np
from sklearn.metrics import accuracy_score

decision_tree = tree.DecisionTreeClassifier()

A boostrap generation method take randomly data from out data set and return is as a bootstrap set.

In [3]:
def create_bootstrap_data():
    bootstrap_ids = np.random.randint(0, len(data_set), size=len(data_set))
    return data_set[bootstrap_ids,:],labels[bootstrap_ids]

We build a classifier based on decision tree that is used later for generation the prediction. We can use different classifiers here.

In [4]:
def build_classifier(data_set, labels):
    decision_tree = tree.DecisionTreeClassifier()
    decision_tree.fit(data_set, labels)
    return decision_tree

Based on the number of cases, we build many classifiers where each trains on different bootstrap data set.

In [5]:
def build_classifiers(cases):
    classifiers = []
    for case in range(cases):
        bootstrap_set, bootstrap_labels = create_bootstrap_data()
        classifier = build_classifier(bootstrap_set, bootstrap_labels)
        classifiers.append(classifier)
    return classifiers

The voting part is just counting on the classified values and get the max occurency of it.

In [6]:
def vote(classifiers, test_data):
    output = []
    for classifier in classifiers:
        output.append(classifier.predict(test_data))
    output = np.array(output)
    predicted = []
    for i in range(len(test_data)):
        classified = output[:, i]
        counts = np.bincount(classified)
        predicted.append(np.argmax(counts))
    return predicted

Finally, we can check the results based on ten classifiers.

In [10]:
classifiers = build_classifiers(2)
predicted = vote(classifiers,test_data_set)
accuracy = accuracy_score(test_labels, predicted)
print(accuracy)

0.9


We can easily compare it to just one tree:

In [11]:
data_set, labels = create_bootstrap_data()
decision_tree = tree.DecisionTreeClassifier()
decision_tree.fit(data_set, labels)
predicted = decision_tree.predict(test_data_set)
accuracy = accuracy_score(test_labels, predicted)
print(accuracy)

1.0
