# 05: Bagging and random forests

In [3]:
%matplotlib inline

import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt

import mylib as my

The goal of ensemble methods to to reduce bias and/or variance help prevent overfitting. In this notebook we look at two ensemble methods: bagging and random forests.

## Bootstrap samples
Let's start by seeing how we can draw a bootstrap sample given a dataset $D$. A bootstrap sample is a sample drawn randomly with replacement from the given dataset such that the size of the sample is the same as the size of the original dataset. That means some examples will show up multiple times in the drawn sample.

In the example below, we are using a subset of the car dataset with classes indicating whether the car is in acceptable or unacceptable condition. The description of the original car dataset can be found at [this page](https://archive.ics.uci.edu/ml/datasets/Car+Evaluation).

In [4]:
df = pd.read_csv('datasets/ua_car.csv')
ds = my.DataSet(df, y=True)
print(df.iloc[:,-1].value_counts())

unacc    384
acc      384
Name: y, dtype: int64


In [5]:
train, test = ds.train_test_split(test_portion=.25, shuffle=True)
print(train)
print(test)

    buying maintenance  doors persons luggage safety      y
61   vhigh         med      2    more     med    low  unacc
121   high         med      4    more   small    med  unacc
128  vhigh        high      3       4   small    med  unacc
385  vhigh         med      2       4     med   high    acc
612    med        high      3    more     med    med    acc
..     ...         ...    ...     ...     ...    ...    ...
165   high         low      3       2   small    low  unacc
123    low         med  5more       2     big    med  unacc
87    high       vhigh      3       4     med    med  unacc
248  vhigh         med      2       4   small    low  unacc
416  vhigh         med  5more    more     med    med    acc

[576 rows x 7 columns]
    buying maintenance  doors persons luggage safety      y
64     med         med  5more    more   small    low  unacc
533   high         low      2    more     big    med    acc
208    med         low      2       2     big   high  unacc
425  vhigh      

Given the above training set, we can draw a bootstrap sample like this:

In [6]:
sample_indexes = np.random.randint(0, train.N, size=train.N)
# print(sample_indexes)
bootstrap_sample = train.examples.iloc[sample_indexes, :]
bootstrap_sample

Unnamed: 0,buying,maintenance,doors,persons,luggage,safety,y
435,vhigh,low,3,more,big,high,acc
600,med,high,2,4,small,high,acc
355,vhigh,high,4,more,med,low,unacc
548,high,low,4,4,big,high,acc
153,high,med,2,more,big,low,unacc
...,...,...,...,...,...,...,...
409,vhigh,med,4,more,big,high,acc
471,high,high,3,more,big,high,acc
458,high,high,2,4,big,med,acc
270,vhigh,high,2,2,med,med,unacc


of which:

In [7]:
print("{:.2%}".format(
    pd.unique(bootstrap_sample.index).shape[0] / len(bootstrap_sample)), 'are unique examples')
print("{:.2%}".format(
    1 - pd.unique(bootstrap_sample.index).shape[0] / len(bootstrap_sample)), 'are repeated examples')

62.67% are unique examples
37.33% are repeated examples


Sometimes, it's useful to be able to identify the examples that are included in a given sample and those that aren't. Here are two functions for doing so.

In [8]:
def examples_in_sample(examples, sample):
    return examples[examples.index.isin(sample.index)]

# can i just turn this into the in bag or out of bag?
def examples_not_in_sample(examples, sample):
    return examples[~examples.index.isin(sample.index)]

Here are the examples from the training set what are in the above bootstrap sample:

In [9]:
examples_in_sample(train.examples, bootstrap_sample)

Unnamed: 0,buying,maintenance,doors,persons,luggage,safety,y
61,vhigh,med,2,more,med,low,unacc
121,high,med,4,more,small,med,unacc
128,vhigh,high,3,4,small,med,unacc
385,vhigh,med,2,4,med,high,acc
612,med,high,3,more,med,med,acc
...,...,...,...,...,...,...,...
167,low,high,2,more,med,low,unacc
691,low,vhigh,3,more,med,med,acc
493,high,med,2,4,med,high,acc
248,vhigh,med,2,4,small,low,unacc


And here are the examples from the training set that are not in the above bootstrap sample:

In [10]:
examples_not_in_sample(train.examples, bootstrap_sample)

Unnamed: 0,buying,maintenance,doors,persons,luggage,safety,y
7,med,low,3,2,small,high,unacc
346,high,high,3,4,small,med,unacc
402,vhigh,med,4,4,med,high,acc
671,med,low,2,more,med,med,acc
141,med,high,4,2,small,med,unacc
...,...,...,...,...,...,...,...
404,vhigh,med,4,4,big,high,acc
445,vhigh,low,4,more,big,high,acc
165,high,low,3,2,small,low,unacc
123,low,med,5more,2,big,med,unacc


## Bagging
The simplest form of ensemble methods is called **bagging** which stands for **bootstrap aggregation**. The idea is simple:
* take $T$ bootstrap samples from the given dataset
* for each bootstrap sample, train a decision tree DT
* the predicted label of an unseen example is the average(for regression problems) or the plurality vote (for classification problems) of all the output predicted by all the trained $T$ trees.

Here is a simple implementation of bagging.

In [11]:
class Bagger:
    def __init__(self, dataset, nTrees):
        self.ds = dataset
        self.nTrees = nTrees
        self.classifiers = []
        self.samples = []
        self.make_trees()

    def make_trees(self):
        indexes = np.random.randint(0, self.ds.N,(self.ds.N,self.nTrees))
        for i in range(self.nTrees):
            # Create bootstrap samples one for each tree
            self.samples.append(self.ds.examples.iloc[indexes[:, i], :])

            # Build classifiers
            self.classifiers.append(my.DecisionTreeClassifier(my.DataSet(self.samples[i])))

    def predict(self, unseen):
        """
        Returns the most probable label (or class) for each unseen input. The
        unseen needs to be a data series with the same features (as indexes) as the 
        training data. It can also be a data frame with the same features as 
        the training data.
        """
        if unseen.ndim == 1:
            classes = np.array([ dt.predict(unseen) for dt in self.classifiers ])
            classes = classes[classes != None]
            return st.mode(classes).mode[0]
        
        else:
            return np.array([self.predict(unseen.iloc[i,:]) for i in range(len(unseen))]) 

## Random forests
Bagging is not exclusive to decision trees; it can be used with other models. Random forests is bagging applied exclusively to decision trees. In addition to obtaining $T$ random bootstrap samples, it also requires what is sometimes called **feature bagging**. Feature bagging requires that only a randomly selected subset of the features is considered at each node during the construction of the decision tree. 

That means we need to modify our implementation of the decision tree such that it takes a numeric parameter named `nFeatures` which defaults to 0. If `nFeatures` is 0, then the tree functions as normal. If not, it picks this many features randomly and only consider the best of those during the construction of the tree. The provided `my.DecisionTreeClassifier` class already has these changes.

For prediction, a plurality vote of the $T$ predicted labels is returned. Here is a simple implementing of random forests. Think about the similarities and differences between these too classes.

In [12]:
class RandomForest:
    def __init__(self, dataset, nTrees, nFeatures=0):
        self.ds = dataset
        self.nTrees = nTrees
        self.nFeatures = nFeatures
        self.classifiers = []
        self.samples = []
        self.make_forest()

    def make_forest(self):
        indexes = np.random.randint(0, self.ds.N,(self.ds.N,self.nTrees))
        for i in range(self.nTrees):
            # Create bootstrap samples one for each tree
            self.samples.append(self.ds.examples.iloc[indexes[:, i], :])

            # Build classifiers
            self.classifiers.append(my.DecisionTreeClassifier(my.DataSet(self.samples[i]), nFeatures=self.nFeatures))

    def predict(self, unseen):
        """
        Returns the most probable label (or class) for each unseen input. The
        unseen needs to be a data series with the same features (as indexes) as the 
        training data. It can also be a data frame with the same features as 
        the training data.
        """
        if unseen.ndim == 1:
            classes = np.array([ dt.predict(unseen) for dt in self.classifiers ])
            classes = classes[classes != None]
            return st.mode(classes).mode[0]
        
        else:
            return np.array([self.predict(unseen.iloc[i,:]) for i in range(len(unseen))]) 

## Testing

In [13]:
dt = my.DecisionTreeClassifier(train)
cm = my.confusion_matrix(test.target, dt.predict(test.examples.iloc[:,:-1]))
accuracy = np.trace(cm) / np.sum(cm)

print(cm)
print('Decistion tree accuracy: ', accuracy)


bg = Bagger(train, 20)
cm = my.confusion_matrix(test.target, bg.predict(test.examples.iloc[:,:-1]))
accuracy = np.trace(cm) / np.sum(cm)

print(cm)
print('Bagger accuracy: ', accuracy)

rf = RandomForest(train, 20, nFeatures=3)
cm = my.confusion_matrix(test.target, rf.predict(test.examples.iloc[:,:-1]))
accuracy = np.trace(cm) / np.sum(cm)

print(cm)
print('Random forests accuracy: ', accuracy)

[[91  2]
 [ 4 95]]
Decistion tree accuracy:  0.96875


  return st.mode(classes).mode[0]
  return st.mode(classes).mode[0]


[[90  3]
 [ 4 95]]
Bagger accuracy:  0.9635416666666666
[[93  0]
 [ 6 93]]
Random forests accuracy:  0.96875


  return st.mode(classes).mode[0]
  return st.mode(classes).mode[0]


You should try different values for `nTrees` and `nFeatures`. These variables are considered hyperparameters, and cross-validation can be used to determine the best values for them. Common values for `nFeatures` are $\sqrt{m}$ and $log_2(m)$ where $m$ is the number of features.

## Out of bag score
Another way of testing random forests is to calculate the so-called **out-of-bag** score. Such a score does not require splitting the dataset into a training and test sets. One way to calculate it is to identify for each example $x$ in the dataset the list of trees that are trained using samples that do not include it; let's call this list of trees $D_x$. We then call the `predict` method on each tree of $D_x$ to get the list of predicted classes for each of of these out of bag $x$ examples; let's call this list of classes $C_x$. Finally we find the class in $C_x$ that repeats the most and report it as the predicted class of $x$; let's call it $h_x$.

Doing this for each example in the dataset gives us an array of predicted classes, which we can compare against the actual target classes of these examples. Using the confusion matrix we can report the accuracy as the out of bag score.

Notice that the above implementations of `Bagger` and `RandomForest` already give you access to the bootstrap samples and the classifiers that are trained on them. You can use that to find out what sample does not include a given example.

## CHALLENGE
Write a function that calculates the out of bag score as described above given three arguments: a dataset, number of trees (`nTrees`), and number of features (`nFeatures`). The function should use these arguments to create a random forest object to use for calculating this score.

Test and report the out of bag scores for the whole car dataset and for when `nTrees` is 10, 15, and 20.

In [33]:
# TODO create the out_of_bag_score(dataset, nTrees, nFeatures) function 
# the "score" is the confusion matrix that you get from the using the out-of-bag values for testing the trees

# TODO I think this is just one iteration so far. 
def out_of_bag_score(data, nTrees, nFeatures):

    # make random indices
    bag_samples_indexes = np.random.randint(0,data.N,size = data.N)
    
    # pull out sample from given dataset
    boot_sample = data.examples.iloc[bag_samples_indexes, :]

    # bootstrap sample to be used for training
    in_bag = examples_in_sample(data.examples, boot_sample)
    print(f"In bag indices:\n{pd.unique(in_bag.index)}\n")
    
    # bootstrap sample to be used for testing
    out_bag = examples_not_in_sample(data.examples, boot_sample)
    print(f"Out of bag indices:\n{pd.unique(out_bag.index)}")

    # turn data frames into DataSet objects
    in_bag_ds = my.DataSet(in_bag)
    out_bag_ds = my.DataSet(out_bag)

    # create the random forest object using in bag dataset
    rf = RandomForest(in_bag_ds, nTrees, nFeatures)

    # test the random forest object using the out of bag dataset
    cm = my.confusion_matrix(out_bag_ds.target, rf.predict(out_bag_ds.examples.iloc[:,:-1]))

    # return a confusion matrix from predictions using the test bootstrap sample
    return cm

# test
ds = my.DataSet(df, y=True)
score_cm = out_of_bag_score(ds, 10, 3)

accuracy = np.trace(score_cm) / np.sum(score_cm)

print("\n",score_cm)
print('out of bag score accuracy: ', accuracy)



In bag indices:
[  0   2   3   4   6   9  11  12  14  15  17  18  19  20  21  24  25  26
  28  29  30  31  34  35  37  38  40  41  42  43  46  48  51  52  53  54
  57  58  61  62  65  66  67  69  70  73  74  75  76  77  78  79  80  82
  84  86  87  89  91  93  94  95  96  98  99 100 101 102 103 104 105 106
 107 109 110 111 112 113 115 116 117 118 119 121 124 125 126 127 130 131
 132 134 138 139 142 144 145 147 149 151 152 153 155 161 162 163 165 166
 168 169 170 172 173 174 179 180 181 182 185 187 188 189 190 191 192 193
 194 195 196 197 198 199 200 201 202 203 204 206 210 212 215 216 217 218
 222 224 226 227 229 231 234 235 236 238 240 241 243 245 247 248 250 251
 252 253 254 257 259 260 261 262 263 264 265 266 267 271 272 273 274 277
 280 282 284 285 286 288 290 291 293 295 296 297 298 299 302 303 304 306
 307 310 312 313 315 317 318 319 320 321 322 323 324 325 326 327 330 331
 332 334 335 336 337 338 341 343 344 345 347 353 355 356 357 358 359 360
 364 365 366 367 368 369 370 371 37

  return st.mode(classes).mode[0]
  return st.mode(classes).mode[0]
