# Bootstrap Aggregation

Although decision trees are a very porwerful, expresive and versatile algorihtm, they tend to suffer from high variance. This means that they pay too much attention to data, thus producing very different results based on the training examples they are fed with.

One way to counter this tendency of overfitting to the data is called __Bootstrap Aggregation__ or __Bagging__, for short. Bagging is an ensemble method, which means that it is an algorithm that makes use of several predictors, combine their outputs in some way and then return a final, unified prediction. 

What is __bootstrap__? It is just a sample of a dataset with replacement. Put more simply, we can select a subset of training examples from a dataset, and any given row can be re-selected at any point in the future for any other subset.

Then, Bagging consists of training a given number of decision trees on different _bootstraps_ of data and then combining their predictions. 

Given the characteristics of Bagging, it is a very useful approach when we do not have a lot of data available!

Let's start our implementation by loading the code and libraries we'll need. We will build our solution on top of the ones we implemented in the [previous notebook](https://github.com/jesus-a-martinez-v/toy-ml/blob/master/src/main/scala/notebooks/decision_trees.ipynb).

In [1]:
import $ivy.`com.github.tototoshi::scala-csv:1.3.5`
import $file.^.datasmarts.ml.toy.scripts.DecisionTrees, DecisionTrees._
import scala.util.Random

[32mimport [39m[36m$ivy.$                                      
[39m
[32mimport [39m[36m$file.$                                        , DecisionTrees._
[39m
[32mimport [39m[36mscala.util.Random[39m

## Data

We'll use the [Sonar](https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data) dataset. It involves the prediction of whether or not a certain object is a mine or a rock given the strength of sonar returns at various angles. It is, of course, a binary classification problem, perfect for our perceptron.


Let's load the data:

In [2]:
val BASE_DATA_PATH = "../../resources/data"
val sonarPath = s"$BASE_DATA_PATH/16/sonar.all-data.csv"

val rawData = loadCsv(sonarPath)
val numberOfRows = rawData.length
val numberOfColumns = rawData.head.length
println(s"Number of rows in dataset: $numberOfRows")
println(s"Number of column in dataset: $numberOfColumns")

val (data, lookUpTable) = {
    val dataWithNumericColumns = (0 until (numberOfColumns - 1)).toVector.foldLeft(rawData) { (d, i) => textColumnToNumeric(d, i)}
    categoricalColumnToNumeric(dataWithNumericColumns, numberOfColumns - 1)
}

Number of rows in dataset: 208
Number of column in dataset: 61


[36mBASE_DATA_PATH[39m: [32mString[39m = [32m"../../resources/data"[39m
[36msonarPath[39m: [32mString[39m = [32m"../../resources/data/16/sonar.all-data.csv"[39m
[36mrawData[39m: [32mVector[39m[[32mVector[39m[[32mData[39m]] = [33mVector[39m(
  [33mVector[39m(
    Text(0.0200),
    Text(0.0371),
    Text(0.0428),
    Text(0.0207),
    Text(0.0954),
    Text(0.0986),
    Text(0.1539),
    Text(0.1601),
    Text(0.3109),
    Text(0.2111),
[33m...[39m
[36mnumberOfRows[39m: [32mInt[39m = [32m208[39m
[36mnumberOfColumns[39m: [32mInt[39m = [32m61[39m
[36mdata[39m: [32mVector[39m[[32mVector[39m[[32mData[39m]] = [33mVector[39m(
  [33mVector[39m(
    Numeric(0.02),
    Numeric(0.0371),
    Numeric(0.0428),
    Numeric(0.0207),
    Numeric(0.0954),
    Numeric(0.0986),
    Numeric(0.1539),
    Numeric(0.1601),
    Numeric(0.3109),
    Numeric(0.2111),
[33m...[39m
[36mlookUpTable[39m: [32mMap[39m[[32mData[39m, [32mInt[39m] = [33mMap[39m(

## Bootstrap Resample

The first thing we need to do is implement a method to resample a subset of data from a dataset. This process can be achieved by selection a random proportion or rows of the training example. In particular, we can shuffle the dataset, determine how many rows correspond to the supplied proportion and then take that number of rows from the shuffled dataset. 

Let's implement a function that does this work for us:

In [3]:
def subsample(dataset: Dataset, ratio: Double = 1.0) = {
  val nSample = math.round(dataset.length * ratio).toInt

  val shuffledDataset = Random.shuffle(dataset)

  shuffledDataset.take(nSample)
}

defined [32mfunction[39m [36msubsample[39m

Excellent. By default, the subsample will just return a permutation of the whole dataset (ratio = 1.0).

## Making Predictions

Given that the underlying models used in Bagging are decision trees, we just need to create a method to make predictions given a list of trained trees. We'll use each tree to make a prediction on a given row and then we'll select the mode among all predictions as the final label:

In [4]:
def baggingPredict(trees: List[TreeNode], row: Vector[Numeric]): Numeric = {
    val predictions = trees.map(t => predictWithTree(t, row))
    predictions.maxBy(p => predictions.count(_ == p))
}

defined [32mfunction[39m [36mbaggingPredict[39m

In [5]:
def bagging(train: Dataset, test: Dataset, parameters: Parameters) = {
  val numberOfTrees = parameters("numberOfTrees").asInstanceOf[Int]
  val maxDepth = parameters("maxDepth").asInstanceOf[Int]
  val sampleSize = parameters("sampleSize").asInstanceOf[Double]
  val minSize = parameters("minSize").asInstanceOf[Int]
  val trees = (1 to numberOfTrees).toList.map(_ => buildTree(subsample(train, sampleSize), maxDepth, minSize))

  test.map { r =>
    baggingPredict(trees, r.asInstanceOf[Vector[Numeric]])
  }
}

defined [32mfunction[39m [36mbagging[39m

Good.

Let's now use our new algorithm to test it on the Sonar dataset.

We'll start by running a baseline model on it and then our freshly implemented Bagging algorithm and then we will compare their performance. In this case, we will use 1, 5, 10, 20, 50 and 100 trees.

As a baseline for classification we will use a __zero rule classifier__.

In [6]:
// Normalize data
val minMax = getDatasetMinAndMax(data)
val normalizedData = normalizeDataset(data, minMax)

val baselineAccuracy = evaluateAlgorithmUsingTrainTestSplit[Numeric](
        normalizedData, 
        (train, test, parameters) => zeroRuleClassifier(train, test), 
        Map.empty, 
        accuracy, 
        trainProportion=0.8)

println(s"Zero Rule accuracy: $baselineAccuracy")

Zero Rule accuracy: 0.5714285714285714


[36mminMax[39m: [32mMinMaxData[39m = [33mVector[39m(
  [33mSome[39m(([32m0.0015[39m, [32m0.1371[39m)),
  [33mSome[39m(([32m6.0E-4[39m, [32m0.2339[39m)),
  [33mSome[39m(([32m0.0015[39m, [32m0.3059[39m)),
  [33mSome[39m(([32m0.0058[39m, [32m0.4264[39m)),
  [33mSome[39m(([32m0.0067[39m, [32m0.401[39m)),
  [33mSome[39m(([32m0.0102[39m, [32m0.3823[39m)),
  [33mSome[39m(([32m0.0033[39m, [32m0.3729[39m)),
  [33mSome[39m(([32m0.0055[39m, [32m0.459[39m)),
  [33mSome[39m(([32m0.0075[39m, [32m0.6828[39m)),
  [33mSome[39m(([32m0.0113[39m, [32m0.7106[39m)),
  [33mSome[39m(([32m0.0289[39m, [32m0.7342[39m)),
[33m...[39m
[36mnormalizedData[39m: [32mDataset[39m = [33mVector[39m(
  [33mVector[39m(
    Numeric(0.1364306784660767),
    Numeric(0.15645092156022286),
    Numeric(0.13567674113009198),
    Numeric(0.03542558250118878),
    Numeric(0.22495561755008875),
    Numeric(0.2375705455522709),
    Numeric(0.4074675324675

In [7]:
for (nTrees <- List(1, 5, 10, 20, 50, 100)) {
    println(s"Using $nTrees trees.")
    val baggingAccuracy = evaluateAlgorithmUsingTrainTestSplit[Numeric](
        data,
        bagging,
        Map("maxDepth" -> 6, "minSize" -> 2, "sampleSize" -> 0.5, "numberOfTrees" -> nTrees),
        accuracy,
        trainProportion=0.8)
    
    println(s"Bagging accuracy: $baggingAccuracy")
}

Using 1 trees.
Bagging accuracy: 0.6666666666666666
Using 5 trees.
Bagging accuracy: 0.6666666666666666
Using 10 trees.
Bagging accuracy: 0.7142857142857143
Using 20 trees.
Bagging accuracy: 0.7380952380952381
Using 50 trees.
Bagging accuracy: 0.7142857142857143
Using 100 trees.
Bagging accuracy: 0.7380952380952381


We can notice that even one single tree achieves better performance than the baseline defined above. The problem here is that we are building very shallow trees, thus their predictive power is not very impressive. It is only when we start combining a good amount (20 or more) of them that we start seeing the benefits of bagging!