# Baseline Models

In the last two notebooks we have been exploring several ways to determine the performance of an algorithm. We've learned that there are a huge variety of evaluation metrics to choose from. We have also looked at specific algorithm evaluation techniques such as train-test split and K-Fold cross-validation. 

This is all fine and dandy, but against _what_ do we compare our models? How do we know whether the predictions of an algorithm are good or not? 

Enter __baseline models__!

A baseline model is algorithm that provides a set of predictions based on some heristic. Some of these heuristic might be more clever and sofisticated than others. Today we'll explore two of the most popular baseline prediction algorithms: 

    - Random Prediction Algorithm.
    - Zero Rule Algorithm.

Let's start our implementation by loading the code and libraries we'll need. We will build our solution on top of the ones we implemented in the [previous notebook](https://github.com/jesus-a-martinez-v/toy-ml/blob/master/src/main/scala/notebooks/evaluation_metrics.ipynb).

In [1]:
import $ivy.`com.github.tototoshi::scala-csv:1.3.5`
import $file.^.datasmarts.ml.toy.scripts.EvaluationMetrics, EvaluationMetrics._
import scala.util.Random

[32mimport [39m[36m$ivy.$                                      
[39m
[32mimport [39m[36m$file.$                                            , EvaluationMetrics._
[39m
[32mimport [39m[36mscala.util.Random[39m

### Data

#### Classification

For our classification tasks we will use the following __training__ set:

| X1 | X2 | Y |
|----|----|---|
| A  | 44 | 0 |
| Z  | 12 | 0 |
| Q  | 28 | 1 |
| E  | 81 | 0 |
| F  | 72 | 0 |
| S  | 33 | 0 |
| O  | 29 | 1 |
| N  | 47 | 0 |
| J  | 73 | 1 |
| Q  | 57 | 1 |

And this __test__ set:

| X1 | X2 | Y |
|----|----|---|
| P  | 42 | 1 |
| L  | 14 | 1 |
| O  | 5  | 1 |
| M  | 9  | 0 |

#### Regression

For regression, these are our __training__ and __test__ sets, respectively:

| X1          | X2      | X3   | Y      |
|-------------|---------|------|--------|
| vKSOgzRyjU  | 4214805 | 1102 | -2.900 |
| sOgdNGRhHz  | 141328  | 1521 | 3.340  |
| OuijvSyrrsU | 513968  | 1403 | 2.640  |
| MygFUrQnfDD | 15420   | 822  | 9.000  |
| DKKmnTUAqw  | 19665   | 636  | 6.560  |
| VjKXLhttIx  | 11818   | 82   | -5.380 |
| cqYuHvAlaf  | 19293   | 688  | 2.740  |
| rINbXLITsj  | 23911   | 486  | -3.030 |
| psELWcZXsI  | 13358   | 140  | 4.460  |
| HWmQYkKpzt  | 19294   | 161  | 4.460 |

| X1          | X2      | X3   | Y      |
|-------------|---------|------|--------|
| vKSqwexzRyjU  | 4805 | 1106 | -3.140 |
| sOgqwNGRhHz  | 1138  | 1478 | 3.009  |
| OuijvSyrrsU | 139  | 786 | 1.715  |
| rINbXLITsj  | 231   | 980  | -4.220 |
| psEL23cZXsI  | 358   | 4543  | 4.180  |
| HWmQ12Kpzt  | 924   | 235  | 9.99  |

In [2]:
val mockClassificationTrainingSet: Dataset = Vector(
    ("A",44, 0),
    ("Z", 12, 0),
    ("Q", 28, 1),
    ("E", 81, 0),
    ("F", 72, 0),
    ("S", 33, 0),
    ("O", 29, 1),
    ("N", 47, 0),
    ("J", 73, 1),
    ("Q", 57, 1))
.map { case (x1, x2, y) => Vector(Text(x1), Numeric(x2), Numeric(y)) }

val mockClassificationTestSet: Dataset = Vector(
    ("P", 42, 1),
    ("L", 14, 1),
    ("O", 5, 1),
    ("M", 9, 0)
).map { case (x1, x2, y) => Vector(Text(x1), Numeric(x2), Numeric(y)) }

val mockRegressionTrainingSet: Dataset = Vector(
    ("vKSOgzRyjU", 4214805, 1102, -2.900),
    ("sOgdNGRhHz", 141328, 1521, 3.340),
    ("OuijvSyrrsU", 513968, 1403, 2.640),
    ("MygFUrQnfDD", 15420, 822, 9.000),
    ("DKKmnTUAqw", 19665, 636, 6.560),
    ("VjKXLhttIx", 11818, 82, -5.380),
    ("cqYuHvAlaf", 19293, 688, 2.740),
    ("rINbXLITsj", 23911, 486, -3.030),
    ("psELWcZXsI", 13358, 140, 4.460),
    ("HWmQYkKpzt", 19294, 161, 4.460)
).map { 
    case (x1, x2, x3, y) => Vector(Text(x1), Numeric(x2), Numeric(x3), Numeric(y)) 
}

val mockRegressionTestSet: Dataset = Vector(
    ("vKSqwexzRyjU", 4805, 1106, -3.140),
    ("sOgqwNGRhHz", 1138, 1478, 3.009),
    ("OuijvSyrrsU", 139, 786, 1.715),
    ("rINbXLITsj", 231, 980, -4.220),
    ("psEL23cZXsI", 358, 4543, 4.180),
    ("HWmQ12Kpzt", 924, 235, 9.99)).map { 
    case (x1, x2, x3, y) => Vector(Text(x1), Numeric(x2), Numeric(x3), Numeric(y)) 
}

[36mmockClassificationTrainingSet[39m: [32mDataset[39m = [33mVector[39m(
  [33mVector[39m(Text(A), Numeric(44.0), Numeric(0.0)),
  [33mVector[39m(Text(Z), Numeric(12.0), Numeric(0.0)),
  [33mVector[39m(Text(Q), Numeric(28.0), Numeric(1.0)),
  [33mVector[39m(Text(E), Numeric(81.0), Numeric(0.0)),
  [33mVector[39m(Text(F), Numeric(72.0), Numeric(0.0)),
  [33mVector[39m(Text(S), Numeric(33.0), Numeric(0.0)),
  [33mVector[39m(Text(O), Numeric(29.0), Numeric(1.0)),
  [33mVector[39m(Text(N), Numeric(47.0), Numeric(0.0)),
  [33mVector[39m(Text(J), Numeric(73.0), Numeric(1.0)),
  [33mVector[39m(Text(Q), Numeric(57.0), Numeric(1.0))
)
[36mmockClassificationTestSet[39m: [32mDataset[39m = [33mVector[39m(
  [33mVector[39m(Text(P), Numeric(42.0), Numeric(1.0)),
  [33mVector[39m(Text(L), Numeric(14.0), Numeric(1.0)),
  [33mVector[39m(Text(O), Numeric(5.0), Numeric(1.0)),
  [33mVector[39m(Text(M), Numeric(9.0), Numeric(0.0))
)
[36mmockRegressionTrainingSet[39

## Random Prediction Algorithm

This is one of the simplest baseline models. It works as follows:

__Training phase__:

    1. Select label column.
    2. Keep only unique values.
    
__Prediction phase__:

For each row in the test set, select a random label from the unique label set collected in the training phase.

Of course, it works with both classification and regression tasks.

Let's proceed to implement it (_**NOTE**_: We assume the last column in each dataset corresponds to the labels_)

In [3]:
// Handy helper function to select a particular column in a dataset. We'll use it in all of our implementations.
def selectColumn(dataset: Dataset, index: Int): Vector[Data] = {
  dataset.map(_(index))
}

def randomAlgorithm(train: Dataset, test: Dataset, seed: Int = 42): Vector[Data] = {
  val random = new Random(seed)

  val outputColumn = selectColumn(train, train.head.length - 1)
  val uniqueOutputs = outputColumn.distinct
  val numberOfUniqueOutputs = uniqueOutputs.length

  test.map { row =>
    val randomIndex = random.nextInt(numberOfUniqueOutputs)

    uniqueOutputs(randomIndex)
  }
}

defined [32mfunction[39m [36mselectColumn[39m
defined [32mfunction[39m [36mrandomAlgorithm[39m

Let's now test our implementation with our mock dataset:

In [4]:
randomAlgorithm(mockClassificationTrainingSet, mockClassificationTestSet)

[36mres3[39m: [32mVector[39m[[32mData[39m] = [33mVector[39m(Numeric(1.0), Numeric(0.0), Numeric(1.0), Numeric(0.0))

As we can see, we pass a seed to our algorithm in order to aim reproducibility. By default, we use 42 (after all, it is the [Answer to the Ultimate Question of Life, the Universe, and Everything](https://simple.wikipedia.org/wiki/42_(answer) ;))

The predictions correspond to random selections over unique labels in the training set (In this case, 1 and 0)

## Zero Rule Algorithm

Unlike Random Algorithm, Zero Rule Algorithm displays a slightly different behavior depending on the type of the predictor being trained:

__ Training phase__:
    
    
   * _Classifier_: It determines the label with the highest frecuency (mode).
   * _Regressor_: It calculates a measure of central tendency, such as the mean, mode or median. Usually the mean value is used.
    
__Prediction phase__:

   * _Classifier_: Returns the model for every row in the test set.
   * _Regressor_: Returns the measure of central tendency calculated in the training phase for each measure row in the dataset.

Let's start by implementing a Zero Rule Algorithm for classification:

In [5]:
def zeroRuleClassifier(train: Dataset, test: Dataset): Vector[Data] = {
  val outputColumn = selectColumn(train, train.head.length - 1)

  val mode = outputColumn.groupBy(identity).maxBy(_._2.length)._1

  test.map(row => mode)
}

defined [32mfunction[39m [36mzeroRuleClassifier[39m

Cool. Let's test it with our mock dataset:

In [6]:
zeroRuleClassifier(mockClassificationTrainingSet, mockClassificationTestSet)

[36mres5[39m: [32mVector[39m[[32mData[39m] = [33mVector[39m(Numeric(0.0), Numeric(0.0), Numeric(0.0), Numeric(0.0))

It works! Excellent! As we can see, the predicted class for every example in the test set is 0, as it is the most frequent label in the training set.

Let's now proceed to implement a Zero Rule Regressor:

In [7]:
// Used for typesafe selection of the central measure of tendency used in the zero rule algorithm
sealed trait Measure
case object Mean extends Measure
case object Mode extends Measure
case object Median extends Measure

def zeroRuleRegressor(train: Dataset, test: Dataset, measure: Measure = Mean): Vector[Data] = {
  def calculateMean(labels: Vector[Data]) = Numeric {
    val sum = labels.foldLeft(0.0) { 
        (accum, numericValue) => accum + getNumericValue(numericValue).get 
    }

    sum / labels.length
  }

  def calculateMedian(labels: Vector[Data]) = {
    val sortedLabels = labels.sortBy(getNumericValue(_).get)
    val evenNumberOfLabels = labels.length % 2 == 0

    if (evenNumberOfLabels) {
      val splitIndex = labels.length / 2

      Numeric {
        val firstCentricValue = getNumericValue(sortedLabels(splitIndex - 1)).get
        val secondCentricValue = getNumericValue(sortedLabels(splitIndex)).get
         (firstCentricValue + secondCentricValue) /  2
      }
    } else {
      val medianIndex = labels.length / 2
      sortedLabels(medianIndex)
    }
  }

  def calculateMode(labels: Vector[Data]) = {
    labels.groupBy(identity).maxBy(_._2.length)._1
  }

  val outputColumn = selectColumn(train, train.head.length - 1)

  val measureValue = measure match {
    case Mean => calculateMean(outputColumn)
    case Mode => calculateMode(outputColumn)
    case Median => calculateMedian(outputColumn)
  }

  test.map(row => measureValue)
}

defined [32mtrait[39m [36mMeasure[39m
defined [32mobject[39m [36mMean[39m
defined [32mobject[39m [36mMode[39m
defined [32mobject[39m [36mMedian[39m
defined [32mfunction[39m [36mzeroRuleRegressor[39m

We just implemented a flexible Zero Rule Regressor that's capable of working with the mode, median or mean. Let's test it on the mocking dataset using each measure, starting with the mode:

In [8]:
zeroRuleRegressor(mockRegressionTrainingSet, mockRegressionTestSet, Mode)

[36mres7_0[39m: [32mVector[39m[[32mData[39m] = [33mVector[39m(
  Numeric(4.46),
  Numeric(4.46),
  Numeric(4.46),
  Numeric(4.46),
  Numeric(4.46),
  Numeric(4.46)
)
[36mres7_1[39m: [32mVector[39m[[32mData[39m] = [33mVector[39m(
  Numeric(3.04),
  Numeric(3.04),
  Numeric(3.04),
  Numeric(3.04),
  Numeric(3.04),
  Numeric(3.04)
)
[36mres7_2[39m: [32mVector[39m[[32mData[39m] = [33mVector[39m(
  Numeric(2.189),
  Numeric(2.189),
  Numeric(2.189),
  Numeric(2.189),
  Numeric(2.189),
  Numeric(2.189)
)

Good. It works as 4.460 is, in fact, the most frequent label in the mock test dataset. Let's now proceed test our zero rule regressor using the median: 

In [9]:
zeroRuleRegressor(mockRegressionTrainingSet, mockRegressionTestSet, Median)

[36mres8[39m: [32mVector[39m[[32mData[39m] = [33mVector[39m(
  Numeric(3.04),
  Numeric(3.04),
  Numeric(3.04),
  Numeric(3.04),
  Numeric(3.04),
  Numeric(3.04)
)

Given our training dataset has an even number of elements, our median is actually the mean of the two values located at the center of the training set. Let's remove one row of the training set and see if it still works (it should return a single value in the center of the training set):

In [10]:
zeroRuleRegressor(mockRegressionTrainingSet.init, mockRegressionTestSet, Median)

[36mres9[39m: [32mVector[39m[[32mData[39m] = [33mVector[39m(
  Numeric(2.74),
  Numeric(2.74),
  Numeric(2.74),
  Numeric(2.74),
  Numeric(2.74),
  Numeric(2.74)
)

Cool, it works. Now, let's finally test it using the mean:

In [11]:
zeroRuleRegressor(mockRegressionTrainingSet, mockRegressionTestSet, Mean)

[36mres10[39m: [32mVector[39m[[32mData[39m] = [33mVector[39m(
  Numeric(2.189),
  Numeric(2.189),
  Numeric(2.189),
  Numeric(2.189),
  Numeric(2.189),
  Numeric(2.189)
)

Yep. That's actually the mean ;)