# Evaluation Metrics

One of the most important building blocks of machine learning are the evaluation metrics. Why? Well, even though having a computer make predictions on the data we supplied to it is _cool_, not any predictor is useful or even fit for the problem we are trying to solve or tackle using machine learning.

So, how do we tell how __good__ is a certain model? Using math, of course! 

Enter __evaluation metrics__!

An evaluation metrics, roughly speaking, is just a function that takes the prediction our model generates, compare them with the actual labels in the data, and give us a number that indicates how good (or bad) our algorithm is doing. There are many evaluation functions. Some of them are specific to regression and others to classification. Of course, each of them have a set of features that make them stand out at specific circunstances. The ones we are going to explore in this notebook are:

    - Accuracy.
    - Confusion Matrix.
    - Mean Absolute Error (MAE).
    - Root Mean Squared Error (RMSE).
    - Precission.
    - Recall.
    - F1.
    
Let's get started, shall we?

Let's start our implementation by loading the code and libraries we'll need. We will build our solution on top of the ones we implemented in the [previous notebook](https://github.com/jesus-a-martinez-v/toy-ml/blob/master/src/main/scala/notebooks/algorithm_evaluation.ipynb).

In [1]:
import $ivy.`com.github.tototoshi::scala-csv:1.3.5`
import $file.^.datasmarts.ml.toy.scripts.AlgorithmEvaluation, AlgorithmEvaluation._

Vector(Vector(Numeric(5.0)), Vector(Numeric(7.0)))
Vector(Vector(Numeric(3.0)), Vector(Numeric(2.0)))
Vector(Vector(Numeric(8.0)), Vector(Numeric(10.0)))
Vector(Vector(Numeric(9.0)), Vector(Numeric(6.0)))
Vector(Vector(Numeric(4.0)), Vector(Numeric(1.0)))


[32mimport [39m[36m$ivy.$                                      
[39m
[32mimport [39m[36m$file.$                                              , AlgorithmEvaluation._[39m

### Data

Throughout this notebook we'll use this data to test our implementations:

| ACTUAL 	|  PREDICTED  |
|:---:	|:----:|
|   0  |  0  |
|   0  |  1 |
|   0  |  0   |
|   0  |  0   |
|   0  |  0   |
|   1  |  1   |
|   1  |  0   |
|   1  |  1   |
|   1  |  1   |
|   1  |  0   |

In [2]:
val actual = Vector(0, 0, 0, 0, 0, 1, 1, 1, 1, 1).map(Numeric)
val predicted = Vector(0, 1, 0, 0, 0, 1, 0, 1, 1, 0).map(Numeric)

assert(actual.length == predicted.length)

[36mactual[39m: [32mVector[39m[[32mNumeric[39m] = [33mVector[39m(
  [33mNumeric[39m([32m0.0[39m),
  [33mNumeric[39m([32m0.0[39m),
  [33mNumeric[39m([32m0.0[39m),
  [33mNumeric[39m([32m0.0[39m),
  [33mNumeric[39m([32m0.0[39m),
  [33mNumeric[39m([32m1.0[39m),
  [33mNumeric[39m([32m1.0[39m),
  [33mNumeric[39m([32m1.0[39m),
  [33mNumeric[39m([32m1.0[39m),
  [33mNumeric[39m([32m1.0[39m)
)
[36mpredicted[39m: [32mVector[39m[[32mNumeric[39m] = [33mVector[39m(
  [33mNumeric[39m([32m0.0[39m),
  [33mNumeric[39m([32m1.0[39m),
  [33mNumeric[39m([32m0.0[39m),
  [33mNumeric[39m([32m0.0[39m),
  [33mNumeric[39m([32m0.0[39m),
  [33mNumeric[39m([32m1.0[39m),
  [33mNumeric[39m([32m0.0[39m),
  [33mNumeric[39m([32m1.0[39m),
  [33mNumeric[39m([32m1.0[39m),
  [33mNumeric[39m([32m0.0[39m)
)

## Accuracy

Accuracy is the relation between the predictions our model got right and the total predictions it made. This is one of the most intuitive and simple evaluation metrics there are.

One of its clear disadvantages is that it only tells one side of the story: The number of correct predictions. What about the predictions the algorithm failed? Aren't them important? 

It is also very sensitive to inbalanced datasets. In this cases, for instance, a model that always predicts the predominant label will achieve a very high accuracy, but under the hood it isn't predicting at all, just throwing some constant back!

Despite its flaws, accuracy is very useful, in particular when we deal with somewhat balanced datasets and binary classification tasks. 

The formula for accuracy is:

$$ accuracy = \frac{\sum_{i=1}^{N} (if\ prediction_i\ =\ actual_i\ then\ 1\ else\ 0)}{N}$$

Let's implement it

In [3]:
def accuracy(actual: Vector[Data], predicted: Vector[Data]): Double = {
  // We can only compate vectors of equal length
  assert(actual.length == predicted.length)

  val indices = actual.indices
  val numberOfTotalPredictions = predicted.length

  val numberOfCorrectPredictions = indices.foldLeft(0.0) { (accumulated, index) =>
    accumulated + (if (actual(index) == predicted(index)) 1.0 else 0.0)
  }

  numberOfCorrectPredictions / numberOfTotalPredictions
}

defined [32mfunction[39m [36maccuracy[39m

Let's now test it in our mock dataset:

In [4]:
println(s"Accuracy is of ${accuracy(actual, predicted) * 100}%")

Accuracy is of 70.0%


## Confusion Matrix

A confusion or error matrix is just a 2x2 arrange of the performance of the algorithm on each unique class in the labels. The rows represent predicted classes and the columns the actual classes. 

For instance, for a binary problem with only two classes $ \{0, 1\} $ a confusion matrix will contain the following info (assuming $ 1 $ is the possitive class):

|  	    |  1 | 0  |
|:---:	|:----:| :--: |
|   **1** 	|  True Positives  |   False Positives  |
|   **0**  |  False Negatives  |   True Negatives  |

Where:

  - __True Positive__: The actual class was __1__ and the model predicted __1__.
  - __False Positive__: The actual class was __1__ but the model predicted __0__.
  - **False Negative**: The actual class was __0__ but the model predicted __1__.
  - **True Negative**: The actual class was __0__ and the model predicted __0__.

Let's implement it

In [5]:
def confusionMatrix(actual: Vector[Data], predicted: Vector[Data], positiveLabel: Data): Map[String, Int] = {
  assert(actual.length == predicted.length)

  actual.indices.foldLeft(Map("TP" -> 0, "FP" -> 0, "FN" -> 0, "TN" -> 0)) { (matrix, index) =>
    val actualLabel = actual(index)
    val predictedLabel = predicted(index)

    if (actualLabel == positiveLabel) {
      if (actualLabel == predictedLabel) {
        matrix + ("TP" -> (matrix("TP") + 1))
      } else {
        matrix + ("FP" -> (matrix("FP") + 1))
      }
    } else {
      if (actualLabel == predictedLabel) {
        matrix + ("TN" -> (matrix("TN") + 1))
      } else {
        matrix + ("FN" -> (matrix("FN") + 1))
      }
    }
  }
}

defined [32mfunction[39m [36mconfusionMatrix[39m

The function we implemented returns a map where the keys are $ \{TP, FP, FN, TN\}$ and correspond to $ \{True\ Positive,\ False\ Positive,\ False\ Negative,\ True\ Negative\ \} $. The values are just the counts for each category. 

We must supply the positive label in order to break ambiguity.

Let's now test it in our mock dataset:

In [6]:
val matrix = confusionMatrix(actual, predicted, Numeric(1.0))

println(s"True positives: ${matrix("TP")}")
println(s"False positives: ${matrix("FP")}")
println(s"False negatives: ${matrix("FP")}")
println(s"True negatives: ${matrix("TN")}")

True positives: 3
False positives: 2
False negatives: 2
True negatives: 4


[36mmatrix[39m: [32mMap[39m[[32mString[39m, [32mInt[39m] = [33mMap[39m([32m"TP"[39m -> [32m3[39m, [32m"FP"[39m -> [32m2[39m, [32m"FN"[39m -> [32m1[39m, [32m"TN"[39m -> [32m4[39m)

## Mean Absolute Error

Mean Absolute Error (MAE) is a metric used in regression problems where our goal shifts from predicting the correct class or label to minimizing the error between the value our model outputs and the actual value. 

Of course, this means this metric only works with numeric data. 

MAE basically averages the absolute error of our algorithm. Why absolute? So we can add them up without worrying about the sign.


The formula for MAE is:

$$ MAE = \frac{\sum_{i=1}^{N} |value_i - actual_i|}{N}$$

Let's implement it

In [7]:
def meanAbsoluteError(actual: Vector[Numeric], predicted: Vector[Numeric]): Double = {
  assert(actual.length == predicted.length)

  val sumOfAbsoluteErrors = actual.indices.foldLeft(0.0) { (accumulated, index) =>
    accumulated + math.abs(actual(index).value - predicted(index).value)
  }

  sumOfAbsoluteErrors / actual.length
}

defined [32mfunction[39m [36mmeanAbsoluteError[39m

Let's now test it in our mock dataset:

In [8]:
print(s"MAE is ${meanAbsoluteError(actual, predicted)}")

MAE is 0.3

## Root Mean Squared Error

Root Mean Squared Error (RMSE) is another metric used in regression problems.

It is very similar to MAE and, again, it's only suited to numeric data.

The main advantage of RMSE is that squaring the error forces it to be always positive and also penalizes larger errors with lower score. Also, squaring the MSE returns the value to the original units.

The formula for RMSE is:

$$ RMSE = \sqrt{\frac{\sum_{i=1}^{N} (value_i - actual_i)^2}{N}}$$

Let's implement it.

In [9]:
def rootMeanSquaredError(actual: Vector[Numeric], predicted: Vector[Numeric]): Double = {
  assert(actual.length == predicted.length)

  val sumOfSquaredErrors = actual.indices.foldLeft(0.0) { (accumulated, index) =>
    accumulated + math.pow(actual(index).value - predicted(index).value, 2)
  }

  math.sqrt(sumOfSquaredErrors / actual.length)
}

defined [32mfunction[39m [36mrootMeanSquaredError[39m

Let's now test it in our mock dataset:

In [10]:
print(s"RMSE is ${rootMeanSquaredError(actual, predicted)}")

RMSE is 0.5477225575051661

## Precision

Precision, also known as __Positive Predictive Value__ is just the ration between the true positives (the predictions of the possitive class when the actual class was also positive) and all the positives predicted by the model.

$$ precision = \frac{True\ Positives}{True\ Positives\ +\ False\ Positives}$$

It is, of course, a metric used in classification tasks.

Let's implement it.

In [11]:
def precision(actual: Vector[Data], predicted: Vector[Data], positiveLabel: Data): Double = {
  assert(actual.length == predicted.length)

  val matrix = confusionMatrix(actual, predicted, positiveLabel)

  matrix("TP").toDouble / (matrix("TP") + matrix("FP")).toDouble
}

defined [32mfunction[39m [36mprecision[39m

Let's now test it in our mock dataset:

In [12]:
val ppv = precision(actual, predicted, Numeric(1))
println(s"Algorithm's precision is $ppv. This means that it has an ${ppv * 100}% accuracy predicting positive labels.")

Algorithm's precision is 0.6. This means that it has an 60.0% accuracy predicting positive labels.


[36mppv[39m: [32mDouble[39m = [32m0.6[39m

## Recall

Recall is also another useful evaluation metric for classification problems, 

Recall, also known as __Sensitivity__ is the proportion of positive examples that were actually identified as such by the algorithm. 

The main difference with __precision__ is that the while __recall__ measures the ratio of times the model predicted a positive class when the **_actual_** class of an example was also positive, __precision__ measures the ratio of times the **_predicted_** class was positive among all the positive predictions (correct or not) of the model.

$$ recall = \frac{True\ Positives}{True\ Positives\ +\ False\ Negatives}$$


Let's implement it.

In [13]:
def recall(actual: Vector[Data], predicted: Vector[Data], positiveLabel: Data): Double = {
  assert(actual.length == predicted.length)

  val matrix = confusionMatrix(actual, predicted, positiveLabel)

  matrix("TP").toDouble / (matrix("TP") + matrix("FN")).toDouble
}

defined [32mfunction[39m [36mrecall[39m

Let's now test it in our mock dataset:

In [14]:
val sensitivity = recall(actual, predicted, Numeric(1))
println(s"Algorithm's recall is $sensitivity. This means that it has an ${sensitivity * 100}% accuracy identifying positive labels.")

Algorithm's recall is 0.75. This means that it has an 75.0% accuracy identifying positive labels.


[36msensitivity[39m: [32mDouble[39m = [32m0.75[39m

## F1 Score

F1 score is the harmonical average of precision and recall. Its main advantage is that summarizes the precision and recall in a single quantity. On the flipside, it hasn't an intuitive interpretation as happens with many of the other metrics we've seen in this notebook.

F1 reaches its highest at 1.0 and its lowest at 0.0, where the former means the model has a perfect predictive power (impossible) and 0.0 means the algorithm misses all the time (sad).

$$ F_1 = \frac{recall}{precision + recall}$$

It is, of course, a metric used in classification tasks.

Let's implement it.

In [15]:
def f1(actual: Vector[Data], predicted: Vector[Data], positiveLabel: Data): Double = {
  assert(actual.length == predicted.length)

  val precisionValue = precision(actual, predicted, positiveLabel)
  val recallValue = recall(actual, predicted, positiveLabel)

  (precisionValue * recallValue) / (precisionValue + recallValue)
}

defined [32mfunction[39m [36mf1[39m

Let's now test it in our mock dataset:

In [16]:
f1(actual, predicted, Numeric(1))

[36mres15[39m: [32mDouble[39m = [32m0.33333333333333326[39m

A $F_1$ score of 0.333 translates into a poor performance. Although the model is somewhat good at identifying positive instances (75% of the times), it fails at predicting them almost half of the time (60%)