# Naïve Bayes

What a fun name for a classifier, don't you think? :)

Naïve Bayes, simply put, refer to a family of algorithms that apply the Bayes' theorem with the _naïve_ assumption that all features are independent from each other (Hence the name). 

Although there are (very) rare situations where all features are independent, most of the time there are at least some degree of correlation between two or more of them. For instance, one could infer that an object that is a fruit, is red, has a somewhat round shape and grows in trees is most likely an apple. Why? Because each feature gives us more information about the others and that increases our confidence in the output. Nonetheless, Naïve Bayes algorithm make the strong assumption that in any instance this correlation occurs.

It doesn't sound like an accurate algorithm, right? Well, the thing is that in real life, despite this naïvety, it works really well and it also facilitates the computation __a lot__.

Bayes' Theorem gives us a way to calculate the probability of a piece of data belonging to a particular class, given our prior knowledge.

$$ P(class|data)=\frac{P(data|class)*P(class)}{P(data)} $$

It is somewhat confusing, but what this formula says is "The posterior probability of the class given the data is defined as the prior probability of that data given the class by the likelihood of the class divided by the evidence of the data". So much clearer now! (Fingers crossed)

A nicer way to represent this formula is:

$$ posterior=\frac{prior*likelihood}{evidence} $$

Let's start our implementation by loading the code and libraries we'll need. We will build our solution on top of the ones we implemented in the [previous notebook](https://github.com/jesus-a-martinez-v/toy-ml/blob/master/src/main/scala/notebooks/lperceptron.ipynb).

In [1]:
import $ivy.`com.github.tototoshi::scala-csv:1.3.5`
import $file.^.datasmarts.ml.toy.scripts.Perceptron, Perceptron._
import scala.util.Random

[32mimport [39m[36m$ivy.$                                      
[39m
[32mimport [39m[36m$file.$                                     , Perceptron._
[39m
[32mimport [39m[36mscala.util.Random[39m

## Data

We'll use the [Iris Flower](http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data) dataset. It involves the prediction of flower species given measurements of iris flowers. 

Let's load the data:

In [2]:
val BASE_DATA_PATH = "../../resources/data"
val irisPath = s"$BASE_DATA_PATH/12/iris.csv"

val rawData = loadCsv(irisPath)
val numberOfRows = rawData.length
val numberOfColumns = rawData.head.length
println(s"Number of rows in dataset: $numberOfRows")
println(s"Number of columns in dataset: $numberOfColumns")

val (data, lookUpTable) = {
    val dataWithNumericColumns = (0 until (numberOfColumns - 1)).toVector.foldLeft(rawData) { (d, i) => textColumnToNumeric(d, i)}
    categoricalColumnToNumeric(dataWithNumericColumns, numberOfColumns - 1)
}

Number of rows in dataset: 150
Number of columns in dataset: 5


[36mBASE_DATA_PATH[39m: [32mString[39m = [32m"../../resources/data"[39m
[36mirisPath[39m: [32mString[39m = [32m"../../resources/data/12/iris.csv"[39m
[36mrawData[39m: [32mVector[39m[[32mVector[39m[[32mData[39m]] = [33mVector[39m(
  [33mVector[39m(Text(5.1), Text(3.5), Text(1.4), Text(0.2), Text(Iris-setosa)),
  [33mVector[39m(Text(4.9), Text(3.0), Text(1.4), Text(0.2), Text(Iris-setosa)),
  [33mVector[39m(Text(4.7), Text(3.2), Text(1.3), Text(0.2), Text(Iris-setosa)),
  [33mVector[39m(Text(4.6), Text(3.1), Text(1.5), Text(0.2), Text(Iris-setosa)),
  [33mVector[39m(Text(5.0), Text(3.6), Text(1.4), Text(0.2), Text(Iris-setosa)),
  [33mVector[39m(Text(5.4), Text(3.9), Text(1.7), Text(0.4), Text(Iris-setosa)),
  [33mVector[39m(Text(4.6), Text(3.4), Text(1.4), Text(0.3), Text(Iris-setosa)),
  [33mVector[39m(Text(5.0), Text(3.4), Text(1.5), Text(0.2), Text(Iris-setosa)),
  [33mVector[39m(Text(4.4), Text(2.9), Text(1.4), Text(0.2), Text(Iris-setosa)),
 

## Separation by Class

In a moment we will need to calculate the probability  of data by the class they belong to. In order to do that, we'll need to first searate our training set by classes. Fairly easy:

In [3]:
def separateByClass(dataset: Dataset): Map[Data, Vector[Vector[Data]]] = {
  dataset.groupBy(_.last)
}

defined [32mfunction[39m [36mseparateByClass[39m

Let's test it on a mock dataset:

In [4]:
val mockDataset = Vector(
  (3.393533211, 2.331273381, 0),
  (3.110073483, 1.781539638, 0),
  (1.343808831, 3.368360954, 0),
  (3.582294042, 4.67917911, 0),
  (2.280362439, 2.866990263, 0),
  (7.42346942, 4.696522875, 1),
  (5.745051997, 3.533989803, 1),
  (9.172168622, 2.511101045, 1),
  (7.7922783481, 3.424088941, 1),
  (7.939820817, 0.791637231, 1)
) map { case (x1, x2, y) => Vector(Numeric(x1), Numeric(x2), Numeric(y))}

val separated = separateByClass(mockDataset)

[36mmockDataset[39m: [32mVector[39m[[32mVector[39m[[32mNumeric[39m]] = [33mVector[39m(
  [33mVector[39m([33mNumeric[39m([32m3.393533211[39m), [33mNumeric[39m([32m2.331273381[39m), [33mNumeric[39m([32m0.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m3.110073483[39m), [33mNumeric[39m([32m1.781539638[39m), [33mNumeric[39m([32m0.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m1.343808831[39m), [33mNumeric[39m([32m3.368360954[39m), [33mNumeric[39m([32m0.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m3.582294042[39m), [33mNumeric[39m([32m4.67917911[39m), [33mNumeric[39m([32m0.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m2.280362439[39m), [33mNumeric[39m([32m2.866990263[39m), [33mNumeric[39m([32m0.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m7.42346942[39m), [33mNumeric[39m([32m4.696522875[39m), [33mNumeric[39m([32m1.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m5.745051997[39m), [33mNumeric[39

Good. We can see it is working as expected by inspecting the rows that form each group.

## Summarize Dataset

The next step is to obtain two very important statistics for each feature (column) in the dataset:

 - Mean.
 - Standard Deviation.
 
In order to be a bit more efficient, we'll collect each of these statistics along with the row count per feature in one pass:

In [5]:
def summarizeDataset(dataset: Dataset) = {
  val numberOfColumns = dataset.head.length

  val means = getColumnsMeans(dataset)
  val standardDeviations = getColumnsStandardDeviations(dataset, means)
  val counts = (1 to dataset.head.length).toVector.map(_ => dataset.length)

  assert(List(means.length, standardDeviations.length, counts.length).forall(_ == numberOfColumns))

  // We ignore the labels column
  (0 until numberOfColumns - 1).toVector.map(i => (means(i).get, standardDeviations(i).get, counts(i)))
}

defined [32mfunction[39m [36msummarizeDataset[39m

Let's test it:

In [None]:
summarizeDataset(mockDataset)

The first triplet corresponds to the statistics of X1 and the second to the statistics of X2. In a table:

|                    | X1                | X2                |
|--------------------|-------------------|-------------------|
| __Mean__               | 5.178286121009999 | 2.9984683241      |
| __Standard Deviation__ | 2.766534398702562 | 1.218556343617447 |
| __Count__              | 10                | 10                |

## Summarize Data by Class

With these two functions implemented we have all we need to summarize each subset of data corresponding to each class. We'll keep each summary in a map where the label is the class and the value is a list of triplets (mean, standard deviation and count) per feature.

In [None]:
def summarizeByClass(dataset: Dataset) = {
  val separated = separateByClass(dataset)

  separated.mapValues(summarizeDataset)
}

Let's test it:

In [None]:
val summaries = summarizeByClass(mockDataset)

## Gaussian Probability Density Function

Calculating the probability of observing a given value is quite hard. One of the tricks used to do it is to _assume_ that this value is drawn from a distribution (in this case, a Gaussian distribution).

The good thing about Gaussian distributions is that they can be summarized with only two numbers: Mean and standard deviation. With the mighty power of math we can _estimate_ the probability of some value _X_:

$$ P(X)=\frac{1}{\sqrt{2*\pi}*\sigma}e^{-(\frac{(X - \mu)^2}{(2*\sigma)^2})}$$

Let's create a function that does the heavy lifting for us:

In [None]:
def calculateProbability(x: Double, mean: Double, standardDeviation: Double) = {
  val exponent = math.exp(-(math.pow(x - mean, 2) / (2 * standardDeviation * standardDeviation)))
  (1.0 / (math.sqrt(2 * math.Pi) * standardDeviation)) * exponent
}

calculateProbability(1.0, 1.0, 1.0)
calculateProbability(2.0, 1.0, 1.0)
calculateProbability(0.0, 1.0, 1.0)

## Class Probabilities

Now we need to compute the class probabilities for new data given the statistics summary per feature. These are calculated separately for each class in the dataset.

The probability that some value corresponds to a class is:

$$ P(class|data)=P(X|data)*P(class)$$

This is not exactly the Bayes Theorem. What does dropping the denominator causes? Well, $P(data)$ is actually a normalization factor that makes sure that the value we are calculating falls in the range [0, 1] and, hence, can be considered an actual probability. Given that we are more interested in the classification task rather than the probability itself, we can save ourselves computation resources and still be sure we'll classify each data point right.

In [None]:
def calculateClassProbabilities(summaries: Map[Data, Vector[(Double, Double, Int)]], row: Vector[Data]) = {
  val totalRows = summaries.foldLeft(0){ (accum, entry) =>
    entry match {
      case (_, summary) => accum + summary.head._3
    }
  }

  summaries.mapValues { summaries =>
    var a = summaries.head._3 / totalRows.toDouble

    // Class probability
    summaries.indices.foldLeft(summaries.head._3 / totalRows.toDouble) { (classProbability, i) =>
      val (mean, standardDeviation, _) = summaries(i)
      classProbability * calculateProbability(getNumericValue(row(i)).get, mean, standardDeviation)
    }
  }
}

In [None]:
val probabilities = calculateClassProbabilities(summaries, mockDataset.head)

We see that the probability of the first row belonging to the first class (0) is quite higher that the probability it belongs to the second class (1).

## Naïve Bayes

We can now make predictions using all the functions we've implemented so far. Let's see:

In [None]:
def predictNB(summaries: Map[Data, Vector[(Double, Double, Int)]], row: Vector[Data]): Data = {
  val probabilities = calculateClassProbabilities(summaries, row)

  val (Some(bestLabel), _) = probabilities.foldLeft((None: Option[Data], -1.0)) { (bestLabelAndProb, entry) =>
      entry match {
        case (label, classProbability) =>
          val (bestLabel, bestProbability) = bestLabelAndProb

          if (bestLabel.isEmpty || classProbability > bestProbability) {
            (Some(label), classProbability)
          } else {
            bestLabelAndProb
          }
      }
  }

  bestLabel
}


In [None]:
def naiveBayes(train: Dataset, test: Dataset) = {
  val summaries = summarizeByClass(train)

  test.map { row =>
    predictNB(summaries, row)
  }
}

Good.

Let's now use our new algorithm to test it on the Iris dataset.

We'll start by running a baseline model on it and then our freshly implemented Gaussian Naïve Bayes algorithm and then we will compare their performance.

As a baseline we will use a __zero rule classifier__.

In [None]:
// Normalize data
val minMax = getDatasetMinAndMax(data)
val normalizedData = normalizeDataset(data, minMax)

val baselineAccuracy = evaluateAlgorithmUsingTrainTestSplit[Numeric](
        normalizedData, 
        (train, test, parameters) => zeroRuleClassifier(train, test), 
        Map.empty, 
        accuracy, 
        trainProportion=0.8)

println(s"Zero Rule Algorithm accuracy: $baselineAccuracy")

In [None]:
val naiveBayesAccuracy = evaluateAlgorithmUsingTrainTestSplit[Numeric](
    normalizedData,
    (train, test, parameters) => naiveBayes(train, test),
    Map.empty,
    accuracy,
    trainProportion=0.8)

println(s"Naive Bayes accuracy: $naiveBayesAccuracy")

Wow! It's evident that the 93.334% accuracy achieved by our Gaussian Naïve Bayes is dramatically better that the ine achieved by the baseline model (23.334%)