# _k_-Nearest Neighbors

_k_-Nearest Neighbors must be one of the most intuitive algorithms to grasp. 

Unlike **Linear regression** or **Perceptron**, it doesn't use the data to train some hidden model or equation. In fact, the training process of _k_-Nearest Neighbors doesn't involve any calculation. Basically, it holds the training set as a database used at test time to compare new instances of the data to the previous ones and then determine a class (classification) or value (regression) for it, based on some measure of similarity (usually Euclidean distance).

Of course, this very trait of _k_-Nearest Neighbors acts against itself at prediction time because it must compare the new instance to all of those it has stored in its database in order to select the _k_ most similar. Hence, the bigger the data used to train the algorithm, the slower the predictions will be.

Once the _k_ most similar neighbors have been found, usually, depending on the case, the following actions is taken:

 - __Classification__: Label the new instance with the most common class among the neighbors. There are variations of this, where the class is selected as a result of a weighted votation where, for instance, nearest neighbors have a higher contribution to the outcome.
 - __Regression__: Calculate some measure of central tendency (mean, mode, median) from the neighbors' values. 
 
Let's start our implementation by loading the code and libraries we'll need. We will build our solution on top of the ones we implemented in the [previous notebook](https://github.com/jesus-a-martinez-v/toy-ml/blob/master/src/main/scala/notebooks/naive_bayes.ipynb).

In [1]:
import $ivy.`com.github.tototoshi::scala-csv:1.3.5`
import $file.^.datasmarts.ml.toy.scripts.NaiveBayes, NaiveBayes._
import scala.util.Random

Compiling NaiveBayes.sc


[32mimport [39m[36m$ivy.$                                      
[39m
[32mimport [39m[36m$file.$                                     , NaiveBayes._
[39m
[32mimport [39m[36mscala.util.Random[39m

## Data

We'll use the [Abalone](http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data) dataset. It involves the prediction of the age of abalones given objective measures of individuals. Although it is initially a multiclass classification problem, we'll use it as well as a regression one.

Let's load the data:

In [2]:
val BASE_DATA_PATH = "../../resources/data"
val abalonePath = s"$BASE_DATA_PATH/13/abalone.data.csv"

val rawData = loadCsv(abalonePath)
val numberOfRows = rawData.length
val numberOfColumns = rawData.head.length
println(s"Number of rows in dataset: $numberOfRows")
println(s"Number of columns in dataset: $numberOfColumns")

val (data, lookUpTable) = {
    val dataWithNumericColumns = (1 until numberOfColumns).toVector.foldLeft(rawData) { (d, i) => textColumnToNumeric(d, i)}
    categoricalColumnToNumeric(dataWithNumericColumns, 0)
}

Number of rows in dataset: 4177
Number of columns in dataset: 9


[36mBASE_DATA_PATH[39m: [32mString[39m = [32m"../../resources/data"[39m
[36mabalonePath[39m: [32mString[39m = [32m"../../resources/data/13/abalone.data.csv"[39m
[36mrawData[39m: [32mVector[39m[[32mVector[39m[[32mData[39m]] = [33mVector[39m(
  [33mVector[39m(
    Text(M),
    Text(0.455),
    Text(0.365),
    Text(0.095),
    Text(0.514),
    Text(0.2245),
    Text(0.101),
    Text(0.15),
    Text(15)
  ),
[33m...[39m
[36mnumberOfRows[39m: [32mInt[39m = [32m4177[39m
[36mnumberOfColumns[39m: [32mInt[39m = [32m9[39m
[36mdata[39m: [32mVector[39m[[32mVector[39m[[32mData[39m]] = [33mVector[39m(
  [33mVector[39m(
    Numeric(0.0),
    Numeric(0.455),
    Numeric(0.365),
    Numeric(0.095),
    Numeric(0.514),
    Numeric(0.2245),
    Numeric(0.101),
    Numeric(0.15),
    Numeric(15.0)
  ),
[33m...[39m
[36mlookUpTable[39m: [32mMap[39m[[32mData[39m, [32mInt[39m] = [33mMap[39m(Text(M) -> [32m0[39m, Text(F) -> [32m1[39m, Text(I) -

## Euclidean Distance

In this notebook we'll use Euclidean distance as a similarity measure between two rows or vectors. Here's the equation:

$$ distance(X,Y) = \sqrt{\sum_{i=1}^n{(X_i - Y_i)^2}}$$

Let's implement a function to calculate this measure:

In [3]:
def euclideanDistance(firstRow: Vector[Numeric], secondRow: Vector[Numeric]) = {
  assert(firstRow.length == secondRow.length)

  math.sqrt {
    val featureIndices = firstRow.indices.init

    featureIndices.foldLeft(0.0) { (accum, i) =>
      accum + math.pow(firstRow(i).value - secondRow(i).value, 2)
    }
  }
}

defined [32mfunction[39m [36meuclideanDistance[39m

Good. Let's test it with a mock dataset:



In [4]:
val mockDataset = Vector(
  (2.7810836, 2.550537003, 0),
  (1.465489372, 2.362125076, 0),
  (3.396561688, 4.400293529, 0),
  (1.38807019, 1.850220317, 0),
  (3.06407232, 3.005305973, 0),
  (7.627531214, 2.759262235, 1),
  (5.332441248, 2.088626775, 1),
  (6.922596716, 1.77106367, 1),
  (8.675418651, -0.242068655, 1),
  (7.673756466, 3.508563011, 1)
) map { case (x1, x2, y) => Vector(Numeric(x1), Numeric(x2), Numeric(y))}

val testRow = mockDataset.head

mockDataset.foreach { r => 
    println(euclideanDistance(testRow, r))
}

0.0
1.3290173915275787
1.9494646655653247
1.5591439385540549
0.5356280721938492
4.850940186986411
2.592833759950511
4.214227042632867
6.522409988228337
4.985585382449795


[36mmockDataset[39m: [32mVector[39m[[32mVector[39m[[32mNumeric[39m]] = [33mVector[39m(
  [33mVector[39m([33mNumeric[39m([32m2.7810836[39m), [33mNumeric[39m([32m2.550537003[39m), [33mNumeric[39m([32m0.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m1.465489372[39m), [33mNumeric[39m([32m2.362125076[39m), [33mNumeric[39m([32m0.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m3.396561688[39m), [33mNumeric[39m([32m4.400293529[39m), [33mNumeric[39m([32m0.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m1.38807019[39m), [33mNumeric[39m([32m1.850220317[39m), [33mNumeric[39m([32m0.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m3.06407232[39m), [33mNumeric[39m([32m3.005305973[39m), [33mNumeric[39m([32m0.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m7.627531214[39m), [33mNumeric[39m([32m2.759262235[39m), [33mNumeric[39m([32m1.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m5.332441248[39m), [33mNumeric[39m(

Good it works as expected. 

## Get Neighbors

Now that we have a way to calculate distance between rows, the next step is to pick the nearest _k_ neighbors.

In [5]:
def getNeighbors(train: Dataset, testRow: Vector[Numeric], numberOfNeighbors: Int) = {
  val neighborsAndDistances = for {
    row <- train
    numericRow = row.asInstanceOf[Vector[Numeric]]
  } yield {
    val distance = euclideanDistance(numericRow, testRow)
    (numericRow, distance)
  }

  neighborsAndDistances.sortBy(_._2).take(numberOfNeighbors).map(_._1)
}

defined [32mfunction[39m [36mgetNeighbors[39m

Let's get the 3 nearest neighbors of our test row.

In [6]:
val neighbors = getNeighbors(mockDataset, testRow, 3)

[36mneighbors[39m: [32mVector[39m[[32mVector[39m[[32mNumeric[39m]] = [33mVector[39m(
  [33mVector[39m([33mNumeric[39m([32m2.7810836[39m), [33mNumeric[39m([32m2.550537003[39m), [33mNumeric[39m([32m0.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m3.06407232[39m), [33mNumeric[39m([32m3.005305973[39m), [33mNumeric[39m([32m0.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m1.465489372[39m), [33mNumeric[39m([32m2.362125076[39m), [33mNumeric[39m([32m0.0[39m))
)

Excellent! We're all set to make predictions!

## Make Predictions

This time we'll test our algorithm in both classification and regression problems. For that matter we need prediction functions for both cases:


In [7]:
def predictClassification(train: Dataset, testRow: Vector[Numeric], numberOfNeighbors: Int) = {
  val neighbors = getNeighbors(train, testRow, numberOfNeighbors)
  val outputValues = neighbors.map(_.last)

  outputValues.maxBy(o => outputValues.count(_ == o))
}

def predictRegression(train: Dataset, testRow: Vector[Numeric], numberOfNeighbors: Int) = {
  val neighbors = getNeighbors(train, testRow, numberOfNeighbors)
  val outputValues = neighbors.map(_.last)

  Numeric {
    outputValues.foldLeft(0.0) { (total, numeric) => total + numeric.value } / outputValues.length
  }
}

defined [32mfunction[39m [36mpredictClassification[39m
defined [32mfunction[39m [36mpredictRegression[39m

As we can see, for classification we are implementing a simply majority voting algorithm, while for regression we selected _mean_ as the measure of central tendency. 

In [8]:
type Predictor = (Dataset, Vector[Numeric], Int) => Numeric
def kNearestNeighbors(train: Dataset, test: Dataset, parameters: Parameters) = {
  val numberOfNeighbors = parameters("numberOfNeighbors").asInstanceOf[Int]
  val predictor = parameters("predictor").asInstanceOf[Predictor]
  
  test.map { row =>
   predictor(train, row.asInstanceOf[Vector[Numeric]], numberOfNeighbors)
  }
}

defined [32mtype[39m [36mPredictor[39m
defined [32mfunction[39m [36mkNearestNeighbors[39m

Good.

Let's now use our new algorithm to test it on the Abalone dataset.

We'll start by running a baseline model on it and then our freshly implemented k-Nearest Neighbors algorithm and then we will compare their performance.

As a baseline for classification we will use a __zero rule classifier__, and for regression a __zero rule regressor__.

In [9]:
// Normalize data
val minMax = getDatasetMinAndMax(data)
val normalizedData = normalizeDataset(data, minMax)

val baselineAccuracy = evaluateAlgorithmUsingTrainTestSplit[Numeric](
        normalizedData, 
        (train, test, parameters) => zeroRuleClassifier(train, test), 
        Map.empty, 
        accuracy, 
        trainProportion=0.8)

println(s"Zero Rule Algorithm accuracy: $baselineAccuracy")

Zero Rule Algorithm accuracy: 0.14952153110047847


[36mminMax[39m: [32mMinMaxData[39m = [33mVector[39m(
  [33mSome[39m(([32m0.0[39m, [32m2.0[39m)),
  [33mSome[39m(([32m0.075[39m, [32m0.815[39m)),
  [33mSome[39m(([32m0.055[39m, [32m0.65[39m)),
  [33mSome[39m(([32m0.0[39m, [32m1.13[39m)),
  [33mSome[39m(([32m0.002[39m, [32m2.8255[39m)),
  [33mSome[39m(([32m0.001[39m, [32m1.488[39m)),
  [33mSome[39m(([32m5.0E-4[39m, [32m0.76[39m)),
  [33mSome[39m(([32m0.0015[39m, [32m1.005[39m)),
  [33mSome[39m(([32m1.0[39m, [32m29.0[39m))
)
[36mnormalizedData[39m: [32mDataset[39m = [33mVector[39m(
  [33mVector[39m(
    Numeric(0.0),
    Numeric(0.5135135135135135),
    Numeric(0.5210084033613446),
    Numeric(0.084070796460177),
    Numeric(0.18133522224189835),
    Numeric(0.15030262273032952),
    Numeric(0.13232389730085584),
    Numeric(0.14798206278026907),
    Numeric(0.5)
  ),
[33m...[39m
[36mbaselineAccuracy[39m: [32mDouble[39m = [32m0.14952153110047847[39m

In [10]:
val kNearestNeighborsAccuracy = evaluateAlgorithmUsingTrainTestSplit[Numeric](
    normalizedData,
    kNearestNeighbors,
    Map("numberOfNeighbors" -> 5, "predictor" -> predictClassification _),
    accuracy,
    trainProportion=0.8)

println(s"k-Nearest Neighbors accuracy: $kNearestNeighborsAccuracy")

k-Nearest Neighbors accuracy: 0.20574162679425836


[36mkNearestNeighborsAccuracy[39m: [32mDouble[39m = [32m0.20574162679425836[39m

In [11]:
val baselineRmse = evaluateAlgorithmUsingTrainTestSplit[Numeric](
        normalizedData, 
        (train, test, parameters) => zeroRuleRegressor(train, test), 
        Map.empty, 
        rootMeanSquaredError, 
        trainProportion=0.8)

println(s"Zero Rule Algorithm RMSE: $baselineRmse")

Zero Rule Algorithm RMSE: 0.11533507708591567


[36mbaselineRmse[39m: [32mDouble[39m = [32m0.11533507708591567[39m

In [12]:
val kNearestNeighborsRmse = evaluateAlgorithmUsingTrainTestSplit[Numeric](
    normalizedData,
    kNearestNeighbors,
    Map("numberOfNeighbors" -> 5, "predictor" -> predictRegression _),
    rootMeanSquaredError,
    trainProportion=0.8)

println(s"k-Nearest Neighbors RMSE: $kNearestNeighborsRmse")

k-Nearest Neighbors RMSE: 0.08299764762202569


[36mkNearestNeighborsRmse[39m: [32mDouble[39m = [32m0.08299764762202569[39m

As we can see, in both cases the baseline and the k-Nearest Neighbors algorithm don't perform as well as expected due to the intricacy of the dataset. Although at first glance the target is numeric, in reality it is ordinal given that what's being predicted is the age of the abalones from several input measures, and none of our motdels seem to be capturing that nuance. Nonetheless, this case study serves our purpose of demonstrating the power and usefulness of the most popular instance-based machine learning algorithm: _k_-Nearest Neighbors.

It is very likely that tweaking the number of neighbors and experimenting with some data preprocessing techniques you could bump up the performance. Want to try? :)