# Logistic Regression

Logistic Regression is, possibly, the simplest algorithm used for binary classification.

Yeah, I agree. Although it has _regression_ on its name it is really a classification machine learning algorithm (weird...). At its core it uses the __logistic function__. As happened with __linear regression__, each logistic regression uses an equation as its representation. Each input variable is combined using coefficients to predict an output. Although the output is a real number, it is thresholded to turn it into either __0__ or __1__. Usually, the threshold used is 0.5.

$$ \hat{y} = \frac{1}{1 + \mathrm{e}^{-(b_0+b_1*x_1+...+b_n*x_n)}}$$

In this equation $ \mathrm{e} $ is the Euler's constant and the base of the natural logarithms. $ \hat{y} $ is the prediction, $ b_0 $ is the bias term or intercept and $ b_1, ..., b_n$ are the coefficients for the variables $ x_1, ..., x_n$, correspondingly. 

How do we find the values of these coefficients? Using __gradient descent__, of course! For a deeper explanation, please refer to this [notebook](https://github.com/jesus-a-martinez-v/toy-ml/blob/master/src/main/scala/notebooks/multivariate_linear_regression.ipynb).

The function to update each coefficient with gradient descent is:

$$ b_i = b_i + \alpha*(y - \hat{y})*\hat{y}*(1-\hat{y})*x_i $$

Let's start our implementation by loading the code and libraries we'll need. We will build our solution on top of the ones we implemented in the [previous notebook](https://github.com/jesus-a-martinez-v/toy-ml/blob/master/src/main/scala/notebooks/multivariate_linear_regression.ipynb).

In [1]:
import $ivy.`com.github.tototoshi::scala-csv:1.3.5`
import $file.^.datasmarts.ml.toy.scripts.MultivariateLinearRegression, MultivariateLinearRegression._
import scala.util.Random

[32mimport [39m[36m$ivy.$                                      
[39m
[32mimport [39m[36m$file.$                                                       , MultivariateLinearRegression._
[39m
[32mimport [39m[36mscala.util.Random[39m

## Data

We'll use the [Pima Indians Diabetes](https://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes) dataset. Let's load it:

In [2]:
val BASE_DATA_PATH = "../../resources/data"
val pimaIndiansPath = s"$BASE_DATA_PATH/9/pima-indians-diabetes.csv"

val rawData = loadCsv(pimaIndiansPath)
val numberOfRows = rawData.length
val numberOfColumns = rawData.head.length
println(s"Number of rows in dataset: $numberOfRows")
println(s"Number of column in dataset: $numberOfColumns")

val data = (0 until numberOfColumns).toVector.foldLeft(rawData) { (d, i) => textColumnToNumeric(d, i)}

Number of rows in dataset: 768
Number of column in dataset: 9


[36mBASE_DATA_PATH[39m: [32mString[39m = [32m"../../resources/data"[39m
[36mpimaIndiansPath[39m: [32mString[39m = [32m"../../resources/data/9/pima-indians-diabetes.csv"[39m
[36mrawData[39m: [32mVector[39m[[32mVector[39m[[32mData[39m]] = [33mVector[39m(
  [33mVector[39m(
    Text(6),
    Text(148),
    Text(72),
    Text(35),
    Text(0),
    Text(33.6),
    Text(0.627),
    Text(50),
    Text(1)
  ),
[33m...[39m
[36mnumberOfRows[39m: [32mInt[39m = [32m768[39m
[36mnumberOfColumns[39m: [32mInt[39m = [32m9[39m
[36mdata[39m: [32mVector[39m[[32mVector[39m[[32mData[39m]] = [33mVector[39m(
  [33mVector[39m(
    Numeric(6.0),
    Numeric(148.0),
    Numeric(72.0),
    Numeric(35.0),
    Numeric(0.0),
    Numeric(33.6),
    Numeric(0.627),
    Numeric(50.0),
    Numeric(1.0)
  ),
[33m...[39m

## Making Predictions

Let's proceed to implement a function that makes prediction on a row, given some fitted coefficients. This will be very useful during the training phase as well as in the test stage.

In [3]:
def predictLogisticRegression(row: Vector[Data], coefficients: Vector[Double]): Double = {
  val indices = row.indices.init

  val yHat = indices.foldLeft(0.0) { (accumulator, index) =>
    accumulator + coefficients(index + 1) * getNumericValue(row(index)).get
  } + coefficients.head

  1.0 / (1.0 + math.exp(-yHat))
}

defined [32mfunction[39m [36mpredictLogisticRegression[39m

Let's test it on a mock dataset:

In [4]:
val mockDataset = Vector(
    (2.7810836, 2.550537003,0),
    (1.465489372, 2.362125076, 0),
    (3.396561688, 4.400293529, 0),
    (1.38807019, 1.850220317, 0),
    (3.06407232, 3.005305973, 0),
    (7.627531214, 2.759262235, 1),
    (5.332441248, 2.088626775, 1),
    (6.922596716, 1.77106367, 1),
    (8.675418651,-0.242068655, 1),
    (7.673756466, 3.508563011, 1)).map{ case (x1, x2, y) => Vector(Numeric(x1), Numeric(x2), Numeric(y)) }

val mockCoefficients = Vector(-0.406605464, 0.852573316, -1.104746259)

mockDataset.foreach { case row @ Vector(Numeric(x1), Numeric(x2), Numeric(y)) => 
    val predicted = predictLogisticRegression(row, mockCoefficients)
    println(s"Expected=$y, Predicted=$predicted [${math.round(predicted)}]")
}

Expected=0.0, Predicted=0.2987569855650975 [0]
Expected=0.0, Predicted=0.14595105593031163 [0]
Expected=0.0, Predicted=0.08533326519733725 [0]
Expected=0.0, Predicted=0.21973731424800344 [0]
Expected=0.0, Predicted=0.24705900008926596 [0]
Expected=1.0, Predicted=0.9547021347460022 [1]
Expected=1.0, Predicted=0.8620341905282771 [1]
Expected=1.0, Predicted=0.9717729050420985 [1]
Expected=1.0, Predicted=0.9992954520878627 [1]
Expected=1.0, Predicted=0.9054893228110497 [1]


[36mmockDataset[39m: [32mVector[39m[[32mVector[39m[[32mNumeric[39m]] = [33mVector[39m(
  [33mVector[39m([33mNumeric[39m([32m2.7810836[39m), [33mNumeric[39m([32m2.550537003[39m), [33mNumeric[39m([32m0.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m1.465489372[39m), [33mNumeric[39m([32m2.362125076[39m), [33mNumeric[39m([32m0.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m3.396561688[39m), [33mNumeric[39m([32m4.400293529[39m), [33mNumeric[39m([32m0.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m1.38807019[39m), [33mNumeric[39m([32m1.850220317[39m), [33mNumeric[39m([32m0.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m3.06407232[39m), [33mNumeric[39m([32m3.005305973[39m), [33mNumeric[39m([32m0.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m7.627531214[39m), [33mNumeric[39m([32m2.759262235[39m), [33mNumeric[39m([32m1.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m5.332441248[39m), [33mNumeric[39m(

## Estimating Coefficients

Now that we have a predicting function in place, the next step is to implement a function to estimate the coefficients that'll be used later on in the pipeline:

We are implementing Stochastic Gradient Descent. It requires two parameters:

 - __Learning Rate__: It is used to control the amount of correction each parameter will receive at a time.
 - __Number of epochs__: Number of times the algorithm will loop over all the data, updating the coefficients.
 
The outline of the algorithm is as follows:

 1. Loop over each epoch.
 2. Loop over each row in the training set.
 3. Loop over each coefficient and update it for a row in an epoch.

In [5]:
def coefficientsLogisticRegressionSgd(train: Dataset, learningRate: Double, numberOfEpochs: Int) = {
  var coefficients = Vector.fill(train.head.length)(0.0)

  for {
    _ <- 1 to numberOfEpochs
    row <- train

  } {
    val predicted = predictLogisticRegression(row, coefficients)
    val actual = getNumericValue(row.last).get
    val error = actual - predicted

    val bias = coefficients.head + learningRate * error * predicted * (1.0 - predicted)
    val indices = row.indices.init

    val remainingCoefficients = indices.foldLeft(coefficients) { (c, index) =>
      val actual = getNumericValue(row(index)).get
      updatedVector(c, c(index + 1) + learningRate * error * predicted * (1.0 - predicted) * actual, index + 1)
    }

    coefficients = Vector(bias) ++ remainingCoefficients.tail
  }

  coefficients
}

defined [32mfunction[39m [36mcoefficientsLogisticRegressionSgd[39m

Let's get the coefficients for our mock dataset:

In [6]:
coefficientsLogisticRegressionSgd(mockDataset, 0.3, 100)

[36mres5[39m: [32mVector[39m[[32mDouble[39m] = [33mVector[39m([32m-0.8596443546618894[39m, [32m1.5223825112460012[39m, [32m-2.2187002105650175[39m)

## Logistic Regression

Now that we have all the pieces, defining logistic regression is easy. Let's implement it:

In [7]:
def logisticRegression(train: Dataset, test: Dataset, parameters: Parameters) = {
  val learningRate = parameters("learningRate").asInstanceOf[Double]
  val numberOfEpochs = parameters("numberOfEpochs").asInstanceOf[Int]

  val coefficients = coefficientsLogisticRegressionSgd(train, learningRate, numberOfEpochs)

  test.map { row =>
    Numeric(math.round(predictLogisticRegression(row, coefficients)))
  }
}

defined [32mfunction[39m [36mlogisticRegression[39m

Good. We just need to unpack the relevant parameters, use SGD to obtain the coefficients and then use them to make predictions on the test set.

Let's now use our new algorithm to test it on the Pima Indians Diabetes dataset.

We'll start by running a baseline model on it and then our freshly implemented logistic regression algorithm and then we will compare their performance.

As a baseline we will use a __random algorithm classifier__.

In [8]:
// Normalize data
val minMax = getDatasetMinAndMax(data)
val normalizedData = normalizeDataset(data, minMax)

val baselineAccuracy = evaluateAlgorithmUsingTrainTestSplit[Numeric](
        normalizedData, 
        (train, test, parameters) => randomAlgorithm(train, test), 
        Map.empty, 
        accuracy, 
        trainProportion=0.8)

println(s"Random Algorithm accuracy: $baselineAccuracy")

Random Algorithm accuracy: 0.474025974025974


[36mminMax[39m: [32mMinMaxData[39m = [33mVector[39m(
  [33mSome[39m(([32m0.0[39m, [32m17.0[39m)),
  [33mSome[39m(([32m0.0[39m, [32m199.0[39m)),
  [33mSome[39m(([32m0.0[39m, [32m122.0[39m)),
  [33mSome[39m(([32m0.0[39m, [32m99.0[39m)),
  [33mSome[39m(([32m0.0[39m, [32m846.0[39m)),
  [33mSome[39m(([32m0.0[39m, [32m67.1[39m)),
  [33mSome[39m(([32m0.078[39m, [32m2.42[39m)),
  [33mSome[39m(([32m21.0[39m, [32m81.0[39m)),
  [33mSome[39m(([32m0.0[39m, [32m1.0[39m))
)
[36mnormalizedData[39m: [32mDataset[39m = [33mVector[39m(
  [33mVector[39m(
    Numeric(0.35294117647058826),
    Numeric(0.7437185929648241),
    Numeric(0.5901639344262295),
    Numeric(0.35353535353535354),
    Numeric(0.0),
    Numeric(0.5007451564828614),
    Numeric(0.23441502988898377),
    Numeric(0.48333333333333334),
    Numeric(1.0)
  ),
[33m...[39m
[36mbaselineAccuracy[39m: [32mDouble[39m = [32m0.474025974025974[39m

In [9]:
val logisticRegressionAccuracy = evaluateAlgorithmUsingTrainTestSplit[Numeric](
        normalizedData, 
        logisticRegression, 
        Map("learningRate" -> 0.1, "numberOfEpochs" -> 100), 
        accuracy, 
        trainProportion=0.8)

[36mlogisticRegressionAccuracy[39m: [32mDouble[39m = [32m0.7337662337662337[39m

We can see that our logistic regression algorithm performs much better than the baseline random algorithm we defined above (47.4% accuracy vs. 73.38%). 

We could squeeze more predictive power by tweaking the learning rate and the number of epochs. Feel free to experiment! :)