# Simple Linear Regression

Linear regression is one (if not) the simplest machine learning algorithms. As it names clearly states, it assumes a linear relationship between the input variables, X, and the output variable Y.

When we only have one input variable, the algorithm is known as __Simple Linear Regression__. Otherwise, it is known as __Multivariate Linear Regression__.

Let's start by examining the linear regression model equation:

$$ y = b0 + b1x $$

Here $b0$ and $b1$ are the parameters we must estimate from our training data.

Once the coefficients have been determined, we can use this equation to make predictions on new data.

How can we estimate these coefficients? Using the following formulas: 

$$ b1 = \frac{\sum_{i=1}^{N}[(x_i-mean(x))(y_i-mean(y))]}{\sum_{i=1}^{N} (x_i-mean(x))^2}$$

$$ b0 = mean(y)-b1mean(x) $$

$ b1 $ formula can also be expressed as:

$$ b1 = \frac{covariance(x, y)}{variance(x)}$$



Let's start our implementation by loading the code and libraries we'll need. We will build our solution on top of the ones we implemented in the [previous notebook](https://github.com/jesus-a-martinez-v/toy-ml/blob/master/src/main/scala/notebooks/algorithm_test_harnesses.ipynb).

In [1]:
import $ivy.`com.github.tototoshi::scala-csv:1.3.5`
import $file.^.datasmarts.ml.toy.scripts.AlgorithmTestHarness, AlgorithmTestHarness._
import scala.util.Random

[32mimport [39m[36m$ivy.$                                      
[39m
[32mimport [39m[36m$file.$                                               , AlgorithmTestHarness._
[39m
[32mimport [39m[36mscala.util.Random[39m

## Data

This time we'll use the [Swedish Auto Insurance Dataset](http://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/slr/frames/slr06.html). Let's load it:

In [2]:
val BASE_DATA_PATH = "../../resources/data"
val swedishDataPath = s"$BASE_DATA_PATH/7/swedish.csv"

val rawData = loadCsv(swedishDataPath)
val numberOfRows = rawData.length
val numberOfColumns = rawData.head.length
println(s"Number of rows in dataset: $numberOfRows")
println(s"Number of columns in dataset: $numberOfColumns")

val data = (0 until numberOfColumns).toVector.foldLeft(rawData) { (d, i) => textColumnToNumeric(d, i)}

Number of rows in dataset: 63
Number of columns in dataset: 2


[36mBASE_DATA_PATH[39m: [32mString[39m = [32m"../../resources/data"[39m
[36mswedishDataPath[39m: [32mString[39m = [32m"../../resources/data/7/swedish.csv"[39m
[36mrawData[39m: [32mVector[39m[[32mVector[39m[[32mData[39m]] = [33mVector[39m(
  [33mVector[39m(Text(108), Text(392.5)),
  [33mVector[39m(Text(19), Text(46.2)),
  [33mVector[39m(Text(13), Text(15.7)),
  [33mVector[39m(Text(124), Text(422.2)),
  [33mVector[39m(Text(40), Text(119.4)),
  [33mVector[39m(Text(57), Text(170.9)),
  [33mVector[39m(Text(23), Text(56.9)),
  [33mVector[39m(Text(14), Text(77.5)),
  [33mVector[39m(Text(45), Text(214)),
  [33mVector[39m(Text(10), Text(65.3)),
  [33mVector[39m(Text(5), Text(20.9)),
[33m...[39m
[36mnumberOfRows[39m: [32mInt[39m = [32m63[39m
[36mnumberOfColumns[39m: [32mInt[39m = [32m2[39m
[36mdata[39m: [32mVector[39m[[32mVector[39m[[32mData[39m]] = [33mVector[39m(
  [33mVector[39m(Numeric(108.0), Numeric(392.5)),
  [33mVecto

## Calculate the Mean

As we saw in the previous formula, we'll need to calculate the mean and variance in order to determine the values of the coefficients.

Let's start by recalling the mean formula:

$$ mean(x) = \frac{\sum_{i=1}^{N} x_i}{N}$$

Good. Let's now proceed to implement it:

In [3]:
def mean(values: Vector[Numeric]): Double = values.foldLeft(0.0) { (accumulator, numericValue) =>
  accumulator + numericValue.value
} / values.length

defined [32mfunction[39m [36mmean[39m

## Calculate the Variance

Next, we must need to calculate the variance.

Here is the formula:

$$ variance(x) = {\sum_{i=1}^{N}(x_i - mean(x))^2}$$

Good. Let's now proceed to implement it:

In [4]:
def variance(values: Vector[Numeric], mean: Double): Double = values.foldLeft(0.0) { (accumulator, numericValue) =>
  accumulator + math.pow(numericValue.value - mean, 2)
}

defined [32mfunction[39m [36mvariance[39m

Great. Now that we have proper functions to calculate both the mean and variance of a group of values, let's test them on a small, mock dataset:

In [5]:
val mockData = Vector((1, 1), (2, 3), (4, 3), (3, 2), (5, 5)).map { case (x, y) => Vector(Numeric(x), Numeric(y)) }
val x = selectColumn(mockData, 0).asInstanceOf[Vector[Numeric]]
val y = selectColumn(mockData, 1).asInstanceOf[Vector[Numeric]]

val meanX = mean(x)
val meanY = mean(y)

val varianceX = variance(x, meanX)
val varianceY = variance(y, meanY)

println(s"X stats: mean=$meanX variance=$varianceX")
println(s"Y stats: mean=$meanY variance=$varianceY")

X stats: mean=3.0 variance=10.0
Y stats: mean=2.8 variance=8.8


[36mmockData[39m: [32mVector[39m[[32mVector[39m[[32mNumeric[39m]] = [33mVector[39m(
  [33mVector[39m([33mNumeric[39m([32m1.0[39m), [33mNumeric[39m([32m1.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m2.0[39m), [33mNumeric[39m([32m3.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m4.0[39m), [33mNumeric[39m([32m3.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m3.0[39m), [33mNumeric[39m([32m2.0[39m)),
  [33mVector[39m([33mNumeric[39m([32m5.0[39m), [33mNumeric[39m([32m5.0[39m))
)
[36mx[39m: [32mVector[39m[[32mNumeric[39m] = [33mVector[39m([33mNumeric[39m([32m1.0[39m), [33mNumeric[39m([32m2.0[39m), [33mNumeric[39m([32m4.0[39m), [33mNumeric[39m([32m3.0[39m), [33mNumeric[39m([32m5.0[39m))
[36my[39m: [32mVector[39m[[32mNumeric[39m] = [33mVector[39m([33mNumeric[39m([32m1.0[39m), [33mNumeric[39m([32m3.0[39m), [33mNumeric[39m([32m3.0[39m), [33mNumeric[39m([32m2.0[39m), [33mNumeric[39m([32m

## Calculate the Covariance

Covariance is a generalization of correlation. 

While correlation measures the relationship between two variables, the covariance can measure the same relationship between two or more variables. 

Let's calculate it:

In [6]:
def covariance(x: Vector[Numeric], y: Vector[Numeric], meanX: Double, meanY: Double): Double = {
  assert(x.length == y.length)

  x.indices.foldLeft(0.0) { (accumulator, index) =>
    accumulator + ((x(index).value - meanX) * (y(index).value - meanY))
  }
}

defined [32mfunction[39m [36mcovariance[39m

Good. Let's test it on our mock dataset.

In [7]:
val cov = covariance(x, y, meanX, meanY)

println(s"covariance between X and Y = $cov")


covariance between X and Y = 8.0


[36mcov[39m: [32mDouble[39m = [32m8.0[39m

## Coefficients

We have all we need to calculate the coefficients. Let's remember the formulas:

$$ b1 = \frac{covariance(x, y)}{variance(x)}$$

$$ b0 = mean(y)-b1mean(x) $$

In [8]:
def coefficients(dataset: Dataset) = {
  val x = selectColumn(dataset, 0).asInstanceOf[Vector[Numeric]]
  val y = selectColumn(dataset, 1).asInstanceOf[Vector[Numeric]]

  val xMean = mean(x)
  val yMean = mean(y)

  val b1 = covariance(x, y, xMean, yMean) / variance(x, xMean)
  val b0 = yMean - b1 * xMean

  (b0, b1)
}

defined [32mfunction[39m [36mcoefficients[39m

Let's calculate the coefficients from the mock dataset:

In [9]:
val (b0, b1) = coefficients(mockData)

println("Coefficients: B0=$b0, B1=$b1")

Coefficients: B0=$b0, B1=$b1


[36mb0[39m: [32mDouble[39m = [32m0.39999999999999947[39m
[36mb1[39m: [32mDouble[39m = [32m0.8[39m

## Implementing Simple Linear Regression

Now that we have a way to calculate the coefficients, we can implement a simple linear regression algorithm with just two simple instructions:

In [10]:
def simpleLinearRegression(train: Dataset, test: Dataset) = {
  // Training step: Determine coefficients.
  val (b0, b1) = coefficients(train)

  // Test step: Use coefficients to predict the value.  
  // This decomposition works because simple linear regression only works by finding the relationship between TWO variables.
  test.map { case Vector(x, _) => 
      Numeric(b0 + b1 * getNumericValue(x).get)
  }
}

defined [32mfunction[39m [36msimpleLinearRegression[39m

In [11]:
evaluateAlgorithmUsingTrainTestSplit[Numeric](
    mockData, 
    (train, test, parameters) => simpleLinearRegression(train, test), 
    Map.empty, 
    rootMeanSquaredError, 
    trainProportion=0.6)

[36mres10[39m: [32mDouble[39m = [32m2.0275875100994067[39m

Good. It works. Let's now apply it to the Swedish Car Insurance dataset.

We'll start by running a baseline model on it and then our freshly implemented simple linear regression algorithm and then we will compare their performance.

As a baseline we will use a __zero rule regressor__.

In [12]:
val baselineRmse = evaluateAlgorithmUsingTrainTestSplit[Numeric](
        data, 
        (train, test, parameters) => zeroRuleRegressor(train, test), 
        Map.empty, 
        rootMeanSquaredError, 
        trainProportion=0.8)

println(s"Zero Rule Regressor RMSE: $baselineRmse")

Zero Rule Regressor RMSE: 120.45498620710497


[36mbaselineRmse[39m: [32mDouble[39m = [32m120.45498620710497[39m

In [13]:
val simpleLinearRegressionRmse = evaluateAlgorithmUsingTrainTestSplit[Numeric](
        data, 
        (train, test, parameters) => simpleLinearRegression(train, test), 
        Map.empty, 
        rootMeanSquaredError, 
        trainProportion=0.8)

println(s"Simple Linear Regressor RMSE: $simpleLinearRegressionRmse")

Simple Linear Regressor RMSE: 34.48536616564703


[36msimpleLinearRegressionRmse[39m: [32mDouble[39m = [32m34.48536616564703[39m

The performance gap is astonishing! 

As we can see, although SLR is a basic algorithm, it is very powerful for the right kind of problem!