# Scale Machine Learning Data

An important number of machine learning algorithms make assumptions about the scale of the data and their range of values.

Popular algorithms such as **logistic** and **linear regression** put different weights on their parameters, so a scaling problem could really hurt their performance or learning process.

Other more complex algorithms such as **artificial neural networks** tend to combine their inputs in non trivial ways. Hence, again, it is a good idea to put all inputs in a similar scale, which could also prevent problems such as exploding or vanishing gradients.

## Normalize Data

Let's start by exploring one of the two methods for data scaling that we'll address in this notebook: Normalization. 

Normalization's meaning vary depending on the context. In our context it means that we'll rescale our values to be in the range [0, 1]. We can achieve this by applying the following formula:

$$ value' = \frac{value - min}{max - min} $$

Good, let's start our implementation by loading the code and libraries we'll need. We will build our solution on top of the ones we implemented in the [previous notebook](https://github.com/jesus-a-martinez-v/toy-ml/blob/master/src/main/scala/notebooks/load_data_from_csv.ipynb).

In [1]:
import $ivy.`com.github.tototoshi::scala-csv:1.3.5`
import $file.^.datasmarts.ml.toy.scripts.LoadCsv, LoadCsv._

[32mimport [39m[36m$ivy.$                                      
[39m
[32mimport [39m[36m$file.$                                  , LoadCsv._[39m

Now, let's define some type aliases and helper functions to make our lives easier:

In [2]:
type Dataset = Vector[Vector[Data]]
type MinMaxData = Vector[Option[(Double, Double)]]
type StatisticData = Vector[Option[Double]]

def isNumeric(data: Data) = data match {
  case _: Numeric => true
  case _ => false
}

def isText(data: Data) = !isNumeric(data)


def getNumericValue(data: Data): Option[Double] = data match {
  case Numeric(value) => Some(value)
  case _ => None
}

def getTextValue(data: Data): Option[String] = data match {
  case Text(value) => Some(value)
  case _ => None
}

defined [32mtype[39m [36mDataset[39m
defined [32mtype[39m [36mMinMaxData[39m
defined [32mtype[39m [36mStatisticData[39m
defined [32mfunction[39m [36misNumeric[39m
defined [32mfunction[39m [36misText[39m
defined [32mfunction[39m [36mgetNumericValue[39m
defined [32mfunction[39m [36mgetTextValue[39m

Good! Our **Dataset** representation is just a vector or rows, where each row is also a vector that contains an entry for each specific column.

In order to determine the minimum and maximum value of each column in the dataset, we define the **MinMax** type as a Vector of optional tuples of doubles, where the first element in the tuple corresponds to the minimum and the second to the maximum. Why optional? If a column is not numeric, then we'll return None stating that min and max aren't well defined for text data.

Analogously, **StatisticData** type refers to a vector of optional doubles. As in the **MinMax** case, a None represents that either the mean or the standard deviation of a text data cannot be calculated.

Finally, **isNumeric**, **isText**, **getNumericValue** and **getTextValue** allow us to determine the type of a data instance, and to get its value, respectively.

### Mock dataset

Let's use the following dataset for testing our upcoming functions:

| X1 	|  X2  | X3   |
|:---:	|:----:| :--: |
|   50 	|  30  |   A  |
|   20  |  90  |   B  |
|   19  | 90.4 |   C  |

In [3]:
val dataset: Dataset = Vector(
    //       X1            X2          X3
    Vector(Numeric(50), Numeric(30), Text("A")),
    Vector(Numeric(20), Numeric(90), Text("B")),
    Vector(Numeric(19), Numeric(90.4), Text("C")))

[36mdataset[39m: [32mDataset[39m = [33mVector[39m(
  [33mVector[39m(Numeric(50.0), Numeric(30.0), Text(A)),
  [33mVector[39m(Numeric(20.0), Numeric(90.0), Text(B)),
  [33mVector[39m(Numeric(19.0), Numeric(90.4), Text(C))
)

### Getting MIN and MAX values of a dataset

Let's now proceed to define a function to obtain the minimum and maximum values of each column in a dataset. The logic will be as follows:
    
    - If the dataset is empty, do nothing.
    - If not, for each column:
        - If the column is of type Text, then return None.
        - If the column is of type Numeric, sort its values in ascending order. The minimum value will be at the beginning of the vector and the maximum at the end. Return them in a tuple.

In [4]:
def getDatasetMinAndMax(dataset: Dataset): MinMaxData = {
  if (dataset.isEmpty) {
    Vector.empty
  } else {
    val numberOfColumns = dataset.head.length
    val columnIndicesRange = (0 until numberOfColumns).toVector
    val testRow = dataset.head

    for {
      columnIndex <- columnIndicesRange
    } yield {
      if (isText(testRow(columnIndex))) {
        None
      } else {
        val columnValues = dataset.map { row => 
          getNumericValue(row(columnIndex)).get
        }.sorted
        
        val max = columnValues.last
        val min = columnValues.head

        Some((min, max))
      }
    }
  }
}

defined [32mfunction[39m [36mgetDatasetMinAndMax[39m

Good! Let's now test it in our mock dataset:

In [5]:
val minMax = getDatasetMinAndMax(dataset)

[36mminMax[39m: [32mMinMaxData[39m = [33mVector[39m([33mSome[39m(([32m19.0[39m, [32m50.0[39m)), [33mSome[39m(([32m30.0[39m, [32m90.4[39m)), None)

As expected, the results are:

|  	    |  MIN | MAX  |
|:---:	|:----:| :--: |
|   **X1** 	|  19  |   50  |
|   **X2**  |  30  |   90.4  |
|   **X3**  | - |   -  |

### MIN-MAX normalizer

We are all set! Let's define a function to calculate the min-max normalization for each value in the dataset. This will only affect Numeric data. Text will remain untouched:

In [6]:
def normalizeDataset(dataset: Dataset, minMaxes: MinMaxData): Dataset = {
  if (dataset.isEmpty) {
    Vector.empty
  } else {
    val numberOfColumns = dataset.head.length
    val columnIndicesRange = (0 until numberOfColumns).toVector

    for {
      row <- dataset
    } yield {
      columnIndicesRange.map { columnIndex =>
        val rowData = row(columnIndex)

        minMaxes(columnIndex) match {
          case None => rowData
          case Some((min, max)) =>
            val rowValue = getNumericValue(rowData).get
            val normalizedRowValue = (rowValue - min) / (max - min)

            Numeric(normalizedRowValue)
        }
      }
    }
  }
}

defined [32mfunction[39m [36mnormalizeDataset[39m

Let's now test it in our mock dataset:

In [7]:
val minMaxNormalizedData = normalizeDataset(dataset, minMax)

[36mminMaxNormalizedData[39m: [32mDataset[39m = [33mVector[39m(
  [33mVector[39m(Numeric(1.0), Numeric(0.0), Text(A)),
  [33mVector[39m(Numeric(0.03225806451612903), Numeric(0.9933774834437085), Text(B)),
  [33mVector[39m(Numeric(0.0), Numeric(1.0), Text(C))
)

The results after normalization are:

| X1 	|  X2  | X3   |
|:---:	|:----:| :--: |
|   1 	|  0  |   A  |
|   0.033  |  0.994  |   B  |
|   0  | 1 |   C  |

## Standardize Data

The second method for normalizing data is known as _standardization_. This is a rescaling technique that aims to centering the distribution of the data on the value 0 and the standard deviation to 1. These two indicators can be used in conjunction to summarize a normal or Gaussian distribution.

The formula for the mean is:

$$ \mu = \frac{\sum_{i=1}^{N} value_i}{N}$$

The formula for the standard deviation is:

$$ \sigma = \sqrt{\frac{\sum_{i=1}^{N} (value_i - \mu)^2}{N - 1}} $$

And finally, the formula for the standardization is:

$$ value' = \frac{value - \mu}{\sigma}  $$

### Getting MEAN and STANDARD DEVIATIONS of a dataset

Let's now proceed to define the functions needed to obtain the mean and standard deviation of each column in a dataset. The logic will be as follows:
    
    - If the dataset is empty, do nothing.
    - If not, for each column:
        - If the column is of type Text, then return None.
        - If the column is of type Numeric, apply the corresponding formula to obtain the needed value.

In [8]:
def getColumnMeans(dataset: Dataset): StatisticData = {
  if (dataset.isEmpty) {
    Vector.empty
  } else {
    val numberOfColumns = dataset.head.length
    val testRow = dataset.head

    for {
      columnIndex <- (0 until numberOfColumns).toVector
    } yield {
      if (isText(testRow(columnIndex))) {
        None
      } else {
        val columnValues = dataset.map { row => 
            getNumericValue(row(columnIndex)).get
        }
        val sum = columnValues.sum
        val count = columnValues.length

        Some(sum / count)
      }
    }
  }
}

defined [32mfunction[39m [36mgetColumnMeans[39m

Let's now test it in our mock dataset:

In [9]:
val means = getColumnMeans(dataset)

[36mmeans[39m: [32mStatisticData[39m = [33mVector[39m([33mSome[39m([32m29.666666666666668[39m), [33mSome[39m([32m70.13333333333334[39m), None)

The results are:

|  	    |  $$ \mu $$ |
|:---:	|:----:|
|   **X1** 	|  29.667  |
|   **X2**  |  70.134  |
|   **X3**  | - |

In [10]:
def getColumnsStandardDeviations(dataset: Dataset, means: StatisticData): StatisticData = {
  if (dataset.isEmpty) {
    Vector.empty
  } else {
    val numberOfColumns = dataset.head.length
    val testRow = dataset.head
      
    for {
      columnIndex <- (0 until numberOfColumns).toVector
      
    } yield {
      if (isText(testRow(columnIndex))) {
        None
      } else {
        val columnMean = means(columnIndex).get
        val columnSquaredMeanDifferences = dataset.map { row => 
            val meanDifference = getNumericValue(row(columnIndex)).get - columnMean
            
            math.pow(meanDifference, 2)
        }
        val sum = columnSquaredMeanDifferences.sum
        val count = columnSquaredMeanDifferences.length
        val variance = sum / (count - 1)
        val standardDeviation = math.sqrt(variance)

        Some(standardDeviation)
      }
    }
  }
}

defined [32mfunction[39m [36mgetColumnsStandardDeviations[39m

Let's now test it in our mock dataset:

In [11]:
val standardDeviations = getColumnsStandardDeviations(dataset, means)

[36mstandardDeviations[39m: [32mStatisticData[39m = [33mVector[39m([33mSome[39m([32m17.61628034896508[39m), [33mSome[39m([32m34.757061632614075[39m), None)

The results are:

|  	    |  $$ \sigma $$ |
|:---:	|:----:|
|   **X1** 	|  17.616  |
|   **X2** |  34.757  |
|   **X3**  | - |

Finally, let's use these functions to standardize a dataset:

In [12]:
def standardizeDataset(dataset: Dataset, means: StatisticData, standardDeviations: StatisticData): Dataset = {
  if (dataset.isEmpty) {
    Vector.empty
  } else {
    val numberOfColumns = dataset.head.length

    for {
      row <- dataset
      columnIndicesRange = (0 until numberOfColumns).toVector
    } yield {
      columnIndicesRange.map { columnIndex =>
        val rowData = row(columnIndex)

        if (isText(rowData)) {
          rowData
        } else {
          val columnMean = means(columnIndex).get
          val columnStandardDeviation = standardDeviations(columnIndex).get
          val rowValue = getNumericValue(rowData).get

          val standardizedRowValue = (rowValue - columnMean) / columnStandardDeviation

          Numeric(standardizedRowValue)
        }
      }
    }
  }
}

defined [32mfunction[39m [36mstandardizeDataset[39m

Let's now test it in our mock dataset:

In [13]:
val standardizedDataset = standardizeDataset(dataset, means, standardDeviations)

[36mstandardizedDataset[39m: [32mDataset[39m = [33mVector[39m(
  [33mVector[39m(Numeric(1.1542353397281098), Numeric(-1.1546814215064278), Text(A)),
  [33mVector[39m(Numeric(-0.5487348336412325), Numeric(0.5715864844001916), Text(B)),
  [33mVector[39m(Numeric(-0.6055005060868773), Numeric(0.5830949371062358), Text(C))
)

The results after standardization are:

| X1 	|  X2  | X3   |
|:---:	|:----:| :--: |
|   1.154 	|  -1.155  |   A  |
|   -0.549  |  0.572  |   B  |
|   -0.606  | 0.583 |   C  |

## When to normalize? When to standardize?

When the data or any particular column doesn't follow a normal distribution, it is a good idea to apply MIN-MAX normalization because it doesn't make any assumptions regarding the values' distributions.

On the other hand, if the data or any particular column adjust to a Gaussian distribution, standardizing is definitely the way to go!