# Data Preprocessing {#sec-data-preprocessing}

## Overview

Data come in various formats and can contain all sorts of different values. These traits make it difficult to correctly
interpret model it. In this chapter, we will explore various techniques that we can use in order to bring the data into a form
more suitable for interpretation and modeling.
 
In particular, we will look into

- Dataset scaling
- Dataset normalization
- Dataset imputation

We will see that there are different forms to scale the data. Throughout this chapter, we will be using Python and the ```sklearn.preprocessing```  module.

## Data preprocessing 

Let's discuss a few methodologies we can use to scale our data.

### Data scaling

Algorithms such as logistic regression, see @sec-logistic-regression, or support vector machines, see @sec-support-vector-machines, 
perform better when the dataset has feature-wise null mean. Thus being able to scale the dataset is very important.
In this section we will review

- Zero-centering
- $z$-score scaling
- Min-max scaling
- Range scaling
- Robust scaling

As you can understand from the plethora of the methodologies we will reviee, each one of these has advantages and disadvantages
that we will mention as we discuss the particular approach.

**Zero-centering**

In this approach, every variable entering the model is scaled according to

$$\hat{x}_i = x_i - E\left[X\right]$$

where $E\left[X\right]$ represents the dataset mean. The advantage of zero-scaling is that it is reversible
and it does not alter the relationships among samples. Furthermore, zero-scaling  allows us to exploit the symmetry
that some activation functions have and thus driving the overall model convergence in faster pace.
For an 1D dataset, zero-scaling is trivial to perform as the code snippet below demonstrates.

```python
import numpy as np

x = [float(i) for i in range (1, 50)]
mean = np.mean(x)
x = [i - mean for i in x]
```

**$z-score$**

$z-score$ or standard scaling  scales the data points such that they have a mean of 0 and a variance of 1, allowing for negative values.
z-score standardization ensures that outliers are handled more properly but will not guarantee that the data will end up on the exact same scale. Normally, we will standardize the data independently across each feature of the data array. The benefit of doing so is that we can see how many standard deviations a particular observation’s feature value is from the mean.

```python
import numpy as np
from sklearn.preprocessing import StandardScaler

np.random.seed(42)

size = 300
mu = [1.0, 1.0]
cov_mat = [[2.0, 0.0],[0.0, 0.8]]
X = np.random.multivariate_normal(mean=mu, cov=cov_mat, size=size)

ss = StandardScaler()
X_ss = ss.fit_transform(X)

x_ss=[]
y_ss=[]

for point in X_ss:
    x_ss.append(point[0])
    y_ss.append(point[1])

plt.scatter(x_ss, y_ss)
plt.title("Standard scaler data")
plt.axhline(0.0, color='r', linestyle='--')
plt.axvline(0.0, color='r', linestyle='--')
plt.show()
```

**Min-Max scaling**

Min-max standardization scales values in a feature to be between 0 and 1. However, with sklearn we have more freedom with this.
Min-max standardization has a harder time dealing with outliers, so if our data have many outliers, it is generally better to stick with z-score standardization. The following snippet shows how to perform Min-Max scaling with scikit-learn.

```python
import numpy as np
from sklearn.preprocessing import MinMaxScaler

np.random.seed(42)

size = 300
mu = [1.0, 1.0]
cov_mat = [[2.0, 0.0],[0.0, 0.8]]
X = np.random.multivariate_normal(mean=mu, cov=cov_mat, size=size)

ss = MinMaxScaler(feature_range=(-1, 1))
X_mm = ss.fit_transform(X)

x_mm=[]
y_mm=[]

for point in X_mm:
    x_mm.append(point[0])
    y_mm.append(point[1])

plt.scatter(x_mm, y_mm)
plt.title("Min-Max scaler data")
plt.axhline(0.0, color='r', linestyle='--')
plt.axvline(0.0, color='r', linestyle='--')
plt.show()
```

**Range scaling**

The ```MinMaxScaler``` can also be used to perform range scaling or range compression.

$$$$

**Robust scaling**

Very often the data we are dealing with contain outliers. Frequently outliers can skew the results of a modeling effort and thus invalidate it.
Outliers should in general be handled with care and thought as simply dropping these may also invalidate your modeling approach.
In general terms, an outlier is a data point that is significantly further away from the other data points. How further away
this depends on the application.

The data scaling methods from the previous two chapters are both affected by outliers. Data standardization uses each feature's mean and standard deviation, while ranged scaling uses the maximum and minimum feature values, meaning that they're both susceptible to being skewed by outlier values.

We can robustly scale the data, i.e. avoid being affected by outliers, by using use the data's median and Interquartile Range (IQR). Since the median and IQR are percentile measurements of the data (50% for median, 25% to 75% for the IQR), they are not affected by outliers. For the scaling method, we just subtract the median from each data value then scale to the IQR.

In scikit-learn, we perform robust scaling with the ```RobustScaler``` class. It is another transformer object, with the same fit, transform, and fit_transform functions described in the previous chapter.


```python
import numpy as np
from sklearn.preprocessing import RobustScaler

np.random.seed(42)

size = 300
mu = [1.0, 1.0]
cov_mat = [[2.0, 0.0],[0.0, 0.8]]
X = np.random.multivariate_normal(mean=mu, cov=cov_mat, size=size)
rs = RobustScaler()
X_rs = rs.fit_transform(X)

x_rs=[]
y_rs=[]

for point in X_rs:
    x_rs.append(point[0])
    y_rs.append(point[1])

plt.scatter(x_rs, y_rs)
plt.title("Robust scaler scaler data")
plt.axhline(0.0, color='r', linestyle='--')
plt.axvline(0.0, color='r', linestyle='--')
plt.show()
```

### Data normalization

So far, each of the scaling techniques we've used has been applied to the data features (i.e. columns). However, in certain cases we want to scale the individual data observations (i.e. rows). For instance, when clustering data we need to apply L2 normalization to each row, in order to calculate cosine similarity scores.
L2 normalization applied to a particular row of a data array will divide each value in that row by the row's L2 norm. In general terms, the L2 norm of a row is just the square root of the sum of squared values for the row. The $L_2$ is defined as

$$L_2 = \sqrt{\sum_i x_{i}^{2}}$$



### Data imputation

The previous sections covered some methods we can use in order to scale the data. Scaling is important when we deal with features that have different scales.
However, often times data is missing from the presented dataset and we need methodologies so that consistently we deal with this problem.
In this section we will discuss such methodologies. Note that the techniques we will discuss are effective when the missing data is not too much.
The latter of course is subjective and depends on the application.


#### Common imputation methods

Some common imputation methods are

- Using the mean value
- Usimg tje median value
- Using the most frequent value
- Filling in missing values with a constant

The ```SimplerImputer``` class from scikit-learn can be used for this. The code snippets below show how to use this class.
Check the scikit-learn official documentation: <a href="https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html">SimpleImputer</a>.


```python

from sklearn.impute import SimpleImputer
imp_median = SimpleImputer(strategy='median')
transformed = imp_median.fit_transform(data)

```

```python
from sklearn.impute import SimpleImputer
imp_constant = SimpleImputer(strategy='constant',
                             fill_value=10)
transformed = imp_constant.fit_transform(data)
```

----
**Remark**

When working with numpy, the missing data is represented using the ```np.nan``` value.

----

#### End-tail imputation

End-of-tail imputation is a special type of arbitrary imputation in which the constant value we use to fill in missing values is based on the distribution of the feature. The value is at the end of the distribution. This method still has the benefit of calling out missing values as being different from the rest of the values (which is what imputing with the mean/median does) but also has the added benefit of making the values that we pick more automatically generated and easier to impute

- If our variable is normally distributed, our arbitrary value is the mean + 3 × the standard deviation. Using 3 as a multiplier is common but also can be changed at the data scientist’s discretion.
- If our data are skewed, then we can use the IQR (interquartile range) rule to place values at either end of the distribution by adding 1.5 times the IQR (which is the 75th percentile minus the 25th percentile) to the 75th or subtracting 1.5 times the IQR from the 25th percentile.

#### Mode Imputation

As with numerical data, there are many ways we can impute missing categorical data. One such method is called the most-frequent 
category imputation or mode imputation. As the name suggests, we simply replace missing values with the most common non-missing value:

#### Arbitrary value imputation

Similar to arbitrary value imputation for numerical values, we can apply this to categorical values 
by either creating a new category, called Missing or Unknown, that the machine learning algorithm will have to learn about or by making an assumption about the missing values and filling in the values based on that assumption.

#### Binning

Binning refers to the act of creating a new categorical (usually ordinal) feature from a numerical or categorical feature. The most common way to bin data is to group numerical data into bins based on threshold cutoffs, similar to how a histogram is created.

The SimpleImputer object only implements the four imputation methods shown in section A. However, data imputation is not limited to those four methods.

There are also more advanced imputation methods such as k-Nearest Neighbors (filling in missing values based on similarity scores from the kNN algorithm) and MICE applying multiple chained imputations, assuming the missing values are randomly distributed across observations see [2].


## Summary

In this chapter we reviewd several methods for preprocessing the available data. In particular, we reviewd methodologies to scale and impute the data.
We also saw how to apply these methods to simple datasets using the scikit-learn API.

The <a href="https://scikit-learn.org/stable/api/sklearn.preprocessing.html">sklearn.preprocessing</a> module has many different techniques than the ones
we touched herein. You can therefore explore further.

## References

1. <a href="https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html">SimpleImputer</a>
2. <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/">MICE</a>
3. <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler">RobustScaler</a>
4. <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer">Normalizer</a>

In [12]:
import numpy as np
import matplotlib.pyplot as plt

In [13]:
np.random.seed(42)

In [14]:
size = 300
mu = [1.0, 1.0]
cov_mat = [[2.0, 0.0],[0.0, 0.8]]

In [18]:
X = np.random.multivariate_normal(mean=mu, cov=cov_mat, size=size)
#print(X)

In [1]:
x=[]
y=[]

for point in X:
    x.append(point[0])
    y.append(point[1])

plt.scatter(x, y)
plt.title("Original data")
plt.axhline(mu[0], color='r', linestyle='--')
plt.axvline(mu[1], color='r', linestyle='--')
plt.show()

NameError: name 'X' is not defined