# Preprocessing

## Introduction

Preprocessing refers to any manipluation of the dataset before running it through the model.

We have already seen some preprocessing. In the tensorflow intro we created a np.savez file and all the training came from there.

In this section we will focus on data transformation rather than reordering.

We preprocess for a number of reasons.
- compatability is one, ie tensor flow takes tensors rather than excel or csv files
- orders of magnitude, if one input is a lot larger than another, they have to be adjusted to be relevant relative to each other, especially when combining with matrix mathematics
- generalisation, problems that seem different can be solved by the same models. if you can generalise data you can possibly use models built before


## Types of Basic Preprocessing

Often we are interested in a relative value, often when talking about stock prices. I.e. % change in stock prices is often important.

Relative metrics are especially useful when we have time-series data.

We can transform these relative changes into logarithms, as they have clearer relationships and homogenous variance. The computation is often faster and have lower order of magnitudes. 

## Standardisation

The most challenging problem when working with numerical data is in working with different magnitutes.

The fix for magnitude issues is standardisation, often called feature scaling and normalisation. 

This is the process of transforming data into a standard scale.

A common way of doing this is by subtracting the mean and dividing by the standard deviation:

$ standardised\: variable\: =\: \frac{x - \mu}{\sigma}$

Regardless of the dataset, this will always result in a dataset with a mean of zero and a standard deviation of one. 

Normalisation refers to a few methods. One such method is by taking a matrix and converting it into a unit length vector using the L1- or L2-norm. 

Another method is PCA (principal components analysis) which is a dimension reduction technique used to combine several variables into a bigger (latent) variable. For example if you have data on religion, voting histroy and upbringing you can associate them all in a single value which might refer to attitude towards immigration, with a mean of zero and a standard deviation of one.

Whitening is another technique sometimes used for preprocessing. It is often used after PCA and it removes most of the underlying correlations between data points. Whitening can be useful when the underlying data should be uncorrelated but that is not reflected in the observations. 

There are a lot more and each strategy is problem specific. 

## Preprocessing Categorical Data

Most of what we have seen are examples of numerical data. Often we must deal with categorical data, i.e. groups or categories, such as cats or dog values. The ML algorithm only takes numbers, so we need to be able to convert cat or dog to a number, or a tensor.

How do you convert categories to numbers? One solution is to say, cat = 1, dog =2. Unfortunately this implies that there is some order, which is not true. 2 x cat is not a dog. 

How to encode categories in a way that is useful for an ML algo.

There are two main methods here:
- one-hot encoding
- binary encoding

## Binary and One-Hot Encoding

Binary encoding starts by issuing random numbers to each category. You then convert those numbers into binary numbers. 1 = 01, 2=10, 3 = 11. You then take each binary digit as a variable so number 1 has a 0 for var1 and a 1 for var2. Number 2 has a 1 for var1 and a 0 for var2. 3 has a 1 for var1 and a 1 for var2.

There are still some implied correlations between them however, for insance 1 and 3 seem correlated on var1 and 2 and 3 seem correlated on var2. 1 and 2 seem the opposite of each other. 

Therefore binary encoding is an improvement of normal, but not great.

One-Hot is simple and widely adopted. It starts by creating as many columns as there are possible values. I.e. for 3 products there must be 3 variables or columns. Then a product gets a 1 in the column that represents itself and a zero in all others. Each product has 1 variable set to a 1, all others at 0. This means the products are uncorrelated and unequivocal.

The problem with one-hot encoding is that it requires a lot of new variables. Eg. IKEA offers 12,000 products, and we wont want to include 12,000 extra variables in our models. If we used binary encoding for this problem there would only be 16 variables, due to the scalability of binary numbers vs the other method. This is exponentially lower than requirements for one-hot. In these circumstances you must use binary even though it would include unnecessary correlations between some products.










