# Data preprocessing

One of the most important part when it comes to machine learning is feature engineering and data preprocessing. Why destinguishing these? Feature engineering is way larger than data preprocessing we will be discussing here. While feature engineering might include creating and deleting certain features, filling the missing values and dealing with outliers to name a few, data preprocessing will tranform the final features in some mathematical way, so that these will be easier to digest for our model.

Some of the models, like K-Nearest Neighbours or SVM models require the data to be standarized, while models like Decision Trees or Random Forest could work on just cleaned data. 

In [1]:
# let's import the tools
import numpy as np # for linear algebra and some additional transformations
import pandas as pd # for some examples
import seaborn as sns # for dataset

# preprocessing tools from sklearn
from sklearn.preprocessing import StandardScaler, LabelEncoder, MaxAbsScaler, MinMaxScaler, Normalizer
from sklearn.preprocessing import OneHotEncoder, RobustScaler

In [2]:
# let's load the dataset for examples
df = sns.load_dataset("diamonds")
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


## Data leakage
Data leakage is an issue that might occur when we are creating our machine learning models, where we feed our model with data, that have some encoding of our target or the same/other feature, that indicates the spread of these data, values etc. 

But why worry about data leakage? Depending on the kind of data leakage, it might be harmless to our model, but it can lead to overfitting the model and for it to memorise the predictions. It might also lead to exclusion or diminishing the importance of some features, that might bring up some additional patterns in data. 

We might differentiate a few types of data leakages:
- **Target leakage** - in which we leak the predicted value into our training and testing data. This might a result of:
    - Badly done target encoding;
    - Wrong recognision of provided features during EDA - some of the features might be results of our predictions, not features leading to our prediction;
- **Train-Test contamination** - where we preprocess on the whole dataset. In other words - we do not want the model to learn anything from the testing and validation sets (more on this split in next notebook). Otherwise, once we feed brand new informations, we will get worse results. This might be:
    - Scaling numeric data on the whole dataset; 
    - Filling the NaN values with any kind of center metrics (e.g. mean, median);
    - Target encoding using mean/aggregation values;
    - Any feature engineering that uses whole dataset aggregation.
    
While any data scaling must be done in splits, separately for train and test splits, **one-hot** and **label encoding** of string valeus can be done at any point (either before or right after the split).

### How to prevent data leakage?
To avoid data leakage and ensure our model will be working as similar as possible on any given data (whether we will compare it on the training or brand new data), we should follow, apply or at least check the following things:
1. Do a manual check on our data - deep EDA and understanding is needed. We should check and consider all features that are highly correlated with our target (if these are a cause or a result?) or if these are in a way, co-dependent.
2. Splitting our data into train and test (and validation) sets and using cross-validation, to check how the model is behaving on different set-ups - if the scores in cross-validation are very different each time, we might be dealing with data leakage;
3. Minimizing the features amount going into the model, by using feature selection and extraction techniques (more on them in third notebook);
4. Using **pipelines**, to automate the process and scale the accurate data in correct time and not in advance. And also pipelines are really handy!

> **NOTE!** In the following part, we will not care for data leakage in preprocessing of the data, as these are only examples. The appropriate way, in which we prevent the data leakage will be shown in the third notebook.

## String encoding preprocessing
Machine Learning models won't be able to process string or object data types. Therefore we need to transform our string or object data into numeric representation. I already covered this using different methods in data analytics on my github.

Scikit-Learn provides a few options for string data encoding as well, which are for one-hot and label encoding. For this notebook I will work on two different methods.

### One-hot encoding with sklearn
As written in the mentioned notes, one-hot encoding creates additional columns in which the label data are encoded in 0 (is not) and 1 (is). This creates an additional, often sparse table, attached at the end of our dataset. 

It is good to use with binary classification or labeling (Yes or No questions) and when there are not that many unique values in the string/object column. The more labels there are, the larger our table gets and therefore takes up more space, which leads to longer modeling. The best way to use it is when there are up to 10-ish unique labels we want to encode for a dataset.

In [3]:
# How we use the OneHotEncoder from sklearn
onehot = OneHotEncoder(handle_unknown='ignore', # this way if something doesn't have a value there won't be errors
                       drop='first',# this way we will have one less column,
                       dtype='uint8' # this would be the smallest in terms of memory
                      )

X = onehot.fit_transform(df[['color']]).toarray() # we need to put this into array
onehot.categories_ # so we see the categories - remember, we dropped the first one

[array(['D', 'E', 'F', 'G', 'H', 'I', 'J'], dtype=object)]

Now we have created an array with 0 and 1, we can attach it to the original dataframe. For that however we need to create new columns and know the labels for each. That's why above I've used he `onehot.categories_` to know the order in which these were encoded. But we can use slicing as well 😉

In [4]:
df[onehot.categories_[0][1:]] = X # this way we add this to the df
# we could also immediately assign the one-hot columns to the table
df.head(1)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,E,F,G,H,I,J
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43,1,0,0,0,0,0


In [5]:
# since we won't be using the cut feature anymore, we can drop it
df = df.drop(['color'], axis=1)

Now, we miht wonder, what is the difference between OneHotEncoder, and `pandas.get_dummies()` method, which is also used for one-hot encoding of categorical data.
- **pandas.get_dummies()** creates an additional dataframe, which we can immediatly append into the original dataframe;
- **OneHotEncoder()** creates and array and is useful when creating pipelines.

### Label encoding with sklearn
While there are at leadst a few ways to create label encoding using pandas and standard python, there is no dedicated method for this operation. However, it is provided in sklearn library with `LabelEncoder()`.

In [6]:
# How to use LabelEncoder()
labeler = LabelEncoder() # here we do not need to define anything inside

df['cut'] = labeler.fit_transform(df['cut']) # notice that there is no double square bracket
df.head(1)

Unnamed: 0,carat,cut,clarity,depth,table,price,x,y,z,E,F,G,H,I,J
0,0.23,2,SI2,61.5,55.0,326,3.95,3.98,2.43,1,0,0,0,0,0


In [7]:
# if we would like to check what are the representations of each category
cut_labels = dict(zip(labeler.transform(labeler.classes_), labeler.classes_)) # create a dictionary
cut_labels

{0: 'Fair', 1: 'Good', 2: 'Ideal', 3: 'Premium', 4: 'Very Good'}

In [8]:
# since we will be using this dataset for following, we might as well encode the clarity
labeler1 = LabelEncoder()

# encode the data
df['clarity'] = labeler1.fit_transform(df['clarity']) 

# create a dictionary
clarity_labels = dict(zip(labeler1.transform(labeler1.classes_), labeler1.classes_))
print(clarity_labels)

#display changes
df.head(1)

{0: 'I1', 1: 'IF', 2: 'SI1', 3: 'SI2', 4: 'VS1', 5: 'VS2', 6: 'VVS1', 7: 'VVS2'}


Unnamed: 0,carat,cut,clarity,depth,table,price,x,y,z,E,F,G,H,I,J
0,0.23,2,3,61.5,55.0,326,3.95,3.98,2.43,1,0,0,0,0,0


## Numeric scaling preprocessing
Now that we handled the some of the string data, we need to talk about the numeric data and why it is important to sometimes, if not always, scale them for machine learning algorithms. Many of them are using distances or gradient. If the data are widely spread out within one categoty and narrow within another, this might lead to minimazing the role of these narrowly spead data.

To avoid this, we often do some data scaling on our data, so that it is in one scale - therefore all features have the same starting point to affect the target data. Also, the models that are using distances between data will need slightly less memory to process the data. A few models that need data scaling are KNN or SVM. Also, Neural Networks require scaling to work correctly. 

It is not nessesary to scale data we encoded in the previous part, however if we label-encode a lot of data, we might want to scale it a bit, so that instead of range(0, 20) the range would be (0, 4) with float values.

### Standars Scaler from Sklearn
Standard Scaler is one of the most commonly used scaling algorithm. It takes the data we want to scale and assigns it their positions depending on the mean and standard deviation. The mean of the data is in value 0, values smaller than the mean - are below 0, and larger than the mean - will be above 0, in the equivalent value, depending on the standard deviation. The most outstanding values (larger than 3 standard deviations on both ways) are the outliers. We could also switch it up and use median instead of mean and use variance or median absolute deviation.

Standard scaler does alter the distribution of data from the original into normal distibution! So it is best to use StandardScaler on data that is more or less normally distributed.

> **Attention!**  
> For this notebook we will scale each column differently. In real life scenario we would rather scale all the data in the same way, sometimes stack a few scaling methods on top of each other, rather than encoding numerical values in different ways

In [9]:
# because I want to save the encoding for future sets:
df1 = df.copy()

In [10]:
# let's use standard scaler on the depth
scaler = StandardScaler(with_mean=True, # here we can change to median
                       with_std=True #here we can change to other deviation value
                       )

X = scaler.fit_transform(df1[['depth']]) # transformation
df1['depth'] = X # assigning the value

# if we are scaling the whole dataset, we can omitt the assignment and create X_scaled for whole dataframe

df1.head(1)

Unnamed: 0,carat,cut,clarity,depth,table,price,x,y,z,E,F,G,H,I,J
0,0.23,2,3,-0.174092,55.0,326,3.95,3.98,2.43,1,0,0,0,0,0


### Maximum Absolute Scaler from Sklearn
MaxAbsScaler is another common scaling technique. It takes the absolute values from the data (especially important to remember when we are dealing with positive and negative values) and uses the maximum value as the reference. Then, back in the original setting, it divides all the values in the set by the maximum absolute value. If the maximum absolute value is originally negative - then it will have the value of -1, and if the maximum absolute value is positive, it will end up with the value of 1. All the remaining values will then fit within the range of (-1, 1).

MaxAbsScaler does not alter the distribution of data from the original distribution. It is good to use when we want to preserve the proportions among the data.

In [11]:
# now let's use the MaxAbsScaler on table
maxabs = MaxAbsScaler() # no need to define anything inside

X = maxabs.fit_transform(df1[['table']])
df1['table'] = X # assigning the value

df1.head(1)

Unnamed: 0,carat,cut,clarity,depth,table,price,x,y,z,E,F,G,H,I,J
0,0.23,2,3,-0.174092,0.578947,326,3.95,3.98,2.43,1,0,0,0,0,0


### Robust Scaler from Sklearn
RobustScaler is yet another popular scaling method, which is commonly used then dealing with outliers in the changed data (we don't have them in our current dataset, but don't tell them about it). Robust scaler is in a way similar to StandardScaler, however it uses differen central and deviation values. For the transformation, it subtracts the median from the value and then divides it by the IQR. 

RobustScaler keeps the distribution more or less the same as the original (we say that is does not alter it, since the changes are minor). 

In [12]:
# for the Robust Scaler let's use the carat column
robust = RobustScaler() # we can set up a few things in here, but I won't

X = robust.fit_transform(df1[['carat']])
df1['carat'] = X # assigning the value

df1.head(1)

Unnamed: 0,carat,cut,clarity,depth,table,price,x,y,z,E,F,G,H,I,J
0,-0.734375,2,3,-0.174092,0.578947,326,3.95,3.98,2.43,1,0,0,0,0,0


### Min Max Scaler from Sklearn
MinMaxScaler is also one of the key scaling methods a data scientist needs to learn. It is quite common due to it's narrow range <0,1> and is simple to understand. What is does, is that is takes the value from out column, subtracts the minimum value of the column and divides by the difference between the maximum value and the minimum value of the column. 

MinMaxScaler does not alter the distribution of the data.

In [13]:
# for the minmaxscaler, let's create a new column = cube (we could also scale the paramaters all in one way)
df['cube'] = df['x'] * df['y'] * df['z']
df = df.drop(['x', 'y', 'z'], axis=1)
df1['cube'] = df['cube']

In [14]:
minmax = MinMaxScaler(feature_range=(0,1), # this way we can assign the range ourselves!
                     )

X = minmax.fit_transform(df1[['cube']])
df1['cube'] = X # assigning the value
df1.head(1)

Unnamed: 0,carat,cut,clarity,depth,table,price,x,y,z,E,F,G,H,I,J,cube
0,-0.734375,2,3,-0.174092,0.578947,326,3.95,3.98,2.43,1,0,0,0,0,0,0.009947


In [15]:
# At this point our dataset is ready to be used in the following examples, so we can move on with it
df.to_csv('Diamonds_encoded.csv', index=False)

### Honorable metions
There are of course different encoding and scaling methods, that have not been discussed here, since we ran out ouf features to scale :lol:

For the encoding, the most common method besides these two would be `target encoding`, which in itself has a few methods. It encodes our categorical data using different features' value(s'). It could however lead to data leakage, so as tempting as it is, we must encode these data thoughtfully.

On the scaling, there are a few more worth mentioning, that might come in handy and perhaps I will extend this note in the future with the following scalers:
- `sklearn.preprocessing.Normalizer`
- `sklearn.preprocessing.PowerTransformer`
- `sklearn.preprocessing.QuantileTransformer`
- `np.log`
- `scipy.stats transformations`
- `feature-engine transformations`