# Data pre-processing
In this notebook we will see some transformation that are commonly applied to data set before starting to train a model. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
print("numpy version: %s"%np.__version__)
print("pandas version: %s"%pd.__version__)

numpy version: 1.23.1
pandas version: 1.4.3


In [130]:
df = pd.DataFrame([['green', 'M', 10.1, 'class2'],
                   ['red', 'L', 13.5, 'class1'],
                   ['blue', 'XL', 15.3, 'class2'],
                   ['blue', 'L', 14.5, 'class2']])

df.columns = ['color', 'size', 'price', 'classlabel']
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,red,L,13.5,class1
2,blue,XL,15.3,class2
3,blue,L,14.5,class2


## Transforming categorical values
We might want to transform categorical values provided as literals into integers. We can define a mapping between the literal values of a feature, e.g. color, to integer values

In [131]:
color_mapping = {'green': 3,
                'red': 2,
                'blue': 1}

df['color'] = df['color'].map(color_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,3,M,10.1,class2
1,2,L,13.5,class1
2,1,XL,15.3,class2
3,1,L,14.5,class2


We can revert the values from integer to literals by applying the map() method again to the dataframe or also by applying an anonymous function that does the same thing

In [132]:
inv_color_mapping = {v: k for k, v in color_mapping.items()}
colors = df['color']
#df['color'] = colors.apply(lambda color: inv_color_mapping[color])
df['color'] = colors.map(inv_color_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,red,L,13.5,class1
2,blue,XL,15.3,class2
3,blue,L,14.5,class2


## One-hot encoding
The transformation from literal values to integers might raise a problem since integers are ordinal numbers. While red > blue does not make sense 1 > 0 does make sense and it might bring a machine learning algorithm to wrong conclusions. A solution to this problem is to define the literal values as new features of our data set. If there are N distinct literal values we need olny N-1 new features. For example for the three colors in our example data set we need only two new features since the third can be inferred from the value of the other two. This trick is used to avoid adding dimension that might not be independent and can cause problem in case we need to invert a matrix since the determinant of a matrix with two columns close to each other will be close to zero. 

In [134]:
pd.get_dummies(df, columns=['color'], drop_first=True)

Unnamed: 0,size,price,classlabel,color_green,color_red
0,M,10.1,class2,1,0
1,L,13.5,class1,0,1
2,XL,15.3,class2,0,0
3,L,14.5,class2,0,0


## The Wine data set
We will use the [Wine](https://archive.ics.uci.edu/ml/datasets/wine) data set from UCI for the next sections of this notebook.

In [140]:
wine_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)
wine_df.shape

(178, 14)

In [149]:
wine_df.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                   'Proline']
wine_df.head(3)

Unnamed: 0,Class label,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185


## Data partition
We split the Wine data set into two data sets, 70% for training and 30% for validation.

In [152]:
from sklearn.model_selection import train_test_split

X, y = wine_df.iloc[:, 1:].values, wine_df.iloc[:, 0].values
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

## Normalization and standardization
Usually data sets come with features that have different ranges of values. In order to perform computations we need to normalize or standardize the values. Normalization means to transform the data according to the rule

$$x_{norm}^{(i)} = \frac{x^{(i)} - x_{min}}{x_{max} - x_{min}}$$

In [162]:
alcohol = X[:, 1]
alcohol_min = alcohol.min()
alcohol_max = alcohol.max()
alcohol_min, alcohol_max

(0.74, 5.8)

In [164]:
alcohol_norm = (alcohol - alcohol_min) / (alcohol_max - alcohol_min)
alcohol_norm[:3]

array([0.1916996, 0.2055336, 0.3201581])

Standardization means to transform the data according to the rule

$$x_{std}^{(i)} = \frac{x^{(i)} - \mu_x}{\sigma_x}$$

In [168]:
alcohol_mean = alcohol.mean()
alcohol_sigma = alcohol.std()
alcohol_mean, alcohol_sigma

(2.3363483146067416, 1.1140036269797893)

In [169]:
alcohol_standard = (alcohol - alcohol_mean) / alcohol_sigma
alcohol_standard[:3]

array([-0.5622498 , -0.49941338,  0.02123125])