# Dataset preprocessing

Data preprocessing refers to the process of preparing the raw data and making it suitable for a machine learning model. It is the first and crucial step during the construction of a machine learning model.

The vast majority of the real-world datasets do not contain clean and formatted data. More specifically, a dataset generally includes noisy features, and/or missing values. Moreover, it may be in a format that prevents it from being applied directly to a machine learning algorithm. Examples of such formats are raw text and raw images.

Consequently, it is mandatory to clean and augment the data and transform it into a supported (vectorized) format. In this way we do not only make the data useful, but we also increase the accuracy and efficiency of a machine learning model.

In this notebook we examine the first problem, namely, data cleaning and augmentation. In the next notebook (MLLAB-13) we deal with the second problem, that is, the transformation of the data into an appropriate format.


In [1]:
import sys
import numpy as np
import pandas as pd
from numpy import genfromtxt
import matplotlib.pyplot as plt

DATASET_LOCATION = "datasets/Processed_NASDAQ.csv"


## Loading the dataset into a pandas dataframe

The [UCI machine learning repository](https://archive.ics.uci.edu/ml/) contains numerous datasets in the form of redistributable CSV files. One of them is [CNNpred](https://archive.ics.uci.edu/ml/datasets/CNNpred%3A+CNN-based+stock+market+prediction+using+a+diverse+set+of+variables), a collection of daily features of the US S&P 500, NASDAQ, Dow Jones, RUSSELL 2000 and NYSE stock indices from 2010 to 2017. It covers features from various categories of technical indicators, futures contracts, commodity prices, major market indices around the world, the price of large companies in the US market, etc. A detailed description of the features is described in the paper entitled ["CNNpred: CNN-based stock market prediction using a diverse set of variables"](https://www.sciencedirect.com/science/article/abs/pii/S0957417419301915).

In the following example we load the `_Processed_NASDAQ.csv` file (containing the NASDAQ index stock prices) into a [pandas](https://pandas.pydata.org/) dataframe. 


In [2]:
# Load the input CSV file into a Pandas dataframe
df = pd.read_csv(DATASET_LOCATION, sep=',')

# The head(n) command prints the n first rows of the dataframe
print("Dataframe shape:", df.shape)
df.head(10)


Dataframe shape: (1984, 84)


Unnamed: 0,Date,Close,Volume,mom,mom1,mom2,mom3,ROC_5,ROC_10,ROC_15,...,NZD,silver-F,RUSSELL-F,S&P-F,CHF,Dollar index-F,Dollar index,wheat-F,XAG,XAU
0,2009-12-31,2269.149902,,,,,,,,,...,0.03,0.26,-1.08,-1.0,-0.11,-0.08,-0.06,-0.48,0.3,0.39
1,2010-01-04,2308.419922,0.560308,0.017306,,,,,,,...,1.52,3.26,1.61,1.62,-0.57,-0.59,-0.42,3.12,3.91,2.1
2,2010-01-05,2308.709961,0.225994,0.000126,0.017306,,,,,,...,-0.07,1.96,-0.2,0.31,0.43,0.03,0.12,-0.9,1.42,-0.12
3,2010-01-06,2301.090088,-0.048364,-0.0033,0.000126,0.017306,,,,,...,0.56,2.15,-0.02,0.07,-0.56,-0.24,-0.17,2.62,2.25,1.77
4,2010-01-07,2300.050049,0.007416,-0.000452,-0.0033,0.000126,0.017306,,,,...,-0.72,0.94,0.5,0.4,0.58,0.58,0.54,-1.85,0.22,-0.58
5,2010-01-08,2317.169922,-0.054915,0.007443,-0.000452,-0.0033,0.000126,2.116212,,,...,0.61,0.68,0.64,0.35,-0.98,-0.58,-0.56,2.07,1.26,0.38
6,2010-01-11,2312.409912,-0.031463,-0.002054,0.007443,-0.000452,-0.0033,0.172845,,,...,0.64,-0.13,-1.01,0.09,-0.66,-0.64,-0.61,1.08,0.65,1.44
7,2010-01-12,2282.310059,0.139772,-0.013017,-0.002054,0.007443,-0.000452,-1.143491,,,...,-0.47,-2.36,-0.67,-0.74,0.22,-0.05,-0.06,-6.33,-1.78,-2.19
8,2010-01-13,2307.899902,-0.021099,0.011212,-0.013017,-0.002054,0.007443,0.295939,,,...,0.26,1.62,0.82,0.66,-0.15,-0.17,-0.13,-0.51,1.97,0.98
9,2010-01-14,2316.73999,-0.027683,0.00383,0.011212,-0.013017,-0.002054,0.725634,,,...,0.27,0.57,0.76,0.33,0.12,-0.13,-0.16,-1.49,0.32,0.39


The raw dataset contains 1984 rows (samples) and 84 columns (features). Plese note these numbers.


## Handling missing values

In real-world applications, we usually deal with training examples that are missing one or more feature values for various reasons (e.g. errors in the data collection process). Such missing values can cause multiple problems for machine learning tools and make them produce unpredictable outputs.

In this section, we will work through several practical techniques for dealing with missing values by removing entries from our dataset or imputing missing values from other training examples and features.

### Counting missing values


In [3]:
# Count the number of the missing values (i.e. the dataframe colum)
df.isnull().sum()


Date              0
Close             0
Volume            1
mom               1
mom1              2
                 ..
Dollar index-F    0
Dollar index      0
wheat-F           2
XAG               0
XAU               0
Length: 84, dtype: int64

### Removing rows (samples) with even one missing value feature 

The [dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) method of pandas removes the rows (declared by `axis=0`) of the dataframe that have missing values from their columns.


In [4]:
# dropna: Remove missing values across a specific axis.
# Remove the rows which have missing values. Observe the removal of rows 204, 205, 1916, 1979, 1918,...
df.dropna(axis = 0)


Unnamed: 0,Date,Close,Volume,mom,mom1,mom2,mom3,ROC_5,ROC_10,ROC_15,...,NZD,silver-F,RUSSELL-F,S&P-F,CHF,Dollar index-F,Dollar index,wheat-F,XAG,XAU
201,2010-10-19,2436.949951,0.299836,-0.017620,0.004816,0.013710,-0.002396,0.787041,1.546771,2.410493,...,-1.84,-2.59,-1.74,-1.23,1.27,1.66,1.62,-2.58,-4.14,-2.51
202,2010-10-20,2457.389893,-0.100395,0.008388,-0.017620,0.004816,0.013710,0.661958,3.223055,3.401127,...,1.40,1.03,0.91,0.94,-1.01,-1.29,-1.29,2.52,2.44,0.82
203,2010-10-21,2459.669922,0.058861,0.000928,0.008388,-0.017620,0.004816,0.997382,3.188361,3.844002,...,-0.98,-3.69,-0.53,0.09,0.64,0.28,0.32,-3.09,-3.26,-1.55
206,2010-10-26,2497.290039,0.096219,0.002585,0.004622,0.008017,0.000928,2.476050,3.282578,4.061119,...,-0.52,2.42,-0.23,0.00,1.17,0.79,0.79,3.65,0.85,-0.04
207,2010-10-27,2503.260010,0.051657,0.002391,0.002585,0.004622,0.008017,1.866619,2.540933,5.149837,...,0.03,-2.97,-0.26,-0.34,0.47,0.60,0.57,0.61,-1.22,-1.09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1914,2017-08-09,6352.330078,0.062199,-0.002846,-0.002085,0.005071,0.001770,-0.162194,-1.096414,-0.512291,...,0.12,2.90,-0.86,0.01,-1.10,-0.10,-0.11,0.60,2.64,1.10
1915,2017-08-10,6216.870117,0.084051,-0.021324,-0.002846,-0.002085,0.005071,-1.947368,-2.590331,-2.709388,...,-0.87,1.20,-1.97,-1.44,-0.11,-0.14,-0.16,-4.24,1.21,0.84
1919,2017-08-16,6345.109863,0.142191,0.001911,-0.001139,0.013373,0.006384,-0.113662,-0.275672,-1.208830,...,1.06,1.21,0.07,0.15,-0.70,-0.34,-0.33,-2.50,2.58,0.79
1920,2017-08-17,6221.910156,0.131039,-0.019416,0.001911,-0.001139,0.013373,0.081070,-1.867876,-2.511360,...,-0.42,0.67,-2.00,-1.54,-0.31,0.10,0.09,-1.25,-0.30,0.43


Compared to the 1984 rows of the original dataset, the filtered one contains only 1114. Consequently, this filter removes 870 rows from the dataset. This in turn means that approximately 44% of the dataset is discarded.

### Removing columns (features) with even one missing value

We may also remove entire columns (declared by `axis=1`) which have missing values.


In [5]:
# Remove the columns which have missing values.
df.dropna(axis = 1)


Unnamed: 0,Date,Close,DTB4WK,DTB3,DTB6,DGS5,DGS10,DAAA,DBAA,TE1,...,Nikkei-F,NZD,silver-F,RUSSELL-F,S&P-F,CHF,Dollar index-F,Dollar index,XAG,XAU
0,2009-12-31,2269.149902,0.04,0.06,0.20,2.69,3.85,5.33,6.39,3.81,...,0.67,0.03,0.26,-1.08,-1.00,-0.11,-0.08,-0.06,0.30,0.39
1,2010-01-04,2308.419922,0.05,0.08,0.18,2.65,3.85,5.35,6.39,3.80,...,0.31,1.52,3.26,1.61,1.62,-0.57,-0.59,-0.42,3.91,2.10
2,2010-01-05,2308.709961,0.03,0.07,0.17,2.56,3.77,5.24,6.30,3.74,...,0.47,-0.07,1.96,-0.20,0.31,0.43,0.03,0.12,1.42,-0.12
3,2010-01-06,2301.090088,0.03,0.06,0.15,2.60,3.85,5.30,6.34,3.82,...,0.19,0.56,2.15,-0.02,0.07,-0.56,-0.24,-0.17,2.25,1.77
4,2010-01-07,2300.050049,0.02,0.05,0.16,2.62,3.85,5.31,6.33,3.83,...,-0.09,-0.72,0.94,0.50,0.40,0.58,0.58,0.54,0.22,-0.58
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1979,2017-11-09,6750.049805,1.05,1.22,1.33,2.01,2.33,3.60,4.26,1.28,...,-0.17,-0.24,-0.62,-0.34,-0.27,-0.61,-0.44,-0.45,-0.26,0.32
1980,2017-11-10,6750.939941,1.03,1.21,1.34,2.06,2.40,0.00,0.00,1.37,...,-1.66,-0.27,-0.58,-0.20,-0.17,0.18,-0.07,-0.05,-0.71,-0.80
1981,2017-11-13,6757.600098,1.04,1.22,1.35,2.08,2.40,3.69,4.33,1.36,...,-1.07,-0.38,0.72,-0.04,0.10,0.06,0.12,0.11,0.83,0.16
1982,2017-11-14,6737.870117,1.04,1.24,1.37,2.06,2.38,3.66,4.31,1.34,...,0.67,-0.39,0.17,-0.21,-0.15,-0.70,-0.71,-0.70,0.01,0.24


Compared to the 84 columns of the original dataset, the filtered one contains only 39. Consequently, this filter removes 45 columns from the dataset. This in turn means that approximately 53% of the dataset is discarded.

**Both of these filters remove massive amounts of data.**


### Removing samples that have all their features missing

The filter below will remove the rows that have all their columns empty. In this particular example, there are no such samples, so the dataset stays intact.


In [6]:
# This removes the rows that have all their columns empty.
df.dropna( how = 'all' )


Unnamed: 0,Date,Close,Volume,mom,mom1,mom2,mom3,ROC_5,ROC_10,ROC_15,...,NZD,silver-F,RUSSELL-F,S&P-F,CHF,Dollar index-F,Dollar index,wheat-F,XAG,XAU
0,2009-12-31,2269.149902,,,,,,,,,...,0.03,0.26,-1.08,-1.00,-0.11,-0.08,-0.06,-0.48,0.30,0.39
1,2010-01-04,2308.419922,0.560308,0.017306,,,,,,,...,1.52,3.26,1.61,1.62,-0.57,-0.59,-0.42,3.12,3.91,2.10
2,2010-01-05,2308.709961,0.225994,0.000126,0.017306,,,,,,...,-0.07,1.96,-0.20,0.31,0.43,0.03,0.12,-0.90,1.42,-0.12
3,2010-01-06,2301.090088,-0.048364,-0.003300,0.000126,0.017306,,,,,...,0.56,2.15,-0.02,0.07,-0.56,-0.24,-0.17,2.62,2.25,1.77
4,2010-01-07,2300.050049,0.007416,-0.000452,-0.003300,0.000126,0.017306,,,,...,-0.72,0.94,0.50,0.40,0.58,0.58,0.54,-1.85,0.22,-0.58
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1979,2017-11-09,6750.049805,0.058830,-0.005755,0.003153,-0.002750,0.003252,0.522862,2.947790,2.194980,...,-0.24,-0.62,-0.34,-0.27,-0.61,-0.44,-0.45,0.53,-0.26,0.32
1980,2017-11-10,6750.939941,-0.116863,0.000132,-0.005755,0.003153,-0.002750,-0.199573,0.741356,1.838727,...,-0.27,-0.58,-0.20,-0.17,0.18,-0.07,-0.05,0.70,-0.71,-0.80
1981,2017-11-13,6757.600098,-0.000091,0.000987,0.000132,-0.005755,0.003153,-0.424963,0.875362,2.592598,...,-0.38,0.72,-0.04,0.10,0.06,0.12,0.11,-1.85,0.83,0.16
1982,2017-11-14,6737.870117,0.005087,-0.002920,0.000987,0.000132,-0.005755,-0.441942,0.151616,2.113229,...,-0.39,0.17,-0.21,-0.15,-0.70,-0.71,-0.70,1.00,0.01,0.24


### Removing samples that have a specific number of their features missing

In the following example, [dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) will remove a row from the dataframe unless it has at least 80 non-empty columns.


In [7]:
# This will remove a row from the dataframe unless it has at least 80 non-empty columns.
df.dropna(thresh = 80)


Unnamed: 0,Date,Close,Volume,mom,mom1,mom2,mom3,ROC_5,ROC_10,ROC_15,...,NZD,silver-F,RUSSELL-F,S&P-F,CHF,Dollar index-F,Dollar index,wheat-F,XAG,XAU
15,2010-01-25,2210.800049,-0.242499,0.002499,-0.026663,-0.011151,-0.012562,-3.373701,-4.590508,-2.571441,...,0.62,0.46,-0.57,0.14,-0.14,-0.10,-0.11,-0.05,0.29,0.48
16,2010-01-26,2203.729980,0.106313,-0.003198,0.002499,-0.026663,-0.011151,-5.028009,-4.699856,-4.535134,...,-0.90,-1.66,-0.81,-0.48,0.56,0.33,0.31,-0.98,-2.34,-0.02
17,2010-01-27,2221.409912,0.055741,0.008023,-0.003198,0.002499,-0.026663,-3.048122,-2.668356,-3.781335,...,-0.23,-1.37,1.00,0.67,0.38,0.33,0.31,-1.67,-1.08,-1.02
18,2010-01-28,2179.000000,0.135089,-0.019091,0.008023,-0.003198,0.002499,-3.826630,-5.585160,-5.305750,...,-0.16,-2.49,-1.85,-1.39,0.20,0.29,0.29,0.46,-2.11,-0.05
19,2010-01-29,2147.350098,0.092217,-0.014525,-0.019091,0.008023,-0.003198,-2.627316,-7.311562,-6.638984,...,-0.52,-0.12,-0.74,-0.81,0.84,0.70,0.71,-2.85,-0.06,-0.44
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1979,2017-11-09,6750.049805,0.058830,-0.005755,0.003153,-0.002750,0.003252,0.522862,2.947790,2.194980,...,-0.24,-0.62,-0.34,-0.27,-0.61,-0.44,-0.45,0.53,-0.26,0.32
1980,2017-11-10,6750.939941,-0.116863,0.000132,-0.005755,0.003153,-0.002750,-0.199573,0.741356,1.838727,...,-0.27,-0.58,-0.20,-0.17,0.18,-0.07,-0.05,0.70,-0.71,-0.80
1981,2017-11-13,6757.600098,-0.000091,0.000987,0.000132,-0.005755,0.003153,-0.424963,0.875362,2.592598,...,-0.38,0.72,-0.04,0.10,0.06,0.12,0.11,-1.85,0.83,0.16
1982,2017-11-14,6737.870117,0.005087,-0.002920,0.000987,0.000132,-0.005755,-0.441942,0.151616,2.113229,...,-0.39,0.17,-0.21,-0.15,-0.70,-0.71,-0.70,1.00,0.01,0.24


### Removing samples that have a specific feature missing

In some problems we cannot tolerate the absence of a specific feature from a sample. This mainly because this feature is too important to be ignored. Therefore, all the samples that do not possess this specific feature must be removed from the dataset.

In the following example, [dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) will remove all the samples that have their `ROC_15` feature missing.


In [8]:
# This will remove all the samples that have their ROC_15 feature missing.
df.dropna(subset = ['ROC_15'])


Unnamed: 0,Date,Close,Volume,mom,mom1,mom2,mom3,ROC_5,ROC_10,ROC_15,...,NZD,silver-F,RUSSELL-F,S&P-F,CHF,Dollar index-F,Dollar index,wheat-F,XAG,XAU
15,2010-01-25,2210.800049,-0.242499,0.002499,-0.026663,-0.011151,-0.012562,-3.373701,-4.590508,-2.571441,...,0.62,0.46,-0.57,0.14,-0.14,-0.10,-0.11,-0.05,0.29,0.48
16,2010-01-26,2203.729980,0.106313,-0.003198,0.002499,-0.026663,-0.011151,-5.028009,-4.699856,-4.535134,...,-0.90,-1.66,-0.81,-0.48,0.56,0.33,0.31,-0.98,-2.34,-0.02
17,2010-01-27,2221.409912,0.055741,0.008023,-0.003198,0.002499,-0.026663,-3.048122,-2.668356,-3.781335,...,-0.23,-1.37,1.00,0.67,0.38,0.33,0.31,-1.67,-1.08,-1.02
18,2010-01-28,2179.000000,0.135089,-0.019091,0.008023,-0.003198,0.002499,-3.826630,-5.585160,-5.305750,...,-0.16,-2.49,-1.85,-1.39,0.20,0.29,0.29,0.46,-2.11,-0.05
19,2010-01-29,2147.350098,0.092217,-0.014525,-0.019091,0.008023,-0.003198,-2.627316,-7.311562,-6.638984,...,-0.52,-0.12,-0.74,-0.81,0.84,0.70,0.71,-2.85,-0.06,-0.44
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1979,2017-11-09,6750.049805,0.058830,-0.005755,0.003153,-0.002750,0.003252,0.522862,2.947790,2.194980,...,-0.24,-0.62,-0.34,-0.27,-0.61,-0.44,-0.45,0.53,-0.26,0.32
1980,2017-11-10,6750.939941,-0.116863,0.000132,-0.005755,0.003153,-0.002750,-0.199573,0.741356,1.838727,...,-0.27,-0.58,-0.20,-0.17,0.18,-0.07,-0.05,0.70,-0.71,-0.80
1981,2017-11-13,6757.600098,-0.000091,0.000987,0.000132,-0.005755,0.003153,-0.424963,0.875362,2.592598,...,-0.38,0.72,-0.04,0.10,0.06,0.12,0.11,-1.85,0.83,0.16
1982,2017-11-14,6737.870117,0.005087,-0.002920,0.000987,0.000132,-0.005755,-0.441942,0.151616,2.113229,...,-0.39,0.17,-0.21,-0.15,-0.70,-0.71,-0.70,1.00,0.01,0.24


### Imputing missing values

In some cases it is not desirable to remove entire rows (i.e., samples) or columns (i.e., features), as this action discards massive amounts of valuable data. In these cases, various techniques can be used to compute an estimate for the missing values. These techniques rely on data that exists elsewhere in the dataset, and attempt to compute a replacement value by exploiting that data.

In the following example we will use the [fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html) function of the pandas library to fill up the empty `ROC_15` features. Here we calculate the average among the existing values and we "fill" the missing values with this average.

Note that alternative statistics or fixed values (like 0) can be used insted.


In [9]:
df2 = df.copy()
df2['ROC_15'].fillna( df2['ROC_15'].mean(), inplace=True )
# df2 = df['ROC_15'].fillna( df['ROC_15'].mean(), inplace=False )
df2.head(10)


Unnamed: 0,Date,Close,Volume,mom,mom1,mom2,mom3,ROC_5,ROC_10,ROC_15,...,NZD,silver-F,RUSSELL-F,S&P-F,CHF,Dollar index-F,Dollar index,wheat-F,XAG,XAU
0,2009-12-31,2269.149902,,,,,,,,0.890409,...,0.03,0.26,-1.08,-1.0,-0.11,-0.08,-0.06,-0.48,0.3,0.39
1,2010-01-04,2308.419922,0.560308,0.017306,,,,,,0.890409,...,1.52,3.26,1.61,1.62,-0.57,-0.59,-0.42,3.12,3.91,2.1
2,2010-01-05,2308.709961,0.225994,0.000126,0.017306,,,,,0.890409,...,-0.07,1.96,-0.2,0.31,0.43,0.03,0.12,-0.9,1.42,-0.12
3,2010-01-06,2301.090088,-0.048364,-0.0033,0.000126,0.017306,,,,0.890409,...,0.56,2.15,-0.02,0.07,-0.56,-0.24,-0.17,2.62,2.25,1.77
4,2010-01-07,2300.050049,0.007416,-0.000452,-0.0033,0.000126,0.017306,,,0.890409,...,-0.72,0.94,0.5,0.4,0.58,0.58,0.54,-1.85,0.22,-0.58
5,2010-01-08,2317.169922,-0.054915,0.007443,-0.000452,-0.0033,0.000126,2.116212,,0.890409,...,0.61,0.68,0.64,0.35,-0.98,-0.58,-0.56,2.07,1.26,0.38
6,2010-01-11,2312.409912,-0.031463,-0.002054,0.007443,-0.000452,-0.0033,0.172845,,0.890409,...,0.64,-0.13,-1.01,0.09,-0.66,-0.64,-0.61,1.08,0.65,1.44
7,2010-01-12,2282.310059,0.139772,-0.013017,-0.002054,0.007443,-0.000452,-1.143491,,0.890409,...,-0.47,-2.36,-0.67,-0.74,0.22,-0.05,-0.06,-6.33,-1.78,-2.19
8,2010-01-13,2307.899902,-0.021099,0.011212,-0.013017,-0.002054,0.007443,0.295939,,0.890409,...,0.26,1.62,0.82,0.66,-0.15,-0.17,-0.13,-0.51,1.97,0.98
9,2010-01-14,2316.73999,-0.027683,0.00383,0.011212,-0.013017,-0.002054,0.725634,,0.890409,...,0.27,0.57,0.76,0.33,0.12,-0.13,-0.16,-1.49,0.32,0.39


## Handling categorical features

Categorical features receive their values from predefined sets of fixed values. The countries constitute a representative example of categorical features. We mainly distinguish two types of categorical features: nominal and ordinal.

Nominal features receive values that either cannot be sorted, or, if sorted, no useful information can be inferred. Examples of nominal features include movie, music and video game genres, country names, food types, etc. 

On the other hand, the values of the ordinal features can be sorted and their sorting leads to meaningful information. For example, the footwear sizes are an ordinal feature, because we can define a logical order: XL > L > M > S. 

To demonstrate categorical features, we will work with an experimental toy dataset. The following code creates the dataset and stores it into a pandas dataframe. In this context, `color` and `size` are categorical features because they receive distinct values from fixed sets of values. Furthermore, `color` is a nominal feature, whereas `size` is an ordinal feature.


In [10]:
import pandas as pd

dfCat_original = pd.DataFrame([
                      ['green',  'M',  10.1, 'T-shirt' ],
                      ['red',    'L',  13.5, 'Blouse'  ],
                      ['blue',   'XL', 15.3, 'T-shirt' ],
                      ['yellow', 'XL', 17.1, 'T-shirt' ],
                      ['grey',   'L',   8.9, 'Blouse'  ],
                      ['red',    'M',  11.5, 'Shirt'   ],
                      ['red',    'L',  12.5, 'Shirt'   ],
                      ['blue',   'M',  14.9, 'Blouse'  ],
                      ['grey',   'XL', 15.0, 'Dress'   ],
                      ['green',  'L',  15.0, 'Dress'   ]
                     ])

dfCat_original.columns = ['color', 'size', 'price', 'classlabel']
dfCat = dfCat_original.copy()
dfCat


Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,T-shirt
1,red,L,13.5,Blouse
2,blue,XL,15.3,T-shirt
3,yellow,XL,17.1,T-shirt
4,grey,L,8.9,Blouse
5,red,M,11.5,Shirt
6,red,L,12.5,Shirt
7,blue,M,14.9,Blouse
8,grey,XL,15.0,Dress
9,green,L,15.0,Dress


### Encoding the ordinal features

As mentioned earlier, the only ordinal feature in the input dataset is the `size` in the second column. To feed a dataset that contains ordinal features into a machine learning model, we must convert the string values into numerical values.

The following code uses a dictionary to perform the following mappings:

* the XL size is mapped to 3,
* the L size is mapped to 2, and
* the M size is mapped to 1


In [11]:
# Map the "size" ordinal feature to an integer value
map_lexicon = {'XL': 3, 'L': 2, 'M': 1}
dfCat['size'] = dfCat['size'].map(map_lexicon)

dfCat_original = dfCat.copy()
dfCat


Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,T-shirt
1,red,2,13.5,Blouse
2,blue,3,15.3,T-shirt
3,yellow,3,17.1,T-shirt
4,grey,2,8.9,Blouse
5,red,1,11.5,Shirt
6,red,2,12.5,Shirt
7,blue,1,14.9,Blouse
8,grey,3,15.0,Dress
9,green,2,15.0,Dress


If we require to roll back to the previous version of the dataset, then we initially define a reverse mapping dictionary `inv_map_lexicon`. In the sequel, we apply the `inv_map_lexicon` at the column of the transformed features, through the pandas [map](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html) function.


In [12]:
# Convert the integers back to their meaningful categorical features.
# Employ another dictionary to reverse the previous mapping (observe that the reversed dictionary
# derives from the original lexicon (map_lexicon) by interchanging the keys with values and vice-versa).
inv_map_lexicon = {v: k for k, v in map_lexicon.items()}

print(inv_map_lexicon)

dfCat['size'] = dfCat['size'].map(inv_map_lexicon)
dfCat


{3: 'XL', 2: 'L', 1: 'M'}


Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,T-shirt
1,red,L,13.5,Blouse
2,blue,XL,15.3,T-shirt
3,yellow,XL,17.1,T-shirt
4,grey,L,8.9,Blouse
5,red,M,11.5,Shirt
6,red,L,12.5,Shirt
7,blue,M,14.9,Blouse
8,grey,XL,15.0,Dress
9,green,L,15.0,Dress


In [13]:
dfCat = dfCat_original.copy()


### Encoding the class labels

The majority of machine learning libraries require the target variables (which in classification problems are commonly referred to as class labels) to be encoded as integers. Although most scikit-learn classification algorithms automatically convert class labels to integers, it is good practice to provide class labels in an integer form to avoid problems.

Similarly to the aforementioned mapping technique for the ordinal features, the class labels can be encoded as integers by using mapping dictionaries. However, compared to the ordinal features, the class labels have a significant difference: they do not imply an ordering. Therefore, it does not matter which integer is used to encode a particular label.

There are two ways to encode class labels: 

#### Encoding the class labels with mapping dictionaries

Initially we create the mapping dictionary `class_map_lexicon` for the class labels:


In [14]:
# class_mapping = {label: idx for idx, label in enumerate(np.unique(dfCat['classlabel']))}
class_map_lexicon = {"Blouse" : 0.3, "Shirt" : 0.5, "T-shirt" : 0.7, "Dress" : 0.9}
class_map_lexicon


{'Blouse': 0.3, 'Shirt': 0.5, 'T-shirt': 0.7, 'Dress': 0.9}

In the sequel, we employ `class_map_lexicon` to the respective column of the dataframe:


In [15]:
dfCat['classlabel'] = dfCat['classlabel'].map(class_map_lexicon)
dfCat


Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,0.7
1,red,2,13.5,0.3
2,blue,3,15.3,0.7
3,yellow,3,17.1,0.7
4,grey,2,8.9,0.3
5,red,1,11.5,0.5
6,red,2,12.5,0.5
7,blue,1,14.9,0.3
8,grey,3,15.0,0.9
9,green,2,15.0,0.9


Similarly to the previous case, we can roll back to the original form of the dataset by employing another dictionary that reverses the encoding of the class labels.


In [16]:
dfCat_original = dfCat.copy()

# Decoding
# Employ another dictionary to reverse the previous mapping (observe that the reversed dictionary
# derives from the original lexicon (map_lexicon) by interchanging the keys with values and vice-versa).
inv_class_map_lexicon = {v: k for k, v in class_map_lexicon.items()}

dfCat['classlabel'] = dfCat['classlabel'].map(inv_class_map_lexicon)
#dfCat['classlabel'].map(inv_class_map_lexicon)
dfCat


Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,T-shirt
1,red,2,13.5,Blouse
2,blue,3,15.3,T-shirt
3,yellow,3,17.1,T-shirt
4,grey,2,8.9,Blouse
5,red,1,11.5,Shirt
6,red,2,12.5,Shirt
7,blue,1,14.9,Blouse
8,grey,3,15.0,Dress
9,green,2,15.0,Dress


#### Encoding the class labels with the `LabelEncoder` object

Here, instead of manually constructing mapping (and reverse mapping) dictionaries we employ the [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) object of scikit-learn.


In [17]:
from sklearn.preprocessing import LabelEncoder

# Label encoding with sklearn's LabelEncoder
class_le = LabelEncoder()
dfCat['classlabel'] = class_le.fit_transform(dfCat['classlabel'].values)

print("Classes of Label Encoder:", class_le.classes_)
dfCat


Classes of Label Encoder: ['Blouse' 'Dress' 'Shirt' 'T-shirt']


Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,3
1,red,2,13.5,0
2,blue,3,15.3,3
3,yellow,3,17.1,3
4,grey,2,8.9,0
5,red,1,11.5,2
6,red,2,12.5,2
7,blue,1,14.9,0
8,grey,3,15.0,1
9,green,2,15.0,1


In [18]:
# Reverse mapping
dfCat['classlabel'] = class_le.inverse_transform(dfCat['classlabel'])
print("Inversed classes of Label Encoder:", class_le.classes_)

dfCat


Inversed classes of Label Encoder: ['Blouse' 'Dress' 'Shirt' 'T-shirt']


Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,T-shirt
1,red,2,13.5,Blouse
2,blue,3,15.3,T-shirt
3,yellow,3,17.1,T-shirt
4,grey,2,8.9,Blouse
5,red,1,11.5,Shirt
6,red,2,12.5,Shirt
7,blue,1,14.9,Blouse
8,grey,3,15.0,Dress
9,green,2,15.0,Dress


In [19]:
dfCat['classlabel'] = class_le.fit_transform(dfCat['classlabel'].values)
# dfCat


### Encoding the nominal features

Similarly to the ordinal features, the nominal features are also of categorical nature. Nevertheless, although their simple transformation into integer values is technically possible, it hides a serious logical error. We will demonstrate this error with an example. 

So let us employ the [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) object of scikit-learn to map the `color` feature to distinct numerical values.


In [20]:
color_le = LabelEncoder()
dfCat['color'] = color_le.fit_transform(dfCat['color'])
dfCat


Unnamed: 0,color,size,price,classlabel
0,1,1,10.1,3
1,3,2,13.5,0
2,0,3,15.3,3
3,4,3,17.1,3
4,2,2,8.9,0
5,3,1,11.5,2
6,3,2,12.5,2
7,0,1,14.9,0
8,2,3,15.0,1
9,1,2,15.0,1


If this data is passed to any classification model, then a logical error occurs.

The problem is that although the `color` values do not imply any particular order, a learning algorithm will now assume that red is larger than green, and green is larger than blue. Although this assumption is incorrect, the algorithm could still produce useful results. However, those results would not be optimal.

So let us reverse the previous transformation.


In [21]:
dfCat['color'] = color_le.inverse_transform(dfCat['color'])
dfCat

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,3
1,red,2,13.5,0
2,blue,3,15.3,3
3,yellow,3,17.1,3
4,grey,2,8.9,0
5,red,1,11.5,2
6,red,2,12.5,2
7,blue,1,14.9,0
8,grey,3,15.0,1
9,green,2,15.0,1


#### The one-hot-encoding method for nominal features (and class labels)

One-Hot encoding is a popular way of addressing this problem. The idea is to "featurize" all the possible values of a nominal feature (in our case, one feature for each `color`). Binary values can then be used to indicate the particular `color` of a sample; for example, a blue example can be encoded as blue=1, green=0, red=0, yellow=0, and gray=0.


In [22]:
from sklearn.preprocessing import OneHotEncoder

# Create a new DataFrame X that contains only the input variables of the dataset (that is, the class labels are excluded).
X = dfCat[['color', 'size', 'price']].values

# OneHotEncoder() object
color_ohe = OneHotEncoder()

# Get the first column of X and pass it to the One Hot Encoder.
XColor = X[:, 0:1]
print(XColor)

color_ohe.fit_transform( XColor ).toarray()


[['green']
 ['red']
 ['blue']
 ['yellow']
 ['grey']
 ['red']
 ['red']
 ['blue']
 ['grey']
 ['green']]


array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]])

In [23]:
from sklearn.compose import ColumnTransformer

X = dfCat[['color', 'size', 'price']].values
c_transf = ColumnTransformer([ ('onehot', OneHotEncoder(), [0]), ('nothing', 'passthrough', [1, 2])])
c_transf.fit_transform(X).astype(float)


array([[ 0. ,  1. ,  0. ,  0. ,  0. ,  1. , 10.1],
       [ 0. ,  0. ,  0. ,  1. ,  0. ,  2. , 13.5],
       [ 1. ,  0. ,  0. ,  0. ,  0. ,  3. , 15.3],
       [ 0. ,  0. ,  0. ,  0. ,  1. ,  3. , 17.1],
       [ 0. ,  0. ,  1. ,  0. ,  0. ,  2. ,  8.9],
       [ 0. ,  0. ,  0. ,  1. ,  0. ,  1. , 11.5],
       [ 0. ,  0. ,  0. ,  1. ,  0. ,  2. , 12.5],
       [ 1. ,  0. ,  0. ,  0. ,  0. ,  1. , 14.9],
       [ 0. ,  0. ,  1. ,  0. ,  0. ,  3. , 15. ],
       [ 0. ,  1. ,  0. ,  0. ,  0. ,  2. , 15. ]])

In [24]:
pd.get_dummies(dfCat[['price', 'color', 'size']])


Unnamed: 0,price,size,color_blue,color_green,color_grey,color_red,color_yellow
0,10.1,1,0,1,0,0,0
1,13.5,2,0,0,0,1,0
2,15.3,3,1,0,0,0,0
3,17.1,3,0,0,0,0,1
4,8.9,2,0,0,1,0,0
5,11.5,1,0,0,0,1,0
6,12.5,2,0,0,0,1,0
7,14.9,1,1,0,0,0,0
8,15.0,3,0,0,1,0,0
9,15.0,2,0,1,0,0,0


## Creating training and test sets through input dataset partitioning

In most cases, the supervised learning algorithms work in a pretty much similar fashion. At first, a machine learning model is trained by using a subset of the original dataset. This subset is known as the training set. In the sequel, the performance of the model is evaluated by using the rest of the dataset. This second subset of the original dataset is called the test set.

scikit-learn offers a powerful method for partitioning a dataset into a training and a test set: [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). In the following example we demonstrate the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method and we describe its properties.


In [25]:
# Load the IRIS dataset
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target
print('Class labels:', np.unique(y))

print(y)


Class labels: [0 1 2]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [26]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1, stratify=y)
print(y_test)


[2 1 2 0 1 0 0 2 1 2 0 1 1 2 0]


Using the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function, we randomly partitioned `X` and `y` into a test set (by 30% the size of the dataset, that is, 45 samples) and a training set (by 70% the size of the dataset, that is, 105 training examples). Note that the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function shuffles the training data sets internally before the partitioning takes place.

If there was no shuffling process, all samples from class 0 and class 1 would have ended up in the training set. Also, the test set would consist of 45 examples, all coming from class 2. The `random_state` parameter provides a fixed random seed (e.g. `random_state=1`) to the pseudo-random number generator used for shuffling. Using this constant integer, `random_state` ensures that our results can be reproduced.

Finally, we took advantage of the built-in layering support via the `stratify = y` statement. In this context, stratification means that the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method will produce a training and a test set that contain the same analogy of class labels, as the original dataset. In our example, the IRIS dataset is balanced, that is, 33.3\% of its samples are from class 0, 33.3\% of its samples are from class 1, and 33.3\% of its samples are from class 2. With `stratify = y` we ensure that these percentages are retained in both the training, and the test set.


## Feature scaling

Feature scaling is the process of adjusting the feature values so that they are of comparable sizes. The importance of this procedure is crucial, because without it, the features with significantly higher values than others will dominate the cost function optimization. This in turn will nullify the importance of other features that have lower values.

### min-max Normalization and Standardization

There are two approaches to achieve feature scaling: **Normalization** and **Standardization**. In most cases, the term normalization refers to the transformation that is performed with the aim of limiting the feature values in the interval $[0, 1]$. In other words, feature normalization is a special case of min-max scaling:

\begin{equation}
x_i = \frac{x_i - x_{\min}} {x_{\max} - x_{\min}}
\end{equation}

On the other hand, standardization is expressed by the following equation:

\begin{equation}
x_i = \frac{x_i - \mu_{x}} {\sigma_x}
\end{equation}

where $\mu_x$ is the median and $\sigma_x$ is the standard deviation.


In [27]:
# MinMax Normalization
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

print(X_train[:10, :])
print(X_train_norm[:10, :])


[[6.7 3.  5.  1.7]
 [5.8 2.8 5.1 2.4]
 [6.5 3.2 5.1 2. ]
 [6.9 3.1 5.4 2.1]
 [5.1 3.4 1.5 0.2]
 [4.6 3.6 1.  0.2]
 [6.4 2.7 5.3 1.9]
 [5.5 2.4 3.8 1.1]
 [5.6 2.7 4.2 1.3]
 [7.7 3.  6.1 2.3]]
[[0.66666667 0.41666667 0.6779661  0.66666667]
 [0.41666667 0.33333333 0.69491525 0.95833333]
 [0.61111111 0.5        0.69491525 0.79166667]
 [0.72222222 0.45833333 0.74576271 0.83333333]
 [0.22222222 0.58333333 0.08474576 0.04166667]
 [0.08333333 0.66666667 0.         0.04166667]
 [0.58333333 0.29166667 0.72881356 0.75      ]
 [0.33333333 0.16666667 0.47457627 0.41666667]
 [0.36111111 0.29166667 0.54237288 0.5       ]
 [0.94444444 0.41666667 0.86440678 0.91666667]]


In [28]:
# Standardization
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

print(X_train[:10, :])
print(X_train_std[:10, :])


[[6.7 3.  5.  1.7]
 [5.8 2.8 5.1 2.4]
 [6.5 3.2 5.1 2. ]
 [6.9 3.1 5.4 2.1]
 [5.1 3.4 1.5 0.2]
 [4.6 3.6 1.  0.2]
 [6.4 2.7 5.3 1.9]
 [5.5 2.4 3.8 1.1]
 [5.6 2.7 4.2 1.3]
 [7.7 3.  6.1 2.3]]
[[ 0.98322472 -0.17043914  0.68152706  0.65474041]
 [-0.10874789 -0.63062482  0.73786886  1.57684128]
 [ 0.74056414  0.28974654  0.73786886  1.0499265 ]
 [ 1.2258853   0.0596537   0.90689425  1.18165519]
 [-0.95805992  0.74993221 -1.29043581 -1.32119004]
 [-1.56471137  1.21011789 -1.5721448  -1.32119004]
 [ 0.61923385 -0.86071766  0.85055245  0.9181978 ]
 [-0.47273876 -1.55099617  0.00542551 -0.13563177]
 [-0.35140847 -0.86071766  0.23079269  0.12782562]
 [ 2.19652763 -0.17043914  1.30128683  1.44511259]]


Another example:

In [29]:
ex = np.array([0, 1, 2, 3, 4, 5])

print('standardized:', (ex - ex.mean()) / ex.std())

# normalize
print('normalized:', (ex - ex.min()) / (ex.max() - ex.min()))

standardized: [-1.46385011 -0.87831007 -0.29277002  0.29277002  0.87831007  1.46385011]
normalized: [0.  0.2 0.4 0.6 0.8 1. ]


### L2 Normalization

Another popular normalization method is by dividing the feature values by the L2 norm (or Euclidean norm) of the vector to which they belong. More specifically, the $i$-th feature of the vector $\mathbf{x}=(x_1, x_2, \dots, x_n)$ is normalized according to L2 as follows:

\begin{equation}
x_i = \frac{x_i}{||\mathbf{x}||_2}=\frac{x_i}{\sqrt{x_1^2+x_2^2+ \dots + x_n^2}}=\frac{x_i}{\big (\sum_{i=1}^{n} x_{i}^{2}\big)^{1/2}}
\end{equation}

There are additional strategies for normalizing features by using the vector norms (L1 norm, Frobenius norm, etc.). Some of these methods have been widely adopted by state-of-the-art solutions that appear in the relevant scientific literature. Different ways of normalizing the components of vectors are likely to enhance the effectiveness of machine learning algorithms. These enhancements can range from subtle to significant. 
