# Data Preprocessing
The preprocessing of the features of a dataset is a fundamental step for the preparation of data to be fed to a machine learning algorithm. In fact, while the steps of dataset design and of data analysis are made to ensure that a dataset contains only meaningful data, the step of data preprocessing ensures that the data representation is fit to be given in input to a neural network.

The preprocessing of a dataset consists of 4 phases:
1.   identifying and handling missing values;
2.   encoding the categorical features;
3.   splitting the dataset in training and test set;
4.   scaling the features of the dataset.

In the next sections of this practical lesson we'll see how to handle the preprocessing of the Titanic Survivals dataset, so let's import it. 

In [154]:
import pandas as pd
import numpy as np


df = pd.read_csv('http://ailab.uniud.it/wp-content/uploads/2020/10/titanic_survivals.csv', sep=',', index_col=None)  # Loading the dataset in a Pandas Dataframe
df[1:10]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,Cumings; Mrs. John Bradley (Florence Briggs Th...,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,Heikkinen; Miss. Laina,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,Futrelle; Mrs. Jacques Heath (Lily May Peel),female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,Allen; Mr. William Henry,male,35.0,0,0,373450,8.05,,S
5,6,0,3,Moran; Mr. James,male,,0,0,330877,8.4583,,Q
6,7,0,1,McCarthy; Mr. Timothy J,male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,Palsson; Master. Gosta Leonard,male,2.0,3,1,349909,21.075,,S
8,9,1,3,Johnson; Mrs. Oscar W (Elisabeth Vilhelmina Berg),female,27.0,0,2,347742,11.1333,,S
9,10,1,2,Nasser; Mrs. Nicholas (Adele Achem),female,14.0,1,0,237736,30.0708,,C


# Handling missing values
As we have seen when printing the first ten values of the dataset, there are some columns that have missing values, marked with the value *NaN* (Not a Number). Such undefined values, while usually ignored by Pandas operations like *median()* and *mean()*, can cause problems in both other built-in and new user-defined operations; moreover, such missing values cannot be handled at all by a neural network. In order to avoid future problems, we have to get rid of the missing values of the dataset. This can be done in three ways:

1.   by picking a strategy to fill the holes in the dataset;
2.   by dropping from the dataset the rows that have a missing value;
3.   by dropping an entire column from the dataset if it has too many missing values that cannot be replaced.

Such strategies are not mutually exclusive and can be used together to handle the missing values in the dataset. In order to handle the missing values in the dataset, let's check which columns have missing values and how many missing values they have. 

In [155]:
for column in df.columns:
  
  number_of_nans_in_column = df[column].isnull().sum()  # Counting the number of missing values in the column
  print(f"The column {column} has {number_of_nans_in_column} missing values\n")  # Printing the column with missing values and how many such values are in it

The column PassengerId has 0 missing values

The column Survived has 0 missing values

The column Pclass has 0 missing values

The column Name has 0 missing values

The column Sex has 0 missing values

The column Age has 177 missing values

The column SibSp has 0 missing values

The column Parch has 0 missing values

The column Ticket has 0 missing values

The column Fare has 0 missing values

The column Cabin has 687 missing values

The column Embarked has 2 missing values



As we can see, the column with the most missing entries is Cabin, with 687 missing values, while the column Age has 177 missing values. The column with the least missing entries is Embarked, with only 2 missing values.

Since the column Cabin has too many missing values (687 out of 891) to try using the other two approaches, the best course of action is to drop the column altogether from the dataset. In order to do so, we have to use the Pandas function *drop()* while specifying the column we want to drop.

In [156]:
df = df.drop(columns = ['Cabin'])# Dropping the Cabin column

print(df.shape)

(891, 11)


As we can see by printing the new shape of the dataframe, the dataset now has one less column.

Now, in order to handle the missing values in the column Age, there are two possible courses of action: either eliminating from the dataset the rows which have a *NaN* value in the Age column or replacing the missing values with a new value. In some cases dropping the NaNs from a dataset isn't a viable solution because of the excessive number of lines dropped. Some of the most common strategies used to replace NaN values without eliminating the row from the dataset (commonly called **data filling** strategies) are:
*    replacing the NaNs with the most common value found in the dataset column (*median filling*);
*    replacing the NaNs with the mean value of the dataset column (*mean filling*);
*    propagating the last valid observation forward to the next valid one (*forward filling*)
*    propagating the last valid observation backward to the previous valid one (backward filling).

By eliminating from the dataset the rows which have a *NaN* value in the Age column we would erase 177 rows from the dataset, which are roughly the 20% of the entire dataset. Since those are too many values to drop, the best alternative is to replace the *NaN* values with a new value.

Let's try filling the NaN values of the column "Age" of the dataset by replacing the *NaN* values with a mean filling of the age of the passengers. We'll do so by using the *replace()* command of Pandas.


In [157]:
import numpy as np

mean_age_passengers = df["Age"].mean()  # Getting the mean age of all the passengers
print(mean_age_passengers)

# IMPORTANT! In dataframes the NaN values correspond to the value returned by the "np.nan" command of the library numpy (here shorted in np)
column_without_nans = df["Age"].replace(np.nan,mean_age_passengers ) # Replacing in the Age column the NaN values with the mean age of the passengers

df["Age"] = column_without_nans # Assigning to the original column

print(df["Age"].value_counts())  # By using value_counts() we can see we replaced successfully the 177 NaN values in the Age column with the mean age of the passengers

29.69911764705882
29.699118    177
24.000000     30
22.000000     27
18.000000     26
28.000000     25
            ... 
55.500000      1
53.000000      1
20.500000      1
23.500000      1
0.420000       1
Name: Age, Length: 89, dtype: int64


Now that we've dealt with the missing values of the column Age, the last column with *NaN* entries is the Embarked one. Since there are only 2 empty entries in such column, the easiest way to proceed is to drop the rows with such entries from the dataset with the *dropna()* command

In [158]:
df = df.dropna()# Applying dropna to the dataset
df.reset_index(inplace=True)  # IMPORTANT! When removing rows from a dataset it's always recommended to reset the index of the dataframe 
print(df.shape)  # Printing the new shape of the dataset

(889, 12)


Through the command dropna the number of rows of the dataset passed from 891 to 889, and so only the 2 rows containing *NaN* values were dropped. Now let's check again that the dataset doesn't have empty entries:

In [159]:
dataset_has_nans = False

for column in df.columns:
  
  number_of_nans_in_column =  df[column].isnull().sum()# Counting the number of missing values in the column
  print(f"The column {column} has {number_of_nans_in_column} missing values\n")  # Printing the column with missing values and how many such values are in it

The column index has 0 missing values

The column PassengerId has 0 missing values

The column Survived has 0 missing values

The column Pclass has 0 missing values

The column Name has 0 missing values

The column Sex has 0 missing values

The column Age has 0 missing values

The column SibSp has 0 missing values

The column Parch has 0 missing values

The column Ticket has 0 missing values

The column Fare has 0 missing values

The column Embarked has 0 missing values



# Encoding categorical features
The categorical features of a dataset are all those features that don't actually represent values, but categories; such categories can be represented as strings or characters identifying a certain class. Neural networks don't know how to handle strings or characters and so that type of data needs to be encoded in a numerical form, so that the neural network can process it.

Instances of categorical features in the Titanic Survivals dataset are the column Pclass, in which the category of the ticket class of the passenger is *already encoded* through an integer (1 = 1st, 2 = 2nd, 3 = 3rd class), the column Sex, in which the category of the sex of the passenger is represented through a string ("Male" for men, "Female" for women) that needs to be encoded, and the column Embarked, in which the category of the port of embarkation is represented through a character that needs to be encoded ('C'=Cherbourg, 'Q'=Queenstown, 'S'=Southampton).

Among the most popular methods used to encode categorical features the two most used approaches are **integer encoding** and a **one-hot encoding**.

With the integer encoding (which was used for the column Pclass) to each unique category value of a feature is assigned an integer, which replaces the original value. While simple, integer encoding should be used *carefully*, since with such encoding it is possible to enstablish an ordinal relationship where there is not supposed to be one. For instance, let's say we map the values of the feature Sex in the following way:

'Male' -> 0; 'Female' -> 1;

by doing so we also implicitly define an ordinal relationship, since 1>0, but such relationship doesn't apply to the original data (the value "Female" isn't "greater than" the value "Male").

The one-hot encoding is often used to avoid the issues of the integer encoding. In one-hot encoding the feature to be encoded is removed and a new binary variable is added for each unique categorical value.  In the Sex variable example, there are 2 categories ("Male" and "Female") and therefore 2 binary variables are needed. A "1" value is placed in the binary variable of the variable to be encoded and "0" values for the other values. Let's see an example by using only the first 10 rows of the original dataset:


In [160]:
dataframe_to_encode = df.head(10)  # Getting the first 10 rows of the original dataset
print(dataframe_to_encode)

values_to_encode = dataframe_to_encode[["Sex"]]  # Inserting the values to be encoded in a dataframe (hence the double square brackets)
print(f"Unencoded values of the column Sex:\n{values_to_encode}\n")

from sklearn.preprocessing import OneHotEncoder  # The most used implementation of the one-hot encoder is the one of the sklearn library for preprocessing 
encoder = OneHotEncoder()  # Initializing the encoder
encoded_values = encoder.fit_transform(values_to_encode)  # Fitting the encoder with the data to encode and making it transform such data
encoded_values = encoded_values.toarray()  # Putting the encoded values in a numpy array
print(f"Array of the values encoded by the one-hot encoder:\n{encoded_values}\n")  # Checking which categorical value was encoded first

# Converting the encoded values in a dataframe, while naming the new columns according to the order in which the values were encoded
encoded_values = pd.DataFrame(encoded_values, columns=["isFemale", "isMale"])
print(f"Dataframe of the values encoded by the one-hot encoder:\n{encoded_values}\n")

dataframe_to_encode = dataframe_to_encode.join(encoded_values)  # Adding the columns with the encoded values to the dataframe
dataframe_to_encode = dataframe_to_encode.drop(columns=["Sex"])  # Dropping the original unencoded column
print(f"Dataframe with the encoded values:\n{dataframe_to_encode}\n")  # Printing the dataset with the encoded data


   index  PassengerId  Survived  Pclass  \
0      0            1         0       3   
1      1            2         1       1   
2      2            3         1       3   
3      3            4         1       1   
4      4            5         0       3   
5      5            6         0       3   
6      6            7         0       1   
7      7            8         0       3   
8      8            9         1       3   
9      9           10         1       2   

                                                Name     Sex        Age  \
0                            Braund; Mr. Owen Harris    male  22.000000   
1  Cumings; Mrs. John Bradley (Florence Briggs Th...  female  38.000000   
2                             Heikkinen; Miss. Laina  female  26.000000   
3       Futrelle; Mrs. Jacques Heath (Lily May Peel)  female  35.000000   
4                           Allen; Mr. William Henry    male  35.000000   
5                                   Moran; Mr. James    male  29.699118   
6

With one-hot encoding the original column of the dataset is **replaced** by *n* new columns, where *n* is the number of unique categorical values of the feature to be encoded. One-hot encoding can cause problems when a categorical feature has too many unique categorical values. For instance, encoding a categorical feature of a dataset with 300 unique categorical values with one-hot encoding adds 300 new columns to the dataset, while only the original column is dropped, thus effectively adding 299 new columns to the dataset!

Now let's try use one-hot encoding to encode the feature Embarked, since such feature has only 3 unique categorical values

In [162]:
df = d2

values_to_encode = df[["Embarked"]]  # Selecting the values to be encoded in a dataframe
print(f"First ten values of the feature to be encoded:\n{values_to_encode.head(10)}\n")  # Checking the values of the unencoded categorical values

encoder = OneHotEncoder()
encoded_values = encoder.fit_transform(values_to_encode)
encoded_values = encoded_values.toarray()
print(f"First ten values of the encoded feature:\n{encoded_values[:10]}\n")  # Checking the order in which the categorical values were encoded

encoded_values = pd.DataFrame(encoded_values, columns=["inCherbourg", "inQueenstown", "inSouthampton"])

df = df.join(encoded_values)
df = df.drop(columns=['Embarked'])
print(f"Dataframe with the encoded values:\n{df}\n")  # Printing the dataset with the encoded data

First ten values of the feature to be encoded:
  Embarked
0        S
1        C
2        S
3        S
4        S
5        Q
6        S
7        S
8        S
9        C

First ten values of the encoded feature:
[[0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [1. 0. 0.]]

Dataframe with the encoded values:
     index  PassengerId  Survived  Pclass  \
0        0            1         0       3   
1        1            2         1       1   
2        2            3         1       3   
3        3            4         1       1   
4        4            5         0       3   
..     ...          ...       ...     ...   
884    886          887         0       2   
885    887          888         1       1   
886    888          889         0       3   
887    889          890         1       1   
888    890          891         0       3   

                                                  Name     Sex        Age  \
0              

# Splitting the dataset
Splitting the dataset before giving the data in input to a neural network is a fundamental step in data preprocessing. Datasets are commonly splitted in two disjointed parts:
*   a **training set**, which data is used during the training phase of a machine learning algorithm in order to fit the parameters (weights) of the neural network to the data given in input;
*   a **testing set**, used to benchmark the performances of the trained neural network.

As a rule of thumb, datasets are typically divided among training and testing set with a 80-20 proportion.

Let's see how to split the Titanic Survival dataset by using the sklearn library.

In [163]:
from sklearn.model_selection import train_test_split  # Importing the splitting function from the model_selection functionalities of sklearn

train, test = train_test_split(df, test_size=0.2)  # Appyling the splitting function to the dataset, which randomly selects 20% of the dataset rows for the testing set

print(f"The training set is:\n{train.shape}\n")
print(f"The testing set is:\n{test.shape}\n")

The training set is:
(711, 14)

The testing set is:
(178, 14)



Sometimes the dataset can also be further divided in order to obtain a **validation set**. A validation dataset is a dataset of examples used to tune the hyperparameters of a neural network in order to avoid the problem of overfitting during the training process, thus improving the performance of the network in the testing phase.

When using a training, validation and testing set, the original dataset is tipically splitted with a 60-20-20 proportion.

Now let's try dividing the training set in order to obtain a validation dataset following such proportion (and thus with the same number of rows of the testing set).

In [164]:
# Applying the splitting function to the training set to get a validation set
train, validation = train_test_split(train, test_size=0.25)

if validation.shape[0] == test.shape[0]:  # Checking that the validation and testing set have the same number of rows
  print("The validation and testing sets have the same number of rows!")
  print(f"The new training set is:\n{train}\n")
  print(f"The validation set is:\n{validation}\n")
else:
  raise Exception("Sorry, the training and testing sets don't have the same number of rows")
print("train:", train.shape,"val:", validation.shape, "test:", test.shape, )

The validation and testing sets have the same number of rows!
The new training set is:
     index  PassengerId  Survived  Pclass  \
504    505          506         0       1   
252    253          254         0       3   
665    666          667         0       2   
514    515          516         0       1   
487    488          489         0       3   
..     ...          ...       ...     ...   
568    569          570         1       3   
790    791          792         0       2   
77      78           79         1       2   
386    387          388         1       2   
831    833          834         0       3   

                                           Name     Sex    Age  SibSp  Parch  \
504  Penasco y Castellana; Mr. Victor de Satode    male  18.00      1      0   
252                    Lobb; Mr. William Arthur    male  30.00      1      0   
665                 Butler; Mr. Reginald Fenton    male  25.00      0      0   
514                Walker; Mr. William Anderson    m

Splitting the dataset in training, validation and testing set isn't enough: in fact, before feeding such data to a neural network, the target variable should be given in input to the network *separately* from the train, validation and test data. This has to be done in order to correctly train, validate and test the network because if the target variable isn't separated from the other features of the data, the neural network will just output it directly, without learning anything

Let's suppose we want to train a neural network to classify if a passenger given in input will survive or not. In this case, the target variable is Survived. Now separate the target variable from the other features of the training, validation and testing sets. 

In [165]:
x_training = train[train.columns.difference(['Survived'])]# Selecting all the features of the training set but the column Survived
y_training = train['Survived']
print(f"Features of the training set:\n{x_training.shape}\nTarget variabke of the training set\n:{y_training.shape}\n")

x_validation = validation[validation.columns.difference(['Survived'])]
y_validation = validation['Survived']
print(f"Features of the training set:\n{x_validation.shape}\nTarget variabke of the training set\n:{y_validation.shape}\n")

x_testing = test[test.columns.difference(['Survived'])]
y_testing = test['Survived']
print(f"Features of the training set:\n{x_testing.shape}\nTarget variabke of the training set\n:{y_testing.shape}\n")

Features of the training set:
(533, 13)
Target variabke of the training set
:(533,)

Features of the training set:
(178, 13)
Target variabke of the training set
:(178,)

Features of the training set:
(178, 13)
Target variabke of the training set
:(178,)



# Feature scaling
The scaling of the features of the dataset is typically the last step of the preprocessing of a dataset. Since the range of values of the features of a dataset usually varies widely, scaling is done in order to obtain a uniform scale of values between all the features of the dataset; doing so typically facilitates the training phase of a machine learning algorithm.

Two approaches are typically used in order to scale the data of a dataset: **normalization** and **standardization**. With normalization the values of the features are scaled in order to have values between 0 and 1, while standardization transforms the data of a feature to have a mean of zero and a standard deviation of 1.

The *standard scaler* is among the most used methods of feature scaling through standardization. The standard scaler acts on a feature by removing the mean and scaling to unit variance. The scaled value *z* of a sample *x* of a feature is calculated as:

*z = (x - u) / s*

where *u* is the mean of the samples in the feature, and *s* is the standard deviation of the samples of the feature.

Now, let's try using the standard scaler on the feature Fare of the training, validation and testing dataset.

In [181]:
x = x_training[["Fare"]]
print(x)

print(x_training.shape, y_training.shape)

         Fare
504  108.9000
252   16.1000
665   13.0000
514   34.0208
487    8.0500
..        ...
568    7.8542
790   26.0000
77    29.0000
386   13.0000
831    7.8542

[533 rows x 1 columns]
(533, 13) (533,)


In [186]:
data_to_scale_training = x_training[["Fare"]]  # Dataframe with the values of Fare in the training set
data_to_scale_validation = x_validation[["Fare"]]  # Dataframe with the values of Fare in the validation set
data_to_scale_testing =   x_testing[["Fare"]]# Dataframe with the values of Fare in the testing set

from sklearn.preprocessing import StandardScaler  # Importing the scaler
standard_scaler = StandardScaler()

X_train_std = standard_scaler.fit_transform(data_to_scale_training)  # Fitting the scaler on the training examples and transforming the values 
# IMPORTANT! The scaler must be fit on the training set data ONLY
X_val_std = standard_scaler.transform(data_to_scale_validation)
X_test_std = standard_scaler.transform(data_to_scale_testing)


print(f"First 10 values of the feature Fare in the training set before standard scaling are:\n{data_to_scale_training.head(10)}\nafter they are:\n{X_train_std[:10]}\n\n")
print(f"First 10 values of the feature Fare in the training set before standard scaling are:\n{data_to_scale_validation.head(10)}\nafter they are:\n{X_val_std[:10]}\n\n")
print(f"First 10 values of the feature Fare in the training set before standard scaling are:\n{data_to_scale_testing.head(10)}\nafter they are:\n{X_test_std[:10]}")

First 10 values of the feature Fare in the training set before standard scaling are:
         Fare
504  108.9000
252   16.1000
665   13.0000
514   34.0208
487    8.0500
720    7.0542
317  164.8667
6     51.8625
31   146.5208
98    26.0000
after they are:
[[ 1.46570951]
 [-0.33750122]
 [-0.39773779]
 [ 0.01072053]
 [-0.49392198]
 [-0.51327152]
 [ 2.55320685]
 [ 0.35740528]
 [ 2.19672492]
 [-0.14513284]]


First 10 values of the feature Fare in the training set before standard scaling are:
         Fare
885   30.0000
863   13.0000
100    7.8958
380   15.7417
225   10.5000
392  113.2750
788   79.2000
376  211.5000
12     8.0500
182   39.0000
after they are:
[[-0.06740824]
 [-0.39773779]
 [-0.49691827]
 [-0.34446341]
 [-0.44631567]
 [ 1.55072079]
 [ 0.88860435]
 [ 3.45934551]
 [-0.49392198]
 [ 0.10747211]]


First 10 values of the feature Fare in the training set before standard scaling are:
        Fare
406  18.7500
168  56.4958
620  52.5542
509   7.7500
681   9.2250
291  12.8750
82   47.

The *minmax scaler* is among the most used methods of feature scaling through normalization. The minmax scaler acts by setting the scaled value of the maximal non-scaled value of a feature to 1 and the scaled value of the minimal non-scaled value of the feature to 0. The values of the other entries of the feature are then scaled accordingly to the new range of values.

Now, let's try using the minmax scaler on the feature Fare of the training, validation and testing dataset. 

In [187]:
from sklearn.preprocessing import MinMaxScaler  # Importing the scaler
minmax_scaler = MinMaxScaler()

X_train_norm = minmax_scaler.fit_transform(data_to_scale_training)
X_val_norm = minmax_scaler.transform(data_to_scale_validation)
X_test_norm = minmax_scaler.transform(data_to_scale_testing)

print(f"First 10 values of the feature Fare in the training set before minmax scaling are:\n{data_to_scale_training.head(10)}\nafter they are:\n{X_train_norm[:10]}\n\n")
print(f"First 10 values of the feature Fare in the training set before minmax scaling are:\n{data_to_scale_validation.head(10)}\nafter they are:\n{X_val_norm[:10]}\n\n")
print(f"First 10 values of the feature Fare in the training set before minmax scaling are:\n{data_to_scale_testing.head(10)}\nafter they are:\n{X_test_norm[:10]}")

First 10 values of the feature Fare in the training set before minmax scaling are:
         Fare
504  108.9000
252   16.1000
665   13.0000
514   34.0208
487    8.0500
720    7.0542
317  164.8667
6     51.8625
31   146.5208
98    26.0000
after they are:
[[0.21255864]
 [0.03142511]
 [0.02537431]
 [0.06640418]
 [0.01571255]
 [0.01376888]
 [0.32179837]
 [0.10122886]
 [0.28598956]
 [0.05074862]]


First 10 values of the feature Fare in the training set before minmax scaling are:
         Fare
885   30.0000
863   13.0000
100    7.8958
380   15.7417
225   10.5000
392  113.2750
788   79.2000
376  211.5000
12     8.0500
182   39.0000
after they are:
[[0.0585561 ]
 [0.02537431]
 [0.01541158]
 [0.03072575]
 [0.02049464]
 [0.22109808]
 [0.1545881 ]
 [0.41282051]
 [0.01571255]
 [0.07612293]]


First 10 values of the feature Fare in the training set before minmax scaling are:
        Fare
406  18.7500
168  56.4958
620  52.5542
509   7.7500
681   9.2250
291  12.8750
82   47.1000
798  24.1500
416  13.