# Introduction to Data Preprocessing

We will investigate common practices of data preprocessing with a toy example.


In [None]:
import pandas as pd
import numpy as np 
import sklearn

In [None]:
df = pd.read_csv("../Datasets/titanic3.csv")
df.head()

In [None]:
df.shape

In [None]:
df.info()   #age, fare, cabin, embarked, boat, body, home.dest clearly have missing data

In [None]:
df.isna().sum() #you can check it also in this way

We have to decide how to handle the dataset. A first consideration can be made on the usefulness of the presented columns.

In [None]:
#drop columns Name, Ticket, Cabin, Boat, Body, Home.dest
df = df.drop(["name","ticket","cabin","boat","body", "home.dest"], axis=1)
df.head()

We still have to deal with missing data. We can choose either to remove the rows that present missing data (but we risk losing a lot of information) or impute the missing values.

If you want to remove the rows with missing values you can simply type `df = df.dropna()`. We will proceed with the imputation instead.

In [None]:
df['age'] = df['age'].fillna(df['age'].mean())  #impute the mean value of the column for the missing data
df['fare'] = df['fare'].fillna(df['fare'].mean())

Now we can preprocess the categorical data, as most models are not able to handle them explicitly we resort to dummy variables.

In [None]:
df = pd.concat([df, pd.get_dummies(df['pclass']), pd.get_dummies(df['sex']), pd.get_dummies(df['embarked'])], axis=1) #concatenate column-wise
df.drop(["pclass","sex","embarked"], axis=1, inplace=True) #remove original columns, keep only the dummy encoding

In [None]:
df.head()   #we can see the new columns were added

Something else that can be done while preprocessing a dataset is the normalization of the numerical variables.

In [None]:
#we will treat 'age' and 'fare' 
#sibsp' and 'parch' represent the number of sibling/spouses or parents/children aboard the ship, and can be left untouched

#for example, we can choose to divide by the absolute value of the maximum in order to have features ranging in [0,1]
def absolute_maximum_scaler(series):
    return series/series.abs().max()

for col in ['age', 'fare']:
    df[col] = absolute_maximum_scaler(df[col])

In [None]:
df.head()

Lastly, you would usually separate the response variable from the covariates.

In [None]:
y = df['survived']
X = df.iloc[:,1:]

#or you could write
#y = df['survived']
#X = df.drop(['survived'], axis=1)