# Data Preprocessing

### Loading the packages

In [1]:
import pandas as pd

### Preparing the data

Loading the `train` and `test` data frames and combining them to have a look at the whole dataset.

In [2]:
train = pd.read_csv("../data/raw/train.csv")
test = pd.read_csv("../data/raw/test.csv")

data = pd.concat([train.drop("Survived", axis=1), test])
data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Getting a description of the whole dataset.

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1309 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Pclass       1309 non-null   int64  
 2   Name         1309 non-null   object 
 3   Sex          1309 non-null   object 
 4   Age          1046 non-null   float64
 5   SibSp        1309 non-null   int64  
 6   Parch        1309 non-null   int64  
 7   Ticket       1309 non-null   object 
 8   Fare         1308 non-null   float64
 9   Cabin        295 non-null    object 
 10  Embarked     1307 non-null   object 
dtypes: float64(2), int64(4), object(5)
memory usage: 122.7+ KB


In [4]:
data.isnull().sum()

PassengerId       0
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
dtype: int64

In [5]:
data.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0
mean,655.0,2.294882,29.881138,0.498854,0.385027,33.295479
std,378.020061,0.837836,14.413493,1.041658,0.86556,51.758668
min,1.0,1.0,0.17,0.0,0.0,0.0
25%,328.0,2.0,21.0,0.0,0.0,7.8958
50%,655.0,3.0,28.0,0.0,0.0,14.4542
75%,982.0,3.0,39.0,1.0,0.0,31.275
max,1309.0,3.0,80.0,8.0,9.0,512.3292


### Preprocessing

The columns `Name`, `Ticket` and `Cabin` are to be dropped.<br>
There are missing values in the columns `Age`, `Fare` and `Embarked`.<br>
The missing `Age` values we will replace with the mean age.<br>
The missing `Fare` we will replace with the most frequent value.<br>
The `Embarked` column we will turn into one-hot-encoded features, thus eliminating any missing values.<br>

In [6]:
def process(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    df.drop(columns=["Name", "Ticket", "Cabin"], inplace=True)
    df["Sex"].replace({"male": 0, "female": 1}, inplace=True)
    df["Age"].fillna(df["Age"].mean(), inplace=True)
    df["Fare"].fillna(df["Fare"].value_counts().max(), inplace=True)

    df = pd.concat(
        [df.drop(["Embarked"], axis=1), pd.get_dummies(df["Embarked"], dtype=int)],
        axis=1
    )

    return df

Taking a look at the preprocessed dataset.

In [7]:
process(data).info()

<class 'pandas.core.frame.DataFrame'>
Index: 1309 entries, 0 to 417
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Pclass       1309 non-null   int64  
 2   Sex          1309 non-null   int64  
 3   Age          1309 non-null   float64
 4   SibSp        1309 non-null   int64  
 5   Parch        1309 non-null   int64  
 6   Fare         1309 non-null   float64
 7   C            1309 non-null   int32  
 8   Q            1309 non-null   int32  
 9   S            1309 non-null   int32  
dtypes: float64(2), int32(3), int64(5)
memory usage: 97.2 KB


Saving the preprocessed data for further usage.

In [8]:
process(train).to_csv("../data/train.csv", index=False)
process(test).to_csv("../data/test.csv", index=False)