# Titanic Survivor Predictions

This project aims to predict the survival rate of Titanic passengers based on different factors (e.g. gender, socioeconomic status). The dataset is part of Kaggle's introduction to machine learning competition and can be downloaded <a href="https://www.kaggle.com/c/titanic/data?select=test.csv">here</a>.

This project serves as part of my machine learning journey and I will explore different aspects of amchine learning from data cleaning, feature engineering and model selection.

### 1. Dataset exploration and cleaning

In this section, we will explore the dataset and clean it where needed to ensure it is ready to be fed into machine learning models.

In [1]:
# import all the required analysis modules minus model selection (we will import this later on)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use("fivethirtyeight")
pd.set_option("display.max_columns", None)

In [2]:
# import both datasets 
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [3]:
# display both datasets' datatypes
print(train.info())
print(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pcl

As we can see from the information above, most of the columns are populated correctly. However, three columns have missing values: Age, Cabin and Embarked. The Cabin column has the most missing values (~77%) and it seems that this column doesn't provide a lot of additional information as it only contains the code of the cabin each passenger belongs to. 

Given this, and the high level of missing values, we can safely drop this column from both the train and test datsets

In [4]:
train = train.drop("Cabin", axis=1)
test = test.drop("Cabin", axis=1)

Let's now clean the other two columns of the missing values. The Age column has missing values in both the training and testing datasets. However, the Embarked column only has missing values in the training set. 

For the Age column, we will potentially look at either the mode or the median of the column, based on the gender of the passengers. For the Embarked column, we will fill it with the mode of the column.

In [8]:
# fill the missing values in the Age column using median
age_median_train = train.groupby("Sex").median()["Age"]
male_median_age_train = age_median_train["male"]
female_median_age_train = age_median_train["female"]

train[train["Sex"] == "male"]["Age"].fillna(male_median_age_train, inplace=True)
train[train["Sex"] == "female"]["Age"].fillna(female_median_age_train, inplace=True)

age_median_test = test.groupby("Sex").median()["Age"]
male_median_age_test = age_median_test["male"]
female_median_age_test = age_median_test["female"]

test[test["Sex"] == "male"]["Age"].fillna(male_median_age_test, inplace=True)
test[test["Sex"] == "female"]["Age"].fillna(female_median_age_test, inplace=True)