# Titanic Machine Learning from Disaster



# Table of Contents

* [1. Introduction](#introduction)
* [2. Loading the Data](#loading-data)
* [3. Exploratory Data Analysis (EDA)](#eda)
* [4. Feature Engineering & Data Wrangling](#fe-dw)

#  1. Introduction <a class="anchor" id="introduction"></a>
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

# 2. Loading the Data <a class="anchor" id="loading-data"></a>

In [None]:
# import Lib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train_df = pd.read_csv("./data/train.csv")
test_df = pd.read_csv("./data/test.csv")

In [None]:
train_df.columns

In [None]:
test_df.columns

#### Notes:
* `SibSp`: # of siblings / spouses aboard the Titanic
* `Parch`: # of parents / children aboard the Titanic

* `Ticket`: Ticket number
* `Cabin`: Cabin number

In [None]:
#preview data
train_df.head()

###### PassengerId
The first column is the passenger ID. The ID for a passenger is just a number to identify this passenger in this dataset. So this column is not really a part of the information we should care about.
We can drop this column or make it the index for this dataset. Let's make it the index for the dataset just to demonstrate the use of df.set_index method.

In [None]:
train_df = pd.read_csv('./data/train.csv', index_col="PassengerId")
test_df = pd.read_csv('./data/test.csv', index_col="PassengerId")

In [None]:
train_df.set_index(train_df.PassengerId, inplace=True)

In [None]:
train_df.head()

In [None]:
train_df.drop('PassengerId', axis = 1, inplace=True)

In [None]:
train_df

In [None]:
train_df.head()

## 1.1. Feature Classification: Categorical vs Numerical

* This helps us select the appropriate plots for visualization.

#### Which features are categorical?

* Categorical Features: `nominal`, `ordinal`, `ratio`, `interval`
* To classify the samples into sets of similar samples

#### Which features are numerical?
* Numerical features: `discrete`, `continuous`, or `timeseries`
* These values change from sample to sample

In [None]:
train_df.info()

In [None]:
test_df.info()

  - Categorical: `Survived`, `Sex`, `Embarked`, `Pclass` (ordinal),  `SibSp` , `Parch`
      - `Embarked`: Port of Embarkation -	C = Cherbourg, Q = Queenstown, S = Southampton
  - Numerical: (continuous) `Age`, `Fare`, (discrete)
  
  - Mix types of data: `Ticket`, `Cabin`
  - Contain Error/Typo: `Name`
  - Blank or Null: `Cabin` > `Age` > `Embarked`
  - Various Data Type: String, Int, Float
  
According to the data dictionary, we know that if a passernger is marked as 1, he or she survived. Clearly the number 1 or 0 is a flag for the person's survivorship. Yet the data type of the column is int64, which is a numerical type. We can change that with the following command.

In [None]:
train_df["Survived"] = train_df["Survived"].astype("category")

In [None]:
train_df["Survived"].dtype

In [None]:
train_df.info()

In [None]:
features = ["Pclass", "Sex", "SibSp", "Parch", "Embarked"]
def convert_cat(df, features):
    for feature in features:
        df[feature] = df[feature].astype("category") #df.Pclass, df."Pclass" => df["Pclass"]
convert_cat(train_df, features)
convert_cat(test_df, features)

In [None]:
train_df.info()

### 1.1.1. Distribution of Numerical feature values across the samples

In [None]:
train_df.describe()

### 1.1.2. Distribution of Categorical features

In [None]:
train_df.describe(include=['category'])

# 3. Exploratory Data Analysis (EDA)<a class="anchor" id="eda"></a>

## 3.1. Correlating categorical features
- Categorical: `Survived`, `Sex`, `Embarked`, `Pclass` (ordinal),  `SibSp` , `Parch`

### Target Variable: `Survived`

In [None]:
train_df["Survived"].value_counts().to_frame()

In [None]:
train_df["Survived"].value_counts(normalize=True).to_frame()

Only 38% survived the disaster. So the training data suffers from data imbalance but it is not severe which is why I will not consider techniques like sampling to tackle the imbalance.

### `Sex`

In [None]:
train_df['Sex'].value_counts().to_frame()

In [None]:
sns.countplot(data=train_df, x='Sex', hue='Survived', palette='Blues');

- Remaining Categorical Feature Columns

In [None]:
cols = ['Sex', 'Embarked', 'Pclass', 'SibSp', 'Parch']

n_rows = 2
n_cols = 3

fig, ax = plt.subplots(n_rows, n_cols, figsize=(n_cols*3.5, n_rows*3.5))

for r in range(0, n_rows):
    for c in range(0, n_cols):
        i = r*n_cols + c #index to loop through list "cols"
        if i < len(cols):
            ax_i = ax[r,c]
            sns.countplot(data=train_df, x=cols[i], hue="Survived", palette="Blues", ax=ax_i)
            ax_i.set_title(f"Figure {i+1}: Survival Rate vs {cols[i]}")
            ax_i.legend(title='', loc='upper right', labels=['Not Survived', 'Survived'])
ax.flat[-1].set_visible(False) #Remove the last subplot
plt.tight_layout()
plt.show()

### Observation:

* **Survival Rate**:
    - Fig 1: Female survival rate > male
    - Fig 2: Most People embarked on Southampton, and also had the highest people not survived
    - Fig 3: 1st class higher survival rate  
    - Fig 4: People going with 0 `SibSp` are mostly not survived. the number of passenger with 1-2 family members has a better chance of survival
    - Fig 5: People going with 0 `Parch` are mostly not survived
    
## 3.2. EDA for Numerical Features
- Numerical Features: (continuous) `Age`, `Fare`

### Age

In [None]:
sns.histplot(data=train_df, x='Age', hue='Survived' ,bins = 40, kde=True);

- Majority passengers were from 18-40 ages
- Chilren had more chance to survive than other ages

### Fare

In [None]:
train_df["Fare"].describe()

In [None]:
sns.histplot(data=train_df, x='Fare', hue='Survived', bins = 40, palette='Blues');

In [None]:
#To name for 0-25% quartile, 25-50, 50-75, 75-100

fare_categories = ['Economic', 'Standard', 'Expensive', 'Luxury']
quartile_data = pd.qcut(train_df['Fare'], 4, labels=fare_categories)

sns.countplot(x=quartile_data, hue=train_df['Survived'], palette='Blues');

In [None]:
train_df['Fare']

- Distribution of Fare
    - Fare does not follow a normal distribution and has a huge spike at the price range `[0–100$]`.
    - The distribution is skewed to the left with `75%` of the fare paid under `$31` and a max paid fare of `$512`.
- Quartile plot:
    - Passenger with Luxury & Expensive Fare will have more chance to survive

# 4.  Feature Engineering & Data Wrangling <a class="anchor" id="fe-dw"></a>