# <center> Case study: Survival classification with Titanic dataset </center>

## 1. Quick reviews
||Regression|Clasification|
|-|-|-|
|Data|Attribute-value description|Attribute-value description|
|Target|Continuous|Nominal|
|Evaluation methods|Cross-validation, train test split ...|Cross-validation, train test split ...|
|Errors|MSE, MAE, RMSE|1-accuracy|
|Algorithms|Linear regression|Logistic Regression, Decision Tree, Naive Bayes ...|
|Baseline|Mean of target|Majority class|

## 2. Titanic dataset
The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons for the sinking of the Titanic was the lack of supplies and medical expertise. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data as explained below:

|Variable| Definition| Key|
|-|-|-|
|survived| Survival| 0 = No, 1 = Yes|
|pclass| Ticket class| 1 = 1st, 2 = 2nd, 3 = 3rd|
|sex| Male or Female||
|age| Age in years||
|sibsp| # of siblings / spouses aboard the Titanic||
|parch| # of parents / children aboard the Titanic||
|fare| Passenger fare||
|embarked| Port of Embarkation|  C=Cherbourg, Q=Queenstown, S=Southampton|
|deck|||

The goal is to predict if a passenger survived the sinking of the Titanic or not (`survived`), regarding the passenger’s attributes. 

# Import libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load and quick view data

https://www.youtube.com/watch?v=vcbMinm_1Q8

1.5 IQR Rule

# EDA

## Descriptive analysis

In [None]:
df_num = data.select_dtypes(['int64','float64']).drop('survived',axis=1)
# Figure
fig = plt.figure(figsize=(10, 7))

# Define grid
ncols = 3
nrows = np.ceil(df_num.shape[1] / ncols)

# Add subplots
for i, c in enumerate(df_num.columns):
    ax = fig.add_subplot(nrows, ncols, i + 1)
    df_num[c].plot.hist(bins=20, ax=ax)
    ax.set_title(c)
    ax.set_ylabel(None)

# Other
plt.tight_layout()
plt.show()

## Diagnostic Analysis

In [None]:
df = data[['sex','embarked','who','adult_male','alone']]
# Figure
fig = plt.figure(figsize=(10, 7))

# Define grid
ncols = 3
nrows = np.ceil(df.shape[1] / ncols)

# Add subplots
for i, c in enumerate(df.columns):
    ax = fig.add_subplot(nrows, ncols, i + 1)
    tmp = pd.pivot_table(data=data,index=c,columns='survived',values='age',aggfunc='count').reset_index()
    tmp.plot(x=c,kind='bar',stacked=True,ax=ax)
    ax.set_title('Survied by {}'.format(c))
    ax.set_ylabel(None)
    plt.xticks(rotation=360)

# Other
plt.tight_layout()
plt.show()

In [None]:
df = data[['age', 'sibsp', 'parch', 'fare']]
# Figure
fig = plt.figure(figsize=(18, 12))

# Define grid
ncols = 3
nrows = np.ceil(df.shape[1] / ncols)

# Add subplots
for i, c in enumerate(df.columns):
    ax = fig.add_subplot(nrows, ncols, i + 1)
    if c != 'fare':
        sns.histplot(data=data,x=c,hue='survived',kde=True,ax=ax)
    else:
        sns.histplot(data=data[data['fare']<=200],x=c,hue='survived',bins=20,kde=True,ax=ax)
    ax.set_title('Survied by {}'.format(c))
    ax.set_ylabel(None)
#     plt.xticks(rotation=360)

# Other
plt.tight_layout()
plt.show()

In [29]:
# titanic_stats = titanic.survived.value_counts()
# plt.figure(figsize = (6,6))
# plt.pie(titanic_stats, labels = titanic_stats.index, autopct = '%.2f', explode=[0,0.05], shadow=True)
# plt.title('Pie chart about survival rate')
# plt.legend()
# plt.show()

# Data preprocessing

## Handle Missing Values

### Age

- Easy imputation: mean, median, mode
- Domain imputation: 

In [30]:
## Easy imputation

In [36]:
from sklearn.impute import SimpleImputer

In [37]:
# sip = SimpleImputer(strategy='median')
# sip.fit_transform(data[['age']])

In [38]:
## Domain imputation

### Embarked

### Deck

## 2. Outliers

- **Trimming**: easier
- **Capping**

##### Trimming

##### Capping

In [60]:
# IQR = 75% (q3) - 25% (q1)
# 75% + 1.5IQR
# 25% - 1.5IQR

In [None]:
plt.figure(figsize=(16,6))

plt.subplot(2,2,1)
plt.title('kde before capping')
sns.kdeplot(data=data['fare'])

plt.subplot(2,2,2)
plt.title('boxplot before capping')
sns.boxplot(data=data['fare'])

plt.subplot(2,2,3)
plt.title('kde after capping')
sns.kdeplot(data=fare_new['fare_new'])

plt.subplot(2,2,4)
plt.title('boxplot after capping')
sns.boxplot(data=fare_new['fare_new'])

## 3. Scale data

# Standard Scaler
![](https://i.stack.imgur.com/Yr42l.png)

# MinMax Scaler
![](https://androidkt.com/wp-content/uploads/2020/10/Selection_060.png)

# Robust Scaler
![](https://i.stack.imgur.com/G3V7C.png)

# Distribution after scaling
![](https://curiousily.com/static/c9cf00949c60d2eacb1fb27d24d1544d/3e3fe/scaling-overview.png)

![](https://miro.medium.com/max/1400/1*A9d4SEX0t_bAAPzZeVqwAQ.png)

In [64]:
from sklearn.preprocessing import StandardScaler

## 4. Feature transformation

In [68]:
# Age [0,100] -> group 5 nhom
# pd.qcut() pd.cut()

## 5. Imbalanced labels

In [None]:
- Before:
    + Down sampling
    + Up sampling
    + Robust sampling
- Inside:
    param: 
        class_weighted (), imbalanced