<span style="font-family: Segoe UI; font-size: 2.5em; font-weight: 300;">THE TITANIC PROJECT</span>

![](https://static1.squarespace.com/static/5006453fe4b09ef2252ba068/5095eabce4b06cb305058603/5095eabce4b02d37bef4c24c/1352002236895/100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8.jpg)

<span style="font-family: Segoe UI; font-size: 2EM; font-weight: 300;">Summary</span>
* [1. Introduction](#introduction)
* [2. Environment Preparation](#envprep)
    - [2.1 Library Imports](#libimport)
        - [2.2.1 Data Cleaning](#datacleaning)
        - [2.2.2 Data Visualization](#datavisualization)
        - [2.2.3 Data Engineering](#)
        - [2.2.4 Data Modelling](#)
        - [2.2.5 Settings](#)
    - [2.2 Utils](#)
    - [2.3 Data Imports](#)
* [3. A bit of Exploratory Data Analysis](#)
    - [3.1 Age](#)
    - [3.2 Fare](#)
    - [3.3 Pclass](#)
    - [3.4 Sex](#)
    - [3.5 SibSp](#)
    - [3.6 Parch](#)
    - [3.7 Embarked](#)
* [4. Feature Engineering & Data Cleaning](#)
    - [4.1 Merge Train & Test for Transformation](#)
    - [4.2 Encoding Sex](#)
    - [4.3 Title](#)
    - [4.4 Name Length](#)
    - [4.5 One-hot Encode Embarked & Label Encode Title](#)
    - [4.6 Family Size](#)
    - [4.7 Label Encoding Family Size](#)
    - [4.8 FamilyName](#)
    - [4.9 Cabin](#)
    - [4.10 Cleaning Ticket](#)
    - [4.11 Ticket Frequency](#)
    - [4.12 One-hot Encoding Ticket](#)
    - [4.13 Fare into Categorical Bins](#)
    - [4.14 Additional Derived Features from Feature Relationships](#)
    - [4.15 Remove Constant Columns](#)
* [5. Imputation of Missing Values](#)
* [6. Final Adjustments](#)
    - [6.1 Create Age Banc](#)
    - [6.2 Obtain Features for Children & Seniors](#)
    - [6.3 Standard Scaling Data](#)
    - [6.4 Split Data Back to Train & Test](#)
* [7. Checking feature Importance & Correlations](#)
* [8. Preparation of Train & Test Data](#)
    - [8.1 Split the Data](#)
    - [8.2 Cross Validation (K-Fold)](#)
* [9. Model Development](#)
    - [9.1 Model Evaluation](#)
    - [9.2 Prediction](#)
        - [9.2.1 AdaBoost](#)
        - [9.2.2 Bagging](#)
        - [9.2.3 Gradient Boosting](#)
        - [9.2.4 Extra Trees](#)
        - [9.2.5 Random Forest](#)
        - [9.2.6 Gaussian Process](#)
        - [9.2.7 Logistic Regression](#)
        - [9.2.8 Ridge](#)
        - [9.2.9 Perceptron](#)
        - [9.2.10 Passive Agressive](#)
        - [9.2.11 SGD](#)
        - [9.2.12 Gaussian Naive Bayes](#)
        - [9.2.13 Bernoulli](#)
        - [9.2.14 K-Nearest Neighbors](#)
        - [9.2.15 Support Vector Clustering](#)
        - [9.2.16 Linear SVC](#)
        - [9.2.17 NuSVC](#)
        - [9.2.18 Decision Tree](#)
        - [9.2.19 Linear Discriminant Analysis](#)
        - [9.2.20 XGBoost](#)
        - [9.2.21 Keras](#)
    - [9.3 Model Performance](#)
    - [9.4 Stack](#)
    - [9.5 Voting](#)
    - [9.6 Tunning Parameters](#)
* [10. Submission](#)
    - [10.1. Using Stack](#)
    - [10.2. Using Keras](#)
    - [10.3. Using Voting](#)
    - [10.4. Final Adjustments](#)
* [11. Credits](#)

<a id="introduction"></a>
# 1. Introduction

#### Goal
* The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

#### Details & Description of Features:

* PassengerID
* Survived - (0 = No, 1 = Yes)
* Pclass - Passenger Class (1 = 1st, 2 = 2nd, 3 = 3rd)
* Name
* Sex
* Age
* SibSp - Number of Siblings/Spouses Aboard
* Parch - Number of Parents/Children Aboard
* Ticket - Ticket Number
* Fare - Passenger Fare in British pound
* Cabin - Cabin Number
* Embarked - Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

<a id="envprep"></a>
# 2. Environment Preparation

<a id="libimport"></a>
### 2.1 Library Imports

<a id="datacleaning"></a>
#### 2.1.1 Data Cleaning

In [None]:
import numpy as np
import pandas as pd

<a id="datavisualization"></a>
#### 2.1.2. Data Visualization

In [None]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline

import seaborn as sns
sns.set(rc={"font.size":18,"axes.titlesize":30,"axes.labelsize":18,
            "axes.titlepad":22, "axes.labelpad":18, "legend.fontsize":15,
            "legend.title_fontsize":15, "figure.titlesize":35})

#### 2.1.3. Data Engineering

In [None]:
from optbinning import OptimalBinning
from statsmodels.stats.outliers_influence import variance_inflation_factor

#### 2.1.4. Modelling

In [None]:
from sklearn import *
from xgboost import XGBClassifier

import tensorflow as tf
from tensorflow.keras import *

#### 2.1.5. Settings

In [None]:
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

### 2.2. Utils

In [None]:
def get_missing(df):    
    missing = df.isnull().sum()
    missing_percentage = df.isnull().sum() / df.isnull().count() * 100
    missing_percentage = round(missing_percentage, 1)
    missing_data = pd.concat([missing, missing_percentage], axis=1, keys=['Total', '%'])
    missing_data = missing_data[missing_data['Total'] > 0].sort_values(by=['%'], ascending=False)
    
    return missing_data

def calc_vif(X):
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

### 2.3. Data Imports

In [None]:
df_train = pd.read_csv("../data/train.csv")
df_test = pd.read_csv("../data/test.csv")
df_submission = pd.read_csv("../data/gender_submission.csv")
df_train.head(5)

In [None]:
print(df_train.shape, df_test.shape)

# 3. A bit of Exploratory Data Analysis

In [None]:
df_train.info()

### 3.1 Age

In [None]:
fig = plt.figure(figsize=(22,8))
kde = sns.kdeplot(x="Age", data=df_train, cut=0, hue="Survived", fill=True, legend=True, palette="plasma_r")

plt.xlim(0)

kde.xaxis.set_major_locator(ticker.MultipleLocator(1))
kde.xaxis.set_major_formatter(ticker.ScalarFormatter())

fig.suptitle("AGE BY SURVIVED", x=0.125, y=1.0, ha='left', fontweight=100, fontfamily='Segoe UI', size=39);

In [None]:
fig = plt.figure(figsize=(22, 8))
hist = sns.histplot(df_train["Age"], color="springgreen", kde=True, bins=50, label="Train")
hist = sns.histplot(df_test["Age"], color="gold", kde=True, bins=50, label="Test")

plt.xlim(0)

title = fig.suptitle("DISTRIBUITION OF AGE IN TRAIN & TEST", x=0.125, y=1.01, ha='left', 
                     fontweight=100, fontfamily='Segoe UI', size=39)

hist.xaxis.set_major_locator(ticker.MultipleLocator(1))
hist.xaxis.set_major_formatter(ticker.ScalarFormatter())

plt.legend()
plt.show()

### 3.2. Fare

In [None]:
fig = plt.figure(figsize=(22,8))
kde = sns.kdeplot(x="Fare", data=df_train, cut=0, hue="Survived", fill=True, legend=True, palette="mako_r")

kde.xaxis.set_major_locator(ticker.MultipleLocator(10))
kde.xaxis.set_major_formatter(ticker.ScalarFormatter())

plt.xlim(0)

fig.suptitle("FARE BY SURVIVED", x=0.125, y=1.01, ha='left',fontweight=100, fontfamily='Segoe UI', size=39);

In [None]:
fig = plt.figure(figsize=(20,8))
kde = sns.kdeplot(x="Fare", data=df_train, cut=0, clip=[0,180], hue="Survived", fill=True, legend=True, palette="mako_r")

plt.xlim(0)

kde.xaxis.set_major_locator(ticker.MultipleLocator(4))
kde.xaxis.set_major_formatter(ticker.ScalarFormatter())

fig.suptitle("FARE BY SURVIVED - CLIPPED TO REMOVE OUTLIERS", x=0.12, y=1.01, ha='left', 
             fontweight=100, fontfamily='Segoe UI', size=37);

In [None]:
fig = plt.figure(figsize=(20,8))

dist = sns.histplot(df_train[(df_train.Fare > 0) & (df_train.Fare <=180)]['Fare'],
                    color="gold", 
                    kde=True, 
                    bins=50, 
                    label='Train')

dist = sns.histplot(df_test[(df_test.Fare > 0) & (df_test.Fare <=180)]['Fare'],
                    color="crimson", 
                    kde=True, 
                    bins=50, 
                    label='Test')

title = fig.suptitle("DISTRIBUTION OF FARE IN TRAIN & TEST", 
                     x=0.12, 
                     y=1.01, 
                     ha='left',
                     fontweight=100, 
                     fontfamily='Segoe UI', 
                     size=37)

plt.xlim(0)

dist.xaxis.set_major_locator(ticker.MultipleLocator(4))
dist.xaxis.set_major_formatter(ticker.ScalarFormatter())

plt.legend()
plt.show()

### 3.3. Pclass

In [None]:
c1 = sns.catplot(x="Pclass", 
                 hue="Survived", 
                 kind="count", 
                 data=df_train,
                 aspect = 3.5, 
                 legend=True, 
                 palette="YlGnBu")

title = c1.fig.suptitle("COUNT BY PCLASS", 
                        x=0.04, 
                        y=1.12, 
                        ha='left', 
                        fontweight=100, 
                        fontfamily='Segoe UI', 
                        size=42)

### 3.4. Sex

In [None]:
c1 = sns.catplot(x="Sex", 
                 hue="Survived", 
                 kind="count", 
                 data=df_train,
                 aspect = 3.5, 
                 legend=True, 
                 palette="YlGnBu")

title = c1.fig.suptitle("COUNT BY SEX", 
                        x=0.04, 
                        y=1.12, 
                        ha='left', 
                        fontweight=100, 
                        fontfamily='Segoe UI', 
                        size=42)

In [None]:
g = sns.catplot(x="Sex", y="Survived", data=df_train,kind="bar", palette = "YlGnBu")
g.set_ylabels("Survival probability")

### 3.5. SibSp

In [None]:
c1 = sns.catplot(x="SibSp", 
                 hue="Survived", 
                 kind="count", 
                 data=df_train,
                 aspect = 3.5, 
                 legend=True, 
                 palette="YlGnBu")

title = c1.fig.suptitle("COUNT BY SibSp", 
                        x=0.04, 
                        y=1.12, 
                        ha='left', 
                        fontweight=100, 
                        fontfamily='Segoe UI', 
                        size=42)

### 3.6. Parch

In [None]:
c1 = sns.catplot(x="Parch", 
                 hue="Survived", 
                 kind="count", 
                 data=df_train,
                 aspect = 3.5, 
                 legend=True, 
                 palette="YlGnBu")

title = c1.fig.suptitle("COUNT BY Parch", 
                        x=0.04, 
                        y=1.12, 
                        ha='left', 
                        fontweight=100, 
                        fontfamily='Segoe UI', 
                        size=42)

### 3.7. Embarked

In [None]:
c1 = sns.catplot(x="Embarked", 
                 hue="Survived", 
                 kind="count", 
                 data=df_train,
                 aspect = 3.5, 
                 legend=True, 
                 palette="YlGnBu")

title = c1.fig.suptitle("COUNT BY Embarked", 
                        x=0.04, 
                        y=1.12, 
                        ha='left', 
                        fontweight=100, 
                        fontfamily='Segoe UI', 
                        size=42)

# 4. Feature Engineering & Data Cleaning

### 4.1. Merge Train & Test for Transformation


In [None]:
full_df = pd.concat([df_train, df_test]).reset_index(drop=True)

df_train_test = df_train.sample(frac=0.2,random_state=123)
y_train_test = df_train_test[["Survived", "PassengerId"]]
df_train_test = df_train_test.drop(["Survived"], axis=1)
list_index = df_train_test.index.values.tolist()
df_train_train = df_train[~df_train.index.isin(list_index)]
full_df_model = pd.concat([df_train_test, df_train_train])

train_shape = df_train.shape
test_shape = df_test.shape

In [None]:
print(full_df.shape, df_train.shape, df_train_train.shape, df_train_test.shape, full_df_model.shape)

### 4.2. Encoding Sex

In [None]:
full_df.loc[:, 'Sex'] = (full_df.loc[:, 'Sex'] == 'female').astype(int)

### 4.3. Title

In [None]:
full_df.Name.head(5)

In [None]:
full_df["Title"] = full_df["Name"]
full_df["Title"] = full_df["Name"].str.extract("([A-Za-z]+)\.", expand=True)

c1 = sns.catplot(x="Title", hue="Survived", kind="count", data=full_df[:train_shape[0]],
                 aspect = 3.5, legend=True, palette="YlGnBu")

title = c1.fig.suptitle("COUNT BY TITLE", x=0.04, y=1.12, ha='left',
             fontweight=100, fontfamily='Segoe UI', size=42)

# Replacing rare titles 
mapping = {'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs', 'Major': 'Other', 
           'Col': 'Other', 'Dr' : 'Other', 'Rev' : 'Other', 'Capt': 'Other', 
           'Jonkheer': 'Royal', 'Sir': 'Royal', 'Lady': 'Royal', 
           'Don': 'Royal', 'Countess': 'Royal', 'Dona': 'Royal'}
           
full_df.replace({'Title': mapping}, inplace=True)

c2 = sns.catplot(x="Title", hue="Survived", kind="count", data=full_df[:train_shape[0]],
                 aspect = 3.5, legend=True, palette="YlGnBu")
c2.fig.suptitle("COUNT BY TITLE AGGREGATED", x=0.04, y=1.12, ha='left',
             fontweight=100, fontfamily='Segoe UI', size=42);

### 4.4. Name Length

In [None]:
full_df["Name_Length"] = full_df.Name.str.replace("[^a-zA-Z]", "").str.len()

fig, ax = plt.subplots(ncols=1, figsize=(20,8))
kde = sns.kdeplot(x="Name_Length", data=full_df[:train_shape[0]], cut=True,
                  hue="Survived", fill=True, ax=ax, palette="mako_r")

kde.xaxis.set_major_locator(ticker.MultipleLocator(1))
kde.xaxis.set_major_formatter(ticker.ScalarFormatter())

fig.suptitle("NAME_LENGTH BY SURVIVED", x=0.125, y=1.01, ha='left',
             fontweight=100, fontfamily='Lato', size=42);

### 4.5. One-hot Encode Embarked & Label Encode Title

In [None]:
full_df['Title_C'] = full_df['Title']

full_df['Embarked'] = full_df['Embarked'].fillna('S')
full_df = pd.get_dummies(full_df, columns=["Embarked","Title_C"],prefix=["Emb","Title"], drop_first=False)

title_dict = {'Mr': 1, 'Miss': 2, 'Mrs': 3, 'Other': 4, 'Royal': 5, 'Master': 6}
full_df['Title'] = full_df['Title'].map(title_dict).astype('int')

### 4.6. Derive Family Size Feature

In [None]:
full_df['Family_Size'] = full_df['Parch'] + full_df['SibSp'] + 1
full_df['Fsize_Cat'] = full_df['Family_Size'].map(lambda val: 'Alone' if val <= 1 else ('Small' if val < 5 else 'Big'))
full_df["isAlone"] = full_df.Family_Size.apply(lambda x: 1 if x==1 else 0)

In [None]:
fig, ax = plt.subplots(ncols=1, figsize=(20,8))
kde = sns.kdeplot(x="Family_Size", data=full_df[:train_shape[0]], cut=True,
                  hue="Survived", fill=True, ax=ax, palette="mako_r")

kde.xaxis.set_major_locator(ticker.MultipleLocator(1))
kde.xaxis.set_major_formatter(ticker.ScalarFormatter())

plt.xlim(1)

fig.suptitle("FAMILY SIZE BY SURVIVED", x=0.125, y=1.01, ha='left',
             fontweight=100, fontfamily='Segoe UI', size=42);

In [None]:
c1 = sns.catplot(x="Fsize_Cat", 
                 hue="Survived", 
                 kind="count", 
                 data=full_df[:train_shape[0]],
                 aspect = 3.5, 
                 legend=True, 
                 palette="YlGnBu")

title = c1.fig.suptitle("COUNT BY Fsize_Cat", 
                        x=0.04, 
                        y=1.12, 
                        ha='left', 
                        fontweight=100, 
                        fontfamily='Segoe UI', 
                        size=42)

### 4.7. Label Encoding Family Size

In [None]:
Fsize_dict = {'Alone':3, 'Small':2, 'Big':1}
full_df['Fsize_Cat'] = full_df['Fsize_Cat'].map(Fsize_dict).astype('int')

### 4.8. Extract FamilyName Feature from Name

In [None]:
full_df['Family_Name'] = full_df['Name'].str.extract('([A-Za-z]+.[A-Za-z]+)\,', expand=True)
full_df_model['Family_Name'] = full_df_model['Name'].str.extract('([A-Za-z]+.[A-Za-z]+)\,', expand=True)

In [None]:
MEAN_SURVIVAL_RATE = round(np.mean(df_train['Survived']), 4)

full_df['Family_Friends_Surv_Rate'] = MEAN_SURVIVAL_RATE
full_df['Surv_Rate_Invalid'] = 1

for _, grp_df in full_df[['Survived', 'Family_Name', 'Fare', 'Ticket', 'PassengerId']].groupby(['Family_Name', 'Fare']):                       
    if (len(grp_df) > 1):
        if(grp_df['Survived'].isnull().sum() != len(grp_df)):
            for ind, row in grp_df.iterrows():
                full_df.loc[full_df['PassengerId'] == row['PassengerId'],
                            'Family_Friends_Surv_Rate'] = round(grp_df['Survived'].mean(), 4)
                full_df.loc[full_df['PassengerId'] == row['PassengerId'],
                            'Surv_Rate_Invalid'] = 0

for _, grp_df in full_df[['Survived', 'Family_Name', 'Fare', 'Ticket', 'PassengerId', 
                          'Family_Friends_Surv_Rate']].groupby('Ticket'):
    if (len(grp_df) > 1):
        for ind, row in grp_df.iterrows():
            if (row['Family_Friends_Surv_Rate'] == 0.) | (row['Family_Friends_Surv_Rate'] == MEAN_SURVIVAL_RATE):
                if(grp_df['Survived'].isnull().sum() != len(grp_df)):
                    full_df.loc[full_df['PassengerId'] == row['PassengerId'],
                                'Family_Friends_Surv_Rate'] = round(grp_df['Survived'].mean(), 4)
                    full_df.loc[full_df['PassengerId'] == row['PassengerId'],
                                'Surv_Rate_Invalid'] = 0

In [None]:
MEAN_SURVIVAL_RATE = round(np.mean(df_train_train['Survived']), 4)

full_df_model['Family_Friends_Surv_Rate'] = MEAN_SURVIVAL_RATE
full_df_model['Surv_Rate_Invalid'] = 1

for _, grp_df in full_df_model[['Survived', 'Family_Name', 'Fare', 'Ticket', 'PassengerId']].groupby(['Family_Name', 'Fare']):                       
    if (len(grp_df) > 1):
        if(grp_df['Survived'].isnull().sum() != len(grp_df)):
            for ind, row in grp_df.iterrows():
                full_df_model.loc[full_df_model['PassengerId'] == row['PassengerId'],
                            'Family_Friends_Surv_Rate'] = round(grp_df['Survived'].mean(), 4)
                full_df_model.loc[full_df_model['PassengerId'] == row['PassengerId'],
                            'Surv_Rate_Invalid'] = 0

for _, grp_df in full_df_model[['Survived', 'Family_Name', 'Fare', 'Ticket', 'PassengerId', 
                          'Family_Friends_Surv_Rate']].groupby('Ticket'):
    if (len(grp_df) > 1):
        for ind, row in grp_df.iterrows():
            if (row['Family_Friends_Surv_Rate'] == 0.) | (row['Family_Friends_Surv_Rate'] == MEAN_SURVIVAL_RATE):
                if(grp_df['Survived'].isnull().sum() != len(grp_df)):
                    full_df_model.loc[full_df_model['PassengerId'] == row['PassengerId'],
                                'Family_Friends_Surv_Rate'] = round(grp_df['Survived'].mean(), 4)
                    full_df_model.loc[full_df_model['PassengerId'] == row['PassengerId'],
                                'Surv_Rate_Invalid'] = 0

In [None]:
full_df = full_df.drop(["Name", "Family_Name"], axis=1)
full_df_model = full_df_model.drop(["Name", "Family_Name"], axis=1)

### 4.9. Cleaning & Encoding of the Cabin

In [None]:
# Replace missing values with 'U' for Cabin
full_df['Cabin_Clean'] = full_df['Cabin'].fillna('U')
full_df['Cabin_Clean'] = full_df['Cabin_Clean'].str.strip(' ').str[0]
# Label Encoding
cabin_dict = {'A':9, 'B':8, 'C':7, 'D':6, 'E':5, 'F':4, 'G':3, 'T':2, 'U':1}
full_df['Cabin_Clean'] = full_df['Cabin_Clean'].map(cabin_dict).astype('int')
full_df.drop(["Cabin"], axis=1, inplace=True)

In [None]:
fig, ax = plt.subplots(ncols=1, figsize=(23,8))
sns.histplot(x="Cabin_Clean", data=full_df[:train_shape[0]], hue="Survived", fill=True, ax=ax, palette="nipy_spectral", 
             kde=True)
fig.suptitle("CABIN_CLEAN BY SURVIVED", x=0.125, y=1.01, ha='left',
             fontweight=100, fontfamily='Segoe UI', size=42);

### 4.10. Cleaning the Ticket

In [None]:
import re
def clean_ticket(each_ticket):
    prefix = re.sub(r'[^a-zA-Z]', '', each_ticket)
    if(prefix):
        return prefix
    else:
        return "NUM"

full_df["Tkt_Clean"] = full_df.Ticket.apply(clean_ticket)

fig, ax = plt.subplots(ncols=1, figsize=(23,8))
sns.countplot(x="Tkt_Clean", data=full_df[:train_shape[0]], hue="Survived", fill=True, ax=ax, palette="bwr_r")
fig.suptitle("TKT_CLEAN BY SURVIVED", x=0.125, y=1.01, ha='left',
             fontweight=100, fontfamily='Segoe UI', size=42);

### 4.11. Derive the Ticket Frequency

In [None]:
# ticket_group = df_train_train.groupby('Ticket')['Ticket'].count()
# df_ticket = pd.DataFrame({'Ticket':ticket_group.index, 'Ticket_Frequency':ticket_group.values})
# full_df = full_df.merge(df_ticket, on='Ticket', how='left')

In [None]:
full_df['Ticket_Frequency'] = full_df.groupby('Ticket')['Ticket'].transform('count')
full_df.drop(["Ticket"], axis=1, inplace=True)
fig, ax = plt.subplots(ncols=1, figsize=(23,8))
sns.countplot(x="Ticket_Frequency", data=full_df[:train_shape[0]], hue="Survived", fill=True, ax=ax, palette="PiYG_r")

fig.suptitle("TICKET_FREQUENCY BY SURVIVED", x=0.125, y=1.01, ha='left',
             fontweight=100, fontfamily='Lato', size=42);

### 4.12. One-hot Encoding Ticket

In [None]:
full_df = pd.get_dummies(full_df, columns=["Tkt_Clean"], prefix=["Tkt"], drop_first=True)

### 4.13. Fare into Categorical Bins

In [None]:
def fare_cat(fare):
    if fare <= 7.0:
        return 1
    elif fare <= 39 and fare > 7.0:
        return 2
    else:
        return 3

full_df.loc[:, 'Fare_Cat'] = full_df['Fare'].apply(fare_cat).astype('int')

### 4.14. Additional Derived Features from Feature Relationships

In [None]:
full_df.loc[:, 'Fare_Family_Size'] = full_df['Fare']/full_df['Family_Size']

full_df.loc[:, 'Fare_Cat_Pclass'] = full_df['Fare_Cat']*full_df['Pclass']
full_df.loc[:, 'Fare_Cat_Title'] = full_df['Fare_Cat']*full_df['Title']

full_df.loc[:, 'Fsize_Cat_Title'] = full_df['Fsize_Cat']*full_df['Title']
full_df.loc[:, 'Fsize_Cat_Fare_Cat'] = full_df['Fare_Cat']/full_df['Fsize_Cat'].astype('int')

full_df.loc[:, 'Pclass_Title'] = full_df['Pclass']*full_df['Title']
full_df.loc[:, 'Fsize_Cat_Pclass'] = full_df['Fsize_Cat']*full_df['Pclass']

### 4.15. Remove Constant Columns

In [None]:
colsToRemove = []
cols = ['Tkt_AQ', 'Tkt_AS', 'Tkt_C', 'Tkt_CA',
         'Tkt_CASOTON', 'Tkt_FC', 'Tkt_FCC', 'Tkt_Fa', 'Tkt_LINE', 'Tkt_LP',
         'Tkt_NUM', 'Tkt_PC', 'Tkt_PP', 'Tkt_PPP', 'Tkt_SC', 'Tkt_SCA',
         'Tkt_SCAH', 'Tkt_SCAHBasle', 'Tkt_SCOW', 'Tkt_SCPARIS', 'Tkt_SCParis',
         'Tkt_SOC', 'Tkt_SOP', 'Tkt_SOPP', 'Tkt_SOTONO', 'Tkt_SOTONOQ',
         'Tkt_SP', 'Tkt_STONO', 'Tkt_STONOQ', 'Tkt_SWPP', 'Tkt_WC', 
         'Tkt_WEP', 'Fare_Cat', 'Fare_Family_Size', 'Fare_Cat_Pclass',
         'Fare_Cat_Title', 'Fsize_Cat_Title', 'Fsize_Cat_Fare_Cat', 
         'Pclass_Title', 'Fsize_Cat_Pclass']

for col in cols:
    if full_df[col][:train_shape[0]].std() == 0: 
        colsToRemove.append(col)

# remove constant columns in the training set
full_df.drop(colsToRemove, axis=1, inplace=True)
print("Removed `{}` Constant Columns\n".format(len(colsToRemove)))
print(colsToRemove)

# 5. Imputation of Missing Values

In [None]:
features = ["Survived",'Family_Friends_Surv_Rate','Surv_Rate_Invalid']
df = full_df.copy()
df.loc[df.PassengerId.isin(full_df_model.PassengerId), features] = full_df_model[features]
passenger_list = full_df_model["PassengerId"].tolist()
full_df_model = df[df["PassengerId"].isin(passenger_list)]

In [None]:
imp_features = ['Pclass', 
                'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Title',
                 'Name_Length',
                'Emb_C', 'Emb_Q', 'Emb_S','Family_Size',
                 'Fsize_Cat', 'Family_Friends_Surv_Rate', 'Surv_Rate_Invalid',
                 'Cabin_Clean','Ticket_Frequency', 'Tkt_AS', 'Tkt_C', 'Tkt_CA',
                 'Tkt_CASOTON', 'Tkt_FC', 'Tkt_FCC', 'Tkt_Fa', 'Tkt_LINE',
                 'Tkt_NUM', 'Tkt_PC', 'Tkt_PP', 'Tkt_PPP', 'Tkt_SC', 'Tkt_SCA',
                 'Tkt_SCAH', 'Tkt_SCAHBasle', 'Tkt_SCOW', 'Tkt_SCPARIS', 'Tkt_SCParis',
                 'Tkt_SOC', 'Tkt_SOP', 'Tkt_SOPP', 'Tkt_SOTONO', 'Tkt_SOTONOQ',
                 'Tkt_SP', 'Tkt_STONO', 'Tkt_SWPP', 'Tkt_WC', 
                 'Tkt_WEP', 'Fare_Cat', 'Fare_Family_Size', 'Fare_Cat_Pclass',
                 'Fare_Cat_Title', 'Fsize_Cat_Title', 'Fsize_Cat_Fare_Cat', 
                 'Pclass_Title', 'Fsize_Cat_Pclass']

imputer = KNNImputer(n_neighbors=10, missing_values=np.nan)
imputer.fit(full_df[imp_features])
full_df.loc[:, imp_features] = pd.DataFrame(imputer.transform(full_df[imp_features]), 
                                            index=full_df.index, columns = imp_features)

In [None]:
imputer = KNNImputer(n_neighbors=10, missing_values=np.nan)
imputer.fit(full_df_model[imp_features])
full_df_model.loc[:, imp_features] = pd.DataFrame(imputer.transform(full_df_model[imp_features]), 
                                            index=full_df_model.index, columns = imp_features)

# 6. Final Adjustments

### 6.1. Create Age Band

In [None]:
optb = OptimalBinning(name="Age", dtype="numerical", solver="cp")
x = full_df[:train_shape[0]]["Age"].values
y_train = full_df[:train_shape[0]]["Survived"]
y = y_train[y_train.index.isin(df_train.index)]
optb.fit(x, y)

In [None]:
binning_table = optb.binning_table
binning_table.build()

In [None]:
list_index = full_df.index.values.tolist()
col = full_df["Age"].values
x_transform = optb.transform(col, metric="event_rate")
x_transform = pd.Series(x_transform, index=list_index)
x_transform.value_counts()
x_transform = x_transform.rename("Age_Band")
full_df = pd.concat((full_df, x_transform), axis=1)

In [None]:
optb = OptimalBinning(name="Age", dtype="numerical", solver="cp")
x = df_train_train["Age"].values
y_train = df_train_train["Survived"]
y = y_train[y_train.index.isin(df_train.index)]
optb.fit(x, y)

In [None]:
binning_table = optb.binning_table
binning_table.build()

In [None]:
list_index = full_df_model.index.values.tolist()
col = full_df_model["Age"].values
x_transform = optb.transform(col, metric="event_rate")
x_transform = pd.Series(x_transform, index=list_index)
x_transform.value_counts()
x_transform = x_transform.rename("Age_Band")
full_df_model = pd.concat((full_df_model, x_transform), axis=1)

### 6.2. Obtain Features for Children & Seniors

In [None]:
full_df['Child'] = full_df['Age'].map(lambda val:1 if val<18 else 0)
full_df['Senior'] = full_df['Age'].map(lambda val:1 if val>70 else 0)

In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(23,8))

sns.countplot(x="Child", data=full_df[:train_shape[0]], hue="Survived", fill=True, ax=ax[0], palette="PiYG_r")
sns.countplot(x="Senior", data=full_df[:train_shape[0]], hue="Survived", fill=True, ax=ax[1], palette="PiYG_r")

### 6.3. Standard Scaling Data

In [None]:
from sklearn.preprocessing import StandardScaler
scaler_cols = ['Age', 'Fare', 'Name_Length', 'Family_Size',
               'Ticket_Frequency', 'Fare_Family_Size', 'Fare_Cat_Pclass']
std = StandardScaler()
std.fit(full_df[scaler_cols])

In [None]:
df_std = pd.DataFrame(std.transform(full_df[scaler_cols]), index=full_df.index, columns = scaler_cols)
full_df.drop(scaler_cols, axis=1, inplace=True)
full_df = pd.concat((full_df, df_std), axis=1)

### 6.4. Split Data back to Train and Test

In [None]:
features = ["Survived",'Family_Friends_Surv_Rate','Surv_Rate_Invalid', "Age_Band"]
df = full_df.copy()
df.loc[df.PassengerId.isin(full_df_model.PassengerId), features] = full_df_model[features]
passenger_list = full_df_model["PassengerId"].tolist()
full_df_model = df[df["PassengerId"].isin(passenger_list)]

In [None]:
df_train_final = full_df[:train_shape[0]]
df_test_final = full_df[train_shape[0]:]

In [None]:
df_test_final.drop(["Survived"], axis=1, inplace=True)

# 7. Checking Feature Importance by Correlation Analysis

In [None]:
df_train_final.head(5)

In [None]:
corr_mat = df_train_final.astype(float).corr()
corr_mat_fil = corr_mat.loc[:, 'Survived'].sort_values(ascending=False)
corr_mat_fil = pd.DataFrame(data=corr_mat_fil[1:])

In [None]:
plt.figure(figsize=(15,14))
bar = sns.barplot(x=corr_mat_fil.Survived, y=corr_mat_fil.index, data=corr_mat_fil, palette="Spectral")
title = bar.set_title("FEATURE CORRELATION", x=0.0, y=1.01, ha='left',
             fontweight=100, fontfamily='Segoe UI', size=30)

In [None]:
df_corr = df_train_final.drop(["PassengerId"], axis=1)
corrmat = df_corr.corr()
sorted_corrs = corrmat['Survived'].abs().sort_values(ascending=False)
print(sorted_corrs)

In [None]:
corr = df_train_final.corr()
top_corr_cols = corr[abs((corr.Survived)>=.0)].Survived.sort_values(ascending=False).keys()
top_corr = corr.loc[top_corr_cols, top_corr_cols]
dropSelf = np.zeros_like(top_corr)
dropSelf[np.triu_indices_from(dropSelf)] = True
plt.figure(figsize=(13, 13))
sns.heatmap(top_corr, cmap=sns.diverging_palette(220, 10, as_cmap=True), annot=False, fmt=".2f", mask=dropSelf)
plt.show()

In [None]:
X = df_train_final.drop(["Survived", "PassengerId"], axis=1)
X = X.assign(const=1)
calc_vif(X)

# 8. Preparation of Train & Test Data

### 8.1. Split the data

In [None]:
passenger_train = df_train_train["PassengerId"].tolist()
df_train = full_df_model[full_df_model["PassengerId"].isin(passenger_train)]

In [None]:
passenger_test = df_train_test["PassengerId"].tolist()
df_test = full_df_model[full_df_model["PassengerId"].isin(passenger_test)]
df_test.loc[df_test.PassengerId.isin(y_train_test.PassengerId), "Survived"] = y_train_test["Survived"]

In [None]:
X_train = df_train.drop(["Survived", "PassengerId"], axis=1)
y_train = df_train["Survived"]
X_test = df_test.drop(["Survived", "PassengerId"], axis=1)
y_test = df_test["Survived"]

In [None]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
all_passenger = passenger_train + passenger_test
df_train_final = full_df[full_df["PassengerId"].isin(all_passenger)]
df_test_final = full_df[~full_df["PassengerId"].isin(all_passenger)]

### 8.2. Cross Validation (K-Fold)

In [None]:
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)

# 9. Model Development

### 9.1. Model Evaluation

In [None]:
def get_kfold_accuracy(model):
    score = cross_val_score(model, X_train, y_train, cv=k_fold, scoring="accuracy")
    print("KFold Score:", round(np.mean(score) * 100, 2))
    
    return score

def get_accuracy(prediction):
    score = round(accuracy_score(prediction, y_test)*100,2)
    print("Accuracy", score)
    
    return score


### 9.2. Prediction

#### 9.2.1. AdaBoost

In [None]:
ada_boost = AdaBoostClassifier()
ada_boost.fit(X_train, y_train)
prediction = ada_boost.predict(X_test)
ada_boost_score = get_accuracy(prediction)
ada_boost_kfold_score = get_kfold_accuracy(ada_boost)

#### 9.2.2. Bagging Classifier

In [None]:
bagging = ensemble.BaggingClassifier()
bagging.fit(X_train, y_train)
prediction = bagging.predict(X_test)
bagging_score = get_accuracy(prediction)
bagging_kfold_score = get_kfold_accuracy(bagging)

#### 9.2.3. Gradient Boosting Classifier

In [None]:
gradient_boosting = ensemble.GradientBoostingClassifier()
gradient_boosting.fit(X_train, y_train)
prediction = gradient_boosting.predict(X_test)
gradient_boosting_score = get_accuracy(prediction)
gradient_boosting_kfold_score = get_kfold_accuracy(gradient_boosting)

#### 9.2.4. Extra Trees Classifier

In [None]:
extra_trees = ensemble.ExtraTreesClassifier()
extra_trees.fit(X_train, y_train)
prediction = extra_trees.predict(X_test)
extra_trees_score = get_accuracy(prediction)
extra_trees_kfold_score = get_kfold_accuracy(extra_trees)

#### 9.2.5. Random Forest

In [None]:
random_forest = ensemble.RandomForestClassifier()
random_forest.fit(X_train, y_train)
prediction = random_forest.predict(X_test)
random_forest_score = get_accuracy(prediction)
random_forest_kfold_score = get_kfold_accuracy(random_forest)

#### 9.2.6. Gaussian Process Classifier

In [None]:
gaussian_process = GaussianProcessClassifier()
gaussian_process.fit(X_train, y_train)
prediction = gaussian_process.predict(X_test)
gaussian_process_score = get_accuracy(prediction)
gaussian_process_kfold_score = get_kfold_accuracy(gaussian_process)

#### 9.2.7. Logistic Regression

In [None]:
logistic_regression_cv = linear_model.LogisticRegressionCV(max_iter=100000)
logistic_regression_cv.fit(X_train, y_train)
prediction = logistic_regression_cv.predict(X_test)
logistic_regression_cv_score = get_accuracy(prediction)
logistic_regression_cv_kfold_score = get_kfold_accuracy(logistic_regression_cv)

In [None]:
logistic_regression = LogisticRegression(random_state=1, max_iter=10000)
logistic_regression.fit(X_train, y_train)
prediction = logistic_regression.predict(X_test)
logistic_regression_score = get_accuracy(prediction)
logistic_regression_kfold_score = get_kfold_accuracy(logistic_regression)

#### 9.2.8. Ridge Classifier

In [None]:
ridge = linear_model.RidgeClassifierCV()
ridge.fit(X_train, y_train)
prediction = ridge.predict(X_test)
ridge_score = get_accuracy(prediction)
ridge_kfold_score = get_kfold_accuracy(ridge)

#### 9.2.9. Perceptron

In [None]:
perceptron = linear_model.Perceptron()
perceptron.fit(X_train, y_train)
prediction = perceptron.predict(X_test)
perceptron_score = get_accuracy(prediction)
perceptron_kfold_score = get_kfold_accuracy(perceptron)

#### 9.2.10. Passive Aggressive Classifier

In [None]:
passive_aggressive = linear_model.PassiveAggressiveClassifier()
passive_aggressive.fit(X_train, y_train)
prediction = passive_aggressive.predict(X_test)
passive_aggressive_score = get_accuracy(prediction)
passive_aggressive_kfold_score = get_kfold_accuracy(passive_aggressive)

#### 9.2.11. SGDClassifier

In [None]:
sdg = linear_model.SGDClassifier()
sdg.fit(X_train, y_train)
prediction = sdg.predict(X_test)
sdg_score = get_accuracy(prediction)
sdg_kfold_score = get_kfold_accuracy(sdg)

#### 9.2.12. Gaussian Naive Bayes

In [None]:
gaussian_nb = naive_bayes.GaussianNB()
gaussian_nb.fit(X_train, y_train)
prediction = gaussian_nb.predict(X_test)
gaussian_nb_score = get_accuracy(prediction)
gaussian_nb_kfold_score = get_kfold_accuracy(gaussian_nb)

#### 9.2.13. Bernoulli NB

In [None]:
bernoulli_nb = naive_bayes.BernoulliNB()
bernoulli_nb.fit(X_train, y_train)
prediction = bernoulli_nb.predict(X_test)
bernoulli_nb_score = get_accuracy(prediction)
bernoulli_nb_kfold_score = get_kfold_accuracy(bernoulli_nb)

#### 9.2.14. kNN

In [None]:
knn = neighbors.KNeighborsClassifier(n_neighbors = 13)
knn.fit(X_train, y_train)
prediction = knn.predict(X_test)
knn_score = get_accuracy(prediction)
knn_kfold_score = get_kfold_accuracy(knn)

#### 9.2.15. SVC

In [None]:
svc = SVC(random_state=1, kernel='linear')
svc.fit(X_train, y_train)
prediction = svc.predict(X_test)
svc_score = get_accuracy(prediction)
svc_kfold_score = get_kfold_accuracy(svc)

#### 9.2.16. Linear SVC

In [None]:
svc_linear = LinearSVC(random_state=1, max_iter=100000)
svc_linear.fit(X_train, y_train)
prediction = svc_linear.predict(X_test)
svc_linear_score = get_accuracy(prediction)
svc_linear_kfold_score = get_kfold_accuracy(svc_linear)

#### 9.2.17. NuSVC

In [None]:
svc_nu = svm.NuSVC(probability=True)
svc_nu.fit(X_train, y_train)
prediction = svc_nu.predict(X_test)
svc_nu_score = get_accuracy(prediction)
svc_nu_kfold_score = get_kfold_accuracy(svc_nu)

#### 9.2.18. Decision Tree

In [None]:
decision_tree = tree.DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
prediction = decision_tree.predict(X_test)
decision_tree_score = get_accuracy(prediction)
decision_tree_kfold_score = get_kfold_accuracy(decision_tree)

#### 9.2.19. Linear Discriminant Analysis

In [None]:
linear_discriminant = LinearDiscriminantAnalysis()
linear_discriminant.fit(X_train, y_train)
prediction = linear_discriminant.predict(X_test)
linear_discriminant_score = get_accuracy(prediction)
linear_discriminant_kfold_score = get_kfold_accuracy(linear_discriminant)

#### 9.2.20. XGBoost

In [None]:
xgboost = XGBClassifier(random_state=1, objective="binary:logistic", n_estimators=10, eval_metric='mlogloss', use_label_encoder=False)
xgboost.fit(X_train, y_train)
prediction = xgboost.predict(X_test)
xgboost_score = get_accuracy(prediction)
xgboost_kfold_score = get_kfold_accuracy(xgboost)

#### 9.2.21. Keras

In [None]:
metrics = ['accuracy', 
           Precision(),
           Recall()]

def create_model():
    model = Sequential()
    model.add(Input(shape=X_train.shape[1], name='Input_'))
    model.add(Dense(8, activation='relu', kernel_initializer='glorot_normal', kernel_regularizer=l2(0.001)))
    model.add(Dense(16, activation='relu', kernel_initializer='glorot_normal', kernel_regularizer=l2(0.1)))
    model.add(Dropout(0.5))
    model.add(Dense(16, activation='relu', kernel_initializer='glorot_normal', kernel_regularizer=l2(0.1)))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid', kernel_initializer='glorot_normal'))

    optimize = Adam(lr = 0.0001)
    model.compile(optimizer = optimize,loss = 'binary_crossentropy',metrics = metrics)
    
    return model

In [None]:
keras = KerasClassifier(build_fn = create_model, epochs = 600, batch_size = 32, verbose = 0)
keras.fit(X_train, y_train)
prediction = keras.predict(X_test)
keras_score = get_accuracy(prediction)
# keras_kfold_score = get_kfold_accuracy(keras)

### 9.3. Model Performance

In [None]:
model_performance = pd.DataFrame({
    "Model": ["Ada Boost", 
              "Bagging", 
              "Keras", 
              "XGBClassifier", 
              "Linear Discriminant Analysis", 
              "Extra Tree",  
              "Decision Tree", 
              "SVM Nu",
             "SVM Linear",
             "SVM",
             "kNN",
             "Bernoulli Naive Bayes",
             "Gaussian Naive Bayes",
             "SDG",
             "Passive Aggressive",
             "Perceptron",
             "Ridge",
             "Logistic Regression",
             "Logistic Regression CV",
             "Gaussian Process",
             "Random Forest",
             "Gradient Boosting"],
    
    "Accuracy": [ada_boost_score, 
                 bagging_score, 
                 keras_score,
                xgboost_score,
                linear_discriminant_score,
                extra_trees_score,
                decision_tree_score,
                svc_nu_score,
                svc_linear_score,
                svc_score,
                knn_score,
                bernoulli_nb_score,
                gaussian_nb_score,
                sdg_score,
                passive_aggressive_score,
                perceptron_score,
                ridge_score,
                logistic_regression_score,
                logistic_regression_cv_score,
                gaussian_process_score,
                random_forest_score,
                 gradient_boosting_score,
                ]
})

model_performance.sort_values(by="Accuracy", ascending=False)

### 9.4. Stack

In [None]:
estimators = [('Gaussian Process',gaussian_process), 
              ('Linear Discriminant', linear_discriminant),
              ('kNN', knn)]

stack = StackingClassifier(estimators=estimators)
stack.fit(X_train, y_train)
prediction = stack.predict(X_test)
stack_score = get_accuracy(prediction)

### 9.5. Voting

In [None]:
voting = VotingClassifier(
    estimators = [
        ('Gaussian Process',gaussian_process),
        ('Linear Discriminant Analysis',linear_discriminant),
        ("SVM Nu", svc_nu),
        ("Knn", knn),
],
    voting = 'hard'
)

In [None]:
voting.fit(X_train, y_train)
prediction = voting.predict(X_test)
voting_score = get_accuracy(prediction)

### 9.6. Tunning Parameters

In [None]:
rf_clf = LogisticRegression(random_state=1)

parameters = {
    'penalty' : ['l1', 'l2', 'elasticnet', 'none'],
    'C' : np.logspace(1, -1),
    'solver' : ['liblinear', "newton-cg", "lbfgs", "sag", "saga"]
}

grid_cv = GridSearchCV(rf_clf, parameters, scoring = make_scorer(accuracy_score))
grid_cv = grid_cv.fit(X_training, y_training)

best_estimator = grid_cv.best_estimator_
best_score = grid_cv.best_score_
best_params = grid_cv.best_params_

best_model = grid_cv.best_estimator_
best_model.fit(train_data, target)


# 10. Submission

In [None]:
all_passenger = passenger_train + passenger_test
df_train_final = full_df[full_df["PassengerId"].isin(all_passenger)]
X_train = df_train_final.drop(["PassengerId", "Survived"], axis=1)
y_train = df_train_final["Survived"]

df_test_final = full_df[~full_df["PassengerId"].isin(all_passenger)]
X_test = df_test_final.drop(["PassengerId", "Survived"], axis=1)

### 10.1. Submiting Using Keras

In [None]:
keras = KerasClassifier(build_fn = create_model, epochs = 600, batch_size = 32, verbose = 0)
keras.fit(X_train, y_train)
prediction = keras.predict(X_test)
y_pred = []
for y in prediction:
    y_pred.append(y[0])

### 10.2. Submitting Using Stack

In [None]:
gaussian_process = GaussianProcessClassifier()
gaussian_process.fit(X_train, y_train)

linear_discriminant = LinearDiscriminantAnalysis()
linear_discriminant.fit(X_train, y_train)

knn = neighbors.KNeighborsClassifier(n_neighbors = 13)
knn.fit(X_train, y_train)

estimators = [('Gaussian Process',gaussian_process), 
              ('Linear Discriminant', linear_discriminant),
              ('kNN', knn)]

stack = StackingClassifier(estimators=estimators)
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)

### 10.3. Submitting Using Voting

In [None]:
voting = VotingClassifier(
    estimators = [
        ('Gaussian Process',gaussian_process),
                  ('Linear Discriminant Analysis',linear_discriminant),
                  ("SVM Nu", svc_nu),
        ("Knn", knn),
                 ],
    voting = 'hard'
)

voting.fit(X_train, y_train)
prediction = voting.predict(X_test)
y_pred = stack.predict(X_test)

### 10.4. Final Submission Adjustments

In [None]:
submission = pd.DataFrame({ 
    "PassengerId": df_test_final["PassengerId"],
    "Survived": y_pred
})
submission.Survived = submission.Survived.astype(int)
submission.to_csv(r"../data/submission.csv", index=False)

In [None]:
submission.head(10)

# 11. Credits
https://www.kaggle.com/sreevishnudamodaran/ultimate-eda-fe-neural-network-model-top-2