# Kaggle: Titanic

#### Copmetition Specification:
The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

#### Approach:
* Feature creation using pandas
* Create pipeline to manage preprocessing (transformation) and stacking (model)
* Use grid search to optimise pipelines parameters

#### Objective is to demonstrate:
* feature creation
* sklearn pipelines
* target encoding
* XGBoost, Random Forest
* Cross-validation
* Performance metrics: accuracy, precision/recall, ROC
* Stacking
* Hyperparameter searching/tuning

##### Limitations:
* no eda: there are many other notebooks for Titanic competition with in depth eda to read

##### Further Enhancements:
* The groups in which people were travelling were a very important indicator of survival. Take a deeper look into what defines those group e.g. nanny-child groups, solo male/female travellers, other segments.
* Manually label cabins to locate cabin with in a deck. Think of the location of cabin as a 3D coordinate. Currently you have two of the three coordinates. Manual labelling is the only way to access the third.
* Think more carefully about how to define miss as adult vs child

## Import Libraries

In [1]:
# kaggle API
from kaggle.api.kaggle_api_extended import KaggleApi

# file handling
import requests
import os
import zipfile

# data hnadling
import pandas as pd
import numpy as np

# plotting
import matplotlib.pyplot as plt
import seaborn as sb
sb.set()

# classification models
from xgboost import XGBClassifier

# pre-processing
from sklearn.preprocessing import Imputer

# encoding methods
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce

# sklearn pipelines
from sklearn.pipeline import Pipeline


# model selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

# metrics to assess classification models
from sklearn.metrics import precision_recall_fscore_support

# other imports
from pprint import pprint

os.environ['KMP_DUPLICATE_LIB_OK']='True'



## To Do:
* graphs and words
* for each deck encode the location of the cabin on an x/y plane
* mayb interaction terms and feature selection?

## Resources:
* XGBoost (summary): https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d
* XGBoost (is feature scaling required?): https://www.quora.com/Why-is-it-not-a-good-idea-to-do-feature-scaling-for-xgboost
* XGBoost (parameter tuning): https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/


* skLearn pipeelines (introduction): https://www.kaggle.com/dansbecker/pipelines
* skLearn pipelines (tutorial): https://www.kaggle.com/aashita/advanced-pipelines-tutorial
* skLearn pipelines (examples): https://www.codementor.io/bruce3557/beautiful-machine-learning-pipeline-with-scikit-learn-uiqapbxuj
* skLearn pipelines (custom transformers - for feature creation): https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65
* sklearn pipelines (with XGBoost): https://stackoverflow.com/questions/42793709/how-to-optimize-a-sklearn-pipeline-using-xgboost-for-a-different-eval-metric


* stacking api: https://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#methods
* model stacking: https://datascience.stackexchange.com/questions/43436/stacking-doesnt-improve-accuracy

* extracting feature names: https://stackoverflow.com/questions/35376293/extracting-selected-feature-names-from-scikit-pipeline

## Helper Functions

In [2]:
def test_train_split(dep_variable_name, df_train, df_test):
    # move dependant variable to last column of frame
    dep_variable = df_train[dep_variable_name]
    df_train.drop(dep_variable_name, axis=1, inplace=True)
    df_train = df_train.merge(dep_variable, left_index=True, right_index=True)

    # split dependant and independant variables
    X_train = df_train.to_numpy()[:,:-1]
    y_train = df_train.to_numpy()[:,-1:].ravel()

    X_test = df_test.to_numpy()
    # y_test unknown: part of project submission
    y_test = 0
    
    return X_train, y_train, X_test, y_test

def create_submission_df(df_passengerID, df_pred):
    df_submission = df_passengerID.merge(df_pred, left_index=True, right_index=True);

    return df_submission


## Get Data

In [3]:
save_folder = "DataSets/"

# if save folder does not exist, create
if not os.path.exists(save_folder):
    os.mkdir(save_folder)
    
# authenticate and download competition
api = KaggleApi()
api.authenticate()
competition = "titanic"
files = api.competition_download_files(competition, path = save_folder)

# unzip files and remove zip file
dataSet_folder = save_folder + competition

with zipfile.ZipFile(dataSet_folder + ".zip","r") as zip_ref:
    zip_ref.extractall(dataSet_folder)
    
os.remove(dataSet_folder + ".zip")



## Import Data

In [4]:
save_folder = "DataSets/"
competition = "titanic"
dataSet_folder = save_folder + competition

train_csv = 'train.csv'
test_csv = 'test.csv'

trainData = os.path.join(dataSet_folder, train_csv)
testData = os.path.join(dataSet_folder, test_csv)

df_train = pd.read_csv(trainData)
df_test = pd.read_csv(testData)

## Feature Creation

Create dataFrame to be used later for submission.

In [226]:
df_passengerID = pd.DataFrame(df_test['PassengerId'])

Merge dataFrames so all preprocessing can be done on one dataFrame

In [227]:
#### Merge DataFrames
df_test['Survived'] = np.NaN
df_test['Test/Train'] = 'Test'
df_train['Test/Train'] = 'Train'
df_all = pd.concat([df_train.copy(), df_test.copy()], sort=True)

### Missing Values

##### Age

Impute moved to after Title Group is extracted to get a more accurate imput.

##### Embarked

In [228]:
mode_embark = df_all['Embarked'].value_counts().idxmax()

In [229]:
df_all['Embarked'] = df_all['Embarked'].fillna(mode_embark)

##### Fare

In [230]:
mean_Fare = df_all['Fare'].mean(skipna = True)
df_all['Fare'] = df_all['Fare'].fillna(mean_Fare,)

## Feature Creation

We will look deeper into the dataset to see if we can create new features to provide further insight and improve the accuracy score of our model.

### Where were the idividuals at the time of the crash?

#### Deck

There were 10 decks in total. From top to bottom they were the Boat Deck, the Promenade Deck (deck A), passenger decks B to G, Orlop Deck, and the Tank Top. Orlop deck is below deck G and Tank Top is below the Orlop Deck.

There were no passangers on the Orlop or Tank Top decks.

In [232]:
df_all['Deck'] = df_all['Cabin']
df_all['Deck'] = df_all['Deck'].fillna('Z');
df_all['Deck'] = df_all['Deck'].str[:1]

#### Vertical Location

In [234]:
df_all['Vertical_Location'] = df_all['Cabin']
df_all['Vertical_Location'] = df_all['Cabin'].fillna('Z00');
df_all['Vertical_Location'] = df_all['Vertical_Location'].str[1:]
df_all['Vertical_Location'].replace('23 C25 C27', '23', inplace=True)
df_all['Vertical_Location'].replace(' G73', '73', inplace=True)
df_all['Vertical_Location'].replace('10 D12', '10', inplace=True)
df_all['Vertical_Location'].replace('58 B60', '58', inplace=True)
df_all['Vertical_Location'].replace(' E69', '69', inplace=True) 
df_all['Vertical_Location'].replace('22 C26', '22', inplace=True)
df_all['Vertical_Location'].replace('57 B59 B63 B66', '57', inplace=True)
df_all['Vertical_Location'].replace('96 B98', '96', inplace=True) 
df_all['Vertical_Location'].replace('51 B53 B55', '51', inplace=True) 
df_all['Vertical_Location'].replace(' G63', '63', inplace=True) 
df_all['Vertical_Location'].replace('62 C64', '62', inplace=True)
df_all['Vertical_Location'].replace('82 B84', '82', inplace=True) 
df_all['Vertical_Location'].replace('55 C57', '55', inplace=True) 
df_all['Vertical_Location'].replace('39 E41', '39', inplace=True) 
df_all['Vertical_Location'].replace('52 B54 B56', '52', inplace=True)
df_all['Vertical_Location'].replace(' E46', '46', inplace=True)
df_all['Vertical_Location'].replace(' E57', '57', inplace=True) 
df_all['Vertical_Location'] = pd.to_numeric(df_all['Vertical_Location'])

def title_group_conditions(x):
    if x['Deck'] in ['Z','T']:
        return 'unknown'
    if (x['Deck'] not in ['Z','T']) and (x['Vertical_Location']%2==0.0) and (x['Deck'] in ['A','B','C','D']):
        return 'Top'
    else:
        return 'Bottom'

In [235]:
df_all['Vertical_Location'] = df_all.apply(title_group_conditions, axis=1)

### What further information can we extract about the idividual?

##### Surname

In [231]:
df_all['Surname'] = df_all.Name.str.extract('([A-Za-z]+),')

872

##### Title

In [236]:
# use ... to extract title from name
df_all['Title'] = df_all.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

In [237]:
# Replace French Titles with English Equivalents
df_all['Title'] = df_all['Title'].replace('Mlle', 'Miss')
df_all['Title'] = df_all['Title'].replace('Ms', 'Miss')
df_all['Title'] = df_all['Title'].replace('Mme', 'Mrs')
df_all['Title'] = df_all['Title'].replace('Ms', 'Miss')

#### Title Group

In [238]:
def title_group_conditions(x):
    if ((x['Title'] not in ['Mr', 'Miss', 'Mrs', 'Master'] ) and (x['Sex']=='male')):
        return 'Mr'
    elif ((x['Title'] not in ['Mr', 'Miss', 'Mrs', 'Master'] ) and (x['Sex']=='female')):
        return 'Mrs'
    else:
        return x['Title']

In [239]:
df_all['Title Group'] = df_all.apply(title_group_conditions, axis=1)

In [240]:
#### Impute Ages Based on Groups

In [241]:
mean_Age = df_all['Age'].mean(skipna = True)
df_all['Age'] = df_all['Age'].fillna(mean_Age,)

mr_mean_age = df_all[df_all['Title Group'] == 'Mr']['Age'].mean()
master_mean_age = df_all[df_all['Title Group'] == 'Master']['Age'].mean()
mrs_mean_age = df_all[df_all['Title Group'] == 'Mrs']['Age'].mean()
miss_mean_age = df_all[df_all['Title Group'] == 'Miss']['Age'].mean()

mr_index = list(df_all[(df_all['Age'].isna()) & (df_all['Title Group'] == 'Mr')]['Age'].index)
master_index = list(df_all[(df_all['Age'].isna()) & (df_all['Title Group'] == 'Master')]['Age'].index)
mrs_index = list(df_all[(df_all['Age'].isna()) & (df_all['Title Group'] == 'Mrs')]['Age'].index)
miss_idnex = list(df_all[(df_all['Age'].isna()) & (df_all['Title Group'] == 'Miss')]['Age'].index)

df_all.set_value(mr_index, 'Age', mr_mean_age);
df_all.set_value(master_index, 'Age', master_mean_age);
df_all.set_value(mrs_index, 'Age', mrs_mean_age);
df_all.set_value(miss_idnex, 'Age', miss_mean_age);

  
  from ipykernel import kernelapp as app
  app.launch_new_instance()


##### Famliy Size

In [242]:
df_all['Family Size'] = df_all['SibSp'] + df_all['Parch'] + 1

##### is Boy Man or Female ?
Women and children. Women encompasses both children and adults. The distinction that needs to be drawn here is what defines an adult vs child boy? 

In [243]:
def is_boy_man_female_conditions(x):
    if ((x['Title Group'] in ['Master']) and (x['Sex'] == 'male')):
        return 'boy'
    elif ((x['Title Group'] in ['Mr']) and (x['Sex'] == 'male')):
        return 'man'
    else:
        return 'female'

In [244]:
df_all['boy_man_female'] = df_all.apply(is_boy_man_female_conditions, axis=1)

##### isParent

In [245]:
def isadult_conditions(x):
    if ((x['Title Group'] in ['Mr','Mrs'])):
        return 1
    else:
        return 0

In [246]:
df_all['isAdult'] = df_all.apply(isadult_conditions, axis=1)

##### isChild

In [247]:
def ischild_conditions(x):
    if ((x['Title Group'] in ['Master','Miss'])):
        return 1
    else:
        return 0

In [248]:
df_all['isChild'] = df_all.apply(ischild_conditions, axis=1)

##### fareBins

In [249]:
# Making Bins
df_all['FareBin'] = pd.qcut(df_all['Fare'], 5).astype(str)

#####  ageBins

In [250]:
# Making Bins
df_all['AgeBin'] = pd.qcut(df_all['Age'], 5).astype(str)

### What group was the person travelling with and what information can we obtain about these groups?

##### FamilyGroupID

In [251]:
##### Surname+Ticket Count
check_shape = df_all.shape

df_all['FamilyGroup_ID'] = df_all['Surname'] +  df_all['Ticket'] + df_all['Embarked'] + df_all['Pclass'].astype(str) + df_all['Fare'].astype(str)

assert((df_all.shape[1] - check_shape[1])==1)

##### Ticket Count

In [252]:
check_shape = df_all.shape

df_temp = df_all.groupby('Ticket').count()['Survived'].reset_index()
df_temp.rename(columns={'Survived':'Ticket_Count'}, inplace=True)
df_all = df_all.merge(df_temp, on='Ticket',how='left')

assert((df_all.shape[1] - check_shape[1])==1)

df_all['Ticket_Count'].unique()

array([1, 2, 4, 3, 7, 5, 6, 0])

##### FamilyID Count (excluding self)

In [253]:
check_shape = df_all.shape

df_temp = df_all.groupby('FamilyGroup_ID').count()['Survived'].reset_index()
df_temp.rename(columns={'Survived':'FamilyGroup_ID_Count'}, inplace=True)
df_temp['FamilyGroup_ID_Count'] = df_temp['FamilyGroup_ID_Count'] - 1
df_all = df_all.merge(df_temp, on='FamilyGroup_ID',how='left')

assert((df_all.shape[1] - check_shape[1])==1)

df_all['FamilyGroup_ID_Count'].unique()


array([ 0,  1,  3,  2,  6,  4,  5, -1])

##### FamilyID Child Count

In [254]:
check_shape = df_all.shape

df_temp = df_all.groupby('FamilyGroup_ID').sum()['isChild'].reset_index()
df_temp.rename(columns={'isChild':'FamilyGroup_Child_Count'}, inplace=True)
df_all = df_all.merge(df_temp, on='FamilyGroup_ID',how='left')

assert((df_all.shape[1] - check_shape[1])==1)

##### FamilyID Adult Count

In [255]:
'''
check_shape = df_all.shape

df_temp = df_all.groupby('FamilyGroup_ID').sum()['isAdult'].reset_index()
df_temp.rename(columns={'isAdult':'FamilyGroup_Parent_Count'}, inplace=True)
df_all = df_all.merge(df_temp, on='FamilyGroup_ID',how='left')

assert((df_all.shape[1] - check_shape[1])==1)
'''

"\ncheck_shape = df_all.shape\n\ndf_temp = df_all.groupby('FamilyGroup_ID').sum()['isAdult'].reset_index()\ndf_temp.rename(columns={'isAdult':'FamilyGroup_Parent_Count'}, inplace=True)\ndf_all = df_all.merge(df_temp, on='FamilyGroup_ID',how='left')\n\nassert((df_all.shape[1] - check_shape[1])==1)\n"

##### FamilyID Female Adult Count

In [256]:
check_shape = df_all.shape

df_temp = df_all.groupby(['FamilyGroup_ID','Sex']).sum()['isAdult'].reset_index()
df_temp = df_temp[df_temp['Sex']=='female']
df_temp.rename(columns={'isAdult':'FamilyGroup_FemaleAdult_Count'}, inplace=True)
df_temp.drop('Sex',axis=1 ,inplace=True)
df_all = df_all.merge(df_temp, on='FamilyGroup_ID',how='left')
df_all['FamilyGroup_FemaleAdult_Count'].fillna(value=-1, inplace=True)

assert((df_all.shape[1] - check_shape[1])==1)
df_all['FamilyGroup_FemaleAdult_Count'].unique()

array([-1.,  1.,  0.])

##### FamilyID Male Adult Count

In [257]:
check_shape = df_all.shape

df_temp = df_all.groupby(['FamilyGroup_ID','Sex']).sum()['isAdult'].reset_index()
df_temp = df_temp[df_temp['Sex']=='male']
df_temp.rename(columns={'isAdult':'FamilyGroup_MaleAdult_Count'}, inplace=True)
df_temp.drop('Sex',axis=1 ,inplace=True)
df_all = df_all.merge(df_temp, on='FamilyGroup_ID',how='left')
df_all['FamilyGroup_MaleAdult_Count'].fillna(value=-1, inplace=True)

assert((df_all.shape[1] - check_shape[1])==1)
df_all['FamilyGroup_MaleAdult_Count'].unique()


array([ 1., -1.,  0.,  2.,  3.,  4.])

##### FramilyGroupID Survival Group

In [258]:
# list of family IDs in both test and train set
set_test = set(df_all[df_all['Test/Train']=='Test']['FamilyGroup_ID'].unique())
set_train = set(df_all[df_all['Test/Train']=='Train']['FamilyGroup_ID'].unique())
family_id_both = set_test.intersection(set_train)

##### list of family IDs in only train set
family_id_train = set_train - set_test

##### list of family IDs in only test set
family_id_test = set_test - set_train

assert(family_id_train.intersection(family_id_both) == set())
assert(family_id_test.intersection(family_id_both) == set())


In [259]:
#### survival count
check_shape = df_all.shape

df_temp = df_all.groupby('FamilyGroup_ID').sum()['Survived'].reset_index()
df_temp.rename(columns={'Survived':'FamilyGroup_SurvivalCount'}, inplace=True)
df_all = df_all.merge(df_temp, on='FamilyGroup_ID',how='left')

# create survived temp and fill nan with 0 (this is so when we deduct 
# indvidiual survival count from group survival count we don't return a nan)
df_all['Survived_temp'] = df_all['Survived']
df_all['Survived_temp'].fillna(value=0, inplace=True)

df_all['FamilyGroup_SurvivalCount'] = df_all['FamilyGroup_SurvivalCount'] - df_all['Survived_temp']

df_all.drop('Survived_temp', axis=1, inplace=True)

assert((df_all.shape[1] - check_shape[1])==1)

In [260]:
# family_group_id_sizes == 1 do not encode anything as you are inputing the target variable into the model as a
# feature, this will cause massive overfitting

# family_group_id_survival_count - individual survival count (same as above)

## family sizes 1
# G - family group sizes exactly 1

## whole group survival information known
# A - exactly 1
# B - exactly 0
# C - whole group survives

## not whole group information known
# D - at least 1 (at least 1 survives known with some unknowns)
# E - maybe 1 (0 survived with unkowns)
# F - unknown (all other family members survival not known)

def group_survival_status(x):
    
    if x['FamilyGroup_ID_Count'] == 0:
        return 'G'

    elif ((x['FamilyGroup_ID_Count']!=0) and (x['FamilyGroup_ID'] in family_id_train) and 
          (x['FamilyGroup_SurvivalCount'] == 1)):
    #family survival known and #family survival == 1:
        return 'A'

    elif ((x['FamilyGroup_ID_Count']!=0) and (x['FamilyGroup_ID'] in family_id_train) and 
          (x['FamilyGroup_SurvivalCount'] == 0)):
    #family survival known and #family survival == 0:
        return 'B'

    #elif ((x['FamilyGroup_ID_Count']!=0) and (x['FamilyGroup_ID'] in family_id_train) and 
          #(x['FamilyGroup_SurvivalCount'] == x['FamilyGroup_ID_Count'])):
    #family survival known and #family survival == family size:
        #return 'C'

    elif ((x['FamilyGroup_ID'] in family_id_both) and (x['FamilyGroup_ID_Count']!=0) and 
          (x['FamilyGroup_SurvivalCount'] > 0)):
    #family survival not known fully #at least one survives:
        return 'D'

    elif ((x['FamilyGroup_ID'] in family_id_both) and (x['FamilyGroup_ID_Count']!=0) and 
          (x['FamilyGroup_SurvivalCount'] == 0)):
    #family survival not known and #zero survived:
        return 'E'

    elif ((x['FamilyGroup_ID'] in family_id_test) and (x['FamilyGroup_ID_Count']!=0)):
    #family survival not known 
        return 'F'

    else:
    # just encase
        return 'I'

In [261]:
df_all['FamilyID_Survival_Group'] = df_all.apply(group_survival_status, axis=1)

In [262]:
df_all.drop('FamilyGroup_SurvivalCount', axis=1 ,inplace=True)

##### Woman and Child Group

In [263]:
def is_woman_and_child_group_conditions(x):
    if x['FamilyGroup_FemaleAdult_Count']>0 and x['FamilyGroup_Child_Count']>0:
        return 1
    else:
        return 0

In [264]:
df_all['isWomanAndChild'] = df_all.apply(is_woman_and_child_group_conditions, axis=1)

##### Woman and Child Count

In [265]:
df_all['womanAndChildCount'] = df_all['FamilyGroup_Child_Count'] + df_all['FamilyGroup_FemaleAdult_Count'] - 1


##### Woman and Child Survival Ratio

In [266]:

df_temp = df_all.groupby(['FamilyGroup_ID','isChild']).sum()['Survived'].reset_index()
df_temp = df_temp[df_temp['isChild']==1]
df_temp.drop('isChild', axis=1, inplace=True)
df_temp.rename(columns={'Survived':'FamilyGroup_ChildSurvivalCount'}, inplace=True)
df_all = df_all.merge(df_temp, on='FamilyGroup_ID',how='left')


In [267]:

df_temp = df_all.groupby(['FamilyGroup_ID','isAdult', 'Sex']).sum()['Survived'].reset_index()
df_temp = df_temp[df_temp['isAdult']==1]
df_temp = df_temp[df_temp['Sex']=='female']
df_temp.drop(['isAdult','Sex'], axis=1, inplace=True)
df_temp.rename(columns={'Survived':'FamilyGroup_FemaleSurvivalCount'}, inplace=True)
df_all = df_all.merge(df_temp, on='FamilyGroup_ID',how='left')


In [268]:
df_all['Survived_temp'] = df_all['Survived']
df_all['Survived_temp'].fillna(value=0, inplace=True)


In [269]:
df_all['womanAndChildSurvivalRatio'] = ((df_all['FamilyGroup_ChildSurvivalCount'] + 
                                        df_all['FamilyGroup_FemaleSurvivalCount'] - df_all['Survived_temp']) / 
                                        df_all['womanAndChildCount'])

df_all['womanAndChildSurvivalCount'] = ((df_all['FamilyGroup_ChildSurvivalCount'] + 
                                        df_all['FamilyGroup_FemaleSurvivalCount'] - df_all['Survived_temp'])) 


df_all.drop('Survived_temp', axis=1, inplace=True)

In [270]:
def group_survival_status(x):
    
    if x['isWomanAndChild'] == 0:
        return 'Z'

    elif ((x['isWomanAndChild']!=0) and (x['FamilyGroup_ID'] in family_id_train) and 
          (x['womanAndChildSurvivalRatio'] == 1)):
    #family survival known and #family survival == 1:
        return 'A'

    elif ((x['isWomanAndChild']!=0) and (x['FamilyGroup_ID'] in family_id_train) and 
          (x['womanAndChildSurvivalRatio'] == 0)):
    #family survival known and #family survival == 0:
        return 'B'

    #elif ((x['FamilyGroup_ID_Count']!=0) and (x['FamilyGroup_ID'] in family_id_train) and 
          #(x['FamilyGroup_SurvivalCount'] == x['FamilyGroup_ID_Count'])):
    #family survival known and #family survival == family size:
        #return 'C'

    elif ((x['FamilyGroup_ID'] in family_id_both) and (x['isWomanAndChild']!=0) and 
          (x['womanAndChildSurvivalCount'] > 0)):
    #family survival not known fully #at least one survives:
        return 'D'

    elif ((x['FamilyGroup_ID'] in family_id_both) and (x['isWomanAndChild']!=0) and 
          (x['womanAndChildSurvivalCount'] == 0)):
    #family survival not known and #zero survived:
        return 'E'

    elif ((x['FamilyGroup_ID'] in family_id_test) and (x['isWomanAndChild']!=0)):
    #family survival not known 
        return 'F'

    else:
    # just encase
        return 'I'

In [271]:
df_all['womanAndChild_Survival'] = df_all.apply(group_survival_status, axis=1)

In [272]:
df_all.drop('womanAndChildSurvivalRatio', axis=1, inplace=True)
df_all.drop('womanAndChildSurvivalCount', axis=1, inplace=True)
df_all.drop('FamilyGroup_ChildSurvivalCount', axis=1, inplace=True)
df_all.drop('FamilyGroup_FemaleSurvivalCount', axis=1, inplace=True)


### Drop Columns

In [273]:
drop_columns = ['Name', 'Cabin', 'PassengerId', 'Ticket', 'FamilyGroup_ID', 'Surname', 'Title', 'Age', 'Fare', 'isAdult']
df_all.drop(drop_columns, axis=1, inplace=True)

### Recreate Train and Test DataFrames

In [274]:
#### Split DataFrames
df_train = df_all[df_all['Test/Train']=='Train'].copy()
df_test = df_all[df_all['Test/Train']=='Test'].copy()
df_train.drop(['Test/Train'], axis=1, inplace=True)
df_test.drop(['Test/Train','Survived'], axis=1, inplace=True)

## ML Pipeline

In [275]:
from sklearn.pipeline import FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Imputer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
import category_encoders as ce
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.decomposition import PCA

### Feature Groups

In [276]:
X_train = df_train.drop('Survived', axis=1)
y_train = df_train['Survived']

X_train, X_val, y_train, y_val = train_test_split(X_train,y_train,test_size=0.2, random_state=0)

In [277]:
all_Feat = X_train.columns.tolist()

In [278]:
num_feat = []
cat_feat = list(set(X_train.columns.tolist()) - set(num_feat))

assert (set(num_feat).union(set(cat_feat)) == set(all_Feat))

### Transformation Components

##### Component for Numerical Values

In [279]:
num_comp = ("numerical_features", ColumnTransformer([
                ("numerical", Pipeline(steps=[(
                    "impute_stage", SimpleImputer(missing_values=np.nan, strategy="mean"))]), num_feat
                )])
            )

##### Component for  Categorical Values

In [280]:
cat_comp = ("categorical_features", ColumnTransformer([
                ("categorical_mode", Pipeline(steps=[
                    ("impute_stage", SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
                    ("label_encoder", ce.TargetEncoder(handle_unknown="impute"))]), cat_feat 
                )])
           )

##### Component for Transformation 

In [281]:
trans_comp = ("features", FeatureUnion([num_comp, cat_comp]))

### XGBoost Model Component

In [282]:
model_comp_xgb = ("classifiers", XGBClassifier(random_state=0))

### XGBoost Full Pipeline 

In [283]:
full_pipeline_xgb  = Pipeline(steps=[trans_comp, model_comp_xgb])

## Train Pipeline - XGBoost

In [293]:
full_pipeline_xgb.fit(X_train, y_train);

In [294]:
scores_accuracy = cross_val_score(full_pipeline_xgb, X_train, y_train, cv=10, scoring = "accuracy")
scores_precision = cross_val_score(full_pipeline_xgb, X_train, y_train, cv=10, scoring = "precision")
scores_recall = cross_val_score(full_pipeline_xgb, X_train, y_train, cv=10, scoring = "recall")
scores_rocauc = cross_val_score(full_pipeline_xgb, X_train, y_train, cv=10, scoring = "roc_auc")

print("CV Accuracy score: " + str(scores_accuracy.mean()))
print("CV Precision score: " + str(scores_precision.mean()))
print("CV Recall score: " + str(scores_recall.mean()))
print("CV ROC score: " + str(scores_rocauc.mean()))

CV Accuracy score: 0.8329370668455175
CV Precision score: 0.8172616391473015
CV Recall score: 0.735978835978836
CV ROC score: 0.8794676764880253


In [295]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score

In [296]:
y_pred = full_pipeline_xgb.predict(X_val)

In [297]:
print("Test Accuracy score: " + str(accuracy_score(y_val,y_pred)))
print("Test Precision score: " + str(precision_score(y_val,y_pred)))
print("Test Recall score: " + str(recall_score(y_val,y_pred)))
print("Test ROC score: " + str(roc_auc_score(y_val,y_pred)))

Test Accuracy score: 0.8379888268156425
Test Precision score: 0.8125
Test Recall score: 0.7536231884057971
Test ROC score: 0.822266139657444


### Feature Importances

In [155]:
from sklearn.feature_selection import SelectFromModel

In [156]:
model = full_pipeline_xgb.steps[1][1]
selection_xgb = SelectFromModel(model, threshold=0, prefit=True)
feature_names_example = X_train.columns.to_numpy().reshape(1, -1)
selected_features = selection_xgb.transform(feature_names_example)
selected_features = list(selected_features[0])
selected_features.reverse()

importances = list(full_pipeline_xgb.named_steps['classifiers'].feature_importances_)
importances.sort()
importances.reverse()

for i in range(len(importances)):
    print(selected_features[i] + ' has importances: ' + str(round(importances[i] * 100,2)) +'%')

womanAndChild_Survival has importances: 51.37%
womanAndChildCount has importances: 7.89%
isWomanAndChild has importances: 7.12%
FamilyID_Survival_Group has importances: 4.99%
FamilyGroup_MaleAdult_Count has importances: 3.71%
FamilyGroup_FemaleAdult_Count has importances: 3.13%
FamilyGroup_Child_Count has importances: 2.97%
FamilyGroup_ID_Count has importances: 2.38%
Ticket_Count has importances: 2.08%
AgeBin has importances: 1.8%
FareBin has importances: 1.73%
isChild has importances: 1.53%
isAdult has importances: 1.52%
Family Size has importances: 1.35%
Title Group has importances: 1.28%
Vertical_Location has importances: 1.19%
Deck has importances: 1.17%
SibSp has importances: 1.16%
Sex has importances: 1.02%
Pclass has importances: 0.59%
Parch has importances: 0.0%
Embarked has importances: 0.0%


## Tune Pipeline - XGBoost

##### Iteration 1

In [306]:
new_params= {'classifiers__max_depth': list(range(2,4,1)),
             'classifiers__min_child_weight': list(range(0,2,1))}

In [307]:
grid_searchCV = GridSearchCV(estimator=full_pipeline_xgb, param_grid=new_params, cv=5, scoring='accuracy')

np.random.seed(seed=0)
grid_searchCV.fit(X_train, y_train);

grid_searchCV.best_score_, grid_searchCV.best_params_

(0.8342696629213483,
 {'classifiers__max_depth': 2, 'classifiers__min_child_weight': 0})

##### Iteration 2

In [300]:
new_params= {'classifiers__max_depth': [2],
             'classifiers__min_child_weight': [0],
             'classifiers__gamma': [i/10.0 for i in range(0,5)]}

In [301]:
grid_searchCV = GridSearchCV(estimator=full_pipeline_xgb, param_grid=new_params, cv=5, scoring='accuracy')

np.random.seed(seed=0)
grid_searchCV.fit(X_train, y_train);

grid_searchCV.best_score_, grid_searchCV.best_params_

(0.8342696629213483,
 {'classifiers__gamma': 0.0,
  'classifiers__max_depth': 2,
  'classifiers__min_child_weight': 0})

##### Iteration 3

In [312]:
new_params= {'classifiers__max_depth': [2],
             'classifiers__min_child_weight': [0],
             'classifiers__gamma': [0.0],
             'classifiers__colsample_bytree': [i/10.0 for i in range(8,11)],
             'classifiers__subsample': [i/10.0 for i in range(8,11)]}

In [313]:
grid_searchCV = GridSearchCV(estimator=full_pipeline_xgb, param_grid=new_params, cv=5, scoring='accuracy')

np.random.seed(seed=0)
grid_searchCV.fit(X_train, y_train);

grid_searchCV.best_score_, grid_searchCV.best_params_

(0.8342696629213483,
 {'classifiers__colsample_bytree': 1.0,
  'classifiers__gamma': 0.0,
  'classifiers__max_depth': 2,
  'classifiers__min_child_weight': 0,
  'classifiers__subsample': 1.0})

##### Iteration 4

In [318]:
new_params= {'classifiers__max_depth': [2],
             'classifiers__min_child_weight': [0],
             'classifiers__gamma': [0],
             'classifiers__colsample_bytree': [1],
             'classifiers__subsample': [1],
             'classifiers__reg_alpha': [1e-5, 1e-2, 0.1, 1, 100]}

In [319]:
grid_searchCV = GridSearchCV(estimator=full_pipeline_xgb, param_grid=new_params, cv=5, scoring='accuracy')

np.random.seed(seed=0)
grid_searchCV.fit(X_train, y_train);

grid_searchCV.best_score_, grid_searchCV.best_params_

(0.8356741573033708,
 {'classifiers__colsample_bytree': 1,
  'classifiers__gamma': 0,
  'classifiers__max_depth': 2,
  'classifiers__min_child_weight': 0,
  'classifiers__reg_alpha': 1,
  'classifiers__subsample': 1})

### Validation Set Score

In [320]:
y_pred = grid_searchCV.predict(X_val)

print("Test Accuracy score: " + str(accuracy_score(y_val,y_pred)))
print("Test Precision score: " + str(precision_score(y_val,y_pred)))
print("Test Recall score: " + str(recall_score(y_val,y_pred)))
print("Test ROC score: " + str(roc_auc_score(y_val,y_pred)))

Test Accuracy score: 0.8324022346368715
Test Precision score: 0.8
Test Recall score: 0.7536231884057971
Test ROC score: 0.8177206851119895


### Train Model on Full Training Set

In [321]:
X_train = df_train.drop('Survived', axis=1)
y_train = df_train['Survived']

In [322]:
grid_searchCV.fit(X_train, y_train);
grid_searchCV.best_score_, grid_searchCV.best_params_

(0.8473625140291807,
 {'classifiers__colsample_bytree': 1,
  'classifiers__gamma': 0,
  'classifiers__max_depth': 2,
  'classifiers__min_child_weight': 0,
  'classifiers__reg_alpha': 1,
  'classifiers__subsample': 1})

#### Submission

In [323]:
X_test = df_test
y_pred = grid_searchCV.predict(X_test).astype(int)

In [324]:
y_pred = pd.Series(y_pred, name='Survived')
df_submission = create_submission_df(df_passengerID, y_pred)

In [325]:
save_folder = "/Users/IainMac/Desktop/Learning/01 Portfolio/02 Machine Learning/02 Classification/00 Kaggle Submissions/"
save_name = 'submission.csv'
df_submission.to_csv(path_or_buf = save_folder + save_name, index=False)

In [326]:
api.competition_submit(save_folder + save_name, message='skLearn Pipeline with XGBoost: Submission' ,competition=competition)

100%|██████████| 2.77k/2.77k [00:12<00:00, 223B/s]  


Successfully submitted to Titanic: Machine Learning from Disaster

## Train Pipeline - Stacked

No submissions were made using this pipeline, this is simply to show how to build a stacked model using sklearn pipelines.

### Stacked Model Component

In [284]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from mlxtend.classifier import StackingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

xgb = XGBClassifier(random_state=0)
rfr = RandomForestClassifier(random_state=0, n_estimators=100)
lgr = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=0)
mlgr = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=0)
gnb = GaussianNB()
knn = KNeighborsClassifier()

sclf = StackingClassifier(classifiers=[xgb, gnb, knn], meta_classifier=mlgr)

In [285]:
model_comp = ("classifiers", sclf)

### Stacked Full Pipeline

In [286]:
full_pipeline = Pipeline(steps=[trans_comp, model_comp])

In [62]:
full_pipeline.fit(X_train, y_train);

In [63]:
scores_accuracy = cross_val_score(full_pipeline, X_train, y_train, cv=10, scoring = "accuracy")
scores_precision = cross_val_score(full_pipeline, X_train, y_train, cv=10, scoring = "precision")
scores_recall = cross_val_score(full_pipeline, X_train, y_train, cv=10, scoring = "recall")
scores_rocauc = cross_val_score(full_pipeline, X_train, y_train, cv=10, scoring = "roc_auc")

print("CV Accuracy score: " + str(scores_accuracy.mean()))
print("CV Precision score: " + str(scores_precision.mean()))
print("CV Recall score: " + str(scores_recall.mean()))
print("CV ROC score: " + str(scores_rocauc.mean()))

CV Accuracy score: 0.8314883746926002
CV Precision score: 0.8089678724851138
CV Recall score: 0.7433862433862434
CV ROC score: 0.8429136324485162


In [58]:
y_pred = full_pipeline.predict(X_val)

In [59]:
print("Test Accuracy score: " + str(accuracy_score(y_val,y_pred)))
print("Test Precision score: " + str(precision_score(y_val,y_pred)))
print("Test Recall score: " + str(recall_score(y_val,y_pred)))
print("Test ROC score: " + str(roc_auc_score(y_val,y_pred)))

Test Accuracy score: 0.8212290502793296
Test Precision score: 0.7936507936507936
Test Recall score: 0.7246376811594203
Test ROC score: 0.8032279314888011


## Tune Pipeline - Stacked

### Tune Pipeline - RandomForest

### Tune Pipeline - XGBoost

##### Iteration 1

In [477]:
new_params= {'classifiers__randomforestclassifier__n_estimators':[30],
              'classifiers__randomforestclassifier__min_samples_split': [2],
              'classifiers__randomforestclassifier__min_samples_leaf': [3],
              'classifiers__randomforestclassifier__max_features': ['auto'],
              'classifiers__randomforestclassifier__max_depth':[10],
             'classifiers__xgbclassifier__max_depth': list(range(2,4,1)),
             'classifiers__xgbclassifier__min_child_weight': list(range(0,2,1))}

In [478]:
grid_searchCV = GridSearchCV(estimator=full_pipeline, param_grid=new_params, cv=5, scoring='accuracy')

np.random.seed(seed=0)
grid_searchCV.fit(X_train, y_train);

grid_searchCV.best_score_, grid_searchCV.best_params_

(0.8342696629213483,
 {'classifiers__randomforestclassifier__max_depth': 10,
  'classifiers__randomforestclassifier__max_features': 'auto',
  'classifiers__randomforestclassifier__min_samples_leaf': 3,
  'classifiers__randomforestclassifier__min_samples_split': 2,
  'classifiers__randomforestclassifier__n_estimators': 30,
  'classifiers__xgbclassifier__max_depth': 3,
  'classifiers__xgbclassifier__min_child_weight': 1})

##### Iteration 2

In [479]:
new_params= {'classifiers__randomforestclassifier__n_estimators':[30],
              'classifiers__randomforestclassifier__min_samples_split': [2],
              'classifiers__randomforestclassifier__min_samples_leaf': [3],
              'classifiers__randomforestclassifier__max_features': ['auto'],
              'classifiers__randomforestclassifier__max_depth':[10],
             'classifiers__xgbclassifier__max_depth': [3],
             'classifiers__xgbclassifier__min_child_weight': [1],
             'classifiers__xgbclassifier__gamma': [i/10.0 for i in range(0,5)]}

In [480]:
grid_searchCV = GridSearchCV(estimator=full_pipeline, param_grid=new_params, cv=5, scoring='accuracy')

np.random.seed(seed=0)
grid_searchCV.fit(X_train, y_train);

grid_searchCV.best_score_, grid_searchCV.best_params_

(0.8370786516853933,
 {'classifiers__randomforestclassifier__max_depth': 10,
  'classifiers__randomforestclassifier__max_features': 'auto',
  'classifiers__randomforestclassifier__min_samples_leaf': 3,
  'classifiers__randomforestclassifier__min_samples_split': 2,
  'classifiers__randomforestclassifier__n_estimators': 30,
  'classifiers__xgbclassifier__gamma': 0.1,
  'classifiers__xgbclassifier__max_depth': 3,
  'classifiers__xgbclassifier__min_child_weight': 1})

##### Iteration 3

In [481]:
new_params= {'classifiers__randomforestclassifier__n_estimators':[30],
              'classifiers__randomforestclassifier__min_samples_split': [2],
              'classifiers__randomforestclassifier__min_samples_leaf': [3],
              'classifiers__randomforestclassifier__max_features': ['auto'],
              'classifiers__randomforestclassifier__max_depth':[10],
             'classifiers__xgbclassifier__max_depth': [3],
             'classifiers__xgbclassifier__min_child_weight': [1],
             'classifiers__xgbclassifier__gamma': [0.1],
             'classifiers__xgbclassifier__colsample_bytree': [i/10.0 for i in range(6,10)],
             'classifiers__xgbclassifier__subsample': [i/10.0 for i in range(6,10)]}

In [482]:
grid_searchCV = GridSearchCV(estimator=full_pipeline, param_grid=new_params, cv=5, scoring='accuracy')

np.random.seed(seed=0)
grid_searchCV.fit(X_train, y_train);

grid_searchCV.best_score_, grid_searchCV.best_params_

(0.8398876404494382,
 {'classifiers__randomforestclassifier__max_depth': 10,
  'classifiers__randomforestclassifier__max_features': 'auto',
  'classifiers__randomforestclassifier__min_samples_leaf': 3,
  'classifiers__randomforestclassifier__min_samples_split': 2,
  'classifiers__randomforestclassifier__n_estimators': 30,
  'classifiers__xgbclassifier__colsample_bytree': 0.7,
  'classifiers__xgbclassifier__gamma': 0.1,
  'classifiers__xgbclassifier__max_depth': 3,
  'classifiers__xgbclassifier__min_child_weight': 1,
  'classifiers__xgbclassifier__subsample': 0.6})

##### Iteration 4

In [485]:
new_params= {'classifiers__randomforestclassifier__n_estimators':[30],
              'classifiers__randomforestclassifier__min_samples_split': [2],
              'classifiers__randomforestclassifier__min_samples_leaf': [3],
              'classifiers__randomforestclassifier__max_features': ['auto'],
              'classifiers__randomforestclassifier__max_depth':[10],
             'classifiers__xgbclassifier__max_depth': [3],
             'classifiers__xgbclassifier__min_child_weight': [1],
             'classifiers__xgbclassifier__gamma': [0.1],
             'classifiers__xgbclassifier__colsample_bytree': [i/20.0 for i in range(13,16)],
             'classifiers__xgbclassifier__subsample': [i/20.0 for i in range(11,14)]}

In [486]:
grid_searchCV = GridSearchCV(estimator=full_pipeline, param_grid=new_params, cv=5, scoring='accuracy')

np.random.seed(seed=0)
grid_searchCV.fit(X_train, y_train);

grid_searchCV.best_score_, grid_searchCV.best_params_

(0.8398876404494382,
 {'classifiers__randomforestclassifier__max_depth': 10,
  'classifiers__randomforestclassifier__max_features': 'auto',
  'classifiers__randomforestclassifier__min_samples_leaf': 3,
  'classifiers__randomforestclassifier__min_samples_split': 2,
  'classifiers__randomforestclassifier__n_estimators': 30,
  'classifiers__xgbclassifier__colsample_bytree': 0.7,
  'classifiers__xgbclassifier__gamma': 0.1,
  'classifiers__xgbclassifier__max_depth': 3,
  'classifiers__xgbclassifier__min_child_weight': 1,
  'classifiers__xgbclassifier__subsample': 0.6})

##### Iteration 5

In [487]:
new_params= {'classifiers__randomforestclassifier__n_estimators':[30],
              'classifiers__randomforestclassifier__min_samples_split': [2],
              'classifiers__randomforestclassifier__min_samples_leaf': [3],
              'classifiers__randomforestclassifier__max_features': ['auto'],
              'classifiers__randomforestclassifier__max_depth':[10],
             'classifiers__xgbclassifier__max_depth': [3],
             'classifiers__xgbclassifier__min_child_weight': [1],
             'classifiers__xgbclassifier__gamma': [0.1],
             'classifiers__xgbclassifier__colsample_bytree': [0.7],
             'classifiers__xgbclassifier__subsample': [0.6],
             'classifiers__xgbclassifier__reg_alpha': [1e-5, 1e-2, 0.1, 1, 100]}

In [488]:
grid_searchCV = GridSearchCV(estimator=full_pipeline, param_grid=new_params, cv=5, scoring='accuracy')

np.random.seed(seed=0)
grid_searchCV.fit(X_train, y_train);

grid_searchCV.best_score_, grid_searchCV.best_params_

(0.8398876404494382,
 {'classifiers__randomforestclassifier__max_depth': 10,
  'classifiers__randomforestclassifier__max_features': 'auto',
  'classifiers__randomforestclassifier__min_samples_leaf': 3,
  'classifiers__randomforestclassifier__min_samples_split': 2,
  'classifiers__randomforestclassifier__n_estimators': 30,
  'classifiers__xgbclassifier__colsample_bytree': 0.7,
  'classifiers__xgbclassifier__gamma': 0.1,
  'classifiers__xgbclassifier__max_depth': 3,
  'classifiers__xgbclassifier__min_child_weight': 1,
  'classifiers__xgbclassifier__reg_alpha': 1e-05,
  'classifiers__xgbclassifier__subsample': 0.6})

### Logistic Regression: Meta Classifier

In [507]:
new_params= {'classifiers__randomforestclassifier__n_estimators':[30],
              'classifiers__randomforestclassifier__min_samples_split': [2],
              'classifiers__randomforestclassifier__min_samples_leaf': [3],
              'classifiers__randomforestclassifier__max_features': ['auto'],
              'classifiers__randomforestclassifier__max_depth':[10],
             'classifiers__xgbclassifier__max_depth': [3],
             'classifiers__xgbclassifier__min_child_weight': [1],
             'classifiers__xgbclassifier__gamma': [0.1],
             'classifiers__xgbclassifier__colsample_bytree': [0.7],
             'classifiers__xgbclassifier__subsample': [0.6],
             'classifiers__xgbclassifier__reg_alpha': [1e-5],
            'classifiers__meta_classifier__C': [1]}



In [508]:
grid_searchCV = GridSearchCV(estimator=full_pipeline, param_grid=new_params, cv=5, scoring='accuracy')

np.random.seed(seed=0)
grid_searchCV.fit(X_train, y_train);

grid_searchCV.best_score_, grid_searchCV.best_params_

(0.8398876404494382,
 {'classifiers__meta_classifier__C': 1,
  'classifiers__randomforestclassifier__max_depth': 10,
  'classifiers__randomforestclassifier__max_features': 'auto',
  'classifiers__randomforestclassifier__min_samples_leaf': 3,
  'classifiers__randomforestclassifier__min_samples_split': 2,
  'classifiers__randomforestclassifier__n_estimators': 30,
  'classifiers__xgbclassifier__colsample_bytree': 0.7,
  'classifiers__xgbclassifier__gamma': 0.1,
  'classifiers__xgbclassifier__max_depth': 3,
  'classifiers__xgbclassifier__min_child_weight': 1,
  'classifiers__xgbclassifier__reg_alpha': 1e-05,
  'classifiers__xgbclassifier__subsample': 0.6})

### Cross Validated Score

In [509]:
y_pred = grid_searchCV.predict(X_val)

print("Test Accuracy score: " + str(accuracy_score(y_val,y_pred)))
print("Test Precision score: " + str(precision_score(y_val,y_pred)))
print("Test Recall score: " + str(recall_score(y_val,y_pred)))
print("Test ROC score: " + str(roc_auc_score(y_val,y_pred)))

Test Accuracy score: 0.8547486033519553
Test Precision score: 0.8524590163934426
Test Recall score: 0.7536231884057971
Test ROC score: 0.8359025032938077
