# An XGBoost  solution to the Titanic Survivor Dataset

In this exercise i will be working with the titanic data set from Kaggle. This dataset contains information about the passengers who were on board when the titanic sank. 

## The Objective

My goal for this project is to make predictions on whether a person survived or not. This will be a supervised binary classification problem. 

I will be using an XGBoost Classifier as my model of choice. 

In [2]:
#Modules

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import GridSearchCV,RandomizedSearchCV,train_test_split
import xgboost as xgb 
from sklearn.metrics import classification_report, accuracy_score,confusion_matrix


In [3]:
#importing the data 
df=pd.read_csv('titanic_train.csv',index_col='PassengerId') 

# EDA

I will begin by performing some initial exploratory data analysis on the data. I will be looking to see if there is any missing data or cleaning that i need to perform. I will also be looking to see if there are any obvious relationships in the data. 

In [4]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


from the initial look at the dataset there are a few obvious issues that need to be fixed. There are categorical columns which need to be encoded using dictvectorizer. 

The cabin column as some NaN values so i will need to see if it is worth keeping this column or whether i should just ignore it. 

There are also some numeric categorical columns that need to be fixed too. 

The name and ticket columns will also be dropped. I'm sure some extra performance could be extracted but i don't think it is worth the additional effort in this case. 

In [5]:
# I will first drop the name and ticket columns from my dataset

df.drop(inplace=True,columns=['Name','Ticket'])

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 9 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(3)
memory usage: 69.6+ KB


From looking at the info sheet above i think it is worth dropping the cabin column completely as most of the values are missing. The information would also be hard to work with as it is not in any sort of category. 

Their are only a few missing values from embarked so i will drop these rows of data.

for the age values i will fill them using a mean or median strategy. 


In [7]:
df.drop(inplace=True,columns=['Cabin','SibSp','Parch'])

In [8]:
df.dropna(inplace=True,subset=['Embarked'])

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 1 to 891
Data columns (total 8 columns):
Survived    889 non-null int64
Pclass      889 non-null int64
Sex         889 non-null object
Age         712 non-null float64
SibSp       889 non-null int64
Parch       889 non-null int64
Fare        889 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 62.5+ KB


In [10]:
df.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,889.0,889.0,712.0,889.0,889.0,889.0
mean,0.382452,2.311586,29.642093,0.524184,0.382452,32.096681
std,0.48626,0.8347,14.492933,1.103705,0.806761,49.697504
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.0,0.0,0.0,7.8958
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [11]:
df.fillna(df.mean(),inplace=True)

df.isnull().sum()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

All null values have been removed as well as redundant columns. 

Next up lets take a look at some visual relations between the data. 

Converting my data using Onehotencoder and labelencoder

In [12]:
from sklearn.preprocessing import OneHotEncoder

In [13]:
df_new = pd.get_dummies(df,columns=['Pclass','Sex','Embarked'])



In [14]:
df_new.head()

Unnamed: 0_level_0,Survived,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,22.0,1,0,7.25,0,0,1,0,1,0,0,1
2,1,38.0,1,0,71.2833,1,0,0,1,0,1,0,0
3,1,26.0,0,0,7.925,0,0,1,1,0,0,0,1
4,1,35.0,1,0,53.1,1,0,0,1,0,0,0,1
5,0,35.0,0,0,8.05,0,0,1,0,1,0,0,1


In [15]:
y=df['Survived']
X=df_new.drop('Survived',axis=1)





In [16]:
#Instanstiate a stock classifier
clf = xgb.XGBClassifier()

#fit to data

clf.fit(X_train,y_train)


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)

In [17]:
y_preds= clf.predict(X_test)

print(accuracy_score(y_test,y_preds))
print(confusion_matrix(y_test,y_preds))
print(classification_report(y_test,y_preds))

0.8258426966292135
[[98  7]
 [24 49]]
              precision    recall  f1-score   support

           0       0.80      0.93      0.86       105
           1       0.88      0.67      0.76        73

   micro avg       0.83      0.83      0.83       178
   macro avg       0.84      0.80      0.81       178
weighted avg       0.83      0.83      0.82       178



The initial classifier has worked well. Now time to see how tuning will work

In [None]:
from sklearn.model_selection import GridSearchCV

In [21]:
clf_2 = xgb.XGBClassifier()

param_grid = {
    'learning_rate':[0.0001,0.001,0.005,0.01],
    'colsample_bytree':(0.3,1,0.1),
    'n_estimators':[20,50,100,200,300,500,1000],
    'max_depth':range(2,20),
    'base_score':[0.2,0.3,0.4,0.5]
    
}


cv_random = GridSearchCV(clf_2,cv=2,param_grid=param_grid,scoring='accuracy',verbose=1,n_jobs=-1)

cv_random.fit(X_train,y_train)

print(cv_random.best_params_)
print(cv_random.best_score_)

Fitting 2 folds for each of 6048 candidates, totalling 12096 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 126 tasks      | elapsed:   16.0s
[Parallel(n_jobs=-1)]: Done 276 tasks      | elapsed:   40.2s
[Parallel(n_jobs=-1)]: Done 526 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 876 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 1326 tasks      | elapsed:  6.0min
[Parallel(n_jobs=-1)]: Done 1876 tasks      | elapsed: 10.1min
[Parallel(n_jobs=-1)]: Done 2526 tasks      | elapsed: 12.9min
[Parallel(n_jobs=-1)]: Done 3276 tasks      | elapsed: 15.9min
[Parallel(n_jobs=-1)]: Done 4126 tasks      | elapsed: 20.6min
[Parallel(n_jobs=-1)]: Done 5076 tasks      | elapsed: 31.3min
[Parallel(n_jobs=-1)]: Done 6126 tasks      | elapsed: 35.8min
[Parallel(n_jobs=-1)]: Done 7276 tasks      | elapsed: 42.7min
[Parallel(n_jobs=-1)]: Done 8526 tasks      | elapsed: 49.0min
[Parallel(n_jobs=-1)]: Done 9876 tasks      | elapsed: 53.5min
[Parallel(n_jobs=-1)]: Done 11326 tasks      |

{'base_score': 0.4, 'colsample_bytree': 1, 'learning_rate': 0.005, 'max_depth': 3, 'n_estimators': 300}
0.819971870604782


{'n_estimators': 1000, 'max_depth': 15, 'learning_rate': 0.001, 'colsample_bytree': 1, 'base_score': 0.4}
0.810126582278481

{'n_estimators': 500, 'max_depth': 20, 'learning_rate': 0.01, 'colsample_bytree': 0.3, 'base_score': 0.3}
0.8143459915611815

{'n_estimators': 300, 'max_depth': 20, 'learning_rate': 0.01, 'colsample_bytree': 0.3, 'base_score': 0.5}
0.8129395218002813


In [22]:
clf_3 = xgb.XGBClassifier(n_estimators=500,max_depth=20,learning_rate=0.05,colsample_bytree=1,base_score=0.4)

clf_3.fit(X_train,y_train)

y_pred=clf_3.predict(X_test)

print(accuracy_score(y_test,y_preds))
print(confusion_matrix(y_test,y_preds))
print(classification_report(y_test,y_preds))



0.8258426966292135
[[98  7]
 [24 49]]
              precision    recall  f1-score   support

           0       0.80      0.93      0.86       105
           1       0.88      0.67      0.76        73

   micro avg       0.83      0.83      0.83       178
   macro avg       0.84      0.80      0.81       178
weighted avg       0.83      0.83      0.82       178



In [23]:
clf_3.fit(X,y)

XGBClassifier(base_score=0.4, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.05,
       max_delta_step=0, max_depth=20, min_child_weight=1, missing=None,
       n_estimators=500, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)

In [29]:
test_df = pd.read_csv('titanic_test.csv',index_col='PassengerId')

test_df.head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [31]:
test_df.drop(['Name','Ticket','Cabin'],axis=1,inplace=True)

In [32]:
test_df=pd.get_dummies(test_df,columns=['Pclass','Sex','Embarked'])

In [33]:
test_df.fillna(df.mean(),inplace=True)

test_df.isnull().sum()

Age           0
SibSp         0
Parch         0
Fare          0
Pclass_1      0
Pclass_2      0
Pclass_3      0
Sex_female    0
Sex_male      0
Embarked_C    0
Embarked_Q    0
Embarked_S    0
dtype: int64

In [34]:
test_df.head()

Unnamed: 0_level_0,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
892,34.5,0,0,7.8292,0,0,1,0,1,0,1,0
893,47.0,1,0,7.0,0,0,1,1,0,0,0,1
894,62.0,0,0,9.6875,0,1,0,0,1,0,1,0
895,27.0,0,0,8.6625,0,0,1,0,1,0,0,1
896,22.0,1,1,12.2875,0,0,1,1,0,0,0,1


In [None]:
test_y_preds = clf_3.predict(test_df)



In [62]:
submission=pd.DataFrame(test_y_preds.reshape(418,1))

submission.columns=['Survived']
submission.index +=892

submission.index.name= 'PassengerId'

submission.to_csv('titanic_submission.csv')

In [63]:
pd.read_csv('titanic_submission.csv')

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,1
4,896,1
5,897,0
6,898,0
7,899,0
8,900,1
9,901,0
