# An XGBoost  solution to the Titanic Survivor Dataset

In this exercise i will be working with the titanic data set from Kaggle. This dataset contains information about the passengers who were on board when the titanic sank. 

## The Objective

My goal for this project is to make predictions on whether a person survived or not. This will be a supervised binary classification problem. 

I will be using an XGBoost Classifier as my model of choice. 

## Version 2

This is the second iteration of this project that i have conducted. I've made a few adjustments and quality of life changes from my first attempt. My first model Overfit the training data and performed badly in the kaggle contest. I think i was placed around 10,000 in the leaderboard. This version was much more effective and achieved results in the top 2000. 

In [1]:
#Modules

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import GridSearchCV,RandomizedSearchCV,train_test_split
import xgboost as xgb 
from sklearn.metrics import classification_report, accuracy_score,confusion_matrix


In [2]:
#importing the data 
df=pd.read_csv('titanic_train.csv',index_col='PassengerId') 

FileNotFoundError: [Errno 2] File b'titanic_train.csv' does not exist: b'titanic_train.csv'

# EDA

I will begin by performing some initial exploratory data analysis on the data. I will be looking to see if there is any missing data or cleaning that i need to perform. I will also be looking to see if there are any obvious relationships in the data. 

In [None]:
df.head()

From the initial inspection of the data there are some issues that need fixing. I will be dropping Name and Ticket from my dataset completely as they are text columns. I could have used some NLP techniques on the name column but i didn't think the extra work was worth the performance trade off. 

I also have categorical columns that need to be converted. I will be using pd.get_dummies to do this.

I can also see some NaN values in my cabin column. I will also need to check for other NaN values in the dataset.

In [None]:
# I will first drop the name and ticket columns from my dataset

df.drop(inplace=True,columns=['Name','Ticket'])

In [None]:
df.info()

From looking at the info sheet above i think it is worth dropping the cabin column completely as most of the values are missing. The information would also be hard to work with as it is not in any sort of category. 

There are only a few missing values from embarked so i will drop these rows of data.

for the age values i will fill them using the mean value for the column


In [None]:
#Dropping cabin columns
df.drop(inplace=True,columns=['Cabin'])

In [None]:
#Dropping NA rows from the embarked dataset
df.dropna(inplace=True,subset=['Embarked'])

In [None]:
#Now i only have the age column to fix

df.info()

In [None]:
#Checking the data to make sure there isn't any strange results. Also checking what the mean value for age is and making sure
# it is relevant. 29 seems like a realist mean age to me. 
df.describe()

In [None]:
#filling age with the mean value
df.fillna(df.mean(),inplace=True)

# a final check for NA values
df.isnull().sum()

## Visual relationships

Having a little look at the correlations between the data shows some clear relationships in the data. 

The correlations are all unsurprising for anyone who has watched the Titanic movie. Higher class, sex and fare price all have a strong correlation. Age, Siblings and parents do not factor in quite so much. 

In [None]:
sns.heatmap(df.corr(),annot=True)

All null values have been removed as well as redundant columns. 

Next i will convert the categorical columns of my dataframe using pd.get_dummies

In [None]:
df_new = pd.get_dummies(df,columns=['Pclass','Sex','Embarked'])



In [None]:
df_new.head()

## Training the classifier 

In previous tests i split my data in to a train and test set. I've decided that it is more effective to just leave one training set as there is a limited size of training data to fit a model on. I will also be using a GridSearchCV so this should prevent any overfitting of my model. 


In [None]:
#Creating my target and feature variables
y=df['Survived']
X=df_new.drop('Survived',axis=1)





In [None]:
from sklearn.model_selection import GridSearchCV

In [16]:
#I used a gridsearch as my parameter grid was not too big. I experimented with a lot more params in my first version and found
# this range to be optimal. 


clf_2 = xgb.XGBClassifier()

param_grid = {
    'learning_rate':[0.0001,0.001,0.005],
    'colsample_bytree':[0.8,0.9,1],
    'n_estimators':[50,100,200,500,1000],
    'max_depth':range(2,5) 
}


cv_random = GridSearchCV(clf_2,cv=2,param_grid=param_grid,scoring='accuracy',verbose=1,n_jobs=-1)

cv_random.fit(X,y)

print(cv_random.best_params_)
print(cv_random.best_score_)

Fitting 2 folds for each of 135 candidates, totalling 270 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   14.5s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 270 out of 270 | elapsed:  1.7min finished


{'colsample_bytree': 0.8, 'learning_rate': 0.0001, 'max_depth': 3, 'n_estimators': 200}
0.8065241844769404


In [None]:
#Creating a new classifier and fitting it to the whole data set

clf_3 = xgb.XGBClassifier(n_estimators=100,max_depth=3,learning_rate=0.0001,colsample_bytree=0.8)

clf_3.fit(X,y)

#Creating my predictions and then printing off metric scores
y_preds=clf_3.predict(X)
print(accuracy_score(y,y_preds))
print(confusion_matrix(y,y_preds))
print(classification_report(y,y_preds))



In [None]:
test_df = pd.read_csv('titanic_test.csv',index_col='PassengerId')

test_df.head()

In [None]:
test_df.drop(['Name','Ticket','Cabin'],axis=1,inplace=True)

In [None]:
test_df=pd.get_dummies(test_df,columns=['Pclass','Sex','Embarked'])

In [None]:
test_df.fillna(df.mean(),inplace=True)

test_df.isnull().sum()

In [None]:
test_df.head()

In [None]:
#Making my predictions for the test set
test_y_preds = clf_3.predict(test_df)



In [None]:
# creating my submissions file as a csv for the kaggle contest. 
submission=pd.DataFrame(test_y_preds.reshape(418,1))

submission.columns=['Survived']
submission.index +=892

submission.index.name= 'PassengerId'

submission.to_csv('titanic_submission.csv')

In [None]:
# checking to make sure it looks ok.
pd.read_csv('titanic_submission.csv')