# Predicting which passengers could survive in Titanic

## Background
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we will complete the analysis of what sorts of people were likely to survive. In particular, we will apply the tools of machine learning to predict which passengers survived the tragedy.

## Get the data ready

In [2]:
### import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
%pylab inline
import matplotlib.patches as mpatches
from sklearn.model_selection import train_test_split

Populating the interactive namespace from numpy and matplotlib


In [77]:
### read data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
titanic = pd.concat([train.drop('Survived',1),test])

#'PassengerID','Name' and'Ticket'are irrelevant to our study, which will be excluded from our study.
titanic=titanic.drop(['PassengerId','Ticket','Name'],axis=1)

In [78]:
titanic['Without_Age']= titanic['Age'].isnull()
titanic['Without_Cabin']= titanic['Cabin'].isnull()

It appears passengers without age or cabin information have a lower survivate rate. It may due to that it was harder to gather the age information of people who did not survive the trip. 

In this study, the Nan values are replaced with the mean age of the same SEX and PClass group. And we will replace the 'Cabin' column with a new column showing if there is cabin information for the passenger.

In [79]:
#Replace nan values use mean values
data['Age'] = titanic.groupby(['Sex', 'Pclass'])['Age'].transform(lambda x: x.fillna(x.mean()))
test['Age'] = test.groupby(['Sex', 'Pclass'])['Age'].transform(lambda x: x.fillna(x.mean()))

## Feature Engineering

In [80]:
survived = train['Survived']

titanic["Sex"] = titanic["Sex"].astype("category")
titanic["Sex"].cat.categories = [0,1]
titanic["Sex"] = titanic["Sex"].astype("int")

titanic["Embarked"] = titanic["Embarked"].astype("category")
titanic["Embarked"].cat.categories = [0,1,2]
titanic["Embarked"] = titanic["Embarked"].astype("int")

titanic["Without_Age"] = titanic["Without_Age"].astype("category")
titanic["Without_Age"].cat.categories = [0,1]
titanic["Without_Age"] = titanic["Without_Age"].astype("int")

titanic["Without_Cabin"] = titanic["Without_Cabin"].astype("category")
titanic["Without_Cabin"].cat.categories = [0,1]
titanic["Without_Cabin"] = titanic["Without_Cabin"].astype("int")

titanic=titanic.drop(['Cabin'],axis=1)

In [81]:
test = titanic.iloc[len(train):]
train = titanic.iloc[:len(train)]
train['Survived'] = survived

training, testing = train_test_split(train, test_size=0.2, random_state=0)
print("Total sample size = %i; training sample size = %i, testing sample size = %i"\
     %(train.shape[0],training.shape[0],testing.shape[0]))

Total sample size = 891; training sample size = 712, testing sample size = 179


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


## Regression Models

In [82]:
training.head()
#training.describe()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Without_Age,Without_Cabin,Survived
140,3,0,22.185329,0,2,15.2458,0,1,1,0
439,2,1,31.0,0,0,10.5,2,0,1,0
817,2,1,31.0,1,1,37.0042,0,0,1,0
378,3,1,20.0,0,0,4.0125,0,0,1,0
491,3,1,21.0,0,0,7.25,2,0,1,0


### SVM

In [96]:
from sklearn import svm
X = training.drop(['Survived'],axis=1)
y = training['Survived']
clf = svm.SVC()
clf.fit(X, y)

test_x = testing.drop(['Survived'],axis=1)
test_y = testing['Survived']
                      
print("Accuracy: {}".format(clf.score(test_x,test_y)))

Accuracy: 0.6871508379888268


### Decision Tree

In [100]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

tree = DecisionTreeClassifier()
tree.fit(X, y)
pred_y = tree.predict(test_x)
print("Accuracy: {}".format(accuracy_score(pred_y,test_y)))

Accuracy: 0.8044692737430168


In [108]:
import xgboost
# Let's try XGboost algorithm to see if we can get better results
xgb = xgboost.XGBClassifier(n_estimators=100, learning_rate=0.08, gamma=0, subsample=0.75,
                           colsample_bytree=1, max_depth=7)

xgb.fit(X,y)

pred_y = xgb.predict(test_x)
print("Accuracy: {}".format(accuracy_score(pred_y,test_y)))

Accuracy: 0.8603351955307262
