## Can we create a model to predict survivability?

1. Prepare the Data. Select the features ('Age', 'PClass', 'Embarked', 'Sex') and the target variable ('Survived').
2. Split the Data into training and testing data. This is so we can train the model and verify the accuracy.
3. Pick a model to play with: Logistic Regression, Decision Trees, Random Forests
4. Train the model.
5. Test the model.
6. Explore what model adjustments or training adjustments would make it more accurate

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

raw_data = pd.read_csv('./titanic-data-ext/full.csv')
raw_data.shape

(1309, 21)

In [2]:
raw_data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'WikiId', 'Name_wiki',
       'Age_wiki', 'Hometown', 'Boarded', 'Destination', 'Lifeboat', 'Body',
       'Class'],
      dtype='object')

In [3]:
# clean data
df = pd.DataFrame(raw_data)

# if we don't know survival, assume missing and dead
df['Survived'] = df['Survived'].fillna(0.0)
# remove rows where age, pclass, embarked is unknown
df = df.dropna(subset=['Age'])
df = df.dropna(subset=['Pclass'])
df = df.dropna(subset=['Embarked'])
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Embarked,WikiId,Name_wiki,Age_wiki,Hometown,Boarded,Destination,Lifeboat,Body,Class
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,...,S,691.0,"Braund, Mr. Owen Harris",22.0,"Bridgerule, Devon, England",Southampton,"Qu'Appelle Valley, Saskatchewan, Canada",,,3.0
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,...,C,90.0,"Cumings, Mrs. Florence Briggs (née Thayer)",35.0,"New York, New York, US",Cherbourg,"New York, New York, US",4,,1.0
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,...,S,865.0,"Heikkinen, Miss Laina",26.0,"Jyväskylä, Finland",Southampton,New York City,14?,,3.0
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,...,S,127.0,"Futrelle, Mrs. Lily May (née Peel)",35.0,"Scituate, Massachusetts, US",Southampton,"Scituate, Massachusetts, US",D,,1.0
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,...,S,627.0,"Allen, Mr. William Henry",35.0,"Birmingham, West Midlands, England",Southampton,New York City,,,3.0


In [4]:
# convert Sex into a number so it will work with different models 
# Define mapping dictionary
sex_mapping = {'male': 0, 'female': 1}

# Replace values using map()
df['SexNum'] = df['Sex'].map(sex_mapping)

# Also need to replace the embarked with a number
port_mapping = {'C': 0, 'S': 1, 'Q': 2}

# Replace values using map()
df['EmbarkedNum'] = df['Embarked'].map(port_mapping)
df.tail(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Name_wiki,Age_wiki,Hometown,Boarded,Destination,Lifeboat,Body,Class,SexNum,EmbarkedNum
1300,1301,0.0,3,"Peacock, Miss. Treasteall",female,3.0,1,1,SOTON/O.Q. 3101315,13.775,...,"Peacock, Miss Treasteall",4.0,"Southampton, Hampshire, England",Southampton,"Elizabeth, New Jersey, US",,,3.0,1,1
1302,1303,0.0,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37.0,1,0,19928,90.0,...,"Minahan, Mrs. Lillian E. (née Thorpe)",37.0,"Fond du Lac, Wisconsin, US",Southampton,"Fond du Lac, Wisconsin, US",14.0,,1.0,1,2
1303,1304,0.0,3,"Henriksson, Miss. Jenny Lovisa",female,28.0,0,0,347086,7.775,...,"Henriksson, Miss Jenny Lovisa",28.0,"Stockholm, Sweden",Southampton,"Iron Mountain, Michigan, US",,3MB,3.0,1,1
1305,1306,0.0,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9,...,"and maid, Doña Fermina Oliva y Ocana",39.0,"Madrid, Spain",Cherbourg,"New York, New York, US",8.0,,1.0,1,0
1306,1307,0.0,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,...,"Sæther, Mr. Simon Sivertsen",43.0,"Skaun, Sør-Trøndelag, Norway",Southampton,US,,32MB,3.0,0,1


In [5]:
features = ['SexNum','Pclass','EmbarkedNum','Age']
target = ['Survived']

X = df[features]
y = df[target]

X.iloc[2]

SexNum          1.0
Pclass          3.0
EmbarkedNum     1.0
Age            26.0
Name: 2, dtype: float64

In [6]:
y

Unnamed: 0,Survived
0,0.0
1,1.0
2,1.0
3,1.0
4,0.0
...,...
1300,0.0
1302,0.0
1303,0.0
1305,0.0


In [7]:
# Split the Dataset into Training and Test Datasets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

# describing the training and test data
type(X_train)
type(X_test)
type(y_train)
type(y_test)
X_train.head()
y_train.describe()

Unnamed: 0,Survived
count,699.0
mean,0.264664
std,0.44147
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [8]:
X_test.head()

Unnamed: 0,SexNum,Pclass,EmbarkedNum,Age
734,0,2,1,23.0
1160,0,3,1,17.0
1065,0,3,1,40.0
610,1,3,1,39.0
1191,0,3,1,32.0


In [9]:
type(X_test)

pandas.core.frame.DataFrame

In [10]:
dtc = DecisionTreeClassifier(max_leaf_nodes=10, random_state=0)
# using the test data
dtc.fit(X_train, y_train)

dtc_predictions = dtc.predict(X_test)

accuracy = accuracy_score(y_true = y_test, y_pred = dtc_predictions)
print(accuracy)

0.7652173913043478


In [11]:
# Can we increase accuracy by adjusting nodes?
def testModel(ln, xt, yt, xxt, yyt):
    dtc = DecisionTreeClassifier(max_leaf_nodes=ln, random_state=0)
    dtc.fit(xt, yt)
    dtc_predictions = dtc.predict(xxt)
    accuracy = accuracy_score(y_true = yyt, y_pred = dtc_predictions)
    return accuracy

for i in [5, 10, 15, 20]:
    result = testModel(i, X_train, y_train, X_test, y_test)
    print(result)

0.7681159420289855
0.7652173913043478
0.7768115942028986
0.7565217391304347


In [12]:
# what if we adjust the features?
features = ['SexNum','Pclass','EmbarkedNum','Age','Parch','SibSp']
X = df[features]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

for i in [5, 10, 15, 20]:
    result = testModel(i, X_train, y_train, X_test, y_test)
    print(result)

0.7913043478260869
0.7913043478260869
0.7739130434782608
0.7565217391304347
