## Who Survived the Titanic?

In [13]:
import pandas as pd 
tdf = pd.read_csv('titanic.csv', sep = ',', header=0)

### Fix sklearn's CART Implementation
Decision trees are deterministic when implemented corrected. sklearn--to speed them up--implemened this as a non-deterministic way. This code fixes the problem.

In [14]:
import numpy as np
np.random.seed(101)

### Set up the data for the decision tree analysis

In [15]:
# Only keep the features we want to use and place the "target" at the end
tdf = tdf[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived']]

# Change sex to numeric value so we can use DecisionTreeClassifier() <-- string okay for target, but not as an input
tdf['Sex'] = tdf['Sex'].map({'male': 0, 'female': 1})

# Drop rows with missing fields
tdf = tdf.dropna() 
#print(df.info())
columns = list(tdf)

### Separate the independent variables (AKA Features) from the dependent labels (AKA Target)

In [16]:
X = tdf.iloc[:, 0:6]   # load features into X DF
Y = tdf.iloc[:, 6]     # Load target into Y DF

### Split the Training and Testing Data

In [17]:
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.1, random_state=1)

### Generating  and evaluating the model

In [18]:
from sklearn.tree import DecisionTreeClassifier 

# Use entropy = no limit on samples for split
model_ent = DecisionTreeClassifier(criterion='entropy').fit(X_train, y_train) 
y_ent_pred = model_ent.predict(X_test)

# Use information gain (default) limit min_samples to 4
model_gini = DecisionTreeClassifier(min_samples_leaf=4).fit(X_train, y_train)
y_gini_pred = model_gini.predict(X_test)

# NOTE: You should, when testing models, only vary 1 thing at a time. 

In [19]:
# Generate an accuracy Score
from sklearn.metrics import accuracy_score

print("Entropy accuracy is : {}%".format(accuracy_score(y_test, y_ent_pred)*100))
print("Gini accuracy is : {}%".format(accuracy_score(y_test, y_gini_pred)*100))

Entropy accuracy is : 77.7777777778%
Gini accuracy is : 83.3333333333%


## Visualizing the results

In [20]:
from sklearn.metrics import confusion_matrix

pd.DataFrame(
    confusion_matrix(y_test, y_gini_pred),
    columns=['Predicted Died', 'Predicted Survived'],
    index=['True Died', 'True Survived']
)

Unnamed: 0,Predicted Died,Predicted Survived
True Died,43,5
True Survived,7,17
