# Exploring Titanic Survival rates with skicit-learn library

For an overview of Decision  Tree Classifier and visualization, check out 

https://www.datacamp.com/community/tutorials/decision-tree-classification-python

## Getting Started
In this lab, you will see how decision trees work by implementing a decision tree in sklearn.

We'll start by loading the dataset and displaying some of its rows.

In [4]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames

# Pretty display for notebooks
%matplotlib inline

# Set a random seed
import random
random.seed(42)

# Load the dataset
in_file = 'titanic_data.csv'
full_data = pd.read_csv(in_file)

# Print the first few entries of the RMS Titanic data
display(full_data.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Recall that these are the various features present for each passenger on the ship:
- **Survived**: Outcome of survival (0 = No; 1 = Yes)
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `NaN`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `NaN`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

Since we're interested in the outcome of survival for each passenger or crew member, we can remove the **Survived** feature from this dataset and store it as its own separate variable `outcomes`. We will use these outcomes as our prediction targets.  
Run the code cell below to remove **Survived** as a feature of the dataset and store it in `outcomes`.


Let's look at numeric variables

In [45]:
full_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Let's explore the nonnumeric variables


In [51]:
full_data[['Ticket', 'Sex', 'Cabin', 'Embarked']].describe()

Unnamed: 0,Ticket,Sex,Cabin,Embarked
count,891,891,204,889
unique,681,2,147,3
top,347082,male,G6,S
freq,7,577,4,644


Check missing values

In [41]:
full_data.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [5]:
# Store the 'Survived' feature in a new variable and remove it from the dataset
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis = 1)

# Show the new dataset with 'Survived' removed
features_raw.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [8]:
outcomes.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

The very same sample of the RMS Titanic data now shows the Survived feature removed from the DataFrame. Note that data (the passenger data) and outcomes (the outcomes of survival) are now paired. That means for any passenger data.loc[i], they have the survival outcome outcomes[i]

In [12]:
print(full_data.iloc[0:10,1])
print(outcomes[0:10])

0    0
1    1
2    1
3    1
4    0
5    0
6    0
7    0
8    1
9    1
Name: Survived, dtype: int64
0    0
1    1
2    1
3    1
4    0
5    0
6    0
7    0
8    1
9    1
Name: Survived, dtype: int64


## Preprocessing the data

Now, let's do some data preprocessing. First, we'll remove the names of the passengers, and then one-hot encode the features.



In [71]:
# Removing the names
features_no_names = features_raw.drop(['Name'], axis=1)

# One-hot encoding
features = pd.get_dummies(features_no_names)

#Replace all missing values with zeroes (vbecause everything is a dummy, this is OK, and it makes sense!!!)
features = features.fillna(0)

In [72]:
display(features.head(10))

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Ticket_110152,Ticket_110413,...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,1,3,22.0,1,0,7.25,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
1,2,1,38.0,1,0,71.2833,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,3,26.0,0,0,7.925,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,4,1,35.0,1,0,53.1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,5,3,35.0,0,0,8.05,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
5,6,3,0.0,0,0,8.4583,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
6,7,1,54.0,0,0,51.8625,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
7,8,3,2.0,3,1,21.075,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
8,9,3,27.0,0,2,11.1333,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9,10,2,14.0,1,0,30.0708,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0


## Training the model

Now we're ready to train a model in sklearn. First, let's split the data into training and testing sets. Then we'll train the model on the training set.

In [74]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, outcomes, test_size=0.2, random_state=42)

In [75]:
X_train.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Ticket_110152,Ticket_110413,...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
count,712.0,712.0,712.0,712.0,712.0,712.0,712.0,712.0,712.0,712.0,...,712.0,712.0,712.0,712.0,712.0,712.0,712.0,712.0,712.0,712.0
mean,448.234551,2.330056,23.698511,0.553371,0.379213,32.586276,0.344101,0.655899,0.004213,0.002809,...,0.002809,0.004213,0.002809,0.001404,0.002809,0.004213,0.001404,0.175562,0.08427,0.73736
std,256.731423,0.824584,17.507272,1.176404,0.791669,51.969529,0.475408,0.475408,0.06482,0.052963,...,0.052963,0.06482,0.052963,0.037477,0.052963,0.06482,0.037477,0.380714,0.277987,0.440378
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,224.75,2.0,5.0,0.0,0.0,7.925,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,453.5,3.0,24.0,0.0,0.0,14.4542,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,673.5,3.0,35.0,1.0,0.0,30.5,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,891.0,3.0,80.0,8.0,6.0,512.3292,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [77]:
X_test.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Ticket_110152,Ticket_110413,...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
count,179.0,179.0,179.0,179.0,179.0,179.0,179.0,179.0,179.0,179.0,...,179.0,179.0,179.0,179.0,179.0,179.0,179.0,179.0,179.0,179.0
mean,437.111732,2.223464,24.200168,0.402235,0.391061,30.684473,0.385475,0.614525,0.0,0.005587,...,0.0,0.0,0.005587,0.0,0.0,0.005587,0.0,0.240223,0.094972,0.664804
std,260.349331,0.877501,17.989179,0.730703,0.863185,39.447725,0.488073,0.488073,0.0,0.074744,...,0.0,0.0,0.074744,0.0,0.0,0.074744,0.0,0.428417,0.293999,0.473383
min,6.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,217.5,1.0,9.0,0.0,0.0,7.8958,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,423.0,3.0,23.5,0.0,0.0,14.5,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,656.0,3.0,36.0,1.0,0.0,32.4104,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,890.0,3.0,71.0,4.0,5.0,262.375,1.0,1.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0


## Define and fit the Decision Tree model

In [78]:
# Import the classifier from sklearn
from sklearn.tree import DecisionTreeClassifier

# Define the classifier, and fit it to the data
dtmodel = DecisionTreeClassifier()
dtmodel.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

## Testing the model
Now, let's see how our model does, let's calculate the accuracy over both the training and the testing set.

In [79]:
# Making predictions
y_train_pred = dtmodel.predict(X_train)
y_test_pred = dtmodel.predict(X_test)

# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 1.0
The test accuracy is 0.8156424581005587


# Exercise: Improving the model

Ok, high training accuracy and a lower testing accuracy. We may be overfitting a bit.

So now it's your turn to shine! Train a new model, and try to specify some parameters in order to improve the testing accuracy, such as:
- `max_depth`
- `min_samples_leaf`
- `min_samples_split`

You can use your intuition, trial and error, or even better, feel free to use Grid Search!

**Challenge:** Try to get to 85% accuracy on the testing set. If you'd like a hint, take a look at the solutions notebook next.

## Setting and playing with max_depth

In [101]:
# Define the classifier, and fit it to the data
dtmodel_maxd = DecisionTreeClassifier(max_depth = 6)
dtmodel_maxd.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

## Testing the model
Now, let's see how our model does, let's calculate the accuracy over both the training and the testing set.


In [103]:

# Making predictions
y_train_predmaxd = dtmodel_maxd.predict(X_train)
y_test_predmaxd = dtmodel_maxd.predict(X_test)

# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracymaxd = accuracy_score(y_train, y_train_predmaxd)
test_accuracymaxd = accuracy_score(y_test, y_test_predmaxd)
print('The training accuracy with max depth is', train_accuracymaxd)
print('The test accuracy with max depth  is', test_accuracymaxd)

The training accuracy with max depth is 0.8707865168539326
The test accuracy with max depth  is 0.8212290502793296


# Playing with min_samples_leaf

In [106]:
## Setting and playing with max_depth

# Define the classifier, and fit it to the data
dtmodel_msl = DecisionTreeClassifier(max_depth = 6, min_samples_leaf = 5)
dtmodel_msl.fit(X_train, y_train)

## Testing the model



# Making predictions
y_train_predmsl = dtmodel_msl.predict(X_train)
y_test_predmsl = dtmodel_msl.predict(X_test)

# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracymsl = accuracy_score(y_train, y_train_predmsl)
test_accuracymsl = accuracy_score(y_test, y_test_predmsl)
print('The training accuracy with max depth is', train_accuracymsl)
print('The test accuracy with max depth  is', test_accuracymsl)

The training accuracy with max depth is 0.8735955056179775
The test accuracy with max depth  is 0.8547486033519553
