In [None]:
# plotting
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl

# data visualization
import seaborn as sns
from helper_functions import plot_setup, plot_confusion_matrix
sns.set_style('white')
plot_setup()

# data analysis
import pandas as pd

# data mining & ML
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

In this part, we'll look at Decision Trees.

First, let's load the Titanic dataset previously prepared.

In [None]:
titanic = pd.read_csv('titanic_processed.csv')

Let's extract the feature and label vectors.

In [None]:
X = titanic.drop('survived', axis = 1)
y = titanic['survived']

# Decision Trees

Decision trees are can be one of the easiest ML algorithms to think about conceptually.

We'll learn about this concept by jumping into code and exploring the idea.

#### Gender Feature

In our first notebook, during data exploration, we noticed that gender was a signifcant factor in predicting survival. Let's look at that again.

In [None]:
# Display counts of survivors for each gender category
sns.countplot(data = titanic, x = 'gender', hue = 'survived')

We see that gender is a significant factor. So what would happen if we just used gender to predict if an individual survives?

In [1]:
# Here we predict that if you are a woman you survive, if you are not, you won't 

def predict(features):
    if features['gender'] == 0:
        return 1
    else:
        return 0

How well does this simple predictor perform?

Let's pull out a test set using `train_test_split`. We won't be using the train set for now since our predictor model is already built based just on gender.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a data frame from the train data to use in our simple predictor
# This is the data that we will analyze (we never want to look at our test data
# or change our model based on it)
train_df = X_train.copy()
train_df['survived'] = y_train

In [None]:
y_pred = [predict(row) for _, row in X_test.iterrows() ]
accuracy_score(y_test, y_pred)

That's not a bad start. We of course know that this is too simple a classifier, though.

Let's get a bit more specific. Age seemed to play a role in survival. Let's look at age by gender and see if we can improve our model.

**Age**

Women's survival by age:

In [None]:
sns.distplot(a = train_df['age'][(train_df['survived'] == 1) & (train_df['gender'] == 0)].dropna(), kde_kws = {'label': 'survived'})
sns.distplot(a = train_df['age'][(train_df['survived'] == 0) & (train_df['gender'] == 0)].dropna(), kde_kws = {'label': 'did not survive'})

Men's survival by age:

In [None]:
sns.distplot(a = train_df['age'][(train_df['survived'] == 1) & (train_df['gender'] == 1)].dropna(), kde_kws = {'label': 'survived'})
sns.distplot(a = train_df['age'][(train_df['survived'] == 0) & (train_df['gender'] == 1)].dropna(), kde_kws = {'label': 'did not survive'})

For men, there seems to be a much higher rate of survival if they were younger than 15. We can add that to our model.

In [None]:
def predict(features):
    # TODO

How accurate is our predict function for our test data?

In [None]:
y_pred = [ predict(row) for _, row in X_test.iterrows() ]
accuracy_score(y_test, y_pred)

We see almost increase in 1.5% accuracy from also distinguishing by age.

We would like to better distinguish and predict for women. Right now we are saying that all women survived. That isn't true and it is affecting the accuracy of our predictions.

Let's explore some other features that we might use to better predict for female passengers.

**Ticket Class**

In [None]:
# Survival rates for women based on their ticket class
sns.countplot(data = titanic[titanic["gender"] == 0], x = 'pclass', hue = 'survived')

Almost all women with 1st and 2nd class tickets survived, but many women with 3rd class tickets did not. Let's include this in our prediction function.

In [None]:
def predict(features):
    # TODO

How accurate are our predictions now?

In [None]:
y_pred = [ predict(row) for _, row in X_test.iterrows() ]
accuracy_score(y_test, y_pred)

We see that our accuracy went down slightly. This means that the last break down didn't have great information gain (more below).

On Your Own: Keep playing around with data break downs. What combination of conditions mean that a passenger was likely to survive or not survive?

You can continue to expand on the `predict` function.

#### Back to Decision Trees

What we did above is exactly the beginnings of a decision tree.

A decision tree is a logical tree structure, which asks questions at each branch. Based on the answer to that question, you either go right or left from that branch. You proceed down the logic tree until you get to a node which gives you the predicted value.

You get a new data point. A new passenger, for example.
You ask a series of questions about the passenger and after you answer all of those questions, you have a prediction/answer about if the passenger survived or not.

**Important Concepts:**   
*Entropy*   
An important part of forming a decision tree is something called [entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)).  
Entropy is essentially a measure of how varied (or random) the values are in a set. A set of items where all items are identical, has low entropy. A set of items where every item is different from the others, has high entropy.

When creating decision trees, the goal is to reduce the amount of entropy in your resulting data set at each step. You want to find questions which help separate your data into as many clear sets as possible.

We are predicting survival, either 1 - survived or 0 - did not survive. Every question we ask splits our data up into some part. In each data split, we want as little entropy in the resulting set of data as possible.

Our ideal question would split surviving passengers from passengers who did not survive. Unfortunately we don't have any one question like that, but with a series of questions we can get closer and closer to that goal.

*Information Gain*  
Information gain refers to the change in differentness/randomness in your data set, before and after your conditional branch (i.e. before and after you asked your question). A very good condition in your decision tree, will give you a high information gain. In our example, that means it would do a great job splitting passengers who survived from passengers who did not.

A bad condition, would provide us with almost no more information after the split. We would have almost no more information about if a passenger survived or did not.


### Decision Tree Classifier with Sklearn

We don't need to create a decision tree manually, though. Scikit learn has tools already which will automatically generate a logic tree for us and use it to predict new data. Let's see how it performs.

[Scikit Learn Decision Tree Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)

In [None]:
from sklearn import tree

clf = tree.DecisionTreeClassifier()

In [None]:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
plot_confusion_matrix(y_test, y_pred)

Lastly, let's visualize our decision tree classifier.

In [None]:
import graphviz 
dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 
graph 

This graph shows us a very important point about decision tree classifiers. The automatically generated ones can become incredibly complicated, and the more complicated the are, the more likely it is they are overfitted or highly specialized around our training data.

This could be why our sklearn decision tree underperformed our manually determined one. This is something to keep in mind when working with decision trees.

There are some parameters of our model we can change to reduce overfitting/the complexity of our model.

**On Your Own:**
Play around with the `max_depth` and `min_samples_split` fields. See how they change your resulting visualized decision tree.