## Decision Trees

#### Table of Contents

- [Preliminaries](#Preliminaries)
- [Classification](#Classification)
- [Pruning](#Pruning)
- [Regression](#Regression)

We can use decision trees for classification or regression:

- `sklearn.tree.DecisionTreeClassifier()`
- `sklearn.tree.DecisionTreeRegressor()`

They operate in the same manner, but for different problems.

*******
# Preliminaries
[TOP](#Decision-Trees)

We are going to use the titanic data from lecture to show how to implement classification and regression decision trees.

In [None]:
# utilities
import pandas as pd

# processing
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV

# algorithms
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree

# plotting
import matplotlib.pyplot as plt

In [None]:
titanic = pd.read_csv('titanic3.csv')
titanic.head()

Select the following variables to match [Varian (2014)](https://www.aeaweb.org/articles?id=10.1257/jep.28.2.3):

- `pclass`
- `survived`
- `sex`
- `age`
- `sibsp`

In [None]:
df = titanic[['pclass', 'survived', 'sex', 'age', 'sibsp']]

Check for any `NAs`. 

In [None]:
df.isnull().any()

If there are any `NAs`, drop them.

In [None]:
df.dropna(inplace = True)

Print the head.

Notice by printing the head that we have three categorical variables:

1. `survived` - our label
2. `sex` - string
3. `pclass` - an _**ordered**_ numeric categorical variable

In [None]:
df.head()

Decision trees only need dummies for non-ordered categorical variables.
Adjust `sex` to be a dummy variable.

In [None]:
df = df.drop(columns = 'sex').join(pd.get_dummies(df['sex'], drop_first = True))
df.head()

We are going to use the whole data (no train-test split).
The reason why is because we are using the data for the entire universe of Titanic passengers.

Define `x` and `y`.

Convert `y` to a `string` and then a `category`.

In [None]:
y = df['survived'].astype('string').astype('category')
x = df.drop(columns = 'survived') 

***********
# Classification
[TOP](#Decision-Trees)
    
The Varian (2014) paper has 7 terminal nodes.
Let's create it!

In [None]:
plt.figure(figsize = (8, 8))
fit_5 = DecisionTreeClassifier(random_state = 490,
                               max_leaf_nodes = 7)
fit_5.fit(x, y)

_ = plot_tree(fit_5,
              feature_names = x.columns,
              class_names = y.cat.categories,
              filled = True)

*********
# Pruning
[TOP](#Decision-Trees)

Pruning is just cross-validating the optimal number of terminal nodes.
I hope CV is becoming familiar by now.

In [None]:
param_grid = {
    'max_leaf_nodes': range(1, 11)
}

tree_cv = DecisionTreeClassifier(random_state = 490)

grid_search = GridSearchCV(tree_cv, param_grid,
                          cv = 5,
                          scoring = 'accuracy',
                          n_jobs = 10).fit(x, y)
best = grid_search.best_params_
best

Fit the optimal model and plot the tree.

In [None]:
plt.figure(figsize = (8, 8))
fit_best = DecisionTreeClassifier(random_state = 490,
                                 max_leaf_nodes = best['max_leaf_nodes'])
fit_best.fit(x, y)

_ = plot_tree(fit_best,
              feature_names = x.columns,
              class_names = y.cat.categories,
              filled = True)

***********
# Regression
[TOP](#Decision-Trees)

Regression works in the same way as classification. 
Let's do an example!

Let's predict the `fare` column.

In [None]:
df2 = df.join(titanic['fare'])

In [None]:
df2.isnull().any()

In [None]:
df2.dropna(inplace = True)

In [None]:
y = df2['fare']
x = df2.drop(columns = 'fare')

Let's fit a regression decision tree with five terminal nodes.

In [None]:
plt.figure(figsize = (8, 10))
fit_5 = DecisionTreeRegressor(random_state = 490,
                               max_leaf_nodes = 5)
fit_5.fit(x, y)

_ = plot_tree(fit_5,
              feature_names = x.columns,
              filled = True)

We could prune this tree, however, it is identical to the clasification decision tree.

So, eh?