### Introduction to Decision Trees and Forests

**OBJECTIVES**

- Understand how a decision tree is built for classification and regression
- Fit decision tree models using `scikit-learn`
- Control overfitting by grid searching and cross validating
- Understand the random forest model and fit models using `scikit-learn`
- Examine feature importance of fit tree and forest models

In [None]:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn.preprocessing import OneHotEncoder

#### Titanic Dataset

In [None]:
#load the data
titanic = sns.load_dataset('titanic')
titanic.head(5) #shows first five rows of data

In [None]:
#subset the data to binary columns
data = titanic.loc[:4, ['alone', 'adult_male', 'survived']]
data

Suppose you want to use a single column to predict if a passenger survives or not.  Which column will do a better job predicting survival in the sample dataset above?

In [None]:
#survival by alone


In [None]:
#survival by adult_male


### Entropy

One way to quantify the quality of the split is to use a quantity called **entropy**.  This is determined by:

$$H = - \sum p_i \log p_i $$

With a decision tree the idea is to select a feature that produces less entropy.  

In [None]:
#all the same -- probability = 1


In [None]:
#half and half -- probability = .5


In [None]:
#subset the data to age, pclass, and survived five rows
data = titanic.loc[:4, ['age', 'pclass', 'survived']]
data

In [None]:
#compute entropy for pclass
#first class entropy


In [None]:
#pclass entropy


In [None]:
#weighted sum of these


In [None]:
#splitting on age < 30
entropy_left = None
entropy_right = None
entropy_age = None

In [None]:
#original entropy
original_entropy = -((3/5)*np.log2(3/5) + (2/5)*np.log2(2/5))

In [None]:
# improvement based on pclass


In [None]:
#improvement based on age < 30


**EXAMPLE**

#### Using `sklearn`

The `DecisionTreeClassifier` can use `entropy` to build a full decision tree model.  Below we build and visualize such a model.

In [None]:
X = data[['age', 'pclass']]
y = data['survived']

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
DecisionTreeClassifier?

In [None]:
#instantiate


In [None]:
#fit


In [None]:
#score it


In [None]:
#predictions


#### Visualizing the results

The `plot_tree` function will plot the decision tree model after fitting.  There are many options you can use to control the resulting tree drawn.

In [None]:
from sklearn.tree import plot_tree

In [None]:
#plot_tree


**Larger Example**

In [None]:
bigger_data = titanic[['pclass', 'age', 'fare', 'survived']].dropna()
bigger_data.info()

In [None]:
bigger_data.info()

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

In [None]:
X = bigger_data.drop('survived', axis = 1)
y = bigger_data['survived']

In [None]:
#instantiate
bigger_tree = DecisionTreeClassifier(criterion = 'entropy')
#fit
bigger_tree.fit(X, y)

In [None]:
#evaluate -- Accuracy aka percent correct
bigger_tree.score(X, y)

In [None]:
1 - bigger_tree.score(X, y) #error rate

In [None]:
plot_tree(bigger_tree, feature_names = X.columns);

Looks like this may overfit the data and not generalize well!  It is important to continue to use what we discussed in terms of train/test split and cross validation to explore the quality of the model fit.  

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
bigger_tree.fit(X_train, y_train)

In [None]:
bigger_tree.score(X_train, y_train)

In [None]:
bigger_tree.score(X_test, y_test)

#### Control `max_depth`

There are many hyperparameters in the decision tree model.  One thing we may seek to control is how many decisions are allowed to be made.  This is determined by the `max_depth` parameter, essentially stopping the decision tree after a set number of splits.

In [None]:
#decision tree with depth of 3
constrained_tree = DecisionTreeClassifier(criterion = 'entropy',
                                          max_depth = 3)

In [None]:
#fit on train
constrained_tree.fit(X_train, y_train)

In [None]:
#score on train
constrained_tree.score(X_train, y_train)

In [None]:
#score on test
constrained_tree.score(X_test, y_test)

In [None]:
#plot results
fig, ax = plt.subplots(figsize = (20, 10))
plot_tree(constrained_tree, feature_names=X.columns, ax = ax, fontsize = 14);

### Selecting the Best Tree

In [None]:
train_scores = []
test_scores = []
for d in range(1, 20):
    dtree = DecisionTreeClassifier(criterion = 'entropy',
                                 max_depth = d).fit(X_train, y_train)
    train_scores.append(dtree.score(X_train, y_train))
    test_scores.append(dtree.score(X_test, y_test))

In [None]:
plt.plot(range(1, 20), train_scores, '--o', label = 'train')
plt.plot(range(1, 20), test_scores, '--o', label = 'test')
plt.grid()
plt.legend()
plt.xticks(range(1, 20))
plt.xlabel('max depth')
plt.ylabel('accuracy')
plt.title('Depth vs. train and test score');

In [None]:
dtree = DecisionTreeClassifier(criterion = 'entropy',
                               max_depth = 6).fit(X_train, y_train)

In [None]:
dtree.score(X_train, y_train)

In [None]:
dtree.score(X_test, y_test)

#### Evaluating Importance

By tracking how frequently a feature was used to make a split, we can determine the "importance" of each feature.  These are stored in the `feature_importances_` attribute of the fit model.  

In [None]:
#feature importances
dtree.feature_importances_

In [None]:
#features
dtree.feature_names_in_

In [None]:
pd.DataFrame({'features': dtree.feature_names_in_, 'importance': dtree.feature_importances_})

#### Problem

Use a decision tree model to predict heart conditions using the data below.  

In [None]:
heart = pd.read_csv('data/heart_cleveland_upload.csv')

In [None]:
heart.head()