# Classification trees
_Learn by splitting the feature space_

# Decision boundaries
![Decision Tree](images/dt.png)

# Hunt's Algorithm

* Take one feature at a time
    * If a node is pure\*, it is a leaf node
    * If a node is impure, create new internal node
        * Consider all binary splits of given feature space
            * For each split, measure impurity
            * Take the split with smallest impurity as node value
            
Iterate until convergence

# Stopping conditions

* Node is pure
* Node has small impurity
* Node has small number of datapoints

# Impurity measures

* $Entropy(t) = -\sum_{i=1}^{C}p(i)log p(i)$

* $Gini(t) = 1 - \sum_{i=1}^{C}p(i)^2$

* $Error(t) = 1-{max}_{i \in C}[p(i)]$

![gini](images/gini.png)

In [41]:
%load_ext autoreload
%autoreload 2

In [48]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from tree_util import draw_tree_to_file

# Example

In [45]:
iris = load_iris()
clf = DecisionTreeClassifier().fit(iris.data, iris.target)

In [46]:
draw_tree_to_file(clf, iris['feature_names'], "images/dt-iris.dot")

In [47]:
!dot -Tpng images/dt-iris.dot -o images/dt-iris.png

![decision-tree](images/dt-iris.png)

# Random Forests
_Bootstrap with many decision trees_

# One tree is not enough

Running decision tree on the _same_ dataset will give the _same_ result.

This will miss less informative features.

**Solution**:

* _Sample with replacement_

# Algorithm

* Select a subset of features
* Run decision tree algorithm

# Prediction - bagging

* Take majority vote

# Summary

## Pros

* Highly non-linear
* Interpretable

## Cons

* Easy to overfit
* Expensive

# Further reading

* Introduction to data mining by Tan, Steinbach, Kumar
    * [Chapter 4](https://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf)
* Random Forests, Leo Breiman. 2001
    * [Machine Learning, Springer](http://link.springer.com/article/10.1023%2FA%3A1010933404324)

# Exercise

In [51]:
from sklearn.ensemble import RandomForestClassifier

In [52]:
RandomForestClassifier().fit(data, labels)

NameError: name 'data' is not defined