# Classification trees
_Learn by splitting the feature space_

# Decision boundaries
![Decision Tree](images/dt.png)

# Feature Splits
![Decision Tree](images/dt-boundaries.png)

# Tree
![Decision Tree](images/dt-tree.png)

# Hunt's Algorithm

* Take one feature at a time
    * If a node is pure\*, it is a leaf node
    * If a node is impure, create new internal node
        * Consider all binary splits of given feature space
            * For each split, measure impurity
            * Take the split with smallest impurity as node value
            
Iterate until convergence

# Stopping conditions

* Node is pure
* Node has small impurity
* Node has small number of datapoints

![gini](images/gini.png)

# Impurity measures

All are measures of misclassification

* $Entropy(t) = -\sum_{i=1}^{C}p(i)log p(i)$

* $Gini(t) = 1 - \sum_{i=1}^{C}p(i)^2$

* $Error(t) = 1-{max}_{i \in C}[p(i)]$

In [41]:
%load_ext autoreload
%autoreload 2

In [48]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from tree_util import draw_tree_to_file

# Example

In [45]:
iris = load_iris()
clf = DecisionTreeClassifier().fit(iris.data, iris.target)

In [46]:
draw_tree_to_file(clf, iris['feature_names'], "images/dt-iris.dot")

In [47]:
!dot -Tpng images/dt-iris.dot -o images/dt-iris.png

![decision-tree](images/dt-iris.png)

# Random Forests
_Bootstrap with many decision trees_

# One tree is not enough

Running decision tree on the _same_ dataset will give the _same_ result.

This will miss less informative features.

![rf](images/random-forest.png)

# Algorithm

* Select a subset of features
* Run decision tree algorithm

# Prediction - bagging

* Take majority vote

# Summary

## Pros

* Highly non-linear
* Interpretable

## Cons

* Easy to overfit
* Expensive

# Further reading

* Introduction to data mining by Tan, Steinbach, Kumar
    * [Chapter 4](https://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf)
* Random Forests, Leo Breiman. 2001
    * [Machine Learning, Springer](http://link.springer.com/article/10.1023%2FA%3A1010933404324)

# Exercise

In [44]:
#import the classifiers
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

#import the functions to use the dataset
from pathogenicity_predictor import prepare_variants, concat_training_data, partition_into_training_and_test, plot_line_graph

In [45]:
data, feature_names = prepare_variants('../data/variants.json.gz')
variants, labels = concat_training_data(data)
training_vars, test_vars, training_labels, test_labels = partition_into_training_and_test(variants, labels, 0.8)

In [56]:
# Build a decision tree!
tree_classifier = DecisionTreeClassifier().fit(training_vars, training_labels)

# Run random forest classification!
rf_classifier = RandomForestClassifier().fit(training_vars, training_labels)

# Get probabilities!
pathogenicity_probs = rf_classifier.predict_proba(training_vars)[:,1]

# Feature importances

Decision trees and random forests can tell you how many times a feature is used in the classifier and how much information is gained by using the splits for this feature. This amalgamates into feature importances vector that you can see with `classifier.feature_importances_`.

# Question 1
Do the results make sense?

# Question 2
How does this compare between between a random forest and one decision tree?

# Question 3
Build a $ROC$ curve for the random forest classifier. Is it a good classifier?