<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Decision-Tree-and-Random-Forest-Models" data-toc-modified-id="Decision-Tree-and-Random-Forest-Models-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Decision Tree and Random Forest Models</a></span><ul class="toc-item"><li><span><a href="#Decision-Trees" data-toc-modified-id="Decision-Trees-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Decision Trees</a></span><ul class="toc-item"><li><span><a href="#Simple-example" data-toc-modified-id="Simple-example-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Simple example</a></span></li><li><span><a href="#Decision-tree-example-with-real-data" data-toc-modified-id="Decision-tree-example-with-real-data-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Decision tree example with real data</a></span></li><li><span><a href="#Feature-importance" data-toc-modified-id="Feature-importance-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Feature importance</a></span></li></ul></li><li><span><a href="#Random-Forests" data-toc-modified-id="Random-Forests-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Random Forests</a></span><ul class="toc-item"><li><span><a href="#Random-Forest-for-Cancer-Data" data-toc-modified-id="Random-Forest-for-Cancer-Data-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Random Forest for Cancer Data</a></span></li><li><span><a href="#Random-Forest-for-Digits-Data" data-toc-modified-id="Random-Forest-for-Digits-Data-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Random Forest for Digits Data</a></span></li><li><span><a href="#Add-on:-Try-a-more-complicated-dataset:-2-spirals." data-toc-modified-id="Add-on:-Try-a-more-complicated-dataset:-2-spirals.-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>Add-on: Try a more complicated dataset: 2 spirals.</a></span></li></ul></li></ul></li><li><span><a href="#Further-reading" data-toc-modified-id="Further-reading-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Further reading</a></span></li><li><span><a href="#Exercise:-Feed-rotated-features-to-the-Classifier" data-toc-modified-id="Exercise:-Feed-rotated-features-to-the-Classifier-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Exercise: Feed rotated features to the Classifier</a></span></li></ul></div>

## Decision Tree and Random Forest Models

### Decision Trees

A further important model category. The basic principle is easy to understand:  
 Hierarchical series of  **if/else questions** 

*Example:* Game where you need to distinguish four kinds of animals:  
* *Bear, Dolphin, Penguin, Hawk*

Goal is to use as few questions as possible.

One possible solution:

![](figures/DT_animals.png)

#### Simple example 
Illustrate DT with half-moon data, a simple dataset with half-moon shaped data distributions:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier

from sklearn.datasets import make_moons


In [None]:
# create and plot dataset
X, y = make_moons(n_samples=100, noise=0.25, random_state=3)
plt.figure()

plt.scatter(X[:, 0], X[:, 1], c=y,cmap='rainbow');


**Try previous models first**

In [None]:
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB()                       # 2. instantiate model
print("GaussianNB: accuracy = %.2f" % model.fit(X, y).score(X, y)) # 3. fit model to data


In [None]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
print("KNeighborsClassifier: accuracy = %.2f" % model.fit(X, y).score(X, y))


**Now the decision Tree**

In [None]:
# run DT
max_depth=3
model = DecisionTreeClassifier(max_depth=max_depth, random_state=0)
tree=model.fit(X,y)
tree.score(X,y)

In [None]:
# helper function to visualize DT
#
def visualize_classifier(model, X, y, ax=None, cmap='RdBu', plot_proba=False):
    ax = ax or plt.gca()
    
    # Plot the training points
    ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=cmap, edgecolors="black",
               clim=(y.min(), y.max()), zorder=3)
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
        
    xx, yy = np.meshgrid(np.linspace(*xlim, num=200),
                         np.linspace(*ylim, num=200))
    X = np.stack([xx.ravel(), yy.ravel()], axis=1)
    
    if not plot_proba:
        Z = model.predict(X).reshape(xx.shape)
    else:
        Z = model.predict_proba(X)[:,1].reshape(xx.shape)
    
    # Create a color plot with the results
    n_classes = len(np.unique(y))
    if not plot_proba:
        ax.contourf(xx, yy, Z, alpha=0.3,
                    levels=np.arange(n_classes + 1) - 0.5,
                    cmap=cmap, zorder=1)
    else:
        ax.pcolormesh(xx, yy, Z, cmap=cmap, shading="auto")

    ax.set(xlim=xlim, ylim=ylim)

In [None]:
# run DT with varying depth and visualize limits
max_depth=9
model = DecisionTreeClassifier(max_depth=max_depth, random_state=0)
tree=model.fit(X,y)
print ('Depth, score: ', max_depth, tree.score(X,y))
visualize_classifier(model, X, y, cmap = "rainbow")

For high depth, clearly goes into over-training
***

**Decision Trees** work well in principle, however, they are rather sensitive to over-training  
&rarr; Validation curve left for exercises

***
#### Decision tree example with real data

A frequently used data set for ML is a data set for *breast cancer diagnosis*

In [None]:
# load dataset
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

print (cancer.feature_names)
print (cancer.DESCR)

In [None]:
# apply decision-tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, stratify=cancer.target, random_state=42)
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

Without limiting the depth, the DT will be evolved until perfect accuracy.

But not really useful &rarr; Over-training

Better approach:

In [None]:
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(X_train, y_train)

print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))


Note that the performance on the test set has improved by introducing a maximum depth of the trees. (The fact that we do no longer get perfect classfication on the training sample is not relevant.)

#### Feature importance

A very useful additional result of DT classification is the *feature importance*.
This gives for each feature a rating between 0 and 1 how important it is for the classification:
* 0 means no effect, not useful
* 1 means perfect separation

In [None]:
print("Feature importances:\n{}".format(tree.feature_importances_))

In [None]:
# better to visualize
def plot_feature_importance( model):
    n_features = cancer.data.shape[1]
    plt.figure(figsize = (6,8))
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), cancer.feature_names)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
    plt.ylim(-1, n_features);

    
plot_feature_importance( tree )

Clearly shows that feature ` ẁorst radius` has largest impact.

### Random Forests

Decisions trees are potentially very powerful models but they are very also sensitive to overtraining (overfitting); therefore they are normally not directly used in practice. 

However, one can mitigate or solve this problem by using an ensemble of decision trees and not just a single DT.  
The main trick is randomization:
* train many DTs but
    * each DT sees different parts of the data
    * or different set of features

This approach is called **Random Forest**:  
Many randomized trees contribute and the final decision is made by some sort of majority voting.

Test with half moon data using 5 DTs:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=100, noise=0.25, random_state=5)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
                                                    random_state=42)
# random forest with 5 DT
forest = RandomForestClassifier(n_estimators=5, random_state=2)
forest.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(forest.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(forest.score(X_test, y_test)))


In [None]:
visualize_classifier(forest, X, y, cmap="rainbow", plot_proba=False)

Effectively, boundaries are more complex

#### Random Forest for Cancer Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0)
forest = RandomForestClassifier(n_estimators=100, random_state=0)
forest.fit(X_train, y_train)

print("Accuracy on training set: {:.3f}".format(forest.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(forest.score(X_test, y_test)))

Much better accuracy on validation set

**Feature importance** also instructive:

In [None]:
plot_feature_importance(forest)

* many more features contribute
* `worst radius` feature no longer dominant

&rarr; Random Forests much better in classification and better exploit information of features 

Main drawback: decision process rather intransparent compared to single DT.

An alternative to Random Forests are **Boosted** Decision Trees  
&rarr; literature


#### Random Forest for Digits Data

Come back to example of digit classification and apply RandomForest to it:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()

In [None]:
# apply RFC to digit data
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

Xtrain, Xtest, ytrain, ytest = train_test_split(digits.data, digits.target,
                                                random_state=0)
model = RandomForestClassifier(n_estimators=1000)
model.fit(Xtrain, ytrain)
ypred = model.predict(Xtest)

***
**Evaluate model**

In [None]:
print("Accuracy on training set: {:.3f}".format(model.score(Xtrain, ytrain)))
print("Accuracy on test set: {:.3f}".format(model.score(Xtest, ytest)))

In [None]:
from sklearn import metrics
print(metrics.classification_report(ytest, ypred))

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest, ypred)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');

***
**Excellent performance of Random Forest!**

#### Add-on: Try a more complicated dataset: 2 spirals.

In [None]:
N=200
p = sorted(np.random.random(N)*np.pi*5)
v1 = p*np.cos(p)
v2 = p*np.sin(p)
p2 = sorted(np.random.random(N)*np.pi*5+np.pi)
w1 = p*np.cos(p2)
w2 = p*np.sin(p2)
plt.scatter(v1, v2)
plt.scatter(w1, w2)
y1 = np.zeros((N, 1))
y2 = np.ones((N, 1))

In [None]:
X = np.stack([np.concatenate([v1, w1]), np.concatenate([v2, w2])], axis = 1)
y = np.concatenate([y1, y2]).ravel()

In [None]:
y.shape

In [None]:
# apply RFC to digit data
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)
forest = RandomForestClassifier(n_estimators=1000)
forest.fit(Xtrain, ytrain)
ypred = forest.predict(Xtest)

In [None]:
(ypred==ytest).mean()

In [None]:
visualize_classifier(forest, Xtest, ytest, cmap="rainbow", plot_proba=False)

Color by predicted probability:

In [None]:
visualize_classifier(forest, Xtest, ytest, cmap="Blues", plot_proba=True)

---

Next topic: [preprocessing (scaling) of input features](DataRescaling.ipynb)

---

## Further reading
There is a nice interactive tool that helps to understand how decision trees work:

[![Screenshot](figures/screenshot_BDT_playground.png)](https://arogozhnikov.github.io/2016/07/05/gradient_boosting_playground.html)

This also allows to use rotated decision trees, originally proposed in [2006](https://ieeexplore.ieee.org/document/1677518). You can read more about this e.g. [here](https://jmlr.csail.mit.edu/papers/volume17/blaser16a/blaser16a.pdf).

## Exercise: Feed rotated features to the Classifier

A similar effect to using "rotated decision trees" (or more precisely: rotations of the feature space in ensemble learning) can be achieved by feeding the classifier a set of additional features which are rotated versions of the original input features (called augmentation of the input data).

*Your task*: Try feeding the x and y coordinates with 10 different rotations from $0$ to $\frac{\pi}{2}$ and see 
how this improves the decision contour on the moons or spiral dataset.

You can try to solve this completely on your own or follow the guidelines to a solution that are given below.

One way to rotate the input coordinates is via a rotation matrix:

In [None]:
def rotate(X, angle):
    m = np.array([[np.cos(angle), -np.sin(angle)],
                  [np.sin(angle), np.cos(angle)]])
    return np.dot(m, X.T).T

You can then concatenate e.g. 10 rotated vectors to get the new "augmented" input. You can use a [sklearn Pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline) to automate this. First, define a "Transformer" that augments the input coordinates with rotated versions:

In [None]:
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

In [None]:
class RotationAugmentor(TransformerMixin):
    
    def __init__(self, rotation_angles):
        self.rotation_angles = rotation_angles
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return np.concatenate([rotate(X, angle) for angle in self.rotation_angles], axis=1)

In [None]:
rotation_angles = np.linspace(0, 0.5 * np.pi, 10)
rotation_angles

In [None]:
rotation_augmentor = RotationAugmentor(rotation_angles)

We can have a look at one of the input data points to see the rotated versions:

In [None]:
Xrotated = rotation_augmentor.transform(X)
Xrotated.shape

In [None]:
plt.scatter(*Xrotated[0].reshape(-1, 2).T)

Now, define an [sklearn Pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline) that first transforms the input data and then fits a new `RandomForestClassifier`.

Then, fit it and plot the results.