## Objectives
- Apply tree classifiers and learn ensemble techniques such as Bagging, Boosting and RandomForest

## Decision Trees
Let's walk through a simple decision tree. I'll draw one out.

## Brainstorm & Discussion
- In what real-life contexts do you think this technique could be useful?
- Where do you think it would be less useful?

## The Mechanics of Trees
I'll draw out a tree in a 2-dimensional space. Let's see how classification and accuracy work. 

## Worksheet 1
- [Tree Worksheet 1](https://s3-us-west-2.amazonaws.com/ga-dat-2015-suneel/worksheets/Decision+Trees/DT_wksht_2.pdf)
   

## Inside the Black Box
Let's work through an example and define the procedure.

- [Tree Worksheet 2](https://s3-us-west-2.amazonaws.com/ga-dat-2015-suneel/worksheets/Decision+Trees/DT_wksht_3.pdf)

### Notes
- Gini impurity index
- Feature importance

## The Problem with Trees

* Strengths and weaknesses:
  * Prone to overfitting, esp. as you make the tree deeper.
  * The model is very understandable, and provides clear rules for predictions.
  * Flexible with respect to training data (continuous and categorical work great, missing values can be a source of splitting)
  * Depending on how big of a tree you want, fitting them can be expensive.
  * But... honestly, often they are still much faster than something like a Neural Network.
  * Not great at regression.
  * Like many learning algorithms, trees perform poorly with small amounts of training data.


### Bagging (Bootstrap Aggregation) and Feature Selection

* Bagging is sampling with replacement.
    * Common to select the same number of samples as the population size, but this is configurable.
    * Because it uses replacement, it's quite likely be duplicates.
    * Can help reduce overfitting and causes each tree to make different splitting decisions
* Feature Selection means at each split we randomly select a list of features that are eligable to be split.
    * Common choices for numebr of features are sqrt(n) and log(n)
    * Forces each tree to make different splitting decisions.

## Read & Discuss
Read this [Intuitive explanation of Random Forest](https://www.quora.com/How-does-randomization-in-a-random-forest-work/answer/Edwin-Chen-1?share=1&srid=h5QG) and discuss with your group.

Separate into pairs and practice explaining to the best of your understanding how Random Forest works.

This is a good practie for all the models that you learn going forward.

## The Random Forest Procedure
Let's lay it out. What are the steps?

1. Pick the number of trees and define their hyperparameters (max depth, min sample split...)
1. For each tree, perform bagging to create the sample for that tree. 
    1. For each tree for each split, randomly sample the available features.
    1. Determine the best split given the available features, and split!
    1. Repeat A-C until we've got all leaf nodes.
1. Repeat 2-5 until we've done all the trees.

At classification time, run the new sample through all the trees, and pick the class based on the vote of all the individual trees.

## Let's apply what we've learned

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("../data/breast-cancer.csv")

In [3]:
df.head()

Unnamed: 0,ID,Diagnosis,x1,x2,x3,x4,x5,x6,x7,x8,...,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## Exercise
1. Import the `DecisionTreeClassifier` from `sklearn`.
2. Look through the documentation page on `sklearn`. What do you notice? What are our dials that we can use to tune the model? Any questions or thoughts?
3. Apply the model to the dataset. What are the results?
4. Use `train_test_split` multiple times to shuffle the randomness of the split. Each time you shuffle and you run the model, what happens? What do you notice?

In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [5]:
X = df.drop(columns=['Diagnosis', 'ID'])
y = df['Diagnosis'] == 'M' # "True" will mean malignant, "False" will mean begnin

for _ in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
    tree = DecisionTreeClassifier()
    tree.fit(X_train, y_train)
    score = tree.score(X_test, y_test)
    print(score)

0.9361702127659575
0.9095744680851063
0.8936170212765957
0.9202127659574468
0.925531914893617
0.9042553191489362
0.925531914893617
0.9308510638297872
0.9414893617021277
0.925531914893617


## Introducing `cross_val_score`

- Let's save ourselves some time with this nice `sklearn` utility.
- Let's look at the importance of the mean score and the standard deviation of the score. What do they mean and what are each of them important?
  * Mean is the average accuracy over different folds of the data.
      * Lower values mean we're not generally predicting well.
  * Std. gives us an idea about the error bounds of our model (on this data).
      * Higher values mean our model has high variance across subsets of the data.

In [6]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree, X, y, cv=10)

print(scores)
print(f"mean: {np.mean(scores):.3}")
print(f"Std. Deviation: {np.std(scores):.3}")

[0.9122807  0.85964912 0.92982456 0.87719298 0.94736842 0.89473684
 0.89473684 0.94736842 0.92982456 0.89285714]
mean: 0.909
Std. Deviation: 0.0282


## Exercise
1. Now apply `RandomForestClassifier`, tune it with the different dials and describe the results.


### Notes
- Let's examine feature importance. Random forest gives us a nice way to do that, remember?

In [7]:
from sklearn.ensemble import RandomForestClassifier

In [8]:
# One example, but you should try more combinations of different params.
# You could even consider a "grid search"
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

rf = RandomForestClassifier(min_samples_split=10, max_depth=5, min_samples_leaf=10)

scores = cross_val_score(rf, X, y, cv=10)
print(scores)
print(f"mean: {np.mean(scores):.3}")
print(f"Std. Deviation: {np.std(scores):.3}")

[1.         0.87719298 0.92982456 0.94736842 0.98245614 0.96491228
 0.94736842 0.98245614 0.92982456 0.98214286]
mean: 0.954
Std. Deviation: 0.0344
