## Objectives
- Apply tree classifiers and learn ensemble techniques such as Bagging, Boosting and RandomForest

## Worksheet 1
- [Tree Worksheet 1](https://s3-us-west-2.amazonaws.com/ga-dat-2015-suneel/worksheets/Decision+Trees/DT_wksht_2.pdf)
    1. 3-doors high safety --> circle.
    2.
    ```
    if safety == medium or safety == high:
        # Note this could be condensed since both the areas right of slice one classify as O
        if numer_of_doors >= 5:
            return O
        else:
            return O
    
    if safety == low:
        if number_of_doors >= 5:
            return O
        else:
            return X
    ```
    
- [Tree Worksheet 2](https://s3-us-west-2.amazonaws.com/ga-dat-2015-suneel/worksheets/Decision+Trees/DT_wksht_3.pdf)
    1. 0 + 1/2 => 1/2
    2. 0 + 1/4 => 1/4
    3. 2/6 + 1/2 => 5/6
    4. s4 doesn't exist, not sure why this question does...
    5. 1/4 + 1/2 => 3/4
    6. 3/7 + 0 => 3/7
    
   

## Let's apply what we've learned

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("../data/breast-cancer.csv")

In [4]:
df.head()

Unnamed: 0,ID,Diagnosis,x1,x2,x3,x4,x5,x6,x7,x8,...,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## Implement a Decision Tree

1. Import the `DecisionTreeClassifier` from `sklearn`.
2. Look through the documentation page on `sklearn`. What do you notice? What are our dials that we can use to tune the model? Any questions or thoughts?
3. Apply the model to the dataset. What are the results?
4. Use `train_test_split` multiple times to shuffle the randomness of the split. Each time you shuffle and you run the model, what happens? What do you notice?

In [5]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [6]:
X = df.drop(columns=['Diagnosis', 'ID'])
y = df['Diagnosis'] == 'M' # "True" will mean malignant, "False" will mean begnin

for _ in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
    tree = DecisionTreeClassifier()
    tree.fit(X_train, y_train)
    score = tree.score(X_test, y_test)
    print(score)

0.9649122807017544
0.9298245614035088
0.9035087719298246
0.8771929824561403
0.9210526315789473
0.9298245614035088
0.956140350877193
0.9210526315789473
0.9473684210526315
0.9210526315789473


## Introducing `cross_val_score`

- Let's save ourselves some time with this nice `sklearn` utility.
- Let's look at the importance of the mean score and the standard deviation of the score. What do they mean and what are each of them important?
  * Mean is the average accuracy over different folds of the data.
      * Lower values mean we're not generally predicting well.
  * Std. gives us an idea about the error bounds of our model (on this data).
      * Higher values mean our model has high variance across subsets of the data.

In [10]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree, X, y, cv=10)

print(scores)
print(f"mean: {np.mean(scores):.3}")
print(f"Std. Deviation: {np.std(scores):.3}")

[0.92982456 0.84210526 0.92982456 0.89473684 0.9122807  0.89473684
 0.9122807  0.94736842 0.92982456 0.94642857]
mean: 0.914
Std. Deviation: 0.0297


## Implement a Random Forest

1. Now apply `RandomForestClassifier`, tune it with the different dials and describe the results.


### Notes
- Let's examine feature importance. Random forest gives us a nice way to do that, remember?

In [11]:
from sklearn.ensemble import RandomForestClassifier

In [14]:
# One example, but you should try more combinations of different params.
# You could even consider a "grid search"
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

rf = RandomForestClassifier(min_samples_split=10, max_depth=5, min_samples_leaf=10)

scores = cross_val_score(rf, X, y, cv=10)
print(scores)
print(f"mean: {np.mean(scores):.2}")
print(f"Std. Deviation: {np.std(scores):.3}")

[0.98245614 0.87719298 0.9122807  0.94736842 0.98245614 0.96491228
 0.94736842 0.98245614 0.92982456 0.98214286]
mean: 0.95
Std. Deviation: 0.034
