# Decision Trees and Ensembles Lab

In this lab we will compare the performance of a simple Decision Tree classifier with a Bagging classifier. We will do that on few datasets, starting from the ones offered by Scikit Learn.

## 1. Breast Cancer Dataset
We will start our comparison on the breast cancer dataset.
You can load it directly from scikit-learn using the `load_breast_cancer` function.

### 1.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds
- Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?


#### 1. Load the data and create X and y

In [2]:
# Pretty much the detail imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
# Get that breast Cancer dataset from Sklearn
from sklearn.datasets import load_breast_cancer

In [4]:
# Load the data into a variable
data = load_breast_cancer()

In [5]:
# The Data from Sklearn is stored in a dictionary so we can view the keys.
data.keys()

['target_names', 'data', 'target', 'DESCR', 'feature_names']

In [6]:
# Converting data into a dataframe structure 
X = pd.DataFrame(data['data'], columns=data['feature_names'])
# Setting up our Y value as well
y = pd.Series(data['target'])

In [7]:
# QUick view of the data
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [8]:
# Checking the distribution of malignant and benign
y.value_counts()/y.count()
# Class 1: Benign
# Class 2: Malignant

1    0.627417
0    0.372583
dtype: float64

#### 2. Initialize a Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds

In [9]:
# Lets get to bagging our decision Trees
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.cross_validation import cross_val_score

In [10]:
# Set the Decision Tree Classifer
dt = DecisionTreeClassifier()

In [11]:
# Creating a function to easily do cross validations on different models
def do_cross_val(model):
    scores = cross_val_score(model, X, y, cv=5, n_jobs=-1)
    return scores.mean(), scores.std()

do_cross_val(dt)

(0.91919969218930342, 0.014955371601639518)

#### 3. Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds.

In [12]:
bdt = BaggingClassifier(DecisionTreeClassifier())
# Remember, you have to put a classifier into the bag first

In [13]:
do_cross_val(bdt)

(0.94739515198153135, 0.016274615508727188)

#### 4. Which score is better? Are the score significantly different? How can you judge that?

- The bagged model has a score (93.3%) that is slightly higher than the baseline model(91.6%).
- The scores standard deviation is very low indicating that the scores from each of the cross validated folds are very similar.  Additionally, the standard deviation of both models is almost the same; 0.0128 and 0.0129.



------

### 1.b Scaled pipelines
As you may have noticed the features are not normalized. Do the score improve with normalization?
By now you should be very familiar with pipelines and scaling, so:

1. Create 2 pipelines, with a scaling preprocessing step and then either a decision tree or a bagging decision tree.
- Which score is better? Are the score significantly different? How can you judge that?
- Are the scores different from the non-scaled data?



#### 1. Create 2 pipelines, with a scaling preprocessing step and then either a decision tree or a bagging decision tree.

In [14]:
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import make_pipeline

In [15]:
# Pipeline for just a decision tree with a robust scaler.
# You can use whatever scaler you like.
pipedt = make_pipeline(RobustScaler(),
                       DecisionTreeClassifier())

pipebdt = make_pipeline(RobustScaler(),
                        BaggingClassifier(DecisionTreeClassifier()))

Robust scaling is a scaling method that takes into consideration the median and the interquartile range so it is an effective scaling technique when dealing with normally distributed data that contains outliers.  

(I don't have a better definition because I could not find a reasonable formula for it.  I put it in here more or less to introduce you to another scaling method.

#### 2. Which score is better? Are the score significantly different? How can you judge that?

In [16]:
do_cross_val(pipedt)

(0.91919969218930342, 0.014955371601639518)

In [17]:
do_cross_val(pipebdt)

(0.94739515198153135, 0.016274615508727188)

- Even though the information has been scaled, the results are identical to the unscaled. This is probably a good indication that our data is either unaffected by scaling or the scaling method used is not useful.
- Just like in our previous answer, the scores have very low standard deviations indication there is no significant difference between the results of our cross validation folds.

#### 3. Are the scores different from the non-scaled data?
- No, Scores are the exact same

### 1.c Grid Search

Grid search is a great way to improve the performance of a classifier. Let's explore the parameter space of both models and see if we can improve their performance.

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier
- Search for few values of the parameters in order to improve the score of the classifier
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- how does the score of the Grid-searched DT compare with the score of the Bagging DT?

- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search

- Does the score improve for the Grid-searched Bagging Classifier?
> Yes
- Which score is better? Are the score significantly different? How can you judge that?
> The grid search bagging classifier is the best one

#### 1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier

In [18]:
from sklearn.grid_search import GridSearchCV

#### 2. Search for few values of the parameters in order to improve the score of the classifier

In [19]:
# Be careful when setting gridsearch params as some of them can step on 
# eachothers toes with the Decision Tree.  
params = {"max_depth": [3,5,10,20],
          "max_features": [None, "auto"],
          "min_samples_leaf": [1, 3, 5, 7, 10],
          "min_samples_split": [2, 5, 7]
         }
    

gsdt = GridSearchCV(dt, params, n_jobs=-1, cv=5)

#### 3. Use the whole X, y dataset for your test

In [20]:
# Surpisingly does not take too long
gsdt.fit(X, y)

GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_features': [None, 'auto'], 'min_samples_split': [2, 5, 7], 'max_depth': [3, 5, 10, 20], 'min_samples_leaf': [1, 3, 5, 7, 10]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [21]:
gsdt.best_params_

{'max_depth': 5,
 'max_features': 'auto',
 'min_samples_leaf': 5,
 'min_samples_split': 5}

#### 4. Check the best\_score\_ once you've trained it. Is it better than before?

In [22]:
gsdt.best_score_

0.95079086115992972

It did improve!

In [23]:
bdt.get_params()

{'base_estimator': DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
             max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             presort=False, random_state=None, splitter='best'),
 'base_estimator__class_weight': None,
 'base_estimator__criterion': 'gini',
 'base_estimator__max_depth': None,
 'base_estimator__max_features': None,
 'base_estimator__max_leaf_nodes': None,
 'base_estimator__min_samples_leaf': 1,
 'base_estimator__min_samples_split': 2,
 'base_estimator__min_weight_fraction_leaf': 0.0,
 'base_estimator__presort': False,
 'base_estimator__random_state': None,
 'base_estimator__splitter': 'best',
 'bootstrap': True,
 'bootstrap_features': False,
 'max_features': 1.0,
 'max_samples': 1.0,
 'n_estimators': 10,
 'n_jobs': 1,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

#### 5. how does the score of the Grid-searched DT compare with the score of the Bagging DT?

- It is better by about 1 percent.  Which is pretty significant once you get above 90%

#### 6. Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier

In [24]:
# One more time, more features and bootstrapped
params = {"base_estimator__max_depth": [3,5,10,20],
          "base_estimator__max_features": [None, "auto"],
          "base_estimator__min_samples_leaf": [1, 3, 5, 7, 10],
          "base_estimator__min_samples_split": [2, 5, 7],
          'bootstrap_features': [False, True],
          'max_features': [0.5, 0.7, 1.0],
          'max_samples': [0.5, 0.7, 1.0],
          'n_estimators': [2, 5, 10, 20],
         }
    

gsbdt = GridSearchCV(bdt, params, n_jobs=-1, cv=5)

#### 7. Repeat the search

In [25]:
# This took a really long time to run.  
gsbdt.fit(X, y)
# Like, I made a grilled cheese and ate it before this was over.
# 10-20 mins

KeyboardInterrupt: 

A really long time to gridsearch is why we want to run gridsearch towards the end of our modelling phases, once we have a better understanding of our data and useful parameters so we are not testing parameters that are somewhat useless.

In [26]:
gsbdt.best_params_

AttributeError: 'GridSearchCV' object has no attribute 'best_params_'

#### 8.  Does the score improve for the Grid-searched Bagging Regressor?



In [None]:
# But that accuracy score...
gsbdt.best_score_


#### 9.  Which score is better? Are the score significantly different? How can you judge that?

- An accuracy score of 97% is really good, however that is just the best score and so if would be important to consider the standard deviation os scores.

## 2 Diabetes and Regression

Scikit Learn has a dataset of diabetic patients obtained from this study:

http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

442 diabetes patients were measured on 10 baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements.

The target is a quantitative measure of disease progression one year after baseline.

Repeat the above comparison between a DecisionTreeRegressor and a Bagging version of the same.

### 2.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. Which score will you use?

- Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?



#### 1. Load the data and create X and y

In [41]:
from sklearn.datasets import load_diabetes

data = load_diabetes()
X = data['data']
y = data['target']

In [42]:
import collections 
collections.Counter(y)

Counter({25.0: 1,
         31.0: 1,
         37.0: 1,
         39.0: 2,
         40.0: 1,
         42.0: 3,
         43.0: 1,
         44.0: 1,
         45.0: 1,
         47.0: 2,
         48.0: 3,
         49.0: 3,
         50.0: 1,
         51.0: 3,
         52.0: 4,
         53.0: 4,
         54.0: 1,
         55.0: 4,
         57.0: 1,
         58.0: 1,
         59.0: 4,
         60.0: 3,
         61.0: 2,
         63.0: 4,
         64.0: 3,
         65.0: 4,
         66.0: 2,
         67.0: 2,
         68.0: 3,
         69.0: 3,
         70.0: 2,
         71.0: 5,
         72.0: 6,
         73.0: 1,
         74.0: 2,
         75.0: 2,
         77.0: 4,
         78.0: 3,
         79.0: 1,
         80.0: 1,
         81.0: 2,
         83.0: 3,
         84.0: 4,
         85.0: 4,
         86.0: 1,
         87.0: 2,
         88.0: 4,
         89.0: 2,
         90.0: 5,
         91.0: 4,
         92.0: 2,
         93.0: 2,
         94.0: 3,
         95.0: 2,
         96.0: 4,
         9

We have a lot of different target variables. The greatest appearing 6 times in our target data (200). I we predicted Y as 200 for every value our models base score would be 0.01357.  It is a pretty low bar to beat to create a useful model.    

#### 2. Initialize a Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. Which score will you use?

-  I Used an r2

In [43]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor

def do_cross_val(model):
    scores = cross_val_score(model, X, y, cv=5, n_jobs=-1, scoring='r2')
    return scores.mean(), scores.std()


In [44]:
dtr = DecisionTreeRegressor()
do_cross_val(dtr)

(-0.11437863045387864, 0.12602191112786265)

#### 3. Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds.

In [45]:
bdtr = BaggingRegressor(DecisionTreeRegressor())
do_cross_val(bdtr)

(0.33561365300123336, 0.054215295215910203)

#### 4. Which score is better? Are the score significantly different? How can you judge that?

> The Bagging Regressor is better.  That r2 of the baseline decision tree is really bad.
> Additionally the standard deviation of the scores from the cross_val_score output. While the Bagged DT did not have a great std dev of score it was a significant improvement from the baseline DT's std dev.

### 2.b Grid Search

Repeat Grid search as above:

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Regressor
- Search for few values of the parameters in order to improve the score of the regressor
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- how does the score of the Grid-searched DT compare with the score of the Bagging DT?
> Answer: they are the same (within error), Grid search improved the score of simple DT
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Regressor
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search

- Does the score improve for the Grid-searched Bagging Regressor?
> Yes
- Which score is better? Are the score significantly different? How can you judge that?
> The grid search bagging classifier is the best one

#### 1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Regressor

In [46]:
dtr = DecisionTreeRegressor()


#### 2. Search for few values of the parameters in order to improve the score of the regressor

In [47]:
params = {"splitter": ['best', 'random'],
          "max_depth": [3,5,10,20],
          "max_features": [None, "auto"],
          "min_samples_leaf": [1, 3, 5, 7, 10],
          "min_samples_split": [2, 5, 7]
         }
    

gsdtr = GridSearchCV(dtr, params, n_jobs=-1, cv=5)

#### 3. Use the whole X, y dataset for your test

In [48]:
gsdtr.fit(X, y)

GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best'),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_features': [None, 'auto'], 'splitter': ['best', 'random'], 'min_samples_split': [2, 5, 7], 'max_depth': [3, 5, 10, 20], 'min_samples_leaf': [1, 3, 5, 7, 10]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [49]:
gsdtr.best_params_

{'max_depth': 20,
 'max_features': 'auto',
 'min_samples_leaf': 10,
 'min_samples_split': 2,
 'splitter': 'random'}

#### 4. Check the best_score_ once you've trained it. Is it better than before?

In [50]:
gsdtr.best_score_

0.38995337926147983

I'd say its better.

#### 5. how does the score of the Grid-searched DT compare with the score of the Bagging DT?

Best score from the GridSearch DT is about 3 points higher than the average Bagging DT, however given that the mean from the Bagging DT was .35 and the standard deviation was 0.05 it is possible that the Bagging DT scored as high as 0.40+ which would put it above the predictability of the GridSearched DT.

Might as well combined the two and see what happens.

#### 6. Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Regressor

In [51]:
params = {"base_estimator__splitter": ['best', 'random'],
          "base_estimator__max_depth": [3,5,10,20],
          "base_estimator__max_features": [None, "auto"],
          "base_estimator__min_samples_leaf": [1, 3, 5, 7, 10],
          "base_estimator__min_samples_split": [2, 5, 7],
          'bootstrap_features': [False, True],
          'max_features': [0.5, 0.7, 1.0],
          'max_samples': [0.5, 0.7, 1.0],
          'n_estimators': [2, 5, 10, 20],
         }
    

gsbdtr = GridSearchCV(bdtr, params, n_jobs=-1, cv=5)

#### 7. Repeat the search

In [52]:
gsbdtr.fit(X, y)

GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingRegressor(base_estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [2, 5, 10, 20], 'max_samples': [0.5, 0.7, 1.0], 'base_estimator__min_samples_split': [2, 5, 7], 'base_estimator__max_depth': [3, 5, 10, 20], 'bootstrap_features': [False, True], 'base_estimator__splitter': ['best', 'random'], 'max_features': [0.5, 0.7, 1.0], 'base_estimator__min_samples_leaf': [1, 3, 5, 7, 10], 'base_estimator__max_features': [None, 'auto']},
       pre_dispatch='2*

In [53]:
gsbdtr.best_params_

{'base_estimator__max_depth': 10,
 'base_estimator__max_features': 'auto',
 'base_estimator__min_samples_leaf': 7,
 'base_estimator__min_samples_split': 5,
 'base_estimator__splitter': 'random',
 'bootstrap_features': False,
 'max_features': 1.0,
 'max_samples': 1.0,
 'n_estimators': 5}

#### 8. Does the score improve for the Grid-searched Bagging Regressor?

In [54]:
gsbdtr.best_score_

0.48609184780132547

#### 9. Which score is better? Are the score significantly different? How can you judge that?

this one, better score by 10%, but it is a still relatively low score.  What is more important to us is probably how this last model is looking at the data so we can find distinguishing features and values.

# Questions:
#### Can you use Criterion in a Gridsearch?
> Yes, However Gini vs. Entropy only applies to Decision Tree Classifiers.  Decision Tree Regressors take erros measures as their Criterion like MSE and RMSE. 

#### How to bagg a list of models.
> There is no logical reason to oppose the use of multiple different classifiers in an ensemble model. However, Sklearns bagging classifier will only accept a single classifier type.  Combining different classifiers to come to a aggregate conclussion can be reffered to as "Stacking"

#### Cross Validation, does it just fit one model and test agianst many information sets.
> Cross Validation is a model evaluation/validation technique. Ensembling is a model building/tuning technique.  We must keep in mind that Ensembles are in themselves models, like a team of models working together as one. We are not accessing the performance of each of the models, but more or less using them.  One of the most important features of an Ensemble is the ability to randomly select features which does not occur in a CV.  
> A cross validation can be used with a Fit or Unfit model and is a way of using different folds(data subsets) to assess how well your model handles variance and/or bias (bias if it is fitting on different folds). 

#### Is it possible to create a model where the output is two independent features/Predictions.
> While it is possible to use the same dataset with multiple y values it does not seem feasable to create a single model whose output is 2 separate predictive values because the model is based off a single algorithm and the coefficients have been weighted according to one prediction. 