# Decision Trees and Ensembles Lab

In this lab we will compare the performance of a simple Decision Tree classifier with a Bagging classifier. We will do that on few datasets, starting from the ones offered by Scikit Learn.

## 1. Breast Cancer Dataset
We will start our comparison on the breast cancer dataset.
You can load it directly from scikit-learn using the `load_breast_cancer` function.

### 1.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds
- Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

In [5]:
## 2 - initialize decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import cross_val_score

tree = DecisionTreeClassifier()


def cross_validate(model):
    return cross_val_score(model, X, y, cv=5, n_jobs=-1)


base = cross_validate(tree)
base.mean()



0.90861100423239716

In [6]:
## 3 - bagging classifier

bag = BaggingClassifier(tree)
bag_score = cross_validate(bag)
bag_score.mean()



0.95093497499038082

Scores are much better using bagging. The avg bagging score is higher than the average baseline model

### 1.b Scaled pipelines
As you may have noticed the features are not normalized. Do the score improve with normalization?
Let's practice pipelines and scaling!

1. Create 2 pipelines, with a scaling preprocessing step and then either a decision tree or a bagging decision tree.
- Which score is better? Are the score significantly different? How can you judge that?
- Are the scores different from the non-scaled data?

In [7]:
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import make_pipeline

base_pipe = make_pipeline(RobustScaler(),
                       DecisionTreeClassifier())

bagged_pipe = make_pipeline(RobustScaler(),
                        BaggingClassifier(DecisionTreeClassifier()))

print cross_validate(base_pipe).mean()
print cross_validate(bagged_pipe).mean()

0.908611004232
0.95093497499


Scores are not different from the non-scaled data

### 1.c Grid Search

Grid search is a great way to improve the performance of a classifier. Let's explore the parameter space of both models and see if we can improve their performance.

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier
- search for few values of the parameters in order to improve the score of the classifier
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
    - This can take a really long time.
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Classifier?
- Which score is better? Are the score significantly different? How can you judge that?

In [8]:
from sklearn.model_selection import GridSearchCV

params = {"max_depth": [3,5,10,20],
          "max_features": [None, "auto"],
          "min_samples_leaf": [1, 3, 5, 7, 10],
          "min_samples_split": [2, 5, 7]
         }
    

tree_grid = GridSearchCV(tree, params, n_jobs=-1, cv=5)
tree_grid.fit(X,y)


GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_features': [None, 'auto'], 'min_samples_split': [2, 5, 7], 'max_depth': [3, 5, 10, 20], 'min_samples_leaf': [1, 3, 5, 7, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [9]:
print tree_grid.best_params_
print tree_grid.best_score_

{'max_features': 'auto', 'min_samples_split': 2, 'max_depth': 20, 'min_samples_leaf': 7}
0.947275922671


No, the score isn't better but I'm sure is supposed to be

In [None]:
params = {"base_estimator__max_depth": [3,5,10,20],
          "base_estimator__max_features": [None, "auto"],
          "base_estimator__min_samples_leaf": [1, 3, 5, 7, 10],
          "base_estimator__min_samples_split": [2, 5, 7],
          'bootstrap_features': [False, True],
          'max_features': [0.5, 0.7, 1.0],
          'max_samples': [0.5, 0.7, 1.0],
          'n_estimators': [2, 5, 10, 20],
         }
    

bagging_grid = GridSearchCV(bag, params, n_jobs=-1, cv=5)
bagging_grid.fit(X,y)

In [None]:
print bagging_grid.best_params_
print bagging_grid.best_score_

## 2 Diabetes and Regression

Scikit Learn has a dataset of diabetic patients obtained from this study:

http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

442 diabetes patients were measured on 10 baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements.

The target is a quantitative measure of disease progression one year after baseline.

Repeat the above comparison between a DecisionTreeRegressor and a Bagging version of the same.

### 2.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. Which score will you use?
- Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [24]:
from sklearn.datasets import load_diabetes
from sklearn.tree import DecisionTreeRegressor
diabetes = load_diabetes()
X = diabetes['data']
y = diabetes['target']



tree_r = DecisionTreeRegressor()
print cross_validate(tree_r).mean()


bagging_reg = BaggingClassifier(tree_r)
print cross_validate(bagging_reg).mean()


-0.150959103114


  proba[i, predictions[i]] += 1
  proba[i, predictions[i]] += 1
  proba[i, predictions[i]] += 1
  proba[i, predictions[i]] += 1


0.0487671169929


### 2.b Grid Search

Repeat Grid search as above:

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Regressor
- Search for few values of the parameters in order to improve the score of the regressor
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Regressor
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Regressor?
- Which score is better? Are the score significantly different? How can you judge that?